Restarting a failed Spark streaming job running on top of a yarn cluster

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Restarting a failed Spark streaming job running on top of a yarn cluster

jcgarciam
Hi Folks,

We have few spark job streaming jobs running on a yarn cluster, and from
time to time a job need to be restarted (it was killed due to external
reason or others).

Once we submit the new job we are face with the following exception:
 ERROR spark.SparkContext: Failed to add
/mnt/data1/yarn/nm/usercache/spark/appcache/*application_1537885048149_15382*/container_e82_1537885048149_15382_01_000001/__app__.jar
to Spark environment
java.io.FileNotFoundException: Jar
/mnt/data1/yarn/nm/usercache/spark/appcache/application_1537885048149_15382/container_e82_1537885048149_15382_01_000001/__app__.jar
not found
        at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1807)
        at org.apache.spark.SparkContext.addJar(SparkContext.scala:1835)
        at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:457)

Of course we know that *application_1537885048149_15382* correspond to the
previous job that was killed, and that our Yarn is cleaning up the usercache
directory very often to avoid choking the filesystem with so many unused
file.

However what can you guys recommend for long running jobs that have to be
restarted but the previous context is not available due to the cleanup?


Hope is clear what i meant, if you need more information just ask.

Thanks

JC




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]