K8S spark-submit Loses Successful Driver Completion

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

K8S spark-submit Loses Successful Driver Completion

Marshall Markham



I am running Airflow + Spark + AKS (Azure K8s). Sporadically, when I have a spark job complete, my spark-submit process does not notice that the driver has succeeded and continues to track the job as running. Does anyone know how spark-submit process monitors driver processes on k8s? My expectation is that it monitors them by HTTP, but since we actually deleted the driver pod and the spark-submit process continued to show the process as in progress, I am now questioning this assumption. My end goal is to have spark-submit track driver behavior more accurately.


  • Marshall



NOTE: This communication and any attachments are for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by replying to this email, and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view

Re: K8S spark-submit Loses Successful Driver Completion

Attila Zsolt Piros
This post was updated on .

I am not using Airflow but I assume your application is deployed in cluster
mode and in this case the class you are looking for is
"org.apache.spark.deploy.k8s.submit.Client" [1].

If we are talking about the first "spark-submit" used to start the
application and not "spark-submit --status" then it contains loop where the
application status is logged. This loop stops when the
"LoggingPodStatusWatcher" reports the app is completed [2] or when
"spark.kubernetes.submission.waitAppCompletion" [3] is false.

And you are right the monitoring (POD state watching) is done via REST
(HTTPS) and should be detected by
"io.fabric8.kubernetes.client.Watcher.onClose()" method so by the kubernetes

I hope this helps. Some further questions if you need some more help:

1. What is the Spark version you are running?
2. Does it contain SPARK-24266 [4]?
3. If yes can you reproduce the issue without airflow and do you have the
logs about the issue?

Best regards,

[1] https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L88-L103

[2] https://github.com/apache/spark/blob/8604db28b87b387bbdb3761df85fae292cd402a1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L162-L166

[3] https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala#L112-L114

[4] https://issues.apache.org/jira/browse/SPARK-24266

Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

To unsubscribe e-mail: user-unsubscribe@spark.apache.org