Spark Standalone - Spark application "Stuck" doesn't launch an executor on an available worker

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Standalone - Spark application "Stuck" doesn't launch an executor on an available worker

Brett Spark
We have seen issues where a Spark master will show an available worker, however the application will not launch a new executor on this worker.
Here is the scenario with Spark 3.0.0.

On the master: the application is shown as running with 0 cores for 57.6 hours.
image.png

The worker worker-20210125080015-100.67.94.187-43703 is shown as alive, but it's not being used.

Looking at the application it shows the following executor summary:
image.png

I believe that the assigned Spark worker 100.69.189.133 may have died as the backing server was killed. 

Looking more at the application it shows an active job, however it's been stuck for 57 hours:
image.png

On the executors page it only shows one dead executor:
image.png

I would suspect that this Spark application would see that the executor is dead and tell the master to assign a new executor / new worker, however this does not appear to be the case.

Questions:
* Is it possible to set up a timeout with a Spark application / task / job so that this doesn't hang forever?
* Are there other Spark confs I can use to make sure that a new executor is re-launched?
* Is there something else I should be checking to see why this behavior is occurring?
* Is this a known issue? Is there any way to force this to complete?
Reply | Threaded
Open this post in threaded view
|

Re: Spark Standalone - Spark application "Stuck" doesn't launch an executor on an available worker

Mich Talebzadeh
Hi Brett,

Is there a particular reason you are using standalone mode? I used to use it till a couple of years ago and switched to yarn. Without being critical, it is unpredictable at best. I recall I had similar issues in 2.1.4 as well.

Going back to spark GUI, I can see that the job is stuck there as you pointed out for more than a day, sounds to be stale. What does jps is telling you?

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 27 Jan 2021 at 17:45, Brett Spark <[hidden email]> wrote:
We have seen issues where a Spark master will show an available worker, however the application will not launch a new executor on this worker.
Here is the scenario with Spark 3.0.0.

On the master: the application is shown as running with 0 cores for 57.6 hours.
image.png

The worker worker-20210125080015-100.67.94.187-43703 is shown as alive, but it's not being used.

Looking at the application it shows the following executor summary:
image.png

I believe that the assigned Spark worker 100.69.189.133 may have died as the backing server was killed. 

Looking more at the application it shows an active job, however it's been stuck for 57 hours:
image.png

On the executors page it only shows one dead executor:
image.png

I would suspect that this Spark application would see that the executor is dead and tell the master to assign a new executor / new worker, however this does not appear to be the case.

Questions:
* Is it possible to set up a timeout with a Spark application / task / job so that this doesn't hang forever?
* Are there other Spark confs I can use to make sure that a new executor is re-launched?
* Is there something else I should be checking to see why this behavior is occurring?
* Is this a known issue? Is there any way to force this to complete?