Properly stop applications or jobs within the application

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Properly stop applications or jobs within the application

Behroz Sikander
Hello,
We are using spark-jobserver to spawn jobs in Spark cluster. We have recently faced issues with Zombie jobs in Spark cluster. This normally happens when the job is accessing some external resources like Kafka/C* and something goes wrong while consuming them. For example, if suddenly a topic which was being consumed is deleted in Kafka or connection breaks to the whole Kafka cluster.

Within spark-jobserver, we have the option to delete the context/jobs in such scenarios.
When we delete the job, internally context.cancelJobGroup(<jobId>) is used.
When we delete the context, internally context.stop(true,true) is executed.

In both cases, even if we delete the job/context, the application on the Spark cluster is still running (sometimes) and some jobs are still being executed within Spark.

Here are the logs of one such scenario. The job context was stopped but it kept on running and became a zombie.

2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13 org.apache.kafka.common.utils.AppInfoParser []: Kafka version : 0.11.0.1-SNAPSHOT
2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13 org.apache.kafka.common.utils.AppInfoParser []: Kafka commitId : de8225b66d494cd
2018-02-28 15:36:51,144 INFO dispatcher-event-loop-5 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint []: Registered executor NettyRpcEndpointRef(null) (10.10.10.15:46224) with ID 1
2018-02-28 15:38:58,254 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available.
2018-02-28 15:41:05,485 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available.
2018-02-28 15:42:07,074 WARN JobServer-akka.actor.default-dispatcher-3 akka.cluster.ClusterCoreDaemon []: Cluster Node [akka.tcp://JobServer@127.0.0.1:43319] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://JobServer@127.0.0.1:37343, status = Up)]. Node roles [manager]

Later at some point, we see the following logs. It seems that from Spark job, none of the Kafka nodes were accessible. The job kept on trying and became a zombie.

2018-02-28 15:43:12,717 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available.
2018-02-28 15:45:19,949 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available.
2018-02-28 15:47:27,180 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available.
2018-02-28 15:49:34,412 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available.
2018-02-28 15:51:41,644 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available.
2018-02-28 15:53:48,877 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available.
2018-02-28 15:55:56,109 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available.
2018-02-28 15:58:03,340 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -2 could not be established. Broker may not be available.
2018-02-28 16:00:10,572 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -3 could not be established. Broker may not be available.
2018-02-28 16:02:17,804 WARN ForkJoinPool-3-worker-13 org.apache.kafka.clients.NetworkClient []: Connection to node -1 could not be established. Broker may not be available.



Similarly to this, we have another scenario for zombie contexts. The logs are in the gist below.

In the gist, you can see that the topic is not created and the job tried to use it. Then when we try to delete the job but it bacame a zombie and kept on showing.
"Block rdd_13011_0 already exists on this machine; not re-adding it"


So, my question would be, what is the right way to kill the jobs running within
the context or the context/application itself without having these zombies?


Regards,
Behroz
Reply | Threaded
Open this post in threaded view
|

Re: Properly stop applications or jobs within the application

bsikander
It seems to be related to this issue from Kafka
https://issues.apache.org/jira/browse/KAFKA-1894



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Properly stop applications or jobs within the application

bsikander
Any help would be much appreciated. This seems to be a common problem.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Properly stop applications or jobs within the application

sagargrover16
What do you mean by stopping applications?
Do you want to kill a batch application mid way or are you running streaming jobs that you want to kill?

With regards,
Sagar Grover

On Thu, Mar 8, 2018 at 1:45 PM, bsikander <[hidden email]> wrote:
Any help would be much appreciated. This seems to be a common problem.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Properly stop applications or jobs within the application

bsikander
I have scenarios for both.
So, I want to kill both batch and streaming midway, if required.

Usecase:
Normally, if everything is okay we don't kill the application but sometimes
while accessing external resources (like Kafka) something can go wrong. In
that case, the application can become useless because it is not doing
anything useful, so we want to kill it (midway). In such a case, when we
kill it, sometimes the application becomes a zombie and doesn't get killed
programmatically (atleast, this is what we found). A kill through Master UI
or manual using kill -9 is required to clean up the zombies.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Properly stop applications or jobs within the application

sagargrover16
I am assuming you are running in yarn cluster mode. Have you tried yarn application -kill application_id ?

With regards,
Sagar Grover
Phone - 7022175584

On Thu, Mar 8, 2018 at 4:03 PM, bsikander <[hidden email]> wrote:
I have scenarios for both.
So, I want to kill both batch and streaming midway, if required.

Usecase:
Normally, if everything is okay we don't kill the application but sometimes
while accessing external resources (like Kafka) something can go wrong. In
that case, the application can become useless because it is not doing
anything useful, so we want to kill it (midway). In such a case, when we
kill it, sometimes the application becomes a zombie and doesn't get killed
programmatically (atleast, this is what we found). A kill through Master UI
or manual using kill -9 is required to clean up the zombies.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Properly stop applications or jobs within the application

bsikander
I am running in Spark standalone mode. No YARN.

anyways, yarn application -kill is a manual process. I donot want that. I
was to properly kill the driver/application programatically.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Properly stop applications or jobs within the application

Dhaval Modi
@sagar - YARN kill is not a reliable process for spark streaming.



Regards,
Dhaval Modi
[hidden email]

On 8 March 2018 at 17:18, bsikander <[hidden email]> wrote:
I am running in Spark standalone mode. No YARN.

anyways, yarn application -kill is a manual process. I donot want that. I
was to properly kill the driver/application programatically.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]