How to track batch jobs in spark ?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to track batch jobs in spark ?

kant kodali
Hi All,

How to track batch jobs in spark? For example, is there some id or token i can get after I spawn a batch job and use it to track the progress or to kill the batch job itself?

For Streaming, we have StreamingQuery.id()

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: How to track batch jobs in spark ?

pmatpadi
if you are deploying your spark application on YARN cluster,
1. ssh into master node
2. List the currently running application and retreive the application_id
    yarn application --list
3. Kill the application using application_id of the form application_xxxxx_xxxx from output of list command
        yarn application --kill <application_id>

On Wed, Dec 5, 2018 at 1:42 PM kant kodali <[hidden email]> wrote:
Hi All,

How to track batch jobs in spark? For example, is there some id or token i can get after I spawn a batch job and use it to track the progress or to kill the batch job itself?

For Streaming, we have StreamingQuery.id()

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: How to track batch jobs in spark ?

Mark Hamstra
That will kill an entire Spark application, not a batch Job.

On Wed, Dec 5, 2018 at 3:07 PM Priya Matpadi <[hidden email]> wrote:
if you are deploying your spark application on YARN cluster,
1. ssh into master node
2. List the currently running application and retreive the application_id
    yarn application --list
3. Kill the application using application_id of the form application_xxxxx_xxxx from output of list command
        yarn application --kill <application_id>

On Wed, Dec 5, 2018 at 1:42 PM kant kodali <[hidden email]> wrote:
Hi All,

How to track batch jobs in spark? For example, is there some id or token i can get after I spawn a batch job and use it to track the progress or to kill the batch job itself?

For Streaming, we have StreamingQuery.id()

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: How to track batch jobs in spark ?

kant kodali
Thanks for all responses.

1) I am not using YARN. I am using Spark Standalone.
2) yes I want to be able to kill the whole Application.
3) I want to be able to monitor the status of the Application which is running a batch query and expected to run for an hour or so, therefore, I am looking for some mechanism where I can monitor the progress like a percentage or something.

Thanks!


On Wed, Dec 5, 2018 at 3:12 PM Mark Hamstra <[hidden email]> wrote:
That will kill an entire Spark application, not a batch Job.

On Wed, Dec 5, 2018 at 3:07 PM Priya Matpadi <[hidden email]> wrote:
if you are deploying your spark application on YARN cluster,
1. ssh into master node
2. List the currently running application and retreive the application_id
    yarn application --list
3. Kill the application using application_id of the form application_xxxxx_xxxx from output of list command
        yarn application --kill <application_id>

On Wed, Dec 5, 2018 at 1:42 PM kant kodali <[hidden email]> wrote:
Hi All,

How to track batch jobs in spark? For example, is there some id or token i can get after I spawn a batch job and use it to track the progress or to kill the batch job itself?

For Streaming, we have StreamingQuery.id()

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: How to track batch jobs in spark ?

Thakrar, Jayesh

See if https://spark.apache.org/docs/latest/monitoring.html helps.

 

Essentially whether you are running an app as spark-shell, via spark-submit (local, Spark-Cluster, YARN, Kubernetes, mesos), the driver will provide a UI on port 4040.

 

You can monitor via the UI and via a REST API

 

E.g. running a job locally on your laptop, you can run something like this

http://127.0.0.1:4040/api/v1/applications

 

To see the “jobs”, you can use something like this

(local-1544110095543 is just the id of my spark-shell on my laptop which I got from the command above)

 

http://127.0.0.1:4040/api/v1/applications/local-1544110095543/jobs

 

If you are only look for job completion, you can just monitor if there is a listener on the port.

Once job completes/fails, the driver and the listener will exit and hence it means the job is complete.

 

As far as giving you the percentage complete for the application, there is no such thing as unlike mapreduce, a Spark app is not a single step/job.

Using the REST API, you can see which job/stage is running and determine what percentage of your job is complete.

Even when getting the stage info, you only get the number of “tasks” complete v/s percentage complete.

 

From: kant kodali <[hidden email]>
Date: Thursday, December 6, 2018 at 4:40 AM
To: Mark Hamstra <[hidden email]>
Cc: <[hidden email]>, "user @spark" <[hidden email]>
Subject: Re: How to track batch jobs in spark ?

 

Thanks for all responses.

 

1) I am not using YARN. I am using Spark Standalone.

2) yes I want to be able to kill the whole Application.

3) I want to be able to monitor the status of the Application which is running a batch query and expected to run for an hour or so, therefore, I am looking for some mechanism where I can monitor the progress like a percentage or something.

 

Thanks!

 

 

On Wed, Dec 5, 2018 at 3:12 PM Mark Hamstra <[hidden email]> wrote:

That will kill an entire Spark application, not a batch Job.

 

On Wed, Dec 5, 2018 at 3:07 PM Priya Matpadi <[hidden email]> wrote:

if you are deploying your spark application on YARN cluster,

1. ssh into master node

2. List the currently running application and retreive the application_id

    yarn application --list

3. Kill the application using application_id of the form application_xxxxx_xxxx from output of list command

        yarn application --kill <application_id>

 

On Wed, Dec 5, 2018 at 1:42 PM kant kodali <[hidden email]> wrote:

Hi All,

 

How to track batch jobs in spark? For example, is there some id or token i can get after I spawn a batch job and use it to track the progress or to kill the batch job itself?

 

For Streaming, we have StreamingQuery.id()

 

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: How to track batch jobs in spark ?

Gourav Sengupta
Hi Kant,

why would you want to kill a batch job at all, it leads to half written data in to the disk, and sometimes other issues. The general practice is to have exception handling code.

In case you are running into scenarios where the code is just consuming too much resources and you are running the spark job locally, I would prefer killing the entire spark application from command line using unix kill command or stop-all.sh.

Otherwise, as Jayesh mentioned killing from the application also makes sense.

Regards,
Gourav 


On Thu, Dec 6, 2018 at 3:40 PM Thakrar, Jayesh <[hidden email]> wrote:

See if https://spark.apache.org/docs/latest/monitoring.html helps.

 

Essentially whether you are running an app as spark-shell, via spark-submit (local, Spark-Cluster, YARN, Kubernetes, mesos), the driver will provide a UI on port 4040.

 

You can monitor via the UI and via a REST API

 

E.g. running a job locally on your laptop, you can run something like this

http://127.0.0.1:4040/api/v1/applications

 

To see the “jobs”, you can use something like this

(local-1544110095543 is just the id of my spark-shell on my laptop which I got from the command above)

 

http://127.0.0.1:4040/api/v1/applications/local-1544110095543/jobs

 

If you are only look for job completion, you can just monitor if there is a listener on the port.

Once job completes/fails, the driver and the listener will exit and hence it means the job is complete.

 

As far as giving you the percentage complete for the application, there is no such thing as unlike mapreduce, a Spark app is not a single step/job.

Using the REST API, you can see which job/stage is running and determine what percentage of your job is complete.

Even when getting the stage info, you only get the number of “tasks” complete v/s percentage complete.

 

From: kant kodali <[hidden email]>
Date: Thursday, December 6, 2018 at 4:40 AM
To: Mark Hamstra <[hidden email]>
Cc: <[hidden email]>, "user @spark" <[hidden email]>
Subject: Re: How to track batch jobs in spark ?

 

Thanks for all responses.

 

1) I am not using YARN. I am using Spark Standalone.

2) yes I want to be able to kill the whole Application.

3) I want to be able to monitor the status of the Application which is running a batch query and expected to run for an hour or so, therefore, I am looking for some mechanism where I can monitor the progress like a percentage or something.

 

Thanks!

 

 

On Wed, Dec 5, 2018 at 3:12 PM Mark Hamstra <[hidden email]> wrote:

That will kill an entire Spark application, not a batch Job.

 

On Wed, Dec 5, 2018 at 3:07 PM Priya Matpadi <[hidden email]> wrote:

if you are deploying your spark application on YARN cluster,

1. ssh into master node

2. List the currently running application and retreive the application_id

    yarn application --list

3. Kill the application using application_id of the form application_xxxxx_xxxx from output of list command

        yarn application --kill <application_id>

 

On Wed, Dec 5, 2018 at 1:42 PM kant kodali <[hidden email]> wrote:

Hi All,

 

How to track batch jobs in spark? For example, is there some id or token i can get after I spawn a batch job and use it to track the progress or to kill the batch job itself?

 

For Streaming, we have StreamingQuery.id()

 

Thanks!