spark application takes significant some time to succeed even after all jobs are completed

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

spark application takes significant some time to succeed even after all jobs are completed

Akshay Mendole
Hi, 
      As you can see in the picture below, the application last job finished at around 13:45 and I could see the output directory updated with the results. Yet, the application took a total of 20 min more to change the status. What could be the reason for this? Is this a known fact? The application has 3 jobs with many stages inside each having around 10K tasks. Could the scale be reason for this? What is it exactly spark framework doing during this time?

Screen Shot 2018-12-25 at 5.14.26 PM.png

Thanks,
Akshay

Reply | Threaded
Open this post in threaded view
|

Re: spark application takes significant some time to succeed even after all jobs are completed

Jörn Franke
Do you have a lot of small files? Do you use S3 or similar? It could be that Spark does some IO related tasks.

> Am 25.12.2018 um 12:51 schrieb Akshay Mendole <[hidden email]>:
>
> Hi,
>       As you can see in the picture below, the application last job finished at around 13:45 and I could see the output directory updated with the results. Yet, the application took a total of 20 min more to change the status. What could be the reason for this? Is this a known fact? The application has 3 jobs with many stages inside each having around 10K tasks. Could the scale be reason for this? What is it exactly spark framework doing during this time?
>
> <Screen Shot 2018-12-25 at 5.14.26 PM.png>
>
> Thanks,
> Akshay
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: spark application takes significant some time to succeed even after all jobs are completed

Akshay Mendole
Yes. We have lot of small files (10 K files of around 100 MB each ) that we read from and write to HDFS. But the timeline shows, the jobs has completed quite some time ago and the output directory is also updated at that time.
Thanks,
Akshay


On Tue, Dec 25, 2018 at 5:30 PM Jörn Franke <[hidden email]> wrote:
Do you have a lot of small files? Do you use S3 or similar? It could be that Spark does some IO related tasks.

> Am 25.12.2018 um 12:51 schrieb Akshay Mendole <[hidden email]>:
>
> Hi,
>       As you can see in the picture below, the application last job finished at around 13:45 and I could see the output directory updated with the results. Yet, the application took a total of 20 min more to change the status. What could be the reason for this? Is this a known fact? The application has 3 jobs with many stages inside each having around 10K tasks. Could the scale be reason for this? What is it exactly spark framework doing during this time?
>
> <Screen Shot 2018-12-25 at 5.14.26 PM.png>
>
> Thanks,
> Akshay
>
Reply | Threaded
Open this post in threaded view
|

Re: spark application takes significant some time to succeed even after all jobs are completed

Jörn Franke
It could be that “Spark” checks if each file after the job and with 10000 files on HDFS it can take some time. I think this also is format specific (eg for parquet it does some checks) and does not occur with all formats. This time is not really highlighted in the UI (maybe worth to raise an enhancement issue).

It could be also that you have stragglers (partitions skewered) somewhere, but I assume you checked that already. 

The only thing that you can do is to have less files (for the final output but also in between) or live with it. There are some other tuning methods as well (different outputcomitter etc), but that would require more in-depth knowledge of your application .

Am 25.12.2018 um 13:08 schrieb Akshay Mendole <[hidden email]>:

Yes. We have lot of small files (10 K files of around 100 MB each ) that we read from and write to HDFS. But the timeline shows, the jobs has completed quite some time ago and the output directory is also updated at that time.
Thanks,
Akshay


On Tue, Dec 25, 2018 at 5:30 PM Jörn Franke <[hidden email]> wrote:
Do you have a lot of small files? Do you use S3 or similar? It could be that Spark does some IO related tasks.

> Am 25.12.2018 um 12:51 schrieb Akshay Mendole <[hidden email]>:
>
> Hi,
>       As you can see in the picture below, the application last job finished at around 13:45 and I could see the output directory updated with the results. Yet, the application took a total of 20 min more to change the status. What could be the reason for this? Is this a known fact? The application has 3 jobs with many stages inside each having around 10K tasks. Could the scale be reason for this? What is it exactly spark framework doing during this time?
>
> <Screen Shot 2018-12-25 at 5.14.26 PM.png>
>
> Thanks,
> Akshay
>