Tuning spark job to make count faster.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Tuning spark job to make count faster.

Krishna Chakka
Hi,

I am working on a spark job. It takes 10 mins for the job just for the count() function.  Question is How can I make it faster ?

From the above image, what I understood is that there 4001 tasks are running in parallel. Total tasks are 76,553 . 

Here are the parameters that I am using for the job
    - master machine type - e2-standard-16
    - worker machine type - e2-standard-8 (8 vcpus, 32 GB memory)
    - number of workers - 400 
    - spark.executor.cores - 4
    - spark.executor.memory - 11g
    - spark.sql.shuffle.partitions - 10000


Please advice how can I make this faster ? 

Thanks










  
Reply | Threaded
Open this post in threaded view
|

Re: Tuning spark job to make count faster.

srowen

Hard to say without a lot more info, but 76.5K tasks is very large. How big are the tasks / how long do they take? if very short, you should repartition down.
Do you end up with 800 executors? if so why 2 per machine? that generally is a loss at this scale of worker. I'm confused because you have 4000 tasks running, which would be just 10 per executor as well.
What is the data input format? it's far faster to 'count' parquet as it's just a metadata read.
Is anything else happening besides count() after the data is read?


On Tue, Apr 6, 2021 at 2:00 AM Krishna Chakka <[hidden email]> wrote:
Hi,

I am working on a spark job. It takes 10 mins for the job just for the count() function.  Question is How can I make it faster ?

From the above image, what I understood is that there 4001 tasks are running in parallel. Total tasks are 76,553 . 

Here are the parameters that I am using for the job
    - master machine type - e2-standard-16
    - worker machine type - e2-standard-8 (8 vcpus, 32 GB memory)
    - number of workers - 400 
    - spark.executor.cores - 4
    - spark.executor.memory - 11g
    - spark.sql.shuffle.partitions - 10000


Please advice how can I make this faster ? 

Thanks