[pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

rishishah.star
Hi All,

I am scratching my head against this weird behavior, where df (read from .csv) of size ~3.4GB gets cross joined with itself and creates 50K tasks! How to correlate input size with number of tasks in this case?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

Chris Teoh
Look at your DAG. Are there lots of CSV files? Does your input CSV dataframe have lots of partitions to start with? Bear in mind cross join makes the dataset much larger so expect to have more tasks.

On Fri, 30 Aug 2019 at 14:11, Rishi Shah <[hidden email]> wrote:
Hi All,

I am scratching my head against this weird behavior, where df (read from .csv) of size ~3.4GB gets cross joined with itself and creates 50K tasks! How to correlate input size with number of tasks in this case?

--
Regards,

Rishi Shah


--
Chris