[Pyspark 2.4] not able to partition the data frame by dates

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Pyspark 2.4] not able to partition the data frame by dates

rishishah.star
Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by date, however below spark program doesn't help - keeps failing due to file already exists exception..

df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried spinning up new ones which write to the same .staging directory?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [Pyspark 2.4] not able to partition the data frame by dates

Gourav Sengupta
Hi Rishi,

there is no version as 2.4 :), can you please specify the exact SPARK version you are using? How are you starting the SPARK session? And what is the environment?

I know this issue occurs intermittently over large writes in S3 and has to do with S3 eventual consistency issues. Just restarting the job sometimes helps.


Regards,
Gourav Sengupta

On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah <[hidden email]> wrote:
Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by date, however below spark program doesn't help - keeps failing due to file already exists exception..

df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried spinning up new ones which write to the same .staging directory?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [Pyspark 2.4] not able to partition the data frame by dates

rishishah.star
Thanks for your prompt reply Gourav. I am using Spark 2.4.0 (cloudera distribution). The job consistently threw this error, so I narrowed down the dataset by adding a date filter (date rang: 2018-01-01 to 2018-06-30).. However it's still throwing the sameĀ error!

command: spark2-submit --master yarn --deploy-mode client --executor-memory 15G --executor-cores 5 samplerestage.py
cluster: 4 nodes, 32 cores each 256GB RAMĀ 

This is the only job running, with 20 executors...

I would really like to know the best practice around creating partitioned table using pays-ark - every time I need to partition huge dataset, I run into such issues. Appreciate your help!


On Wed, Jul 31, 2019 at 10:58 PM Gourav Sengupta <[hidden email]> wrote:
Hi Rishi,

there is no version as 2.4 :), can you please specify the exact SPARK version you are using? How are you starting the SPARK session? And what is the environment?

I know this issue occurs intermittently over large writes in S3 and has to do with S3 eventual consistency issues. Just restarting the job sometimes helps.


Regards,
Gourav Sengupta

On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah <[hidden email]> wrote:
Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by date, however below spark program doesn't help - keeps failing due to file already exists exception..

df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried spinning up new ones which write to the same .staging directory?

--
Regards,

Rishi Shah


--
Regards,

Rishi Shah