How to split a dataframe into two dataframes based on count

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to split a dataframe into two dataframes based on count

Mohit
Dear All,

I would like to know how, in spark 2.0, can I split a dataframe into two dataframes when I know the exact counts the two dataframes should have. I tried using limit but got quite weird results. Also, I am looking for exact counts in child dfs, not the approximate % based split.

Following is what I have tried:

var dfParent = sc.read.parquet("somelocation");// let's say it has 4000 rows

I want to split the parent into two dfs with the following counts:

var dfChild1Count = 1000 

var dfChild2Count = 3000

I tried this: 

var dfChild1 = dfParent.limit(dfChild1Count);

var dfChild2 = dfParent.except(dfChild1);

and wrote that to output hdfs directories:

dfChild1.write.parquet("/outputfilechild1");

dfChild2.write.parquet("/outputfilechild2");

It turns out this results in some duplicates saved in files outputfilechild1 & outputfilechild2. Could anyone explain why they have duplicates?

When I sorted my parent dataframe before limit, it then worked fine:

dfParent = dfParent.sortBy(col("unique_col").desc())

Seems like the limit on parent is executed twice and return different records each time. Not sure why it is executed twice when I mentioned only once.

Also, Is there a better way to split a df into multiple dfs when we know exact counts of the child dfs?




Regards,
Mohit


Reply | Threaded
Open this post in threaded view
|

Re: How to split a dataframe into two dataframes based on count

Vipul Rajan
Hi Mohit,

"Seems like the limit on parent is executed twice and return different records each time. Not sure why it is executed twice when I mentioned only once"

That is to be expected. Since spark follows lazy evaluation, which means that execution only happens when you call an action, every action would result in every step being processed again (mostly, some steps are automatically cached and skipped). You can try using this

var dfChild1 = dfParent.limit(dfChild1Count).cache()

when you sort your dataframe this problem does not occur because sorting causes data to shuffle. Spark automatically caches when a data shuffle happens.

Let me know if you get it to work.

Regards

On Mon, May 18, 2020 at 10:27 PM Mohit Durgapal <[hidden email]> wrote:
Dear All,

I would like to know how, in spark 2.0, can I split a dataframe into two dataframes when I know the exact counts the two dataframes should have. I tried using limit but got quite weird results. Also, I am looking for exact counts in child dfs, not the approximate % based split.

Following is what I have tried:

var dfParent = sc.read.parquet("somelocation");// let's say it has 4000 rows

I want to split the parent into two dfs with the following counts:

var dfChild1Count = 1000 

var dfChild2Count = 3000

I tried this: 

var dfChild1 = dfParent.limit(dfChild1Count);

var dfChild2 = dfParent.except(dfChild1);

and wrote that to output hdfs directories:

dfChild1.write.parquet("/outputfilechild1");

dfChild2.write.parquet("/outputfilechild2");

It turns out this results in some duplicates saved in files outputfilechild1 & outputfilechild2. Could anyone explain why they have duplicates?

When I sorted my parent dataframe before limit, it then worked fine:

dfParent = dfParent.sortBy(col("unique_col").desc())

Seems like the limit on parent is executed twice and return different records each time. Not sure why it is executed twice when I mentioned only once.

Also, Is there a better way to split a df into multiple dfs when we know exact counts of the child dfs?




Regards,
Mohit