Convert each partition of RDD to Dataframe

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Convert each partition of RDD to Dataframe

Manjunath Shetty H

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance


Reply | Threaded
Open this post in threaded view
|

Re: Convert each partition of RDD to Dataframe

Charles vinodh
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. 

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H <[hidden email]> wrote:

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance


Reply | Threaded
Open this post in threaded view
|

Re: Convert each partition of RDD to Dataframe

Manjunath Shetty H
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet  will be helpful.

To explain the problem more,
  • I have 10 partitions , each partition loads the data from different table and different SQL shard.
  • Most of the partitions will have different schema.
  • Before persisting the data i want to do some column level manipulation using data frame.
So thats why i want to create 10 (based on partitions ) dataframes that maps to 10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh <[hidden email]>
Sent: Thursday, February 27, 2020 7:04 PM
To: [hidden email] <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Convert each partition of RDD to Dataframe
 
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. 

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H <[hidden email]> wrote:

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance


Reply | Threaded
Open this post in threaded view
|

Re: Convert each partition of RDD to Dataframe

Enrico Minack
Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the first place?

Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet  will be helpful.

To explain the problem more,
  • I have 10 partitions , each partition loads the data from different table and different SQL shard.
  • Most of the partitions will have different schema.
  • Before persisting the data i want to do some column level manipulation using data frame.
So thats why i want to create 10 (based on partitions ) dataframes that maps to 10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh [hidden email]
Sent: Thursday, February 27, 2020 7:04 PM
To: [hidden email] [hidden email]
Cc: user [hidden email]
Subject: Re: Convert each partition of RDD to Dataframe
 
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. 

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H <[hidden email]> wrote:

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance



Reply | Threaded
Open this post in threaded view
|

Re: Convert each partition of RDD to Dataframe

Manjunath Shetty H
Hi Enrico,

In that case how to make effective use of all nodes in the cluster ?.

And also whats your opinion on the below
  • Create 10 Dataframes sequentially in Driver program and transform/write to hdfs one after the other
  • Or the current approach mentioned in the previous mail 
What will be the performance implications ?

Regards
Manjunath


From: Enrico Minack <[hidden email]>
Sent: Thursday, February 27, 2020 7:57 PM
To: [hidden email] <[hidden email]>
Subject: Re: Convert each partition of RDD to Dataframe
 
Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the first place?

Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet  will be helpful.

To explain the problem more,
  • I have 10 partitions , each partition loads the data from different table and different SQL shard.
  • Most of the partitions will have different schema.
  • Before persisting the data i want to do some column level manipulation using data frame.
So thats why i want to create 10 (based on partitions ) dataframes that maps to 10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh [hidden email]
Sent: Thursday, February 27, 2020 7:04 PM
To: [hidden email] [hidden email]
Cc: user [hidden email]
Subject: Re: Convert each partition of RDD to Dataframe
 
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. 

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H <[hidden email]> wrote:

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance



Reply | Threaded
Open this post in threaded view
|

Re: Convert each partition of RDD to Dataframe

prosp4300
Looks no obvious relationship between the partition or tables, maybe try make them in different jobs, so they could run at same time to fully make use of the cluster resource.


On 02/27/2020 22:50, [hidden email] wrote:
Hi Enrico,

In that case how to make effective use of all nodes in the cluster ?.

And also whats your opinion on the below
  • Create 10 Dataframes sequentially in Driver program and transform/write to hdfs one after the other
  • Or the current approach mentioned in the previous mail 
What will be the performance implications ?

Regards
Manjunath


From: Enrico Minack <[hidden email]>
Sent: Thursday, February 27, 2020 7:57 PM
To: [hidden email] <[hidden email]>
Subject: Re: Convert each partition of RDD to Dataframe
 
Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the first place?

Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet  will be helpful.

To explain the problem more,
  • I have 10 partitions , each partition loads the data from different table and different SQL shard.
  • Most of the partitions will have different schema.
  • Before persisting the data i want to do some column level manipulation using data frame.
So thats why i want to create 10 (based on partitions ) dataframes that maps to 10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh [hidden email]
Sent: Thursday, February 27, 2020 7:04 PM
To: [hidden email] [hidden email]
Cc: user [hidden email]
Subject: Re: Convert each partition of RDD to Dataframe
 
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. 

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H <[hidden email]> wrote:

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance



Reply | Threaded
Open this post in threaded view
|

Re: Convert each partition of RDD to Dataframe

Enrico Minack
In reply to this post by Manjunath Shetty H
Manjunath,

You can define your DataFrame in parallel in a multi-threaded driver.

Enrico

Am 27.02.20 um 15:50 schrieb Manjunath Shetty H:
Hi Enrico,

In that case how to make effective use of all nodes in the cluster ?.

And also whats your opinion on the below
  • Create 10 Dataframes sequentially in Driver program and transform/write to hdfs one after the other
  • Or the current approach mentioned in the previous mail 
What will be the performance implications ?

Regards
Manjunath


From: Enrico Minack [hidden email]
Sent: Thursday, February 27, 2020 7:57 PM
To: [hidden email] [hidden email]
Subject: Re: Convert each partition of RDD to Dataframe
 
Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the first place?

Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet  will be helpful.

To explain the problem more,
  • I have 10 partitions , each partition loads the data from different table and different SQL shard.
  • Most of the partitions will have different schema.
  • Before persisting the data i want to do some column level manipulation using data frame.
So thats why i want to create 10 (based on partitions ) dataframes that maps to 10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh [hidden email]
Sent: Thursday, February 27, 2020 7:04 PM
To: [hidden email] [hidden email]
Cc: user [hidden email]
Subject: Re: Convert each partition of RDD to Dataframe
 
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. 

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H <[hidden email]> wrote:

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance




Reply | Threaded
Open this post in threaded view
|

Re: Convert each partition of RDD to Dataframe

Manjunath Shetty H
Hi Enrico,

Thanks for the suggestion, i wanted to know if there are any performance implications of running multi-threaded driver ?
If i create multiple Dataframes in parallel, then Spark will schedule those jobs in parallel ?

Thanks
Manjunath

From: Enrico Minack <[hidden email]>
Sent: Thursday, February 27, 2020 8:51 PM
To: Manjunath Shetty H <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: Convert each partition of RDD to Dataframe
 
Manjunath,

You can define your DataFrame in parallel in a multi-threaded driver.

Enrico

Am 27.02.20 um 15:50 schrieb Manjunath Shetty H:
Hi Enrico,

In that case how to make effective use of all nodes in the cluster ?.

And also whats your opinion on the below
  • Create 10 Dataframes sequentially in Driver program and transform/write to hdfs one after the other
  • Or the current approach mentioned in the previous mail 
What will be the performance implications ?

Regards
Manjunath


From: Enrico Minack [hidden email]
Sent: Thursday, February 27, 2020 7:57 PM
To: [hidden email] [hidden email]
Subject: Re: Convert each partition of RDD to Dataframe
 
Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the first place?

Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet  will be helpful.

To explain the problem more,
  • I have 10 partitions , each partition loads the data from different table and different SQL shard.
  • Most of the partitions will have different schema.
  • Before persisting the data i want to do some column level manipulation using data frame.
So thats why i want to create 10 (based on partitions ) dataframes that maps to 10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh [hidden email]
Sent: Thursday, February 27, 2020 7:04 PM
To: [hidden email] [hidden email]
Cc: user [hidden email]
Subject: Re: Convert each partition of RDD to Dataframe
 
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. 

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H <[hidden email]> wrote:

Hello All,

In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable<Row> to List and converting that to Dataframe. But the problem is converting Iterable to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance