Pyspark Partitioning

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Pyspark Partitioning

dimitris plakas
Hello everyone,

I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below

Group_Id | Id | Points
1            | id1| Point1
2            | id2| Point2

I want to have a partition for every Group_Id and apply on every partition a function defined by me.
I have tried with partitionBy('Group_Id').mapPartitions() but i receive error.
Could you please advice me how to do it?
Reply | Threaded
Open this post in threaded view
|

Re: Pyspark Partitioning

Riccardo Ferrari
Hi Dimitris,

I believe the methods partitionBy and mapPartitions are specific to RDDs while you're talking about DataFrames. I guess you have few options including:
1. use the Dataframe.rdd call and process the returned RDD. Please note the return type for this call is and RDD of Row
2. User the groupBy from Dataframes and start from there, this may involved defining an udf or leverage on the existing GroupedData functions.

It really depends on your use-case and your performance requirements.
HTH

On Sun, Sep 30, 2018 at 8:31 PM dimitris plakas <[hidden email]> wrote:
Hello everyone,

I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below

Group_Id | Id | Points
1            | id1| Point1
2            | id2| Point2

I want to have a partition for every Group_Id and apply on every partition a function defined by me.
I have tried with partitionBy('Group_Id').mapPartitions() but i receive error.
Could you please advice me how to do it?
Reply | Threaded
Open this post in threaded view
|

Re: Pyspark Partitioning

ayan guha
Hi

There are a set pf finction which can be used with the construct 
Over (partition by col order by col). 

You search for rank and window functions in spark documentation. 

On Mon, 1 Oct 2018 at 5:29 am, Riccardo Ferrari <[hidden email]> wrote:
Hi Dimitris,

I believe the methods partitionBy and mapPartitions are specific to RDDs while you're talking about DataFrames. I guess you have few options including:
1. use the Dataframe.rdd call and process the returned RDD. Please note the return type for this call is and RDD of Row
2. User the groupBy from Dataframes and start from there, this may involved defining an udf or leverage on the existing GroupedData functions.

It really depends on your use-case and your performance requirements.
HTH

On Sun, Sep 30, 2018 at 8:31 PM dimitris plakas <[hidden email]> wrote:
Hello everyone,

I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below

Group_Id | Id | Points
1            | id1| Point1
2            | id2| Point2

I want to have a partition for every Group_Id and apply on every partition a function defined by me.
I have tried with partitionBy('Group_Id').mapPartitions() but i receive error.
Could you please advice me how to do it?
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Pyspark Partitioning

Gourav Sengupta
In reply to this post by dimitris plakas
Hi,

the most simple option is create UDF's of these different functions and then use case statement (or similar) in SQL and pass it on. But this is low tech, in case you have conditions based on record values which are even more granular, why not use a single UDF, and then let conditions handle it.

But I think that UDF is not that super unless you use Scala. 

It will be interesting to see if there are other scalable options (which are not RDD based) from the group.

Regards,
Gourav Sengupta

On Sun, Sep 30, 2018 at 7:31 PM dimitris plakas <[hidden email]> wrote:
Hello everyone,

I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below

Group_Id | Id | Points
1            | id1| Point1
2            | id2| Point2

I want to have a partition for every Group_Id and apply on every partition a function defined by me.
I have tried with partitionBy('Group_Id').mapPartitions() but i receive error.
Could you please advice me how to do it?