Pyspark Partitioning

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Pyspark Partitioning

dimitris plakas
Hello everyone,

Here is an issue that i am facing in partitioning dtafarame.

I have a dataframe which called data_df. It is look like:

Group_Id | Object_Id | Trajectory
   1         |  obj1        | Traj1
   2         |  obj2        | Traj2
   1         |  obj3        | Traj3
   3         |  obj4        | Traj4
   2         |  obj5        | Traj5

This dataframe has 5045 rows where each row has value in Group_Id from 1 to 7, and the number of rows per group_id is arbitrary.
I want to split the rdd which produced by from this dataframe in 7 partitions one for each group_id and then apply mapPartitions() where i call function custom_func(). How can i create these partitions from this dataframe? Should i first apply group by (create the grouped_df) in order to create a dataframe with 7 rows and then call partitioned_rdd=grouped_df.rdd.mapPartitions()?
Which is the optimal way to do it?

Thank you in advance
Reply | Threaded
Open this post in threaded view
|

Re: Pyspark Partitioning

Vitaliy Pisarev
Groupby is an operator you would use if you wanted to *aggregate* the values that are grouped by rhe specify key.

In your case you want to retain access to the values.

You need to do df.partitionBy and then you can map the partirions. Of course you need to be carefull of potential skews in the resulting partitions.

On Thu, Oct 4, 2018, 23:27 dimitris plakas <[hidden email]> wrote:
Hello everyone,

Here is an issue that i am facing in partitioning dtafarame.

I have a dataframe which called data_df. It is look like:

Group_Id | Object_Id | Trajectory
   1         |  obj1        | Traj1
   2         |  obj2        | Traj2
   1         |  obj3        | Traj3
   3         |  obj4        | Traj4
   2         |  obj5        | Traj5

This dataframe has 5045 rows where each row has value in Group_Id from 1 to 7, and the number of rows per group_id is arbitrary.
I want to split the rdd which produced by from this dataframe in 7 partitions one for each group_id and then apply mapPartitions() where i call function custom_func(). How can i create these partitions from this dataframe? Should i first apply group by (create the grouped_df) in order to create a dataframe with 7 rows and then call partitioned_rdd=grouped_df.rdd.mapPartitions()?
Which is the optimal way to do it?

Thank you in advance