DataFrame --- join / groupBy-agg question...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

DataFrame --- join / groupBy-agg question...

muthu
This post has NOT been accepted by the mailing list yet.
I may be having a naive question on join / groupBy-agg. During the days of RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an optional Partition-Strategy (with is number of partitions or Partitioner) b. join (of PairRDDFunctions) and its variants, I used to have a way to provide number of partitions

In DataFrame, how do I specify the number of partitions during this operation? I could use repartition() after the fact. But this would be another Stage in the Job.

One work around to increase the number of partitions / task during a join is to set 'spark.sql.shuffle.partitions' it some desired number during spark-submit. I am trying to see if there is a way to provide this programmatically for every step of a groupBy-agg / join?

Reason to do it programmatically is so that, depending on the size of the dataframe, I can use a larger or smaller number of tasks to avoid OutOfMemoryError.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame --- join / groupBy-agg question...

qihuagao
This post has NOT been accepted by the mailing list yet.
also interested in this.
Is the partition count of df depending on fields of groupby?
Also is the performance of groupby-agg comparable to reducebykey/aggbykey?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: DataFrame --- join / groupBy-agg question...

muthu
This post has NOT been accepted by the mailing list yet.
Hello there,

Thank you for looking into the question.

>Is the partition count of df depending on fields of groupby? 
Absolute partition number or by column value to determine the partition count would be fine for me (which is similar to repartition() I suppose)

>Also is the performance of groupby-agg comparable to reducebykey/aggbykey? 
In theory the DF/ DS APIs are supposed to be better as they would optimize the execution order and so on by building an effective Query Plan.

Currently I am hacking to spin up a new spark-submit per query request by setting 'spark.sql.shuffle.partitions'. In ideal situations, we have a long running application that uses the same spark-session and runs one or more query using FAIR mode.

Thanks,
Muthu



On Wed, Jul 19, 2017 at 6:03 AM, qihuagao [via Apache Spark User List] <[hidden email]> wrote:
also interested in this.
Is the partition count of df depending on fields of groupby?
Also is the performance of groupby-agg comparable to reducebykey/aggbykey?


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-join-groupBy-agg-question-tp28849p28879.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from DataFrame --- join / groupBy-agg question..., click here.
NAML

Loading...