SchemaRDD partition on specific column values?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

SchemaRDD partition on specific column values?

nitin
Hi All,

I want to hash partition (and then cache) a schema RDD in way that partitions are based on hash of the values of a  column ("ID" column in my case).

e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and spark.sql.shuffle.partitions is configured as 3, then there should be 3 partitions and say for ID=1, all the tuples should be present in one particular partition.

My actual use case is that I always get a query in which I have to join 2 cached tables on ID column, so it first partitions both tables on ID and then apply JOIN and I want to avoid the partitioning based on ID by preprocessing it (and then cache it).

Thanks in Advance
Reply | Threaded
Open this post in threaded view
|

Re: SchemaRDD partition on specific column values?

nitin
This post was updated on .
With some quick googling, I learnt that I can provide "distribute by <coulmn_name>" in hive ql to distribute data based on a column values. My question now if I use "distribute by id", will there be any performance improvements? Will I be able to avoid data movement in shuffle(Excahnge before JOIN step) and improve overall performance?
Reply | Threaded
Open this post in threaded view
|

Re: SchemaRDD partition on specific column values?

Michael Armbrust
It does not appear that the in-memory caching currently preserves the information about the partitioning of the data so this optimization will probably not work.

On Thu, Dec 4, 2014 at 8:42 PM, nitin <[hidden email]> wrote:
With some quick googling, I learnt that I can we can provide "distribute by
<coulmn_name>" in hive ql to distribute data based on a column values. My
question now if I use "distribute by id", will there be any performance
improvements? Will I be able to avoid data movement in shuffle(Excahnge
before JOIN step) and improve overall performance?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: SchemaRDD partition on specific column values?

nitin
This post has NOT been accepted by the mailing list yet.
Can we take this as a performance improvement task in Spark-1.2.1? I can help contribute for this.