How to Join already partitioned data produced by Map Reduce

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How to Join already partitioned data produced by Map Reduce

rajat
This post has NOT been accepted by the mailing list yet.
Hi ,

I have 3 MR jobs produce hash partitioned data .

1 job :- 10,000 part files ( 4tb)
2 job :- 200 part files (500gb)
3 job:- 200 part files(500gb)


I want to Join all 3 jobs but i know that all jobs are being produced by Map Reduce and data is already hash partitioned based on User Id ( By reducer)


Job1 :- part-r-0000  to part-r-09999
Job2:- part-r-0000 to part-r-199
Job 3:- part-r-0000 to part-r-199

Naive Solution :- Load all MR inputs into RDD and JOin it but it does shuffle join
(Hash Shuffle Join) and it completely shuffles the data before it starts Joining .

HIve solution :- Load Data into Hive tables and do bucket Join .
Since i am using spark so i will have  to depend on MR again ( as i am using HIve)

Bucket Join in Hive Joins Data partition by partition

job1:- part-r-0000,part-r-00200,part-r-00400...
job2:- part-r-00000
job3:-part-r-0000

Will get joined Toghether and likewise other buckets will get joined .

Spark SOlution :- How to Do above mentioned thing in spark ?

a:- One solution is start 200 different stages and each stage input will be

job1:- part-r-0000,part-r-00200,part-r-00400...
job2:- part-r-00000
job3:-part-r-0000

Like wise i will start 200 parallel stages and later will union the output


Are there are better solution exist in spark sql so that we can do bucket Join as we do in Hive  ?

Thanks
Loading...