Performance issue to handle too many aggregations with group by

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Performance issue to handle too many aggregations with group by

KannanSpark
This post has NOT been accepted by the mailing list yet.
Hi Team,

my requirement, reading the table from hive(Size - around 1.6 TB). I have to do more than 200 aggregation operations mostly avg, sum and std_dev. Spark application total execution time taken is more than 12 hours. To Optimize the code I've used shuffle Partitioning and memory tuning and all. But Its not helpful.  
Please note that same query I ran in hive on map reduce. MR job completion time taken around only 5 hours.  Kindly let me know is there any other way to optimize or efficient way of handling multiple aggregation operations.  
 val inputDataDF = hiveContext.read.parquet("/inputparquetData")    
val finalDF =inputDataDF.groupBy("seq_no","year", "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),  avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... like 200 columns)
finalDF.registerTempTable("tempResult")
hiveContext.sql("create table OutputData as select * from tempResult")

Thanks,
Kannan

Loading...