Efficient hierarchical aggregation in Spark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Efficient hierarchical aggregation in Spark

deenar.toraskar

Classification: Public

Hi

 

I have a requirement to aggregate a large data set in Spark across a multi level (25 levels) hierarchy. The data model (simplified) is as follows

 

Measures

leafNode           Long

measureType   String

measureValue Array[Float]

 

Hierarchy (expanded) – a typical organisation hierarchy.

leafNode             Long /* Account also called level0Node */

level1Node        Long /* Portfolio */

level2Node        Long /* Sub Fund */

level3Node        Long /* Fund */

level4Node        Long

                …

                …

level25Node      Long /* organisation*/

 

alternative representation

node

parentNode

hierarchylevel

 

Output Format

Level                     Int                   /* 0-25*/

Node                     Long     

measureType     String

measureValue    Array[Float]

 

I can do the aggregation by joining both the RDDs together and aggregating each level one at a time. I was wondering if there was a more efficient way of doing this in spark? Maybe a recursive algorithm that traverses the tree?

Currently the measures data set is loaded in a batch fashion, but I am working on getting incremental feeds of measures using Spark streaming.

 

Deenar


---
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional EU corporate and regulatory disclosures.
Loading...