Understanding Spark execution plans

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Understanding Spark execution plans

Daniel Stojanov

When an execution plan is printed it lists the tree of operations that will be completed when the job is run. The tasks have somewhat cryptic names of the sort: BroadcastHashJoin, Project, Filter, etc. These do not appear to map directly to functions that are performed on an RDD.

1) Is there a place in which each of these steps are documented? 
2) Is there documentation, outside of Spark's source code, in which the map between operations on Spark dataframes or RDDs and the resulting physical execution plan is described? At least in a way that would allow for more accurately understanding physical execution steps and predicting the steps that would result from particular actions.