Re: Where is the DAG stored before catalyst gets it?
Hi Jean Georges,
> I am assuming it is still in the master and when catalyst is finished it sends the tasks to the workers.
Sorry to be that direct, but the sentence does not make much sense to me. Again, very sorry for saying it in the very first sentence. Since I know Jean Georges I allowed myself for more openness.
In other words, "the master" part seems to suggest that you use Spark Standalone cluster. Correct? Other cluster use different naming for the master/manager node.
"when catalyst is finished" that one is really tough to understand. You mean once all the optimizations are applied and the query is ready for execution? The final output of the "query execution pipeline" is to generate a RDD with the right code for execution. At this phase, the query is more an RDD than a Dataset.
"it sends the tasks to the workers." since we're talking about an RDD, this abstraction is planned as a set of tasks (one per partition of the RDD). And yes, the tasks are sent out over the wire to executors. It's been like this from Spark 1.0 (and even earlier).