Trying to make sense of the actual executed code

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Trying to make sense of the actual executed code

Tom Hubregtsen
Hi,

I am trying to look at for instance the following SQL query in Spark 1.1:
SELECT table.key, table.value, table2.value FROM table2 JOIN table WHERE table2.key = table.key
When I look at the output, I see that there are several stages, and several tasks per stage. The tasks have a TID, I do not see such a thing for a stage. I see the input split of the files and start, running and finished messages for the tasks. But what I really want to know is the following:
Which map, shuffle and reduces are performed in which order/where can I see the actual executed code per task/stage. In between files/rdd's would be a bonus!

Thanks in advance,

Tom
Reply | Threaded
Open this post in threaded view
|

Re: Trying to make sense of the actual executed code

Michael Armbrust
This is maybe not exactly what you are asking for, but you might consider looking at the queryExecution (a developer API that shows how the query is analyzed / executed)

sql("...").queryExecution


On Wed, Aug 6, 2014 at 3:55 PM, Tom <[hidden email]> wrote:
Hi,

I am trying to look at for instance the following SQL query in Spark 1.1:
SELECT table.key, table.value, table2.value FROM table2 JOIN table WHERE
table2.key = table.key
When I look at the output, I see that there are several stages, and several
tasks per stage. The tasks have a TID, I do not see such a thing for a
stage. I see the input split of the files and start, running and finished
messages for the tasks. But what I really want to know is the following:
Which map, shuffle and reduces are performed in which order/where can I see
the actual executed code per task/stage. In between files/rdd's would be a
bonus!

Thanks in advance,

Tom



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-make-sense-of-the-actual-executed-code-tp11594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Fwd: Trying to make sense of the actual executed code

Tobias Pfeiffer
In reply to this post by Tom Hubregtsen
(Forgot to include the mailing list in my reply. Here it is.)


Hi,

On Thu, Aug 7, 2014 at 7:55 AM, Tom <[hidden email]> wrote:
When I look at the output, I see that there are several stages, and several
tasks per stage. The tasks have a TID, I do not see such a thing for a
stage.

They should have. In my logs, for example, I see something like

INFO  scheduler.DAGScheduler - Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at SimpleSpark.scala:21), which has no missing parents
INFO  scheduler.DAGScheduler - Submitting Stage 0 (MapPartitionsRDD[6] at reduceByKey at SimpleSpark.scala:21), which is now runnable
 
But what I really want to know is the following:
Which map, shuffle and reduces are performed in which order/where can I see
the actual executed code per task/stage. In between files/rdd's would be a
bonus!
 
I would also be interested in that, although I think it's quite hard to understand what is actually being executed. I dug a bit into that yesterday, and even the simple WordCount (flatMap, map, reduceByKey, max) is already quite tough to understand. For example, reduceByKey consists of three transformations (local reduceByKey, repartition by key, another local reduceByKey), one of which happens in one stage, the other two in a different stage. I would love to see a good visualization of that (I wonder how the developers got their head around that without such a tool), but I am not aware of any.

Tobias