Strange WholeStageCodegen UI values

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange WholeStageCodegen UI values

Michal Sankot
Hi,
I'm checking execution of SQL queries in Spark UI, trying to find a
bottleneck and values that are displayed in WholeStageCodegen blocks are
confusing.

In attached example whole query took 6.6 minutes and upper left
WholeStageCodegen block says that median value is 7.8 minutes and
maximum 7.27h :O

What does it mean? Do those number have any real meaning? Is there a way
to find out how long individual blocks really took?

Thanks,
Michal
Spark 2.4.4 on AWS EMR


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Screenshot 2020-07-09 at 16.44.35.png (341K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Strange WholeStageCodegen UI values

Bobby Evans
Sadly there isn't a lot you can do to fix this.  All of the operations take iterators of rows as input and produce iterators of rows as output.  For efficiency reasons, the timing is not done for each individual row. If we did that in many cases it would take longer to measure how long something took then it would to just do the operation. So most operators actually end up measuring the lifetime of the operator which often is the time of the entire task minus how long it took for the first task to get to that operator. This is also true of WholeStageCodeGen. 

On Thu, Jul 9, 2020 at 11:55 AM Michal Sankot <[hidden email]> wrote:
Hi,
I'm checking execution of SQL queries in Spark UI, trying to find a
bottleneck and values that are displayed in WholeStageCodegen blocks are
confusing.

In attached example whole query took 6.6 minutes and upper left
WholeStageCodegen block says that median value is 7.8 minutes and
maximum 7.27h :O

What does it mean? Do those number have any real meaning? Is there a way
to find out how long individual blocks really took?

Thanks,
Michal
Spark 2.4.4 on AWS EMR

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Strange WholeStageCodegen UI values

srowen
It sounds like you have huge data skew?

On Thu, Jul 9, 2020 at 4:15 PM Bobby Evans <[hidden email]> wrote:

>
> Sadly there isn't a lot you can do to fix this.  All of the operations take iterators of rows as input and produce iterators of rows as output.  For efficiency reasons, the timing is not done for each individual row. If we did that in many cases it would take longer to measure how long something took then it would to just do the operation. So most operators actually end up measuring the lifetime of the operator which often is the time of the entire task minus how long it took for the first task to get to that operator. This is also true of WholeStageCodeGen.
>
> On Thu, Jul 9, 2020 at 11:55 AM Michal Sankot <[hidden email]> wrote:
>>
>> Hi,
>> I'm checking execution of SQL queries in Spark UI, trying to find a
>> bottleneck and values that are displayed in WholeStageCodegen blocks are
>> confusing.
>>
>> In attached example whole query took 6.6 minutes and upper left
>> WholeStageCodegen block says that median value is 7.8 minutes and
>> maximum 7.27h :O
>>
>> What does it mean? Do those number have any real meaning? Is there a way
>> to find out how long individual blocks really took?
>>
>> Thanks,
>> Michal
>> Spark 2.4.4 on AWS EMR
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Strange WholeStageCodegen UI values

Michal Sankot
Hey guys,
Thanks for insights.

Bobby, I see that it guesses those values from run time of the whole task. But as the whole task took 6.6 minutes, how can it come up with 7.27 hours?

Sean, yes there is a data skew. One task taking tens of minutes while other take tens of seconds. What gave it away in the chart? Wild differences between min/med/max values of WholeStageCodegen?

Cheers,
Michal


Dne čt 9. 7. 2020 23:47 uživatel Sean Owen <[hidden email]> napsal:
It sounds like you have huge data skew?

On Thu, Jul 9, 2020 at 4:15 PM Bobby Evans <[hidden email]> wrote:
>
> Sadly there isn't a lot you can do to fix this.  All of the operations take iterators of rows as input and produce iterators of rows as output.  For efficiency reasons, the timing is not done for each individual row. If we did that in many cases it would take longer to measure how long something took then it would to just do the operation. So most operators actually end up measuring the lifetime of the operator which often is the time of the entire task minus how long it took for the first task to get to that operator. This is also true of WholeStageCodeGen.
>
> On Thu, Jul 9, 2020 at 11:55 AM Michal Sankot <[hidden email].invalid> wrote:
>>
>> Hi,
>> I'm checking execution of SQL queries in Spark UI, trying to find a
>> bottleneck and values that are displayed in WholeStageCodegen blocks are
>> confusing.
>>
>> In attached example whole query took 6.6 minutes and upper left
>> WholeStageCodegen block says that median value is 7.8 minutes and
>> maximum 7.27h :O
>>
>> What does it mean? Do those number have any real meaning? Is there a way
>> to find out how long individual blocks really took?
>>
>> Thanks,
>> Michal
>> Spark 2.4.4 on AWS EMR
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]