Lazy execution

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Lazy execution

Sebastian Schelter
Hi,

I want to run some benchmarks using Spark for which I need to explicitly
control the lazy execution. My benchmarks basically consist of two steps:

1. loading and transforming a dataset
2. applying an operation to the transformed dataset, where I want to
measure the runtime

How can I make sure that the operations for step 1 are fully executed
before I start step 2, whose time I'd like to measure?

Would a solution be to invoke cache() and count() on the RDD holding my
dataset after step 1?

--sebastian
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lazy execution

Matei Zaharia
Administrator
If you’re trying to measure the performance assuming that a dataset is already in memory, then doing cache() and count() would work. However if you want to measure an end-to-end workflow, it might be good to leave the operations and the data loading to happen together, as Spark does by default. This gives the engine room to pipeline these and might result in a faster time than “loading” first (where you’re IO-bound) and then computing after (where you’re CPU or communication bound).

Matei

On Dec 27, 2013, at 7:29 AM, Sebastian Schelter <[hidden email]> wrote:

> Hi,
>
> I want to run some benchmarks using Spark for which I need to explicitly
> control the lazy execution. My benchmarks basically consist of two steps:
>
> 1. loading and transforming a dataset
> 2. applying an operation to the transformed dataset, where I want to
> measure the runtime
>
> How can I make sure that the operations for step 1 are fully executed
> before I start step 2, whose time I'd like to measure?
>
> Would a solution be to invoke cache() and count() on the RDD holding my
> dataset after step 1?
>
> --sebastian

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lazy execution

Sebastian Schelter
Hi Matei,

I want to get a feeling for the pro's and con's of executing a certain
computation in a certain way (and not really benchmark Spark). So I'll
go with the cache() and count() approach.

--sebastian

On 27.12.2013 17:35, Matei Zaharia wrote:
> If you’re trying to measure the performance assuming that a dataset is already in memory, then doing cache() and count() would work.

However if you want to measure an end-to-end workflow, it might be good
to leave the operations and the data loading to happen together, as
Spark does by default. This gives the engine room to pipeline these and
might result in a faster time than “loading” first (where you’re
IO-bound) and then computing after (where you’re CPU or communication
bound).

>
> Matei
>
> On Dec 27, 2013, at 7:29 AM, Sebastian Schelter <[hidden email]> wrote:
>
>> Hi,
>>
>> I want to run some benchmarks using Spark for which I need to explicitly
>> control the lazy execution. My benchmarks basically consist of two steps:
>>
>> 1. loading and transforming a dataset
>> 2. applying an operation to the transformed dataset, where I want to
>> measure the runtime
>>
>> How can I make sure that the operations for step 1 are fully executed
>> before I start step 2, whose time I'd like to measure?
>>
>> Would a solution be to invoke cache() and count() on the RDD holding my
>> dataset after step 1?
>>
>> --sebastian
>

Loading...