Multiple transformations without recalculating or caching

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple transformations without recalculating or caching

Fernando Pereira
Dear Spark users

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?

Consider the case of a very large dataset (1+TB) which suffered several transformations and now we want to save it but also calculate some statistics per group.
So the best processing way would for: for each partition: do task A, do task B.

I don't see a way of instructing spark how to proceed that way without caching to disk, which seems unnecessarily heavy. And if we don't cache spark recalculates every partition all the way from the beginning. In either case huge file reads happen.

Any ideas on how to avoid it?

Thanks
Fernando
Reply | Threaded
Open this post in threaded view
|

Re: Multiple transformations without recalculating or caching

sebastian.piu

If you don't want to recalculate you need to hold the results somewhere, of you need to save it why don't you so that and then read it again and get your stats?


On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <[hidden email]> wrote:
Dear Spark users

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?

Consider the case of a very large dataset (1+TB) which suffered several transformations and now we want to save it but also calculate some statistics per group.
So the best processing way would for: for each partition: do task A, do task B.

I don't see a way of instructing spark how to proceed that way without caching to disk, which seems unnecessarily heavy. And if we don't cache spark recalculates every partition all the way from the beginning. In either case huge file reads happen.

Any ideas on how to avoid it?

Thanks

Fernando
Reply | Threaded
Open this post in threaded view
|

Re: Multiple transformations without recalculating or caching

Fernando Pereira
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I wouldn't be using spark.

On 17 November 2017 at 11:06, Sebastian Piu <[hidden email]> wrote:

If you don't want to recalculate you need to hold the results somewhere, of you need to save it why don't you so that and then read it again and get your stats?


On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <[hidden email]> wrote:
Dear Spark users

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?

Consider the case of a very large dataset (1+TB) which suffered several transformations and now we want to save it but also calculate some statistics per group.
So the best processing way would for: for each partition: do task A, do task B.

I don't see a way of instructing spark how to proceed that way without caching to disk, which seems unnecessarily heavy. And if we don't cache spark recalculates every partition all the way from the beginning. In either case huge file reads happen.

Any ideas on how to avoid it?

Thanks

Fernando

Reply | Threaded
Open this post in threaded view
|

Re: Multiple transformations without recalculating or caching

Phillip Henry
A back-of-a-beermat calculation says if you have, say, 20 boxes, saving 1TB should take approximately 15 minutes (with a replication factor of 1 since you don't need it higher for ephemeral data that is relatively easy to generate). 

This isn't much if the whole job takes hours. You get the added bonus that you can inspect interim data to help understand how the results came to be. 

This worked for us. As ever, YMMV.

Phillip


On 17 Nov 2017 11:12, "Fernando Pereira" <[hidden email]> wrote:
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I wouldn't be using spark.

On 17 November 2017 at 11:06, Sebastian Piu <[hidden email]> wrote:

If you don't want to recalculate you need to hold the results somewhere, of you need to save it why don't you so that and then read it again and get your stats?


On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <[hidden email]> wrote:
Dear Spark users

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?

Consider the case of a very large dataset (1+TB) which suffered several transformations and now we want to save it but also calculate some statistics per group.
So the best processing way would for: for each partition: do task A, do task B.

I don't see a way of instructing spark how to proceed that way without caching to disk, which seems unnecessarily heavy. And if we don't cache spark recalculates every partition all the way from the beginning. In either case huge file reads happen.

Any ideas on how to avoid it?

Thanks

Fernando