understanding spark shuffle file re-use better

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

understanding spark shuffle file re-use better

Koert Kuipers
is shuffle file re-use based on identity or equality of the dataframe?

for example if run the exact same code twice to load data and do transforms (joins, aggregations, etc.) but without re-using any actual dataframes, will i still see skipped stages thanks to shuffle file re-use?

thanks!
koert
Reply | Threaded
Open this post in threaded view
|

Re: understanding spark shuffle file re-use better

Jacek Laskowski
Hi,

An interesting question that I must admit I'm not sure how to answer myself actually :)

Off the top of my head, I'd **guess** unless you cache the first query these two queries would share nothing. With caching, there's a phase in query execution when a canonicalized version of a query is used to look up any cached queries.

Again, I'm not really sure and if I'd have to answer it (e.g. as part of an interview) I'd say nothing would be shared / re-used.

On Wed, Jan 13, 2021 at 5:39 PM Koert Kuipers <[hidden email]> wrote:
is shuffle file re-use based on identity or equality of the dataframe?

for example if run the exact same code twice to load data and do transforms (joins, aggregations, etc.) but without re-using any actual dataframes, will i still see skipped stages thanks to shuffle file re-use?

thanks!
koert
Reply | Threaded
Open this post in threaded view
|

Re: understanding spark shuffle file re-use better

Attila Zsolt Piros
This post was updated on .
In reply to this post by Koert Kuipers
No, it won't be reused.
You should reuse the dateframe for reusing the shuffle blocks (and cached
data).

I know this because the two actions will lead to building a two separate
DAGs, but I will show you a way how you could check this on your own (with a
small simple spark application).

For this you can even use the spark-shell. Start it in directory where a
simple text file available ("README.md" in my case).

After this the one-liner is:

```
scala> spark.read.text("README.md").selectExpr("length(value) as l",
"value").groupBy("l").count
.take(1)
```

Now if you check Stages tab on the UI you will see 3 stages.
After re-executing the same line of code in the Stages tab you can see the
number of stages are doubled.

So shuffle files are not reused.

Finally you can delete the file and re-execute our small test. Now it will
produce:

```
org.apache.spark.sql.AnalysisException: Path does not exist:
file:/Users/attilazsoltpiros/git/attilapiros/spark/README.md;
```

So the file would have been opened again for loading the data (even in the
3rd run).



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Reply | Threaded
Open this post in threaded view
|

Re: understanding spark shuffle file re-use better

Attila Zsolt Piros
A much better one-liner (easier to understand the UI because it will be 1
simple job with 2 stages):

```
spark.read.text("README.md").repartition(2).take(1)
```


Attila Zsolt Piros wrote

> No, it won't be reused.
> You should reuse the dateframe for reusing the shuffle blocks (and cached
> data).
>
> I know this because the two actions will lead to building a two separate
> DAGs, but I will show you a way how you could check this on your own (with
> a
> small simple spark application).
>
> For this you can even use the spark-shell. Start it in directory where a
> simple text file available ("README.md" in my case).
>
> After this the one-liner is:
>
> ```
> scala> spark.read.text("README.md").selectExpr("length(value) as l",
> "value").groupBy("l").count
> .take(1)
> ```
>
> Now if you check Stages tab on the UI you will see 3 stages.
> After re-executing the same line of code in the Stages tab you can see the
> number of stages are doubled.
>
> So shuffle files are not reused.
>
> Finally you can delete the file and re-execute our small test. Now it will
> produce:
>
> ```
> org.apache.spark.sql.AnalysisException: Path does not exist:
> file:/Users/attilazsoltpiros/git/attilapiros/spark/README.md;
> ```
>
> So the file would have been opened again for loading the data (even in the
> 3rd run).
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]


```
```



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: understanding spark shuffle file re-use better

Mandloi87-2
Increase or Decrease the number of data partitions: Since a data partition
represents the quantum of data to be processed together by a single Spark
Task, there could be situations:
 (a) Where existing number of data partitions are not sufficient enough in
order to maximize the usage of available resources
 (b) Where existing number of data partitions are too heavy to be computed
reliably without memory overruns.
 (c) Where existing number of data partitions are too high in number such
that task scheduling overhead becomes the bottleneck in the overall
processing time.



-----
&nbsp;แ…&nbsp; Targeted Web Traffic &nbsp;AFFORDABLE web traffic package is the best ideal for small businesses&nbsp;๐Ÿ‘‰ Website Traffic Packages : Turn Traffic Increase Into Revenue -
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]