repartition in Spark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

repartition in Spark

Ashok Kumar-2
Hi,

Just need some advise.

  1. When we have multiple spark nodes running code, under what conditions a repartition make sense?
  2. Can we repartition and cache the result --> df = spark.sql("select from ...").repartition(4).cache
  3. If we choose a repartition (4), will that repartition applies to all nodes running the code and how can one see that?

Thanks


Reply | Threaded
Open this post in threaded view
|

Re: repartition in Spark

Mich Talebzadeh
As a generic answer in a distributed environment like spark, making sure that data is distributed evenly among all nodes (assuming every node is the same or similar) can help performance

repartition thus controls the data distribution among all nodes. However, it is not that straight forward. Your mileage varies simply because changing the distribution is related to a cost for physical data movement on the cluster nodes (a so-called shuffle).

So there is a cost associated with repartition due to creation of shuffle. You need to see the execution plan by using df.explain() or looking at spark GUI to see the physical plan.

In simplest form repartition(n) will distribute the data randomly and I think that is the most common form. However, this also depends on the volume of data. For smaller volumes I don't think it really matters. However, for large volumes of data, repartition may be an option, if the data in joining is skewed. However, you need to know the volume of data before deploying partitioning.

HTH




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 9 Nov 2020 at 16:57, [hidden email] <[hidden email]> wrote:
Hi,

Just need some advise.

  1. When we have multiple spark nodes running code, under what conditions a repartition make sense?
  2. Can we repartition and cache the result --> df = spark.sql("select from ...").repartition(4).cache
  3. If we choose a repartition (4), will that repartition applies to all nodes running the code and how can one see that?

Thanks