Configuration for unit testing and sql.shuffle.partitions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Configuration for unit testing and sql.shuffle.partitions

peay
Hello,

I am running unit tests with Spark DataFrames, and I am looking for configuration tweaks that would make tests faster. Usually, I use a local[2] or local[4] master.

Something that has been bothering me is that most of my stages end up using 200 partitions, independently of whether I repartition the input. This seems a bit overkill for small unit tests that barely have 200 rows per DataFrame.

spark.sql.shuffle.partitions used to control this I believe, but it seems to be gone and I could not find any information on what mechanism/setting replaces it or the corresponding JIRA.

Has anyone experience to share on how to tune Spark best for very small local runs like that?

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Configuration for unit testing and sql.shuffle.partitions

Akhil Das-2
spark.sql.shuffle.partitions is still used I believe. I can see it in the code and in the documentation page.

On Wed, Sep 13, 2017 at 4:46 AM, peay <[hidden email]> wrote:
Hello,

I am running unit tests with Spark DataFrames, and I am looking for configuration tweaks that would make tests faster. Usually, I use a local[2] or local[4] master.

Something that has been bothering me is that most of my stages end up using 200 partitions, independently of whether I repartition the input. This seems a bit overkill for small unit tests that barely have 200 rows per DataFrame.

spark.sql.shuffle.partitions used to control this I believe, but it seems to be gone and I could not find any information on what mechanism/setting replaces it or the corresponding JIRA.

Has anyone experience to share on how to tune Spark best for very small local runs like that?

Thanks!




--
Cheers!

Reply | Threaded
Open this post in threaded view
|

Re: Configuration for unit testing and sql.shuffle.partitions

femibyte
How are you specifying it, as an option to spark-submit ?

On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das <[hidden email]> wrote:
spark.sql.shuffle.partitions is still used I believe. I can see it in the code and in the documentation page.

On Wed, Sep 13, 2017 at 4:46 AM, peay <[hidden email]> wrote:
Hello,

I am running unit tests with Spark DataFrames, and I am looking for configuration tweaks that would make tests faster. Usually, I use a local[2] or local[4] master.

Something that has been bothering me is that most of my stages end up using 200 partitions, independently of whether I repartition the input. This seems a bit overkill for small unit tests that barely have 200 rows per DataFrame.

spark.sql.shuffle.partitions used to control this I believe, but it seems to be gone and I could not find any information on what mechanism/setting replaces it or the corresponding JIRA.

Has anyone experience to share on how to tune Spark best for very small local runs like that?

Thanks!




--
Cheers!




--
"Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.
Reply | Threaded
Open this post in threaded view
|

Re: Configuration for unit testing and sql.shuffle.partitions

Vadim Semenov
you can create a Super class "FunSuiteWithSparkContext" that's going to create a Spark sessions, Spark context, and SQLContext with all the desired properties.
Then you add the class to all the relevant test suites, and that's pretty much it.

The other option can be is to pass it as a VM parameter like
`-Dspark.driver.memory=2g -Xmx3G -Dspark.master=local[3]`

For example, if you run your tests with sbt:

```
SBT_OPTS="-Xmx3G -Dspark.driver.memory=1536m" sbt test
```

On Sat, Sep 16, 2017 at 2:54 PM, Femi Anthony <[hidden email]> wrote:
How are you specifying it, as an option to spark-submit ?

On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das <[hidden email]> wrote:
spark.sql.shuffle.partitions is still used I believe. I can see it in the code and in the documentation page.

On Wed, Sep 13, 2017 at 4:46 AM, peay <[hidden email]> wrote:
Hello,

I am running unit tests with Spark DataFrames, and I am looking for configuration tweaks that would make tests faster. Usually, I use a local[2] or local[4] master.

Something that has been bothering me is that most of my stages end up using 200 partitions, independently of whether I repartition the input. This seems a bit overkill for small unit tests that barely have 200 rows per DataFrame.

spark.sql.shuffle.partitions used to control this I believe, but it seems to be gone and I could not find any information on what mechanism/setting replaces it or the corresponding JIRA.

Has anyone experience to share on how to tune Spark best for very small local runs like that?

Thanks!




--
Cheers!




--
"Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.