Spark Optimization

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Optimization

Pallavi

Hi Team,

 

We are currently working on POC based on Spark and Scala.

we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.

we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min.

 

Cluster Information:

Number of Node:2

Total Core:28Core

Total RAM:128GB

 

Component:

Spark Core

 

Scenario:

How-to

 

Tuning Parameter:

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.default.parallelism 24

spark.sql.shuffle.partitions 24

spark.executor.extraJavaOptions -XX:+UseG1GC

spark.speculation true

spark.executor.memory 16G

spark.driver.memory 8G

spark.sql.codegen true

spark.sql.inMemoryColumnarStorage.batchSize 100000

spark.locality.wait 1s

spark.ui.showConsoleProgress false

spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec

Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.

 

 

Regards,

Pallavi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Reply | Threaded
Open this post in threaded view
|

Re: Spark Optimization

vincent gromakowski
Ideal parallelization is 2-3x the nb of cores. But it depends on the number of partitions of your source and the operation you use (Shuffle or not). It can be worth paying the extra cost of an initial repartition to match your cluster but it clearly depends on your DAG.
Optimizing spark apps depends on lots of thing, it's hard to answer 
- cluster size
- scheduler
- spark version
- transformation graph (DAG)
...

Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <[hidden email]> a écrit :

Hi Team,

 

We are currently working on POC based on Spark and Scala.

we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.

we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min.

 

Cluster Information:

Number of Node:2

Total Core:28Core

Total RAM:128GB

 

Component:

Spark Core

 

Scenario:

How-to

 

Tuning Parameter:

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.default.parallelism 24

spark.sql.shuffle.partitions 24

spark.executor.extraJavaOptions -XX:+UseG1GC

spark.speculation true

spark.executor.memory 16G

spark.driver.memory 8G

spark.sql.codegen true

spark.sql.inMemoryColumnarStorage.batchSize 100000

spark.locality.wait 1s

spark.ui.showConsoleProgress false

spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec

Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.

 

 

Regards,

Pallavi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
CPC
Reply | Threaded
Open this post in threaded view
|

Re: Spark Optimization

CPC
I would recommend UseParallelGC since this is a batch job. Parallelization should be 2-3x of cores. Also if those are physical machines i would recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node?

On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski <[hidden email]> wrote:
Ideal parallelization is 2-3x the nb of cores. But it depends on the number of partitions of your source and the operation you use (Shuffle or not). It can be worth paying the extra cost of an initial repartition to match your cluster but it clearly depends on your DAG.
Optimizing spark apps depends on lots of thing, it's hard to answer 
- cluster size
- scheduler
- spark version
- transformation graph (DAG)
...

Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <[hidden email]> a écrit :

Hi Team,

 

We are currently working on POC based on Spark and Scala.

we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.

we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min.

 

Cluster Information:

Number of Node:2

Total Core:28Core

Total RAM:128GB

 

Component:

Spark Core

 

Scenario:

How-to

 

Tuning Parameter:

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.default.parallelism 24

spark.sql.shuffle.partitions 24

spark.executor.extraJavaOptions -XX:+UseG1GC

spark.speculation true

spark.executor.memory 16G

spark.driver.memory 8G

spark.sql.codegen true

spark.sql.inMemoryColumnarStorage.batchSize 100000

spark.locality.wait 1s

spark.ui.showConsoleProgress false

spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec

Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.

 

 

Regards,

Pallavi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Reply | Threaded
Open this post in threaded view
|

RE: Spark Optimization

Pallavi

Thanks for your reply.

 

It is 64GB per node. We will try using UseParallelGC.

 

From: CPC [mailto:[hidden email]]
Sent: Thursday, April 26, 2018 11:44 PM
To: vincent gromakowski <[hidden email]>
Cc: Pallavi Singh <[hidden email]>; user <[hidden email]>
Subject: Re: Spark Optimization

 

I would recommend UseParallelGC since this is a batch job. Parallelization should be 2-3x of cores. Also if those are physical machines i would recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node?

On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski <[hidden email]> wrote:

Ideal parallelization is 2-3x the nb of cores. But it depends on the number of partitions of your source and the operation you use (Shuffle or not). It can be worth paying the extra cost of an initial repartition to match your cluster but it clearly depends on your DAG.

Optimizing spark apps depends on lots of thing, it's hard to answer 

- cluster size

- scheduler

- spark version

- transformation graph (DAG)

...

 

Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <[hidden email]> a écrit :

Hi Team,

 

We are currently working on POC based on Spark and Scala.

we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.

we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min.

 

Cluster Information:

Number of Node:2

Total Core:28Core

Total RAM:128GB

 

Component:

Spark Core

 

Scenario:

How-to

 

Tuning Parameter:

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.default.parallelism 24

spark.sql.shuffle.partitions 24

spark.executor.extraJavaOptions -XX:+UseG1GC

spark.speculation true

spark.executor.memory 16G

spark.driver.memory 8G

spark.sql.codegen true

spark.sql.inMemoryColumnarStorage.batchSize 100000

spark.locality.wait 1s

spark.ui.showConsoleProgress false

spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec

Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.

 

 

Regards,

Pallavi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.