Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

satishjohn
This post has NOT been accepted by the mailing list yet.
This post was updated on .
Performance issue / time taken to complete spark job in yarn is 4 x slower, when considered spark standalone mode. However, in spark standalone mode jobs often fails with executor lost issue.

Hardware configuration


 32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)

Spark configuration:


spark.executor.memory 7g
Spark cores Max 96
Spark driver 5GB
spark.sql.autoBroadcastJoinThreshold::-1 (Without this key the job fails or job takes 50x times more time)
spark.driver.maxResultSize::2g
spark.driver.memory::5g
No of Instances 4 per machine.

With the above spark configuration the spark job for the business flow of 17 million records completes in 8 Minutes.

Problem Area:


When run in yarn client mode with the below configuration which takes 33 to 42 minutes to run the same flow. Below is the yarn-site.xml configuration data.

<configuration>
  <property><name>yarn.label.enabled</name><value>true</value></property>
  <property><name>yarn.log-aggregation.enable-local-cleanup</name><value>false</value></property>
  <property><name>yarn.resourcemanager.scheduler.client.thread-count</name><value>64</value></property>
  <property><name>yarn.resourcemanager.resource-tracker.address</name><value>satish-NS1:8031</value></property>
  <property><name>yarn.resourcemanager.scheduler.address</name><value>satish-NS1:8030</value></property>
  <property><name>yarn.dispatcher.exit-on-error</name><value>true</value></property>
  <property><name>yarn.nodemanager.container-manager.thread-count</name><value>64</value></property>
  <property><name>yarn.nodemanager.local-dirs</name><value>/home/satish/yarn</value></property>
  <property><name>yarn.nodemanager.localizer.fetch.thread-count</name><value>20</value></property>
  <property><name>yarn.resourcemanager.address</name><value>satish-NS1:8032</value></property>
  <property><name>yarn.scheduler.increment-allocation-mb</name><value>512</value></property>
  <property><name>yarn.log.server.url</name><value>http://satish-NS1:19888/jobhistory/logs</value></property>
  <property><name>yarn.nodemanager.resource.memory-mb</name><value>28000</value></property>
  <property><name>yarn.nodemanager.labels</name><value>MASTER</value></property>
  <property><name>yarn.nodemanager.resource.cpu-vcores</name><value>48</value></property>
  <property><name>yarn.scheduler.minimum-allocation-mb</name><value>1024</value></property>
  <property><name>yarn.log-aggregation-enable</name><value>true</value></property>
  <property><name>yarn.nodemanager.localizer.client.thread-count</name><value>20</value></property>
  <property><name>yarn.app.mapreduce.am.labels</name><value>CORE</value></property>
  <property><name>yarn.log-aggregation.retain-seconds</name><value>172800</value></property>
  <property><name>yarn.nodemanager.address</name><value>${yarn.nodemanager.hostname}:8041</value></property>
  <property><name>yarn.resourcemanager.hostname</name><value>satish-NS1</value></property>
  <property><name>yarn.scheduler.maximum-allocation-mb</name><value>8192</value></property>
  <property><name>yarn.nodemanager.remote-app-log-dir</name><value>/home/satish/satish/hadoop-yarn/apps</value></property>
  <property><name>yarn.resourcemanager.resource-tracker.client.thread-count</name><value>64</value></property>
  <property><name>yarn.scheduler.maximum-allocation-vcores</name><value>1</value></property>
  <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle,</value></property>
  <property><name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property>
  <property><name>yarn.resourcemanager.client.thread-count</name><value>64</value></property>
  <property><name>yarn.nodemanager.container-metrics.enable</name><value>true</value></property>
  <property><name>yarn.nodemanager.log-dirs</name><value>/home/satish/hadoop-yarn/containers</value></property>
  <property> <name>yarn.nodemanager.aux-services</name> <value>spark_shuffle,mapreduce_shuffle</value></property>   
 <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>      <value>org.apache.hadoop.mapred.ShuffleHandler</value>    </property>
  <property><name>yarn.nodemanager.aux-services.spark_shuffle.class</name>    <value>org.apache.spark.network.yarn.YarnShuffleService</value></property>
  <property><name>yarn.scheduler.minimum-allocation-vcores</name><value>1</value></property>
  <property><name>yarn.scheduler.increment-allocation-vcores</name>        <value>1</value>    </property>
<property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value></property>
<property><name>yarn.scheduler.fair.preemption</name><value>true</value></property>

</configuration>

Also in capacity scheduler I am using Dominant resource calculator. I have tried hands on other fair and default as well.

In order to make the test simple, I ran sort on the same cluster with yarn-client mode and spark standalone mode. I can share the data for your comparative test analysis as well.

Results:


136 seconds - Yarn-client mode
40 seconds  - Spark Standalone mode

To conclude I am looking for a reason and solution for yarn-client mode performance issue best configuration possible to achieve performance from yarn.

When I use spark.sql.autoBroadcastJoinThreshold::-1 the jobs that takes long completes in time and also does not fail often when compared to without as I have had history of issues when running job in spark without this option enabled.

Let me know how to get similar performance from yarn-client or spark standalone.
Loading...