Out of memory causing due to high number of spark submissions in FIFO mode

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Out of memory causing due to high number of spark submissions in FIFO mode

sunil_pp
Hi all,

I have written a small ETL spark application which takes data from GCS and transforms them and saves them again into some other GCS bucket.
I am trying to run this application for different ids using a spark cluster in google's dataproc and just tweaking the default configuration to use a FAIR scheduler with FIFO queue by configuring these settings
  in /etc/hadoop/conf/yarn-site.xml
  yarn.resourcemanager.scheduler.class = yarn.resourcemanager.scheduler.class
  yarn.scheduler.fair.allocation.file = /etc/hadoop/conf/fair-scheduler.xml
  yarn.scheduler.fair.user-as-default-queue = false
  in /etc/hadoop/conf/fair-scheduler.xml, allocations as
  <queueMaxAppsDefault>1</queueMaxAppsDefault>


in a spark cluster for a
2 core, 4GB RAM master   - 1
4 core, 16GB RAM workers - 2
I did testing for 5 spark submissions and everything is working as expected. All the applications are running one after the other without any exceptions.

when I tried to run the same testing exercise for 100 submissions, some of the submissions failed with out of memory errors. When I re-ran the OOM submissions individually they completed without any error.

the submission's log which has out of memory
'''
20/06/05 19:44:23 INFO org.spark_project.jetty.util.log: Logging initialized @5463ms
20/06/05 19:44:24 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/06/05 19:44:24 INFO org.spark_project.jetty.server.Server: Started @5599ms
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
20/06/05 19:44:24 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@723f98fa{HTTP/1.1,[http/1.1]}{0.0.0.0:4045}
20/06/05 19:44:24 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
20/06/05 19:44:26 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at airf-m-2c-w-4c-4-faff-m/10.160.0.156:8032
20/06/05 19:44:27 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at airf-m-2c-w-4c-4-faff-m/10.160.0.156:10200
20/06/05 19:44:29 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1591383928453_0047
20/06/05 19:46:34 WARN org.apache.spark.sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
20/06/05 19:46:41 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Repairing batch of 24 missing directories.
20/06/05 19:46:44 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Successfully repaired 24/24 implicit directories.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000098200000, 46661632, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 46661632 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/9e22ca5b-5bf8-47b7-12ee-69cd9e37e7c8_spark_submit_20200605_82b0375c/hs_err_pid9917.log
Job output is complete
'''

ALso, when I was test running an application I never saw this log
  Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

I am very new to spark. I didnt know which configurations might help to debug this. This log also didn't help.
I lost the hs_err file when the cluster was deleted.
What can I do to debug this?
Thanks for taking your time to read this.