No space left on device

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

No space left on device

lordjoe

We are trying to run a job that has previously run on Spark 1.3 on a different cluster. The job was converted to 2.3 spark and this is a new cluster.
    The job dies after completing about a half dozen stages with
java.io.IOException: No space left on device

   It appears that the nodes are using local storage as tmp.

    I could use help diagnosing the issue and how to fix it. 

Here are the spark conf properties
Spark Conf Properties
spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
spark.master=spark://10.141.0.34:7077
spark.mesos.executor.memoryOverhead=3128
spark.shuffle.consolidateFiles=true
spark.shuffle.spill=false
spark.app.name=Anonymous
spark.shuffle.manager=sort
spark.storage.memoryFraction=0.3
spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
spark.ui.killEnabled=true
spark.shuffle.spill.compress=true
spark.shuffle.sort.bypassMergeThreshold=100
com.lordjoe.distributed.marker_property=spark_property_set
spark.executor.memory=12g
spark.mesos.coarse=true
spark.shuffle.memoryFraction=0.4
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
spark.default.parallelism=360
spark.io.compression.codec=lz4
spark.reducer.maxMbInFlight=128
spark.hadoop.validateOutputSpecs=false
spark.submit.deployMode=client
spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
spark.shuffle.file.buffer.kb=1024


--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Reply | Threaded
Open this post in threaded view
|

Re: No space left on device

Vitaliy Pisarev
The other time when I encountered this I solved it by throwing more resources at it (stronger cluster).
I was not able to understand the root cause though. I'll be happy to hear deeper insight as well.

On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis <[hidden email]> wrote:

We are trying to run a job that has previously run on Spark 1.3 on a different cluster. The job was converted to 2.3 spark and this is a new cluster.
    The job dies after completing about a half dozen stages with
java.io.IOException: No space left on device

   It appears that the nodes are using local storage as tmp.

    I could use help diagnosing the issue and how to fix it. 

Here are the spark conf properties
Spark Conf Properties
spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
spark.master=spark://10.141.0.34:7077
spark.mesos.executor.memoryOverhead=3128
spark.shuffle.consolidateFiles=true
spark.shuffle.spill=false
spark.app.name=Anonymous
spark.shuffle.manager=sort
spark.storage.memoryFraction=0.3
spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
spark.ui.killEnabled=true
spark.shuffle.spill.compress=true
spark.shuffle.sort.bypassMergeThreshold=100
com.lordjoe.distributed.marker_property=spark_property_set
spark.executor.memory=12g
spark.mesos.coarse=true
spark.shuffle.memoryFraction=0.4
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
spark.default.parallelism=360
spark.io.compression.codec=lz4
spark.reducer.maxMbInFlight=128
spark.hadoop.validateOutputSpecs=false
spark.submit.deployMode=client
spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
spark.shuffle.file.buffer.kb=1024


--
Steven M. Lewis PhD
206-384-1340 (cell)
Skype lordjoe_com


Reply | Threaded
Open this post in threaded view
|

Re: No space left on device

Gourav Sengupta
Hi,

The best part about Spark is that it is showing you which configuration to tweak as well. In case you are using EMR, try to see that the configuration points to the right location in the cluster "spark.local.dir". If a disk is mounted across all the systems with a common path (you can do that easily in EMR) then you can change the configuration to point to that disk location and thereby overcome the issue.

On another note also try to see why the data is being written to the disk, is it too much shuffle, can you increase the shuffle memory as shown in the error message using "spark.shuffle.memoryFraction"

By any change have you changed from caching to persistent data frames?


Regards,
Gourav Sengupta



On Tue, Aug 21, 2018 at 12:04 PM Vitaliy Pisarev <[hidden email]> wrote:
The other time when I encountered this I solved it by throwing more resources at it (stronger cluster).
I was not able to understand the root cause though. I'll be happy to hear deeper insight as well.

On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis <[hidden email]> wrote:

We are trying to run a job that has previously run on Spark 1.3 on a different cluster. The job was converted to 2.3 spark and this is a new cluster.
    The job dies after completing about a half dozen stages with
java.io.IOException: No space left on device

   It appears that the nodes are using local storage as tmp.

    I could use help diagnosing the issue and how to fix it. 

Here are the spark conf properties
Spark Conf Properties
spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
spark.master=spark://10.141.0.34:7077
spark.mesos.executor.memoryOverhead=3128
spark.shuffle.consolidateFiles=true
spark.shuffle.spill=false
spark.app.name=Anonymous
spark.shuffle.manager=sort
spark.storage.memoryFraction=0.3
spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
spark.ui.killEnabled=true
spark.shuffle.spill.compress=true
spark.shuffle.sort.bypassMergeThreshold=100
com.lordjoe.distributed.marker_property=spark_property_set
spark.executor.memory=12g
spark.mesos.coarse=true
spark.shuffle.memoryFraction=0.4
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
spark.default.parallelism=360
spark.io.compression.codec=lz4
spark.reducer.maxMbInFlight=128
spark.hadoop.validateOutputSpecs=false
spark.submit.deployMode=client
spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
spark.shuffle.file.buffer.kb=1024


--
Steven M. Lewis PhD
206-384-1340 (cell)
Skype lordjoe_com


Reply | Threaded
Open this post in threaded view
|

Re: No space left on device

Vitaliy Pisarev
Documentation says that 'spark.shuffle.memoryFraction' was deprecated, but it doesn't say what to use instead. Any idea?

On Wed, Aug 22, 2018 at 9:36 AM, Gourav Sengupta <[hidden email]> wrote:
Hi,

The best part about Spark is that it is showing you which configuration to tweak as well. In case you are using EMR, try to see that the configuration points to the right location in the cluster "spark.local.dir". If a disk is mounted across all the systems with a common path (you can do that easily in EMR) then you can change the configuration to point to that disk location and thereby overcome the issue.

On another note also try to see why the data is being written to the disk, is it too much shuffle, can you increase the shuffle memory as shown in the error message using "spark.shuffle.memoryFraction"

By any change have you changed from caching to persistent data frames?


Regards,
Gourav Sengupta



On Tue, Aug 21, 2018 at 12:04 PM Vitaliy Pisarev <[hidden email]> wrote:
The other time when I encountered this I solved it by throwing more resources at it (stronger cluster).
I was not able to understand the root cause though. I'll be happy to hear deeper insight as well.

On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis <[hidden email]> wrote:

We are trying to run a job that has previously run on Spark 1.3 on a different cluster. The job was converted to 2.3 spark and this is a new cluster.
    The job dies after completing about a half dozen stages with
java.io.IOException: No space left on device

   It appears that the nodes are using local storage as tmp.

    I could use help diagnosing the issue and how to fix it. 

Here are the spark conf properties
Spark Conf Properties
spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
spark.master=spark://10.141.0.34:7077
spark.mesos.executor.memoryOverhead=3128
spark.shuffle.consolidateFiles=true
spark.shuffle.spill=false
spark.app.name=Anonymous
spark.shuffle.manager=sort
spark.storage.memoryFraction=0.3
spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
spark.ui.killEnabled=true
spark.shuffle.spill.compress=true
spark.shuffle.sort.bypassMergeThreshold=100
com.lordjoe.distributed.marker_property=spark_property_set
spark.executor.memory=12g
spark.mesos.coarse=true
spark.shuffle.memoryFraction=0.4
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
spark.default.parallelism=360
spark.io.compression.codec=lz4
spark.reducer.maxMbInFlight=128
spark.hadoop.validateOutputSpecs=false
spark.submit.deployMode=client
spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
spark.shuffle.file.buffer.kb=1024


--
Steven M. Lewis PhD
206-384-1340 (cell)
Skype lordjoe_com



Reply | Threaded
Open this post in threaded view
|

Re: No space left on device

Gourav Sengupta
Hi,

that was just one of the options, and not the first one, is there any chance of trying out the other options mentioned? For example, pointing the shuffle storage area to a location with larger space?

Regards,
Gourav Sengupta

On Wed, Aug 22, 2018 at 11:15 AM Vitaliy Pisarev <[hidden email]> wrote:
Documentation says that 'spark.shuffle.memoryFraction' was deprecated, but it doesn't say what to use instead. Any idea?

On Wed, Aug 22, 2018 at 9:36 AM, Gourav Sengupta <[hidden email]> wrote:
Hi,

The best part about Spark is that it is showing you which configuration to tweak as well. In case you are using EMR, try to see that the configuration points to the right location in the cluster "spark.local.dir". If a disk is mounted across all the systems with a common path (you can do that easily in EMR) then you can change the configuration to point to that disk location and thereby overcome the issue.

On another note also try to see why the data is being written to the disk, is it too much shuffle, can you increase the shuffle memory as shown in the error message using "spark.shuffle.memoryFraction"

By any change have you changed from caching to persistent data frames?


Regards,
Gourav Sengupta



On Tue, Aug 21, 2018 at 12:04 PM Vitaliy Pisarev <[hidden email]> wrote:
The other time when I encountered this I solved it by throwing more resources at it (stronger cluster).
I was not able to understand the root cause though. I'll be happy to hear deeper insight as well.

On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis <[hidden email]> wrote:

We are trying to run a job that has previously run on Spark 1.3 on a different cluster. The job was converted to 2.3 spark and this is a new cluster.
    The job dies after completing about a half dozen stages with
java.io.IOException: No space left on device

   It appears that the nodes are using local storage as tmp.

    I could use help diagnosing the issue and how to fix it. 

Here are the spark conf properties
Spark Conf Properties
spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
spark.master=spark://10.141.0.34:7077
spark.mesos.executor.memoryOverhead=3128
spark.shuffle.consolidateFiles=true
spark.shuffle.spill=false
spark.app.name=Anonymous
spark.shuffle.manager=sort
spark.storage.memoryFraction=0.3
spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
spark.ui.killEnabled=true
spark.shuffle.spill.compress=true
spark.shuffle.sort.bypassMergeThreshold=100
com.lordjoe.distributed.marker_property=spark_property_set
spark.executor.memory=12g
spark.mesos.coarse=true
spark.shuffle.memoryFraction=0.4
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
spark.default.parallelism=360
spark.io.compression.codec=lz4
spark.reducer.maxMbInFlight=128
spark.hadoop.validateOutputSpecs=false
spark.submit.deployMode=client
spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
spark.shuffle.file.buffer.kb=1024


--
Steven M. Lewis PhD
206-384-1340 (cell)
Skype lordjoe_com