/tmp fills up to 100GB when using a window function

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

/tmp fills up to 100GB when using a window function

Mihai Iacob
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\
              partitionBy('***', 'date_str').\
              orderBy(sqlDf['***'])

sqlDf = sqlDf.withColumn("***",rank().over(window))
df_w_least = sqlDf.filter("***=1")
 
 
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: /tmp fills up to 100GB when using a window function

Vadim Semenov
Spark doesn't remove intermediate shuffle files if they're part of the same job.

On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob <[hidden email]> wrote:
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\
              partitionBy('***', 'date_str').\
              orderBy(sqlDf['***'])

sqlDf = sqlDf.withColumn("***",rank().over(window))
df_w_least = sqlDf.filter("***=1")
 
 
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: /tmp fills up to 100GB when using a window function

Mihai Iacob
When does spark remove them?
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics
 
 
----- Original message -----
From: Vadim Semenov <[hidden email]>
To: Mihai Iacob <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: /tmp fills up to 100GB when using a window function
Date: Tue, Dec 19, 2017 9:46 AM
 
Spark doesn't remove intermediate shuffle files if they're part of the same job.
 
On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob <[hidden email]> wrote:
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\
              partitionBy('***', 'date_str').\
              orderBy(sqlDf['***'])

sqlDf = sqlDf.withColumn("***",rank().over(window))
df_w_least = sqlDf.filter("***=1")
 
 
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
 

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: /tmp fills up to 100GB when using a window function

Vadim Semenov
Until after an action is done (i.e. save/count/reduce) or if you explicitly truncate the DAG by checkpointing.

Spark needs to keep all shuffle files because if some task/stage/node fails it'll only need to recompute missing partitions by using already computed parts.

On Tue, Dec 19, 2017 at 10:08 AM, Mihai Iacob <[hidden email]> wrote:
When does spark remove them?
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics
 
 
----- Original message -----
From: Vadim Semenov <[hidden email]>
To: Mihai Iacob <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: /tmp fills up to 100GB when using a window function
Date: Tue, Dec 19, 2017 9:46 AM
 
Spark doesn't remove intermediate shuffle files if they're part of the same job.
 
On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob <[hidden email]> wrote:
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\
              partitionBy('***', 'date_str').\
              orderBy(sqlDf['***'])

sqlDf = sqlDf.withColumn("***",rank().over(window))
df_w_least = sqlDf.filter("***=1")
 
 
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
 


Reply | Threaded
Open this post in threaded view
|

Re: /tmp fills up to 100GB when using a window function

Gourav Sengupta
In reply to this post by Mihai Iacob
I do think that there is an option to set the temporary shuffle location to a particular directory. While working with EMR I set it to /mnt1/. Let me know in case you are not able to find it.

On Mon, Dec 18, 2017 at 8:10 PM, Mihai Iacob <[hidden email]> wrote:
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\
              partitionBy('***', 'date_str').\
              orderBy(sqlDf['***'])

sqlDf = sqlDf.withColumn("***",rank().over(window))
df_w_least = sqlDf.filter("***=1")
 
 
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: /tmp fills up to 100GB when using a window function

Vadim Semenov
Ah, yes, I missed that part

it's `spark.local.dir`

spark.local.dir/tmpDirectory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

On Wed, Dec 20, 2017 at 2:58 PM, Gourav Sengupta <[hidden email]> wrote:
I do think that there is an option to set the temporary shuffle location to a particular directory. While working with EMR I set it to /mnt1/. Let me know in case you are not able to find it.

On Mon, Dec 18, 2017 at 8:10 PM, Mihai Iacob <[hidden email]> wrote:
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\
              partitionBy('***', 'date_str').\
              orderBy(sqlDf['***'])

sqlDf = sqlDf.withColumn("***",rank().over(window))
df_w_least = sqlDf.filter("***=1")
 
 
 
Regards,
 
Mihai Iacob
DSX Local - Security, IBM Analytics

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]