is spark.cleaner.ttl safe?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

is spark.cleaner.ttl safe?

Michael Allman
Hello,

I've been trying to run an iterative spark job that spills 1+ GB to disk
per iteration on a system with limited disk space. I believe there's
enough space if spark would clean up unused data from previous iterations,
but as it stands the number of iterations I can run is limited by
available disk space.

I found a thread on the usage of spark.cleaner.ttl on the old Spark Users
Google group here:

https://groups.google.com/forum/#!topic/spark-users/9ebKcNCDih4

I think this setting may be what I'm looking for, however the cleaner
seems to delete data that's still in use. The effect is I get bizarre
exceptions from Spark complaining about missing broadcast data or
ArrayIndexOutOfBounds. When is spark.cleaner.ttl safe to use? Is it
supposed to delete in-use data or is this a bug/shortcoming?

Cheers,

Michael


Reply | Threaded
Open this post in threaded view
|

Re: is spark.cleaner.ttl safe?

Mark Hamstra
Actually, TD's work-in-progress is probably more what you want: https://github.com/apache/spark/pull/126


On Tue, Mar 11, 2014 at 1:58 PM, Michael Allman <[hidden email]> wrote:
Hello,

I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by available disk space.

I found a thread on the usage of spark.cleaner.ttl on the old Spark Users Google group here:

https://groups.google.com/forum/#!topic/spark-users/9ebKcNCDih4

I think this setting may be what I'm looking for, however the cleaner seems to delete data that's still in use. The effect is I get bizarre exceptions from Spark complaining about missing broadcast data or ArrayIndexOutOfBounds. When is spark.cleaner.ttl safe to use? Is it supposed to delete in-use data or is this a bug/shortcoming?

Cheers,

Michael



Reply | Threaded
Open this post in threaded view
|

Re: is spark.cleaner.ttl safe?

Aaron Davidson
And to answer your original question, spark.cleaner.ttl is not safe for the exact reason you brought up. The PR Mark linked intends to provide a much cleaner (and safer) solution.


On Tue, Mar 11, 2014 at 2:01 PM, Mark Hamstra <[hidden email]> wrote:
Actually, TD's work-in-progress is probably more what you want: https://github.com/apache/spark/pull/126


On Tue, Mar 11, 2014 at 1:58 PM, Michael Allman <[hidden email]> wrote:
Hello,

I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by available disk space.

I found a thread on the usage of spark.cleaner.ttl on the old Spark Users Google group here:

https://groups.google.com/forum/#!topic/spark-users/9ebKcNCDih4

I think this setting may be what I'm looking for, however the cleaner seems to delete data that's still in use. The effect is I get bizarre exceptions from Spark complaining about missing broadcast data or ArrayIndexOutOfBounds. When is spark.cleaner.ttl safe to use? Is it supposed to delete in-use data or is this a bug/shortcoming?

Cheers,

Michael




Reply | Threaded
Open this post in threaded view
|

Re: is spark.cleaner.ttl safe?

Sourav Chandra
Yes, we are also facing same problem. The workaround we came up with is 

 - store the broadcast variable id when it was first created
 - then create a scheduled job which runs every (spark.cleaner.ttl - 1minute) interval and creates the same broadcast variable using same id. This way spark is happy finding same broadcast file (broadcast_<id>)
    val httpBroadcastFactory = new HttpBroadcastFactory()
    httpBroadcastFactory.newBroadcast(bcastVariable.value, false, id)




On Wed, Mar 12, 2014 at 2:38 AM, Aaron Davidson <[hidden email]> wrote:
And to answer your original question, spark.cleaner.ttl is not safe for the exact reason you brought up. The PR Mark linked intends to provide a much cleaner (and safer) solution.


On Tue, Mar 11, 2014 at 2:01 PM, Mark Hamstra <[hidden email]> wrote:
Actually, TD's work-in-progress is probably more what you want: https://github.com/apache/spark/pull/126


On Tue, Mar 11, 2014 at 1:58 PM, Michael Allman <[hidden email]> wrote:
Hello,

I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by available disk space.

I found a thread on the usage of spark.cleaner.ttl on the old Spark Users Google group here:

https://groups.google.com/forum/#!topic/spark-users/9ebKcNCDih4

I think this setting may be what I'm looking for, however the cleaner seems to delete data that's still in use. The effect is I get bizarre exceptions from Spark complaining about missing broadcast data or ArrayIndexOutOfBounds. When is spark.cleaner.ttl safe to use? Is it supposed to delete in-use data or is this a bug/shortcoming?

Cheers,

Michael







--

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

[hidden email]

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com