spark master OOME from maxMbInFlight buffers

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

spark master OOME from maxMbInFlight buffers

stephenh
Hey,

We have a Spark job that is OOMEing on the master, which we haven't
seen before.

The heap dump shows 70 byte[]s, owned by various Akka threads, all 48mb
each (3.3gb total), which I assume is from the maxMbInFlight value.

We have 30 slaves in the cluster, spark standalone, running Spark 0.9.

Is it expected to have this many Akka buffers around? I suppose if we
have 30 slaves, and 2 executors/slave, that would be 60
connections/threads. So if it over shoots to 70...

We can probably just lower the maxMbInFlight, as we're not pulling any
results back to the master anyway.

Does my reasoning make sense?

Thanks,
Stephen


Reply | Threaded
Open this post in threaded view
|

Re: spark master OOME from maxMbInFlight buffers

stephenh

> We can probably just lower the maxMbInFlight, as we're not pulling any
> results back to the master anyway.

Well, right, I am now doubting my theory...

If the setting is about the reducer buffer, I'm confused why our master
would have any reducer buffers in the first place...

- Stephen


Reply | Threaded
Open this post in threaded view
|

Re: spark master OOME from maxMbInFlight buffers

stephenh

> Well, right, I am now doubting my theory...

I have no solution, but I have at least found the OOME culprit:

14/02/14 01:18:05 INFO spark.MapOutputTrackerMaster: Size of output
statuses for shuffle 2 is 49504507 bytes

There is the 49mb byte[] that I have 60+ instances (3.3gb) of floating
around in RAM (while being put on the wire by Akka/Netty to each slave).

I'm a little surprised that Akka/Netty/"someone" couldn't do this with
some sort of zero-copy, although I guess that could get hairy with
NIO/concurrency/etc.

Given the comments around this code path already being space optimized,
I assume it's not surprising to have 50mb of output statuses...

Not sure what our job is doing to trigger this, but that's what I'm now
going to go look in to.

Any insight into whether 50mb of map statuses, copied 60-70 times, could
be avoided would be appreciated. :-)

Thanks,
Stephen