FlatMapGroupsFunction Without Running Out of Memory For Large Groups

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

FlatMapGroupsFunction Without Running Out of Memory For Large Groups

ddukek
This question is more about a type of processing that I haven't been able to
really find a good solution for within spark than it is about the
FlatMapGroupsFunction specifically. However, the FlatMapGroupsFunction
serves as a good concrete example of what I'm trying to talk about.

In MapReduce if I want to do something like replicate each record that maps
to a particular key (100, 1000, 10000.. etc) times I can because for each
record I'll just set up a loop for the number of replications that I want
and use the context.write(outKey, outValue) method to serialize out the
data. The amount of memory that I would ever be required to use would be the
size of one of the objects that I'm replicating plus some input and output
buffer overhead.

Now if I want to do this in Spark there are a couple ways I could try to do
this, but I'm not sure any of them would work in the limit of replicating my
data so much that I would run my executor out of memory. The interface for
the FlatMapGroupsFunction is a single function call
http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/function/FlatMapGroupsFunction.html
. The return type here is an Iterator<R> which implies that I would have to
materialize N replications of my data in memory in order to return an
iterator over it. Is there not a way that I can return data to Spark such
that it buffers my output records for as long as it continues to have
available memory and then it can spill it to disk itself? This idea isn't
limited to the Dataset API. The api for the pyspark rdd flatmap function has
the same implications.

Again I'm not trying to create the worlds greatest data cloning application.
It just serves the purpose of an example. Basically I would like a memory
safe way to let the framework handle me creating more data within a single
task context than can be contained by an executor.

Thanks in advance for any info people can provide. If you don't necessarily
have an answer I'm happy to kind of brainstorm on potential solutions as
well.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]