I would like to construct an RDD from data I already have in memory as POJO objects. Is this possible? For example, is it possible to create an RDD from Iterable<String>?
I'm running Spark from Java as a stand-alone application. The JavaWordCount example runs fine. In the example, the initial RDD is populated from a text file. In my use case, I'm streaming data from a database, but even this is hidden behind an interface which is essentially Iterable<String>.
What I am doing is so basic that I must not understand something obvious. Thanks for any suggestions.
The question would be whether or not Iterable would save memory.
It's trivial for me to build a list out of my iterable. If I understood the code correctly, Spark takes that List and converts it to an array, so I built an ArrayList out of the iterable in the hopes that Spark would use the underlying array structure natively in the RDD. If that is the case, then no effort or memory has been wasted (at least for single node). If (when) that is not the case, if the RDD throws away my array, then indeed it would be more efficient to pass Spark an iterable and let it build up whatever internal representation that it needs.
Our bigger picture is that we have a proprietary streaming system that streams rows of data (the "row" is a java data structure we have in the form of Iterable<ProprietaryRow>. The Iterable<> may invoke other upstream Iterables which may invoke a streaming read from a database, file, etc. So far we have been careful to avoid collecting the entire stream in memory unless absolutely necessary. By experimenting with Spark and RDD, we are taking the leap of collecting the entire dataset in (potentially distributed) memory to see if it can help us parallelize and scale.
In a similar vein, it would be helpful to have an Iterable way to access the data inside an RDD. The collect method takes everything in the RDD and puts in a list, but this blows up memory. Since everything I want is already inside the RDD, it could be easy to iterate over the content without replicating the array.
In a similar vein, it would be helpful to have an Iterable way to access the
data inside an RDD. The collect method takes everything in the RDD and puts
in a list, but this blows up memory. Since everything I want is already
inside the RDD, it could be easy to iterate over the content without
replicating the array.