How to create RDD from Java in-memory data?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to create RDD from Java in-memory data?

wallacemann
I would like to construct an RDD from data I already have in memory as POJO objects.  Is this possible?  For example, is it possible to create an RDD from Iterable<String>?

I'm running Spark from Java as a stand-alone application.  The JavaWordCount example runs fine.  In the example, the initial RDD is populated from a text file.  In my use case, I'm streaming data from a database, but even this is hidden behind an interface which is essentially Iterable<String>.

What I am doing is so basic that I must not understand something obvious.  Thanks for any suggestions.
Reply | Threaded
Open this post in threaded view
|

Re: How to create RDD from Java in-memory data?

wallacemann
I was right ... I was missing something obvious.  The answer to my question is to use JavaSparkContext.parallelize which works with List<T> or List<Tuple2<K,V>>.
Reply | Threaded
Open this post in threaded view
|

Re: How to create RDD from Java in-memory data?

Matei Zaharia
Administrator
Yeah, we could make it take Iterable too if that helped. What data structure did you have here?

Matei

On Mar 10, 2014, at 6:29 PM, wallacemann <[hidden email]> wrote:

> I was right ... I was missing something obvious.  The answer to my question
> is to use JavaSparkContext.parallelize which works with List<T> or
> List<Tuple2&lt;K,V>>.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486p2487.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How to create RDD from Java in-memory data?

wallacemann
The question would be whether or not Iterable would save memory.  

It's trivial for me to build a list out of my iterable.  If I understood the code correctly, Spark takes that List and converts it to an array, so I built an ArrayList out of the iterable in the hopes that Spark would use the underlying array structure natively in the RDD.  If that is the case, then no effort or memory has been wasted (at least for single node).  If (when) that is not the case, if the RDD throws away my array, then indeed it would be more efficient to pass Spark an iterable and let it build up whatever internal representation that it needs.

Our bigger picture is that we have a proprietary streaming system that streams rows of data (the "row" is a java data structure we have in the form of Iterable<ProprietaryRow>.  The Iterable<> may invoke other upstream Iterables which may invoke a streaming read from a database, file, etc.  So far we have been careful to avoid collecting the entire stream in memory unless absolutely necessary.  By experimenting with Spark and RDD, we are taking the leap of collecting the entire dataset in (potentially distributed) memory to see if it can help us parallelize and scale.
Reply | Threaded
Open this post in threaded view
|

Re: How to create RDD from Java in-memory data?

wallacemann
In reply to this post by Matei Zaharia
In a similar vein, it would be helpful to have an Iterable way to access the data inside an RDD.  The collect method takes everything in the RDD and puts in a list, but this blows up memory.  Since everything I want is already inside the RDD, it could be easy to iterate over the content without replicating the array.
Reply | Threaded
Open this post in threaded view
|

Re: How to create RDD from Java in-memory data?

Mark Hamstra
https://github.com/apache/incubator-spark/pull/421

Works pretty good, but really needs to be enhanced to work with AsyncRDDActions.


On Tue, Mar 11, 2014 at 4:50 PM, wallacemann <[hidden email]> wrote:
In a similar vein, it would be helpful to have an Iterable way to access the
data inside an RDD.  The collect method takes everything in the RDD and puts
in a list, but this blows up memory.  Since everything I want is already
inside the RDD, it could be easy to iterate over the content without
replicating the array.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486p2568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How to create RDD from Java in-memory data?

wallacemann
Ah!  Thank you.  That'll work for now.