flatten RDD[RDD[T]]

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

flatten RDD[RDD[T]]

Cosmin Radoi

I'm trying to flatten an RDD of RDDs. The straightforward approach:

a: [RDD[RDD[Int]]
a flatMap { _.collect }

throws a java.lang.NullPointerException at org.apache.spark.rdd.RDD.collect(RDD.scala:602)

In a more complex scenario I also got:
Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

So I guess this may be related to the context not being available inside the map.

Are nested RDDs not supported?

Thanks,

Cosmin Radoi

Reply | Threaded
Open this post in threaded view
|

Re: flatten RDD[RDD[T]]

Josh Rosen


On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi <[hidden email]> wrote:

I'm trying to flatten an RDD of RDDs. The straightforward approach:

a: [RDD[RDD[Int]]
a flatMap { _.collect }

throws a java.lang.NullPointerException at org.apache.spark.rdd.RDD.collect(RDD.scala:602)

In a more complex scenario I also got:
Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

So I guess this may be related to the context not being available inside the map.

Are nested RDDs not supported?

Thanks,

Cosmin Radoi