RDD Collect returns empty arrays

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

RDD Collect returns empty arrays

gaganbm
I am getting strange behavior with the RDDs.

All I want is to persist the RDD contents in a single file.

The saveAsTextFile() saves them in multiple textfiles for each partition. So I tried with rdd.coalesce(1,true).saveAsTextFile(). This fails with the exception :

org.apache.spark.SparkException: Job aborted: Task 75.0:0 failed 1 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data)

Then I tried collecting the RDD contents in an array, and writing the array to the file manually. Again, that fails. It is giving me empty arrays, even when data is there.

/**The below saves the data in multiple text files. So data is there for sure **/
rdd.saveAsTextFile(resultDirectory)
/**The below simply prints size 0 for all the RDDs in a stream. Why ?! **/
val arr = rdd.collect
println("SIZE of RDD " + rdd.id + " " + arr.size)

Kindly help! I am clueless on how to proceed.