Task not serializable (java.io.NotSerializableException)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Task not serializable (java.io.NotSerializableException)

David Thomas
I'm trying to copy a file from hdfs to a temp local directory within a map function using static method of FileUtil and I get the below error. Is there a way to get around this?

org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.fs.Path
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
Reply | Threaded
Open this post in threaded view
|

Re: Task not serializable (java.io.NotSerializableException)

Andrew Ash
Do you want the files scattered across the local temp directories of all your machines or just one of them?  If just one, I'd recommend having your driver program execute hadoop fs -getmerge /path/to/files...  using Scala's external process libraries.


On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <[hidden email]> wrote:
I'm trying to copy a file from hdfs to a temp local directory within a map function using static method of FileUtil and I get the below error. Is there a way to get around this?

org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.fs.Path
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

Reply | Threaded
Open this post in threaded view
|

Re: Task not serializable (java.io.NotSerializableException)

David Thomas
I want it to be available on all machines in the cluster.


On Tue, Feb 11, 2014 at 10:35 AM, Andrew Ash <[hidden email]> wrote:
Do you want the files scattered across the local temp directories of all your machines or just one of them?  If just one, I'd recommend having your driver program execute hadoop fs -getmerge /path/to/files...  using Scala's external process libraries.


On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <[hidden email]> wrote:
I'm trying to copy a file from hdfs to a temp local directory within a map function using static method of FileUtil and I get the below error. Is there a way to get around this?

org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.fs.Path
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)


Reply | Threaded
Open this post in threaded view
|

Re: Task not serializable (java.io.NotSerializableException)

Andrew Ash
The full file on all the machines or just write the partitions that are already on each machine to disk?

If the latter, try rdd.saveAsTextFile("file:///tmp/mydata")


On Tue, Feb 11, 2014 at 9:39 AM, David Thomas <[hidden email]> wrote:
I want it to be available on all machines in the cluster.


On Tue, Feb 11, 2014 at 10:35 AM, Andrew Ash <[hidden email]> wrote:
Do you want the files scattered across the local temp directories of all your machines or just one of them?  If just one, I'd recommend having your driver program execute hadoop fs -getmerge /path/to/files...  using Scala's external process libraries.


On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <[hidden email]> wrote:
I'm trying to copy a file from hdfs to a temp local directory within a map function using static method of FileUtil and I get the below error. Is there a way to get around this?

org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.fs.Path
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)



Reply | Threaded
Open this post in threaded view
|

Re: Task not serializable (java.io.NotSerializableException)

David Thomas
The files that are in hdfs are pretty heavy weight and so I do not want to create an RDD out of it. Instead, I have another lightweight RDD and I want to apply a map function on it, within which I'll load the files into local disk, perform some operations with the RDD elements against these files and create another RDD.


On Tue, Feb 11, 2014 at 10:41 AM, Andrew Ash <[hidden email]> wrote:
The full file on all the machines or just write the partitions that are already on each machine to disk?

If the latter, try rdd.saveAsTextFile("file:///tmp/mydata")


On Tue, Feb 11, 2014 at 9:39 AM, David Thomas <[hidden email]> wrote:
I want it to be available on all machines in the cluster.


On Tue, Feb 11, 2014 at 10:35 AM, Andrew Ash <[hidden email]> wrote:
Do you want the files scattered across the local temp directories of all your machines or just one of them?  If just one, I'd recommend having your driver program execute hadoop fs -getmerge /path/to/files...  using Scala's external process libraries.


On Tue, Feb 11, 2014 at 9:18 AM, David Thomas <[hidden email]> wrote:
I'm trying to copy a file from hdfs to a temp local directory within a map function using static method of FileUtil and I get the below error. Is there a way to get around this?

org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.fs.Path
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)