SequenceFileRDDFunctions cannot be used output of spark package

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

SequenceFileRDDFunctions cannot be used output of spark package

Aureliano Buendia
Hi,

I'm trying to create a custom version of saveAsObject(). however, I do not seem to be able to use SequenceFileRDDFunctions in my package.

I simply copy/pasted saveAsObject() body to my funtion:

out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
      .saveAsSequenceFile("output")


But that gives me this error:

value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable, org.apache.hadoop.io.BytesWritable)]
possible cause: maybe a semicolon is missing before `value saveAsSequenceFile'?
      .saveAsSequenceFile("output")

       ^

Scala implicit conversion error is not of any help here. So I tried to apply explicit conversion:

org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x)))))
      .saveAsSequenceFile("output")


Giving me this error:

object SequenceFileRDDFunctions is not a member of package org.apache.spark.rdd
Note: class SequenceFileRDDFunctions exists, but it has no companion object.
    org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))

                         ^

Is this scala compiler version mismatch hell?
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

Xuefeng Wu
would you try new it ?
new org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x)))))
      .saveAsSequenceFile("output")


and do you import org.apache.spark.SparkContext._ for implicit conversion work for you ?



On Sat, Jan 4, 2014 at 11:42 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I'm trying to create a custom version of saveAsObject(). however, I do not seem to be able to use SequenceFileRDDFunctions in my package.

I simply copy/pasted saveAsObject() body to my funtion:

out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
      .saveAsSequenceFile("output")


But that gives me this error:

value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable, org.apache.hadoop.io.BytesWritable)]
possible cause: maybe a semicolon is missing before `value saveAsSequenceFile'?
      .saveAsSequenceFile("output")

       ^

Scala implicit conversion error is not of any help here. So I tried to apply explicit conversion:

org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x)))))
      .saveAsSequenceFile("output")


Giving me this error:

object SequenceFileRDDFunctions is not a member of package org.apache.spark.rdd
Note: class SequenceFileRDDFunctions exists, but it has no companion object.
    org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))

                         ^

Is this scala compiler version mismatch hell?



--

~Yours, Xuefeng Wu/吴雪峰  敬上

Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

Imran Rashid
In reply to this post by Aureliano Buendia
nice work tracking down the problems w/ the codec getting applied consistently.  I think you're close to the fix, just need to understand scala implicit resolution rules.

I'm not entirely sure what you mean when you say "I simply copy/pasted saveAsObject() body to my funtion:" -- where does your function live?
Are you trying to modify SequenceFileRDDFunctions, then recompile your own version of spark?  or are you trying to leave the spark package alone, and add your own helper function elsewhere?

If you are modifying SequenceFileRDDFunctions, then you should just be able to drop another function in there no problem.  Just be sure you have the implicit conversions in scope when you try to apply them.  The way to do that is to "import org.apache.spark.SparkContext._".  In the SparkContext *object*, you'll notice a bunch of "implicit def"s -- by importing those, you are telling the scala compiler that it should try to apply those rules when searching for function definitions.

Your attempt at *explicit* conversion doesn't work b/c you aren't actually doing a conversion -- you are attempting to apply a function.  What you have gets desugared to:

org.apache.spark.rdd.SequenceFileRDDFunctions.apply[(NullWritable, BytesWritable)](...)

you'll notice that the compiler even told you it was looking for an object called "SequenceFileRDDFunctions" but didn't find one.   You want to create a new instance of the class, which you do by adding *new* in front.

new org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](...)

This is really confusing when you are new to scala -- lots of companion objects have an "apply" method that act like new.  But SequenceFileRDDFunctions doesn't.


then the second part is, making your newly added functions available on all RDDs by implicit conversion.  first, define a wrapper class with your new function, and a companion object with an implicit conversion:

class AwesomeRDD[T](self: RDD[T]) {
  def saveAwesomeObjectFile(path: String) {
    //put your def here
  }
}

object AwesomeRDD {
  implicit def addAwesomeFunctions[T](rdd: RDD[T]) = new AwesomeRDD(rdd)
}


then, just import the implicit conversion wherever you want it:

class Demo {
  val rdd: RDD[String] = ...
  import AwesomeRDD._
  rdd.saveAwesomeObjectFile("/path")
}





On Sat, Jan 4, 2014 at 9:42 AM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I'm trying to create a custom version of saveAsObject(). however, I do not seem to be able to use SequenceFileRDDFunctions in my package.

I simply copy/pasted saveAsObject() body to my funtion:

out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
      .saveAsSequenceFile("output")


But that gives me this error:

value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable, org.apache.hadoop.io.BytesWritable)]
possible cause: maybe a semicolon is missing before `value saveAsSequenceFile'?
      .saveAsSequenceFile("output")

       ^

Scala implicit conversion error is not of any help here. So I tried to apply explicit conversion:

org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x)))))
      .saveAsSequenceFile("output")


Giving me this error:

object SequenceFileRDDFunctions is not a member of package org.apache.spark.rdd
Note: class SequenceFileRDDFunctions exists, but it has no companion object.
    org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))

                         ^

Is this scala compiler version mismatch hell?

Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

Aureliano Buendia
Thank you both for the clear explanation. Both using new and importing:

import org.apache.spark.SparkContext._

solved the problem.

When using SequenceFileRDDFunctions, Intellij only tried to suggest:

import org.apache.spark.rdd.SequenceFileRDDFunctions

which was not helpful. I guess org.apache.spark.SparkContext._ had to be imported manually.



On Sat, Jan 4, 2014 at 4:22 PM, Imran Rashid <[hidden email]> wrote:
nice work tracking down the problems w/ the codec getting applied consistently.  I think you're close to the fix, just need to understand scala implicit resolution rules.

I'm not entirely sure what you mean when you say "I simply copy/pasted saveAsObject() body to my funtion:" -- where does your function live?
Are you trying to modify SequenceFileRDDFunctions, then recompile your own version of spark?  or are you trying to leave the spark package alone, and add your own helper function elsewhere?

If you are modifying SequenceFileRDDFunctions, then you should just be able to drop another function in there no problem.  Just be sure you have the implicit conversions in scope when you try to apply them.  The way to do that is to "import org.apache.spark.SparkContext._".  In the SparkContext *object*, you'll notice a bunch of "implicit def"s -- by importing those, you are telling the scala compiler that it should try to apply those rules when searching for function definitions.

Your attempt at *explicit* conversion doesn't work b/c you aren't actually doing a conversion -- you are attempting to apply a function.  What you have gets desugared to:

org.apache.spark.rdd.SequenceFileRDDFunctions.apply[(NullWritable, BytesWritable)](...)

you'll notice that the compiler even told you it was looking for an object called "SequenceFileRDDFunctions" but didn't find one.   You want to create a new instance of the class, which you do by adding *new* in front.

new org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](...)

This is really confusing when you are new to scala -- lots of companion objects have an "apply" method that act like new.  But SequenceFileRDDFunctions doesn't.


then the second part is, making your newly added functions available on all RDDs by implicit conversion.  first, define a wrapper class with your new function, and a companion object with an implicit conversion:

class AwesomeRDD[T](self: RDD[T]) {
  def saveAwesomeObjectFile(path: String) {
    //put your def here
  }
}

object AwesomeRDD {
  implicit def addAwesomeFunctions[T](rdd: RDD[T]) = new AwesomeRDD(rdd)
}


then, just import the implicit conversion wherever you want it:

class Demo {
  val rdd: RDD[String] = ...
  import AwesomeRDD._
  rdd.saveAwesomeObjectFile("/path")
}





On Sat, Jan 4, 2014 at 9:42 AM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I'm trying to create a custom version of saveAsObject(). however, I do not seem to be able to use SequenceFileRDDFunctions in my package.

I simply copy/pasted saveAsObject() body to my funtion:

out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x))))
      .saveAsSequenceFile("output")


But that gives me this error:

value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[(org.apache.hadoop.io.NullWritable, org.apache.hadoop.io.BytesWritable)]
possible cause: maybe a semicolon is missing before `value saveAsSequenceFile'?
      .saveAsSequenceFile("output")

       ^

Scala implicit conversion error is not of any help here. So I tried to apply explicit conversion:

org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(serialize(x)))))
      .saveAsSequenceFile("output")


Giving me this error:

object SequenceFileRDDFunctions is not a member of package org.apache.spark.rdd
Note: class SequenceFileRDDFunctions exists, but it has no companion object.
    org.apache.spark.rdd.SequenceFileRDDFunctions[(NullWritable, BytesWritable)](out.mapPartitions(iter => iter.grouped(10).map(_.toArray))

                         ^

Is this scala compiler version mismatch hell?


Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

deenar.toraskar
Hi Aureliano

If you have managed to get a custom version of  saveAsObject() that handles compression working, would appreciate if you could share the code. I have come across the same issue and it would help me some time having to reinvent the wheel.

Deenar
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

Aureliano Buendia



On Fri, Mar 21, 2014 at 5:53 PM, deenar.toraskar <[hidden email]> wrote:
Hi Aureliano

If you have managed to get a custom version of  saveAsObject() that handles
compression working, would appreciate if you could share the code. I have
come across the same issue and it would help me some time having to reinvent
the wheel.


My problem was not about compression.
 
Deenar



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3005.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

Matei Zaharia
Administrator
In reply to this post by deenar.toraskar
To use compression here, you might just have to set the correct Hadoop settings in SparkContext.hadoopConf.

Matei

On Mar 21, 2014, at 10:53 AM, deenar.toraskar <[hidden email]> wrote:

> Hi Aureliano
>
> If you have managed to get a custom version of  saveAsObject() that handles
> compression working, would appreciate if you could share the code. I have
> come across the same issue and it would help me some time having to reinvent
> the wheel.
>
> Deenar
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3005.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

deenar.toraskar
Matei

It turns out that saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano found out in this post

http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html

Deenar
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

Aureliano Buendia
I think you bumped the wrong thread.

As I mentioned in the other thread:

saveAsHadoopFile only applies compression when the codec is available, and it does not seem to respect the global hadoop compression properties.

I'm not sure if this is a feature, or a bug in spark.

if this is a feature, the docs should make it clear that mapred.output.compression.* properties are read only.


On Sat, Mar 22, 2014 at 12:20 AM, deenar.toraskar <[hidden email]> wrote:
Matei

It turns out that saveAsObjectFile(), saveAsSequenceFile() and
saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano
found out in this post

http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html

Deenar



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

pradeeps8
Hi Aureliano,

I followed this thread to create a custom saveAsObjectFile.
The following is the code.
new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable, BytesWritable](saveRDD.mapPartitions(iter => iter.grouped(10).map(_.toArray)).map(x => (NullWritable.get(), new BytesWritable(serialize(x))))).saveAsSequenceFile("objFiles")

But, I get the following error when executed.

org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
        at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Any idea about this error?
or
Is there anything wrong in the line of code?

Thanks,
Pradeep
Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

Sonal Goyal
What does your saveRDD contain? If you are using custom objects, they should be serializable.

Best Regards,
Sonal
Nube Technologies 






On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 <[hidden email]> wrote:
Hi Aureliano,

I followed this thread to create a custom saveAsObjectFile.
The following is the code.
/new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable,
BytesWritable](saveRDD.mapPartitions(iter =>
iter.grouped(10).map(_.toArray)).map(x => (NullWritable.get(), new
BytesWritable(serialize(x))))).saveAsSequenceFile("objFiles") /

But, I get the following error when executed.
/
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
        at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
/

Any idea about this error?
or
Is there anything wrong in the line of code?

Thanks,
Pradeep




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: SequenceFileRDDFunctions cannot be used output of spark package

pradeeps8
Hi Sonal,

There are no custom objects in saveRDD, it is of type RDD[(String, String)].

Thanks,
Pradeep