Why does saveAfObjectFile() serialize Array[T] instead of T?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Why does saveAfObjectFile() serialize Array[T] instead of T?

Aureliano Buendia
Hi,

Given an RDD[T] instance, saveAfObjectFile() passes an instance of Array[T] to serialize(), and not and instance of T:

  def saveAsObjectFile(path: String) {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
      .saveAsSequenceFile(path)
  }


Is this array mapping for efficiency reasons, or are there other reasons for this?

I'm trying to use saveAfObjectFile() to serialize protobuf messages. Protobuf messages already come with a method that turns them into Array[Byte] (see here), that is, toByteArray() can be clled on an instance of T. Considering this, how can a protobuf message instance be serialized in a custom version of saveAsObjectFile()?

Is it a good idea to drop array mapping?:

def saveAsObjectFile(path: String) {
    this.map(x => (NullWritable.get(), new BytesWritable(x.toByteArray())))
      .saveAsSequenceFile(path)
  }