batching the output

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

batching the output

Vipul Pandey
Hi,

I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple "rows" in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same.
e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records instead of 6

What I"m doing is that I'm using mapPartitions by using the grouped function of the iterator by giving it a groupSize.

    val protoRDD:RDD[MyProto] = rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq =>{
        val profiles = MyProto(...)
        seq.foreach(x =>{
          val row = new Row(x._1.toString)
          row.setFloatValue(x._2)
          profiles.addRow(row)
        })
        profiles
      })
    )
I haven't been able to test it out because of a separate issue (protobuf version mismatch - in a different thread)  - but i'm hoping it will work.

Is there a better/straight-forward way of doing this?

Thanks
Vipul
Reply | Threaded
Open this post in threaded view
|

Re: batching the output

Patrick Wendell
Ya this is a good way to do it.


On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey <[hidden email]> wrote:
Hi,

I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple "rows" in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same.
e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records instead of 6

What I"m doing is that I'm using mapPartitions by using the grouped function of the iterator by giving it a groupSize.

    val protoRDD:RDD[MyProto] = rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq =>{
        val profiles = MyProto(...)
        seq.foreach(x =>{
          val row = new Row(x._1.toString)
          row.setFloatValue(x._2)
          profiles.addRow(row)
        })
        profiles
      })
    )
I haven't been able to test it out because of a separate issue (protobuf version mismatch - in a different thread)  - but i'm hoping it will work.

Is there a better/straight-forward way of doing this?

Thanks
Vipul