Turning kryo on does not decrease binary output

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Turning kryo on does not decrease binary output

Aureliano Buendia
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?

Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Guillaume Pitel
Hi,

I believe Kryo is only use during RDD serialization (i.e. communication between nodes), not for saving. If you want to compress output, you can use GZip or snappy codec like that :

val codec = "org.apache.hadoop.io.compress.SnappyCodec" // for snappy
val codec = "org.apache.hadoop.io.compress.GzipCodec" // for gzip

System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress", "true")
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.codec", codec)
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.type", "BLOCK")

(That's for HDP2, for HDP1, the keys are different)
Regards
Guillaume   
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?



--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia
Thanks for clarifying this.

I tried setting hadoop properties before constructing SparkContext, but it had no effect.

Where is the right place to set these properties?


On Fri, Jan 3, 2014 at 4:56 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

I believe Kryo is only use during RDD serialization (i.e. communication between nodes), not for saving. If you want to compress output, you can use GZip or snappy codec like that :

val codec = "org.apache.hadoop.io.compress.SnappyCodec" // for snappy
val codec = "org.apache.hadoop.io.compress.GzipCodec" // for gzip

System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress", "true")
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.codec", codec)
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.type", "BLOCK")

(That's for HDP2, for HDP1, the keys are different)
Regards
Guillaume   
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Andrew Ash
For hadoop properties I find the most reliable way to be to set them on a Configuration object and use a method on SparkContext that accepts that conf object.

From working code:

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

def nlLZOfile(path: String) = {
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", "\n")
    sc.newAPIHadoopFile(path, classOf[com.hadoop.mapreduce.LzoTextInputFormat], classOf[LongWritable], classOf[Text], conf)
      .map(_._2.toString)
}


On Fri, Jan 3, 2014 at 12:34 PM, Aureliano Buendia <[hidden email]> wrote:
Thanks for clarifying this.

I tried setting hadoop properties before constructing SparkContext, but it had no effect.

Where is the right place to set these properties?


On Fri, Jan 3, 2014 at 4:56 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

I believe Kryo is only use during RDD serialization (i.e. communication between nodes), not for saving. If you want to compress output, you can use GZip or snappy codec like that :

val codec = "org.apache.hadoop.io.compress.SnappyCodec" // for snappy
val codec = "org.apache.hadoop.io.compress.GzipCodec" // for gzip

System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress", "true")
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.codec", codec)
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.type", "BLOCK")

(That's for HDP2, for HDP1, the keys are different)
Regards
Guillaume   
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05


Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia
Andrew, according to http://stackoverflow.com/a/17241273/1136722 , what you described is the old way of doing this.


On Fri, Jan 3, 2014 at 5:43 PM, Andrew Ash <[hidden email]> wrote:
For hadoop properties I find the most reliable way to be to set them on a Configuration object and use a method on SparkContext that accepts that conf object.

From working code:

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

def nlLZOfile(path: String) = {
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", "\n")
    sc.newAPIHadoopFile(path, classOf[com.hadoop.mapreduce.LzoTextInputFormat], classOf[LongWritable], classOf[Text], conf)
      .map(_._2.toString)
}


On Fri, Jan 3, 2014 at 12:34 PM, Aureliano Buendia <[hidden email]> wrote:
Thanks for clarifying this.

I tried setting hadoop properties before constructing SparkContext, but it had no effect.

Where is the right place to set these properties?


On Fri, Jan 3, 2014 at 4:56 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

I believe Kryo is only use during RDD serialization (i.e. communication between nodes), not for saving. If you want to compress output, you can use GZip or snappy codec like that :

val codec = "org.apache.hadoop.io.compress.SnappyCodec" // for snappy
val codec = "org.apache.hadoop.io.compress.GzipCodec" // for gzip

System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress", "true")
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.codec", codec)
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.type", "BLOCK")

(That's for HDP2, for HDP1, the keys are different)
Regards
Guillaume   
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05



Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Guillaume Pitel
In reply to this post by Aureliano Buendia
That's the right place. Maybe try with HDP1 properties :

http://stackoverflow.com/questions/17241185/spark-standalone-mode-how-to-compress-spark-output-written-to-hdfs

About your Kryo error, you can use that if you want a coverage of scala types : https://github.com/romix/scala-kryo-serialization

Guillaume
Thanks for clarifying this.

I tried setting hadoop properties before constructing SparkContext, but it had no effect.

Where is the right place to set these properties?


On Fri, Jan 3, 2014 at 4:56 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

I believe Kryo is only use during RDD serialization (i.e. communication between nodes), not for saving. If you want to compress output, you can use GZip or snappy codec like that :

val codec = "org.apache.hadoop.io.compress.SnappyCodec" // for snappy
val codec = "org.apache.hadoop.io.compress.GzipCodec" // for gzip

System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress", "true")
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.codec", codec)
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.type", "BLOCK")

(That's for HDP2, for HDP1, the keys are different)
Regards
Guillaume   
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?



--
eXenSa
Guillaume PITEL, Président
<a moz-do-not-send="true" href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a moz-do-not-send="true" href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a moz-do-not-send="true" href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05



--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia
Even 

someMap.saveAsTextFile("out", classOf[GzipCodec])

has no effect.

Also, I notices that saving sequence files has no compression option (my original question was about compressing binary output).

Having said this, I still do not understand why kryo cannot be helpful when saving binary output. Binary output uses java serialization, which has a pretty hefty overhead.

How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?



On Fri, Jan 3, 2014 at 5:58 PM, Guillaume Pitel <[hidden email]> wrote:
That's the right place. Maybe try with HDP1 properties :

http://stackoverflow.com/questions/17241185/spark-standalone-mode-how-to-compress-spark-output-written-to-hdfs

About your Kryo error, you can use that if you want a coverage of scala types : https://github.com/romix/scala-kryo-serialization

Guillaume
Thanks for clarifying this.

I tried setting hadoop properties before constructing SparkContext, but it had no effect.

Where is the right place to set these properties?


On Fri, Jan 3, 2014 at 4:56 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

I believe Kryo is only use during RDD serialization (i.e. communication between nodes), not for saving. If you want to compress output, you can use GZip or snappy codec like that :

val codec = "org.apache.hadoop.io.compress.SnappyCodec" // for snappy
val codec = "org.apache.hadoop.io.compress.GzipCodec" // for gzip

System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress", "true")
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.codec", codec)
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.type", "BLOCK")

(That's for HDP2, for HDP1, the keys are different)
Regards
Guillaume   
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax <a href="tel:%2B33%280%299%2072%2028%2037%2005" value="+33972283705" target="_blank">+33(0)9 72 28 37 05



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Andrew Ash
Hi Aureliano,

First, check out the documentation the team has written up on using Kryo here: http://spark.incubator.apache.org/docs/latest/tuning.html specifically the Data Serialization and Serialized RDD Storage sections.

If you want RDDs to spill over to disk if they don't fit in memory (rather than be recalculated), then you must use the MEMORY_AND_DISK storage level -- http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

That focus is on using Kryo for temporary RDD serialization though, not so much storing long term binary output.  It sounds like you're going to need to touch a little of the Hadoop APIs to get this working.



Hope that helps you down the right path,
Andrew





On Fri, Jan 3, 2014 at 1:18 PM, Aureliano Buendia <[hidden email]> wrote:
Even 

someMap.saveAsTextFile("out", classOf[GzipCodec])

has no effect.

Also, I notices that saving sequence files has no compression option (my original question was about compressing binary output).

Having said this, I still do not understand why kryo cannot be helpful when saving binary output. Binary output uses java serialization, which has a pretty hefty overhead.

How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?



On Fri, Jan 3, 2014 at 5:58 PM, Guillaume Pitel <[hidden email]> wrote:
That's the right place. Maybe try with HDP1 properties :

http://stackoverflow.com/questions/17241185/spark-standalone-mode-how-to-compress-spark-output-written-to-hdfs

About your Kryo error, you can use that if you want a coverage of scala types : https://github.com/romix/scala-kryo-serialization

Guillaume
Thanks for clarifying this.

I tried setting hadoop properties before constructing SparkContext, but it had no effect.

Where is the right place to set these properties?


On Fri, Jan 3, 2014 at 4:56 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

I believe Kryo is only use during RDD serialization (i.e. communication between nodes), not for saving. If you want to compress output, you can use GZip or snappy codec like that :

val codec = "org.apache.hadoop.io.compress.SnappyCodec" // for snappy
val codec = "org.apache.hadoop.io.compress.GzipCodec" // for gzip

System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress", "true")
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.codec", codec)
System.setProperty("spark.hadoop.mapreduce.output.fileoutputformat.compress.type", "BLOCK")

(That's for HDP2, for HDP1, the keys are different)
Regards
Guillaume   
Hi,

I'm trying to call saveAsObjectFile() on an RDD[(Int, Int, Double Double)], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on.

I've checked the log, and there is no trace of kryo related errors.

The code looks something like:

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.setRegistrationRequired(true)
    kryo.register(classOf[(Int, Int, Double Double)])
  }
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "MyRegistrator")


At the end, I tried to call:

kryo.setRegistrationRequired(true)

to make sure my class gets registered. But I found errors like:

Exception in thread "DAGScheduler" com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Numeric$IntIsIntegral$
Note: To register this class use: kryo.register(scala.math.Numeric$IntIsIntegral$.class);


It appears many scala internal types have to be registered in order to have full kryo support.

Any idea why my simple tuple type should not get kryo benefits?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax <a href="tel:%2B33%280%299%2072%2028%2037%2005" value="+33972283705" target="_blank">+33(0)9 72 28 37 05



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05


Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Guillaume Pitel
In reply to this post by Aureliano Buendia
Hi,

After a little bit of thinking, I'm not sure anymore if saveAsObjectFile uses the spark.hadoop.*

Also, I did write a mistake. The use of *.mapred.* or *.mapreduce.* does not depend on the hadoop version you use, but onthe API version you use

So, I can assure you that if you use the saveAsNewAPIHadoopFile, with the spark.hadoop.mapreduce.* properties, the compression will be used.

If you use the saveAsHadoopFile, it should be used with mapred.*

If you use the saveAsObjectFile to a hdfs path, I'm not sure if the output is compressed.

Anyway, saveAsObjectFile should be used for small objects, in my opinion.

Guillaume
Even 

someMap.saveAsTextFile("out", classOf[GzipCodec])

has no effect.

Also, I notices that saving sequence files has no compression option (my original question was about compressing binary output).

Having said this, I still do not understand why kryo cannot be helpful when saving binary output. Binary output uses java serialization, which has a pretty hefty overhead.

How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?

--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia
RDD only defines saveAsTextFile and saveAsObjectFile. I think saveAsHadoopFile and saveAsNewAPIHadoopFile belong to the older versions.

saveAsObjectFile definitely outputs hadoop format.

I'm not trying to save big objects by saveAsObjectFile, I'm just trying to minimize the java serialization overhead when saving to a binary file.

I can see spark can benefit from something like https://github.com/twitter/chill in this matter.


On Fri, Jan 3, 2014 at 6:42 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

After a little bit of thinking, I'm not sure anymore if saveAsObjectFile uses the spark.hadoop.*

Also, I did write a mistake. The use of *.mapred.* or *.mapreduce.* does not depend on the hadoop version you use, but onthe API version you use

So, I can assure you that if you use the saveAsNewAPIHadoopFile, with the spark.hadoop.mapreduce.* properties, the compression will be used.

If you use the saveAsHadoopFile, it should be used with mapred.*

If you use the saveAsObjectFile to a hdfs path, I'm not sure if the output is compressed.

Anyway, saveAsObjectFile should be used for small objects, in my opinion.

Guillaume
Even 

someMap.saveAsTextFile("out", classOf[GzipCodec])

has no effect.

Also, I notices that saving sequence files has no compression option (my original question was about compressing binary output).

Having said this, I still do not understand why kryo cannot be helpful when saving binary output. Binary output uses java serialization, which has a pretty hefty overhead.

How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?

--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Andrew Ash
saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which uses some Scala magic to become available when you have an that's RDD[Key, Value]


Agreed, something like Chill would make this much easier for the default cases.


On Fri, Jan 3, 2014 at 2:04 PM, Aureliano Buendia <[hidden email]> wrote:
RDD only defines saveAsTextFile and saveAsObjectFile. I think saveAsHadoopFile and saveAsNewAPIHadoopFile belong to the older versions.

saveAsObjectFile definitely outputs hadoop format.

I'm not trying to save big objects by saveAsObjectFile, I'm just trying to minimize the java serialization overhead when saving to a binary file.

I can see spark can benefit from something like https://github.com/twitter/chill in this matter.


On Fri, Jan 3, 2014 at 6:42 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

After a little bit of thinking, I'm not sure anymore if saveAsObjectFile uses the spark.hadoop.*

Also, I did write a mistake. The use of *.mapred.* or *.mapreduce.* does not depend on the hadoop version you use, but onthe API version you use

So, I can assure you that if you use the saveAsNewAPIHadoopFile, with the spark.hadoop.mapreduce.* properties, the compression will be used.

If you use the saveAsHadoopFile, it should be used with mapred.*

If you use the saveAsObjectFile to a hdfs path, I'm not sure if the output is compressed.

Anyway, saveAsObjectFile should be used for small objects, in my opinion.

Guillaume
Even 

someMap.saveAsTextFile("out", classOf[GzipCodec])

has no effect.

Also, I notices that saving sequence files has no compression option (my original question was about compressing binary output).

Having said this, I still do not understand why kryo cannot be helpful when saving binary output. Binary output uses java serialization, which has a pretty hefty overhead.

How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?

--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05


Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia



On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[hidden email]> wrote:
saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which uses some Scala magic to become available when you have an that's RDD[Key, Value]


I see. So if my data is of RDD[Value] type, I cannot use compression? Why does it have to be of RDD[Key, Value] in order to save it in hadoop?

Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is confusing.

I'm only interested in saving data on s3 ("s3n://..."), does it matter if I use saveAsHadoopFile, or saveAsObjectFile?
 

Agreed, something like Chill would make this much easier for the default cases.

But what we need is something like chill-hadoop:

https://github.com/twitter/chill/tree/develop/chill-hadoop
 


On Fri, Jan 3, 2014 at 2:04 PM, Aureliano Buendia <[hidden email]> wrote:
RDD only defines saveAsTextFile and saveAsObjectFile. I think saveAsHadoopFile and saveAsNewAPIHadoopFile belong to the older versions.

saveAsObjectFile definitely outputs hadoop format.

I'm not trying to save big objects by saveAsObjectFile, I'm just trying to minimize the java serialization overhead when saving to a binary file.

I can see spark can benefit from something like https://github.com/twitter/chill in this matter.


On Fri, Jan 3, 2014 at 6:42 PM, Guillaume Pitel <[hidden email]> wrote:
Hi,

After a little bit of thinking, I'm not sure anymore if saveAsObjectFile uses the spark.hadoop.*

Also, I did write a mistake. The use of *.mapred.* or *.mapreduce.* does not depend on the hadoop version you use, but onthe API version you use

So, I can assure you that if you use the saveAsNewAPIHadoopFile, with the spark.hadoop.mapreduce.* properties, the compression will be used.

If you use the saveAsHadoopFile, it should be used with mapred.*

If you use the saveAsObjectFile to a hdfs path, I'm not sure if the output is compressed.

Anyway, saveAsObjectFile should be used for small objects, in my opinion.

Guillaume
Even 

someMap.saveAsTextFile("out", classOf[GzipCodec])

has no effect.

Also, I notices that saving sequence files has no compression option (my original question was about compressing binary output).

Having said this, I still do not understand why kryo cannot be helpful when saving binary output. Binary output uses java serialization, which has a pretty hefty overhead.

How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?

--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05



Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Guillaume Pitel
Actually, the interesting part in hadoop files is the sequencefile format which allows to split the data in various blocks. Other files in HDFS are single-blocks. They do not scale

An ObjectFile cannot be naturally splitted.

Usually, in Hadoop when storing a sequence of elements instead of a sequence of key,value the trick is to store key,null

I don't know what's the most effective way to do that in scala/spark. Actually that would be a good thing to add it to RDD[U]

Guillaume



On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[hidden email]> wrote:
saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which uses some Scala magic to become available when you have an that's RDD[Key, Value]


I see. So if my data is of RDD[Value] type, I cannot use compression? Why does it have to be of RDD[Key, Value] in order to save it in hadoop?

Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is confusing.

I'm only interested in saving data on s3 ("s3n://..."), does it matter if I use saveAsHadoopFile, or saveAsObjectFile?
 


--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia



On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel <[hidden email]> wrote:
Actually, the interesting part in hadoop files is the sequencefile format which allows to split the data in various blocks. Other files in HDFS are single-blocks. They do not scale

But the output of saveAsObjectFile looks like: part-00000, part-00001, part-00002,... . It does output split data, making it scalable, no?
 

An ObjectFile cannot be naturally splitted.

Usually, in Hadoop when storing a sequence of elements instead of a sequence of key,value the trick is to store key,null

I don't know what's the most effective way to do that in scala/spark. Actually that would be a good thing to add it to RDD[U]

Guillaume



On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[hidden email]> wrote:
saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which uses some Scala magic to become available when you have an that's RDD[Key, Value]


I see. So if my data is of RDD[Value] type, I cannot use compression? Why does it have to be of RDD[Key, Value] in order to save it in hadoop?

Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is confusing.

I'm only interested in saving data on s3 ("s3n://..."), does it matter if I use saveAsHadoopFile, or saveAsObjectFile?
 


--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Guillaume Pitel
I thought it didn't split the files. Seems I'm wrong. Maybe it's a matter of size then.

In this case, yes it's scalable. After all it's a RDD initially.


On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel <[hidden email]> wrote:
Actually, the interesting part in hadoop files is the sequencefile format which allows to split the data in various blocks. Other files in HDFS are single-blocks. They do not scale

But the output of saveAsObjectFile looks like: part-00000, part-00001, part-00002,... . It does output split data, making it scalable, no?
 


--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Imran Rashid
In reply to this post by Aureliano Buendia
I think a lot of the confusion is cleared up with a quick look at the code:

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L901

saveAsObjectFile is just a thin wrapper around saveAsSequenceFile, which makes a null key and calls the java serializer.

if you want to use kryo, just do the same thing yourself, but use the kryo serializer in place of the java one.




On Fri, Jan 3, 2014 at 1:33 PM, Aureliano Buendia <[hidden email]> wrote:



On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel <[hidden email]> wrote:
Actually, the interesting part in hadoop files is the sequencefile format which allows to split the data in various blocks. Other files in HDFS are single-blocks. They do not scale

But the output of saveAsObjectFile looks like: part-00000, part-00001, part-00002,... . It does output split data, making it scalable, no?
 

An ObjectFile cannot be naturally splitted.

Usually, in Hadoop when storing a sequence of elements instead of a sequence of key,value the trick is to store key,null

I don't know what's the most effective way to do that in scala/spark. Actually that would be a good thing to add it to RDD[U]

Guillaume



On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[hidden email]> wrote:
saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which uses some Scala magic to become available when you have an that's RDD[Key, Value]


I see. So if my data is of RDD[Value] type, I cannot use compression? Why does it have to be of RDD[Key, Value] in order to save it in hadoop?

Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is confusing.

I'm only interested in saving data on s3 ("s3n://..."), does it matter if I use saveAsHadoopFile, or saveAsObjectFile?
 


--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05


Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia



On Fri, Jan 3, 2014 at 7:41 PM, Imran Rashid <[hidden email]> wrote:
I think a lot of the confusion is cleared up with a quick look at the code:

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L901

saveAsObjectFile is just a thin wrapper around saveAsSequenceFile, which makes a null key and calls the java serializer.

if you want to use kryo, just do the same thing yourself, but use the kryo serializer in place of the java one.

Thanks!

But why is that hadoop compression doesn't work for saveAsObject(), but it does work (according to Guillaume) for saveAsHadoopFile()?
 




On Fri, Jan 3, 2014 at 1:33 PM, Aureliano Buendia <[hidden email]> wrote:



On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel <[hidden email]> wrote:
Actually, the interesting part in hadoop files is the sequencefile format which allows to split the data in various blocks. Other files in HDFS are single-blocks. They do not scale

But the output of saveAsObjectFile looks like: part-00000, part-00001, part-00002,... . It does output split data, making it scalable, no?
 

An ObjectFile cannot be naturally splitted.

Usually, in Hadoop when storing a sequence of elements instead of a sequence of key,value the trick is to store key,null

I don't know what's the most effective way to do that in scala/spark. Actually that would be a good thing to add it to RDD[U]

Guillaume



On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[hidden email]> wrote:
saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which uses some Scala magic to become available when you have an that's RDD[Key, Value]


I see. So if my data is of RDD[Value] type, I cannot use compression? Why does it have to be of RDD[Key, Value] in order to save it in hadoop?

Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is confusing.

I'm only interested in saving data on s3 ("s3n://..."), does it matter if I use saveAsHadoopFile, or saveAsObjectFile?
 


--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05



Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Guillaume Pitel
Have you tried with the mapred.* properties ? If saveAsObjectFile uses saveAsSequenceFile, maybe it uses the old API ?

Guillaume
But why is that hadoop compression doesn't work for saveAsObject(), but it does work (according to Guillaume) for saveAsHadoopFile()?


--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia



On Fri, Jan 3, 2014 at 8:25 PM, Guillaume Pitel <[hidden email]> wrote:
Have you tried with the mapred.* properties ? If saveAsObjectFile uses saveAsSequenceFile, maybe it uses the old API ?

None of spark.hadoop.mapred.* and spark.hadoop.mapreduce.* approaches cause compression with saveAsObject. (Using spark 0.8.1)
 

Guillaume

But why is that hadoop compression doesn't work for saveAsObject(), but it does work (according to Guillaume) for saveAsHadoopFile()?


--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply | Threaded
Open this post in threaded view
|

Re: Turning kryo on does not decrease binary output

Aureliano Buendia
it seems saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() are written in a rather dirty and inconsistent way.

saveAsObjectFile calls saveAsSequenceFile, but does not pass the codec:

def saveAsObjectFile(path: String) {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
      .saveAsSequenceFile(path)
  }


so the codec is set to None:

def saveAsSequenceFile(path: String, codec: Option[Class[_ <: CompressionCodec]] = None) {
    ...
      self.saveAsHadoopFile(path, keyClass, valueClass, format, jobConf, codec)
    ...
  }

saveAsHadoopFile only applies compression when the codec is available, and it does not seem to respect the global hadoop compression properties:

def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      conf: JobConf = new JobConf(self.context.hadoopConfiguration),
      codec: Option[Class[_ <: CompressionCodec]] = None) {
    conf.setOutputKeyClass(keyClass)
    conf.setOutputValueClass(valueClass)
    // conf.setOutputFormat(outputFormatClass) // Doesn't work in Scala 2.9 due to what may be a generics bug
    conf.set("mapred.output.format.class", outputFormatClass.getName)
    for (c <- codec) {
      conf.setCompressMapOutput(true)
      conf.set("mapred.output.compress", "true")
      conf.setMapOutputCompressorClass(c)
      conf.set("mapred.output.compression.codec", c.getCanonicalName)
      conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString)
    }

    conf.setOutputCommitter(classOf[FileOutputCommitter])
    FileOutputFormat.setOutputPath(conf, SparkHadoopWriter.createPathFromString(path, conf))
    saveAsHadoopDataset(conf)
  }




On Fri, Jan 3, 2014 at 9:49 PM, Aureliano Buendia <[hidden email]> wrote:



On Fri, Jan 3, 2014 at 8:25 PM, Guillaume Pitel <[hidden email]> wrote:
Have you tried with the mapred.* properties ? If saveAsObjectFile uses saveAsSequenceFile, maybe it uses the old API ?

None of spark.hadoop.mapred.* and spark.hadoop.mapreduce.* approaches cause compression with saveAsObject. (Using spark 0.8.1)
 

Guillaume

But why is that hadoop compression doesn't work for saveAsObject(), but it does work (according to Guillaume) for saveAsHadoopFile()?


--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80 / <a href="tel:%2B33%280%299%2070%2044%2067%2053" value="+33970446753" target="_blank">+33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05