native-lzo / gpl lib

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

native-lzo / gpl lib

leosandylh@gmail.com
HI,
    I do a query from shark , it read a compress data from hdfs . but spark could't find the native-lzo lib .
 
14/01/08 22:58:21 ERROR executor.Executor: Exception in task ID 286
java.lang.RuntimeException: native-lzo library not available
at com.hadoop.compression.lzo.LzoCodec.getDecompressorType(LzoCodec.java:175)
at org.apache.hadoop.hive.ql.io.CodecPool.getDecompressor(CodecPool.java:122)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.init(RCFile.java:1299)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.<init>(RCFile.java:1139)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.<init>(RCFile.java:1118)
at org.apache.hadoop.hive.ql.io.RCFileRecordReader.<init>(RCFileRecordReader.java:52)
at org.apache.hadoop.hive.ql.io.RCFileInputFormat.getRecordReader(RCFileInputFormat.java:57)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:93)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:83)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:36)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:29)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MapPartitionsWithIndexRDD.compute(MapPartitionsWithIndexRDD.scala:40)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MapPartitionsWithIndexRDD.compute(MapPartitionsWithIndexRDD.scala:40)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
 
can anyone give me the hint
 
thank you !
 

Reply | Threaded
Open this post in threaded view
|

Re: native-lzo / gpl lib

Andrew Ash
To get shark on LZO files working (I have it up and running with CDH4.4.0) you first need the hadoop-lzo jar on the classpath for shark (and spark).  Hadoop-lzo seems to require its native code component, unlike Hadoop which can run non-native if it can't find native.  So you'll need to add hadoop-lzo's native component to the library path too.

Here's an excerpt from my puppet module that does these things.  Edit accordingly and put these two rows into your shark-env.sh

export SPARK_LIBRARY_PATH="<%= scope['common::masterBaseDir'] %>/hadoop-current/lib/native/"
export SPARK_CLASSPATH="<%= scope['common::masterBaseDir'] %>/hadoop-current/lib/hadoop-lzo.jar"

And here's what I have in hadoop-current/lib/native:

[user@machine hadoop-current]$ ls
bin   hadoop-ant-2.0.0-mr1-cdh4.4.0.jar   hadoop-examples-2.0.0-mr1-cdh4.4.0.jar  hadoop-tools-2.0.0-mr1-cdh4.4.0.jar  lib      logs  webapps
conf  hadoop-core-2.0.0-mr1-cdh4.4.0.jar  hadoop-test-2.0.0-mr1-cdh4.4.0.jar      include                              libexec  sbin
[user@machine hadoop-current]$ ls lib/native/
libgplcompression.a  libgplcompression.la  libgplcompression.so  libgplcompression.so.0  libgplcompression.so.0.0.0  Linux-amd64-64
[user@machine hadoop-current]$


Does that help?

Andrew


On Wed, Jan 8, 2014 at 7:02 AM, [hidden email] <[hidden email]> wrote:
HI,
    I do a query from shark , it read a compress data from hdfs . but spark could't find the native-lzo lib .
 
14/01/08 22:58:21 ERROR executor.Executor: Exception in task ID 286
java.lang.RuntimeException: native-lzo library not available
at com.hadoop.compression.lzo.LzoCodec.getDecompressorType(LzoCodec.java:175)
at org.apache.hadoop.hive.ql.io.CodecPool.getDecompressor(CodecPool.java:122)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.init(RCFile.java:1299)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.<init>(RCFile.java:1139)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.<init>(RCFile.java:1118)
at org.apache.hadoop.hive.ql.io.RCFileRecordReader.<init>(RCFileRecordReader.java:52)
at org.apache.hadoop.hive.ql.io.RCFileInputFormat.getRecordReader(RCFileInputFormat.java:57)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:93)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:83)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:36)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:29)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MapPartitionsWithIndexRDD.compute(MapPartitionsWithIndexRDD.scala:40)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.rdd.MapPartitionsWithIndexRDD.compute(MapPartitionsWithIndexRDD.scala:40)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
 
can anyone give me the hint
 
thank you !