Problem with HBase external table on freshly created EMR cluster

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Problem with HBase external table on freshly created EMR cluster

phil3k
Hi!

I created an EMR cluster with Spark and HBase according to http://aws.amazon.com/articles/4926593393724923 with --hbase flag to include HBase. Although spark and shark both work nicely with the provided S3 examples, there is a problem with external tables pointing to the HBase instance.

We create the following external table with shark:

CREATE EXTERNAL TABLE oh (id STRING, name STRING, title STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.zookeeper.quorum" = "172.31.13.161","hbase.zookeeper.property.clientPort"="2181", "hbase.columns.mapping" = ":key,o:OH_Name,o:OH_Title") TBLPROPERTIES("hbase.table.name" = "objects")

The objects table exists and has all columns as defined in the DDL.
The Zookeeper for HBase is running on the specified hostname and port.

CREATE TABLE oh_cached AS SELECT * FROM OH fails with the following error:

org.apache.spark.SparkException: Job aborted: Task 11.0:0 failed more than 4 times
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
        at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
        at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

The logfiles of the spark workers are almost empty, however, the stages information in the spark web console reveals additional hints:

 0 4 FAILED NODE_LOCAL ip-172-31-10-246.ec2.internal 2014/03/05 13:38:20 java.lang.IllegalStateException (java.lang.IllegalStateException: unread block data)
   java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2420)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380)java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1954)j
   ava.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1848)java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)java.io.ObjectInput
   Stream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)org.apa
   che.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:199)org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:18
   2)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)java.lang.Thread.run(Thread.java:724)