Problem with HBase external table on freshly created EMR cluster

Problem with HBase external table on freshly created EMR cluster


I created an EMR cluster with Spark and HBase according to with --hbase flag to include HBase. Although spark and shark both work nicely with the provided S3 examples, there is a problem with external tables pointing to the HBase instance.

We create the following external table with shark:

CREATE EXTERNAL TABLE oh (id STRING, name STRING, title STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.zookeeper.quorum" = "",""="2181", "hbase.columns.mapping" = ":key,o:OH_Name,o:OH_Title") TBLPROPERTIES("" = "objects")

The objects table exists and has all columns as defined in the DDL.
The Zookeeper for HBase is running on the specified hostname and port.

CREATE TABLE oh_cached AS SELECT * FROM OH fails with the following error:

org.apache.spark.SparkException: Job aborted: Task 11.0:0 failed more than 4 times
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
        at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
        at org.apache.spark.scheduler.DAGScheduler$$anon$

The logfiles of the spark workers are almost empty, however, the stages information in the spark web console reveals additional hints:

 0 4 FAILED NODE_LOCAL ip-172-31-10-246.ec2.internal 2014/03/05 13:38:20 java.lang.IllegalStateException (java.lang.IllegalStateException: unread block data)$BlockDataInputStream.setBlockDataMode(