Data exploration - spark-0.9.0 - local

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Data exploration - spark-0.9.0 - local

taghrid
This post has NOT been accepted by the mailing list yet.
I am new to spark, and going through the data exploration tutorial, http://ampcamp.berkeley.edu/big-data-mini-course/data-exploration-using-spark.html

I am running standalone, on local machine. I am not sure if I am staring it correctly to run locally. I did "spark-shell local", as described in the faq page. The steps from the data exploration tutorial are working fine, up to step 6 "reduceByKey". I am getting the following exception:
14/03/06 02:20:27 ERROR Executor: Exception in task ID 1
java.lang.ArrayIndexOutOfBoundsException: 3
        at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:16)
        at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:16)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
        at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
        at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
        at org.apache.spark.scheduler.Task.run(Task.scala:53)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

I noticed that in the tutorial the returned data types are always spark.RDD. what I am getting after each operation is spark.rdd.RDD. I am guessing it has something to do with the out-of-bounds exception, the tutorial returned dimensions for RDD are one less than the ones I am seeing in my run. Example below:

from the tutorial:
spark.RDD[Array[java.lang.String]] = MappedRDD[3] at map at <console>:16

returned in my run:
org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14


Thanks,
Taghrid