Data exploration - spark-0.9.0 - local

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Data exploration - spark-0.9.0 - local

This post has NOT been accepted by the mailing list yet.
I am new to spark, and going through the data exploration tutorial,

I am running standalone, on local machine. I am not sure if I am staring it correctly to run locally. I did "spark-shell local", as described in the faq page. The steps from the data exploration tutorial are working fine, up to step 6 "reduceByKey". I am getting the following exception:
14/03/06 02:20:27 ERROR Executor: Exception in task ID 1
java.lang.ArrayIndexOutOfBoundsException: 3
        at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:16)
        at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:16)
        at scala.collection.Iterator$$anon$
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
        at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
        at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
        at org.apache.spark.executor.Executor$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

I noticed that in the tutorial the returned data types are always spark.RDD. what I am getting after each operation is spark.rdd.RDD. I am guessing it has something to do with the out-of-bounds exception, the tutorial returned dimensions for RDD are one less than the ones I am seeing in my run. Example below:

from the tutorial:
spark.RDD[Array[java.lang.String]] = MappedRDD[3] at map at <console>:16

returned in my run:
org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14