IllegalArgumentException on calling KMeans.train()

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

IllegalArgumentException on calling KMeans.train()

bluejoe2008
what does this exception mean?
 
14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
 
my spark version: 1.0.0
Java: 1.7
my codes:
 
JavaRDD<Vector> docVectors = generateDocVector(...);
int numClusters = 20;
int numIterations = 20;
KMeansModel clusters = KMeans.train(docVectors.rdd(), numClusters, numIterations);
 
another strange thing is that the mapPartitionsWithIndex() method call in generateDocVector() are invoked for 3 times...
 
2014-06-04
bluejoe2008
Reply | Threaded
Open this post in threaded view
|

Re: IllegalArgumentException on calling KMeans.train()

Xiangrui Meng
Could you check whether the vectors have the same size? -Xiangrui

On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 <[hidden email]> wrote:

> what does this exception mean?
>
> 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
> at
> org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
> at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.Range.foreach(Range.scala:141)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
>
> my spark version: 1.0.0
> Java: 1.7
> my codes:
>
> JavaRDD<Vector> docVectors = generateDocVector(...);
> int numClusters = 20;
> int numIterations = 20;
> KMeansModel clusters = KMeans.train(docVectors.rdd(), numClusters,
> numIterations);
>
> another strange thing is that the mapPartitionsWithIndex() method call in
> generateDocVector() are invoked for 3 times...
>
> 2014-06-04
> ________________________________
> bluejoe2008
Reply | Threaded
Open this post in threaded view
|

Re: Re: IllegalArgumentException on calling KMeans.train()

bluejoe2008

thank you! 孟祥瑞
with your help i solved the problem.
 
I constructed SparseVectors in a wrong way
the first parameter  of the constructor  <A style="COLOR: rgb(76,107,135); FONT-WEIGHT: bold; TEXT-DECORATION: none" href="http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/SparseVector.html#SparseVector(int, int[], double[])">SparseVector(int size, int[] indices, double[] values) 
I mistaked it for the size of values 
 
2014-06-04
bluejoe2008
 
Date: 2014-06-04 17:35
Subject: Re: IllegalArgumentException on calling KMeans.train()
Could you check whether the vectors have the same size? -Xiangrui
 
On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 <[hidden email]> wrote:
> what does this exception mean?
>
> 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
> at
> org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
> at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.Range.foreach(Range.scala:141)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
>
> my spark version: 1.0.0
> Java: 1.7
> my codes:
>
> JavaRDD<Vector> docVectors = generateDocVector(...);
> int numClusters = 20;
> int numIterations = 20;
> KMeansModel clusters = KMeans.train(docVectors.rdd(), numClusters,
> numIterations);
>
> another strange thing is that the mapPartitionsWithIndex() method call in
> generateDocVector() are invoked for 3 times...
>
> 2014-06-04
> ________________________________
> bluejoe2008