IndexOutOfBoundException in catalyst when doing multiple approxDistinctCount

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

IndexOutOfBoundException in catalyst when doing multiple approxDistinctCount

AssafMendelson
This post has NOT been accepted by the mailing list yet.

Hi,

 

I am doing a large number of aggregations on a dataframe (without groupBy) to get some statistics. As part of this I am doing an approx_count_distinct(c, 0.01)

Everything works fine but when I do the same aggregation a second time (for each column) I get the following error:

 

 

 

[Stage 2:>                                                          (0 + 2) / 2][WARN] org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Error calculating stats of compiled class.

java.lang.IndexOutOfBoundsException: Index: 4355, Size: 1

                at java.util.ArrayList.rangeCheck(ArrayList.java:653)

                at java.util.ArrayList.get(ArrayList.java:429)

                at org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)

                at org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)

                at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)

                at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)

                at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)

                at org.codehaus.janino.util.ClassFile.<init>(ClassFile.java:280)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:996)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:993)

                at scala.collection.Iterator$class.foreach(Iterator.scala:893)

                at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

                at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

                at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:993)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:961)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1027)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1024)

                at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)

                at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)

                at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)

                at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)

                at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)

                at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)

                at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:906)

                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)

                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)

                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:890)

                at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:130)

                at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:140)

                at org.apache.spark.sql.execution.aggregate.AggregationIterator.generateResultProjection(AggregationIterator.scala:235)

                at org.apache.spark.sql.execution.aggregate.AggregationIterator.<init>(AggregationIterator.scala:266)

                at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.<init>(SortBasedAggregationIterator.scala:39)

                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)

                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)

                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

                at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)

                at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)

                at org.apache.spark.scheduler.Task.run(Task.scala:108)

                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

                at java.lang.Thread.run(Thread.java:745)

[WARN] org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Error calculating stats of compiled class.

java.lang.IndexOutOfBoundsException: Index: 768, Size: 1

                at java.util.ArrayList.rangeCheck(ArrayList.java:653)

                at java.util.ArrayList.get(ArrayList.java:429)

                at org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)

                at org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)

                at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)

                at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)

                at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)

                at org.codehaus.janino.util.ClassFile.<init>(ClassFile.java:280)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:996)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:993)

                at scala.collection.Iterator$class.foreach(Iterator.scala:893)

                at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

                at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

                at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:993)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:961)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1027)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1024)

                at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)

                at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)

                at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)

                at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)

                at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)

                at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)

                at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:906)

                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:194)

                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)

                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:890)

                at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection$.create(Projection.scala:182)

                at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection$.apply(Projection.scala:175)

                at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.<init>(SortBasedAggregationIterator.scala:98)

                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)

                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)

                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

                at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)

                at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)

                at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)

                at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)

                at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)

                at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)

                at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)

                at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

                at org.apache.spark.scheduler.Task.run(Task.scala:108)

                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

                at java.lang.Thread.run(Thread.java:745)

 

Anyone ran into this or knows how to fix it?

 

Thanks,

              Assaf.

 

Loading...