Parquet files from spark not readable in Cascading

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Parquet files from spark not readable in Cascading

Vikas Gandham-2

Hi,

 

When I  tried reading parquet data that was generated by spark in cascading it throws following error

 

 

 

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file ""

at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)

at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)

at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103)

at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)

at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)

at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)

at cascading.util.Util.retry(Util.java:1044)

at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)

at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1

at java.util.ArrayList.elementData(ArrayList.java:418)

at java.util.ArrayList.get(ArrayList.java:431)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)

at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)

at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:293)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)

at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)

at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)

at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)

at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)

 

This is mostly seen when parquet has nested structures.

 

I didnt find any solution to this.

 

I see some JIRA issues like this https://issues.apache.org/jira/browse/SPARK-10434 (parquet compatability /interoperabilityissues) where reading parquet files in Spark 1.4 where the files

were generated by Spark 1.5 .This was fixed in later versions but was it fixed in Cascading?

 

Not sure if this is something to do with Parquet version or Cascading has a bug or Spark is doing something with Parquet files

which cascading is not accepting

 

Note : I am trying to read Parquet with avro schema in Cascading

 

I have posted in Cascading mailing list too.

 

 


--
Thanks
Vikas Gandham
Reply | Threaded
Open this post in threaded view
|

Re: Parquet files from spark not readable in Cascading

java8964

I don't have experience with Cascading, but we saw similar issue for importing the data generated in Spark into Hive.


Did you try this setting "spark.sql.parquet.writeLegacyFormat" to true?


https://stackoverflow.com/questions/44279870/why-cant-impala-read-parquet-files-after-spark-sqls-write






From: Vikas Gandham <[hidden email]>
Sent: Wednesday, November 15, 2017 2:30 PM
To: [hidden email]
Subject: Parquet files from spark not readable in Cascading
 

Hi,

 

When I  tried reading parquet data that was generated by spark in cascading it throws following error

 

 

 

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file ""

at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)

at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)

at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103)

at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)

at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)

at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)

at cascading.util.Util.retry(Util.java:1044)

at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)

at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1

at java.util.ArrayList.elementData(ArrayList.java:418)

at java.util.ArrayList.get(ArrayList.java:431)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)

at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)

at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:293)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)

at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)

at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)

at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)

at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)

 

This is mostly seen when parquet has nested structures.

 

I didnt find any solution to this.

 

I see some JIRA issues like this https://issues.apache.org/jira/browse/SPARK-10434 (parquet compatability /interoperabilityissues) where reading parquet files in Spark 1.4 where the files


were generated by Spark 1.5 .This was fixed in later versions but was it fixed in Cascading?

 

Not sure if this is something to do with Parquet version or Cascading has a bug or Spark is doing something with Parquet files

which cascading is not accepting

 

Note : I am trying to read Parquet with avro schema in Cascading

 

I have posted in Cascading mailing list too.

 

 


--
Thanks
Vikas Gandham
Reply | Threaded
Open this post in threaded view
|

Re: Parquet files from spark not readable in Cascading

Vikas Gandham-2
I tried spark.sql.parquet.writeLegacyFormat to true but still issue persists.

Thanks
Vikas Gandham

On Thu, Nov 16, 2017 at 10:25 AM, Yong Zhang <[hidden email]> wrote:

I don't have experience with Cascading, but we saw similar issue for importing the data generated in Spark into Hive.


Did you try this setting "spark.sql.parquet.writeLegacyFormat" to true?


https://stackoverflow.com/questions/44279870/why-cant-impala-read-parquet-files-after-spark-sqls-write

Having some issues with the way that Spark is interpreting columns for parquet. I have an Oracle source with confirmed schema (df.schema() method): root |-- LM_PERSON ...





From: Vikas Gandham <[hidden email]>
Sent: Wednesday, November 15, 2017 2:30 PM
To: [hidden email]
Subject: Parquet files from spark not readable in Cascading
 

Hi,

 

When I  tried reading parquet data that was generated by spark in cascading it throws following error

 

 

 

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file ""

at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)

at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)

at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103)

at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)

at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)

at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)

at cascading.util.Util.retry(Util.java:1044)

at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)

at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1

at java.util.ArrayList.elementData(ArrayList.java:418)

at java.util.ArrayList.get(ArrayList.java:431)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)

at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)

at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:293)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)

at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)

at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)

at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)

at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)

 

This is mostly seen when parquet has nested structures.

 

I didnt find any solution to this.

 

I see some JIRA issues like this https://issues.apache.org/jira/browse/SPARK-10434 (parquet compatability /interoperabilityissues) where reading parquet files in Spark 1.4 where the files

This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level structure and repeated group name "bag" are borrowed from parquet-hive, while the innermost ...

were generated by Spark 1.5 .This was fixed in later versions but was it fixed in Cascading?

 

Not sure if this is something to do with Parquet version or Cascading has a bug or Spark is doing something with Parquet files

which cascading is not accepting

 

Note : I am trying to read Parquet with avro schema in Cascading

 

I have posted in Cascading mailing list too.

 

 


--
Thanks
Vikas Gandham



--
Thanks
Vikas Gandham