Still incompatible schemas

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Still incompatible schemas

Hamish Whittal
Hi folks,

Thanks for the help thus far.

I'm trying to track down the source of this error:
  
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

w hen doing a message.show()

Basically I'm reading in a single Parquet file (to try to narrow things down).

I'm defining the schema in the beginning and loading the parquet with:
   message = spark\
             .read\
             .schema(myMessageSchema)\
             .format("parquet")\
             .option("mergeSchema", "true")\
             .option("badRecordsPath", "/tmp/badRecords/")\
             .load("hdfs:///user/hadoop/feb20/part-00000-c6da95c9-9c40-4623-a5c5-851188e236ff-c000.snappy.parquet")

[I've tried with and without the mergeSchema option]
[ sidenote: I was hoping the badRecordPath would help with the truly bad records, but this seems to do nothing]

I've also tried to cast the potential problematic columns (so Int, Long, Double, etc) with

  message_1 = message\
    .withColumn('price', col('price').cast('double'))\
    .withColumn('price_eur', col('price_eur').cast('double'))\
    .withColumn('cost_usd', col('cost_usd').cast('double'))\
    .withColumn('adapter_status', col('adapter_status').cast('long'))

Yet I get this error and I can't figure out:
(a) whether it's some record WITHIN the parquet file that's causing it and
(b) if it is a single record (or a few records) then how do I find those particular records?

In the previous time I encountered this, there were records that should have had doubles in them (like "price" above) that actually seemed to have null.

I did this to fix that particular problem:

if not 'price' in message.columns:
    message = message.withColumn('price', message.lit('0'))

Any suggestions or help would be MOST welcome. I have also tried using pyarrow to take a look at the Parquet schema and it looks fine. I mean, it doesn't look like the schema in the parquet is the problem - but of course I'm not ruling that out just yet.

Thanks for any suggestions,

Hamish
--
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913
Reply | Threaded
Open this post in threaded view
|

Re: Still incompatible schemas

Zahid Rahman

This issue of  has been  discussed resolved on this page

https://issues.apache.org/jira/browse/SPARK-17557

It is suggested  by one person that by simply reading the parquet file in a different way as illustrated the error may go away. It appears to me you are reading the parquet file using the command line. Perhaps if you try it programmatically as suggested you may find resolution.

" I encounter an issue when data resides in Hive as parquet format and when trying to read from Spark (2.2.1), facing the above issue. I notice that in my case there is date field (contains values as 2018, 2017) which is written as integer. But when reading in spark as -

val df = spark.sql("SELECT * FROM db.table") 

df.show(3, false)
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)

 
To my surprise when reading same data from s3 location as -
val df = spark.read.parquet("s3://path/file")
df.show(3, false) // this displays the results. "


¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}


On Mon, 9 Mar 2020 at 07:57, Hamish Whittal <[hidden email]> wrote:
Hi folks,

Thanks for the help thus far.

I'm trying to track down the source of this error:
  
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

w hen doing a message.show()

Basically I'm reading in a single Parquet file (to try to narrow things down).

I'm defining the schema in the beginning and loading the parquet with:
   message = spark\
             .read\
             .schema(myMessageSchema)\
             .format("parquet")\
             .option("mergeSchema", "true")\
             .option("badRecordsPath", "/tmp/badRecords/")\
             .load("hdfs:///user/hadoop/feb20/part-00000-c6da95c9-9c40-4623-a5c5-851188e236ff-c000.snappy.parquet")

[I've tried with and without the mergeSchema option]
[ sidenote: I was hoping the badRecordPath would help with the truly bad records, but this seems to do nothing]

I've also tried to cast the potential problematic columns (so Int, Long, Double, etc) with

  message_1 = message\
    .withColumn('price', col('price').cast('double'))\
    .withColumn('price_eur', col('price_eur').cast('double'))\
    .withColumn('cost_usd', col('cost_usd').cast('double'))\
    .withColumn('adapter_status', col('adapter_status').cast('long'))

Yet I get this error and I can't figure out:
(a) whether it's some record WITHIN the parquet file that's causing it and
(b) if it is a single record (or a few records) then how do I find those particular records?

In the previous time I encountered this, there were records that should have had doubles in them (like "price" above) that actually seemed to have null.

I did this to fix that particular problem:

if not 'price' in message.columns:
    message = message.withColumn('price', message.lit('0'))

Any suggestions or help would be MOST welcome. I have also tried using pyarrow to take a look at the Parquet schema and it looks fine. I mean, it doesn't look like the schema in the parquet is the problem - but of course I'm not ruling that out just yet.

Thanks for any suggestions,

Hamish
--
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913
Reply | Threaded
Open this post in threaded view
|

Re: Still incompatible schemas

Hamish Whittal
Yeah, thanks Zahid for the reply; but that's not it.

I found two schemas that differ. So I have the sucker(s) now...but how to handle them?

In this case there are two columns, one is a Double and the other is Decimal(19,5) which in Parquet seems to be represented at FIXED_LENGTH_BYTE_ARRAY

price:                         OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(19,5) R:0 D:1
    vs
price:                         OPTIONAL DOUBLE R:0 D:1

(1) First thought is to cast the types after the load:
   message_1 = message\
        .withColumn('price', col('price').cast("double"))\
        .withColumn('price_eur', col('price_eur').cast("double"))

This seems to work if this is the only parquet being read from the prefix. i.e. the file is the same as all the other files in that same prefix. But it's not. Because there are other parquets in there that have the "correct" schema, this approach borks. So I have to somehow separate these things out.

(2) Next, perhaps I can Exception on finding these FIXED_LENGTH parquets and deal with them independently (perhaps copy the file elsewhere and have a separate process handling them). I guess that would be ok if I could figure out how to not die when I hit this error, but handle it as an exception. I can't seem to figure out how to do that.

More thoughts and suggestions are very welcome.

Thanks folks.

On Mon, Mar 9, 2020 at 11:42 AM Zahid Rahman <[hidden email]> wrote:

This issue of  has been  discussed resolved on this page

https://issues.apache.org/jira/browse/SPARK-17557

It is suggested  by one person that by simply reading the parquet file in a different way as illustrated the error may go away. It appears to me you are reading the parquet file using the command line. Perhaps if you try it programmatically as suggested you may find resolution.

" I encounter an issue when data resides in Hive as parquet format and when trying to read from Spark (2.2.1), facing the above issue. I notice that in my case there is date field (contains values as 2018, 2017) which is written as integer. But when reading in spark as -

val df = spark.sql("SELECT * FROM db.table") 

df.show(3, false)
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)

 
To my surprise when reading same data from s3 location as -
val df = spark.read.parquet("s3://path/file")
df.show(3, false) // this displays the results. "


¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}


On Mon, 9 Mar 2020 at 07:57, Hamish Whittal <[hidden email]> wrote:
Hi folks,

Thanks for the help thus far.

I'm trying to track down the source of this error:
  
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

w hen doing a message.show()

Basically I'm reading in a single Parquet file (to try to narrow things down).

I'm defining the schema in the beginning and loading the parquet with:
   message = spark\
             .read\
             .schema(myMessageSchema)\
             .format("parquet")\
             .option("mergeSchema", "true")\
             .option("badRecordsPath", "/tmp/badRecords/")\
             .load("hdfs:///user/hadoop/feb20/part-00000-c6da95c9-9c40-4623-a5c5-851188e236ff-c000.snappy.parquet")

[I've tried with and without the mergeSchema option]
[ sidenote: I was hoping the badRecordPath would help with the truly bad records, but this seems to do nothing]

I've also tried to cast the potential problematic columns (so Int, Long, Double, etc) with

  message_1 = message\
    .withColumn('price', col('price').cast('double'))\
    .withColumn('price_eur', col('price_eur').cast('double'))\
    .withColumn('cost_usd', col('cost_usd').cast('double'))\
    .withColumn('adapter_status', col('adapter_status').cast('long'))

Yet I get this error and I can't figure out:
(a) whether it's some record WITHIN the parquet file that's causing it and
(b) if it is a single record (or a few records) then how do I find those particular records?

In the previous time I encountered this, there were records that should have had doubles in them (like "price" above) that actually seemed to have null.

I did this to fix that particular problem:

if not 'price' in message.columns:
    message = message.withColumn('price', message.lit('0'))

Any suggestions or help would be MOST welcome. I have also tried using pyarrow to take a look at the Parquet schema and it looks fine. I mean, it doesn't look like the schema in the parquet is the problem - but of course I'm not ruling that out just yet.

Thanks for any suggestions,

Hamish
--
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913


--
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913