spark-avro aliases incompatible

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

spark-avro aliases incompatible

Gaspar Muñoz
Hi there, 

I use avro format to store historical due to avro schema evolution. I manage external schemas and read  them using avroSchema option so we have been able to add and delete columns.

The problem is when I introduced aliases and Spark process didn't work as expected and then I read in spark-avro library "At the moment, it ignores docs, aliases and other properties present in the Avro file".

How do you manage aliases and column renaming? Is there any workaround?

Thanks in advance.

Regards

--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473
Reply | Threaded
Open this post in threaded view
|

Re: spark-avro aliases incompatible

Gourav Sengupta
Hi Gaspar,

can you please provide the details regarding the environment, versions, libraries and code snippets please? 

For example: SPARK version, OS, distribution, running on YARN, etc and all other details.


Regards,
Gourav Sengupta

On Sun, Nov 5, 2017 at 9:03 AM, Gaspar Muñoz <[hidden email]> wrote:
Hi there, 

I use avro format to store historical due to avro schema evolution. I manage external schemas and read  them using avroSchema option so we have been able to add and delete columns.

The problem is when I introduced aliases and Spark process didn't work as expected and then I read in spark-avro library "At the moment, it ignores docs, aliases and other properties present in the Avro file".

How do you manage aliases and column renaming? Is there any workaround?

Thanks in advance.

Regards

--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473

Reply | Threaded
Open this post in threaded view
|

Re: spark-avro aliases incompatible

Gaspar Muñoz
Of course, 

right now I'm trying in local with spark 2.2.0 and spark-avro 4.0.0.  I've just uploaded a snippet https://gist.github.com/gasparms/5d0740bd61a500357e0230756be963e1

Basically, my avro schema has a field with an alias and in the last part of code spark-avro is not able to read old data with old name using the alias.

In spark-avro library Readme said that is not supported and I am asking if any of you has a workaround or how do you manage schema evolution?

Regards.

2017-11-05 20:13 GMT+01:00 Gourav Sengupta <[hidden email]>:
Hi Gaspar,

can you please provide the details regarding the environment, versions, libraries and code snippets please? 

For example: SPARK version, OS, distribution, running on YARN, etc and all other details.


Regards,
Gourav Sengupta

On Sun, Nov 5, 2017 at 9:03 AM, Gaspar Muñoz <[hidden email]> wrote:
Hi there, 

I use avro format to store historical due to avro schema evolution. I manage external schemas and read  them using avroSchema option so we have been able to add and delete columns.

The problem is when I introduced aliases and Spark process didn't work as expected and then I read in spark-avro library "At the moment, it ignores docs, aliases and other properties present in the Avro file".

How do you manage aliases and column renaming? Is there any workaround?

Thanks in advance.

Regards

--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473




--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473
Reply | Threaded
Open this post in threaded view
|

Re: spark-avro aliases incompatible

Gourav Sengupta
Hi,

I may be wrong about this, but when you are using format("....") you are basically using old SPARK classes, which still exists because of backward compatibility.

Please refer to the following documentation to take advantage of the recent changes in SPARK: https://docs.databricks.com/spark/latest/data-sources/read-avro.html

Kindly let us know how things are going on.

Regards,
Gourav Sengupta

On Mon, Nov 6, 2017 at 8:04 PM, Gaspar Muñoz <[hidden email]> wrote:
Of course, 

right now I'm trying in local with spark 2.2.0 and spark-avro 4.0.0.  I've just uploaded a snippet https://gist.github.com/gasparms/5d0740bd61a500357e0230756be963e1

Basically, my avro schema has a field with an alias and in the last part of code spark-avro is not able to read old data with old name using the alias.

In spark-avro library Readme said that is not supported and I am asking if any of you has a workaround or how do you manage schema evolution?

Regards.

2017-11-05 20:13 GMT+01:00 Gourav Sengupta <[hidden email]>:
Hi Gaspar,

can you please provide the details regarding the environment, versions, libraries and code snippets please? 

For example: SPARK version, OS, distribution, running on YARN, etc and all other details.


Regards,
Gourav Sengupta

On Sun, Nov 5, 2017 at 9:03 AM, Gaspar Muñoz <[hidden email]> wrote:
Hi there, 

I use avro format to store historical due to avro schema evolution. I manage external schemas and read  them using avroSchema option so we have been able to add and delete columns.

The problem is when I introduced aliases and Spark process didn't work as expected and then I read in spark-avro library "At the moment, it ignores docs, aliases and other properties present in the Avro file".

How do you manage aliases and column renaming? Is there any workaround?

Thanks in advance.

Regards

--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473




--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473

Reply | Threaded
Open this post in threaded view
|

Re: spark-avro aliases incompatible

Gaspar Muñoz
In the doc you refer:

// The Avro records get converted to Spark types, filtered, and
// then written back out as Avro records
val df = spark.read.avro("/tmp/episodes.avro")
df.filter("doctor > 5").write.avro("/tmp/output")

Alternatively you can specify the format to use instead:

Copy to clipboardCopy
val df = spark.read
    .format("com.databricks.spark.avro")
    .load("/tmp/episodes.avro")
As far as I know  spark-avro is not built-in in spark 2.x. That is not the problem, because also in that databricks doc said: "At the moment, it ignores docs, aliases and other properties present in the Avro file."

Regards.


2017-11-06 22:29 GMT+01:00 Gourav Sengupta <[hidden email]>:
Hi,

I may be wrong about this, but when you are using format("....") you are basically using old SPARK classes, which still exists because of backward compatibility.

Please refer to the following documentation to take advantage of the recent changes in SPARK: https://docs.databricks.com/spark/latest/data-sources/read-avro.html

Kindly let us know how things are going on.

Regards,
Gourav Sengupta

On Mon, Nov 6, 2017 at 8:04 PM, Gaspar Muñoz <[hidden email]> wrote:
Of course, 

right now I'm trying in local with spark 2.2.0 and spark-avro 4.0.0.  I've just uploaded a snippet https://gist.github.com/gasparms/5d0740bd61a500357e0230756be963e1

Basically, my avro schema has a field with an alias and in the last part of code spark-avro is not able to read old data with old name using the alias.

In spark-avro library Readme said that is not supported and I am asking if any of you has a workaround or how do you manage schema evolution?

Regards.

2017-11-05 20:13 GMT+01:00 Gourav Sengupta <[hidden email]>:
Hi Gaspar,

can you please provide the details regarding the environment, versions, libraries and code snippets please? 

For example: SPARK version, OS, distribution, running on YARN, etc and all other details.


Regards,
Gourav Sengupta

On Sun, Nov 5, 2017 at 9:03 AM, Gaspar Muñoz <[hidden email]> wrote:
Hi there, 

I use avro format to store historical due to avro schema evolution. I manage external schemas and read  them using avroSchema option so we have been able to add and delete columns.

The problem is when I introduced aliases and Spark process didn't work as expected and then I read in spark-avro library "At the moment, it ignores docs, aliases and other properties present in the Avro file".

How do you manage aliases and column renaming? Is there any workaround?

Thanks in advance.

Regards

--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473




--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473




--
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: <a href="tel:%2B34%2091%20828%206473" value="+34918286473" style="color:rgb(17,85,204)" target="_blank">+34 91 828 6473