[Spark SQL, intermediate+] possible bug or weird behavior of insertInto

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL, intermediate+] possible bug or weird behavior of insertInto

Oldrich Vlasic
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark concerning
partial overwrites of partitioned data. Not sure if this is a bug or just abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition column is
placed at the end). On the second write this is not taken into consideration and Spark
tries to insert values into the columns based on their order and not on their name. If
they have different types this will fail. If not, values will be written to incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about it to prevent
it? This issue can be avoided by doing a select with schema loaded from the target table.
However, when user is not aware this could cause hard to track down errors in data.

Best regards,
Oldřich Vlašic


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

mve_insert_into_301.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

srowen
I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <[hidden email]> wrote:
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark concerning
partial overwrites of partitioned data. Not sure if this is a bug or just abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition column is
placed at the end). On the second write this is not taken into consideration and Spark
tries to insert values into the columns based on their order and not on their name. If
they have different types this will fail. If not, values will be written to incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about it to prevent
it? This issue can be avoided by doing a select with schema loaded from the target table.
However, when user is not aware this could cause hard to track down errors in data.

Best regards,
Oldřich Vlašic

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

Russell Spitzer
Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe.

On Mar 3, 2021, at 8:29 AM, Sean Owen <[hidden email]> wrote:

I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <[hidden email]> wrote:
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark concerning
partial overwrites of partitioned data. Not sure if this is a bug or just abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition column is
placed at the end). On the second write this is not taken into consideration and Spark
tries to insert values into the columns based on their order and not on their name. If
they have different types this will fail. If not, values will be written to incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about it to prevent
it? This issue can be avoided by doing a select with schema loaded from the target table.
However, when user is not aware this could cause hard to track down errors in data.

Best regards,
Oldřich Vlašic

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

Oldrich Vlasic
Thanks for reply! Is there something to be done, setting a config property for example? I'd like to prevent users (mainly data scientists) from falling victim to this.

From: Russell Spitzer <[hidden email]>
Sent: Wednesday, March 3, 2021 3:31 PM
To: Sean Owen <[hidden email]>
Cc: Oldrich Vlasic <[hidden email]>; user <[hidden email]>; Ondřej Havlíček <[hidden email]>
Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto
 
Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe.

On Mar 3, 2021, at 8:29 AM, Sean Owen <[hidden email]> wrote:

I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <[hidden email]> wrote:
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark concerning
partial overwrites of partitioned data. Not sure if this is a bug or just abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition column is
placed at the end). On the second write this is not taken into consideration and Spark
tries to insert values into the columns based on their order and not on their name. If
they have different types this will fail. If not, values will be written to incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about it to prevent
it? This issue can be avoided by doing a select with schema loaded from the target table.
However, when user is not aware this could cause hard to track down errors in data.

Best regards,
Oldřich Vlašic

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

Jeff Evans
Why not perform a df.select(...) before the final write to ensure a consistent ordering.

On Thu, Mar 4, 2021, 7:39 AM Oldrich Vlasic <[hidden email]> wrote:
Thanks for reply! Is there something to be done, setting a config property for example? I'd like to prevent users (mainly data scientists) from falling victim to this.

From: Russell Spitzer <[hidden email]>
Sent: Wednesday, March 3, 2021 3:31 PM
To: Sean Owen <[hidden email]>
Cc: Oldrich Vlasic <[hidden email]>; user <[hidden email]>; Ondřej Havlíček <[hidden email]>
Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto
 
Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe.

On Mar 3, 2021, at 8:29 AM, Sean Owen <[hidden email]> wrote:

I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <[hidden email]> wrote:
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark concerning
partial overwrites of partitioned data. Not sure if this is a bug or just abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition column is
placed at the end). On the second write this is not taken into consideration and Spark
tries to insert values into the columns based on their order and not on their name. If
they have different types this will fail. If not, values will be written to incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about it to prevent
it? This issue can be avoided by doing a select with schema loaded from the target table.
However, when user is not aware this could cause hard to track down errors in data.

Best regards,
Oldřich Vlašic

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

Oldrich Vlasic
That certainly is a solution if you know about the issue and we've used it in the end.

I'm trying to find out if there is a solution that would prevent users who don't know about it from accidentally corrupting data. Something like "enable strict schema matching".

From: Jeff Evans <[hidden email]>
Sent: Thursday, March 4, 2021 2:55 PM
To: Oldrich Vlasic <[hidden email]>
Cc: Russell Spitzer <[hidden email]>; Sean Owen <[hidden email]>; user <[hidden email]>; Ondřej Havlíček <[hidden email]>
Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto
 
Why not perform a df.select(...) before the final write to ensure a consistent ordering.

On Thu, Mar 4, 2021, 7:39 AM Oldrich Vlasic <[hidden email]> wrote:
Thanks for reply! Is there something to be done, setting a config property for example? I'd like to prevent users (mainly data scientists) from falling victim to this.

From: Russell Spitzer <[hidden email]>
Sent: Wednesday, March 3, 2021 3:31 PM
To: Sean Owen <[hidden email]>
Cc: Oldrich Vlasic <[hidden email]>; user <[hidden email]>; Ondřej Havlíček <[hidden email]>
Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto
 
Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe.

On Mar 3, 2021, at 8:29 AM, Sean Owen <[hidden email]> wrote:

I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <[hidden email]> wrote:
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark concerning
partial overwrites of partitioned data. Not sure if this is a bug or just abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition column is
placed at the end). On the second write this is not taken into consideration and Spark
tries to insert values into the columns based on their order and not on their name. If
they have different types this will fail. If not, values will be written to incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about it to prevent
it? This issue can be avoided by doing a select with schema loaded from the target table.
However, when user is not aware this could cause hard to track down errors in data.

Best regards,
Oldřich Vlašic

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]