FYI: The evolution on `CHAR` type behavior

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

FYI: The evolution on `CHAR` type behavior

Dongjoon Hyun-2
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

rxin
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

Dongjoon Hyun-2
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

rxin
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."









On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

Gourav Sengupta
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <[hidden email]> wrote:
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."









On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

Dongjoon Hyun-2
Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <[hidden email]> wrote:
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."









On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

rxin
I haven't spent enough time thinking about it to give a strong opinion, but this is of course very different from TRIM.

TRIM is a publicly documented function with two arguments, and we silently swapped the two arguments. And trim is also quite commonly used from a long time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not great that we are changing the value here, but the value is already fucked up. It depends on the underlying data source, and random configs that are seemingly unrelated (orc) would impact the value.






On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <[hidden email]> wrote:
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."









On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

Stephen Coy
In reply to this post by Dongjoon Hyun-2
Hi there,

I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always space padded, and they default to having this enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to VARCHAR - I believe that was its purpose in the first place, and it does not dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be consistently supported, but VARCHAR is a different animal and I would expect it to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <[hidden email]> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <[hidden email]> wrote:
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."









On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E


This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/
Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

Dongjoon Hyun-2
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

      > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at Apache Spark 1.x without much documentation. In addition, there still exists an effort which is trying to keep it in 3.0.0 age.

       https://issues.apache.org/jira/browse/SPARK-31088
       Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent behavior among its data sources, Apache Hive has been providing its documentation and many customers rely the behavior.

      - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors, many existing huge tables were created by Apache Hive, not Apache Spark. And, Apache Spark is used for boosting SQL performance with its *caching*. This was true because Apache Spark was added into the Hadoop-vendor products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more features to be consistent at least with Hive tables in Apache Hive and Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its existing behavior in this part, Apache Spark evolves inevitably by moving away from the original Apache Spark old behaviors one-by-one.

      >  the value is already fucked up

The following is the change log.

      - When we switched the default value of `convertMetastoreParquet`. (at Apache Spark 1.2)
      - When we switched the default value of `convertMetastoreOrc` (at Apache Spark 2.4)
      - When we switched `CREATE TABLE` itself. (Change `TEXT` table to `PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <[hidden email]> wrote:
Hi there,

I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always space padded, and they default to having this enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to VARCHAR - I believe that was its purpose in the first place, and it does not dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be consistently supported, but VARCHAR is a different animal and I would expect it to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <[hidden email]> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <[hidden email]> wrote:
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."









On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E


This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/
Reply | Threaded
Open this post in threaded view
|

Re: FYI: The evolution on `CHAR` type behavior

Dongjoon Hyun-2
Ur, are you comparing the number of SELECT statement with TRIM and CREATE statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was merely pointing out that if we deviate away from SQL standard in any way we are considered "wrong" or "incorrect". That argument itself is flawed when plenty of other popular database systems also deviate away from the standard on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <[hidden email]> wrote:
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was merely pointing out that if we deviate away from SQL standard in any way we are considered "wrong" or "incorrect". That argument itself is flawed when plenty of other popular database systems also deviate away from the standard on this specific behavior.




On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <[hidden email]> wrote:
I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char.


On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <[hidden email]> wrote:
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

      > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at Apache Spark 1.x without much documentation. In addition, there still exists an effort which is trying to keep it in 3.0.0 age.

       https://issues.apache.org/jira/browse/SPARK-31088
       Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent behavior among its data sources, Apache Hive has been providing its documentation and many customers rely the behavior.

      - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors, many existing huge tables were created by Apache Hive, not Apache Spark. And, Apache Spark is used for boosting SQL performance with its *caching*. This was true because Apache Spark was added into the Hadoop-vendor products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more features to be consistent at least with Hive tables in Apache Hive and Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its existing behavior in this part, Apache Spark evolves inevitably by moving away from the original Apache Spark old behaviors one-by-one.

      >  the value is already fucked up

The following is the change log.

      - When we switched the default value of `convertMetastoreParquet`. (at Apache Spark 1.2)
      - When we switched the default value of `convertMetastoreOrc` (at Apache Spark 2.4)
      - When we switched `CREATE TABLE` itself. (Change `TEXT` table to `PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <[hidden email]> wrote:
Hi there,

I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always space padded, and they default to having this enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to VARCHAR - I believe that was its purpose in the first place, and it does not dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be consistently supported, but VARCHAR is a different animal and I would expect it to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <[hidden email]> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <[hidden email]> wrote:
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."









On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <[hidden email]> wrote:
Hi, Reynold.

Please see the following for the context.

"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <[hidden email]> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working. 

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported. 

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <[hidden email]> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E


This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/