[Spark Core] Why no spark.read.delta / df.write.delta?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark Core] Why no spark.read.delta / df.write.delta?

0vbb

Hi there,

 

I’m just wondering if there is any incentive to implement read/write methods in the DataFrameReader/DataFrameWriter for delta similar to e.g. parquet?

 

For example, using PySpark, “spark.read.parquet” is available, but “spark.read.delta” is not (same for write).

In my opinion, “spark.read.delta” feels more clean and pythonic compared to “spark.read.format(‘delta’).load()”, especially if more options are called, like “mode”.

 

Can anyone explain the reasoning behind this, is this due to the Java nature of Spark?

From a pythonic point of view, I could also imagine a single read/write method, with the format as an arg and kwargs related to the different file format options.

 

Best,

Michael

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Jungtaek Lim-2
Hi,

"spark.read.<format>" is a "shorthand" for "built-in" data sources, not for external data sources. spark.read.format() is still an official way to use it. Delta Lake is not included in Apache Spark so that is indeed not possible for Spark to refer to.

Starting from Spark 3.0, the concept of "catalog" is introduced, which you can simply refer to the table from catalog (if the external data source provides catalog implementation) and no need to specify the format explicitly (as catalog would know about it).

This session explains the catalog and how Cassandra connector leverages it. I see some external data sources starting to support catalog, and in Spark itself there's some effort to support catalog for JDBC.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Mon, Oct 5, 2020 at 8:53 PM Moser, Michael <[hidden email]> wrote:

Hi there,

 

I’m just wondering if there is any incentive to implement read/write methods in the DataFrameReader/DataFrameWriter for delta similar to e.g. parquet?

 

For example, using PySpark, “spark.read.parquet” is available, but “spark.read.delta” is not (same for write).

In my opinion, “spark.read.delta” feels more clean and pythonic compared to “spark.read.format(‘delta’).load()”, especially if more options are called, like “mode”.

 

Can anyone explain the reasoning behind this, is this due to the Java nature of Spark?

From a pythonic point of view, I could also imagine a single read/write method, with the format as an arg and kwargs related to the different file format options.

 

Best,

Michael

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Enrico Minack

Though spark.read.<format> refers to "built-in" data sources, there is nothing that prevents 3rd party libraries to "extend" spark.read in Scala or Python. As users know the Spark-way to read built-in data sources, it feels natural to hook 3rd party data sources under the same scheme, to give users a holistic and integrated feel.

One Scala example (https://github.com/G-Research/spark-dgraph-connector#spark-dgraph-connector):

import uk.co.gresearch.spark.dgraph.connector._
val triples = spark.read.dgraph.triples("localhost:9080")

and in Python:

from gresearch.spark.dgraph.connector import *
triples = spark.read.dgraph.triples("localhost:9080")

I agree that 3rd parties should also support the official spark.read.format() and the new catalog approaches.

Enrico


Am 05.10.20 um 14:03 schrieb Jungtaek Lim:
Hi,

"spark.read.<format>" is a "shorthand" for "built-in" data sources, not for external data sources. spark.read.format() is still an official way to use it. Delta Lake is not included in Apache Spark so that is indeed not possible for Spark to refer to.

Starting from Spark 3.0, the concept of "catalog" is introduced, which you can simply refer to the table from catalog (if the external data source provides catalog implementation) and no need to specify the format explicitly (as catalog would know about it).

This session explains the catalog and how Cassandra connector leverages it. I see some external data sources starting to support catalog, and in Spark itself there's some effort to support catalog for JDBC.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Mon, Oct 5, 2020 at 8:53 PM Moser, Michael <[hidden email]> wrote:

Hi there,

 

I’m just wondering if there is any incentive to implement read/write methods in the DataFrameReader/DataFrameWriter for delta similar to e.g. parquet?

 

For example, using PySpark, “spark.read.parquet” is available, but “spark.read.delta” is not (same for write).

In my opinion, “spark.read.delta” feels more clean and pythonic compared to “spark.read.format(‘delta’).load()”, especially if more options are called, like “mode”.

 

Can anyone explain the reasoning behind this, is this due to the Java nature of Spark?

From a pythonic point of view, I could also imagine a single read/write method, with the format as an arg and kwargs related to the different file format options.

 

Best,

Michael

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Jungtaek Lim-2
Sure. My point was that Delta Lake is also one of the 3rd party libraries and there's no way for Apache Spark to do that. There's a Delta Lake's own group and the request is better to be there.

On Mon, Oct 5, 2020 at 9:54 PM Enrico Minack <[hidden email]> wrote:

Though spark.read.<format> refers to "built-in" data sources, there is nothing that prevents 3rd party libraries to "extend" spark.read in Scala or Python. As users know the Spark-way to read built-in data sources, it feels natural to hook 3rd party data sources under the same scheme, to give users a holistic and integrated feel.

One Scala example (https://github.com/G-Research/spark-dgraph-connector#spark-dgraph-connector):

import uk.co.gresearch.spark.dgraph.connector._
val triples = spark.read.dgraph.triples("localhost:9080")

and in Python:

from gresearch.spark.dgraph.connector import *
triples = spark.read.dgraph.triples("localhost:9080")

I agree that 3rd parties should also support the official spark.read.format() and the new catalog approaches.

Enrico


Am 05.10.20 um 14:03 schrieb Jungtaek Lim:
Hi,

"spark.read.<format>" is a "shorthand" for "built-in" data sources, not for external data sources. spark.read.format() is still an official way to use it. Delta Lake is not included in Apache Spark so that is indeed not possible for Spark to refer to.

Starting from Spark 3.0, the concept of "catalog" is introduced, which you can simply refer to the table from catalog (if the external data source provides catalog implementation) and no need to specify the format explicitly (as catalog would know about it).

This session explains the catalog and how Cassandra connector leverages it. I see some external data sources starting to support catalog, and in Spark itself there's some effort to support catalog for JDBC.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Mon, Oct 5, 2020 at 8:53 PM Moser, Michael <[hidden email]> wrote:

Hi there,

 

I’m just wondering if there is any incentive to implement read/write methods in the DataFrameReader/DataFrameWriter for delta similar to e.g. parquet?

 

For example, using PySpark, “spark.read.parquet” is available, but “spark.read.delta” is not (same for write).

In my opinion, “spark.read.delta” feels more clean and pythonic compared to “spark.read.format(‘delta’).load()”, especially if more options are called, like “mode”.

 

Can anyone explain the reasoning behind this, is this due to the Java nature of Spark?

From a pythonic point of view, I could also imagine a single read/write method, with the format as an arg and kwargs related to the different file format options.

 

Best,

Michael