Why Apache Spark doesn't use Calcite?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Why Apache Spark doesn't use Calcite?

newroyker
Was there a qualitative or quantitative benchmark done before a design
decision was made not to use Calcite?

Are there limitations (for heuristic based, cost based, * aware optimizer)
in Calcite, and frameworks built on top of Calcite? In the context of big
data / TCPH benchmarks.

I was unable to dig up anything concrete from user group / Jira. Appreciate
if any Catalyst veteran here can give me pointers. Trying to defend
Spark/Catalyst.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why Apache Spark doesn't use Calcite?

jasonnerothin@gmail.com
The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:

https://databricks.com/glossary/catalyst-optimizer

On Mon, Jan 13, 2020 at 8:24 AM newroyker <[hidden email]> wrote:
Was there a qualitative or quantitative benchmark done before a design
decision was made not to use Calcite?

Are there limitations (for heuristic based, cost based, * aware optimizer)
in Calcite, and frameworks built on top of Calcite? In the context of big
data / TCPH benchmarks.

I was unable to dig up anything concrete from user group / Jira. Appreciate
if any Catalyst veteran here can give me pointers. Trying to defend
Spark/Catalyst.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Thanks,
Jason
Reply | Threaded
Open this post in threaded view
|

Re: Why Apache Spark doesn't use Calcite?

Michael Mior
It's fairly common for adapters (Calcite's abstraction of a data
source) to push down predicates. However, the API certainly looks a
lot different than Catalyst's.
--
Michael Mior
[hidden email]

Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
<[hidden email]> a écrit :

>
> The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:
>
> https://databricks.com/glossary/catalyst-optimizer
>
> On Mon, Jan 13, 2020 at 8:24 AM newroyker <[hidden email]> wrote:
>>
>> Was there a qualitative or quantitative benchmark done before a design
>> decision was made not to use Calcite?
>>
>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>> in Calcite, and frameworks built on top of Calcite? In the context of big
>> data / TCPH benchmarks.
>>
>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>> if any Catalyst veteran here can give me pointers. Trying to defend
>> Spark/Catalyst.
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>
> --
> Thanks,
> Jason

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why Apache Spark doesn't use Calcite?

Matei Zaharia
Administrator
I’m pretty sure that Catalyst was built before Calcite, or at least in parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, building Catalyst in Scala also made it more concise and easier to extend than an optimizer written in Java (you can find various presentations about how Catalyst works).

Matei

> On Jan 13, 2020, at 8:41 AM, Michael Mior <[hidden email]> wrote:
>
> It's fairly common for adapters (Calcite's abstraction of a data
> source) to push down predicates. However, the API certainly looks a
> lot different than Catalyst's.
> --
> Michael Mior
> [hidden email]
>
> Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> <[hidden email]> a écrit :
>>
>> The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:
>>
>> https://databricks.com/glossary/catalyst-optimizer
>>
>> On Mon, Jan 13, 2020 at 8:24 AM newroyker <[hidden email]> wrote:
>>>
>>> Was there a qualitative or quantitative benchmark done before a design
>>> decision was made not to use Calcite?
>>>
>>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>>> in Calcite, and frameworks built on top of Calcite? In the context of big
>>> data / TCPH benchmarks.
>>>
>>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> Spark/Catalyst.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>
>>
>>
>> --
>> Thanks,
>> Jason
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why Apache Spark doesn't use Calcite?

newroyker
Thanks all, and Matei.

TL;DR of the conclusion for my particular case:
Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima.
Quantitatively, there is no reproducible benchmark, that fairly compares Optimizer frameworks, apples to apples (excluding execution).


On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia <[hidden email]> wrote:
I’m pretty sure that Catalyst was built before Calcite, or at least in parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, building Catalyst in Scala also made it more concise and easier to extend than an optimizer written in Java (you can find various presentations about how Catalyst works).

Matei

> On Jan 13, 2020, at 8:41 AM, Michael Mior <[hidden email]> wrote:
>
> It's fairly common for adapters (Calcite's abstraction of a data
> source) to push down predicates. However, the API certainly looks a
> lot different than Catalyst's.
> --
> Michael Mior
> [hidden email]
>
> Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> <[hidden email]> a écrit :
>>
>> The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:
>>
>> https://databricks.com/glossary/catalyst-optimizer
>>
>> On Mon, Jan 13, 2020 at 8:24 AM newroyker <[hidden email]> wrote:
>>>
>>> Was there a qualitative or quantitative benchmark done before a design
>>> decision was made not to use Calcite?
>>>
>>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>>> in Calcite, and frameworks built on top of Calcite? In the context of big
>>> data / TCPH benchmarks.
>>>
>>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> Spark/Catalyst.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>
>>
>>
>> --
>> Thanks,
>> Jason
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: Why Apache Spark doesn't use Calcite?

Xiao Li
In the upcoming Spark 3.0, we introduced a new framework for Adaptive Query Execution in Catalyst. This can adjust the plans based on the runtime statistics. This is missing in Calcite based on my understanding. 

Catalyst is also very easy to enhance. We also use the dynamic programming approach in our cost-based join reordering. If needed, in the future, we also can improve the existing CBO and make it more general. The paper of Spark SQL was published 5 years ago. A lot of great contributions were made in the past 5 years. 

Cheers,

Xiao

Debajyoti Roy <[hidden email]> 于2020年1月15日周三 上午9:23写道:
Thanks all, and Matei.

TL;DR of the conclusion for my particular case:
Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima.
Quantitatively, there is no reproducible benchmark, that fairly compares Optimizer frameworks, apples to apples (excluding execution).


On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia <[hidden email]> wrote:
I’m pretty sure that Catalyst was built before Calcite, or at least in parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, building Catalyst in Scala also made it more concise and easier to extend than an optimizer written in Java (you can find various presentations about how Catalyst works).

Matei

> On Jan 13, 2020, at 8:41 AM, Michael Mior <[hidden email]> wrote:
>
> It's fairly common for adapters (Calcite's abstraction of a data
> source) to push down predicates. However, the API certainly looks a
> lot different than Catalyst's.
> --
> Michael Mior
> [hidden email]
>
> Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> <[hidden email]> a écrit :
>>
>> The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:
>>
>> https://databricks.com/glossary/catalyst-optimizer
>>
>> On Mon, Jan 13, 2020 at 8:24 AM newroyker <[hidden email]> wrote:
>>>
>>> Was there a qualitative or quantitative benchmark done before a design
>>> decision was made not to use Calcite?
>>>
>>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>>> in Calcite, and frameworks built on top of Calcite? In the context of big
>>> data / TCPH benchmarks.
>>>
>>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> Spark/Catalyst.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>
>>
>>
>> --
>> Thanks,
>> Jason
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: Why Apache Spark doesn't use Calcite?

newroyker
Thanks Xiao, a more up to date publication in a conference like VLDB will certainly turn the the tide for many of us trying to defend Spark's Optimizer.

On Wed, Jan 15, 2020 at 9:39 AM Xiao Li <[hidden email]> wrote:
In the upcoming Spark 3.0, we introduced a new framework for Adaptive Query Execution in Catalyst. This can adjust the plans based on the runtime statistics. This is missing in Calcite based on my understanding. 

Catalyst is also very easy to enhance. We also use the dynamic programming approach in our cost-based join reordering. If needed, in the future, we also can improve the existing CBO and make it more general. The paper of Spark SQL was published 5 years ago. A lot of great contributions were made in the past 5 years. 

Cheers,

Xiao

Debajyoti Roy <[hidden email]> 于2020年1月15日周三 上午9:23写道:
Thanks all, and Matei.

TL;DR of the conclusion for my particular case:
Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima.
Quantitatively, there is no reproducible benchmark, that fairly compares Optimizer frameworks, apples to apples (excluding execution).


On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia <[hidden email]> wrote:
I’m pretty sure that Catalyst was built before Calcite, or at least in parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, building Catalyst in Scala also made it more concise and easier to extend than an optimizer written in Java (you can find various presentations about how Catalyst works).

Matei

> On Jan 13, 2020, at 8:41 AM, Michael Mior <[hidden email]> wrote:
>
> It's fairly common for adapters (Calcite's abstraction of a data
> source) to push down predicates. However, the API certainly looks a
> lot different than Catalyst's.
> --
> Michael Mior
> [hidden email]
>
> Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> <[hidden email]> a écrit :
>>
>> The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:
>>
>> https://databricks.com/glossary/catalyst-optimizer
>>
>> On Mon, Jan 13, 2020 at 8:24 AM newroyker <[hidden email]> wrote:
>>>
>>> Was there a qualitative or quantitative benchmark done before a design
>>> decision was made not to use Calcite?
>>>
>>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>>> in Calcite, and frameworks built on top of Calcite? In the context of big
>>> data / TCPH benchmarks.
>>>
>>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> Spark/Catalyst.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>
>>
>>
>> --
>> Thanks,
>> Jason
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>