Apache Spark 3.1 Preparation Status (Oct. 2020)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Spark 3.1 Preparation Status (Oct. 2020)

Dongjoon Hyun-2
Hi, All.

As of today, master branch (Apache Spark 3.1.0) resolved
852+ JIRA issues and 606+ issues are 3.1.0-only patches.
According to the 3.1.0 release window, branch-3.1 will be
created on November 1st and enters QA period.

Here are some notable updates I've been monitoring.

Language
01. SPARK-25075 Support Scala 2.13
      - Since SPARK-32926, Scala 2.13 build test has
        become a part of GitHub Action jobs.
      - After SPARK-33044, Scala 2.13 test will be
        a part of Jenkins jobs.
02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
03. SPARK-32082 Project Zen: Improving Python usability
      - 7 of 16 issues are resolved.
04. SPARK-32073 Drop R < 3.5 support
      - This is done for Spark 3.0.1 and 3.1.0.

Dependency
05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
      - This changes the default dist. for better cloud support
06. SPARK-32981 Remove hive-1.2 distribution
07. SPARK-20202 Remove references to org.spark-project.hive
      - This will remove Hive 1.2.1 from source code
08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

Core
09. SPARK-27495 Support Stage level resource conf and scheduling
      - 11 of 15 issues are resolved
10. SPARK-25299 Use remote storage for persisting shuffle data
      - 8 of 14 issues are resolved

Resource Manager
11. SPARK-33005 Kubernetes GA preparation
      - It is on the way and we are waiting for more feedback.

SQL
12. SPARK-30648/SPARK-32346 Support filters pushdown
      to JSON/Avro
13. SPARK-32948/SPARK-32958 Add Json expression optimizer
14. SPARK-12312 Support JDBC Kerberos w/ keytab
      - 11 of 17 issues are resolved
15. SPARK-27589 DSv2 was mostly completed in 3.0
      and added more features in 3.1 but still we missed
      - All built-in DataSource v2 write paths are disabled
        and v1 write is used instead.
      - Support partition pruning with subqueries
      - Support bucketing

We still have one month before the feature freeze
and starting QA. If you are working for 3.1,
please consider the timeline and share your schedule
with the Apache Spark community. For the other stuff,
we can put it into 3.2 release scheduled in June 2021.

Last not but least, I want to emphasize (7) once again.
We need to remove the forked unofficial Hive eventually.
Please let us know your reasons if you need to build
from Apache Spark 3.1 source code for Hive 1.2.


As I wrote in the above PR description, for old releases,
Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
Hive 1.2-based distribution.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Mridul Muralidharan

+1 on pushing the branch cut for increased dev time to match previous releases.

Regards,
Mridul 

On Sat, Oct 3, 2020 at 10:22 PM Xiao Li <[hidden email]> wrote:
Thank you for your updates. 

Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of the 3.1 branch cut, the feature development time window is less than 5 months. This is shorter than what we did in Spark 2.3 and 2.4 releases. 

Below are three highly desirable feature work I am watching. Hopefully, we can finish them before the branch cut.
Thanks,

Xiao


Hyukjin Kwon <[hidden email]> 于2020年10月3日周六 下午5:41写道:
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped R 3.5 and below at branch 2.4 as well.

On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, <[hidden email]> wrote:
Hi, All.

As of today, master branch (Apache Spark 3.1.0) resolved
852+ JIRA issues and 606+ issues are 3.1.0-only patches.
According to the 3.1.0 release window, branch-3.1 will be
created on November 1st and enters QA period.

Here are some notable updates I've been monitoring.

Language
01. SPARK-25075 Support Scala 2.13
      - Since SPARK-32926, Scala 2.13 build test has
        become a part of GitHub Action jobs.
      - After SPARK-33044, Scala 2.13 test will be
        a part of Jenkins jobs.
02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
03. SPARK-32082 Project Zen: Improving Python usability
      - 7 of 16 issues are resolved.
04. SPARK-32073 Drop R < 3.5 support
      - This is done for Spark 3.0.1 and 3.1.0.

Dependency
05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
      - This changes the default dist. for better cloud support
06. SPARK-32981 Remove hive-1.2 distribution
07. SPARK-20202 Remove references to org.spark-project.hive
      - This will remove Hive 1.2.1 from source code
08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

Core
09. SPARK-27495 Support Stage level resource conf and scheduling
      - 11 of 15 issues are resolved
10. SPARK-25299 Use remote storage for persisting shuffle data
      - 8 of 14 issues are resolved

Resource Manager
11. SPARK-33005 Kubernetes GA preparation
      - It is on the way and we are waiting for more feedback.

SQL
12. SPARK-30648/SPARK-32346 Support filters pushdown
      to JSON/Avro
13. SPARK-32948/SPARK-32958 Add Json expression optimizer
14. SPARK-12312 Support JDBC Kerberos w/ keytab
      - 11 of 17 issues are resolved
15. SPARK-27589 DSv2 was mostly completed in 3.0
      and added more features in 3.1 but still we missed
      - All built-in DataSource v2 write paths are disabled
        and v1 write is used instead.
      - Support partition pruning with subqueries
      - Support bucketing

We still have one month before the feature freeze
and starting QA. If you are working for 3.1,
please consider the timeline and share your schedule
with the Apache Spark community. For the other stuff,
we can put it into 3.2 release scheduled in June 2021.

Last not but least, I want to emphasize (7) once again.
We need to remove the forked unofficial Hive eventually.
Please let us know your reasons if you need to build
from Apache Spark 3.1 source code for Hive 1.2.


As I wrote in the above PR description, for old releases,
Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
Hive 1.2-based distribution.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Dongjoon Hyun-2
Thank you all.

BTW, Xiao and Mridul, I'm wondering what date you have in your mind specifically.

Usually, `Christmas and New Year season` doesn't give us much additional time.

If you think so, could you make a PR for Apache Spark website according to your expectation?


Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan <[hidden email]> wrote:

+1 on pushing the branch cut for increased dev time to match previous releases.

Regards,
Mridul 

On Sat, Oct 3, 2020 at 10:22 PM Xiao Li <[hidden email]> wrote:
Thank you for your updates. 

Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of the 3.1 branch cut, the feature development time window is less than 5 months. This is shorter than what we did in Spark 2.3 and 2.4 releases. 

Below are three highly desirable feature work I am watching. Hopefully, we can finish them before the branch cut.
Thanks,

Xiao


Hyukjin Kwon <[hidden email]> 于2020年10月3日周六 下午5:41写道:
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped R 3.5 and below at branch 2.4 as well.

On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, <[hidden email]> wrote:
Hi, All.

As of today, master branch (Apache Spark 3.1.0) resolved
852+ JIRA issues and 606+ issues are 3.1.0-only patches.
According to the 3.1.0 release window, branch-3.1 will be
created on November 1st and enters QA period.

Here are some notable updates I've been monitoring.

Language
01. SPARK-25075 Support Scala 2.13
      - Since SPARK-32926, Scala 2.13 build test has
        become a part of GitHub Action jobs.
      - After SPARK-33044, Scala 2.13 test will be
        a part of Jenkins jobs.
02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
03. SPARK-32082 Project Zen: Improving Python usability
      - 7 of 16 issues are resolved.
04. SPARK-32073 Drop R < 3.5 support
      - This is done for Spark 3.0.1 and 3.1.0.

Dependency
05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
      - This changes the default dist. for better cloud support
06. SPARK-32981 Remove hive-1.2 distribution
07. SPARK-20202 Remove references to org.spark-project.hive
      - This will remove Hive 1.2.1 from source code
08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

Core
09. SPARK-27495 Support Stage level resource conf and scheduling
      - 11 of 15 issues are resolved
10. SPARK-25299 Use remote storage for persisting shuffle data
      - 8 of 14 issues are resolved

Resource Manager
11. SPARK-33005 Kubernetes GA preparation
      - It is on the way and we are waiting for more feedback.

SQL
12. SPARK-30648/SPARK-32346 Support filters pushdown
      to JSON/Avro
13. SPARK-32948/SPARK-32958 Add Json expression optimizer
14. SPARK-12312 Support JDBC Kerberos w/ keytab
      - 11 of 17 issues are resolved
15. SPARK-27589 DSv2 was mostly completed in 3.0
      and added more features in 3.1 but still we missed
      - All built-in DataSource v2 write paths are disabled
        and v1 write is used instead.
      - Support partition pruning with subqueries
      - Support bucketing

We still have one month before the feature freeze
and starting QA. If you are working for 3.1,
please consider the timeline and share your schedule
with the Apache Spark community. For the other stuff,
we can put it into 3.2 release scheduled in June 2021.

Last not but least, I want to emphasize (7) once again.
We need to remove the forked unofficial Hive eventually.
Please let us know your reasons if you need to build
from Apache Spark 3.1 source code for Hive 1.2.


As I wrote in the above PR description, for old releases,
Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
Hive 1.2-based distribution.

Bests,
Dongjoon.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Xiao Li
Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.

I think we made a change in release cadence since Spark 2.3. See the commit: https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if we do not want to change the release cadence? 

How about moving the code freeze of Spark 3.1 to Early Dec 2020 and the RC1 date to Early Jan 2021

Thanks,

Xiao


Dongjoon Hyun <[hidden email]> 于2020年10月4日周日 下午12:44写道:
For Xiao's comment, I want to point out that Apache Spark 3.1.0 is different from 2.3 or 2.4.

Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.

- Apache Spark 2.0.0 was released on July 26, 2016.
- Apache Spark 2.1.0 was released on December 28, 2016.

Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun <[hidden email]> wrote:
Thank you all.

BTW, Xiao and Mridul, I'm wondering what date you have in your mind specifically.

Usually, `Christmas and New Year season` doesn't give us much additional time.

If you think so, could you make a PR for Apache Spark website according to your expectation?


Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan <[hidden email]> wrote:

+1 on pushing the branch cut for increased dev time to match previous releases.

Regards,
Mridul 

On Sat, Oct 3, 2020 at 10:22 PM Xiao Li <[hidden email]> wrote:
Thank you for your updates. 

Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of the 3.1 branch cut, the feature development time window is less than 5 months. This is shorter than what we did in Spark 2.3 and 2.4 releases. 

Below are three highly desirable feature work I am watching. Hopefully, we can finish them before the branch cut.
Thanks,

Xiao


Hyukjin Kwon <[hidden email]> 于2020年10月3日周六 下午5:41写道:
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped R 3.5 and below at branch 2.4 as well.

On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, <[hidden email]> wrote:
Hi, All.

As of today, master branch (Apache Spark 3.1.0) resolved
852+ JIRA issues and 606+ issues are 3.1.0-only patches.
According to the 3.1.0 release window, branch-3.1 will be
created on November 1st and enters QA period.

Here are some notable updates I've been monitoring.

Language
01. SPARK-25075 Support Scala 2.13
      - Since SPARK-32926, Scala 2.13 build test has
        become a part of GitHub Action jobs.
      - After SPARK-33044, Scala 2.13 test will be
        a part of Jenkins jobs.
02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
03. SPARK-32082 Project Zen: Improving Python usability
      - 7 of 16 issues are resolved.
04. SPARK-32073 Drop R < 3.5 support
      - This is done for Spark 3.0.1 and 3.1.0.

Dependency
05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
      - This changes the default dist. for better cloud support
06. SPARK-32981 Remove hive-1.2 distribution
07. SPARK-20202 Remove references to org.spark-project.hive
      - This will remove Hive 1.2.1 from source code
08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

Core
09. SPARK-27495 Support Stage level resource conf and scheduling
      - 11 of 15 issues are resolved
10. SPARK-25299 Use remote storage for persisting shuffle data
      - 8 of 14 issues are resolved

Resource Manager
11. SPARK-33005 Kubernetes GA preparation
      - It is on the way and we are waiting for more feedback.

SQL
12. SPARK-30648/SPARK-32346 Support filters pushdown
      to JSON/Avro
13. SPARK-32948/SPARK-32958 Add Json expression optimizer
14. SPARK-12312 Support JDBC Kerberos w/ keytab
      - 11 of 17 issues are resolved
15. SPARK-27589 DSv2 was mostly completed in 3.0
      and added more features in 3.1 but still we missed
      - All built-in DataSource v2 write paths are disabled
        and v1 write is used instead.
      - Support partition pruning with subqueries
      - Support bucketing

We still have one month before the feature freeze
and starting QA. If you are working for 3.1,
please consider the timeline and share your schedule
with the Apache Spark community. For the other stuff,
we can put it into 3.2 release scheduled in June 2021.

Last not but least, I want to emphasize (7) once again.
We need to remove the forked unofficial Hive eventually.
Please let us know your reasons if you need to build
from Apache Spark 3.1 source code for Hive 1.2.


As I wrote in the above PR description, for old releases,
Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
Hive 1.2-based distribution.

Bests,
Dongjoon.