Should python-2 be supported in Spark 3.0?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Should python-2 be supported in Spark 3.0?

Erik Erlandson-2
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.

Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Erik Erlandson-2
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Mark Hamstra
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Felix Cheung
I don’t think we should remove any API even in a major release without deprecating it first...

 

From: Mark Hamstra <[hidden email]>
Sent: Sunday, September 16, 2018 12:26 PM
To: Erik Erlandson
Cc: [hidden email]; dev
Subject: Re: Should python-2 be supported in Spark 3.0?
 
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Aakash Basu-2
Removing support for an API in a major release makes poor sense, deprecating is always better. Removal can always be done two - three minor release later.

On Mon 17 Sep, 2018, 6:49 AM Felix Cheung, <[hidden email]> wrote:
I don’t think we should remove any API even in a major release without deprecating it first...

 

From: Mark Hamstra <[hidden email]>
Sent: Sunday, September 16, 2018 12:26 PM
To: Erik Erlandson
Cc: [hidden email]; dev
Subject: Re: Should python-2 be supported in Spark 3.0?
 
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Hyukjin Kwon
I think we can deprecate it in 3.x.0 and remove it in Spark 4.0.0. Many people still use Python 2. Also, techincally 2.7 support is not officially dropped yet - https://pythonclock.org/


2018년 9월 17일 (월) 오전 9:31, Aakash Basu <[hidden email]>님이 작성:
Removing support for an API in a major release makes poor sense, deprecating is always better. Removal can always be done two - three minor release later.

On Mon 17 Sep, 2018, 6:49 AM Felix Cheung, <[hidden email]> wrote:
I don’t think we should remove any API even in a major release without deprecating it first...

 

From: Mark Hamstra <[hidden email]>
Sent: Sunday, September 16, 2018 12:26 PM
To: Erik Erlandson
Cc: [hidden email]; dev
Subject: Re: Should python-2 be supported in Spark 3.0?
 
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Erik Erlandson-2
In reply to this post by Mark Hamstra
I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches 


On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:
  • Support for PySpark becomes significantly easier.
  • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
(Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.


Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Matei Zaharia
Administrator
I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in the same codebase, so what are we losing out?

The other thing is that even though Python core devs might not support 2.x later, it’s quite possible that various Linux distros will if moving from 2 to 3 remains painful. In that case, we may want Apache Spark to continue releasing for it despite the Python core devs not supporting it.

Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later in 3.x instead of deprecating it in 2.4. I’d also consider looking at what other data science tools are doing before fully removing it: for example, if Pandas and TensorFlow no longer support Python 2 past some point, that might be a good point to remove it.

Matei

> On Sep 17, 2018, at 11:01 AM, Mark Hamstra <[hidden email]> wrote:
>
> If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates.
>
> On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <[hidden email]> wrote:
> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches
>
>
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
> We could also deprecate Py2 already in the 2.4.0 release.
>
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
> In case this didn't make it onto this thread:
>
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.
>
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
> On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.
>
> Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.
>
> Key advantages to dropping Python 2 are:
> • Support for PySpark becomes significantly easier.
> • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
> (Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)
>
> The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0
>
> This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Mark Hamstra
What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't change the code at all; it's just a notification that we will eventually cease supporting Py2. Wouldn't users prefer to get that notification sooner rather than later?

On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia <[hidden email]> wrote:
I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in the same codebase, so what are we losing out?

The other thing is that even though Python core devs might not support 2.x later, it’s quite possible that various Linux distros will if moving from 2 to 3 remains painful. In that case, we may want Apache Spark to continue releasing for it despite the Python core devs not supporting it.

Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later in 3.x instead of deprecating it in 2.4. I’d also consider looking at what other data science tools are doing before fully removing it: for example, if Pandas and TensorFlow no longer support Python 2 past some point, that might be a good point to remove it.

Matei

> On Sep 17, 2018, at 11:01 AM, Mark Hamstra <[hidden email]> wrote:
>
> If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates.
>
> On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <[hidden email]> wrote:
> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches
>
>
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
> We could also deprecate Py2 already in the 2.4.0 release.
>
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
> In case this didn't make it onto this thread:
>
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.
>
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
> On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.
>
> Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.
>
> Key advantages to dropping Python 2 are:
>       • Support for PySpark becomes significantly easier.
>       • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
> (Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)
>
> The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0
>
> This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Erik Erlandson-2
In reply to this post by Matei Zaharia
FWIW, Pandas is dropping Py2 support at the end of this year.  Tensorflow is less clear. They only support py3 on windows, but there is no reference to any policy about py2 on their roadmap or the TF 2.0 announcement.
Reply | Threaded
Open this post in threaded view
|

Re: Should python-2 be supported in Spark 3.0?

Matei Zaharia
Administrator
In reply to this post by Mark Hamstra
That’s a good point — I’d say there’s just a risk of creating a perception issue. First, some users might feel that this means they have to migrate now, which is before Python itself drops support; they might also be surprised that we did this in a minor release (e.g. might we drop Python 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors might feel that this means new features no longer have to work with Python 2, which would be confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra <[hidden email]> wrote:
>
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't change the code at all; it's just a notification that we will eventually cease supporting Py2. Wouldn't users prefer to get that notification sooner rather than later?
>
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia <[hidden email]> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x later, it’s quite possible that various Linux distros will if moving from 2 to 3 remains painful. In that case, we may want Apache Spark to continue releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later in 3.x instead of deprecating it in 2.4. I’d also consider looking at what other data science tools are doing before fully removing it: for example, if Pandas and TensorFlow no longer support Python 2 past some point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra <[hidden email]> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <[hidden email]> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <[hidden email]> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <[hidden email]> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <[hidden email]> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >       • Support for PySpark becomes significantly easier.
> >       • Avoid having to support Python 2 until Spark 4.0, which is likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that PySpark would be supporting a version of python that was no longer receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2 code would have to migrate their code to python 3 to take advantage of Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark community and we want to solicit community feedback.
> >
> >
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]