Scala vs Python for ETL with Spark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Scala vs Python for ETL with Spark

Mich Talebzadeh
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Russell Spitzer
As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no lambdas are used, everything will operate with Catalyst compiled java code so there won't be a big difference between python and scala.

On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Mich Talebzadeh
Thanks

So ignoring Python lambdas is it a matter of individuals familiarity with the language that is the most important factor? Also I have noticed that Spark document preferences have been switched from Scala to Python as the first example. However, some codes for example JDBC calls are the same for Scala and Python.

Some examples like this website claim that Scala performance is an order of magnitude better than Python and also when it comes to concurrency Scala is a better choice. Maybe it is pretty old (2018)?

Also (and may be my ignorance I have not researched it) does Spark offer REPL in the form of spark-shell with Python?


Regards,

Mich



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 9 Oct 2020 at 21:59, Russell Spitzer <[hidden email]> wrote:
As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no lambdas are used, everything will operate with Catalyst compiled java code so there won't be a big difference between python and scala.

On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Russell Spitzer
Spark in Scala (or java) Is much more performant if you are using RDD's, those operations basically force you to pass lambdas, hit serialization between java and python types and yes hit the Global Interpreter Lock. But, none of those things apply to Data Frames which will generate Java code regardless of what language you use to describe the Dataframe operations as long as you don't use python lambdas. A Dataframe operation without python lambdas should not require any remote python code execution.

TLDR, If you are using Dataframes it doesn't matter if you use Scala, Java, Python, R, SQL, the planning and work will all happen in the JVM.

As for a repl, you can run PySpark which will start up a repl. There are also a slew of notebooks which provide interactive python environments as well.


On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks

So ignoring Python lambdas is it a matter of individuals familiarity with the language that is the most important factor? Also I have noticed that Spark document preferences have been switched from Scala to Python as the first example. However, some codes for example JDBC calls are the same for Scala and Python.

Some examples like this website claim that Scala performance is an order of magnitude better than Python and also when it comes to concurrency Scala is a better choice. Maybe it is pretty old (2018)?

Also (and may be my ignorance I have not researched it) does Spark offer REPL in the form of spark-shell with Python?


Regards,

Mich



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 9 Oct 2020 at 21:59, Russell Spitzer <[hidden email]> wrote:
As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no lambdas are used, everything will operate with Catalyst compiled java code so there won't be a big difference between python and scala.

On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Gourav Sengupta
What is the use case? 
Unless you have unlimited funding and time to waste you would usually start with that.

Regards,
Gourav 

On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer <[hidden email]> wrote:
Spark in Scala (or java) Is much more performant if you are using RDD's, those operations basically force you to pass lambdas, hit serialization between java and python types and yes hit the Global Interpreter Lock. But, none of those things apply to Data Frames which will generate Java code regardless of what language you use to describe the Dataframe operations as long as you don't use python lambdas. A Dataframe operation without python lambdas should not require any remote python code execution.

TLDR, If you are using Dataframes it doesn't matter if you use Scala, Java, Python, R, SQL, the planning and work will all happen in the JVM.

As for a repl, you can run PySpark which will start up a repl. There are also a slew of notebooks which provide interactive python environments as well.


On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks

So ignoring Python lambdas is it a matter of individuals familiarity with the language that is the most important factor? Also I have noticed that Spark document preferences have been switched from Scala to Python as the first example. However, some codes for example JDBC calls are the same for Scala and Python.

Some examples like this website claim that Scala performance is an order of magnitude better than Python and also when it comes to concurrency Scala is a better choice. Maybe it is pretty old (2018)?

Also (and may be my ignorance I have not researched it) does Spark offer REPL in the form of spark-shell with Python?


Regards,

Mich



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 9 Oct 2020 at 21:59, Russell Spitzer <[hidden email]> wrote:
As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no lambdas are used, everything will operate with Catalyst compiled java code so there won't be a big difference between python and scala.

On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Wim Van Leuven
In reply to this post by Mich Talebzadeh
Hey Mich,

This is a very fair question .. I've seen many data engineering teams start out with Scala because technically it is the best choice for many given reasons and basically it is what Spark is.

On the other hand, almost all use cases we see these days are data science use cases where people mostly do python. So, if you need those two worlds collaborate and even handover code, you don't want the ideological battle of Scala vs Python. We chose python for the sake of everybody speaking the same language.

But it is true, if you do Spark DataFrames, because then PySpark is a thin layer around everything on the JVM. Even the discussion of Python UDFs don't hold up. If it works as a Python function (and most of the time it does) why do Scala? If however, performance characteristics show you otherwise, implement those UDFs on the JVM.

Problem with Python? Good engineering practices translated in tools are much more rare ... a build tool like Maven for Java or SBT for Scala don't exist ... yet? You can look at PyBuilder for this.

So, referring to the website you mention ... in practice, because of the many data science use cases out there, I see many Spark shops prefer python over Scala because Spark gravitates to dataframes where the downsides of Python do not stack up. Performance of python as a driver program which is just the glue code, becomes irrelevant compared to the processing you are doing on the JVM. We even notice that Python is much easier and we hear echoes that finding (good?) Scala engineers is hard(er).

So, conclusion: Python brings data engineers and data science together. If you only do data engineering, Scala can be the better choice. It depends on the context.

Hope this helps
-wim

On Fri, 9 Oct 2020 at 23:27, Mich Talebzadeh <[hidden email]> wrote:
Thanks

So ignoring Python lambdas is it a matter of individuals familiarity with the language that is the most important factor? Also I have noticed that Spark document preferences have been switched from Scala to Python as the first example. However, some codes for example JDBC calls are the same for Scala and Python.

Some examples like this website claim that Scala performance is an order of magnitude better than Python and also when it comes to concurrency Scala is a better choice. Maybe it is pretty old (2018)?

Also (and may be my ignorance I have not researched it) does Spark offer REPL in the form of spark-shell with Python?


Regards,

Mich



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 9 Oct 2020 at 21:59, Russell Spitzer <[hidden email]> wrote:
As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no lambdas are used, everything will operate with Catalyst compiled java code so there won't be a big difference between python and scala.

On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Jörn Franke
In reply to this post by Mich Talebzadeh
It really depends on what your data scientists talk. I don’t think it makes sense for ad hoc data science things to impose a language on them, but let them choose.
For more complex AI engineering things you can though apply different standards and criteria. And then it really depends on architecture aspects etc.

Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh <[hidden email]>:


I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Jacek Pliszka
I would not leave it to data scientists unless they will maintain it.

The key decision in cases I've seen was usually people
cost/availability with ETL operations cost taken into account.

Often the situation is that ETL cloud cost is small and you will not
save much. Then it is just skills cost/availability.
For Python skills you pay less and you can pick people with other
useful skills and also you can more easily train people you have
internally.

Often you have some simple ETL scripts before moving to spark and
these scripts are usually written in Python.

Best Regards,

Jacek


sob., 10 paź 2020 o 12:32 Jörn Franke <[hidden email]> napisał(a):

>
> It really depends on what your data scientists talk. I don’t think it makes sense for ad hoc data science things to impose a language on them, but let them choose.
> For more complex AI engineering things you can though apply different standards and criteria. And then it really depends on architecture aspects etc.
>
> Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh <[hidden email]>:
>
> 
> I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Mich Talebzadeh
In reply to this post by Mich Talebzadeh
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Gourav Sengupta
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

ayan guha
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people are most imp driver for this choice regardless the use case. Use case is relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <[hidden email]> wrote:
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Mich Talebzadeh
Thanks Ayan.

I am not qualified to answer your first point. However, my experience with Spark with Scala or Spark with Python agrees with your assertion that use cases do not come into it. Most DEV/OPS work dealing with ETL are provided by service companies that have workforce very familiar with Java,. IntelliJ, Maven and latterly with Scala. Scala is their first choice where they create Uber Jar files with IntelliJ and MVN on MacBook and shift them into sandboxes for continuous tests. I believe this will remain a trend for sometime as considerable investment is already made there. Then I came across another consultancy tasked with getting raw files from S3 and putting them into Snowflake. They wanted to use Spark with Python. So your  mileage varies. 


Cheers,


Mich



On Sun, 11 Oct 2020 at 02:41, ayan guha <[hidden email]> wrote:
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people are most imp driver for this choice regardless the use case. Use case is relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <[hidden email]> wrote:
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

ayan guha
In reply to this post by ayan guha
But when you have fairly large volume of data that is where spark comes in the party. And I assume the requirement of using spark is already established in the original qs and the discussion is to use python vs scala/java. 

On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <[hidden email]> wrote:
If org has folks that can do python seriously why then spark in the first place. You can do workflow on your own, streaming or batch or what ever you want.
I would not do anything else aside from python, but that is me.

On Sat, Oct 10, 2020, 9:42 PM ayan guha <[hidden email]> wrote:
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people are most imp driver for this choice regardless the use case. Use case is relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <[hidden email]> wrote:
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Mich Talebzadeh
if we take Spark and its massive parallel processing and in-memory cache away, then one can argue anything can do the "ETL" job. just write some Java/Scala/SQL/Perl/python to read data and write to from one DB to another often using JDBC connections. However, we all concur that may not be good enough with Big Data volumes. Generally speaking, there are two ways of making a process faster:

  1. Do more intelligent work by creating indexes, cubes etc thus reducing the processing time
  2. Throw hardware and memory at it using something like Spark multi-cluster with fully managed cloud service like Google Dataproc

In general, one would see an order of magnitude performance gains.


HTH,


Mich



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 11 Oct 2020 at 13:33, ayan guha <[hidden email]> wrote:
But when you have fairly large volume of data that is where spark comes in the party. And I assume the requirement of using spark is already established in the original qs and the discussion is to use python vs scala/java. 

On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <[hidden email]> wrote:
If org has folks that can do python seriously why then spark in the first place. You can do workflow on your own, streaming or batch or what ever you want.
I would not do anything else aside from python, but that is me.

On Sat, Oct 10, 2020, 9:42 PM ayan guha <[hidden email]> wrote:
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people are most imp driver for this choice regardless the use case. Use case is relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <[hidden email]> wrote:
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Gourav Sengupta
So Mich and rest,

technology choices are agnostic to use cases according to you? This is interesting, really interesting. Perhaps I stand corrected.

Regards,
Gourav

On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <[hidden email]> wrote:
if we take Spark and its massive parallel processing and in-memory cache away, then one can argue anything can do the "ETL" job. just write some Java/Scala/SQL/Perl/python to read data and write to from one DB to another often using JDBC connections. However, we all concur that may not be good enough with Big Data volumes. Generally speaking, there are two ways of making a process faster:

  1. Do more intelligent work by creating indexes, cubes etc thus reducing the processing time
  2. Throw hardware and memory at it using something like Spark multi-cluster with fully managed cloud service like Google Dataproc

In general, one would see an order of magnitude performance gains.


HTH,


Mich



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 11 Oct 2020 at 13:33, ayan guha <[hidden email]> wrote:
But when you have fairly large volume of data that is where spark comes in the party. And I assume the requirement of using spark is already established in the original qs and the discussion is to use python vs scala/java. 

On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <[hidden email]> wrote:
If org has folks that can do python seriously why then spark in the first place. You can do workflow on your own, streaming or batch or what ever you want.
I would not do anything else aside from python, but that is me.

On Sat, Oct 10, 2020, 9:42 PM ayan guha <[hidden email]> wrote:
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people are most imp driver for this choice regardless the use case. Use case is relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <[hidden email]> wrote:
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Mich Talebzadeh
Hi,

With regard to your statement below

".technology choices are agnostic to use cases according to you...."

If I may say, I do not think that was the message implied. What was said was that in addition to "best technology fit" there are other factors "equally important" that need to be considered, when a company makes a decision on a given product use case.

As others have stated, what technology stacks you choose may not be the best available technology but something that provides an adequate solution at a reasonable TCO. Case in point if Scala in a given use case is the best fit but at higher TCO (labour cost), then you may opt to use Python or another because you have those resources available in-house at lower cost and your Data Scientists are eager to invest in Python. Companies these days are very careful where to spend their technology dollars or just cancel the projects totally. From my experience, the following are crucial in deciding what to invest in

  • Total Cost of Ownership
  • Internal Supportability & OpIerability thus avoiding single point of failure 
  • Maximum leverage, strategic as opposed to tactical (example is Python considered more of a strategic product or Scala)
  •  Agile and DevOps compatible 
  • Cloud-ready, flexible, scale-out 
  • Vendor support
  • Documentation
  • Minimal footprint

I trust this answers your point.


Mich


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta <[hidden email]> wrote:
So Mich and rest,

technology choices are agnostic to use cases according to you? This is interesting, really interesting. Perhaps I stand corrected.

Regards,
Gourav

On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <[hidden email]> wrote:
if we take Spark and its massive parallel processing and in-memory cache away, then one can argue anything can do the "ETL" job. just write some Java/Scala/SQL/Perl/python to read data and write to from one DB to another often using JDBC connections. However, we all concur that may not be good enough with Big Data volumes. Generally speaking, there are two ways of making a process faster:

  1. Do more intelligent work by creating indexes, cubes etc thus reducing the processing time
  2. Throw hardware and memory at it using something like Spark multi-cluster with fully managed cloud service like Google Dataproc

In general, one would see an order of magnitude performance gains.


HTH,


Mich



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 11 Oct 2020 at 13:33, ayan guha <[hidden email]> wrote:
But when you have fairly large volume of data that is where spark comes in the party. And I assume the requirement of using spark is already established in the original qs and the discussion is to use python vs scala/java. 

On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <[hidden email]> wrote:
If org has folks that can do python seriously why then spark in the first place. You can do workflow on your own, streaming or batch or what ever you want.
I would not do anything else aside from python, but that is me.

On Sat, Oct 10, 2020, 9:42 PM ayan guha <[hidden email]> wrote:
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people are most imp driver for this choice regardless the use case. Use case is relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <[hidden email]> wrote:
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Mich Talebzadeh
Hi,

I spent a few days converting one of my Spark/Scala scripts to Python. It was interesting but at times looked like trench war. There is a lot of handy stuff in Scala like case classes for defining column headers etc that don't seem to be available in Python (possibly my lack of in-depth Python knowledge). However, Spark documents frequently state availability of features to Scala and Java and not Python.

Looking around everything written for Spark using Python is a work-around. I am not considering Python for data science as my focus has been on using Python with Spark for ETL, I published a thread on this today with two examples of the code written in Scala and Python respectively. OK I admit Lambda functions in Python with map is a great feature but that is all. The rest can be achieved better with Scala. So I buy the view that people tend to use Python with Spark for ETL (because with great respect) they cannot be bothered to pick up Scala (I trust I am not unkind). So that is it. When I was converting the code I remembered that I do still use a Nokia 8210 (21 years old technology) from time to time. Old, sturdy, long battery life and very small. Compare that one with Iphone. That is a fair comparison between Spark on Scala with Spark on Python :)

HTH







 



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 11 Oct 2020 at 20:46, Mich Talebzadeh <[hidden email]> wrote:
Hi,

With regard to your statement below

".technology choices are agnostic to use cases according to you...."

If I may say, I do not think that was the message implied. What was said was that in addition to "best technology fit" there are other factors "equally important" that need to be considered, when a company makes a decision on a given product use case.

As others have stated, what technology stacks you choose may not be the best available technology but something that provides an adequate solution at a reasonable TCO. Case in point if Scala in a given use case is the best fit but at higher TCO (labour cost), then you may opt to use Python or another because you have those resources available in-house at lower cost and your Data Scientists are eager to invest in Python. Companies these days are very careful where to spend their technology dollars or just cancel the projects totally. From my experience, the following are crucial in deciding what to invest in

  • Total Cost of Ownership
  • Internal Supportability & OpIerability thus avoiding single point of failure 
  • Maximum leverage, strategic as opposed to tactical (example is Python considered more of a strategic product or Scala)
  •  Agile and DevOps compatible 
  • Cloud-ready, flexible, scale-out 
  • Vendor support
  • Documentation
  • Minimal footprint

I trust this answers your point.


Mich


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta <[hidden email]> wrote:
So Mich and rest,

technology choices are agnostic to use cases according to you? This is interesting, really interesting. Perhaps I stand corrected.

Regards,
Gourav

On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <[hidden email]> wrote:
if we take Spark and its massive parallel processing and in-memory cache away, then one can argue anything can do the "ETL" job. just write some Java/Scala/SQL/Perl/python to read data and write to from one DB to another often using JDBC connections. However, we all concur that may not be good enough with Big Data volumes. Generally speaking, there are two ways of making a process faster:

  1. Do more intelligent work by creating indexes, cubes etc thus reducing the processing time
  2. Throw hardware and memory at it using something like Spark multi-cluster with fully managed cloud service like Google Dataproc

In general, one would see an order of magnitude performance gains.


HTH,


Mich



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 11 Oct 2020 at 13:33, ayan guha <[hidden email]> wrote:
But when you have fairly large volume of data that is where spark comes in the party. And I assume the requirement of using spark is already established in the original qs and the discussion is to use python vs scala/java. 

On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <[hidden email]> wrote:
If org has folks that can do python seriously why then spark in the first place. You can do workflow on your own, streaming or batch or what ever you want.
I would not do anything else aside from python, but that is me.

On Sat, Oct 10, 2020, 9:42 PM ayan guha <[hidden email]> wrote:
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people are most imp driver for this choice regardless the use case. Use case is relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <[hidden email]> wrote:
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. 
Regards 
Gourav 

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[hidden email]> wrote:
I agree with Wim's assessment of data engineering / ETL vs Data Science.    I wrote pipelines/frameworks for large companies and scala was a much better choice. But for ad-hoc work interfacing directly with data science experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <[hidden email]> wrote:
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the town. I agree with the note that as long as Spark stayed nish and elite, then someone with Scala knowledge was attracting premiums. In fairness in 2014-2015, there was not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked our Hadoop admin for root password to log in to the edge node. Later he became head of machine learning somewhere else and he loved C and Python. So Python was a gift in disguise. I think Python appeals to those who are very familiar with CLI and shell programming (Not GUI fan). As some members alluded to there are more people around with Python knowledge. Most managers choose Python as the unifying development tool because they feel comfortable with it. Frankly I have not seen a manager who feels at home with Scala. So in summary it is a bit disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich

 




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <[hidden email]> wrote:
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark.

The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, itself is an indication of why I think Scala has an edge.

I have not done one to one comparison of Spark with Scala vs Spark with Python. I understand for data science purposes most libraries like TensorFlow etc. are written in Python but I am at loss to understand the validity of using Python with Spark for ETL purposes.

These are my understanding but they are not facts so I would like to get some informed views on this if I can?

Many thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Molotch
I would say the pros and cons of Python vs Scala is both down to Spark, the
languages in themselves and what kind of data engineer you will get when you
try to hire for the different solutions.

With Pyspark you get less functionality and increased complexity with the
py4j java interop compared to vanilla Spark. Why would you want that? Maybe
you want the Python ML tools and have a clear use case, then go for it. If
not, avoid the increased complexity and reduced functionality of Pyspark.

Python vs Scala? Idiomatic Python is a lesson in bad programming
habits/ideas, there's no other way to put it. Do you really want programmers
enjoying coding i such a language hacking away at your system?

Scala might be far from perfect with the plethora of ways to express
yourself. But Python < 3.5 is not fit for anything except simple scripting
IMO.

Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
fine idea. Coding an entire ETL library including state management, the
whole kitchen including the sink, Scala everyday of the week.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Scala vs Python for ETL with Spark

Sasha Kacanski
And you are an expert on python! Idiomatic... 
Please do everyone a favor and stop commenting on things you have no idea...
I build ETL systems python that wiped java commercial stacks left and right. Pyspark was and is  and will be a second class citizen in spark world. That has nothing to do with python.
And as far as scala is concerned good luck with it...





On Sat, Oct 17, 2020, 8:53 AM Molotch <[hidden email]> wrote:
I would say the pros and cons of Python vs Scala is both down to Spark, the
languages in themselves and what kind of data engineer you will get when you
try to hire for the different solutions.

With Pyspark you get less functionality and increased complexity with the
py4j java interop compared to vanilla Spark. Why would you want that? Maybe
you want the Python ML tools and have a clear use case, then go for it. If
not, avoid the increased complexity and reduced functionality of Pyspark.

Python vs Scala? Idiomatic Python is a lesson in bad programming
habits/ideas, there's no other way to put it. Do you really want programmers
enjoying coding i such a language hacking away at your system?

Scala might be far from perfect with the plethora of ways to express
yourself. But Python < 3.5 is not fit for anything except simple scripting
IMO.

Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
fine idea. Coding an entire ETL library including state management, the
whole kitchen including the sink, Scala everyday of the week.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

12