Why Scala?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Why Scala?

Nick Chammas
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Benjamin Black
HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Matei Zaharia
Administrator
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black <[hidden email]> wrote:

HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Nick Chammas
Matei,

Thank you for the concise explanation.

I use Python and will definitely add my vote of interest to seeing more of Spark's functionality (especially Spark Streaming) exposed via Python. 

Scala seems like an interesting language to learn, if only to unlock more of Spark's functionality for use. I am a total n00b in general, so I'm still learning about the things that distinguish programming languages from one another (e.g. type inference, lambda expressions, etc).


Benjamin,

HN does come off as a "Reddit for nerds", and discussions do seem to descend sometimes into "nerd slapfights", as one person put it. :)

Nick


On Thu, May 29, 2014 at 5:19 PM, Matei Zaharia <[hidden email]> wrote:
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black <[hidden email]> wrote:

HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Dmitriy Lyubimov
In reply to this post by Nick Chammas
There were few known concerns about Scala, and some still are, but having been doing Scala professionally over two years now, i learned to master and appreciate the advanatages.

Major concern IMO is Scala in a less-than-scrupulous corporate environment. 

First, Scala requires significantly more discipline in commenting and style to still stay painlessly readable, than java. People with less than stellar code hygiene can easily turn a project into an unmaintainable mess. 

Second, from corporate management prospective, it is (still?) much harder to staff with Scala coders as opposed to Java ones.

All these things are a headache for corporate bosses, but for public and academic projects with thorough peer review and increased desire for contributors to look clean in public it works out quite well, and strong sides really shine.

Spark specifically builds around FP patterns -- such as monads and functors -- which were absent in java prior to  8 (i am not sure that they are as well worked out in java 8 collections even now, as opposed to Scala collections). So java 8 simply comes a little late to the show in that department.

Also FP is not the only thing that is used by Spark. Spark also uses stuff like implicits, akka/agent framework for IPC. Let's not forget that FP is albeit important but only one out of many  stories in Scala in the grand scale of things.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Marek Kolodziej-2
In reply to this post by Nick Chammas
I would disagree that Scala is controversial. It's less controversial than Java was when it came out in 1995. Scala's been around since 2004, and over the past couple of years, it saw major adoption at LinkedIn, Twitter, FourSquare, Netflix, Tumblr, The Guardian, Airbnb, Meetup.com, Coursera, UBS, Ask.com, AT&T, Bloomberg, eBay, The Weather Channel, etc. It's not merely academic.

It's pretty obvious that Java has many major shortcomings, especially in the functional programming realm. Java 8 added lambdas, but it didn't add currying, partial application, tail call optimization, and so on. Java's "BoilerPlate boilerPlate = new BoilerPlateImpl()" is poorly suited for data science and other cases that require expressivity. Scala's type system is both stronger than Java's (e.g. Scala's arrays are invariant while Java's are covariant, which was an error in language design) and more flexible (covariance and bounds, not just bounds like in Java). Scala's type inference cuts out the boilerplate. Implicit conversions make domain-specific languages possible. Pattern matching allows decomposition that's much more expressive than Java's "instanceof," switch/case and if/else. ClassTags also allow you to combat type erasure - how can you check if something is List<Integer> vs. List<String> at runtime if the types are erased (a major sin that Java committed yet C# didn't). The list goes on and on.

Since Scala compiles to Java bytecode, you have all the Java libraries available at your disposal, but there's no question that the language is better, more type safe, more expressive, more concise, etc. The functional programming features are so much better than what Java 8 *stole* from Scala (look how they even copied the method names such as compose/andThen from Scala) that it's hard to even begin to compare. You can still use your old Java tools like Maven and JUnit/TestNG, though Java tools such as SBT and ScalaTest are much more feature rich.

I'm not saying that Scala is perfect, but it's very good, and I would advise others to form their opinions based on experiencing it for themselves, rather than reading what random people say on Hacker News. :)

Marek




On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Marek Kolodziej-2
Also regarding "why the JVM in general," it's worth remembering that the JVM has excellent garbage collection, and the just-in-time compiler (JIT) can make repetitive code run almost as fast as native C++ code. Then there's the concurrency aspect, which is broken in both Python and Ruby (GIL). There are other JVM languages of course, but Scala is both more intuitive and more flexible than, say, Clojure. When one writes big applications or uses new APIs, having a statically typed language not only allows you to have the IDE help you, it also catches many bugs before runtime. Given type inference, it's less painful to have type safety than in Java, but in Clojure, Python and Ruby you don't have type safety at all (well you do in Cython, but that's a different story). Of course the JVM has warts, such as no unsigned numeric types, boxed numeric types for generics, painful interaction with native code if you need to reuse existing native libraries (JNI, JNA). Nothing is perfect I guess. However, Java's scalability, efficiency, concurrency and portability did show that the JVM is a great compromise - and it's not surprising that other languages that were traditionally not on the JVM now are (Jython, JRuby, etc.). 

Marek




On Thu, May 29, 2014 at 5:10 PM, Marek Kolodziej <[hidden email]> wrote:
I would disagree that Scala is controversial. It's less controversial than Java was when it came out in 1995. Scala's been around since 2004, and over the past couple of years, it saw major adoption at LinkedIn, Twitter, FourSquare, Netflix, Tumblr, The Guardian, Airbnb, Meetup.com, Coursera, UBS, Ask.com, AT&T, Bloomberg, eBay, The Weather Channel, etc. It's not merely academic.

It's pretty obvious that Java has many major shortcomings, especially in the functional programming realm. Java 8 added lambdas, but it didn't add currying, partial application, tail call optimization, and so on. Java's "BoilerPlate boilerPlate = new BoilerPlateImpl()" is poorly suited for data science and other cases that require expressivity. Scala's type system is both stronger than Java's (e.g. Scala's arrays are invariant while Java's are covariant, which was an error in language design) and more flexible (covariance and bounds, not just bounds like in Java). Scala's type inference cuts out the boilerplate. Implicit conversions make domain-specific languages possible. Pattern matching allows decomposition that's much more expressive than Java's "instanceof," switch/case and if/else. ClassTags also allow you to combat type erasure - how can you check if something is List<Integer> vs. List<String> at runtime if the types are erased (a major sin that Java committed yet C# didn't). The list goes on and on.

Since Scala compiles to Java bytecode, you have all the Java libraries available at your disposal, but there's no question that the language is better, more type safe, more expressive, more concise, etc. The functional programming features are so much better than what Java 8 *stole* from Scala (look how they even copied the method names such as compose/andThen from Scala) that it's hard to even begin to compare. You can still use your old Java tools like Maven and JUnit/TestNG, though Java tools such as SBT and ScalaTest are much more feature rich.

I'm not saying that Scala is perfect, but it's very good, and I would advise others to form their opinions based on experiencing it for themselves, rather than reading what random people say on Hacker News. :)

Marek




On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Nick Chammas
In reply to this post by Marek Kolodziej-2
Thank you for the specific points about the advantages Scala provides over other languages. Looking at several code samples, the reduction of boilerplate code over Java is one of the biggest plusses, to me.

On Thu, May 29, 2014 at 8:10 PM, Marek Kolodziej <[hidden email]> wrote:
I would advise others to form their opinions based on experiencing it for themselves, rather than reading what random people say on Hacker News. :)

Just a nitpick here: What I said was "It looks like the language is fairly controversial on [Hacker News.]" That was just an observation of what I saw on HN, not a statement of my opinion. I know very little about Scala (or Java, for that matter) and definitely don't have a well-formed opinion on the matter.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Krishna Sankar
Nicholas,
   Good question. Couple of thoughts from my practical experience:
  • Coming from R, Scala feels more natural than other languages. The functional & succinctness of Scala is more suited for Data Science than other languages. In short, Scala-Spark makes sense, for Data Science, ML, Data Exploration et al
  • Having said that occasionally practicality does trump the choice of a language - last time I really wanted to use Scala but ended up in writing in Python ! Hope to get a better result this time
  • Language evolution is more of a long term granularity -  we do underestimate the velocity & impact. Have seen evolutions through languages starting from Cobol, CCP/M Basic,Turbo Pascal, ... I think Scala will find it's equilibrium sooner than we think ...
Cheers
<k/> 


On Thu, May 29, 2014 at 5:54 PM, Nicholas Chammas <[hidden email]> wrote:
Thank you for the specific points about the advantages Scala provides over other languages. Looking at several code samples, the reduction of boilerplate code over Java is one of the biggest plusses, to me.

On Thu, May 29, 2014 at 8:10 PM, Marek Kolodziej <[hidden email]> wrote:
I would advise others to form their opinions based on experiencing it for themselves, rather than reading what random people say on Hacker News. :)

Just a nitpick here: What I said was "It looks like the language is fairly controversial on [Hacker News.]" That was just an observation of what I saw on HN, not a statement of my opinion. I know very little about Scala (or Java, for that matter) and definitely don't have a well-formed opinion on the matter.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

John Omernik
In reply to this post by Matei Zaharia
So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? 

Thanks!

John


On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <[hidden email]> wrote:
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black <[hidden email]> wrote:

HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Matei Zaharia
Administrator
We are definitely investigating a Python API for Streaming, but no announced deadline at this point.

Matei

On Jun 4, 2014, at 5:02 PM, John Omernik <[hidden email]> wrote:

So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? 

Thanks!

John


On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <[hidden email]> wrote:
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black <[hidden email]> wrote:

HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

John Omernik
Thank you for the response. If it helps at all: I demoed the Spark platform for our data science team today. The idea of moving code from batch testing, to Machine Learning systems, GraphX, and then to near-real time models with streaming was cheered by the team as an efficiency they would love.  That said, most folks, on our team are Python junkies, and they love that Spark seems to be committing to Python, and would REALLY love to see Python in Streaming, it would feel complete for them from a platform standpoint. It is still awesome using Scala, and many will learn that, but that full Python integration/support, if possible, would be a home run. 




On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <[hidden email]> wrote:
We are definitely investigating a Python API for Streaming, but no announced deadline at this point.

Matei

On Jun 4, 2014, at 5:02 PM, John Omernik <[hidden email]> wrote:

So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? 

Thanks!

John


On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <[hidden email]> wrote:
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black <[hidden email]> wrote:

HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.





Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Jeremy Lee
I'm still a Spark newbie, but I have a heavy background in languages and compilers... so take this with a barrel of salt...

Scala, to me, is the heart and soul of Spark. Couldn't work without it. Procedural languages like Python, Java, and all the rest are lovely when you have a couple of processors, but it doesn't scale. (pun intended) It's the same reason they had to invent a slew of 'Shader' languages for GPU programming. In fact, that's how I see Scala, as the "CUDA" or "GLSL" of cluster computing.

Now, Scala isn't perfect. It could learn a thing or two from OCCAM about interprocess communication. (And from node.js about package management.) But functional programming becomes essential for highly-parallel code because the primary difference is that functional declares _what_ you want to do, and procedural declares _how_ you want to do it.

Since you rarely know the shape of the cluster/graph ahead of time, functional programming becomes the superior paradigm, especially for the "outermost" parts of the program that interface with the scheduler. Python might be fine for the granular fragments, but you would have to export all those independent functions somehow, and define the scheduling and connective structure (the DAG) elsewhere, in yet another language or library. 

To fit neatly into GraphX, Python would probably have to be warped in the same way that GLSL is a stricter sub-set of C. You'd probably lose everything you like about the language, in order to make it seamless. 

I'm pretty agnostic about the whole Spark stack, and it's components, (eg: every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and I get time to write another long email) but Scala is the one thing that gives it legs. I wish the rest of Spark was more like it. (ie: 'no ceremony')

Scala might seem 'weird', but that's because it directly exposes parallelism, and the ways to cope with it. I've done enough distributed programming that the advantages are obvious, for that domain. You're not being asked to re-wire your thinking for Scala's benefit, but to solve the underlying problem. (But you are still being asked to turn your thinking sideways, I will admit.)

People love Python because it 'fit' it's intended domain perfectly. That doesn't mean you'll love it just as much for embedded hardware, or GPU shader development, or Telecoms, or Spark.

Then again, give me another week with the language, and see what I'm screaming about then ;-)



On Thu, Jun 5, 2014 at 10:21 AM, John Omernik <[hidden email]> wrote:
Thank you for the response. If it helps at all: I demoed the Spark platform for our data science team today. The idea of moving code from batch testing, to Machine Learning systems, GraphX, and then to near-real time models with streaming was cheered by the team as an efficiency they would love.  That said, most folks, on our team are Python junkies, and they love that Spark seems to be committing to Python, and would REALLY love to see Python in Streaming, it would feel complete for them from a platform standpoint. It is still awesome using Scala, and many will learn that, but that full Python integration/support, if possible, would be a home run. 




On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <[hidden email]> wrote:
We are definitely investigating a Python API for Streaming, but no announced deadline at this point.

Matei

On Jun 4, 2014, at 5:02 PM, John Omernik <[hidden email]> wrote:

So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? 

Thanks!

John


On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <[hidden email]> wrote:
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black <[hidden email]> wrote:

HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.








--
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers
Reply | Threaded
Open this post in threaded view
|

Re: Why Scala?

Nick Chammas
To add another note on the benefits of using Scala to build Spark, here is a very interesting and well-written post on the Databricks blog about how Scala 2.10's runtime reflection enables some significant performance optimizations in Spark SQL.


On Wed, Jun 4, 2014 at 10:15 PM, Jeremy Lee <[hidden email]> wrote:
I'm still a Spark newbie, but I have a heavy background in languages and compilers... so take this with a barrel of salt...

Scala, to me, is the heart and soul of Spark. Couldn't work without it. Procedural languages like Python, Java, and all the rest are lovely when you have a couple of processors, but it doesn't scale. (pun intended) It's the same reason they had to invent a slew of 'Shader' languages for GPU programming. In fact, that's how I see Scala, as the "CUDA" or "GLSL" of cluster computing.

Now, Scala isn't perfect. It could learn a thing or two from OCCAM about interprocess communication. (And from node.js about package management.) But functional programming becomes essential for highly-parallel code because the primary difference is that functional declares _what_ you want to do, and procedural declares _how_ you want to do it.

Since you rarely know the shape of the cluster/graph ahead of time, functional programming becomes the superior paradigm, especially for the "outermost" parts of the program that interface with the scheduler. Python might be fine for the granular fragments, but you would have to export all those independent functions somehow, and define the scheduling and connective structure (the DAG) elsewhere, in yet another language or library. 

To fit neatly into GraphX, Python would probably have to be warped in the same way that GLSL is a stricter sub-set of C. You'd probably lose everything you like about the language, in order to make it seamless. 

I'm pretty agnostic about the whole Spark stack, and it's components, (eg: every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and I get time to write another long email) but Scala is the one thing that gives it legs. I wish the rest of Spark was more like it. (ie: 'no ceremony')

Scala might seem 'weird', but that's because it directly exposes parallelism, and the ways to cope with it. I've done enough distributed programming that the advantages are obvious, for that domain. You're not being asked to re-wire your thinking for Scala's benefit, but to solve the underlying problem. (But you are still being asked to turn your thinking sideways, I will admit.)

People love Python because it 'fit' it's intended domain perfectly. That doesn't mean you'll love it just as much for embedded hardware, or GPU shader development, or Telecoms, or Spark.

Then again, give me another week with the language, and see what I'm screaming about then ;-)



On Thu, Jun 5, 2014 at 10:21 AM, John Omernik <[hidden email]> wrote:
Thank you for the response. If it helps at all: I demoed the Spark platform for our data science team today. The idea of moving code from batch testing, to Machine Learning systems, GraphX, and then to near-real time models with streaming was cheered by the team as an efficiency they would love.  That said, most folks, on our team are Python junkies, and they love that Spark seems to be committing to Python, and would REALLY love to see Python in Streaming, it would feel complete for them from a platform standpoint. It is still awesome using Scala, and many will learn that, but that full Python integration/support, if possible, would be a home run. 




On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <[hidden email]> wrote:
We are definitely investigating a Python API for Streaming, but no announced deadline at this point.

Matei

On Jun 4, 2014, at 5:02 PM, John Omernik <[hidden email]> wrote:

So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? 

Thanks!

John


On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <[hidden email]> wrote:
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

In terms of usage, however, we see substantial usage of our other languages (Java and Python), and we’re continuing to invest in both. In a user survey we did last fall, about 25% of users used Java and 30% used Python, and I imagine these numbers are growing. With lambda expressions now added to Java 8 (http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in Python, which is very exciting to us in terms of ease of use.

Matei

On May 29, 2014, at 1:57 PM, Benjamin Black <[hidden email]> wrote:

HN is a cesspool safely ignored.


On Thu, May 29, 2014 at 1:55 PM, Nick Chammas <[hidden email]> wrote:
I recently discovered Hacker News and started reading through older posts about Scala. It looks like the language is fairly controversial on there, and it got me thinking.

Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right?

I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick



View this message in context: Why Scala?
Sent from the Apache Spark User List mailing list archive at Nabble.com.








--
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers