Noob Spark questions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
od
Reply | Threaded
Open this post in threaded view
|

Noob Spark questions

od
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the "Fast data processing with Spark" book.

One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a "pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this?

What about "dumping" all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it "sees" the pipeline but sometimes storing intermediate results helps too...

Thanks!
OD
Reply | Threaded
Open this post in threaded view
|

Re: Noob Spark questions

Jie Deng
I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is slower than Scala shell, well, depends on your requirement.
I am not an expert in Spark, but as far as I know, Spark provide different level of storage including memory or disk. And for the disk part, HDFS is just a choice. I am not using hdfs myself, but you will loss the benefit of hdfs as well. In other words, it's also just based on your requirements.
And MongoDB or S3 are also doable, at least with Java APIs, I suppose.


2013/12/23 Ognen Duzlevski <[hidden email]>
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the "Fast data processing with Spark" book.

One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a "pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this?

What about "dumping" all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it "sees" the pipeline but sometimes storing intermediate results helps too...

Thanks!
OD

Reply | Threaded
Open this post in threaded view
|

Re: Noob Spark questions

Mark Hamstra
Though there is a saying that Java in Spark is slower than Scala shell

That shouldn't be said.  The Java API is mostly a thin wrapper of the Scala implementation, and the performance of the Java API is intended to be equivalent to that of the Scala API.  If you're finding that not to be true, then that is something that the Spark developers would like to know. 


On Mon, Dec 23, 2013 at 1:23 PM, Jie Deng <[hidden email]> wrote:
I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is slower than Scala shell, well, depends on your requirement.
I am not an expert in Spark, but as far as I know, Spark provide different level of storage including memory or disk. And for the disk part, HDFS is just a choice. I am not using hdfs myself, but you will loss the benefit of hdfs as well. In other words, it's also just based on your requirements.
And MongoDB or S3 are also doable, at least with Java APIs, I suppose.


2013/12/23 Ognen Duzlevski <[hidden email]>
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the "Fast data processing with Spark" book.

One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a "pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this?

What about "dumping" all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it "sees" the pipeline but sometimes storing intermediate results helps too...

Thanks!
OD


od
Reply | Threaded
Open this post in threaded view
|

Re: Noob Spark questions

od
In reply to this post by Jie Deng
Hello,

On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng <[hidden email]> wrote:
I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is slower than Scala shell, well, depends on your requirement.
I am not an expert in Spark, but as far as I know, Spark provide different level of storage including memory or disk. And for the disk part, HDFS is just a choice. I am not using hdfs myself, but you will loss the benefit of hdfs as well. In other words, it's also just based on your requirements.
And MongoDB or S3 are also doable, at least with Java APIs, I suppose.


I guess that answers the question of whether it is doable. Where/how do I find out how it is doable? :)

I am guessing every pipeline is a "custom job" of sorts - hence it is the developer's job to write the "connectors" to 0mq or dynamodb, for example? Or....? Is there some kind of a "plug in" system for Spark?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Noob Spark questions

Jie Deng
@Mark Hamstra:
Thanks, good to know. 

@Ognen Duzlevski


2013/12/24 Ognen Duzlevski <[hidden email]>
Hello,

On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng <[hidden email]> wrote:
I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is slower than Scala shell, well, depends on your requirement.
I am not an expert in Spark, but as far as I know, Spark provide different level of storage including memory or disk. And for the disk part, HDFS is just a choice. I am not using hdfs myself, but you will loss the benefit of hdfs as well. In other words, it's also just based on your requirements.
And MongoDB or S3 are also doable, at least with Java APIs, I suppose.


I guess that answers the question of whether it is doable. Where/how do I find out how it is doable? :)

I am guessing every pipeline is a "custom job" of sorts - hence it is the developer's job to write the "connectors" to 0mq or dynamodb, for example? Or....? Is there some kind of a "plug in" system for Spark?

Thanks!

od
Reply | Threaded
Open this post in threaded view
|

Re: Noob Spark questions

od
In reply to this post by od
Can anyone provide any code examples of connecting Spark to zeromq data producers for purposes of simple real-time analytics? Even the most basic example would be nice :)

Thanks!


On Mon, Dec 23, 2013 at 2:42 PM, Ognen Duzlevski <[hidden email]> wrote:
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the "Fast data processing with Spark" book.

One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a "pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this?

What about "dumping" all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it "sees" the pipeline but sometimes storing intermediate results helps too...

Thanks!
OD

Reply | Threaded
Open this post in threaded view
|

Re: Noob Spark questions

Aaron Davidson


On Mon, Dec 30, 2013 at 9:41 PM, Ognen Duzlevski <[hidden email]> wrote:
Can anyone provide any code examples of connecting Spark to zeromq data producers for purposes of simple real-time analytics? Even the most basic example would be nice :)

Thanks!


On Mon, Dec 23, 2013 at 2:42 PM, Ognen Duzlevski <[hidden email]> wrote:
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the "Fast data processing with Spark" book.

One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a "pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this?

What about "dumping" all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it "sees" the pipeline but sometimes storing intermediate results helps too...

Thanks!
OD


od
Reply | Threaded
Open this post in threaded view
|

Re: Noob Spark questions

od
Yes, this helps! I see it uses akka.

Thanks!


On Tue, Dec 31, 2013 at 12:34 AM, Aaron Davidson <[hidden email]> wrote:


On Mon, Dec 30, 2013 at 9:41 PM, Ognen Duzlevski <[hidden email]> wrote:
Can anyone provide any code examples of connecting Spark to zeromq data producers for purposes of simple real-time analytics? Even the most basic example would be nice :)

Thanks!


On Mon, Dec 23, 2013 at 2:42 PM, Ognen Duzlevski <[hidden email]> wrote:
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the "Fast data processing with Spark" book.

One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a "pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this?

What about "dumping" all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it "sees" the pipeline but sometimes storing intermediate results helps too...

Thanks!
OD