Join streams Apache Spark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Join streams Apache Spark

tencas
This post has NOT been accepted by the mailing list yet.
Hi everybody,

 I am using Apache Spark Streaming using a TCP connector to receive data.
I have a python application that connects to a sensor, and create a TCP server that waits connection from Apache Spark, and then, sends json data through this socket.

How can I manage to join many independent sensors sources to send data to the same receiver on Apache Spark?

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

saulshanabrook
This post has NOT been accepted by the mailing list yet.
I wrote a server in Go that allows many TCP connections for incoming data on one port, writing each line to the client listening on another port. The
CLOJURE_PORT
 environmental variable set's what port client's should connect to, to send data to Spark (the sensors in your case) and the
SPARK_PORT
 sets the port that Spark should connect to, to listen for data.

If anyone knows a simpler way of doing this, by using some existing software, I would love to know about it.

If you are interested in this code, I would be happy to clean it up and release it with some documentation.
Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

tencas
This post has NOT been accepted by the mailing list yet.
Thanks @saulshanabrook, I'll have a look at it.

I think apache kafka could be an alternative solution, but I haven't checked it yet.
Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

saulshanabrook
This post has NOT been accepted by the mailing list yet.
Would love to hear if you try it out. I was also considering that. I recently changed to using the file based streaming input. I made another Go script that let's me connect over TCP and writes each newline it receives to a new file in a folder. Then Spark can read them from that folder.

On Sat, May 6, 2017 at 2:38 PM tencas [via Apache Spark User List] <[hidden email]> wrote:
Thanks @saulshanabrook, I'll have a look at it.

I think apache kafka could be an alternative solution, but I haven't checked it yet.


If you reply to this email, your message will be added to the discussion below:
To unsubscribe from Join streams Apache Spark, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

tencas
This post has NOT been accepted by the mailing list yet.
There exists an Spark Streaming example of the classic word count, using apache kafka connector:

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

(maybe you already know)

The point is, what are the benefits from using Kafka, instead of a lighter solution like yours. Maybe anybody could help us. Anyway, when I try it out, I'll give you feedback.

On the other hand, have you got ,by any chance, the same script written on Scala, Phyton or Java ?

Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

saulshanabrook
This post has NOT been accepted by the mailing list yet.
The script I wrote in Go? No I do not, but it's very easy to compile it to whatever platform you are running on! Doesn't need to be integrated in the same language as the rest of your code.

On Sat, May 6, 2017 at 3:13 PM tencas [via Apache Spark User List] <[hidden email]> wrote:
There exists an Spark Streaming example of the classic word count, using apache kafka connector:

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

(maybe you already know)

The point is, what are the benefits from using Kafka, instead of a lighter solution like yours. Maybe anybody could help us. Anyway, when I try it out, I'll give you feedback.

On the other hand, have you got ,by any chance, the same script written on Scala, Phyton or Java ?




If you reply to this email, your message will be added to the discussion below:
To unsubscribe from Join streams Apache Spark, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

tencas
This post has NOT been accepted by the mailing list yet.
Yep, I mean the first script you posted. So, you can compile it to Java binaries for example ? Ok, I have no idea about Go.
Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

saulshanabrook
This post has NOT been accepted by the mailing list yet.
I, actually, just ran it in a Docker image. But the point is, it doesn't need to run in the JVM, because it just runs as a separate process. Then your Java (or any other client) code sends messages to it over TCP and it relays them to Spark. 

On Mon, May 8, 2017 at 4:07 AM tencas [via Apache Spark User List] <[hidden email]> wrote:
Yep, I mean the first script you posted. So, you can compile it to Java binaries for example ? Ok, I have no idea about Go.


If you reply to this email, your message will be added to the discussion below:
To unsubscribe from Join streams Apache Spark, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

scorpio
This post has NOT been accepted by the mailing list yet.
In reply to this post by tencas
Any number of independent clients (python apps in your case) can connect to the same spark server. You have a listening socket at spark level, so that's not an issue at all. If the data coming from each sensor has unified schema then you can afford to have same parser or else you will have to parse differently based on sensor type.

Alternate approach could be to have different spark receivers running on different ports. But in that case each client app needs to be configured to connect to a unique port.

On 16-Apr-2017 1:42 AM, "tencas [via Apache Spark User List]" <[hidden email]> wrote:
Hi everybody,

 I am using Apache Spark Streaming using a TCP connector to receive data.
I have a python application that connects to a sensor, and create a TCP server that waits connection from Apache Spark, and then, sends json data through this socket.

How can I manage to join many independent sensors sources to send data to the same receiver on Apache Spark?

Thanks.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: Join streams Apache Spark

tencas
This post has NOT been accepted by the mailing list yet.
Hi scorpio,

 thanks for your reply.
I don't understand your approach. Is it possible to receive data from different clients throught the same port on Spark?

Surely I'm confused and I'd appreciate your opinion.

Regarding the word count example , from Spark Streaming documentation, Spark acts as a client that connects to a remote server, in order te receive data:

// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);


Then, you create a dummy server using nc receive connections request from spark, and to send data:

nc -lk 9999

So, regarding this implementation, as spark is playing the role of tcp client. you'd need to manage the join of external sensors streams (by the way, all with the same schema) in your own server.
How would you be able to make Spark acts as a "sink" that can receive different sources stream throught the same port??