Spark Streaming - Shared hashmaps

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Streaming - Shared hashmaps

Bryan Bryan
Hi there,

I have read about the two fundamental shared features in spark (broadcasting variables and accumulators), but this is what i need.

I'm using spark streaming in order to get requests from Kafka, these requests may launch long-running tasks, and i need to control them:

1) Keep them in a shared bag, like a Hashmap, to retrieve them by ID, for example.
2) Retrieve an instance of this object/task whatever on-demand (on-request, in fact)


Any idea about that? How can i share objects between slaves? May i use something out of spark (maybe hazelcast')


Regards.
Reply | Threaded
Open this post in threaded view
|

Re: Spark Streaming - Shared hashmaps

Tathagata Das
When you say "launch long-running tasks" does it mean long running Spark jobs/tasks, or long-running tasks in another system?

If the rate of requests from Kafka is not low (in terms of records per second), you could collect the records in the driver, and maintain the "shared bag" in the driver. A separate thread in the driver could pick stuff from the bag and launch "tasks". This is a slightly unorthodox use of Spark Streaming, but should work. 

If the rate of request from Kafka is high, then I am not sure how you can sustain that many long running tasks (assuming 1 task corresponding to each request from Kafka).

TD


On Wed, Mar 26, 2014 at 1:19 AM, Bryan Bryan <[hidden email]> wrote:
Hi there,

I have read about the two fundamental shared features in spark (broadcasting variables and accumulators), but this is what i need.

I'm using spark streaming in order to get requests from Kafka, these requests may launch long-running tasks, and i need to control them:

1) Keep them in a shared bag, like a Hashmap, to retrieve them by ID, for example.
2) Retrieve an instance of this object/task whatever on-demand (on-request, in fact)


Any idea about that? How can i share objects between slaves? May i use something out of spark (maybe hazelcast')


Regards.

Reply | Threaded
Open this post in threaded view
|

Re: Spark Streaming - Shared hashmaps

Bryan Bryan
Thanks for your support, This is my idea of the project, i'm a newbie so please forgive my misunderstandings:

Spark streaming will collect requests, for example: create a table, append records to a table, erase a table (it's just an example).

With spark streaming i can filter the messages by key (kind of request) and send them (forEachRDD) to a specific function which should care about each kind of request.


This is just fine when the requests are "self-contained", or say in other words, one-step request, for example, create table, drop table.

But it's a bit more complicated if i need a connection that:

1) must survive to the scope of the function 
2) must be shared across slaves.

For example, a connection to the database.

How do you think is the best approach for this scenario?







2014-03-26 10:30 GMT+01:00 Tathagata Das <[hidden email]>:
When you say "launch long-running tasks" does it mean long running Spark jobs/tasks, or long-running tasks in another system?

If the rate of requests from Kafka is not low (in terms of records per second), you could collect the records in the driver, and maintain the "shared bag" in the driver. A separate thread in the driver could pick stuff from the bag and launch "tasks". This is a slightly unorthodox use of Spark Streaming, but should work. 

If the rate of request from Kafka is high, then I am not sure how you can sustain that many long running tasks (assuming 1 task corresponding to each request from Kafka).

TD


On Wed, Mar 26, 2014 at 1:19 AM, Bryan Bryan <[hidden email]> wrote:
Hi there,

I have read about the two fundamental shared features in spark (broadcasting variables and accumulators), but this is what i need.

I'm using spark streaming in order to get requests from Kafka, these requests may launch long-running tasks, and i need to control them:

1) Keep them in a shared bag, like a Hashmap, to retrieve them by ID, for example.
2) Retrieve an instance of this object/task whatever on-demand (on-request, in fact)


Any idea about that? How can i share objects between slaves? May i use something out of spark (maybe hazelcast')


Regards.