share datasets across multiple spark-streaming applications for lookup

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

share datasets across multiple spark-streaming applications for lookup

roshan joe
Hi, 

What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? 

The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. 

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. 
Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. 


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 
Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 
Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3.

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!
Reply | Threaded
Open this post in threaded view
|

Re: share datasets across multiple spark-streaming applications for lookup

sparkuser101

Any info on the below will be really appreciated.

 

I read about Alluxio and Ignite. Has anybody used any of them? Do they work well with multiple Apps doing lookups simultaneously? Are there better options? Thank you.  

 

From: roshan joe <[hidden email]>
Date: Monday, October 30, 2017 at 7:53 PM
To: "[hidden email]" <[hidden email]>
Subject: share datasets across multiple spark-streaming applications for lookup

 

Hi, 

 

What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? 

 

The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. 

 

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. 

Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. 

 


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3.

 

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!

Reply | Threaded
Open this post in threaded view
|

Re: share datasets across multiple spark-streaming applications for lookup

gene.pang
Hi,

Alluxio enables sharing dataframes across different applications. This blog post talks about dataframes and Alluxio, and this Spark Summit presentation has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil <[hidden email]> wrote:

Any info on the below will be really appreciated.

 

I read about Alluxio and Ignite. Has anybody used any of them? Do they work well with multiple Apps doing lookups simultaneously? Are there better options? Thank you.  

 

From: roshan joe <[hidden email]>
Date: Monday, October 30, 2017 at 7:53 PM
To: "[hidden email]" <[hidden email]>
Subject: share datasets across multiple spark-streaming applications for lookup

 

Hi, 

 

What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? 

 

The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. 

 

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. 

Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. 

 


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3.

 

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!


Reply | Threaded
Open this post in threaded view
|

Re: share datasets across multiple spark-streaming applications for lookup

Joseph Pride
Folks:

SnappyData.

I’m fairly new to working with it myself, but it looks pretty promising. It marries Spark with a co-located in-memory GemFire (or something gem-related) database. So you can access the data with SQL, JDBC, ODBC (if you wanna go Enterprise instead of open-source) or natively as mutable RDDs and DataFrames.

You can run it so the storage and Spark compute are co-located in the same JVM on each machine, so you get data locality instead of a bottleneck between load, save, and compute. The data is supposed to persist between applications, cluster startups, or multiple applications doing stuff to the data at the same time.

I hope it works for what I’m doing and isn’t too buggy. But it looks pretty good.

—Joe Pride

On Oct 31, 2017, at 11:14 AM, Gene Pang <[hidden email]> wrote:

Hi,

Alluxio enables sharing dataframes across different applications. This blog post talks about dataframes and Alluxio, and this Spark Summit presentation has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil <[hidden email]> wrote:

Any info on the below will be really appreciated.

 

I read about Alluxio and Ignite. Has anybody used any of them? Do they work well with multiple Apps doing lookups simultaneously? Are there better options? Thank you.  

 

From: roshan joe <[hidden email]>
Date: Monday, October 30, 2017 at 7:53 PM
To: "[hidden email]" <[hidden email]>
Subject: share datasets across multiple spark-streaming applications for lookup

 

Hi, 

 

What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? 

 

The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. 

 

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. 

Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. 

 


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3.

 

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!


Reply | Threaded
Open this post in threaded view
|

Re: share datasets across multiple spark-streaming applications for lookup

Jean Georges Perrin
Or Databaricks Delta (announced at Spark Summit) or IBM Event Store depending on the use case.

On Oct 31, 2017, at 14:30, Joseph Pride <[hidden email]> wrote:

Folks:

SnappyData.

I’m fairly new to working with it myself, but it looks pretty promising. It marries Spark with a co-located in-memory GemFire (or something gem-related) database. So you can access the data with SQL, JDBC, ODBC (if you wanna go Enterprise instead of open-source) or natively as mutable RDDs and DataFrames.

You can run it so the storage and Spark compute are co-located in the same JVM on each machine, so you get data locality instead of a bottleneck between load, save, and compute. The data is supposed to persist between applications, cluster startups, or multiple applications doing stuff to the data at the same time.

I hope it works for what I’m doing and isn’t too buggy. But it looks pretty good.

—Joe Pride

On Oct 31, 2017, at 11:14 AM, Gene Pang <[hidden email]> wrote:

Hi,

Alluxio enables sharing dataframes across different applications. This blog post talks about dataframes and Alluxio, and this Spark Summit presentation has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil <[hidden email]> wrote:

Any info on the below will be really appreciated.

 

I read about Alluxio and Ignite. Has anybody used any of them? Do they work well with multiple Apps doing lookups simultaneously? Are there better options? Thank you.  

 

From: roshan joe <[hidden email]>
Date: Monday, October 30, 2017 at 7:53 PM
To: "[hidden email]" <[hidden email]>
Subject: share datasets across multiple spark-streaming applications for lookup

 

Hi, 

 

What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? 

 

The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. 

 

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. 

Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. 

 


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. 

Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3.

 

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!