Structured Streaming & Enrichment Broadcasts

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Structured Streaming & Enrichment Broadcasts

Bryan Jeffrey
Hello.

We're running applications using Spark Streaming.  We're going to begin work to move to using Structured Streaming.  One of our key scenarios is to lookup values from an external data source for each record in an incoming stream.  In Spark Streaming we currently read the external data, broadcast it and then lookup the value from the broadcast.  The broadcast value is refreshed on a periodic basis - with the need to refresh evaluated on each batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records). Each stream we're doing the lookup(s) for is ~6M records / second.  

While we could conceivably continue this pattern in Structured Streaming with Spark 2.4.x and the 'foreachBatch', based on my read of documentation this seems like a bit of an anti-pattern in Structured Streaming.

So I am looking for advice: What mechanism would you suggest to on a periodic basis read an external data source and do a fast lookup for a streaming input.  One option appears to be to do a broadcast left outer join?  In the past this mechanism has been less easy to performance tune than doing an explicit broadcast and lookup.  

Regards,

Bryan Jeffrey
Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming & Enrichment Broadcasts

Burak Yavuz-2
If you store the data that you're going to broadcast as a Delta table (see delta.io) and perform a stream-batch (where your Delta table is the batch) join, it will auto-update once the table receives any updates.

Best,
Burak

On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <[hidden email]> wrote:
Hello.

We're running applications using Spark Streaming.  We're going to begin work to move to using Structured Streaming.  One of our key scenarios is to lookup values from an external data source for each record in an incoming stream.  In Spark Streaming we currently read the external data, broadcast it and then lookup the value from the broadcast.  The broadcast value is refreshed on a periodic basis - with the need to refresh evaluated on each batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records). Each stream we're doing the lookup(s) for is ~6M records / second.  

While we could conceivably continue this pattern in Structured Streaming with Spark 2.4.x and the 'foreachBatch', based on my read of documentation this seems like a bit of an anti-pattern in Structured Streaming.

So I am looking for advice: What mechanism would you suggest to on a periodic basis read an external data source and do a fast lookup for a streaming input.  One option appears to be to do a broadcast left outer join?  In the past this mechanism has been less easy to performance tune than doing an explicit broadcast and lookup.  

Regards,

Bryan Jeffrey
Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming & Enrichment Broadcasts

shicheng31604@gmail.com
I have a scenario similar to yours, but we are using udf to do exactly that. But you need to get the value of a broadcast variable from the udf. But it's not clear how to achieve it, does anyone know?

Burak Yavuz <[hidden email]> 于2019年11月19日周二 下午12:23写道:
If you store the data that you're going to broadcast as a Delta table (see delta.io) and perform a stream-batch (where your Delta table is the batch) join, it will auto-update once the table receives any updates.

Best,
Burak

On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <[hidden email]> wrote:
Hello.

We're running applications using Spark Streaming.  We're going to begin work to move to using Structured Streaming.  One of our key scenarios is to lookup values from an external data source for each record in an incoming stream.  In Spark Streaming we currently read the external data, broadcast it and then lookup the value from the broadcast.  The broadcast value is refreshed on a periodic basis - with the need to refresh evaluated on each batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records). Each stream we're doing the lookup(s) for is ~6M records / second.  

While we could conceivably continue this pattern in Structured Streaming with Spark 2.4.x and the 'foreachBatch', based on my read of documentation this seems like a bit of an anti-pattern in Structured Streaming.

So I am looking for advice: What mechanism would you suggest to on a periodic basis read an external data source and do a fast lookup for a streaming input.  One option appears to be to do a broadcast left outer join?  In the past this mechanism has been less easy to performance tune than doing an explicit broadcast and lookup.  

Regards,

Bryan Jeffrey