Spark structured streaming: periodically refresh static data frame

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark structured streaming: periodically refresh static data frame

Appu K
Hi,

I had followed the instructions from the thread https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-4BA3-8B77-0879F3669352@...%3E while trying to reload a static data frame periodically that gets joined to a structured streaming query.

However, the streaming query results does not reflect the data from the refreshed static data frame.


I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu

Reply | Threaded
Open this post in threaded view
|

Re: Spark structured streaming: periodically refresh static data frame

Appu K
More specifically, 

Quoting TD from the previous thread  
"Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated

Wondering if I’m doing something wrong in  https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

My streaming dataframe is not using the updated data, even though the view is updated!

Thank you


On 14 February 2018 at 2:54:48 PM, Appu K ([hidden email]) wrote:

Hi,

I had followed the instructions from the thread https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-4BA3-8B77-0879F3669352@...%3E while trying to reload a static data frame periodically that gets joined to a structured streaming query.

However, the streaming query results does not reflect the data from the refreshed static data frame.


I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu

Reply | Threaded
Open this post in threaded view
|

Re: Spark structured streaming: periodically refresh static data frame

Tathagata Das
Let me fix my mistake :)
What I suggested in that earlier thread does not work. The streaming query that joins a streaming dataset with a batch view, does not correctly pick up when the view is updated. It works only when you restart the query. That is, 
- stop the query
- recreate the dataframes, 
- start the query on the new dataframe using the same checkpoint location as the previous query

Note that you dont need to restart the whole process/cluster/application, just restart the query in the same process/cluster/application. This should be very fast (within a few seconds). So, unless you have latency SLAs of 1 second, you can periodically restart the query without restarting the process.

Apologies for my misdirections in that earlier thread. Hope this helps.

TD

On Wed, Feb 14, 2018 at 2:57 AM, Appu K <[hidden email]> wrote:
More specifically, 

Quoting TD from the previous thread  
"Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated

Wondering if I’m doing something wrong in  https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

My streaming dataframe is not using the updated data, even though the view is updated!

Thank you


On 14 February 2018 at 2:54:48 PM, Appu K ([hidden email]) wrote:

Hi,

I had followed the instructions from the thread https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-4BA3-8B77-0879F3669352@qvantel.com%3E while trying to reload a static data frame periodically that gets joined to a structured streaming query.

However, the streaming query results does not reflect the data from the refreshed static data frame.


I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu


Reply | Threaded
Open this post in threaded view
|

Re: Spark structured streaming: periodically refresh static data frame

Appu K
TD,

Thanks a lot for the quick reply :) 


Did I understand it right that in the main thread, to wait for the termination of the context I'll not be able to use  outStream.awaitTermination()  -  [ since i'll be closing in inside another thread ]

What would be a good approach to keep the main app long running if I’ve to restart queries?  

Should i just wait for 2.3 where i'll be able to join two structured streams ( if the release is just a few weeks away )

Appreciate all the help!

thanks
App



On 14 February 2018 at 4:41:52 PM, Tathagata Das ([hidden email]) wrote:

Let me fix my mistake :)
What I suggested in that earlier thread does not work. The streaming query that joins a streaming dataset with a batch view, does not correctly pick up when the view is updated. It works only when you restart the query. That is, 
- stop the query
- recreate the dataframes, 
- start the query on the new dataframe using the same checkpoint location as the previous query

Note that you dont need to restart the whole process/cluster/application, just restart the query in the same process/cluster/application. This should be very fast (within a few seconds). So, unless you have latency SLAs of 1 second, you can periodically restart the query without restarting the process.

Apologies for my misdirections in that earlier thread. Hope this helps.

TD

On Wed, Feb 14, 2018 at 2:57 AM, Appu K <[hidden email]> wrote:
More specifically, 

Quoting TD from the previous thread  
"Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated

Wondering if I’m doing something wrong in  https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

My streaming dataframe is not using the updated data, even though the view is updated!

Thank you


On 14 February 2018 at 2:54:48 PM, Appu K ([hidden email]) wrote:

Hi,

I had followed the instructions from the thread https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-4BA3-8B77-0879F3669352@qvantel.com%3E while trying to reload a static data frame periodically that gets joined to a structured streaming query.

However, the streaming query results does not reflect the data from the refreshed static data frame.


I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu


Reply | Threaded
Open this post in threaded view
|

Re: Spark structured streaming: periodically refresh static data frame

Tathagata Das
1. Just loop like this.


def startQuery(): Streaming Query = {
   // Define the dataframes and start the query
}

// call this on main thread
while (notShutdown) {
   val query = startQuery()
   query.awaitTermination(refreshIntervalMs)
   query.stop()
   // refresh static data
}


2. Yes, stream-stream joins in 2.3.0, soon to be released. RC3 is available if you want to test it right now - https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/.



On Wed, Feb 14, 2018 at 3:34 AM, Appu K <[hidden email]> wrote:
TD,

Thanks a lot for the quick reply :) 


Did I understand it right that in the main thread, to wait for the termination of the context I'll not be able to use  outStream.awaitTermination()  -  [ since i'll be closing in inside another thread ]

What would be a good approach to keep the main app long running if I’ve to restart queries?  

Should i just wait for 2.3 where i'll be able to join two structured streams ( if the release is just a few weeks away )

Appreciate all the help!

thanks
App



On 14 February 2018 at 4:41:52 PM, Tathagata Das ([hidden email]) wrote:

Let me fix my mistake :)
What I suggested in that earlier thread does not work. The streaming query that joins a streaming dataset with a batch view, does not correctly pick up when the view is updated. It works only when you restart the query. That is, 
- stop the query
- recreate the dataframes, 
- start the query on the new dataframe using the same checkpoint location as the previous query

Note that you dont need to restart the whole process/cluster/application, just restart the query in the same process/cluster/application. This should be very fast (within a few seconds). So, unless you have latency SLAs of 1 second, you can periodically restart the query without restarting the process.

Apologies for my misdirections in that earlier thread. Hope this helps.

TD

On Wed, Feb 14, 2018 at 2:57 AM, Appu K <[hidden email]> wrote:
More specifically, 

Quoting TD from the previous thread  
"Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated

Wondering if I’m doing something wrong in  https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

My streaming dataframe is not using the updated data, even though the view is updated!

Thank you


On 14 February 2018 at 2:54:48 PM, Appu K ([hidden email]) wrote:

Hi,

I had followed the instructions from the thread https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-4BA3-8B77-0879F3669352@qvantel.com%3E while trying to reload a static data frame periodically that gets joined to a structured streaming query.

However, the streaming query results does not reflect the data from the refreshed static data frame.


I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu



Reply | Threaded
Open this post in threaded view
|

Re: Spark structured streaming: periodically refresh static data frame

naresh Goud
Appu,

I am also landed in same problem.

Are you able to solve this issue? Could you please share snippet of code if your able to do?

Thanks,
Naresh

On Wed, Feb 14, 2018 at 8:04 PM, Tathagata Das <[hidden email]> wrote:
1. Just loop like this.


def startQuery(): Streaming Query = {
   // Define the dataframes and start the query
}

// call this on main thread
while (notShutdown) {
   val query = startQuery()
   query.awaitTermination(refreshIntervalMs)
   query.stop()
   // refresh static data
}


2. Yes, stream-stream joins in 2.3.0, soon to be released. RC3 is available if you want to test it right now - https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/.



On Wed, Feb 14, 2018 at 3:34 AM, Appu K <[hidden email]> wrote:
TD,

Thanks a lot for the quick reply :) 


Did I understand it right that in the main thread, to wait for the termination of the context I'll not be able to use  outStream.awaitTermination()  -  [ since i'll be closing in inside another thread ]

What would be a good approach to keep the main app long running if I’ve to restart queries?  

Should i just wait for 2.3 where i'll be able to join two structured streams ( if the release is just a few weeks away )

Appreciate all the help!

thanks
App



On 14 February 2018 at 4:41:52 PM, Tathagata Das ([hidden email]) wrote:

Let me fix my mistake :)
What I suggested in that earlier thread does not work. The streaming query that joins a streaming dataset with a batch view, does not correctly pick up when the view is updated. It works only when you restart the query. That is, 
- stop the query
- recreate the dataframes, 
- start the query on the new dataframe using the same checkpoint location as the previous query

Note that you dont need to restart the whole process/cluster/application, just restart the query in the same process/cluster/application. This should be very fast (within a few seconds). So, unless you have latency SLAs of 1 second, you can periodically restart the query without restarting the process.

Apologies for my misdirections in that earlier thread. Hope this helps.

TD

On Wed, Feb 14, 2018 at 2:57 AM, Appu K <[hidden email]> wrote:
More specifically, 

Quoting TD from the previous thread  
"Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated

Wondering if I’m doing something wrong in  https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

My streaming dataframe is not using the updated data, even though the view is updated!

Thank you


On 14 February 2018 at 2:54:48 PM, Appu K ([hidden email]) wrote:

Hi,

I had followed the instructions from the thread https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-4BA3-8B77-0879F3669352@qvantel.com%3E while trying to reload a static data frame periodically that gets joined to a structured streaming query.

However, the streaming query results does not reflect the data from the refreshed static data frame.


I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu




Reply | Threaded
Open this post in threaded view
|

Re: Spark structured streaming: periodically refresh static data frame

Harsh
As per the solution, if we are closing and starting the query, then what
happens to the the state which is maintained in memory, will that be
retained ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]