I was just curious if anyone has ever used Spark as an application server cache?
My use case is:
* I have large datasets which need to be updated / inserted (upsert) in the database
* I have actually found that it is much easier to run a Spark submit job that pulls from the database, and compares the incoming new data with the existing data in memory and only upsert the necessary rows (remove all duplicates)
I was thinking that if I keep the spark dataframe in memory in a long running spark session, then I can further speed up this process as I can remove the database query on each batch run.
I have a data pipeline in which I'm subscribed to essentially a firehose of information and I want to save everything however I don't want to update / save any duplicate data and would like to eliminate this in memory before having to make the database IO call.
If anyone has used Spark like this would appreciate their input and or a diff solution if Spark is not appropriate