[Structured spak streaming] How does cassandra connector readstream deals with deleted record

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Structured spak streaming] How does cassandra connector readstream deals with deleted record

Rahul Kumar
Hello everyone,

I was wondering, how Cassandra spark connector deals with deleted/updated
record while readstream operation. If the record was already fetched in
spark memory, and it got updated or deleted in database, does it get
reflected in streaming join?

Thanks,
Rahul



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

Jungtaek Lim-2
I'm not sure how it is implemented, but in general I wouldn't expect such behavior on the connectors which read from non-streaming fashion storages. The query result may depend on "when" the records are fetched.

If you need to reflect the changes in your query you'll probably want to find a way to retrieve "change logs" from your external storage (or how your system/product can also produce change logs if your external storage doesn't support it), and adopt it to your query. There's a keyword you can google to read further, "Change Data Capture".

Otherwise, you can apply the traditional approach, run a batch query periodically and replace entire outputs.

On Thu, Jun 25, 2020 at 1:26 PM Rahul Kumar <[hidden email]> wrote:
Hello everyone,

I was wondering, how Cassandra spark connector deals with deleted/updated
record while readstream operation. If the record was already fetched in
spark memory, and it got updated or deleted in database, does it get
reflected in streaming join?

Thanks,
Rahul



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

Russell Spitzer
The connector uses Java driver cql request under the hood which means it responds to the changing database like a normal application would. This means retries may result in a different set of data than the original request if the underlying database changed.

On Fri, Jun 26, 2020, 9:42 PM Jungtaek Lim <[hidden email]> wrote:
I'm not sure how it is implemented, but in general I wouldn't expect such behavior on the connectors which read from non-streaming fashion storages. The query result may depend on "when" the records are fetched.

If you need to reflect the changes in your query you'll probably want to find a way to retrieve "change logs" from your external storage (or how your system/product can also produce change logs if your external storage doesn't support it), and adopt it to your query. There's a keyword you can google to read further, "Change Data Capture".

Otherwise, you can apply the traditional approach, run a batch query periodically and replace entire outputs.

On Thu, Jun 25, 2020 at 1:26 PM Rahul Kumar <[hidden email]> wrote:
Hello everyone,

I was wondering, how Cassandra spark connector deals with deleted/updated
record while readstream operation. If the record was already fetched in
spark memory, and it got updated or deleted in database, does it get
reflected in streaming join?

Thanks,
Rahul



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]