bulk upsert data batch from Kafka dstream into Postgres db

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

bulk upsert data batch from Kafka dstream into Postgres db

salemi
Hi All,

we are consuming messages from Kafka using Spark dsteam. Once the processing
is done we would like to update/insert the data in bulk fashion into the
database.

I was wondering what the best solution for this might be. Our Postgres
database table is not partitioned.


Thank you,

Ali



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: bulk upsert data batch from Kafka dstream into Postgres db

Cody Koeninger
use foreachPartition(), get a connection from a jdbc connection pool,
and insert the data the same way you would in a non-spark program.

If you're only doing inserts, postgres COPY will be faster (e.g.
https://discuss.pivotal.io/hc/en-us/articles/204237003), but if you're
doing updates that's not an option.

Depending on how many spark partitions you have, coalesce() to
decrease the number of partitions may help avoid database contention
and speed things up, but you'll need to experiment.

On Wed, Dec 13, 2017 at 11:52 PM, salemi <[hidden email]> wrote:

> Hi All,
>
> we are consuming messages from Kafka using Spark dsteam. Once the processing
> is done we would like to update/insert the data in bulk fashion into the
> database.
>
> I was wondering what the best solution for this might be. Our Postgres
> database table is not partitioned.
>
>
> Thank you,
>
> Ali
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: bulk upsert data batch from Kafka dstream into Postgres db

salemi
Thank you for your respond.
The approach loads just the data into the DB. I am looking for an approach
that allows me to update  existing entries in the DB amor insert a new entry
if it doesn't exist.






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: bulk upsert data batch from Kafka dstream into Postgres db

Cody Koeninger
Modern versions of postgres have upsert, ie insert into ... on
conflict ... do update

On Thu, Dec 14, 2017 at 11:26 AM, salemi <[hidden email]> wrote:

> Thank you for your respond.
> The approach loads just the data into the DB. I am looking for an approach
> that allows me to update  existing entries in the DB amor insert a new entry
> if it doesn't exist.
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: bulk upsert data batch from Kafka dstream into Postgres db

salemi
Thank you for your response. In case of an update we need sometime to just
update a record and in other cases we need to update the existing record and
insert a new record. The statement you proposed doesn't  handle that.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]