Postgres Read JDBC with COPY TO STDOUT

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Postgres Read JDBC with COPY TO STDOUT

Nicolas Paris-2
Hi

The spark postgres JDBC reader is limited because it relies on basic
SELECT statements with fetchsize and crashes on large tables even if
multiple partitions are setup with lower/upper bounds.

I am about writing a new postgres JDBC reader based on "COPY TO STDOUT".
It would stream the data and produce CSV on the fileSystem (hdfs or
local).  The CSV would be then parsed with the spark CSV reader to
produce a dataframe. It would send multiple "COPY TO STDOUT" for each
executor.

Right now, I am able to loop over an output stream and write the string
somewhere.
I am wondering what would be the best way to process the resulting
string stream. In particular the best way to direct it to a hdfs folder
or maybe parse it on the fly into a dataframe.

Thanks,

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Postgres Read JDBC with COPY TO STDOUT

Nicolas Paris-2
The resulting library is on github: https://github.com/EDS-APHP/spark-postgres
While there is room for improvements it is also able to read/write postgres
data with the COPY statement allowing reading/writing **very large** tables
without problems.


On Sat, Dec 29, 2018 at 01:06:00PM +0100, Nicolas Paris wrote:

> Hi
>
> The spark postgres JDBC reader is limited because it relies on basic
> SELECT statements with fetchsize and crashes on large tables even if
> multiple partitions are setup with lower/upper bounds.
>
> I am about writing a new postgres JDBC reader based on "COPY TO STDOUT".
> It would stream the data and produce CSV on the fileSystem (hdfs or
> local).  The CSV would be then parsed with the spark CSV reader to
> produce a dataframe. It would send multiple "COPY TO STDOUT" for each
> executor.
>
> Right now, I am able to loop over an output stream and write the string
> somewhere.
> I am wondering what would be the best way to process the resulting
> string stream. In particular the best way to direct it to a hdfs folder
> or maybe parse it on the fly into a dataframe.
>
> Thanks,
>
> --
> nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]