Stream writing parquet files

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Stream writing parquet files

Christopher Piggott
I am trying to write some parquet files and running out of memory.  I'm giving my workers each 16GB and the data is 102 columns * 65536 rows - not really all that much.  The content of each row is a short string.

I am trying to create the file by dynamically allocating a StructType of StructField objects.  I then tried various methods of building an Array or List of Row objects for each of the 65,536 rows.  The last attempt was to create an ArrayBuffer of the correct length.

In all cases, I run out of memory.

It occurs to me that what I really need is a way to generate and stream the parquet files directly to an HDFS file.  I have 70,000+ of these files, so for starters I'm OK with creating 70,000 parquet files as long as there's some way I can merge them later.

Is there an approach for generating parquet files from spark (ultimately to HDFS) that lets me put each row out one at a time, in a streaming fashion?

BTW I'm using spark 2.2.1 and whatever parquet library was bundled within.

--Chris

Reply | Threaded
Open this post in threaded view
|

Re: Stream writing parquet files

Christopher Piggott
As a follow-up question, what happened to org.apache.spark.sql.parquet.RowWriteSupport ?  It seems like it would help me.

On Thu, Apr 19, 2018 at 9:23 PM, Christopher Piggott <[hidden email]> wrote:
I am trying to write some parquet files and running out of memory.  I'm giving my workers each 16GB and the data is 102 columns * 65536 rows - not really all that much.  The content of each row is a short string.

I am trying to create the file by dynamically allocating a StructType of StructField objects.  I then tried various methods of building an Array or List of Row objects for each of the 65,536 rows.  The last attempt was to create an ArrayBuffer of the correct length.

In all cases, I run out of memory.

It occurs to me that what I really need is a way to generate and stream the parquet files directly to an HDFS file.  I have 70,000+ of these files, so for starters I'm OK with creating 70,000 parquet files as long as there's some way I can merge them later.

Is there an approach for generating parquet files from spark (ultimately to HDFS) that lets me put each row out one at a time, in a streaming fashion?

BTW I'm using spark 2.2.1 and whatever parquet library was bundled within.

--Chris