we control spark file names before we write them - should we opensource it?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

we control spark file names before we write them - should we opensource it?

ilaimalka
Hi, as part of our work we needed more control over the name of the files
written out by Spark, e.g instead of "part-...csv.gz" we want to get
something like this "15988891_1748330679_20200507124153.tsv.gz" where the
first number is hardcoded, the second one is the value from partitionBy and
third is a timestamp in provided SimpleDateFormat.

After a long research for possibilities, the most common way is to find
those files and rename them *after* the spark job has finished. We tried to
find a more efficient way.

We decided to implement a new DataSource which is actually a wrapper to most
standard Spark file formats (csv, json, text, parquet, avro), which allows
us to rename the file before it's written.

In short, this is how it works :
Datasource extends FileFormat and implements prepareWrite - which redirects
to local FileNameOutputWriterFactory
TypeFactory which redirects to original Spark Formats
FileNameOutputWriterFactory which actually do the work and by reflection can
call any implementation to control the file name  

The question is - is this interesting/useful enough for the community?
Should we open-source it?
Thanks!

p.s we wrote the same question on spark channel on ASF if you want to
discuss it there:
https://the-asf.slack.com/archives/CD5UQDNBA/p1589117451069600



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: we control spark file names before we write them - should we opensource it?

Stefan Panayotov
Yes, I think so.

Stefan Panayotov, PhD
[hidden email]
[hidden email]
[hidden email]

-----Original Message-----
From: ilaimalka <[hidden email]>
Sent: Monday, June 8, 2020 9:17 AM
To: [hidden email]
Subject: we control spark file names before we write them - should we opensource it?

Hi, as part of our work we needed more control over the name of the files written out by Spark, e.g instead of "part-...csv.gz" we want to get something like this "15988891_1748330679_20200507124153.tsv.gz" where the first number is hardcoded, the second one is the value from partitionBy and third is a timestamp in provided SimpleDateFormat.

After a long research for possibilities, the most common way is to find those files and rename them *after* the spark job has finished. We tried to find a more efficient way.

We decided to implement a new DataSource which is actually a wrapper to most standard Spark file formats (csv, json, text, parquet, avro), which allows us to rename the file before it's written.

In short, this is how it works :
Datasource extends FileFormat and implements prepareWrite - which redirects to local FileNameOutputWriterFactory TypeFactory which redirects to original Spark Formats FileNameOutputWriterFactory which actually do the work and by reflection can call any implementation to control the file name  

The question is - is this interesting/useful enough for the community?
Should we open-source it?
Thanks!

p.s we wrote the same question on spark channel on ASF if you want to discuss it there:
https://the-asf.slack.com/archives/CD5UQDNBA/p1589117451069600



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: we control spark file names before we write them - should we opensource it?

Panos Bletsos
In reply to this post by ilaimalka
May I ask how do you handle multiple partitions? Can't two files have the
same name with this approach, or am I missing something?

BR,
Panos



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: we control spark file names before we write them - should we opensource it?

ilaimalka
Hey Panos,
our solution allows us to analyze the full path and modify the file name.
so for multiple partitions, we can extract the values of the partitions and
then inject them into the file name.

for example,
for the following file:
s3://some_bucket/some_folder/partition1=value1/partition2=value2/part-123.c001.csv
we will store it like that:
s3://some_bucket/some_folder/value1-value2-part-123.c001.csv

and since we can modify the file name as we want, then theoretically two
files can get the same name.
which will either throw an exception or override of one of the files,
according to the configurations.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: we control spark file names before we write them - should we opensource it?

ilaimalka
In reply to this post by Stefan Panayotov
Hey Stefan,
Thank you for your replay.

May I ask for a use-case or an example of how you would use this ability.
I want to make sure our solution would work for you.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]