Does dataframe spark API write/create a single file instead of directory as a result of write operation.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Kshitij

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Kshitij
Is there any way to save it as raw_csv file as we do in pandas? I have a script that uses the CSV file for further processing. 

On Sat, 22 Feb 2020 at 14:31, rahul c <[hidden email]> wrote:
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

sebastian.piu
I'm not aware of a way to specify the file name on the writer. 
Since you'd need to bring all the data into a single node and write from there to get a single file out you could simple move/rename the file that spark creates or write the csv yourself with your library of preference?

On Sat, 22 Feb 2020 at 10:39, Kshitij <[hidden email]> wrote:
Is there any way to save it as raw_csv file as we do in pandas? I have a script that uses the CSV file for further processing. 

On Sat, 22 Feb 2020 at 14:31, rahul c <[hidden email]> wrote:
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Kshitij
That's the alternative ofcourse. But that is costly when we are dealing with bunch of files. 

Thanks.

On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <[hidden email]> wrote:
I'm not aware of a way to specify the file name on the writer. 
Since you'd need to bring all the data into a single node and write from there to get a single file out you could simple move/rename the file that spark creates or write the csv yourself with your library of preference?

On Sat, 22 Feb 2020 at 10:39, Kshitij <[hidden email]> wrote:
Is there any way to save it as raw_csv file as we do in pandas? I have a script that uses the CSV file for further processing. 

On Sat, 22 Feb 2020 at 14:31, rahul c <[hidden email]> wrote:
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

yohann jardin

How costly is it for you, to move files after generating them with Spark?
File systems tend to just update some links under the hood.

Yohann Jardin

Le 2/22/2020 à 11:47 AM, Kshitij a écrit :
That's the alternative ofcourse. But that is costly when we are dealing with bunch of files. 

Thanks.

On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <[hidden email]> wrote:
I'm not aware of a way to specify the file name on the writer. 
Since you'd need to bring all the data into a single node and write from there to get a single file out you could simple move/rename the file that spark creates or write the csv yourself with your library of preference?

On Sat, 22 Feb 2020 at 10:39, Kshitij <[hidden email]> wrote:
Is there any way to save it as raw_csv file as we do in pandas? I have a script that uses the CSV file for further processing. 

On Sat, 22 Feb 2020 at 14:31, rahul c <[hidden email]> wrote:
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:

Hi,

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)
 
df.write.csv(<path>) 

Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified.


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Nicolas Paris-2
In reply to this post by Kshitij

> Is there any way to save it as raw_csv file as we do in pandas? I have a

I did write such a function for scala. Please take a look at
https://github.com/EDS-APHP/spark-etl/blob/master/spark-csv/src/main/scala/CSVTool.scala
see writeCsvToLocal

it first writes csv to hdfs, and then fetches every csv part into one
local csv with headers.


Kshitij <[hidden email]> writes:

> Is there any way to save it as raw_csv file as we do in pandas? I have a
> script that uses the CSV file for further processing.
>
> On Sat, 22 Feb 2020 at 14:31, rahul c <[hidden email]> wrote:
>
>> Hi Kshitij,
>>
>> There are option to suppress the metadata files from get created.
>> Set the below properties and try.
>>
>> 1) To disable the transaction logs of spark
>> "spark.sql.sources.commitProtocolClass =
>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>> This will help to disable the "committed<TID>" and "started<TID>" files but
>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>
>> 2) We can disable the _common_metadata and _metadata files using
>> "parquet.enable.summary-metadata=false".
>>
>> 3) We can also disable the _SUCCESS file using
>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>
>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <[hidden email]> wrote:
>>
>>> Hi,
>>>
>>> There is no dataframe spark API which writes/creates a single file
>>> instead of directory as a result of write operation.
>>>
>>> Below both options will create directory with a random file name.
>>>
>>> df.coalesce(1).write.csv(<path>)
>>>
>>>
>>>
>>> df.write.csv(<path>)
>>>
>>>
>>> Instead of creating directory with standard files (_SUCCESS , _committed
>>> , _started). I want a single file with file_name specified.
>>>
>>>
>>> Thanks
>>>
>>


--
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]