How to make Spark merge the output file?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How to make Spark merge the output file?

Nan Zhu
Hi, all

maybe a stupid question, but is there any way to make Spark write a single file instead of partitioned files?

Best,

-- 
Nan Zhu

Reply | Threaded
Open this post in threaded view
|

Re: How to make Spark merge the output file?

Matei Zaharia
Administrator
Unfortunately this is expensive to do on HDFS — you’d need a single writer to write the whole file. If your file is small enough for that, you can use coalesce() on the RDD to bring all the data to one node, and then save it. However most HDFS applications work with directories containing multiple files instead of single files for this reason.

Matei

On Jan 6, 2014, at 10:56 PM, Nan Zhu <[hidden email]> wrote:

> Hi, all
>
> maybe a stupid question, but is there any way to make Spark write a single file instead of partitioned files?
>
> Best,
>
> --
> Nan Zhu
>

Reply | Threaded
Open this post in threaded view
|

Re: How to make Spark merge the output file?

Aaron Davidson
HDFS, since 0.21, has a concat() method which would do exactly this, but I am not sure of the performance implications. Of course, as Matei pointed out, it's unusual to actually need a single HDFS file.


On Mon, Jan 6, 2014 at 9:08 PM, Matei Zaharia <[hidden email]> wrote:
Unfortunately this is expensive to do on HDFS — you’d need a single writer to write the whole file. If your file is small enough for that, you can use coalesce() on the RDD to bring all the data to one node, and then save it. However most HDFS applications work with directories containing multiple files instead of single files for this reason.

Matei

On Jan 6, 2014, at 10:56 PM, Nan Zhu <[hidden email]> wrote:

> Hi, all
>
> maybe a stupid question, but is there any way to make Spark write a single file instead of partitioned files?
>
> Best,
>
> --
> Nan Zhu
>


Reply | Threaded
Open this post in threaded view
|

Re: How to make Spark merge the output file?

Nan Zhu
Hi, all 

Thanks for the reply

I actually need to provide a single file to an external system to process it…seems that I have to make the consumer of the file to support multiple inputs

Best,

-- 
Nan Zhu

On Tuesday, January 7, 2014 at 12:37 PM, Aaron Davidson wrote:

HDFS, since 0.21, has a concat() method which would do exactly this, but I am not sure of the performance implications. Of course, as Matei pointed out, it's unusual to actually need a single HDFS file.


On Mon, Jan 6, 2014 at 9:08 PM, Matei Zaharia <[hidden email]> wrote:
Unfortunately this is expensive to do on HDFS — you’d need a single writer to write the whole file. If your file is small enough for that, you can use coalesce() on the RDD to bring all the data to one node, and then save it. However most HDFS applications work with directories containing multiple files instead of single files for this reason.

Matei

On Jan 6, 2014, at 10:56 PM, Nan Zhu <[hidden email]> wrote:

> Hi, all
>
> maybe a stupid question, but is there any way to make Spark write a single file instead of partitioned files?
>
> Best,
>
> --
> Nan Zhu
>



Reply | Threaded
Open this post in threaded view
|

Re: How to make Spark merge the output file?

Debasish Das
Hi Nan,

A cleaner approach is to export a RESTful service to the external system. 

The external system calls the service with appropriate api.

For Scala, Spray can be used to make these services. Twitter oss also many examples of this service design.

Thanks.
Deb



On Tue, Jan 7, 2014 at 10:25 AM, Nan Zhu <[hidden email]> wrote:
Hi, all 

Thanks for the reply

I actually need to provide a single file to an external system to process it…seems that I have to make the consumer of the file to support multiple inputs

Best,

-- 
Nan Zhu

On Tuesday, January 7, 2014 at 12:37 PM, Aaron Davidson wrote:

HDFS, since 0.21, has a concat() method which would do exactly this, but I am not sure of the performance implications. Of course, as Matei pointed out, it's unusual to actually need a single HDFS file.


On Mon, Jan 6, 2014 at 9:08 PM, Matei Zaharia <[hidden email]> wrote:
Unfortunately this is expensive to do on HDFS — you’d need a single writer to write the whole file. If your file is small enough for that, you can use coalesce() on the RDD to bring all the data to one node, and then save it. However most HDFS applications work with directories containing multiple files instead of single files for this reason.

Matei

On Jan 6, 2014, at 10:56 PM, Nan Zhu <[hidden email]> wrote:

> Hi, all
>
> maybe a stupid question, but is there any way to make Spark write a single file instead of partitioned files?
>
> Best,
>
> --
> Nan Zhu
>