Processing a splittable file from a single executor

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Processing a splittable file from a single executor

Jeroen Miller
Dear Sparkers,

A while back, I asked how to process non-splittable files in parallel, one file per executor. Vadim's suggested "scheduling within an application" approach worked out beautifully.

I am now facing the 'opposite' problem:

 - I have a bunch of parquet files to process
 - Once processed I need to output a /single/ file for each input file
 - When I read a parquet file, it gets partitioned over several executors
 - If I want a single output file, I would need to coalesce(1) with potential
   performance issues.

Since my files are relatively small, a single file could be handled by a single executor, and several files could be read in parallel, one for each executor.

My question is: how to force my parquet file to be read by a single executor, without repartitioning or coalescing of course.

Regards,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Processing a splittable file from a single executor

Jeroen Miller
On 16 Nov 2017, at 10:22, Michael Shtelma <[hidden email]> wrote:
> you call repartition(1) before starting processing your files. This
> will ensure that you end up with just one partition.

One question and one remark:

Q) val ds = sqlContext.read.parquet(path).repartition(1)

Am I absolutely sure that my file here is read by a single executor and that no data shuffling takes place afterwards to get that single partition?

R) This approach did not work for me.

    val ds = sqlContext.read.parquet(path).repartition(1)
   
    // ds on a single partition

    ds.createOrReplaceTempView("ds")

    val result = sqlContext.sql("... from ds")

    // result on 166 partitions... How to force the processing on a
    // single executor?

    result.write.csv(...)

    // 166 files :-/

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]