How does Dataframe determines the number and size of the output files?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How does Dataframe determines the number and size of the output files?

kchew
This post has NOT been accepted by the mailing list yet.
Here is the pseudocodes,

val rawDF = "select * from raw data Hive table"
val newDF = do-something-to-rawDF
newDF.coalesce(80).write.partitionBy("dt").mode(SaveMode.Overwrite).insertInto(Another Hive table)

I organized my inputs in three combinations, each combination for one partition,

      Number of files                     Size of each file
      ------------------                   ------------------
         3600                                    22 MB
         1800                                    44 MB
           900                                    88 MB

Then I ran my test against each combination (Partition), it stores the output to a single partition, too.
If I set the coalesce to 80, the outputs for all combinations consists of 80 files, where each file is 1 GB.
But if I set the coalesce 160, here are the results

      Number of files                     Size of each file         Number of output files      Size of each output file
      ------------------                   ------------------         -------------------------      -------------------------
         3600                                    22 MB                          3600                               23 MB
         1800                                    44 MB                          1800                               47 MB
           900                                    88 MB                            900                               93 MB

So how does the Dataframe determines the number and size of the output files? Is it possible to have the application to output to the desired number of files of desired size? I have googled it a lot and couldn't find too much information about it. Any pointers to documentation or codes will be much appreciated.

TIA.

Kim
Loading...