We have many Spark jobs that create multiple small files. We would like to improve analyst reading performance, doing so I'm testing the parquet optimal file size.
I've found that the optimal file size should be around 1GB, and not less than 128MB, depending on the size of the data.
I took one process to examine, in my process I'm using shuffle partitions = 600, which creates files of size 11MB. I've added a repartition part to recreate less files - ~12 files of 600gb. After testing it (select * from table where ...) I saw that the old version (with more files) ran faster than the new one. I tried to increase the num of files to 40 - ~130MB each file, and still it runs slower.
Would appreciate your experience with file sizes, and how to optimize the num and size of files.