Re: Parquet read performance for different schemas

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Parquet read performance for different schemas

Tomas Bartalos
I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column:

df.select(sum('amount))

BR,
Tomas

št 19. 9. 2019 o 18:10 Tomas Bartalos <[hidden email]> napísal(a):
Hello,

I have 2 parquets (each containing 1 file):
  • parquet-wide - schema has 25 top level cols + 1 array
  • parquet-narrow - schema has 3 top level cols
Both files have same data for given columns.
When I read from parquet-wide spark reports read 52.6 KB, from parquet-narrow only 2.6 KB.
For bigger dataset the difference is 413 MB vs 961 MB. Needless to say reading narrow parquet is much faster.

Since schema pruning is applied I expected to get similar results for both scenarios (timing and amount of data read). 
What do you think is the reason for such a big difference, is there any tuning I can do ?

Thank you,
Tomas
Reply | Threaded
Open this post in threaded view
|

Re: Parquet read performance for different schemas

Julien Laurenceau
Hi Tomas,

Parquet tuning time !!!
I strongly recommend you to read papers by CERN on spark parquet tuning

You have to check the size of the row groups in your parquet files and maybe tweak it a little.
In my memories, if parquet detects that it has two much cardinality in a row group Pedicate push-down will not be enable and you'll be forced to read the full row group even if you only need a single row.

Check the schema of your parquet files with parquet-tools (you don't need spark for this) and do some tuning of your spark writing.
hadoop jar /.../parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet 

You may also have a look to your hadoopConfiguration and in particular to :
spark.sql.parquet.mergeSchema
parquet.enable.summary-metadata

Regards,
Julien 
 

Le ven. 20 sept. 2019 à 15:37, Tomas Bartalos <[hidden email]> a écrit :
I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column:

df.select(sum('amount))

BR,
Tomas

št 19. 9. 2019 o 18:10 Tomas Bartalos <[hidden email]> napísal(a):
Hello,

I have 2 parquets (each containing 1 file):
  • parquet-wide - schema has 25 top level cols + 1 array
  • parquet-narrow - schema has 3 top level cols
Both files have same data for given columns.
When I read from parquet-wide spark reports read 52.6 KB, from parquet-narrow only 2.6 KB.
For bigger dataset the difference is 413 MB vs 961 MB. Needless to say reading narrow parquet is much faster.

Since schema pruning is applied I expected to get similar results for both scenarios (timing and amount of data read). 
What do you think is the reason for such a big difference, is there any tuning I can do ?

Thank you,
Tomas