Best practices for data like file storage

Patrick McCarthy-2
Hi List,

I'm looking for resources to learn about how to store data on disk for later access.

For a while my team has been using Spark on top of our existing hdfs/Hive cluster without much agency as far as what format is used to store the data. I'd like to learn more about how to re-stage my data to speed up my own analyses, and to start building expertise to define new data stores.

One example of a problem I'm facing is data which is written to Hive using a customized protobuf serde. The data contains many very complex types (arrays of structs of arrays of... ) and I often need very few elements of any particular record, yet the format requires Spark to deserialize the entire object.

The sorts of information I'm looking for:
  • Do's and Dont's of laying out a parquet schema
  • Measuring / debugging read speed
  • How to bucket, index, etc.