Avro file question

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Avro file question

Sam-2
Hi,

How do we choose between single large avro file (size much larger than HDFS block size) vs multiple smaller avro files (close to HDFS block size?

Since avro is splittable, is there even a need to split a very large avro file into smaller files?

I’m assuming that a single large avro file can also be split into multiple mappers/reducers/executors during processing.

Thanks. 
Reply | Threaded
Open this post in threaded view
|

Re: Avro file question

Yaniv Harpaz
It depends on your usage (when and how u read).
the smaller files you were thinking about are also larger than the HDFS block size?
I would not go for something smaller than a block.

Usually (if relevant to the way you read the data) the partitioning helps determine that.

Yaniv Harpaz
[ yaniv.harpaz at gmail.com ]


On Mon, Nov 4, 2019 at 7:03 PM Sam <[hidden email]> wrote:
Hi,

How do we choose between single large avro file (size much larger than HDFS block size) vs multiple smaller avro files (close to HDFS block size?

Since avro is splittable, is there even a need to split a very large avro file into smaller files?

I’m assuming that a single large avro file can also be split into multiple mappers/reducers/executors during processing.

Thanks. 
Reply | Threaded
Open this post in threaded view
|

Re: Avro file question

ayan guha
Assuming you always read data together one large file is good and basic hdfs use case

On Tue, 5 Nov 2019 at 4:28 am, Yaniv Harpaz <[hidden email]> wrote:
It depends on your usage (when and how u read).
the smaller files you were thinking about are also larger than the HDFS block size?
I would not go for something smaller than a block.

Usually (if relevant to the way you read the data) the partitioning helps determine that.


Yaniv Harpaz
[ yaniv.harpaz at gmail.com ]


On Mon, Nov 4, 2019 at 7:03 PM Sam <[hidden email]> wrote:
Hi,

How do we choose between single large avro file (size much larger than HDFS block size) vs multiple smaller avro files (close to HDFS block size?

Since avro is splittable, is there even a need to split a very large avro file into smaller files?

I’m assuming that a single large avro file can also be split into multiple mappers/reducers/executors during processing.

Thanks. 
--
Best Regards,
Ayan Guha