newbie: how to partition data on file system. What are best practices?
I am working on a deep learning project. Currently we do everything on a single machine. I am trying to figure out how we might be able to move to a clustered spark environment.
Clearly its possible a machine or job on the cluster might fail so I assume that the data needs to be replicated to some degree.
Eventually I expect to I will need to process multi petabyte files and will need to come up with some sort of sharding. Communication costs could be a problem. Does spark have any knowledge of how the data distributed, replicated across the machine in my cluster?
Let say my data source is S3. I should I copy the data to my ec2 cluster or try to read directly from S3?
If our pilot is successful we expect to need to process multi petabyte file.
What are best practices?
P.s. We expect to use AWS or some other cloud solution.