I'm running Spark job on AWS EMR that reads many lzo files from a S3 bucket partitioned by date.
Sometimes I see errors in logs similar to
18/04/13 11:53:52 WARN TaskSetManager: Lost task 151177.0 in stage 43.0 (TID 1516123, ip-10-10-2-6.ec2.internal, executor 57): java.io.IOException: Corrupted uncompressed block
I don't see the jobs fail. I assume this task succeeded when it is retried.
If the input file is actually corrupted even task retries should fail and eventually job will fail based on "spark.task.maxFailures" config rt?
Is there way to make Spark/Hadoop lzo library to print the full file name when such failures happen? So that I can then manually check if the file is indeed corrupted.