_SUCCESS file validation on read

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

_SUCCESS file validation on read

This post has NOT been accepted by the mailing list yet.
When writing a dataframe, a _SUCCESS file is created to mark that the entire dataframe is written. However, the existence of this _SUCCESS does not seem to be validated by default on reads. This would allow in some cases for partially written dataframes to be read back. Is this behavior configurable? Is lack of validation intentional?


Here is an example from spark 2.1.0 shell. I would expect the read step to fail because I've manually removed the _SUCCESS file:

scala> spark.range(10).write.save("/tmp/test")

$ rm /tmp/test/_SUCCESS

scala> spark.read.parquet("/tmp/test").show()
| id|
|  8|
|  9|
|  3|
|  4|
|  5|
|  0|
|  6|
|  7|
|  2|
|  1|