[Spark Structured Streaming]: truncated Parquet after driver crash or kill
This post has NOT been accepted by the mailing list yet.
We have a Spark application that performs a set of ETLs: reading messages from a Kafka topic, categorizing them, and writing the contents out as Parquet files on HDFS. After writing, we are querying the data from HDFS using Presto's hive integration. We are having problems because the Parquet files are frequently truncated after the Spark driver is killed or crashes.
The meat of the (Scala) Spark jobs look like this:
Is it expected that Parquet files could be truncated during crashes?
Sometimes the files are only 4 bytes long, sometimes they are longer but still too short to be valid Parquet files. Presto detects the short files and refuses to query the entire table. I hoped the write out of the files would be transactional, so that incomplete files would not be output.
We can fix crashes as they come up, but we will always need to kill the job periodically to deploy new versions of the code. We want to run the application as a long lived process that is continually reading from the Kafka queue and writing out to HDFS for archival purposes.