Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

Pappu Yadav
Hi Team,

While Running Spark Below are some finding.
  1. FileStreamSourceLog is responsible for maintaining input source file list.
  2. Spark Streaming delete expired log files on the basis of spark.sql.streaming.fileSource.log.deletion and spark.sql.streaming.minBatchesToRetain.
  3. But while compacting logs Spark Streaming write the complete list of files streaming has seen till now in HDFS into one single .compact file.
  4. Over the course of time this compact file  is consuming around 2GB-5GB in HDFS which will delay creation of compact file after every 10th Batch and also job restart time will increase.
  5. Why Spark Streaming is logging files in the system which are already deleted . While creating compact file there must be some configured timeout so that Spark can skip writing expired list of input files.
Also kindly let me know if i missed something and there is some configuration already present to handle this. 

Regards
Pappu Yadav
Reply | Threaded
Open this post in threaded view
|

Re: Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

Jungtaek Lim-2
You're hitting an existing issue https://issues.apache.org/jira/browse/SPARK-17604. While there's no active PR to address it, I've been planning to take a look sooner than later.

Btw, you may also want to take a look at my previous mail - the topic on the mail thread was regarding file stream sink metadata growing bigger, but in fact that's basically the same issue, so you may get some information from there. (tl;dr. I have bunch of PRs for addressing multiple issues on file stream source and sink, just having lack of some love.)


Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Apr 21, 2020 at 8:23 PM Pappu Yadav <[hidden email]> wrote:
Hi Team,

While Running Spark Below are some finding.
  1. FileStreamSourceLog is responsible for maintaining input source file list.
  2. Spark Streaming delete expired log files on the basis of spark.sql.streaming.fileSource.log.deletion and spark.sql.streaming.minBatchesToRetain.
  3. But while compacting logs Spark Streaming write the complete list of files streaming has seen till now in HDFS into one single .compact file.
  4. Over the course of time this compact file  is consuming around 2GB-5GB in HDFS which will delay creation of compact file after every 10th Batch and also job restart time will increase.
  5. Why Spark Streaming is logging files in the system which are already deleted . While creating compact file there must be some configured timeout so that Spark can skip writing expired list of input files.
Also kindly let me know if i missed something and there is some configuration already present to handle this. 

Regards
Pappu Yadav