Structured Streaming using File Source - How to handle live files

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Structured Streaming using File Source - How to handle live files

ArtemisDev
We were trying to use structured streaming from file source, but had
problems getting the files read by Spark properly.  We have another
process generating the data files in the Spark data source directory on
a continuous basis.  What we have observed was that the moment a data
file is created before the data producing process finished, it was read
by Spark immediately without reaching the EOF.  Then Spark will never
revisit the file.  So we only ended up with empty data content.  The
only way to make it to work is to produce the data files in a separate
directory (e.g. /tmp) and move them to the Spark's file source dir after
the data generation completes.

My questions:  Is this a behavior by design or is there any way to
control the Spark streaming process not to import a file while it is
still being used by another process?  In other words, do we have to use
the tmp dir to move data files around or can the data producing process
and Spark share the same directory?

Thanks!

-- Nick


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming using File Source - How to handle live files

Jungtaek Lim-2
Hi Nick,

I guess that's by design - Spark assumes the input file will not be modified once it is placed on the input path. This makes Spark easy to track the list of processed files vs unprocessed files. Assume input files can be modified, then Spark will have to enumerate all of files and track how many lines/bytes it reads "per file", even the bad case it may read the incomplete line (if the writer doesn't guarantee that) and crash or bring incorrect results.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Jun 8, 2020 at 2:43 AM ArtemisDev <[hidden email]> wrote:
We were trying to use structured streaming from file source, but had
problems getting the files read by Spark properly.  We have another
process generating the data files in the Spark data source directory on
a continuous basis.  What we have observed was that the moment a data
file is created before the data producing process finished, it was read
by Spark immediately without reaching the EOF.  Then Spark will never
revisit the file.  So we only ended up with empty data content.  The
only way to make it to work is to produce the data files in a separate
directory (e.g. /tmp) and move them to the Spark's file source dir after
the data generation completes.

My questions:  Is this a behavior by design or is there any way to
control the Spark streaming process not to import a file while it is
still being used by another process?  In other words, do we have to use
the tmp dir to move data files around or can the data producing process
and Spark share the same directory?

Thanks!

-- Nick


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming using File Source - How to handle live files

Gourav Sengupta
Hi,

Yeah we generally read files from hdfs or object stores like S3, gcs, etc where files cannot be updated. 

Regards 
Gourav 

On Sun, 7 Jun 2020, 22:36 Jungtaek Lim, <[hidden email]> wrote:
Hi Nick,

I guess that's by design - Spark assumes the input file will not be modified once it is placed on the input path. This makes Spark easy to track the list of processed files vs unprocessed files. Assume input files can be modified, then Spark will have to enumerate all of files and track how many lines/bytes it reads "per file", even the bad case it may read the incomplete line (if the writer doesn't guarantee that) and crash or bring incorrect results.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Jun 8, 2020 at 2:43 AM ArtemisDev <[hidden email]> wrote:
We were trying to use structured streaming from file source, but had
problems getting the files read by Spark properly.  We have another
process generating the data files in the Spark data source directory on
a continuous basis.  What we have observed was that the moment a data
file is created before the data producing process finished, it was read
by Spark immediately without reaching the EOF.  Then Spark will never
revisit the file.  So we only ended up with empty data content.  The
only way to make it to work is to produce the data files in a separate
directory (e.g. /tmp) and move them to the Spark's file source dir after
the data generation completes.

My questions:  Is this a behavior by design or is there any way to
control the Spark streaming process not to import a file while it is
still being used by another process?  In other words, do we have to use
the tmp dir to move data files around or can the data producing process
and Spark share the same directory?

Thanks!

-- Nick


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]