can HDFS be a streaming source like Kafka in Spark 2.2.0?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

can HDFS be a streaming source like Kafka in Spark 2.2.0?

kant kodali
Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

maasg
Hi,

You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. 
Updating the files is not supported.

kr, Gerard.

On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <[hidden email]> wrote:
Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

kant kodali
Hi,

I am not sure I understand. any examples ?

On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas <[hidden email]> wrote:
Hi,

You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. 
Updating the files is not supported.

kr, Gerard.

On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <[hidden email]> wrote:
Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1.

Thanks!


Reply | Threaded
Open this post in threaded view
|

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Gourav Sengupta
What Gerard means is that if you are adding new files in to the same base path (key) then its fine, but in case you are appending lines to the same file then changes will not be picked up.

Regards,
Gourav Sengupta

On Tue, Jan 16, 2018 at 12:20 AM, kant kodali <[hidden email]> wrote:
Hi,

I am not sure I understand. any examples ?

On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas <[hidden email]> wrote:
Hi,

You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. 
Updating the files is not supported.

kr, Gerard.

On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <[hidden email]> wrote:
Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1.

Thanks!



Reply | Threaded
Open this post in threaded view
|

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

kant kodali
Got it! What about overwriting the same file instead of appending? 

On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta <[hidden email]> wrote:
What Gerard means is that if you are adding new files in to the same base path (key) then its fine, but in case you are appending lines to the same file then changes will not be picked up.

Regards,
Gourav Sengupta

On Tue, Jan 16, 2018 at 12:20 AM, kant kodali <[hidden email]> wrote:
Hi,

I am not sure I understand. any examples ?

On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas <[hidden email]> wrote:
Hi,

You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. 
Updating the files is not supported.

kr, Gerard.

On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <[hidden email]> wrote:
Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1.

Thanks!




Reply | Threaded
Open this post in threaded view
|

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Gourav Sengupta
would it not be like appending lines to the same file in that case?

On Tue, Jan 16, 2018 at 4:50 AM, kant kodali <[hidden email]> wrote:
Got it! What about overwriting the same file instead of appending? 

On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta <[hidden email]> wrote:
What Gerard means is that if you are adding new files in to the same base path (key) then its fine, but in case you are appending lines to the same file then changes will not be picked up.

Regards,
Gourav Sengupta

On Tue, Jan 16, 2018 at 12:20 AM, kant kodali <[hidden email]> wrote:
Hi,

I am not sure I understand. any examples ?

On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas <[hidden email]> wrote:
Hi,

You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. 
Updating the files is not supported.

kr, Gerard.

On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <[hidden email]> wrote:
Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1.

Thanks!





Reply | Threaded
Open this post in threaded view
|

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

ayan guha
In reply to this post by kant kodali

On Tue, Jan 16, 2018 at 3:50 PM, kant kodali <[hidden email]> wrote:
Got it! What about overwriting the same file instead of appending? 

On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta <[hidden email]> wrote:
What Gerard means is that if you are adding new files in to the same base path (key) then its fine, but in case you are appending lines to the same file then changes will not be picked up.

Regards,
Gourav Sengupta

On Tue, Jan 16, 2018 at 12:20 AM, kant kodali <[hidden email]> wrote:
Hi,

I am not sure I understand. any examples ?

On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas <[hidden email]> wrote:
Hi,

You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. 
Updating the files is not supported.

kr, Gerard.

On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <[hidden email]> wrote:
Hi All,

I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1.

Thanks!







--
Best Regards,
Ayan Guha