How does readStream() and writeStream() work?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

How does readStream() and writeStream() work?

I'm wondering how does readStream() and writeStream() work internally
Lets take a simple example:

df = spark.readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", kafka_brokers) \
                .option("subscribe", kafka_topic) \
                .load() \

res =
                .option("checkpointLocation", hdfs_dir + "checkpoint") \
                .option("path",  hdfs_dir + "data") \
                .trigger(processingTime= ' 60 seconds') \

This code read a kafka topic and then writes it to hdfs every 60 seconds.

My questions are:
1. when running readStream() what happens? does the spark job do something
or is it just like a "transformation" in spark's terminology and nothing
actually happens until an action is called?
2. writeStream() is started and the wrtiing happend every 60 seconds. Can I
intervene somehow in what happens when the actual writing occurs? for
example, can I write a log message "60 seconds passed, writing bulk to hdfs"
each time ?
3. is it possible to write to the same hdfs file each time the actual
writing occurs? for now it creates a new hdfs file each time.


Sent from:

To unsubscribe e-mail: [hidden email]