Understanding Spark behavior when reading from Kafka in static dataframe

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Understanding Spark behavior when reading from Kafka in static dataframe

Arbab Khalil
Hi all,
I have IoT time series data in Kafka and reading it in static dataframe as:
df = spark.read\
.format("kafka")\
.option("zookeeper.connect", "localhost:2181")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "test_topic")\
.option("failOnDataLoss", "false")\
.option("startingOffsets", "earliest")\
.load()\
.select(from_json(col("value").cast("string"), schema).alias("stream"))\
.select("stream.*")\
.withColumn("time",col("time").cast("timestamp"))\
.orderBy("dev_id")

I want to know how data is distributed over multiple executors. I want the data to be distributed on the basis of dev_id, each executor gets all data from one dev_id. Later I group by dev_id and run @pandas_udf on each group.
Please note that it is a static dataframe not streaming dataframe as @pandas_udf don't support streaming dataframe.
--
Regards,
Arbab Khalil
Software Design Engineer