I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks.
Ceeated a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load()
Converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data", expr("CAST(data as string)"))
Now, I need to extract few fields from df.xml_data column using xpath.
Can you please suggest any possible solution?
If I create a dataframe directly for these xml files as
xml_df = spark.read.format("xml").options(rowTag='Consumers').load("s3a://bkt/xmldata")
I'm able to query using xpath
xml_df.select("Analytics.Amount1").show()
But, not sure how to do extract elements similarly on a Spark Streaming dataframe where data is in text format.
Are there any xml functions to convert text data using schema? I saw an example for json data using from_json.
Is it possible to use spark.read on a dataframe column?
I need to find aggregated "Amount1" for every 5 minutes window.
Thanks for your help.
Nick