How to modify a field in a nested struct using pyspark

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How to modify a field in a nested struct using pyspark

Felix Kizhakkel Jose
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type

a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
"system": "Test"
},
"status": "Active",
"timingPeriod": {
"startDateTime": 1611859271516,
"endDateTime": null
},
"eventDateTime": 1611859272122,
"isPrimary": true,
}
  Here I want to convert "eventDateTime" and "startDateTime" and "endDateTime" as timestamp types

So I have done following,
def transform_date_col(date_col):
return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
df.withColumn(
"eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
"timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
"timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
the timingPeriod fields are not a struct anymore rather they become two different fields with names "timingPeriod.start", "timingPeriod.end". 

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested structs?

I have hundreds of entities where the long needs to convert to timestamp, so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose