How to modify a field in a nested struct using pyspark

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to modify a field in a nested struct using pyspark

Felix Kizhakkel Jose
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type

a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
"system": "Test"
},
"status": "Active",
"timingPeriod": {
"startDateTime": 1611859271516,
"endDateTime": null
},
"eventDateTime": 1611859272122,
"isPrimary": true,
}
  Here I want to convert "eventDateTime" and "startDateTime" and "endDateTime" as timestamp types

So I have done following,
def transform_date_col(date_col):
return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
df.withColumn(
"eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
"timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
"timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
the timingPeriod fields are not a struct anymore rather they become two different fields with names "timingPeriod.start", "timingPeriod.end". 

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested structs?

I have hundreds of entities where the long needs to convert to timestamp, so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose

Reply | Threaded
Open this post in threaded view
|

Re: How to modify a field in a nested struct using pyspark

Kimahriman
As of 3.0, the only way to do it is something that will recreate the whole struct:
df.withColumn('timingPeriod', f.struct(f.col('timingPeriod.start').cast('timestamp').alias('start'), f.col('timingPeriod.end').cast('timestamp').alias('end')))

There's a new method coming in 3.1 on the column class called withField which was designed for this purpose. I backported it to my personal 3.0 build because of how useful it is. It works something like:
df.withColumn('timingPeriod', f.col('timingPeriod').withField('start', f.col('timingPeriod.start').cast('timestamp')).withField('end', f.col('timingPeriod.end')))

And it works on multiple levels of nesting which is nice.

On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose <[hidden email]> wrote:
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type

a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
"system": "Test"
},
"status": "Active",
"timingPeriod": {
"startDateTime": 1611859271516,
"endDateTime": null
},
"eventDateTime": 1611859272122,
"isPrimary": true,
}
  Here I want to convert "eventDateTime" and "startDateTime" and "endDateTime" as timestamp types

So I have done following,
def transform_date_col(date_col):
return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
df.withColumn(
"eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
"timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
"timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
the timingPeriod fields are not a struct anymore rather they become two different fields with names "timingPeriod.start", "timingPeriod.end". 

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested structs?

I have hundreds of entities where the long needs to convert to timestamp, so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose



--
Adam Binford
Reply | Threaded
Open this post in threaded view
|

Re: How to modify a field in a nested struct using pyspark

Felix Kizhakkel Jose
Wow, that's really great to know. Thank you so much Adam. Do you know when the 3.1 release is scheduled?

Regards,
Felix K Jose

On Fri, Jan 29, 2021 at 12:35 PM Adam Binford <[hidden email]> wrote:
As of 3.0, the only way to do it is something that will recreate the whole struct:
df.withColumn('timingPeriod', f.struct(f.col('timingPeriod.start').cast('timestamp').alias('start'), f.col('timingPeriod.end').cast('timestamp').alias('end')))

There's a new method coming in 3.1 on the column class called withField which was designed for this purpose. I backported it to my personal 3.0 build because of how useful it is. It works something like:
df.withColumn('timingPeriod', f.col('timingPeriod').withField('start', f.col('timingPeriod.start').cast('timestamp')).withField('end', f.col('timingPeriod.end')))

And it works on multiple levels of nesting which is nice.

On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose <[hidden email]> wrote:
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type

a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
"system": "Test"
},
"status": "Active",
"timingPeriod": {
"startDateTime": 1611859271516,
"endDateTime": null
},
"eventDateTime": 1611859272122,
"isPrimary": true,
}
  Here I want to convert "eventDateTime" and "startDateTime" and "endDateTime" as timestamp types

So I have done following,
def transform_date_col(date_col):
return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
df.withColumn(
"eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
"timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
"timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
the timingPeriod fields are not a struct anymore rather they become two different fields with names "timingPeriod.start", "timingPeriod.end". 

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested structs?

I have hundreds of entities where the long needs to convert to timestamp, so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose



--
Adam Binford
Reply | Threaded
Open this post in threaded view
|

Re: How to modify a field in a nested struct using pyspark

Kimahriman
I think they're voting on the next release candidate starting sometime next week. So hopefully barring any other major hurdles within the next few weeks.

On Fri, Jan 29, 2021, 1:01 PM Felix Kizhakkel Jose <[hidden email]> wrote:
Wow, that's really great to know. Thank you so much Adam. Do you know when the 3.1 release is scheduled?

Regards,
Felix K Jose

On Fri, Jan 29, 2021 at 12:35 PM Adam Binford <[hidden email]> wrote:
As of 3.0, the only way to do it is something that will recreate the whole struct:
df.withColumn('timingPeriod', f.struct(f.col('timingPeriod.start').cast('timestamp').alias('start'), f.col('timingPeriod.end').cast('timestamp').alias('end')))

There's a new method coming in 3.1 on the column class called withField which was designed for this purpose. I backported it to my personal 3.0 build because of how useful it is. It works something like:
df.withColumn('timingPeriod', f.col('timingPeriod').withField('start', f.col('timingPeriod.start').cast('timestamp')).withField('end', f.col('timingPeriod.end')))

And it works on multiple levels of nesting which is nice.

On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose <[hidden email]> wrote:
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type

a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
"system": "Test"
},
"status": "Active",
"timingPeriod": {
"startDateTime": 1611859271516,
"endDateTime": null
},
"eventDateTime": 1611859272122,
"isPrimary": true,
}
  Here I want to convert "eventDateTime" and "startDateTime" and "endDateTime" as timestamp types

So I have done following,
def transform_date_col(date_col):
return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
df.withColumn(
"eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
"timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
"timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
the timingPeriod fields are not a struct anymore rather they become two different fields with names "timingPeriod.start", "timingPeriod.end". 

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested structs?

I have hundreds of entities where the long needs to convert to timestamp, so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose



--
Adam Binford
Reply | Threaded
Open this post in threaded view
|

Re: How to modify a field in a nested struct using pyspark

Jeff Evans
If you need to do this in 2.x, this library does the trick: https://github.com/fqaiser94/mse

On Fri, Jan 29, 2021 at 12:15 PM Adam Binford <[hidden email]> wrote:
I think they're voting on the next release candidate starting sometime next week. So hopefully barring any other major hurdles within the next few weeks.

On Fri, Jan 29, 2021, 1:01 PM Felix Kizhakkel Jose <[hidden email]> wrote:
Wow, that's really great to know. Thank you so much Adam. Do you know when the 3.1 release is scheduled?

Regards,
Felix K Jose

On Fri, Jan 29, 2021 at 12:35 PM Adam Binford <[hidden email]> wrote:
As of 3.0, the only way to do it is something that will recreate the whole struct:
df.withColumn('timingPeriod', f.struct(f.col('timingPeriod.start').cast('timestamp').alias('start'), f.col('timingPeriod.end').cast('timestamp').alias('end')))

There's a new method coming in 3.1 on the column class called withField which was designed for this purpose. I backported it to my personal 3.0 build because of how useful it is. It works something like:
df.withColumn('timingPeriod', f.col('timingPeriod').withField('start', f.col('timingPeriod.start').cast('timestamp')).withField('end', f.col('timingPeriod.end')))

And it works on multiple levels of nesting which is nice.

On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose <[hidden email]> wrote:
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type

a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
"system": "Test"
},
"status": "Active",
"timingPeriod": {
"startDateTime": 1611859271516,
"endDateTime": null
},
"eventDateTime": 1611859272122,
"isPrimary": true,
}
  Here I want to convert "eventDateTime" and "startDateTime" and "endDateTime" as timestamp types

So I have done following,
def transform_date_col(date_col):
return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
df.withColumn(
"eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
"timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
"timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
the timingPeriod fields are not a struct anymore rather they become two different fields with names "timingPeriod.start", "timingPeriod.end". 

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested structs?

I have hundreds of entities where the long needs to convert to timestamp, so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose



--
Adam Binford
Reply | Threaded
Open this post in threaded view
|

Re: How to modify a field in a nested struct using pyspark

Felix Kizhakkel Jose
Thank you so much for the quick response and great help. 

@jeff, I will use the library if the 3.1 release is getting delayed. Thank you so much.

On Fri, Jan 29, 2021 at 1:23 PM Jeff Evans <[hidden email]> wrote:
If you need to do this in 2.x, this library does the trick: https://github.com/fqaiser94/mse

On Fri, Jan 29, 2021 at 12:15 PM Adam Binford <[hidden email]> wrote:
I think they're voting on the next release candidate starting sometime next week. So hopefully barring any other major hurdles within the next few weeks.

On Fri, Jan 29, 2021, 1:01 PM Felix Kizhakkel Jose <[hidden email]> wrote:
Wow, that's really great to know. Thank you so much Adam. Do you know when the 3.1 release is scheduled?

Regards,
Felix K Jose

On Fri, Jan 29, 2021 at 12:35 PM Adam Binford <[hidden email]> wrote:
As of 3.0, the only way to do it is something that will recreate the whole struct:
df.withColumn('timingPeriod', f.struct(f.col('timingPeriod.start').cast('timestamp').alias('start'), f.col('timingPeriod.end').cast('timestamp').alias('end')))

There's a new method coming in 3.1 on the column class called withField which was designed for this purpose. I backported it to my personal 3.0 build because of how useful it is. It works something like:
df.withColumn('timingPeriod', f.col('timingPeriod').withField('start', f.col('timingPeriod.start').cast('timestamp')).withField('end', f.col('timingPeriod.end')))

And it works on multiple levels of nesting which is nice.

On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose <[hidden email]> wrote:
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type

a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
"system": "Test"
},
"status": "Active",
"timingPeriod": {
"startDateTime": 1611859271516,
"endDateTime": null
},
"eventDateTime": 1611859272122,
"isPrimary": true,
}
  Here I want to convert "eventDateTime" and "startDateTime" and "endDateTime" as timestamp types

So I have done following,
def transform_date_col(date_col):
return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
df.withColumn(
"eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
"timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
"timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
the timingPeriod fields are not a struct anymore rather they become two different fields with names "timingPeriod.start", "timingPeriod.end". 

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested structs?

I have hundreds of entities where the long needs to convert to timestamp, so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose



--
Adam Binford