Flatten log data Using Pyspark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flatten log data Using Pyspark

anbutech
Hi,

I have a raw source data frame having 2 columns as below

timestamp                              
2019-11-29 9:30:45

message_log

<123>NOV 29 10:20:35 ips01 sfids: connection:
tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1

how do we break above each key value as separate columns using udf in
pyspark?

what is the right approach for flattening this type of log data - regex or
python logic?

Could you please help me the logic to bring flattening the log data?

Final output dataframe having the below  each columns:

timestamp                              
2019-11-29 9:30:45

prio
123

msg_ts
NOV 29 10:20:35

msg_ids
ips01

sfids

connection
tcp

bytes
104

user
unknown

url
unknown

host
127.0.0.1


Thanks
Anbu




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flatten log data Using Pyspark

Gourav Sengupta
Why do you want to use UDF?

Regards,
Gourav

On Sat, Nov 30, 2019 at 3:06 AM anbutech <[hidden email]> wrote:
Hi,

I have a raw source data frame having 2 columns as below

timestamp                             
2019-11-29 9:30:45

message_log

<123>NOV 29 10:20:35 ips01 sfids: connection:
tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1

how do we break above each key value as separate columns using udf in
pyspark?

what is the right approach for flattening this type of log data - regex or
python logic?

Could you please help me the logic to bring flattening the log data?

Final output dataframe having the below  each columns:

timestamp                             
2019-11-29 9:30:45

prio
123

msg_ts
NOV 29 10:20:35

msg_ids
ips01

sfids

connection
tcp

bytes
104

user
unknown

url
unknown

host
127.0.0.1


Thanks
Anbu




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]