Explode/Flatten Map type Data Using Pyspark

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Explode/Flatten Map type Data Using Pyspark

anbutech
Hello Sir,

I have a scenario to flatten the different combinations of map type(key
value) in a column called eve_data  like below:

How do we flatten the map type into proper columns using pyspark


1) Source Dataframe having 2 columns(event id,data)

eve_id,eve_data
001,  "k1":"abc",
      "k2":"xyz"
          "k3":"10091"

eve_id,eve_data

002,   "k1":"12",
          "k2":"jack",
           "k3":"0.01",
           "k4":"0998"

eve_id,eve_data

003,   "k1":"aaa",
         "k2":"xxxx",
          "k3":"device",
          "k4":"endpoint",
          "k5":"-"
       
       
Final output:

(flatten the output of each  event ids key values).The number of key values
will be different for each event id.so i want to flatten the records for all
the map type(key values) as below
       
eve_id k1  k2  k3
001        abc xyz 10091

eve_id,  k1  k2   k3   k4
002,     12  jack 0.01 0998

eve_id,   k1     k2        k3          k4      k5
003,       aaa  xxxx   device endpoint     -


Thanks
Anbu



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Explode/Flatten Map type Data Using Pyspark

ayan guha
Hi

How do you want your final DF to look like? Is it with all 5 value columns? Do you have a finite set of columns? 

On Fri, Nov 15, 2019 at 4:50 AM anbutech <[hidden email]> wrote:
Hello Sir,

I have a scenario to flatten the different combinations of map type(key
value) in a column called eve_data  like below:

How do we flatten the map type into proper columns using pyspark


1) Source Dataframe having 2 columns(event id,data)

eve_id,eve_data
001,  "k1":"abc",
      "k2":"xyz"
          "k3":"10091"

eve_id,eve_data

002,   "k1":"12",
          "k2":"jack",
           "k3":"0.01",
           "k4":"0998"

eve_id,eve_data

003,   "k1":"aaa",
         "k2":"xxxx",
          "k3":"device",
          "k4":"endpoint",
          "k5":"-"


Final output:

(flatten the output of each  event ids key values).The number of key values
will be different for each event id.so i want to flatten the records for all
the map type(key values) as below

eve_id  k1  k2  k3
001        abc xyz 10091

eve_id,  k1  k2   k3   k4
002,     12  jack 0.01 0998

eve_id,   k1     k2        k3          k4      k5
003,       aaa  xxxx   device endpoint     -


Thanks
Anbu



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Explode/Flatten Map type Data Using Pyspark

anbutech
This post was updated on .
Hello Guha,

The  number of keys will be different for each event id.for example if the
event id:005 it is has 7 keys then i have to flatten all those 10 keys in
the final output.here there is no fixed number of keys for each event id.

001 -> 3 keys

002 -> 4 keys

003 -> 5 keys

005 -> 7 keys

above event id has different key values combinations and different from
other.i want to dynamically flatten the incoming data

in the final ouput s3 csv file(want to write all the flattened keys in the csv corresponding to the event id)

Final dataframe:

eve_id  k1    k2  k3
001       abc   x  y

eve_id,  k1  k2   k3   k4
002,     12  jack 0.01 0998

eve_id,   k1     k2        k3          k4      k5
003,       aaa  xxxx   device   endpoint     -

eve_id,   k1     k2        k3          k4      k5   k6    k7
005,      127   0.00        -         error     -    000  login



flatten.csv

eve_id  k1    k2  k3
001       abc   x  y

eve_id,  k1  k2   k3   k4
002,     12  jack 0.01 0998

eve_id,   k1     k2        k3          k4      k5
003,       aaa  xxxx   device   endpoint     -

eve_id,   k1     k2        k3          k4      k5   k6    k7
005,      127   0.00        -         error     -    000  login




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Reply | Threaded
Open this post in threaded view
|

Re: Explode/Flatten Map type Data Using Pyspark

ayan guha
Hi Anbutech in that case you have variable number of columns in output df and then in csv. it will not be the best way to read csv

On Fri, 15 Nov 2019 at 2:30 pm, anbutech <[hidden email]> wrote:
Hello Guha,

The  number of keys will be different for each event id.for example if the
event id:005 it is has 10 keys then i have to flatten all those 10 keys in
the final output.here there is no fixed number of keys for each event id.

001 -> 2 keys

002 -> 4 keys

003 -> 5 keys

above event id has different key values combinations and different from
other.i want to dynamically flatten the incoming data

in the ouput s3 csv file(want to write all the flattened keys in the csv
path)

flatten.csv

eve_id  k1    k2  k3
001       abc   x  y

eve_id,  k1  k2   k3   k4
002,     12  jack 0.01 0998

eve_id,   k1     k2        k3          k4      k5
003,       aaa  xxxx   device   endpoint     -




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
Best Regards,
Ayan Guha