How to flatten a row in PySpark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to flatten a row in PySpark

Debabrata Ghosh
Hi,
        Greetings !

I am having data in the format of the following row:

ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730

I want to convert it into several rows in the format below:

ABZ|ABZ|AF|2|1|730
ABZ|ABZ|AF|3+1|730
.
.
.
ABZ|ABZ|AF|3|1|730
ABZ|ABZ|AF|3|2|730
ABZ|ABZ|AF|3|3|730
.
.
.
ABZ|ABZ|AF|Y|4|730
ABZ|ABZ|AF||Y|5|730

Basically, I want to consider the various combinations of the 4th and 5th columns (where the values are delimited by commas) and accordingly generate the above rows from a single row. Please can you suggest me for a good way of acheiving this. Thanks in advance !

Regards,

Debu
Reply | Threaded
Open this post in threaded view
|

Re: How to flatten a row in PySpark

Nicholas Hakobian
Using explode on the 4th column, followed by an explode on the 5th column would produce what you want (you might need to use split on the columns first if they are not already an array).

Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health


On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <[hidden email]> wrote:
Hi,
        Greetings !

I am having data in the format of the following row:

ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730

I want to convert it into several rows in the format below:

ABZ|ABZ|AF|2|1|730
ABZ|ABZ|AF|3+1|730
.
.
.
ABZ|ABZ|AF|3|1|730
ABZ|ABZ|AF|3|2|730
ABZ|ABZ|AF|3|3|730
.
.
.
ABZ|ABZ|AF|Y|4|730
ABZ|ABZ|AF||Y|5|730

Basically, I want to consider the various combinations of the 4th and 5th columns (where the values are delimited by commas) and accordingly generate the above rows from a single row. Please can you suggest me for a good way of acheiving this. Thanks in advance !

Regards,

Debu

Reply | Threaded
Open this post in threaded view
|

Re: How to flatten a row in PySpark

ayan guha
Quick pyspark code: 

>>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730"
>>> base = sc.parallelize([s.split("|")])
>>> base.take(10)
[['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5', '730']]

>>> def pv(t):
...     x = t[3].split(",")
...     y = t[4].split(",")
...     for k in product(x,y):
...         yield (t[0],t[1],k[0],k[1],t[5])
...
>>> res = base.flatMap(pv)
>>> res.take(10)
[('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ', 'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ', '2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2', '730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'), ('ABZ', 'ABZ', '3', '5', '730')]



On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian <[hidden email]> wrote:
Using explode on the 4th column, followed by an explode on the 5th column would produce what you want (you might need to use split on the columns first if they are not already an array).

Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health


On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <[hidden email]> wrote:
Hi,
        Greetings !

I am having data in the format of the following row:

ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730

I want to convert it into several rows in the format below:

ABZ|ABZ|AF|2|1|730
ABZ|ABZ|AF|3+1|730
.
.
.
ABZ|ABZ|AF|3|1|730
ABZ|ABZ|AF|3|2|730
ABZ|ABZ|AF|3|3|730
.
.
.
ABZ|ABZ|AF|Y|4|730
ABZ|ABZ|AF||Y|5|730

Basically, I want to consider the various combinations of the 4th and 5th columns (where the values are delimited by commas) and accordingly generate the above rows from a single row. Please can you suggest me for a good way of acheiving this. Thanks in advance !

Regards,

Debu




--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: How to flatten a row in PySpark

Debabrata Ghosh
Thanks Ayan and NIcholas for your jetfast reply ! Appreciate it a lot.

Cheers,

Debu

On Fri, Oct 13, 2017 at 9:27 AM, ayan guha <[hidden email]> wrote:
Quick pyspark code: 

>>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730"
>>> base = sc.parallelize([s.split("|")])
>>> base.take(10)
[['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5', '730']]

>>> def pv(t):
...     x = t[3].split(",")
...     y = t[4].split(",")
...     for k in product(x,y):
...         yield (t[0],t[1],k[0],k[1],t[5])
...
>>> res = base.flatMap(pv)
>>> res.take(10)
[('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ', 'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ', '2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2', '730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'), ('ABZ', 'ABZ', '3', '5', '730')]



On Fri, Oct 13, 2017 at 6:03 AM, Nicholas Hakobian <[hidden email]> wrote:
Using explode on the 4th column, followed by an explode on the 5th column would produce what you want (you might need to use split on the columns first if they are not already an array).

Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health


On Thu, Oct 12, 2017 at 9:09 AM, Debabrata Ghosh <[hidden email]> wrote:
Hi,
        Greetings !

I am having data in the format of the following row:

ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730

I want to convert it into several rows in the format below:

ABZ|ABZ|AF|2|1|730
ABZ|ABZ|AF|3+1|730
.
.
.
ABZ|ABZ|AF|3|1|730
ABZ|ABZ|AF|3|2|730
ABZ|ABZ|AF|3|3|730
.
.
.
ABZ|ABZ|AF|Y|4|730
ABZ|ABZ|AF||Y|5|730

Basically, I want to consider the various combinations of the 4th and 5th columns (where the values are delimited by commas) and accordingly generate the above rows from a single row. Please can you suggest me for a good way of acheiving this. Thanks in advance !

Regards,

Debu




--
Best Regards,
Ayan Guha