

I am having data in the format of the following row: ABZABZAF2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y1,2,3,4,5730 I want to convert it into several rows in the format below: ABZABZAF21730 ABZABZAF3+1730 . . . ABZABZAF31730 ABZABZAF32730 ABZABZAF33730 . . . ABZABZAFY4730 ABZABZAFY5730 Basically, I want to consider the various combinations of the 4th and 5th columns (where the values are delimited by commas) and accordingly generate the above rows from a single row. Please can you suggest me for a good way of acheiving this. Thanks in advance ! Regards, Debu


Using explode on the 4th column, followed by an explode on the 5th column would produce what you want (you might need to use split on the columns first if they are not already an array).


Quick pyspark code:
>>> s = "ABZABZAF2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y1,2,3,4,5730" >>> base = sc.parallelize([s.split("")]) >>> base.take(10) [['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5', '730']]
>>> def pv(t): ... x = t[3].split(",") ... y = t[4].split(",") ... for k in product(x,y): ... yield (t[0],t[1],k[0],k[1],t[5]) ... >>> res = base.flatMap(pv) >>> res.take(10) [('ABZ', 'ABZ', '2', '1', '730'), ('ABZ', 'ABZ', '2', '2', '730'), ('ABZ', 'ABZ', '2', '3', '730'), ('ABZ', 'ABZ', '2', '4', '730'), ('ABZ', 'ABZ', '2', '5', '730'), ('ABZ', 'ABZ', '3', '1', '730'), ('ABZ', 'ABZ', '3', '2', '730'), ('ABZ', 'ABZ', '3', '3', '730'), ('ABZ', 'ABZ', '3', '4', '730'), ('ABZ', 'ABZ', '3', '5', '730')]


Thanks Ayan and NIcholas for your jetfast reply ! Appreciate it a lot.
Cheers, Debu

