How to merge multiple rows

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to merge multiple rows

msbreuer
A dataframe with following contents is given:

ID PART DETAILS
 1    1 A1
 1    2 A2
 1    3 A3
 2    1 B1
 3    1 C1

Target format should be as following:

ID DETAILS
 1 A1+A2+A3
 2 B1
 3 C1

Note, the order of A1-3 is important.

Currently I am using this alternative:

ID DETAIL_1 DETAIL_2 DETAIL_3
 1 A1       A2       A3
 2 B1
 3 C1

What would be the best method to do such transformation an a large dataset?




---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

jgp
Reply | Threaded
Open this post in threaded view
|

Re: How to merge multiple rows

jgp
How do you do it now?

You could use a withColumn(“newDetails”, <concatenation of details_1, details_2...>)

jg


> On Aug 22, 2018, at 16:04, msbreuer <[hidden email]> wrote:
>
> A dataframe with following contents is given:
>
> ID PART DETAILS
> 1    1 A1
> 1    2 A2
> 1    3 A3
> 2    1 B1
> 3    1 C1
>
> Target format should be as following:
>
> ID DETAILS
> 1 A1+A2+A3
> 2 B1
> 3 C1
>
> Note, the order of A1-3 is important.
>
> Currently I am using this alternative:
>
> ID DETAIL_1 DETAIL_2 DETAIL_3
> 1 A1       A2       A3
> 2 B1
> 3 C1
>
> What would be the best method to do such transformation an a large dataset?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to merge multiple rows

Patrick McCarthy-2
You didn't specify which API, but in pyspark you could do

import pyspark.sql.functions as F

df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show()

+---+------------+
| ID|     DETAILS|
+---+------------+
|  1|[A1, A2, A3]|
|  3|        [B2]|
|  2|        [B1]|
+---+------------+

If you want to sort by PART I think you'll need a UDF.

On Wed, Aug 22, 2018 at 4:12 PM, Jean Georges Perrin <[hidden email]> wrote:
How do you do it now?

You could use a withColumn(“newDetails”, <concatenation of details_1, details_2...>)

jg


> On Aug 22, 2018, at 16:04, msbreuer <[hidden email]> wrote:
>
> A dataframe with following contents is given:
>
> ID PART DETAILS
> 1    1 A1
> 1    2 A2
> 1    3 A3
> 2    1 B1
> 3    1 C1
>
> Target format should be as following:
>
> ID DETAILS
> 1 A1+A2+A3
> 2 B1
> 3 C1
>
> Note, the order of A1-3 is important.
>
> Currently I am using this alternative:
>
> ID DETAIL_1 DETAIL_2 DETAIL_3
> 1 A1       A2       A3
> 2 B1
> 3 C1
>
> What would be the best method to do such transformation an a large dataset?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]