Create all the combinations of a groupBy

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Create all the combinations of a groupBy

Pierremalliard
Hi,

I am trying to generate a dataframe of all combinations that have a same key
using Pyspark.

example:

(a,1)
(a,2)
(a,3)
(b,1)
(b,2)

should return:

(a, 1 , 2)
(a, 1 , 3)
(a, 2, 3)
(b, 1 ,2)


i want to do something like df.groupBy('key').combinations().apply(...)

any suggestions are welcome !!!

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Create all the combinations of a groupBy

hemant singh
Check roll up and cube functions in spark sql.

On Wed, 23 Jan 2019 at 10:47 PM, Pierremalliard <[hidden email]> wrote:
Hi,

I am trying to generate a dataframe of all combinations that have a same key
using Pyspark.

example:

(a,1)
(a,2)
(a,3)
(b,1)
(b,2)

should return:

(a, 1 , 2)
(a, 1 , 3)
(a, 2, 3)
(b, 1 ,2)


i want to do something like df.groupBy('key').combinations().apply(...)

any suggestions are welcome !!!

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Create all the combinations of a groupBy

Pierremalliard
looks like i found the solution in case anyone ever encounters a similar
challenge...

df = spark.createDataFrame(
[("a", 1, 0), ("a", 2, 42), ("a", 3, 10), ("b", 4, -1), ("b", 5, -2), ("b",
6, 12)],
("key", "consumerID", "feature")
)

df.show()

schema = StructType([
    StructField("ID_1", DoubleType()),
    StructField("ID_2", DoubleType()),
    StructField("feature1", DoubleType()),
    StructField("feature2", DoubleType()),
])


@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def get_all_combinations(df):
  p=[]
  for i in range(len(df)):
    for j in range(len(df)):
      if i<j:
       
p.append([df.consumerID[i],df.consumerID[j],df.feature[i],df.feature[j]])
  return pd.DataFrame(p)
 
display(df.groupBy('key').apply(get_all_combinations))



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]