Spark column combinations and combining multiple dataframes (pyspark)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Spark column combinations and combining multiple dataframes (pyspark)

Christopher Petrino
Hi all, I'm working on a problem where it is necessary to find all combinations of columns for a dataframe. 

THE PROBLEM:

Let's say there is a dataframe with columns:
[ col_a, col_b, col_c, col_d, col_e, result ]

The number of combinations can vary between 1 and 5 but lets say 3 for this case.

This would result in something like:
[ col_a, col_b, col_c, result ]
[ col_b, col_c, col_d, result ]
[ col_b, col_c, col_e, result ]
... etc

These would have their columns renamed and combined to result in:
[ col_1, col_2, col_3, result ]


THE CURRENT APPROACH: 

The current approach is to use python and itertools.combinations to produce all combinations. Then for each combination create a new dataframe from the original dataframe using

df.select(cols).toDF(*cols).cache()

{Cache and count are being used to help track progress and resolve previous issues}
The generated dataframe is added to a python array like 

the_dfs.append(df.select(cols).toDF(*cols).cache())
the_dfs[len(the_dfs)].count()

The dataframes are finally combined using 

df_all = reduce(DataFrame.union, the_dfs).cache()
df_all.count()


THE CURRENT STATE:

The proof of concept works on a smaller amount of data but as the data size increases this approach has proven to be unreliable. I've had several blocking issues that I've resolved by caching dataframes. 

Can anyone provide any criticism on the current approach or advice on a different approach?