[Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

hesouol
I have a PySpark method that applies the explode function on every Array column on the DataFrame.
def explode_column(df, column):
select_cols = list(df.columns)
col_position = select_cols.index(column)
select_cols[col_position] = explode_outer(column).alias(column)
return df.select(select_cols)

def explode_all_arrays(df):
still_has_arrays = True
exploded_df = df

while still_has_arrays:
still_has_arrays = False
for f in exploded_df.schema.fields:
if isinstance(f.dataType, ArrayType):
print(f"Exploding: {f}")
still_has_arrays = True
exploded_df = explode_column(exploded_df, f.name)

return exploded_df

When I have a small number of columns to explode it works perfectly, but on large DataFrames (~200 columns with ~40 explosions), after finishing, the DataFrame can't be written as Parquet.
Even a small amount of data fails (400KB), not during the method processing, but on the write step.
Any tips? I tried to write the dataframe as table and as a parquet file. It works when I store it as a temp view though.

Thank you,
henrique.
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

hesouol
I forgot to add an information. By "can't write" I mean it keeps processing
and nothing happens. The job runs for hours even with a very small file and
I have to force the stoppage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Patrick McCarthy-2
This seems like a very expensive operation. Why do you want to write out all the exploded values? If you just want all combinations of values, could you instead do it at read-time with a UDF or something?

On Sat, Aug 1, 2020 at 8:34 PM hesouol <[hidden email]> wrote:
I forgot to add an information. By "can't write" I mean it keeps processing
and nothing happens. The job runs for hours even with a very small file and
I have to force the stoppage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

hesouol
Hi Patrick, thank you for your quick response.
That's exactly what I think. Actually, the result of this processing is an intermediate table that is going to be used for other views generation.
Another approach I'm trying now, is to move the "explosion" step for this "view generation" step, this way I don't need to explode every column but just those used for the final client.

ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the python udfs I tried had very bad performance, but I will give it a try in this case. It can't be worse.
Thanks again!

Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <[hidden email]> escreveu:
This seems like a very expensive operation. Why do you want to write out all the exploded values? If you just want all combinations of values, could you instead do it at read-time with a UDF or something?

On Sat, Aug 1, 2020 at 8:34 PM hesouol <[hidden email]> wrote:
I forgot to add an information. By "can't write" I mean it keeps processing
and nothing happens. The job runs for hours even with a very small file and
I have to force the stoppage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

Patrick McCarthy-2
If you use pandas_udfs in 2.4 they should be quite performant (or at least won't suffer serialization overhead), might be worth looking into.

I didn't run your code but one consideration is that the while loop might be making the DAG a lot bigger than it has to be. You might see if defining those columns with list comprehensions forming a single select() statement makes for a smaller DAG.

On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <[hidden email]> wrote:
Hi Patrick, thank you for your quick response.
That's exactly what I think. Actually, the result of this processing is an intermediate table that is going to be used for other views generation.
Another approach I'm trying now, is to move the "explosion" step for this "view generation" step, this way I don't need to explode every column but just those used for the final client.

ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the python udfs I tried had very bad performance, but I will give it a try in this case. It can't be worse.
Thanks again!

Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <[hidden email]> escreveu:
This seems like a very expensive operation. Why do you want to write out all the exploded values? If you just want all combinations of values, could you instead do it at read-time with a UDF or something?

On Sat, Aug 1, 2020 at 8:34 PM hesouol <[hidden email]> wrote:
I forgot to add an information. By "can't write" I mean it keeps processing
and nothing happens. The job runs for hours even with a very small file and
I have to force the stoppage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

hesouol
Thank you for both tips, I will definitely try the pandas_udfs. About changing the select operation, it's not possible to have multiple explode functions on the same select, sadly they must be applied one at a time.

Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy <[hidden email]> escreveu:
If you use pandas_udfs in 2.4 they should be quite performant (or at least won't suffer serialization overhead), might be worth looking into.

I didn't run your code but one consideration is that the while loop might be making the DAG a lot bigger than it has to be. You might see if defining those columns with list comprehensions forming a single select() statement makes for a smaller DAG.

On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <[hidden email]> wrote:
Hi Patrick, thank you for your quick response.
That's exactly what I think. Actually, the result of this processing is an intermediate table that is going to be used for other views generation.
Another approach I'm trying now, is to move the "explosion" step for this "view generation" step, this way I don't need to explode every column but just those used for the final client.

ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the python udfs I tried had very bad performance, but I will give it a try in this case. It can't be worse.
Thanks again!

Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <[hidden email]> escreveu:
This seems like a very expensive operation. Why do you want to write out all the exploded values? If you just want all combinations of values, could you instead do it at read-time with a UDF or something?

On Sat, Aug 1, 2020 at 8:34 PM hesouol <[hidden email]> wrote:
I forgot to add an information. By "can't write" I mean it keeps processing
and nothing happens. The job runs for hours even with a very small file and
I have to force the stoppage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016