---- On Mon, 24 Feb 2020 22:06:40 -0500 [hidden email] wrote ----
Thanks Genie. Unfortunately, the joins I'm doing in this case are large, so UDF likely won't work.
From: Liu Genie <[hidden email]>
Sent: Monday, February 24, 2020 6:39 PM
To: [hidden email] <[hidden email]>
Subject: Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegenI have encountered too many joins problem before. Since the joined dataframe is small enough, I convert join to udf operation, which is much faster and didn’t generate out of memory problem.
2020年2月25日 10:15，Jianneng Li <[hidden email]> 写道：
WholeStageCodegen generates code that appends results into a BufferedRowIterator, which keeps the results in an in-memory linked list. Long story short, this is a problem when multiple joins (i.e. BroadcastHashJoin) that can blow up get planned into the same WholeStageCodegen - results keep on accumulating in the linked list, and do not get consumed fast enough, eventually causing the JVM to run out of memory.
Does anyone else have experience with this problem? Some obvious solutions include making BufferedRowIterator spill the linked list, or make it bounded, but I'd imagine that this would have been done a long time ago if it were necessary.