ForEachBatch collecting batch to driver

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

ForEachBatch collecting batch to driver

Ruijing Li
Hi all,

I’m curious on how foreachbatch works in spark structured streaming. So since it is taking in a micro batch dataframe, that means the code in foreachbatch is executing on spark driver? Does this mean for large batches, you could potentially have OOM issues from collecting each partition into the driver?
--
Cheers,
Ruijing Li
Reply | Threaded
Open this post in threaded view
|

Re: ForEachBatch collecting batch to driver

Burak Yavuz-2
foreachBatch gives you the micro-batch as a DataFrame, which is distributed. If you don't call collect on that DataFrame, it shouldn't have any memory implications on the Driver.

On Tue, Mar 10, 2020 at 3:46 PM Ruijing Li <[hidden email]> wrote:
Hi all,

I’m curious on how foreachbatch works in spark structured streaming. So since it is taking in a micro batch dataframe, that means the code in foreachbatch is executing on spark driver? Does this mean for large batches, you could potentially have OOM issues from collecting each partition into the driver?
--
Cheers,
Ruijing Li