Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

Antonio Murgia - antonio.murgia2@studio.unibo.it

Hi all,

I'm currently developing a Spark Structured Streaming job and I'm performing flatMapGroupsWithState.

I'm concerned about the laziness of the Iterator[V] that is passed to my custom function (func: (K, Iterator[V], GroupState[S]) => Iterator[U]).

Is it ok to collect that iterator (with a toList, for example)? I have a logic that is practically impossible to perform on a Iterator, but I do not want to break Spark lazy chain, obviously.


Thank you in advance.


#A.M.

Reply | Threaded
Open this post in threaded view
|

Re: Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

Tathagata Das
It is okay to collect the iterator. That will not break Spark. However, collecting it requires memory in the executor, so you may cause OOMs if a group has a LOT of new data.

On Wed, Oct 31, 2018 at 3:44 AM Antonio Murgia - [hidden email] <[hidden email]> wrote:

Hi all,

I'm currently developing a Spark Structured Streaming job and I'm performing flatMapGroupsWithState.

I'm concerned about the laziness of the Iterator[V] that is passed to my custom function (func: (K, Iterator[V], GroupState[S]) => Iterator[U]).

Is it ok to collect that iterator (with a toList, for example)? I have a logic that is practically impossible to perform on a Iterator, but I do not want to break Spark lazy chain, obviously.


Thank you in advance.


#A.M.