GrupState limits

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

GrupState limits

tleilaxu
Hi,
I am tracking states in my Spark streaming application with  MapGroupsWithStateFunction described here: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html
Which are the limiting factors on the number of states a job can track at the same time? Is it memory? Could be a bounded data structure in the internal implementation? Anything else ...
You might have valuable input on this while I am trying to setup and test this.

Thanks,
Arnold
Reply | Threaded
Open this post in threaded view
|

Re: GrupState limits

Srinivas V
If you are talking about total number of objects the state can hold, that depends on the executor memory you have on your cluster apart from rest of the memory required for processing. The state is stored in hdfs and retrieved while processing the next events. 
If you maintain million objects with each 20 bytes , it would be 20MB, which is pretty reasonable to maintain in a executor allocated with few GB memory. But if you need heavy objects to be stored you need to do the math. And also it will have a cost in transferring this data back and forth to hdfs checkpoint location. 

Regards
Srini

On Tue, May 12, 2020 at 2:48 AM tleilaxu <[hidden email]> wrote:
Hi,
I am tracking states in my Spark streaming application with  MapGroupsWithStateFunction described here: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html
Which are the limiting factors on the number of states a job can track at the same time? Is it memory? Could be a bounded data structure in the internal implementation? Anything else ...
You might have valuable input on this while I am trying to setup and test this.

Thanks,
Arnold