Spark Structured Streaming resource contention / memory issue

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Structured Streaming resource contention / memory issue

patrickmcgloin
Hi allI sent this earlier but the screenshots were not attached. Hopefully this time it is correct.

We have a Spark Structured streaming stream which is using mapGroupWithState. After some time of processing in a stable manner suddenly each mini batch starts taking 40 seconds. Suspiciously it looks like exactly 40 seconds each time. Before this the batches were taking less than a second.


Looking at the details for a particular task most partitions are processed really quickly but a few take exactly 40 seconds:




The GC was looking ok as the data was being processed quickly but suddenly the full GCs etc stop (at the same time as the 40 second issue):



I have taken a thread dump from one of the executors as this issue is happening but I cannot see any resource they are blocked on:




Are we hitting a GC problem and why is it manifesting in this way? Is there another resource that is blocking and what is it?



Thanks,
Patrick



This message has been sent by ABN AMRO Bank N.V., which has its seat at Gustav Mahlerlaan 10 (1082 PP) Amsterdam, the Netherlands, and is registered in the Commercial Register of Amsterdam under number 34334259.
Reply | Threaded
Open this post in threaded view
|

Re: Spark Structured Streaming resource contention / memory issue

Jungtaek Lim
Hi Patrick,

Looks like you might be struggling with state memory, which multiple issues are going to be resolved in Spark 2.4.

1. SPARK-24441 [1]: Expose total estimated size of states in HDFSBackedStateStoreProvider
2. SPARK-24637 [2]: Add metrics regarding state and watermark to dropwizard metrics
3. SPARK-24717 [3]: Split out min retain version of state for memory in HDFSBackedStateStoreProvider

There're other patches relevant to state store as well, but above issues are applied to map/flatmapGroupsWithState.

Since Spark community is in progress on releasing Spark 2.4.0, could you try experimenting Spark 2.4.0 RC if you really don't mind? You could try out applying individual patches and see whether it helps.

Thanks,
Jungtaek Lim (HeartSaVioR)



2018년 10월 12일 (금) 오후 9:31, Patrick McGloin <[hidden email]>님이 작성:
Hi allI sent this earlier but the screenshots were not attached. Hopefully this time it is correct.

We have a Spark Structured streaming stream which is using mapGroupWithState. After some time of processing in a stable manner suddenly each mini batch starts taking 40 seconds. Suspiciously it looks like exactly 40 seconds each time. Before this the batches were taking less than a second.


Looking at the details for a particular task most partitions are processed really quickly but a few take exactly 40 seconds:




The GC was looking ok as the data was being processed quickly but suddenly the full GCs etc stop (at the same time as the 40 second issue):



I have taken a thread dump from one of the executors as this issue is happening but I cannot see any resource they are blocked on:




Are we hitting a GC problem and why is it manifesting in this way? Is there another resource that is blocking and what is it?



Thanks,
Patrick



This message has been sent by ABN AMRO Bank N.V., which has its seat at Gustav Mahlerlaan 10 (1082 PP) Amsterdam, the Netherlands, and is registered in the Commercial Register of Amsterdam under number 34334259.


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

noname (133K) Download Attachment
noname (42K) Download Attachment
noname (140K) Download Attachment
noname (134K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Spark Structured Streaming resource contention / memory issue

patrickmcgloin
Hi Jungtaek,

Thanks, we thought that might be the issue but haven't tested yet as building against an unreleased version of Spark is tough for us, due to network restrictions. We will try though. I will report back if we find anything. 

Best regards,
Patrick

On Fri, Oct 12, 2018, 2:57 PM Jungtaek Lim <[hidden email]> wrote:
Hi Patrick,

Looks like you might be struggling with state memory, which multiple issues are going to be resolved in Spark 2.4.

1. SPARK-24441 [1]: Expose total estimated size of states in HDFSBackedStateStoreProvider
2. SPARK-24637 [2]: Add metrics regarding state and watermark to dropwizard metrics
3. SPARK-24717 [3]: Split out min retain version of state for memory in HDFSBackedStateStoreProvider

There're other patches relevant to state store as well, but above issues are applied to map/flatmapGroupsWithState.

Since Spark community is in progress on releasing Spark 2.4.0, could you try experimenting Spark 2.4.0 RC if you really don't mind? You could try out applying individual patches and see whether it helps.

Thanks,
Jungtaek Lim (HeartSaVioR)



2018년 10월 12일 (금) 오후 9:31, Patrick McGloin <[hidden email]>님이 작성:
Hi allI sent this earlier but the screenshots were not attached. Hopefully this time it is correct.

We have a Spark Structured streaming stream which is using mapGroupWithState. After some time of processing in a stable manner suddenly each mini batch starts taking 40 seconds. Suspiciously it looks like exactly 40 seconds each time. Before this the batches were taking less than a second.


Looking at the details for a particular task most partitions are processed really quickly but a few take exactly 40 seconds:




The GC was looking ok as the data was being processed quickly but suddenly the full GCs etc stop (at the same time as the 40 second issue):



I have taken a thread dump from one of the executors as this issue is happening but I cannot see any resource they are blocked on:




Are we hitting a GC problem and why is it manifesting in this way? Is there another resource that is blocking and what is it?



Thanks,
Patrick



This message has been sent by ABN AMRO Bank N.V., which has its seat at Gustav Mahlerlaan 10 (1082 PP) Amsterdam, the Netherlands, and is registered in the Commercial Register of Amsterdam under number 34334259.