I have a problem about task skew and data skew for real-time data (via kafka) under the spark streaming.
When one of executors is crashed, task skew and data skew happened in my project as shown figure.
For example, in ubuntu8, because there are 3 crashed executors (here, I am not sure this), the 50,000 data is placed into the executor: ubuntu8:34168.
Figure 1 executors crash
It is normal (no crashed executors) for most streaming window:
Figure 2 normal performance
Figure 3 poor performance
The experiment design in my project is described in the following.
Real-time data speed (via kafka): 100,000/1sec
Read one topic: kafkasink2
Kafka Broker: 2.10-0.10.1.1
Broker node at ubuntu7
One topic: kafkasink2 (number of partitions: 8)
The running environment is in my PC:
OS: Ubuntn 14.04.4 LTS
The version of related tools:
java version: "1.8.0_151"
Spark version: 2.3.1 Standalone mode
Master/Driver node: ubuntu7
Worker nodes: ubuntu8 (4 Executors); ubuntu9 (4 Executors)
Number of executors: 8
Driver setting (spark-defaults.conf):
If anyone provides any direction to help us to overcome this problem, we would appreciate it.