I am running the Kafka Direct Stream wordcount example code (pasted below my signature). My topic consists of 400 partitions. And the Spark Job tracker page shows 26 executors to process the corresponding 400 tasks.
When I check the execution timeline for each job (=2 sec microbatch worth of records), it shows the tasks to be executed serially by the executors. I attach a screen shot for reference (shows only two out of the 26 executors).
I increased the total-executor-cores to 200, in hopes that it would show me 4 tasks to be processed in parallel by each executor. Still the behavior continues.
Ran the scala wordcount example that uses Direct Kafka Stream (supposedly using kafka010), reading from the same topic. Once again, I see the tasks to be serially executed and not in parallel within an executor.
Can someone please explain why the executor processes the tasks serially? Is it expected? Does it have something to do with YARN?
rom __future__ import print_function
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# This keeps a running count of the total service.instances