[Spark2.X] SparkStreaming to Cassandra performance problem
I am implementing a use case where I read some sensor data from Kafka with SparkStreaming interface (KafkaUtils.createDirectStream) and, after some
transformations, write the output (RDD) to Cassandra.
Everything is working properly but I am having some trouble with the performance. My kafka topic receives around 2000 messages per second. For a 4 min. test, the SparkStreaming app takes 6~7 min. to process
and write to Cassandra, which is not acceptable for longer runs.
I am running this application in a "sandbox" with 12GB of RAM, 2 cores and 30GB SSD space. HDP:
I would like to know you have some suggestion to improve performance (other than getting more resources :) ).
My code (pyspark) is posted in the end of this email so you can take a look.