Spark Kafka API tries to connect to the dead node for every batch, which increases the processing time

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Spark Kafka API tries to connect to the dead node for every batch, which increases the processing time

supritht
Hi guys,

I have a 3 node cluster and i am running a spark streaming job. consider the
below example

/*spark-submit* --master yarn-cluster --class
com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint --jars
/opt/client/Spark/spark/lib/streamingClient/kafka-clients-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/kafka_2.10-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar
/opt/SparkStreamingExample-1.0.jar  /tmp/test 10 test
189.132.190.106:21005,189.132.190.145:21005,10.1.1.1:21005/

In this case, suppose node 10.1.1.1 is down. Then for every window batch,
spark tries to send a request  to all the nodes.
This code is in the class org.apache.spark.streaming.kafka.KafkaCluster

Function : getPartitionMetadata()
Line : val resp: TopicMetadataResponse = consumer.send(req)

The function getPartitionMetadata() is called from getPartitions() and
findLeaders() which gets called for every batch.

Hence, if the node is down, the connection fails and it wits till the
timeout to happen before continuing which adds to the processing time.

Question :
Is there any way to avoid this ?
In simple words, i do not want spark to send request to the node that is
down for every batch. How can i achieve this ?






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]