[Structured Streaming][Kafka] For a Kafka topic with 3 partitions, how does the parallelism work ?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Structured Streaming][Kafka] For a Kafka topic with 3 partitions, how does the parallelism work ?

karthikjay
I have the following code to read data from Kafka topic using the structured
streaming. The topic has 3 partitions:

 val spark = SparkSession
      .builder
      .appName("TestPartition")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    val dataFrame = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers",
"1.2.3.184:9092,1.2.3.185:9092,1.2.3.186:9092")
      .option("subscribe", "partition_test")
      .option("failOnDataLoss", "false")
      .load()
      .selectExpr("CAST(value AS STRING)")

My understanding is that Spark will launch 3 Kafka consumers (for 3
partitions) and these 3 consumers will be running on the worker nodes. Is my
understanding right ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Structured Streaming][Kafka] For a Kafka topic with 3 partitions, how does the parallelism work ?

Raghavendra Pandey
Yes as long as there are 3 cores available on your local machine. 

On Fri, Apr 20, 2018 at 10:56 AM karthikjay <[hidden email]> wrote:
I have the following code to read data from Kafka topic using the structured
streaming. The topic has 3 partitions:

 val spark = SparkSession
      .builder
      .appName("TestPartition")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    val dataFrame = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers",
"1.2.3.184:9092,1.2.3.185:9092,1.2.3.186:9092")
      .option("subscribe", "partition_test")
      .option("failOnDataLoss", "false")
      .load()
      .selectExpr("CAST(value AS STRING)")

My understanding is that Spark will launch 3 Kafka consumers (for 3
partitions) and these 3 consumers will be running on the worker nodes. Is my
understanding right ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]