Structured Streaming on Kubernetes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Structured Streaming on Kubernetes

Krishna Kalyan
Hello All,
We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). It would be awesome if the spark community could share their experience around this. I would like to know more about you production experience and the monitoring tools you are using.

Since spark on kubernetes is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink.

Regards,
Krishna


Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming on Kubernetes

tdas@databricks.com
Structured streaming is stable in production! At Databricks, we and our customers collectively process almost 100s of billions of records per day using SS. However, we are not using kubernetes :)

Though I don't think it will matter too much as long as kubes are correctly provisioned+configured and you are checkpointing to HDFS (for fault-tolerance guarantees).

TD

On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <[hidden email]> wrote:
Hello All,
We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). It would be awesome if the spark community could share their experience around this. I would like to know more about you production experience and the monitoring tools you are using.

Since spark on kubernetes is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink.

Regards,
Krishna


Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming on Kubernetes

Matt Cheah

We don’t provide any Kubernetes-specific mechanisms for streaming, such as checkpointing to persistent volumes. But as long as streaming doesn’t require persisting to the executor’s local disk, streaming ought to work out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local directories.

 

However, I’m unaware of any specific use of streaming with the Spark on Kubernetes integration right now. Would be curious to get feedback on the failover behavior right now.

 

-Matt Cheah

 

From: Tathagata Das <[hidden email]>
Date: Friday, April 13, 2018 at 1:27 AM
To: Krishna Kalyan <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Structured Streaming on Kubernetes

 

Structured streaming is stable in production! At Databricks, we and our customers collectively process almost 100s of billions of records per day using SS. However, we are not using kubernetes :)

 

Though I don't think it will matter too much as long as kubes are correctly provisioned+configured and you are checkpointing to HDFS (for fault-tolerance guarantees).

 

TD

 

On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <[hidden email]> wrote:

Hello All,

We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). It would be awesome if the spark community could share their experience around this. I would like to know more about you production experience and the monitoring tools you are using.

 

Since spark on kubernetes is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink.

 

Regards,

Krishna

 

 


smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming on Kubernetes

Anirudh Ramanathan
+ozzieba who was experimenting with streaming workloads recently. +1 to what Matt said. Checkpointing and driver recovery is future work.
Structured streaming is important, and it would be good to get some production experiences here and try and target improving the feature's support on K8s for 2.4/3.0.


On Fri, Apr 13, 2018 at 11:55 AM Matt Cheah <[hidden email]> wrote:

We don’t provide any Kubernetes-specific mechanisms for streaming, such as checkpointing to persistent volumes. But as long as streaming doesn’t require persisting to the executor’s local disk, streaming ought to work out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local directories.

 

However, I’m unaware of any specific use of streaming with the Spark on Kubernetes integration right now. Would be curious to get feedback on the failover behavior right now.

 

-Matt Cheah

 

From: Tathagata Das <[hidden email]>
Date: Friday, April 13, 2018 at 1:27 AM
To: Krishna Kalyan <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Structured Streaming on Kubernetes

 

Structured streaming is stable in production! At Databricks, we and our customers collectively process almost 100s of billions of records per day using SS. However, we are not using kubernetes :)

 

Though I don't think it will matter too much as long as kubes are correctly provisioned+configured and you are checkpointing to HDFS (for fault-tolerance guarantees).

 

TD

 

On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <[hidden email]> wrote:

Hello All,

We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). It would be awesome if the spark community could share their experience around this. I would like to know more about you production experience and the monitoring tools you are using.

 

Since spark on kubernetes is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink.

 

Regards,

Krishna

 

 



--
Anirudh Ramanathan
Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming on Kubernetes

Krishna Kalyan
Thank you so much TD, Matt, Anirudh and Oz,
Really appropriate this.  

On Fri, Apr 13, 2018 at 9:54 PM, Oz Ben-Ami <[hidden email]> wrote:
I can confirm that Structured Streaming works on Kubernetes, though we're not quite on production with that yet. Issues we're looking at are:
- Submission through spark-submit works, but is a bit clunky with a kubernetes-centered workflow. Spark Operator is promising, but still in alpha (eg, we ran into this). Even better would be something that runs the driver as a Deployment / StatefulSet, so that long-running streaming jobs can be restarted automatically
- Dynamic allocation: works with the spark-on-k8s fork, but not with plain Spark 2.3, due to reliance on shuffle service which hasn't been merged yet. Ideal implementation would be able to connect to a PersistentVolume independently of a node, but that's a bit more complicated
- Checkpointing: We checkpoint to a separate HDFS (Dataproc) cluster, which works well for us both on the old Spark Streaming and Structured Streaming. We've successfully experimented with HDFS on Kubernetes, but again not in production
- UI: Unfortunately Structured Streaming does not yet have a comprehensive UI like the old Spark Streaming, but it does show the basic information (jobs, stages, queries, executors), and other information is generally available in the logs and metrics
- Monitoring / Logging: this is a strength of Kubernetes, in that it's all centralized by the cluster. We use Splunk, but it would also be possible to hook up Spark's Dropwizard Metrics library to Prometheus, and read logs with fluentd or Stackdriver.
- Side note: Kafka support in Spark and Structured Streaming is very good, but as of Spark 2.3 there are still a couple of missing features, notably transparent avro support (UDFs are needed) and taking advantage of transactional processing (introduced to Kafka last year) for better exactly-once guarantees

On Fri, Apr 13, 2018 at 3:08 PM, Anirudh Ramanathan <[hidden email]> wrote:
+ozzieba who was experimenting with streaming workloads recently. +1 to what Matt said. Checkpointing and driver recovery is future work.
Structured streaming is important, and it would be good to get some production experiences here and try and target improving the feature's support on K8s for 2.4/3.0.


On Fri, Apr 13, 2018 at 11:55 AM Matt Cheah <[hidden email]> wrote:

We don’t provide any Kubernetes-specific mechanisms for streaming, such as checkpointing to persistent volumes. But as long as streaming doesn’t require persisting to the executor’s local disk, streaming ought to work out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local directories.

 

However, I’m unaware of any specific use of streaming with the Spark on Kubernetes integration right now. Would be curious to get feedback on the failover behavior right now.

 

-Matt Cheah

 

From: Tathagata Das <[hidden email]>
Date: Friday, April 13, 2018 at 1:27 AM
To: Krishna Kalyan <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Structured Streaming on Kubernetes

 

Structured streaming is stable in production! At Databricks, we and our customers collectively process almost 100s of billions of records per day using SS. However, we are not using kubernetes :)

 

Though I don't think it will matter too much as long as kubes are correctly provisioned+configured and you are checkpointing to HDFS (for fault-tolerance guarantees).

 

TD

 

On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <[hidden email]> wrote:

Hello All,

We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). It would be awesome if the spark community could share their experience around this. I would like to know more about you production experience and the monitoring tools you are using.

 

Since spark on kubernetes is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink.

 

Regards,

Krishna

 

 



--
Anirudh Ramanathan


Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming on Kubernetes

puneetloya
Thanks for putting a comprehensive observation about Spark on Kubernetes. In
mesos Spark deployment, it has a property called spark.mesos.extra.cores.
The property means:
*
Set the extra number of cores for an executor to advertise. This does not
result in more cores allocated. It instead means that an executor will
"pretend" it has more cores, so that the driver will send it more tasks. Use
this to increase parallelism. This setting is only used for Mesos
coarse-grained mode*

Can this be used to increase parallelism? Are there other better ways to
increase parallelism in Kubernetes?





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]