Measuring cluster utilization of a streaming job

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Measuring cluster utilization of a streaming job

Nadeem Lalani
Hi,

I was wondering if anyone has done some work around measuring the cluster resource utilization of a "typical" spark streaming job.

We are trying to build a message ingestion system which will read from Kafka and do some processing.  We have had some concerns raised in the team that a 24*7 streaming job might not be the best use of cluster resources especially when our use cases are to process data in a micro batch fashion and are not truly streaming.

We wanted to measure  as to how much resource does a spark streaming process take. Any pointers on where one would start?

We are on Yarn and plan to use spark 2.1

Thanks in advance,
Nadeem 
Reply | Threaded
Open this post in threaded view
|

Re: Measuring cluster utilization of a streaming job

theikkila
Without knowing anything about your pipeline the best estimate of the resources needed is to run the job with same ingestion rate as the normal production load.

With kafka you can enable back pressure so with high load also your latency will just increase but you don’t have to have capacity for handling the spikes. If you want you can then ie. autoscale the cluster to respond for the load.

If you are using Yarn you can isolate and limit some resources so you can also run other workloads in same cluster if you need to have lots of elasticity.

Usually with streaming jobs the concerns are not with computing capacity but more with network bandwidth and memory consumption.


> On 14.11.2017, at 14.54, Nadeem Lalani <[hidden email]> wrote:
>
> Hi,
>
> I was wondering if anyone has done some work around measuring the cluster resource utilization of a "typical" spark streaming job.
>
> We are trying to build a message ingestion system which will read from Kafka and do some processing.  We have had some concerns raised in the team that a 24*7 streaming job might not be the best use of cluster resources especially when our use cases are to process data in a micro batch fashion and are not truly streaming.
>
> We wanted to measure  as to how much resource does a spark streaming process take. Any pointers on where one would start?
>
> We are on Yarn and plan to use spark 2.1
>
> Thanks in advance,
> Nadeem


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]