[Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

thomas lavocat
Hi everyone,

I'm wondering if the property  spark.streaming.concurrentJobs should
reflects the total number of possible concurrent task on the cluster, or
the a local number of concurrent tasks on one compute node.

Thanks for your help.

Thomas


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

Saisai Shao
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this configuration.

 

thomas lavocat <[hidden email]> 于2018年6月5日周二 下午4:20写道:
Hi everyone,

I'm wondering if the property  spark.streaming.concurrentJobs should
reflects the total number of possible concurrent task on the cluster, or
the a local number of concurrent tasks on one compute node.

Thanks for your help.

Thomas


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

thomas lavocat

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this configuration.

How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and found out that my overall throughput is increased in correlation with this property.
But I'm experiencing scalability issues. With more than 16 receivers spread over 8 executors, my executors no longer receive work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

Saisai Shao
You need to read the code, this is an undocumented configuration.

Basically this will break the ordering of Streaming jobs, AFAIK it may get unexpected results if you streaming jobs are not independent.

thomas lavocat <[hidden email]> 于2018年6月5日周二 下午7:17写道:

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this configuration.

How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and found out that my overall throughput is increased in correlation with this property.
But I'm experiencing scalability issues. With more than 16 receivers spread over 8 executors, my executors no longer receive work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

thomas lavocat

On 05/06/2018 13:44, Saisai Shao wrote:
You need to read the code, this is an undocumented configuration.
I'm on it right now, but, Spark is a big piece of software.
Basically this will break the ordering of Streaming jobs, AFAIK it may get unexpected results if you streaming jobs are not independent.
What do you mean exactly by not independent ?
Are several source joined together dependent ?

Thanks,
Thomas

thomas lavocat <[hidden email]> 于2018年6月5日周二 下午7:17写道:

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this configuration.

How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and found out that my overall throughput is increased in correlation with this property.
But I'm experiencing scalability issues. With more than 16 receivers spread over 8 executors, my executors no longer receive work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas


Reply | Threaded
Open this post in threaded view
|

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

Saisai Shao
"dependent" I mean this batch's job relies on the previous batch's result. So this batch should wait for the finish of previous batch, if you set "spark.streaming.concurrentJobs" larger than 1, then the current batch could start without waiting for the previous batch (if it is delayed), which will lead to unexpected results. 


thomas lavocat <[hidden email]> 于2018年6月5日周二 下午7:48写道:

On 05/06/2018 13:44, Saisai Shao wrote:
You need to read the code, this is an undocumented configuration.
I'm on it right now, but, Spark is a big piece of software.
Basically this will break the ordering of Streaming jobs, AFAIK it may get unexpected results if you streaming jobs are not independent.
What do you mean exactly by not independent ?
Are several source joined together dependent ?

Thanks,
Thomas

thomas lavocat <[hidden email]> 于2018年6月5日周二 下午7:17写道:

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this configuration.

How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and found out that my overall throughput is increased in correlation with this property.
But I'm experiencing scalability issues. With more than 16 receivers spread over 8 executors, my executors no longer receive work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas


Reply | Threaded
Open this post in threaded view
|

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

thomas lavocat

Thank you very much for your answer.

Since I don't have dependent jobs I will continue to use this functionality.


On 05/06/2018 13:52, Saisai Shao wrote:
"dependent" I mean this batch's job relies on the previous batch's result. So this batch should wait for the finish of previous batch, if you set "spark.streaming.concurrentJobs" larger than 1, then the current batch could start without waiting for the previous batch (if it is delayed), which will lead to unexpected results. 


thomas lavocat <[hidden email]> 于2018年6月5日周二 下午7:48写道:

On 05/06/2018 13:44, Saisai Shao wrote:
You need to read the code, this is an undocumented configuration.
I'm on it right now, but, Spark is a big piece of software.
Basically this will break the ordering of Streaming jobs, AFAIK it may get unexpected results if you streaming jobs are not independent.
What do you mean exactly by not independent ?
Are several source joined together dependent ?

Thanks,
Thomas

thomas lavocat <[hidden email]> 于2018年6月5日周二 下午7:17写道:

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication of this configuration.

How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and found out that my overall throughput is increased in correlation with this property.
But I'm experiencing scalability issues. With more than 16 receivers spread over 8 executors, my executors no longer receive work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas