Delay starting jobs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Delay starting jobs

Chris Thomas
Hi, 

The attached event timeline shows a large gap between two groups of jobs from the single application running on a cluster. Any tips for finding the root cause for this delay, or likely reasons for this delay are greatly appreciated? I have trawled the logs from the History Server, but nothing jumps out. 

The behaviour seems to be consistent. 

Kind regards,

Chris
Reply | Threaded
Open this post in threaded view
|

Re: Delay starting jobs

Ido Friedman
Check for availability of resources on the cluster


On Mon, Aug 24, 2020 at 4:18 PM Chris Thomas <[hidden email]> wrote:
Hi, 

The attached event timeline shows a large gap between two groups of jobs from the single application running on a cluster. Any tips for finding the root cause for this delay, or likely reasons for this delay are greatly appreciated? I have trawled the logs from the History Server, but nothing jumps out. 

The behaviour seems to be consistent. 

Kind regards,

Chris
Reply | Threaded
Open this post in threaded view
|

Re: Delay starting jobs

Chris Thomas
In reply to this post by Chris Thomas
 Apart from a spike, memory and disk seem ok:



On 24 Aug 2020, at 14:17, Chris Thomas <[hidden email]> wrote:

Hi, 

The attached event timeline shows a large gap between two groups of jobs from the single application running on a cluster. Any tips for finding the root cause for this delay, or likely reasons for this delay are greatly appreciated? I have trawled the logs from the History Server, but nothing jumps out. 

The behaviour seems to be consistent. 

Kind regards,

Chris
<Screenshot 2020-08-21 at 17.32.45.png>

Reply | Threaded
Open this post in threaded view
|

Re: Delay starting jobs

Ido Friedman
Look at yarn not the physical resources.

Should be port 8088 on EMR.



On Mon, Aug 24, 2020 at 4:28 PM Chris Thomas <[hidden email]> wrote:
 Apart from a spike, memory and disk seem ok:



On 24 Aug 2020, at 14:17, Chris Thomas <[hidden email]> wrote:

Hi, 

The attached event timeline shows a large gap between two groups of jobs from the single application running on a cluster. Any tips for finding the root cause for this delay, or likely reasons for this delay are greatly appreciated? I have trawled the logs from the History Server, but nothing jumps out. 

The behaviour seems to be consistent. 

Kind regards,

Chris
<Screenshot 2020-08-21 at 17.32.45.png>