Spark on EMR suddenly stalling

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark on EMR suddenly stalling

Jeroen Miller
Dear Sparkers,

Once again in times of desperation, I leave what remains of my mental sanity to this wise and knowledgeable community.

I have a Spark job (on EMR 5.8.0) which had been running daily for months, if not the whole year, with absolutely no supervision. This changed all of sudden for reasons I do not understand.

The volume of data processed daily has been slowly increasing over the past year but has been stable in the last couple months. Since I'm only processing the past 8 days's worth of data I do not think that increased data volume is to blame here. Yes, I did check the volume of data for the past few days.

Here is a short description of the issue.

- The Spark job starts normally and proceeds successfully with the first few stages.
- Once we reach the dreaded stage, all tasks are performed successfully (they typically take not more than 1 minute each), except for the /very/ first one (task 0.0) which never finishes.

Here is what the log looks like (simplified for readability):

----------------------------------------
INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412 ms on ... (executor 12) (254/256)
INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394 ms on ... (executor 7) (255/256)
INFO ExecutorAllocationManager: Request to remove executorIds: 14
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 14
INFO YarnAllocator: Driver requested a total number of 0 executor(s).
----------------------------------------

Why is that? There is still a task waiting to be completed right? Isn't an executor needed for that?

Afterwards, all executors are getting killed (dynamic allocation is turned on):

----------------------------------------
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
INFO ExecutorAllocationManager: Removing executor 14 because it has been idle for 60 seconds (new desired total will be 5)
    .
    .
    .
INFO ExecutorAllocationManager: Request to remove executorIds: 7
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 7
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 1)
INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
INFO DAGScheduler: Executor lost: 7 (epoch 4)
INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, ..., 44289, None)
INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
INFO ExecutorAllocationManager: Existing executor 7 has been removed (new total is 1)
----------------------------------------

Then, there's nothing more in the driver's log. Nothing. The cluster then run for hours, with no progress being made, and no executors allocated.

Here is what I tried:

    - More memory per executor: from 13 GB to 24 GB by increments.
    - Explicit repartition() on the RDD: from 128 to 256 partitions.

The offending stage used to be a rather innocent looking keyBy(). After adding some repartition() the offending stage was then a mapToPair(). During my last experiments, it turned out the repartition(256) itself is now the culprit.

I like Spark, but its mysteries will manage to send me in a mental hospital one of those days.

Can anyone shed light on what is going on here, or maybe offer some suggestions or pointers to relevant source of information?

I am completely clueless.

Seasons greetings,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Jeroen Miller
On 28 Dec 2017, at 17:41, Richard Qiao <[hidden email]> wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the driver's log when I check it from the AWS console when the cluster is still running.

JM


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Patrick Alwell
Joren,

Anytime there is a shuffle in the network, Spark moves to a new stage. It seems like you are having issues either pre or post shuffle. Have you looked at a resource management tool like ganglia to determine if this is a memory or thread related issue? The spark UI?

You are using groupByKey() have you thought of an alternative like aggregateByKey() or combineByKey() to reduce shuffling?
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoid_groupbykey_when_performing_an_associative_re/avoid-groupbykey-when-performing-a-group-of-multiple-items-by-key.html

Dynamic allocation is great; but sometimes I’ve found explicitly setting the num executors, cores per executor, and memory per executor to be a better alternative.

Take a look at the yarn logs as well for the particular executor in question. Executors can have multiple tasks; and will often fail if they have more tasks than available threads.

As for partitioning the data; you could also look into your level of parallelism which is correlated to the splitablity (blocks) of data. This will be based on your largest RDD.
https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism

Spark is like C/C++ you need to manage the memory buffer or the compiler will through you out  ;)
https://spark.apache.org/docs/latest/hardware-provisioning.html

Hang in there, this is the more complicated stage of placing a spark application into production. The Yarn logs should point you in the right direction.

It’s tough to debug over email, so hopefully this information is helpful.

-Pat


On 12/28/17, 9:57 AM, "Jeroen Miller" <[hidden email]> wrote:

    On 28 Dec 2017, at 17:41, Richard Qiao <[hidden email]> wrote:
    > Are you able to specify which path of data filled up?
   
    I can narrow it down to a bunch of files but it's not so straightforward.
   
    > Any logs not rolled over?
   
    I have to manually terminate the cluster but there is nothing more in the driver's log when I check it from the AWS console when the cluster is still running.
   
    JM
   
   
    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]
   
   


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Maximiliano Felice
In reply to this post by Jeroen Miller
Hi Jeroen,

I experienced a similar issue a few weeks ago. The situation was a result of a mix of speculative execution and OOM issues in the container.

First of all, when an executor takes too much time in Spark, it is handled by the YARN speculative execution, which will launch a new executor and allocate it in a new container. In our case, some tasks were throwing OOM exceptions while executing, but not on the executor itself, but on the YARN container. 

It turns out that YARN will try several times to run an application when something fails in it. Specifically, it will try yarn.resourcemanager.am.max-attempts times to run the application before failing, which has a default value of 2 and is not modified in EMR configurations (check here).

We've managed to check that when we have speculative execution enabled and some YARN containers which were running speculative tasks died, they did take a chance from the max-attempts number. This wouldn't represent any issue in normal behavior, but it seems that if all the retries were consumed in a task that has started speculative execution, the application itself doesn't fail, but it hangs the task expecting to reschedule it sometime. As the attempts are zero, it never reschedules it and the application itself fails to finish.

I checked this theory repeatedly, always getting the expected results. Several times I changed the named YARN configuration and it always starts speculative retries on this task and hangs when reaching max-attempts number of broken YARN containers.

I personally think that this issue should be possible to reproduce without the speculative execution configured.

So, what would I do if I were you?

1. Check the number of tasks scheduled. If you see one (or more) tasks missing when you do the final sum, then you might be encountering this issue.
2. Check the container logs to see if anything broke. OOM is what failed to me.
3. Contact AWS EMR support, although in my experience they were of no help at all.


Hope this helps you a bit!



2017-12-28 14:57 GMT-03:00 Jeroen Miller <[hidden email]>:
On 28 Dec 2017, at 17:41, Richard Qiao <[hidden email]> wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the driver's log when I check it from the AWS console when the cluster is still running.

JM


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Gourav Sengupta
In reply to this post by Jeroen Miller
HI Jeroen,

Can I get a few pieces of additional information please?

In the EMR cluster what are the other applications that you have enabled (like HIVE, FLUME, Livy, etc). 
Are you using SPARK Session? If yes is your application using cluster mode or client mode?
Have you read the EC2 service level agreement?
Is your cluster on auto scaling group?
Are you scheduling your job by adding another new step into the EMR cluster? Or is it the same job running always triggered by some background process?
Since EMR are supposed to be ephemeral, have you tried creating a new cluster and trying your job in that?


Regards,
Gourav Sengupta

On Thu, Dec 28, 2017 at 4:06 PM, Jeroen Miller <[hidden email]> wrote:
Dear Sparkers,

Once again in times of desperation, I leave what remains of my mental sanity to this wise and knowledgeable community.

I have a Spark job (on EMR 5.8.0) which had been running daily for months, if not the whole year, with absolutely no supervision. This changed all of sudden for reasons I do not understand.

The volume of data processed daily has been slowly increasing over the past year but has been stable in the last couple months. Since I'm only processing the past 8 days's worth of data I do not think that increased data volume is to blame here. Yes, I did check the volume of data for the past few days.

Here is a short description of the issue.

- The Spark job starts normally and proceeds successfully with the first few stages.
- Once we reach the dreaded stage, all tasks are performed successfully (they typically take not more than 1 minute each), except for the /very/ first one (task 0.0) which never finishes.

Here is what the log looks like (simplified for readability):

----------------------------------------
INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412 ms on ... (executor 12) (254/256)
INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394 ms on ... (executor 7) (255/256)
INFO ExecutorAllocationManager: Request to remove executorIds: 14
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 14
INFO YarnAllocator: Driver requested a total number of 0 executor(s).
----------------------------------------

Why is that? There is still a task waiting to be completed right? Isn't an executor needed for that?

Afterwards, all executors are getting killed (dynamic allocation is turned on):

----------------------------------------
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
INFO ExecutorAllocationManager: Removing executor 14 because it has been idle for 60 seconds (new desired total will be 5)
    .
    .
    .
INFO ExecutorAllocationManager: Request to remove executorIds: 7
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 7
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 1)
INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
INFO DAGScheduler: Executor lost: 7 (epoch 4)
INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, ..., 44289, None)
INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
INFO ExecutorAllocationManager: Existing executor 7 has been removed (new total is 1)
----------------------------------------

Then, there's nothing more in the driver's log. Nothing. The cluster then run for hours, with no progress being made, and no executors allocated.

Here is what I tried:

    - More memory per executor: from 13 GB to 24 GB by increments.
    - Explicit repartition() on the RDD: from 128 to 256 partitions.

The offending stage used to be a rather innocent looking keyBy(). After adding some repartition() the offending stage was then a mapToPair(). During my last experiments, it turned out the repartition(256) itself is now the culprit.

I like Spark, but its mysteries will manage to send me in a mental hospital one of those days.

Can anyone shed light on what is going on here, or maybe offer some suggestions or pointers to relevant source of information?

I am completely clueless.

Seasons greetings,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Jeroen Miller
In reply to this post by Maximiliano Felice
On 28 Dec 2017, at 19:40, Maximiliano Felice <[hidden email]> wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of a mix of speculative execution and OOM issues in the container.

Interesting! However I don't have any OOM exception in the logs. Does that rule out your hypothesis?

> We've managed to check that when we have speculative execution enabled and some YARN containers which were running speculative tasks died, they did take a chance from the max-attempts number. This wouldn't represent any issue in normal behavior, but it seems that if all the retries were consumed in a task that has started speculative execution, the application itself doesn't fail, but it hangs the task expecting to reschedule it sometime. As the attempts are zero, it never reschedules it and the application itself fails to finish.

Hmm, this sounds like a huge design fail to me, but I'm sure there are very complicated issues that go way over my head.

> 1. Check the number of tasks scheduled. If you see one (or more) tasks missing when you do the final sum, then you might be encountering this issue.
> 2. Check the container logs to see if anything broke. OOM is what failed to me.

I can't find anything in the logs from EMR. Should I expect to find explicit OOM exception messages?

JM


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Jeroen Miller
In reply to this post by Gourav Sengupta
On 28 Dec 2017, at 19:42, Gourav Sengupta <[hidden email]> wrote:
> In the EMR cluster what are the other applications that you have enabled (like HIVE, FLUME, Livy, etc).

Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff behind my back).

> Are you using SPARK Session?

Yes.

> If yes is your application using cluster mode or client mode?

Cluster mode.

> Have you read the EC2 service level agreement?

I did not -- I doubt it has the answer to my problem though! :-)

> Is your cluster on auto scaling group?

Nope.

> Are you scheduling your job by adding another new step into the EMR cluster? Or is it the same job running always triggered by some background process?
> Since EMR are supposed to be ephemeral, have you tried creating a new cluster and trying your job in that?

I'm creating a new cluster on demand, specifically for that job. No other application runs on it.

JM


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Gourav Sengupta
Hi Jeroen,

can you try to then use the EMR version 5.10 instead or EMR version 5.11 instead? 
can you please try selecting a subnet which is in a different availability zone?
if possible just try to increase the number of task instances and see the difference?
also in case you are using caching, try to see the total amount of space being used, you may also want to persist intermediate data into S3 as default parquet format in worst case scenario and then work through the steps that you think are failing using Jupyter or Spark notebook.
Also can you please report the number of containers that your job is creating by looking at the metrics in the EMR console?

Also if you see the spark UI then you can easily see which particular step is taking the longest period of time - you just have to drill in a bit in order to see that. Generally in case shuffling is an issue then it definitely appears in the SPARK UI as I drill into the steps and see which particular one is taking the longest.


Since you do not have a long running cluster (which I mistook from your statement of a long running job) therefore things should be fine.


Regards,
Gourav Sengupta


On Thu, Dec 28, 2017 at 7:43 PM, Jeroen Miller <[hidden email]> wrote:
On 28 Dec 2017, at 19:42, Gourav Sengupta <[hidden email]> wrote:
> In the EMR cluster what are the other applications that you have enabled (like HIVE, FLUME, Livy, etc).

Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff behind my back).

> Are you using SPARK Session?

Yes.

> If yes is your application using cluster mode or client mode?

Cluster mode.

> Have you read the EC2 service level agreement?

I did not -- I doubt it has the answer to my problem though! :-)

> Is your cluster on auto scaling group?

Nope.

> Are you scheduling your job by adding another new step into the EMR cluster? Or is it the same job running always triggered by some background process?
> Since EMR are supposed to be ephemeral, have you tried creating a new cluster and trying your job in that?

I'm creating a new cluster on demand, specifically for that job. No other application runs on it.

JM


Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Jeroen Miller
In reply to this post by Patrick Alwell
On 28 Dec 2017, at 19:25, Patrick Alwell <[hidden email]> wrote:
> Dynamic allocation is great; but sometimes I’ve found explicitly setting the num executors, cores per executor, and memory per executor to be a better alternative.

No difference with spark.dynamicAllocation.enabled set to false.

JM


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Shushant Arora
you may have to recreate your cluster with below configuration at emr creation 
    "Configurations": [
            {
                "Properties": {
                    "maximizeResourceAllocation": "false"
                },
                "Classification": "spark"
            }
        ]

On Fri, Dec 29, 2017 at 11:57 PM, Jeroen Miller <[hidden email]> wrote:
On 28 Dec 2017, at 19:25, Patrick Alwell <[hidden email]> wrote:
> Dynamic allocation is great; but sometimes I’ve found explicitly setting the num executors, cores per executor, and memory per executor to be a better alternative.

No difference with spark.dynamicAllocation.enabled set to false.

JM


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Gourav Sengupta
In reply to this post by Gourav Sengupta
Hi,

Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

Sadly, I cannot be of much help unless we go for a screen share session over google chat or skype. 

Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true. 

Besides that, there is a metrics in the EMR console which shows the number of containers getting generated by your job on graphs.



Regards,
Gourav Sengupta

On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <[hidden email]> wrote:
Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta <[hidden email]> wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_xxx/.

> Also if you see the spark UI then you can easily see which particular step is taking the longest period of time - you just have to drill in a bit in order to see that. Generally in case shuffling is an issue then it definitely appears in the SPARK UI as I drill into the steps and see which particular one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to date.

JM


Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Rohit Karlupia
Here is the list that I will probably try to fill:
  1. Check GC on the offending executor when the task is running. May be you need even more memory.  
  2. Go back to some previous successful run of the job and check the spark ui for the offending stage and check max task time/max input/max shuffle in/out for the largest task. Will help you understand the degree of skew in this stage. 
  3. Take a thread dump of the executor from the Spark UI and verify if the task is really doing any work or it stuck in some deadlock. Some of the hive serde are not really usable from multi-threaded/multi-use spark executors. 
  4. Take a thread dump of the executor from the Spark UI and verify if the task is spilling to disk. Playing with storage and memory fraction or generally increasing the memory will help. 
  5. Check the disk utilisation on the machine running the executor. 
  6. Look for event loss messages in the logs due to event queue full. Loss of events can send some of the spark components into really bad states.  

thanks,
rohitk



On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <[hidden email]> wrote:
Hi,

Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

Sadly, I cannot be of much help unless we go for a screen share session over google chat or skype. 

Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true. 

Besides that, there is a metrics in the EMR console which shows the number of containers getting generated by your job on graphs.



Regards,
Gourav Sengupta

On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <[hidden email]> wrote:
Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta <[hidden email]> wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_xxx/.

> Also if you see the spark UI then you can easily see which particular step is taking the longest period of time - you just have to drill in a bit in order to see that. Generally in case shuffling is an issue then it definitely appears in the SPARK UI as I drill into the steps and see which particular one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to date.

JM



Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

M Singh
Hi Jeroen:

I am not sure if I missed it - but can you let us know what is your input source and output sink ?  

In some cases, I found that saving to S3 was a problem. In this case I started saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which solved our issue.


Mans


On Monday, January 1, 2018 7:41 AM, Rohit Karlupia <[hidden email]> wrote:


Here is the list that I will probably try to fill:
  1. Check GC on the offending executor when the task is running. May be you need even more memory.  
  2. Go back to some previous successful run of the job and check the spark ui for the offending stage and check max task time/max input/max shuffle in/out for the largest task. Will help you understand the degree of skew in this stage. 
  3. Take a thread dump of the executor from the Spark UI and verify if the task is really doing any work or it stuck in some deadlock. Some of the hive serde are not really usable from multi-threaded/multi-use spark executors. 
  4. Take a thread dump of the executor from the Spark UI and verify if the task is spilling to disk. Playing with storage and memory fraction or generally increasing the memory will help. 
  5. Check the disk utilisation on the machine running the executor. 
  6. Look for event loss messages in the logs due to event queue full. Loss of events can send some of the spark components into really bad states.  

thanks,
rohitk



On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <[hidden email]> wrote:
Hi,

Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

Sadly, I cannot be of much help unless we go for a screen share session over google chat or skype. 

Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true. 

Besides that, there is a metrics in the EMR console which shows the number of containers getting generated by your job on graphs.



Regards,
Gourav Sengupta

On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <[hidden email]> wrote:
Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta <[hidden email]> wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_x xx/.

> Also if you see the spark UI then you can easily see which particular step is taking the longest period of time - you just have to drill in a bit in order to see that. Generally in case shuffling is an issue then it definitely appears in the SPARK UI as I drill into the steps and see which particular one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to date.

JM





Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Jeroen Miller
In reply to this post by Gourav Sengupta
Hello Gourav,

On 30 Dec 2017, at 20:20, Gourav Sengupta <[hidden email]> wrote:
> Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

For some reason sometimes there is absolutely nothing showing up in the Spark UI or the UI is not refreshed, e.g. for the current stage is #x while the logs shows stage #y (with y > x) is currently under way.

It may very well be that the source of this problem lies between the keyboard and the chair, but if this is the case, I do not know how to solve this.

> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true.

Thanks for the tip -- will try this setting in my next batch of experiments!

JM


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Jeroen Miller
In reply to this post by M Singh
Hello Mans,

On 1 Jan 2018, at 17:12, M Singh <[hidden email]> wrote:
> I am not sure if I missed it - but can you let us know what is your input source and output sink ?

Reading from S3 and writing to S3.

However the never-ending task 0.0 happens in a stage way before outputting anything to S3.

Regards,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on EMR suddenly stalling

Gourav Sengupta
In reply to this post by Jeroen Miller
Hi Jeroen,

in case you are using HIVE partitions how many partitions do you have?

Also is there any chance that you might post the code? 

Regards,
Gourav Sengupta

On Tue, Jan 2, 2018 at 7:50 AM, Jeroen Miller <[hidden email]> wrote:
Hello Gourav,

On 30 Dec 2017, at 20:20, Gourav Sengupta <[hidden email]> wrote:
> Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

For some reason sometimes there is absolutely nothing showing up in the Spark UI or the UI is not refreshed, e.g. for the current stage is #x while the logs shows stage #y (with y > x) is currently under way.

It may very well be that the source of this problem lies between the keyboard and the chair, but if this is the case, I do not know how to solve this.

> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true.

Thanks for the tip -- will try this setting in my next batch of experiments!

JM