Re: [DISCUSS] Spark cannot identify the problem executor

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark cannot identify the problem executor

srowen
-dev, +user
Executors do not communicate directly, so I don't think that's quite
what you are seeing. You'd have to clarify.

On Fri, Sep 11, 2020 at 12:08 AM 陈晓宇 <[hidden email]> wrote:
>
> Hello all,
>
> We've been using spark 2.3 with blacklist enabled and  often meet the problem that when executor A has some problem(like connection issue). Tasks on executor B, executor C will fail saying cannot read from executor A. Finally the job will fail due to task on executor B failed 4 times.
>
> I wonder whether there is any existing fix or discussions how to identify Executor A as the problem node.
>
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark cannot identify the problem executor

wuyi
What do you mean by "read from executor A"? I can think of several paths for an executor to read something from another remote executor: 

1. shuffle data
If the executor fails to fetch the shuffle data, I think it will result in the FetchFiled for the task. For this case, blacklist can identify the problematic executor A if spark.blacklist.application.fetchFailure.enabled=true;

2. RDD block
If the executor fails to fetch RDD blocks, I think the task would just do the computation by itself instead of failing.

3. Broadcast block
If the executor fails to fetch the broadcast block, the task seems to fail in this case and blacklist doesn't handle it well.

Thanks,
Yi

On Fri, Sep 11, 2020 at 8:43 PM Sean Owen <[hidden email]> wrote:
-dev, +user
Executors do not communicate directly, so I don't think that's quite
what you are seeing. You'd have to clarify.

On Fri, Sep 11, 2020 at 12:08 AM 陈晓宇 <[hidden email]> wrote:
>
> Hello all,
>
> We've been using spark 2.3 with blacklist enabled and  often meet the problem that when executor A has some problem(like connection issue). Tasks on executor B, executor C will fail saying cannot read from executor A. Finally the job will fail due to task on executor B failed 4 times.
>
> I wonder whether there is any existing fix or discussions how to identify Executor A as the problem node.
>
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark cannot identify the problem executor

wuyi

The FetchFailed error of Task B will be forwarded to DAGScheduler too. The FetchFailed already means the output missing of the stage. So DAGScheduler will reschedule the upstream stage, which would reschedule the upstream task of Task B at the end.

On Mon, Sep 14, 2020 at 10:39 AM 陈晓宇 <[hidden email]> wrote:
Thanks Yi Wu and Sean. Here I mean shuffle data, and without shuffle service.

spark.blacklist.application.fetchFailure.enabled=true; seems to be the answer and I was not noticed about it, thanks for pointing it out. And I will give a try.

However I doubt how it would work: when Task B reported FetchFailed, this blacklist flag can be used to identify executor A, and tasks will not be scheduled to executor A any more. But would the upstream task for Task B(which was previously running on executor A) be re-scheduled by DAG scheduler? The DAG scheduler only reschedule a task unless it thinks the output of the task is missing(Please correct me if I am wrong). And unless executor A failed to report heartbeat for a timeout period, the driver still believe the output are there on executor A.

Thanks again.

Yi Wu <[hidden email]> 于2020年9月11日周五 下午9:24写道:
What do you mean by "read from executor A"? I can think of several paths for an executor to read something from another remote executor: 

1. shuffle data
If the executor fails to fetch the shuffle data, I think it will result in the FetchFiled for the task. For this case, blacklist can identify the problematic executor A if spark.blacklist.application.fetchFailure.enabled=true;

2. RDD block
If the executor fails to fetch RDD blocks, I think the task would just do the computation by itself instead of failing.

3. Broadcast block
If the executor fails to fetch the broadcast block, the task seems to fail in this case and blacklist doesn't handle it well.

Thanks,
Yi

On Fri, Sep 11, 2020 at 8:43 PM Sean Owen <[hidden email]> wrote:
-dev, +user
Executors do not communicate directly, so I don't think that's quite
what you are seeing. You'd have to clarify.

On Fri, Sep 11, 2020 at 12:08 AM 陈晓宇 <[hidden email]> wrote:
>
> Hello all,
>
> We've been using spark 2.3 with blacklist enabled and  often meet the problem that when executor A has some problem(like connection issue). Tasks on executor B, executor C will fail saying cannot read from executor A. Finally the job will fail due to task on executor B failed 4 times.
>
> I wonder whether there is any existing fix or discussions how to identify Executor A as the problem node.
>
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Spark cannot identify the problem executor

roseyrathod456
In reply to this post by srowen
In  spark
<http://www.orienit.com/courses/spark-and-scala-training-in-hyderabad>   2.3
with blacklist enabled this is a common problem when executor A has some
problem, for instance let’s say there’s some connection issue. Tasks on
executor B, executor C will fail saying cannot read from executor A. This
would make the job fail due to task on executor B failed 4 times.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]