Driver aborts on Mesos when unable to connect to one of external shuffle services

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Driver aborts on Mesos when unable to connect to one of external shuffle services

igor.berman
Hi,
any input regarding is it expected:
Driver starts and unable to connect to external shuffle service on one of
the nodes(no matter what is the reason)
This makes framework to go to Inactive mode in Mesos UI
However it seems that driver doesn't exits and continues to execute tasks(or
tries to). The attached stacktrace below shows few lines around the
connection error and aborting message

The question is is it expected behaviour?

Here is stacktracke

I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with
15d9838f-b266-413b-842d-f7c3567bd04a-0051
Exception in thread "Thread-295" java.io.IOException: Failed to connect to
my-company.com/x.x.x.x:7337
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at
org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
        at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused:my-company.com/x.x.x.x:7337
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
I0412 07:35:12.032925   277 sched.cpp:2055] Asked to abort the driver
I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework
15d9838f-b266-413b-842d-f7c3567bd04a-0051



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

Szuromi Tamás
Hi Igor,

Have you started the external shuffle service manually?

Cheers

2018-04-12 10:48 GMT+02:00 igor.berman <[hidden email]>:
Hi,
any input regarding is it expected:
Driver starts and unable to connect to external shuffle service on one of
the nodes(no matter what is the reason)
This makes framework to go to Inactive mode in Mesos UI
However it seems that driver doesn't exits and continues to execute tasks(or
tries to). The attached stacktrace below shows few lines around the
connection error and aborting message

The question is is it expected behaviour?

Here is stacktracke

I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with
15d9838f-b266-413b-842d-f7c3567bd04a-0051
Exception in thread "Thread-295" java.io.IOException: Failed to connect to
my-company.com/x.x.x.x:7337
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at
org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
        at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused:my-company.com/x.x.x.x:7337
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
I0412 07:35:12.032925   277 sched.cpp:2055] Asked to abort the driver
I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework
15d9838f-b266-413b-842d-f7c3567bd04a-0051



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

igor.berman
Hi Szuromi,
We manage external shuffle service by Marathon and not manually
sometime though, eg. when adding new node to cluster there is some delay
between mesos schedules tasks on some slave and marathon scheduling external
shuffle service task on this node.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]