Error During ReceivingConnection

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Error During ReceivingConnection

Surendranauth Hiraman
I have a somewhat large job (10 GB input data but generates about 500 GB of data after many stages).

Most tasks completed but a few stragglers on the same node/executor are still active (but doing nothing) after about 16 hours.

At about 3 to 4 hours in, the tasks that are hanging have the following in the work logs.

Any idea what config to tweak for this?


14/06/10 18:51:10 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.108,37693)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
14/06/10 18:51:10 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/10 18:51:14 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.97,54918)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
14/06/10 18:51:14 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found

--
                                                            
SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: [hidden email]elos.io
W: www.velos.io

Reply | Threaded
Open this post in threaded view
|

Re: Error During ReceivingConnection

Surendranauth Hiraman
It looks like this was due to another executor on a different node closing the connection on its side. I found the entries below in the remote side's logs.

Can anyone comment on why one ConnectionManager would close its connection to another node and what could be tuned to avoid this? It did not have any errors on its side.


This is from the ConnectionManager on the side shutting down the connection, not the ConnectionManager that had the "Connection Reset By Peer".

14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.125,45610)

14/06/10 18:51:14 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.125,45610)




On Wed, Jun 11, 2014 at 8:38 AM, Surendranauth Hiraman <[hidden email]> wrote:
I have a somewhat large job (10 GB input data but generates about 500 GB of data after many stages).

Most tasks completed but a few stragglers on the same node/executor are still active (but doing nothing) after about 16 hours.

At about 3 to 4 hours in, the tasks that are hanging have the following in the work logs.

Any idea what config to tweak for this?


14/06/10 18:51:10 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.108,37693)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
14/06/10 18:51:10 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/10 18:51:10 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/10 18:51:14 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.97,54918)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
14/06/10 18:51:14 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found
14/06/10 18:51:14 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found

--
                                                            
SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: <a href="tel:%28917%29%20525-2466%20ext.%20105" value="+19175252466" target="_blank">(917) 525-2466 ext. 105
F: <a href="tel:646.349.4063" value="+16463494063" target="_blank">646.349.4063
E: [hidden email]elos.io
W: www.velos.io




--
                                                            
SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: [hidden email]elos.io
W: www.velos.io