Job hangs on spark-0.8.1

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Job hangs on spark-0.8.1

Roshan Nair
Hi,

I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some point, in that the driver UI is still available at port 4040, but the master UI says the job failed.

The job seems to always hang in a reduceByKey(or reduceByKeyLocally).

Some of my configuration parameters are - 
-Dspark.akka.frameSize=160 -Dspark.akka.timeout=100 -Dspark.akka.askTimeout=30

The job hangs with the driver saying an executor was lost. Within a few seconds, it loses all executors. 
Some of the executors say that a SendingConnectionManagerId was not found. I've noticed that in all cases, the executors don't log for 15 seconds before they log some more and then finish.

Driver log: 
14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3)
14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 7 from BlockManagerMaster.
14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task 10.0:49 as 6065714 bytes in 9 ms
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task 10.0:54 as TID 881 on executor 9: node1.staging (PROCESS_LOCAL)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4
14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5
14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated: app-20140204140954-0076/7 is now FAILED (Command exited with code 137)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6

Last log lines from first executor that was lost:
14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22
14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is StorageLevel(false, true, true, 1)
14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory
14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally
14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3
14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@469c17ba
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@469c17ba
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@4ed29790
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@4ed29790
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Final few logs from another executor:
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(10.0.15.122,58862)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279)
at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Thanks in advance!

Roshan
Reply | Threaded
Open this post in threaded view
|

Re: Job hangs on spark-0.8.1

Roshan Nair
Has anyone had this happen before?

Roshan


On Tue, Feb 4, 2014 at 9:13 PM, Roshan Nair <[hidden email]> wrote:
Hi,

I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some point, in that the driver UI is still available at port 4040, but the master UI says the job failed.

The job seems to always hang in a reduceByKey(or reduceByKeyLocally).

Some of my configuration parameters are - 
-Dspark.akka.frameSize=160 -Dspark.akka.timeout=100 -Dspark.akka.askTimeout=30

The job hangs with the driver saying an executor was lost. Within a few seconds, it loses all executors. 
Some of the executors say that a SendingConnectionManagerId was not found. I've noticed that in all cases, the executors don't log for 15 seconds before they log some more and then finish.

Driver log: 
14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3)
14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 7 from BlockManagerMaster.
14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task 10.0:49 as 6065714 bytes in 9 ms
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task 10.0:54 as TID 881 on executor 9: node1.staging (PROCESS_LOCAL)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4
14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5
14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated: app-20140204140954-0076/7 is now FAILED (Command exited with code 137)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6

Last log lines from first executor that was lost:
14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22
14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is StorageLevel(false, true, true, 1)
14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory
14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally
14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3
14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@469c17ba
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@469c17ba
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@4ed29790
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@4ed29790
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Final few logs from another executor:
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(10.0.15.122,58862)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279)
at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Thanks in advance!

Roshan

Reply | Threaded
Open this post in threaded view
|

Re: Job hangs on spark-0.8.1

Guillaume Pitel
Yes, it happens quite often. I would add this :

-Dspark.worker.timeout=200

It depends a lot on your network/disk performance

When I did my first tries on EC2, all my workers where dropping like flies. Just pushing the timeout should be enough

Guillaume


Has anyone had this happen before?

Roshan


On Tue, Feb 4, 2014 at 9:13 PM, Roshan Nair <[hidden email]> wrote:
Hi,

I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some point, in that the driver UI is still available at port 4040, but the master UI says the job failed.

The job seems to always hang in a reduceByKey(or reduceByKeyLocally).

Some of my configuration parameters are - 
-Dspark.akka.frameSize=160 -Dspark.akka.timeout=100 -Dspark.akka.askTimeout=30

The job hangs with the driver saying an executor was lost. Within a few seconds, it loses all executors. 
Some of the executors say that a SendingConnectionManagerId was not found. I've noticed that in all cases, the executors don't log for 15 seconds before they log some more and then finish.

Driver log: 
14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3)
14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 7 from BlockManagerMaster.
14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task 10.0:49 as 6065714 bytes in 9 ms
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task 10.0:54 as TID 881 on executor 9: node1.staging (PROCESS_LOCAL)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4
14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5
14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated: app-20140204140954-0076/7 is now FAILED (Command exited with code 137)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6

Last log lines from first executor that was lost:
14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22
14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is StorageLevel(false, true, true, 1)
14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory
14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally
14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3
14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@469c17ba
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@469c17ba
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@4ed29790
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@4ed29790
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Final few logs from another executor:
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
14/02/04 14:12:46 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(10.0.15.122,58862)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279)
at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Thanks in advance!

Roshan



--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Job hangs on spark-0.8.1

Roshan Nair

@Guillaume, thanks for the tip! It was useful, though, this isn't what is happening in our case.

Turns out, other non-spark tasks run on these nodes intermittently, which we failed to notice. Because these tasks take a lot of memory, spark executors failed to launch on some nodes. Master.scala has a hard, non-configurable limit on the maximum number of executors allowed to fail before removing an application. This shows up in the master's logs (Since, it was the executors that failed and not the tasks, this did not appear in the driver logs).

The thing is, even though, the master has failed the app (it says so on the master ui as well), the app driver continues to run. Its web ui is available, even though its not printing anything.

This seems to happen on spark-0.9.0 as well. The master has removed the application, but the app driver process is still running and its web ui is still available. On spark-0.8.1, we lost slaves as well when this happened. This has not happened so far on spark-0.9.0.

Roshan



exensa_logo_mail.png (17K) Download Attachment