Problem with running Spark over Mesos in fine-grained mode

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with running Spark over Mesos in fine-grained mode

Marek Wiewiorka
Hi All,
I trying to run my app over Mesos in in fine-grained mode and I'm getting the attached errors(both Spark driver and Mesos slave).
The same app in coarse-grained mode runs fine. The only thing I'm changing is setting the spark.mesos.coarse param to true/false.
System.setProperty("spark.mesos.coarse", "false")
If set to true everything is fine, if set to false(default)  errors below occur.

I'm using:
Mesos 0.16-rc3
Spark 0.9.0-rc5

I would appreciate any hints.

Regards,
Marek



Spark driver
=============
14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/01/28 14:51:08 INFO Remoting: Starting remoting
14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140128145108-e1d9
14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with capacity 589.2 MB.
14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port 34442 with id = ConnectionManagerId(XXXXXX,34442)
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager XXXXXX:34442 with 589.2 MB RAM
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41485
14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.0.0.8:41485
14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:47676
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}
14/01/28 14:51:09 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at http://XXXXXX:4040
14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called with curMem=0, maxMem=617820979
14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 90.8 KB, free 589.1 MB)
14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process : 1
14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at SparkSeqSimpleJob.scala:48
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at SparkSeqSimpleJob.scala:48)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which has no missing parents
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 24 tasks
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2500 bytes in 13 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as 2500 bytes in 0 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as 2500 bytes in 1 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as 2500 bytes in 1 ms
14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for 201401281448-134217738-5050-17953-1 from TaskSet 1.0
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost: 201401281448-134217738-5050-17953-1 (epoch 0)
14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed 201401281448-134217738-5050-17953-1 successfully in removeExecutor
1

Mesos SLAVE LOG
============
I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9 with resources mem(*):10240' for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at 18386
I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor 201401281448-134217738-5050-17953-1 of framework 201401281448-134217738-5050-17953-0000 forked at pid 18386
I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of terminated executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.428513 17992 slave.cpp:2145] Executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000 has exited with status 0
I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the process tree rooted at pid 18386: Failed to find process 18386
I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework 201401281448-134217738-5050-17953-0000

Reply | Threaded
Open this post in threaded view
|

Re: Problem with running Spark over Mesos in fine-grained mode

Benjamin Hindman
Hey Marek,

Can you include the stdout and stderr of the executors that are running? You should be able to get them from the Mesos master web UI or on the slave directly. If they are empty then you'll need to set up the log4j properties.

Ben.


On Tue, Jan 28, 2014 at 7:07 AM, Marek Wiewiorka <[hidden email]> wrote:
Hi All,
I trying to run my app over Mesos in in fine-grained mode and I'm getting the attached errors(both Spark driver and Mesos slave).
The same app in coarse-grained mode runs fine. The only thing I'm changing is setting the spark.mesos.coarse param to true/false.
System.setProperty("spark.mesos.coarse", "false")
If set to true everything is fine, if set to false(default)  errors below occur.

I'm using:
Mesos 0.16-rc3
Spark 0.9.0-rc5

I would appreciate any hints.

Regards,
Marek



Spark driver
=============
14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/01/28 14:51:08 INFO Remoting: Starting remoting
14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140128145108-e1d9
14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with capacity 589.2 MB.
14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port 34442 with id = ConnectionManagerId(XXXXXX,34442)
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager XXXXXX:34442 with 589.2 MB RAM
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41485
14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.0.0.8:41485
14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:47676
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}
14/01/28 14:51:09 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at http://XXXXXX:4040
14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called with curMem=0, maxMem=617820979
14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 90.8 KB, free 589.1 MB)
14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process : 1
14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at SparkSeqSimpleJob.scala:48
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at SparkSeqSimpleJob.scala:48)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which has no missing parents
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 24 tasks
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2500 bytes in 13 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as 2500 bytes in 0 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as 2500 bytes in 1 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as 2500 bytes in 1 ms
14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for 201401281448-134217738-5050-17953-1 from TaskSet 1.0
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost: 201401281448-134217738-5050-17953-1 (epoch 0)
14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed 201401281448-134217738-5050-17953-1 successfully in removeExecutor
1

Mesos SLAVE LOG
============
I0128 14:51:12.996467 17996 slave.cpp:727] Got assigned task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.996961 17996 slave.cpp:727] Got assigned task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997133 17996 slave.cpp:727] Got assigned task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997287 17996 slave.cpp:727] Got assigned task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.997429 17996 slave.cpp:836] Launching task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998803 17996 slave.cpp:946] Queuing task '0' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.998819 17993 process_isolator.cpp:100] Launching 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9 with resources mem(*):10240' for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998883 17996 slave.cpp:836] Launching task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998914 17996 slave.cpp:946] Queuing task '1' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.998947 17996 slave.cpp:836] Launching task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.998971 17996 slave.cpp:946] Queuing task '2' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at 18386
I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor 201401281448-134217738-5050-17953-1 of framework 201401281448-134217738-5050-17953-0000 forked at pid 18386
I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of terminated executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.428513 17992 slave.cpp:2145] Executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000 has exited with status 0
I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the process tree rooted at pid 18386: Failed to find process 18386
I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework 201401281448-134217738-5050-17953-0000


Reply | Threaded
Open this post in threaded view
|

Re: Problem with running Spark over Mesos in fine-grained mode

Marek Wiewiorka
Hi Benjamin,
thanks for the response!
These are the logs you requsted:
stdout:
========
Fetching resources into '/tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-27878-1/runs/9dc28fa1-7
8ac-44de-b73d-48bf624aadc2'
Fetching resource 'hdfs://XXXXX:9000/frameworks/spark/0.9.0/spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
Extracting resource: tar xzf './spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
Running spark-executor with framework dir = /tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-2787
8-1/runs/9dc28fa1-78ac-44de-b73d-48bf624aadc2/spark-0.9.0-incubating
2014-01-29 11:41:46,626 [Thread-0] INFO  org.apache.spark.executor.MesosExecutorBackend - Registered with Mesos as executor ID 201401291036-134217738-5050-27878-1
2014-01-29 11:41:47,898 [spark-akka.actor.default-dispatcher-4] INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2014-01-29 11:41:47,980 [spark-akka.actor.default-dispatcher-4] INFO  Remoting - Starting remoting
2014-01-29 11:41:48,303 [spark-akka.actor.default-dispatcher-4] INFO  Remoting - Remoting started; listening on addresses :[akka.tcp://spark@XXXXX:51350]
2014-01-29 11:41:48,306 [spark-akka.actor.default-dispatcher-2] INFO  Remoting - Remoting now listens on addresses: [akka.tcp://spark@XXXXX:51350]
2014-01-29 11:41:48,335 [Thread-0] INFO  org.apache.spark.SparkEnv - Connecting to BlockManagerMaster: akka.tcp://spark@XXXXX:7077/user/BlockManagerMaster

stderr
=====
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://spark@XXXXX:7077/]/user/BlockManagerMaster]
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
        at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
        at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
        at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
        at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
        at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
        at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
        at akka.actor.ActorRef.tell(ActorRef.scala:125)
        at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
        at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
        at akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
        at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
        at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
        at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
        at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
        at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
        at akka.actor.ActorCell.terminate(ActorCell.scala:338)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
        at akka.dispatch.Mailbox.run(Mailbox.scala:218)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Exception in thread "Thread-0"

Marek



2014-01-29 Benjamin Hindman <[hidden email]>
Hey Marek,

Can you include the stdout and stderr of the executors that are running? You should be able to get them from the Mesos master web UI or on the slave directly. If they are empty then you'll need to set up the log4j properties.

Ben.


On Tue, Jan 28, 2014 at 7:07 AM, Marek Wiewiorka <[hidden email]> wrote:
Hi All,
I trying to run my app over Mesos in in fine-grained mode and I'm getting the attached errors(both Spark driver and Mesos slave).
The same app in coarse-grained mode runs fine. The only thing I'm changing is setting the spark.mesos.coarse param to true/false.
System.setProperty("spark.mesos.coarse", "false")
If set to true everything is fine, if set to false(default)  errors below occur.

I'm using:
Mesos 0.16-rc3
Spark 0.9.0-rc5

I would appreciate any hints.

Regards,
Marek



Spark driver
=============
14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/01/28 14:51:08 INFO Remoting: Starting remoting
14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140128145108-e1d9
14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with capacity 589.2 MB.
14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port 34442 with id = ConnectionManagerId(XXXXXX,34442)
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager XXXXXX:34442 with 589.2 MB RAM
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41485
14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.0.0.8:41485
14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:47676
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}
14/01/28 14:51:09 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at http://XXXXXX:4040
14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called with curMem=0, maxMem=<a href="tel:617820979" value="+48617820979" target="_blank">617820979
14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 90.8 KB, free 589.1 MB)
14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process : 1
14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at SparkSeqSimpleJob.scala:48
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at SparkSeqSimpleJob.scala:48)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which has no missing parents
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 24 tasks
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2500 bytes in 13 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as 2500 bytes in 0 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as 2500 bytes in 1 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as 2500 bytes in 1 ms
14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for 201401281448-134217738-5050-17953-1 from TaskSet 1.0
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost: 201401281448-134217738-5050-17953-1 (epoch 0)
14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed 201401281448-134217738-5050-17953-1 successfully in removeExecutor
1

Mesos SLAVE LOG
============
I0128 14:51:<a href="tel:12.996467" value="+4812996467" target="_blank">12.996467 17996 slave.cpp:727] Got assigned task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.996961" value="+4812996961" target="_blank">12.996961 17996 slave.cpp:727] Got assigned task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.997133" value="+4812997133" target="_blank">12.997133 17996 slave.cpp:727] Got assigned task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.997287" value="+4812997287" target="_blank">12.997287 17996 slave.cpp:727] Got assigned task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.997429" value="+4812997429" target="_blank">12.997429 17996 slave.cpp:836] Launching task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998803" value="+4812998803" target="_blank">12.998803 17996 slave.cpp:946] Queuing task '0' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998819" value="+4812998819" target="_blank">12.998819 17993 process_isolator.cpp:100] Launching 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9 with resources mem(*):10240' for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998883" value="+4812998883" target="_blank">12.998883 17996 slave.cpp:836] Launching task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998914" value="+4812998914" target="_blank">12.998914 17996 slave.cpp:946] Queuing task '1' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998947" value="+4812998947" target="_blank">12.998947 17996 slave.cpp:836] Launching task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998971" value="+4812998971" target="_blank">12.998971 17996 slave.cpp:946] Queuing task '2' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at 18386
I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor 201401281448-134217738-5050-17953-1 of framework 201401281448-134217738-5050-17953-0000 forked at pid 18386
I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of terminated executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.428513 17992 slave.cpp:2145] Executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000 has exited with status 0
I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the process tree rooted at pid 18386: Failed to find process 18386
I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework 201401281448-134217738-5050-17953-0000



Reply | Threaded
Open this post in threaded view
|

Re: Problem with running Spark over Mesos in fine-grained mode

Alberto Miorin
Hi all!

I've the same problem with the same Exception. I've opened a ticket :

I didn't try with the coarse-grained.

Alberto


On Wed, Jan 29, 2014 at 12:56 PM, Marek Wiewiorka <[hidden email]> wrote:
Hi Benjamin,
thanks for the response!
These are the logs you requsted:
stdout:
========
Fetching resources into '/tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-27878-1/runs/9dc28fa1-7
8ac-44de-b73d-48bf624aadc2'
Fetching resource 'hdfs://XXXXX:9000/frameworks/spark/0.9.0/spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
Extracting resource: tar xzf './spark-0.9.0-incubating-hadoop_1.2.1-bin.tar.gz'
Running spark-executor with framework dir = /tmp/mesos/slaves/201401291036-134217738-5050-27878-1/frameworks/201401291036-134217738-5050-27878-0003/executors/201401291036-134217738-5050-2787
8-1/runs/9dc28fa1-78ac-44de-b73d-48bf624aadc2/spark-0.9.0-incubating
2014-01-29 11:41:46,626 [Thread-0] INFO  org.apache.spark.executor.MesosExecutorBackend - Registered with Mesos as executor ID 201401291036-134217738-5050-27878-1
2014-01-29 11:41:47,898 [spark-akka.actor.default-dispatcher-4] INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2014-01-29 11:41:47,980 [spark-akka.actor.default-dispatcher-4] INFO  Remoting - Starting remoting
2014-01-29 11:41:48,303 [spark-akka.actor.default-dispatcher-4] INFO  Remoting - Remoting started; listening on addresses :[akka.tcp://spark@XXXXX:51350]
2014-01-29 11:41:48,306 [spark-akka.actor.default-dispatcher-2] INFO  Remoting - Remoting now listens on addresses: [akka.tcp://spark@XXXXX:51350]
2014-01-29 11:41:48,335 [Thread-0] INFO  org.apache.spark.SparkEnv - Connecting to BlockManagerMaster: akka.tcp://spark@XXXXX:7077/user/BlockManagerMaster

stderr
=====
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://spark@XXXXX:7077/]/user/BlockManagerMaster]
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
        at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
        at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
        at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
        at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
        at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
        at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
        at akka.actor.ActorRef.tell(ActorRef.scala:125)
        at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
        at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
        at akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
        at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
        at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
        at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
        at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
        at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
        at akka.actor.ActorCell.terminate(ActorCell.scala:338)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
        at akka.dispatch.Mailbox.run(Mailbox.scala:218)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Exception in thread "Thread-0"

Marek



2014-01-29 Benjamin Hindman <[hidden email]>

Hey Marek,

Can you include the stdout and stderr of the executors that are running? You should be able to get them from the Mesos master web UI or on the slave directly. If they are empty then you'll need to set up the log4j properties.

Ben.


On Tue, Jan 28, 2014 at 7:07 AM, Marek Wiewiorka <[hidden email]> wrote:
Hi All,
I trying to run my app over Mesos in in fine-grained mode and I'm getting the attached errors(both Spark driver and Mesos slave).
The same app in coarse-grained mode runs fine. The only thing I'm changing is setting the spark.mesos.coarse param to true/false.
System.setProperty("spark.mesos.coarse", "false")
If set to true everything is fine, if set to false(default)  errors below occur.

I'm using:
Mesos 0.16-rc3
Spark 0.9.0-rc5

I would appreciate any hints.

Regards,
Marek



Spark driver
=============
14/01/28 14:51:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/01/28 14:51:08 INFO Remoting: Starting remoting
14/01/28 14:51:08 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@XXXXXX:49899]
14/01/28 14:51:08 INFO spark.SparkEnv: Registering BlockManagerMaster
14/01/28 14:51:08 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140128145108-e1d9
14/01/28 14:51:08 INFO storage.MemoryStore: MemoryStore started with capacity 589.2 MB.
14/01/28 14:51:09 INFO network.ConnectionManager: Bound socket to port 34442 with id = ConnectionManagerId(XXXXXX,34442)
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/01/28 14:51:09 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager XXXXXX:34442 with 589.2 MB RAM
14/01/28 14:51:09 INFO storage.BlockManagerMaster: Registered BlockManager
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41485
14/01/28 14:51:09 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.0.0.8:41485
14/01/28 14:51:09 INFO spark.SparkEnv: Registering MapOutputTracker
14/01/28 14:51:09 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fcd06bbb-225d-4de4-b5c7-75b054c6a67d
14/01/28 14:51:09 INFO spark.HttpServer: Starting HTTP Server
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:47676
14/01/28 14:51:09 INFO server.Server: jetty-7.6.8.v20121106
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}
14/01/28 14:51:09 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}
14/01/28 14:51:09 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
14/01/28 14:51:09 INFO ui.SparkUI: Started Spark Web UI at http://XXXXXX:4040
14/01/28 14:51:11 INFO storage.MemoryStore: ensureFreeSpace(92955) called with curMem=0, maxMem=<a href="tel:617820979" value="+48617820979" target="_blank">617820979
14/01/28 14:51:11 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 90.8 KB, free 589.1 MB)
14/01/28 14:51:11 INFO input.FileInputFormat: Total input paths to process : 1
14/01/28 14:51:12 INFO spark.SparkContext: Starting job: count at SparkSeqSimpleJob.scala:48
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Registering RDD 4 (reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Got job 0 (count at SparkSeqSimpleJob.scala:48) with 25 output partitions (allowLocal=false)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at SparkSeqSimpleJob.scala:48)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355), which has no missing parents
14/01/28 14:51:12 INFO scheduler.DAGScheduler: Submitting 24 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at SparkSeqAnalysis.scala:355)
14/01/28 14:51:12 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 24 tasks
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 0 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 2500 bytes in 13 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID 1 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as 2500 bytes in 0 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:2 as TID 2 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:2 as 2500 bytes in 1 ms
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Starting task 1.0:3 as TID 3 on executor 201401281448-134217738-5050-17953-1: XXXXXX (PROCESS_LOCAL)
14/01/28 14:51:12 INFO scheduler.TaskSetManager: Serialized task 1.0:3 as 2500 bytes in 1 ms
14/01/28 14:51:20 INFO scheduler.TaskSetManager: Re-queueing tasks for 201401281448-134217738-5050-17953-1 from TaskSet 1.0
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 2 (task 1.0:2)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 1 (task 1.0:1)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 3 (task 1.0:3)
14/01/28 14:51:20 WARN scheduler.TaskSetManager: Lost TID 0 (task 1.0:0)
14/01/28 14:51:20 INFO scheduler.DAGScheduler: Executor lost: 201401281448-134217738-5050-17953-1 (epoch 0)
14/01/28 14:51:20 INFO storage.BlockManagerMasterActor: Trying to remove executor 201401281448-134217738-5050-17953-1 from BlockManagerMaster.
14/01/28 14:51:20 INFO storage.BlockManagerMaster: Removed 201401281448-134217738-5050-17953-1 successfully in removeExecutor
1

Mesos SLAVE LOG
============
I0128 14:51:<a href="tel:12.996467" value="+4812996467" target="_blank">12.996467 17996 slave.cpp:727] Got assigned task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.996961" value="+4812996961" target="_blank">12.996961 17996 slave.cpp:727] Got assigned task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.997133" value="+4812997133" target="_blank">12.997133 17996 slave.cpp:727] Got assigned task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.997287" value="+4812997287" target="_blank">12.997287 17996 slave.cpp:727] Got assigned task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.997429" value="+4812997429" target="_blank">12.997429 17996 slave.cpp:836] Launching task 0 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998803" value="+4812998803" target="_blank">12.998803 17996 slave.cpp:946] Queuing task '0' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998819" value="+4812998819" target="_blank">12.998819 17993 process_isolator.cpp:100] Launching 201401281448-134217738-5050-17953-1 (cd spark-0*; ./sbin/spark-executor) in /tmp/mesos/slaves/201401281448-134217738-5050-17953-1/frameworks/201401281448-134217738-5050-17953-0000/executors/201401281448-134217738-5050-17953-1/runs/a4c7bbb6-2ba2-4b03-9cd7-5bc22e12c1c9 with resources mem(*):10240' for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998883" value="+4812998883" target="_blank">12.998883 17996 slave.cpp:836] Launching task 1 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998914" value="+4812998914" target="_blank">12.998914 17996 slave.cpp:946] Queuing task '1' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998947" value="+4812998947" target="_blank">12.998947 17996 slave.cpp:836] Launching task 2 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:<a href="tel:12.998971" value="+4812998971" target="_blank">12.998971 17996 slave.cpp:946] Queuing task '2' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:12.999003 17996 slave.cpp:836] Launching task 3 for framework 201401281448-134217738-5050-17953-0000
I0128 14:51:12.999943 17996 slave.cpp:946] Queuing task '3' for executor 201401281448-134217738-5050-17953-1 of framework '201401281448-134217738-5050-17953-0000
I0128 14:51:13.000185 17993 process_isolator.cpp:163] Forked executor at 18386
I0128 14:51:13.001010 17993 slave.cpp:2089] Monitoring executor 201401281448-134217738-5050-17953-1 of framework 201401281448-134217738-5050-17953-0000 forked at pid 18386
I0128 14:51:17.422936 17991 slave.cpp:1422] Got registration for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.423828 17991 slave.cpp:1543] Flushing queued task 0 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424185 17991 slave.cpp:1543] Flushing queued task 1 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.424612 17991 slave.cpp:1543] Flushing queued task 2 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:17.425472 17991 slave.cpp:1543] Flushing queued task 3 for executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.427978 17991 process_isolator.cpp:482] Telling slave of terminated executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.428513 17992 slave.cpp:2145] Executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000 has exited with status 0
I0128 14:51:20.433177 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.433763 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.434661 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.435967 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.436715 17993 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.439827 17993 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
W0128 14:51:20.440202 17991 process_isolator.cpp:268] Failed to kill the process tree rooted at pid 18386: Failed to find process 18386
I0128 14:51:20.440649 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.451300 17991 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.436820 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.452231 17995 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452239 17994 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.452981 17994 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.453176 17992 slave.cpp:1756] Handling status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 from @0.0.0.0:0
I0128 14:51:20.453801 17998 process_isolator.cpp:301] Asked to update resources for an unknown/killed executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.453943 17992 status_update_manager.cpp:312] Received status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.455003 17992 status_update_manager.cpp:365] Forwarding status update TASK_LOST (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000 to master@10.0.0.8:5050
I0128 14:51:20.489073 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 67c3ef56-79b7-4fe3-9244-36250d725475) for task 0 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.489478 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: a55053bb-358c-48ff-9360-52e750dd5b2b) for task 1 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490196 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 2b34aa78-5d0e-4704-8a5b-a1c69b0041ab) for task 2 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490628 17997 status_update_manager.cpp:390] Received status update acknowledgement (UUID: 98add22a-d891-4615-a03a-98ae0a48254b) for task 3 of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.490890 17997 slave.cpp:2280] Cleaning up executor '201401281448-134217738-5050-17953-1' of framework 201401281448-134217738-5050-17953-0000
I0128 14:51:20.491407 17997 slave.cpp:2351] Cleaning up framework 201401281448-134217738-5050-17953-0000