endless job and slant tasks

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

endless job and slant tasks

leosandylh@gmail.com
hi all :
I run an example three times , it just read data from hdfs then do map and reduce then write to hdfs .
the first time and second time it works well ,  read almost 7G data and finished in 15 minutes , but there have a problem when I run it the third time .
one machine in my cluster lack of hard disk .
The job begin at 17:11:15 , but it has been unable to end . I wait it for 1 hour then kill it . there are the logs :
 
LogA (from the sick machine):
.....
13/12/25 17:13:56 INFO Executor: Its epoch is 0
13/12/25 17:13:56 ERROR Executor: Exception in task ID 26
java.lang.NullPointerException
        at org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.revertPartialWrites(DiskStore.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
        at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
END
 
LogB (a healthy machine):
......
13/12/25 17:12:35 INFO Executor: Serialized size of result for 54 is 817
13/12/25 17:12:35 INFO Executor: Finished task ID 54
[GC 4007686K->7782K(15073280K), 0.0287070 secs]
END
 
LogC(the master and a worker):
.......
[GC 4907439K->110393K(15457280K), 0.0533800 secs]
13/12/25 17:13:23 INFO Executor: Serialized size of result for 203 is 817
13/12/25 17:13:23 INFO Executor: Finished task ID 203
13/12/25 17:13:24 INFO Executor: Serialized size of result for 202 is 817
13/12/25 17:13:24 INFO Executor: Finished task ID 202
END
 
I don't know why the job doesn't shut down ?  the log message doesn't been writen when the job runs 2 minuts .
why one machine assigned tasks so many more than others ? 
How could I get the job and task' status when I run a big job ? it looks like a black box ...
 
 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: endless job and slant tasks

Azuryy Yu
Hi Leo,
Did you run Spark on Yarn or mesos?




On Wed, Dec 25, 2013 at 6:58 PM, [hidden email] <[hidden email]> wrote:
hi all :
I run an example three times , it just read data from hdfs then do map and reduce then write to hdfs .
the first time and second time it works well ,  read almost 7G data and finished in 15 minutes , but there have a problem when I run it the third time .
one machine in my cluster lack of hard disk .
The job begin at 17:11:15 , but it has been unable to end . I wait it for 1 hour then kill it . there are the logs :
 
LogA (from the sick machine):
.....
13/12/25 17:13:56 INFO Executor: Its epoch is 0
13/12/25 17:13:56 ERROR Executor: Exception in task ID 26
java.lang.NullPointerException
        at org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.revertPartialWrites(DiskStore.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
        at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
END
 
LogB (a healthy machine):
......
13/12/25 17:12:35 INFO Executor: Serialized size of result for 54 is 817
13/12/25 17:12:35 INFO Executor: Finished task ID 54
[GC 4007686K->7782K(15073280K), 0.0287070 secs]
END
 
LogC(the master and a worker):
.......
[GC 4907439K->110393K(15457280K), 0.0533800 secs]
13/12/25 17:13:23 INFO Executor: Serialized size of result for 203 is 817
13/12/25 17:13:23 INFO Executor: Finished task ID 203
13/12/25 17:13:24 INFO Executor: Serialized size of result for 202 is 817
13/12/25 17:13:24 INFO Executor: Finished task ID 202
END
 
I don't know why the job doesn't shut down ?  the log message doesn't been writen when the job runs 2 minuts .
why one machine assigned tasks so many more than others ? 
How could I get the job and task' status when I run a big job ? it looks like a black box ...
 
 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: endless job and slant tasks

leosandylh@gmail.com

No , just standalone cluster
 

 
Date: 2013-12-25 19:21
Subject: Re: endless job and slant tasks
Hi Leo,
Did you run Spark on Yarn or mesos?




On Wed, Dec 25, 2013 at 6:58 PM, [hidden email] <[hidden email]> wrote:
hi all :
I run an example three times , it just read data from hdfs then do map and reduce then write to hdfs .
the first time and second time it works well ,  read almost 7G data and finished in 15 minutes , but there have a problem when I run it the third time .
one machine in my cluster lack of hard disk .
The job begin at 17:11:15 , but it has been unable to end . I wait it for 1 hour then kill it . there are the logs :
 
LogA (from the sick machine):
.....
13/12/25 17:13:56 INFO Executor: Its epoch is 0
13/12/25 17:13:56 ERROR Executor: Exception in task ID 26
java.lang.NullPointerException
        at org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.revertPartialWrites(DiskStore.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
        at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
END
 
LogB (a healthy machine):
......
13/12/25 17:12:35 INFO Executor: Serialized size of result for 54 is 817
13/12/25 17:12:35 INFO Executor: Finished task ID 54
[GC 4007686K->7782K(15073280K), 0.0287070 secs]
END
 
LogC(the master and a worker):
.......
[GC 4907439K->110393K(15457280K), 0.0533800 secs]
13/12/25 17:13:23 INFO Executor: Serialized size of result for 203 is 817
13/12/25 17:13:23 INFO Executor: Finished task ID 203
13/12/25 17:13:24 INFO Executor: Serialized size of result for 202 is 817
13/12/25 17:13:24 INFO Executor: Finished task ID 202
END
 
I don't know why the job doesn't shut down ?  the log message doesn't been writen when the job runs 2 minuts .
why one machine assigned tasks so many more than others ? 
How could I get the job and task' status when I run a big job ? it looks like a black box ...
 
 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: endless job and slant tasks

Matei Zaharia
Administrator
Does that machine maybe have a full disk drive, or no space in /tmp (where Spark stores local files by default)?

On Dec 25, 2013, at 7:50 AM, [hidden email] wrote:

No , just standalone cluster
 

 
Date: 2013-12-25 19:21
Subject: Re: endless job and slant tasks
Hi Leo,
Did you run Spark on Yarn or mesos?




On Wed, Dec 25, 2013 at 6:58 PM, [hidden email] <[hidden email]> wrote:
hi all :
I run an example three times , it just read data from hdfs then do map and reduce then write to hdfs .
the first time and second time it works well ,  read almost 7G data and finished in 15 minutes , but there have a problem when I run it the third time .
one machine in my cluster lack of hard disk .
The job begin at 17:11:15 , but it has been unable to end . I wait it for 1 hour then kill it . there are the logs :
 
LogA (from the sick machine):
.....
13/12/25 17:13:56 INFO Executor: Its epoch is 0
13/12/25 17:13:56 ERROR Executor: Exception in task ID 26
java.lang.NullPointerException
        at org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.revertPartialWrites(DiskStore.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
        at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
END
 
LogB (a healthy machine):
......
13/12/25 17:12:35 INFO Executor: Serialized size of result for 54 is 817
13/12/25 17:12:35 INFO Executor: Finished task ID 54
[GC 4007686K->7782K(15073280K), 0.0287070 secs]
END
 
LogC(the master and a worker):
.......
[GC 4907439K->110393K(15457280K), 0.0533800 secs]
13/12/25 17:13:23 INFO Executor: Serialized size of result for 203 is 817
13/12/25 17:13:23 INFO Executor: Finished task ID 203
13/12/25 17:13:24 INFO Executor: Serialized size of result for 202 is 817
13/12/25 17:13:24 INFO Executor: Finished task ID 202
END
 
I don't know why the job doesn't shut down ?  the log message doesn't been writen when the job runs 2 minuts .
why one machine assigned tasks so many more than others ? 
How could I get the job and task' status when I run a big job ? it looks like a black box ...
 
 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: endless job and slant tasks

leosandylh@gmail.com

Yes , disk space is full of the whole machine .
 

 
Date: 2013-12-26 01:50
Subject: Re: endless job and slant tasks
Does that machine maybe have a full disk drive, or no space in /tmp (where Spark stores local files by default)?

On Dec 25, 2013, at 7:50 AM, [hidden email] wrote:

No , just standalone cluster
 

 
Date: 2013-12-25 19:21
Subject: Re: endless job and slant tasks
Hi Leo,
Did you run Spark on Yarn or mesos?




On Wed, Dec 25, 2013 at 6:58 PM, [hidden email] <[hidden email]> wrote:
hi all :
I run an example three times , it just read data from hdfs then do map and reduce then write to hdfs .
the first time and second time it works well ,  read almost 7G data and finished in 15 minutes , but there have a problem when I run it the third time .
one machine in my cluster lack of hard disk .
The job begin at 17:11:15 , but it has been unable to end . I wait it for 1 hour then kill it . there are the logs :
 
LogA (from the sick machine):
.....
13/12/25 17:13:56 INFO Executor: Its epoch is 0
13/12/25 17:13:56 ERROR Executor: Exception in task ID 26
java.lang.NullPointerException
        at org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.revertPartialWrites(DiskStore.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
        at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:175)
        at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
END
 
LogB (a healthy machine):
......
13/12/25 17:12:35 INFO Executor: Serialized size of result for 54 is 817
13/12/25 17:12:35 INFO Executor: Finished task ID 54
[GC 4007686K->7782K(15073280K), 0.0287070 secs]
END
 
LogC(the master and a worker):
.......
[GC 4907439K->110393K(15457280K), 0.0533800 secs]
13/12/25 17:13:23 INFO Executor: Serialized size of result for 203 is 817
13/12/25 17:13:23 INFO Executor: Finished task ID 203
13/12/25 17:13:24 INFO Executor: Serialized size of result for 202 is 817
13/12/25 17:13:24 INFO Executor: Finished task ID 202
END
 
I don't know why the job doesn't shut down ?  the log message doesn't been writen when the job runs 2 minuts .
why one machine assigned tasks so many more than others ? 
How could I get the job and task' status when I run a big job ? it looks like a black box ...
 
 


Loading...