fail to run LBFS in 5G KDD data in spark 1.0.1?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

fail to run LBFS in 5G KDD data in spark 1.0.1?

bing

1 I don’t use spark_submit to run my problem and use spark context directly

val conf = new SparkConf()
             .setMaster("spark://123d101suse11sp3:7077")
             .setAppName("LBFGS")
             .set("spark.executor.memory", "30g")
             .set("spark.akka.frameSize","20")
val sc = new SparkContext(conf)

 

2 I use KDD data, size is about 5G

 

3 After I execute LBFGS.runLBFGS, at the stage of 7, the problem occus:

 

 

14/08/06 16:44:45 INFO DAGScheduler: Failed to run aggregate at LBFGS.scala:201

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7.0:12 failed 4 times, most recent failure: TID 304 on host 123d103suse11sp3 failed for unknown reason

Driver stacktrace:

        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)

        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

        at scala.Option.foreach(Option.scala:236)

        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

        at akka.actor.ActorCell.invoke(ActorCell.scala:456)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

        at akka.dispatch.Mailbox.run(Mailbox.scala:219)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

Reply | Threaded
Open this post in threaded view
|

Re: fail to run LBFS in 5G KDD data in spark 1.0.1?

Xiangrui Meng
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver? I think the problem is with the feature dimension. KDD data has more than 20M features and in v1.0.1, the driver collects the partial gradients one by one, sums them up, does the update, and then sends the new weights back to executors one by one. In 1.1-SNAPSHOT, we switched to multi-level tree aggregation and torrent broadcasting.

For the driver memory, you can set it with spark-summit using `--driver-memory 30g`. It could be confirmed by visiting the storage tab in the WebUI.

-Xiangrui


On Wed, Aug 6, 2014 at 1:58 AM, Lizhengbing (bing, BIPA) <[hidden email]> wrote:

1 I don’t use spark_submit to run my problem and use spark context directly

val conf = new SparkConf()
             .setMaster("spark://123d101suse11sp3:7077")
             .setAppName("LBFGS")
             .set("spark.executor.memory", "30g")
             .set("spark.akka.frameSize","20")
val sc = new SparkContext(conf)

 

2 I use KDD data, size is about 5G

 

3 After I execute LBFGS.runLBFGS, at the stage of 7, the problem occus:

 

 

14/08/06 16:44:45 INFO DAGScheduler: Failed to run aggregate at LBFGS.scala:201

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7.0:12 failed 4 times, most recent failure: TID 304 on host 123d103suse11sp3 failed for unknown reason

Driver stacktrace:

        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)

        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

        at scala.Option.foreach(Option.scala:236)

        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

        at akka.actor.ActorCell.invoke(ActorCell.scala:456)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

        at akka.dispatch.Mailbox.run(Mailbox.scala:219)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)


Reply | Threaded
Open this post in threaded view
|

答复: fail to run LBFS in 5G KDD data in spark 1.0.1?

bing

I have test it in spark-1.1.0-SNAPSHOT.

It is ok now

 

发件人: Xiangrui Meng [mailto:[hidden email]]
发送时间: 201486 23:12
收件人: Lizhengbing (bing, BIPA)
抄送: [hidden email]
主题: Re: fail to run LBFS in 5G KDD data in spark 1.0.1?

 

Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver? I think the problem is with the feature dimension. KDD data has more than 20M features and in v1.0.1, the driver collects the partial gradients one by one, sums them up, does the update, and then sends the new weights back to executors one by one. In 1.1-SNAPSHOT, we switched to multi-level tree aggregation and torrent broadcasting.

 

For the driver memory, you can set it with spark-summit using `--driver-memory 30g`. It could be confirmed by visiting the storage tab in the WebUI.

 

-Xiangrui

 

On Wed, Aug 6, 2014 at 1:58 AM, Lizhengbing (bing, BIPA) <[hidden email]> wrote:

1 I dont use spark_submit to run my problem and use spark context directly

val conf = new SparkConf()
             .setMaster("spark://123d101suse11sp3:7077")
             .setAppName("LBFGS")
             .set("spark.executor.memory", "30g")
             .set("spark.akka.frameSize","20")
val sc = new SparkContext(conf)

 

2 I use KDD data, size is about 5G

 

3 After I execute LBFGS.runLBFGS, at the stage of 7, the problem occus:

 

cid:image001.png@01CFB234.3AA725F0

 

14/08/06 16:44:45 INFO DAGScheduler: Failed to run aggregate at LBFGS.scala:201

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7.0:12 failed 4 times, most recent failure: TID 304 on host 123d103suse11sp3 failed for unknown reason

Driver stacktrace:

        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)

        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

        at scala.Option.foreach(Option.scala:236)

        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

        at akka.actor.ActorCell.invoke(ActorCell.scala:456)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

        at akka.dispatch.Mailbox.run(Mailbox.scala:219)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)