Spark job got stuck and no active tasks

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark job got stuck and no active tasks

Pola Yao
I was trying to train a ML pipeline (including a XGBoost classifier) on a very small data (i.e., 10Mb). During the training, certain MissingOutputLocation exceptions occurred, however, the Spark job got stuck with no active tasks, and never quit. 
1.png
2.png
3.png

Spark env:
Version 2.3.0, yarn cluster, #executors: 4, #executor-cores: 2, executor-memory: 4g, executor-memory-overhead: 2g

Spark code:
 '''

def train(...): PipelineModel = {

val data = readData(...)

val pipeline = new Pipeline().setStages(...) // the last stage is an XGBoostClassifier

val model = pipeline.fit(data)

val result = model.transform(data)

model

}

def main(String[] args): Unit = {

val spark = //create a sparksession

...

val a = Try(train(...))

a match {

case Success(model) => println("Successfully trained")

case Failure(ex) => println("Failed to train " + ex)

}

spark.stop()

'''

Does anybody have a clue about this scenario? Is it due to the yarn cluster? Or some weird exceptions/errors occurred, but my code failed to catch them? 

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Spark job got stuck and no active tasks

Alessandro Solimando
Hello Pola,
do you mind providing us the full XGBoost parameter map?

Did you try activating the "verbose mode" of XGBoost and see if something wrong is happening in the native code?

I usually increase verbosity from XGBoost by passing "silent = 0" in the param map, and calling Logger.getLogger("ml.dmlc.xgboost4j").setLevel(Level.INFO) (not sure if that's the right way to do it though).

Best regards,
Alessandro


On Fri, 25 Jan 2019 at 07:05, Pola Yao <[hidden email]> wrote:
I was trying to train a ML pipeline (including a XGBoost classifier) on a very small data (i.e., 10Mb). During the training, certain MissingOutputLocation exceptions occurred, however, the Spark job got stuck with no active tasks, and never quit. 
1.png
2.png
3.png

Spark env:
Version 2.3.0, yarn cluster, #executors: 4, #executor-cores: 2, executor-memory: 4g, executor-memory-overhead: 2g

Spark code:
 '''

def train(...): PipelineModel = {

val data = readData(...)

val pipeline = new Pipeline().setStages(...) // the last stage is an XGBoostClassifier

val model = pipeline.fit(data)

val result = model.transform(data)

model

}

def main(String[] args): Unit = {

val spark = //create a sparksession

...

val a = Try(train(...))

a match {

case Success(model) => println("Successfully trained")

case Failure(ex) => println("Failed to train " + ex)

}

spark.stop()

'''

Does anybody have a clue about this scenario? Is it due to the yarn cluster? Or some weird exceptions/errors occurred, but my code failed to catch them? 

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Spark job got stuck and no active tasks

Pola Yao
Hi Alessandro,

Thanks for your advice.

I have pasted the parameters used for the XGBoost:
'''
val xgbParam = Map("eta" -> 0.1f,
"max_depth" -> 5,
"objective" -> "binary:logistic",
"num_round" -> 200,
//"tracker_conf" -> TrackerConf(30000L, "scala"),
"num_workers" -> 2,
"nthread" -> 1,
"verbosity" -> 3
)
'''

Logs:
[10:13:15] Tree method is automatically selected to be 'approx' for distributed training.[10:13:15] Tree method is automatically selected to be 'approx' for distributed training.

terminate called after throwing an instance of 'dmlc::Error'
  what():  [10:13:16] /xgboost/include/xgboost/../../src/common/span.h:402: Check failed: _count >= 0 

Stack trace returned 10 entries:
[bt] (0) /tmp/libxgboost4j7047870347452838018.so(dmlc::StackTrace()+0x19d) [0x7ff3349a5d6d]
[bt] (1) /tmp/libxgboost4j7047870347452838018.so(xgboost::common::Span<xgboost::Entry const, -1l>::Span(xgboost::Entry const*, long)+0x5f0) [0x7ff3349eef00]
[bt] (2) /tmp/libxgboost4j7047870347452838018.so(+0x1e9793) [0x7ff334aec793]
[bt] (3) /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::GlobalProposalHistMaker<xgboost::tree::GradStats>::CreateHist(std::vector<xgboost::detail::GradientPairInternal<float>, std::allocator<xgboost::detail::GradientPairInternal<float> > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator<unsigned int> > const&, xgboost::RegTree const&)+0x995) [0x7ff334afa925]
[bt] (4) /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(std::vector<xgboost::detail::GradientPairInternal<float>, std::allocator<xgboost::detail::GradientPairInternal<float> > > const&, xgboost::DMatrix*, xgboost::RegTree*)+0x2d9) [0x7ff334afce79]
[bt] (5) /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0xa3) [0x7ff334aee933]
[bt] (6) /tmp/libxgboost4j7047870347452838018.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x809) [0x7ff334a295a9]
[bt] (7) /tmp/libxgboost4j7047870347452838018.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*)+0x885) [0x7ff334a2a035]
[bt] (8) /tmp/libxgboost4j7047870347452838018.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x3e0) [0x7ff334a37560]
[bt] (9) /tmp/libxgboost4j7047870347452838018.so(XGBoosterUpdateOneIter+0x35) [0x7ff3349ac185]



On Fri, Jan 25, 2019 at 2:16 AM Alessandro Solimando <[hidden email]> wrote:
Hello Pola,
do you mind providing us the full XGBoost parameter map?

Did you try activating the "verbose mode" of XGBoost and see if something wrong is happening in the native code?

I usually increase verbosity from XGBoost by passing "silent = 0" in the param map, and calling Logger.getLogger("ml.dmlc.xgboost4j").setLevel(Level.INFO) (not sure if that's the right way to do it though).

Best regards,
Alessandro


On Fri, 25 Jan 2019 at 07:05, Pola Yao <[hidden email]> wrote:
I was trying to train a ML pipeline (including a XGBoost classifier) on a very small data (i.e., 10Mb). During the training, certain MissingOutputLocation exceptions occurred, however, the Spark job got stuck with no active tasks, and never quit. 
1.png
2.png
3.png

Spark env:
Version 2.3.0, yarn cluster, #executors: 4, #executor-cores: 2, executor-memory: 4g, executor-memory-overhead: 2g

Spark code:
 '''

def train(...): PipelineModel = {

val data = readData(...)

val pipeline = new Pipeline().setStages(...) // the last stage is an XGBoostClassifier

val model = pipeline.fit(data)

val result = model.transform(data)

model

}

def main(String[] args): Unit = {

val spark = //create a sparksession

...

val a = Try(train(...))

a match {

case Success(model) => println("Successfully trained")

case Failure(ex) => println("Failed to train " + ex)

}

spark.stop()

'''

Does anybody have a clue about this scenario? Is it due to the yarn cluster? Or some weird exceptions/errors occurred, but my code failed to catch them? 

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Spark job got stuck and no active tasks

Alessandro Solimando
Hello Pola,
sorry for the late reply. 

I haven't checked in depth the method in error, but from the stack trace it looks like you have problems with the cardinality of one of your columns in the dataset. Did you try different datasets? 

Does the Spark Job get stuck when no errors are raised from xgboost?

For me when an error happens at xgboost level, the job catches it and fails, it does not become stuck, that's why I am asking.

I have been experimenting with xgboost 0.72, xgboost 0.8 and xgboost 0.81, which version are you using?

Best regards,
Alessandro

On Fri, 25 Jan 2019 at 19:16, Pola Yao <[hidden email]> wrote:
Hi Alessandro,

Thanks for your advice.

I have pasted the parameters used for the XGBoost:
'''
val xgbParam = Map("eta" -> 0.1f,
"max_depth" -> 5,
"objective" -> "binary:logistic",
"num_round" -> 200,
//"tracker_conf" -> TrackerConf(30000L, "scala"),
"num_workers" -> 2,
"nthread" -> 1,
"verbosity" -> 3
)
'''

Logs:
[10:13:15] Tree method is automatically selected to be 'approx' for distributed training.[10:13:15] Tree method is automatically selected to be 'approx' for distributed training.

terminate called after throwing an instance of 'dmlc::Error'
  what():  [10:13:16] /xgboost/include/xgboost/../../src/common/span.h:402: Check failed: _count >= 0 

Stack trace returned 10 entries:
[bt] (0) /tmp/libxgboost4j7047870347452838018.so(dmlc::StackTrace()+0x19d) [0x7ff3349a5d6d]
[bt] (1) /tmp/libxgboost4j7047870347452838018.so(xgboost::common::Span<xgboost::Entry const, -1l>::Span(xgboost::Entry const*, long)+0x5f0) [0x7ff3349eef00]
[bt] (2) /tmp/libxgboost4j7047870347452838018.so(+0x1e9793) [0x7ff334aec793]
[bt] (3) /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::GlobalProposalHistMaker<xgboost::tree::GradStats>::CreateHist(std::vector<xgboost::detail::GradientPairInternal<float>, std::allocator<xgboost::detail::GradientPairInternal<float> > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator<unsigned int> > const&, xgboost::RegTree const&)+0x995) [0x7ff334afa925]
[bt] (4) /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(std::vector<xgboost::detail::GradientPairInternal<float>, std::allocator<xgboost::detail::GradientPairInternal<float> > > const&, xgboost::DMatrix*, xgboost::RegTree*)+0x2d9) [0x7ff334afce79]
[bt] (5) /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0xa3) [0x7ff334aee933]
[bt] (6) /tmp/libxgboost4j7047870347452838018.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x809) [0x7ff334a295a9]
[bt] (7) /tmp/libxgboost4j7047870347452838018.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*)+0x885) [0x7ff334a2a035]
[bt] (8) /tmp/libxgboost4j7047870347452838018.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x3e0) [0x7ff334a37560]
[bt] (9) /tmp/libxgboost4j7047870347452838018.so(XGBoosterUpdateOneIter+0x35) [0x7ff3349ac185]



On Fri, Jan 25, 2019 at 2:16 AM Alessandro Solimando <[hidden email]> wrote:
Hello Pola,
do you mind providing us the full XGBoost parameter map?

Did you try activating the "verbose mode" of XGBoost and see if something wrong is happening in the native code?

I usually increase verbosity from XGBoost by passing "silent = 0" in the param map, and calling Logger.getLogger("ml.dmlc.xgboost4j").setLevel(Level.INFO) (not sure if that's the right way to do it though).

Best regards,
Alessandro


On Fri, 25 Jan 2019 at 07:05, Pola Yao <[hidden email]> wrote:
I was trying to train a ML pipeline (including a XGBoost classifier) on a very small data (i.e., 10Mb). During the training, certain MissingOutputLocation exceptions occurred, however, the Spark job got stuck with no active tasks, and never quit. 
1.png
2.png
3.png

Spark env:
Version 2.3.0, yarn cluster, #executors: 4, #executor-cores: 2, executor-memory: 4g, executor-memory-overhead: 2g

Spark code:
 '''

def train(...): PipelineModel = {

val data = readData(...)

val pipeline = new Pipeline().setStages(...) // the last stage is an XGBoostClassifier

val model = pipeline.fit(data)

val result = model.transform(data)

model

}

def main(String[] args): Unit = {

val spark = //create a sparksession

...

val a = Try(train(...))

a match {

case Success(model) => println("Successfully trained")

case Failure(ex) => println("Failed to train " + ex)

}

spark.stop()

'''

Does anybody have a clue about this scenario? Is it due to the yarn cluster? Or some weird exceptions/errors occurred, but my code failed to catch them? 

Thanks!