Issue with using Generalized Linear Regression for Logistic Regression modeling
This post was updated on .
The Logistic Regression (LR) offered by Spark has rather limited model
statistics output. I would like to have access to q-value, AIC, standard
error etc. Generalized Linear Regression (GLR) does offer these statistics
in the model output, and can be used as LR if one specifies
family="binomial", link="logit" in the GLR. The issue I ran into is that
some models converge nicely using Logistic Regression, but not using
Generalized Linear Regression. For other models, I do see they converge to
the same result using either LR or GLR.
I played around with the solver options in GLR, it didn't help. The option
that does make a difference is the weightCol. Without it, both LR and GLR
converge to the same thing, making sense or not aside. With the weightCol
included, LR converge, in about 10 iterations, to the same result as what I
got using SAS; GLR just won't converge (I tried as many as 10000 iterations) and the
model coefficients at the end of the run, where the maximum number of
iteration was hit, are in the 10^12 range, which are way off.
I am using Spark 2.2.0 currently. The relevant part of the code is pasted
Re: Issue with using Generalized Linear Regression for Logistic Regression modeling
It turns out that the weight was too large (with mean around 5000 and the
standard deviation around 8000) and caused overflow. After scaling down the
weight to, for example, numbers between 0 and 1, the code converged nicely.
Spark did not report the overflow issue. We actually found it out by running
the data set through R.