Issue with using Generalized Linear Regression for Logistic Regression modeling

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Issue with using Generalized Linear Regression for Logistic Regression modeling

This post was updated on .
The Logistic Regression (LR) offered by Spark has rather limited model
statistics output. I would like to have access to q-value, AIC, standard
error etc. Generalized Linear Regression (GLR) does offer these statistics
in the model output, and can be used as LR if one specifies
family="binomial", link="logit" in the GLR. The issue I ran into is that
some models converge nicely using Logistic Regression, but not using
Generalized Linear Regression. For other models, I do see they converge to
the same result using either LR or GLR.

I played around with the solver options in GLR, it didn't help. The option
that does make a difference is the weightCol. Without it, both LR and GLR
converge to the same thing, making sense or not aside. With the weightCol
included, LR converge, in about 10 iterations, to the same result as what I
got using SAS; GLR just won't converge (I tried as many as 10000 iterations) and the
model coefficients at the end of the run, where the maximum number of
iteration was hit, are in the 10^12 range, which are way off.

I am using Spark 2.2.0 currently. The relevant part of the code is pasted

    trainingData =


    catColClass=[colName + colNameModStr for colName in catCol]

    stages = []
    for col in catCol:
        stringIndexer = StringIndexer(inputCol=col, outputCol=col+"Index")
        encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(), outputCol=col+colNameModStr)
        stages += [stringIndexer, encoder]

    assembler = VectorAssembler(inputCols=catColClass + numCol, outputCol='features')

    glr=GeneralizedLinearRegression(family="binomial", link="logit", solver="SGD", weightCol = "wt", labelCol="bad", maxIter=20, tol=1.0E-12, regParam=0)

    pipeline = Pipeline(stages=stages + [assembler, glr])

    modelDF =

    # --- Output some modeling results
    print("Model Betas is {}".format(modelDF.stages.__getitem__(-1).coefficients))

Appreciate any help you would offer to resolve this.

Sent from:

To unsubscribe e-mail: