Issue with using Generalized Linear Regression for Logistic Regression modeling

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue with using Generalized Linear Regression for Logistic Regression modeling

FireFly
This post was updated on .
The Logistic Regression (LR) offered by Spark has rather limited model
statistics output. I would like to have access to q-value, AIC, standard
error etc. Generalized Linear Regression (GLR) does offer these statistics
in the model output, and can be used as LR if one specifies
family="binomial", link="logit" in the GLR. The issue I ran into is that
some models converge nicely using Logistic Regression, but not using
Generalized Linear Regression. For other models, I do see they converge to
the same result using either LR or GLR.

I played around with the solver options in GLR, it didn't help. The option
that does make a difference is the weightCol. Without it, both LR and GLR
converge to the same thing, making sense or not aside. With the weightCol
included, LR converge, in about 10 iterations, to the same result as what I
got using SAS; GLR just won't converge (I tried as many as 10000 iterations) and the
model coefficients at the end of the run, where the maximum number of
iteration was hit, are in the 10^12 range, which are way off.

I am using Spark 2.2.0 currently. The relevant part of the code is pasted
below.

    trainingData = sqlContext.read.load(args.input_df_name).repartition(args.repartition)

   
    catCol=['rwdproduct2','state_final','mixed','ocup','SECURED','o_channel','season','sa_C_ten_buck','sa_C_fico_buck','sa_C_otb_buck']
    numCol=['PRIME_ma_6L36']

    colNameModStr="_class"
    catColClass=[colName + colNameModStr for colName in catCol]

    stages = []
    for col in catCol:
        stringIndexer = StringIndexer(inputCol=col, outputCol=col+"Index")
        encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(), outputCol=col+colNameModStr)
        stages += [stringIndexer, encoder]

    assembler = VectorAssembler(inputCols=catColClass + numCol, outputCol='features')

    glr=GeneralizedLinearRegression(family="binomial", link="logit", solver="SGD", weightCol = "wt", labelCol="bad", maxIter=20, tol=1.0E-12, regParam=0)

    pipeline = Pipeline(stages=stages + [assembler, glr])

    modelDF = pipeline.fit(trainingData)

    # --- Output some modeling results
    print("Model Betas is {}".format(modelDF.stages.__getitem__(-1).coefficients))

Appreciate any help you would offer to resolve this.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Reply | Threaded
Open this post in threaded view
|

Re: Issue with using Generalized Linear Regression for Logistic Regression modeling

FireFly
It turns out that the weight was too large (with mean around 5000 and the
standard deviation around 8000) and caused overflow. After scaling down the
weight to, for example, numbers between 0 and 1, the code converged nicely.

Spark did not report the overflow issue. We actually found it out by running
the data set through R.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]