Inaccurate Estimates from LinearRegressionWithSGD

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Inaccurate Estimates from LinearRegressionWithSGD

herbps10
Hello,

I just finished setting up a standalone Spark cluster and have moved on to exploring MLlib.

I'm trying to perform Linear Regression on a very simple, contrived dataset. I have
data.txt
 which contains
1 1
2 2
3 3
...
10 10

I then ran the following code through the Spark shell (modified very slightly from http://spark.incubator.apache.org/docs/latest/mllib-guide.html):

import org.apache.spark.mllib.classification.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint

val data = sc.textFile("/root/data2.txt")
val parsedData = data.map { line =>
  val parts = line.split(' ')
  LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
}

val model = LinearRegressionWithSGD.train(parsedData, 20)

The problem is that the weights and intercept are extremely off:
scala> model.weights
res28: Array[Double] = Array(1.3423470408513303E21)

scala> model.intercept
res29: Double = 1.9281546837832014E20

It gets a little better if I adjust the step size:
scala> val model = LinearRegressionWithSGD.train(parsedData, 20, 0.1)
...
scala> model.weights
res30: Array[Double] = Array(0.8801059307627607)

scala> model.intercept
res31: Double = 0.8346812131298854

But still doesn't converge on the correct estimates (I would of course expect intercept=0, slope=1). Any idea what I'm doing wrong? I feel like I must be missing something obvious.

Thanks!
Herb Susmann
SUNY Geneseo
hps1@geneseo.edu
Reply | Threaded
Open this post in threaded view
|

Re: Inaccurate Estimates from LinearRegressionWithSGD

sowen
This fix from 8 days ago might be related:

If you are not building from HEAD, I might try again with that, or wait for the 0.9 release that will contain it. May not be the cause.


On Mon, Jan 27, 2014 at 1:35 AM, herbps10 <[hidden email]> wrote:
Hello,

I just finished setting up a standalone Spark cluster and have moved on to
exploring MLlib.

I'm trying to perform Linear Regression on a very simple, contrived dataset.
I have  which contains


I then ran the following code through the Spark shell (modified very
slightly from
http://spark.incubator.apache.org/docs/latest/mllib-guide.html):



The problem is that the weights and intercept are extremely off:


It gets a little better if I adjust the step size:


But still doesn't converge on the correct estimates (I would of course
expect intercept=0, slope=1). Any idea what I'm doing wrong? I feel like I
must be missing something obvious.

Thanks!
Herb Susmann
SUNY Geneseo
[hidden email]



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Inaccurate-Estimates-from-LinearRegressionWithSGD-tp942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.