Cross validation is missing in machine learning examples

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cross validation is missing in machine learning examples

Aureliano Buendia
Hi,

I notices spark machine learning examples use training data to validate regression models, For instance, in linear regression example:

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
...

 Here training data was used to validated a model which was created from the very same training data. This is just a bias estimation, and cross validation is missing here. In order to cross validate, we need to partition the data into in-sample for training, and out-of-sample for validation.

Please correct me if this does not apply to ML algorithms implemented in spark.
Reply | Threaded
Open this post in threaded view
|

Re: Cross validation is missing in machine learning examples

Christopher Nguyen
Aureliano, you're correct that this is not "validation error", which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. 

However, in this example, the errors are correctly referred to as "training error", which is what you might compute on a per-iteration basis in a gradient-descent optimizer, in order to see how you're doing with respect to minimizing the in-sample residuals.

There's nothing special about Spark ML algorithms that claims to escape these bias-variance considerations.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao


On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I notices spark machine learning examples use training data to validate regression models, For instance, in linear regression example:


// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
...

 Here training data was used to validated a model which was created from the very same training data. This is just a bias estimation, and cross validation is missing here. In order to cross validate, we need to partition the data into in-sample for training, and out-of-sample for validation.

Please correct me if this does not apply to ML algorithms implemented in spark.