Spark ML with null labels

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark ML with null labels

Patrick McCarthy-2
I'm trying to implement an algorithm on the MNIST digits that runs like so:

  • for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label to the digits and build a LogisticRegression Classifier -- 45 in total
  • Fit every classifier on the test set separately
  • Aggregate the results per record of the test set and compute a prediction from the 45 predictions
I tried implementing this with a Pipeline, composed of
  • stringIndexer
  • a custom transformer which accepts a lower-digit and upper-digit argument, producing the 0/1 label
  • a custom transformer to assemble the indexed strings to VectorUDT
  • LogisticRegression
fed by a list of paramMaps. It failed because the fit() method of logistic couldn't handle cases of null labels, i.e. a case where my 0/1 transformer found neither the lower nor the upper digit label. I fixed this by extending the LogisticRegression class and overriding the fit() method to include a filter for labels in (0,1) -- I didn't want to alter the transform method.

Now, I'd like to tune these models using CrossValidator with an estimator of pipeline but when I run either fitMultiple on my paramMap or I loop over the paramMaps, I get arcane Scala errors.


Is there a better way to build this procedure? Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Spark ML with null labels

Xiangrui Meng
In your custom transformer that produces labels, can you filter null labels? A transformer doesn't always need to do 1:1 mapping.

On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy <[hidden email] wrote:
I'm trying to implement an algorithm on the MNIST digits that runs like so:

  • for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label to the digits and build a LogisticRegression Classifier -- 45 in total
  • Fit every classifier on the test set separately
  • Aggregate the results per record of the test set and compute a prediction from the 45 predictions
I tried implementing this with a Pipeline, composed of
  • stringIndexer
  • a custom transformer which accepts a lower-digit and upper-digit argument, producing the 0/1 label
  • a custom transformer to assemble the indexed strings to VectorUDT
  • LogisticRegression
fed by a list of paramMaps. It failed because the fit() method of logistic couldn't handle cases of null labels, i.e. a case where my 0/1 transformer found neither the lower nor the upper digit label. I fixed this by extending the LogisticRegression class and overriding the fit() method to include a filter for labels in (0,1) -- I didn't want to alter the transform method.

Now, I'd like to tune these models using CrossValidator with an estimator of pipeline but when I run either fitMultiple on my paramMap or I loop over the paramMaps, I get arcane Scala errors.


Is there a better way to build this procedure? Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Spark ML with null labels

Patrick McCarthy-2
I actually tried that first. I moved away from it because the algorithm needs to evaluate all records for all models, for instance, a model trained on (2,4) needs to be evaluated on a record whose true label is 8. I found that if I apply the filter in the label-creation transformer, then a record whose label is not 2 or 4 will not be scored. I'd be curious to discover if there's a way to make that approach work, however.

On Thu, Jan 10, 2019 at 12:20 PM Xiangrui Meng <[hidden email]> wrote:
In your custom transformer that produces labels, can you filter null labels? A transformer doesn't always need to do 1:1 mapping.

On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy <[hidden email] wrote:
I'm trying to implement an algorithm on the MNIST digits that runs like so:

  • for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label to the digits and build a LogisticRegression Classifier -- 45 in total
  • Fit every classifier on the test set separately
  • Aggregate the results per record of the test set and compute a prediction from the 45 predictions
I tried implementing this with a Pipeline, composed of
  • stringIndexer
  • a custom transformer which accepts a lower-digit and upper-digit argument, producing the 0/1 label
  • a custom transformer to assemble the indexed strings to VectorUDT
  • LogisticRegression
fed by a list of paramMaps. It failed because the fit() method of logistic couldn't handle cases of null labels, i.e. a case where my 0/1 transformer found neither the lower nor the upper digit label. I fixed this by extending the LogisticRegression class and overriding the fit() method to include a filter for labels in (0,1) -- I didn't want to alter the transform method.

Now, I'd like to tune these models using CrossValidator with an estimator of pipeline but when I run either fitMultiple on my paramMap or I loop over the paramMaps, I get arcane Scala errors.


Is there a better way to build this procedure? Thanks!