[Spark ML] Compatibility between features and models
Hi, does anyone know how Spark's model serialization format can support feature evolution with backward compatibility support? Specifically when new data that include both old features and newly added features is fed into old models
trained with old set of features, the old models should be able to ignore those newly added features and use only old features to generate prediction.
However, it seems Spark model serialization format only includes weights and there is no information how weights are aligned with feature id / name. Thus adding new features will cause prediction to throw following exception.
sameModel = LogisticRegressionModel.load("sample-model-old") // old model trained with old features
predictions = sameModel.transform(newData) // new data with both old and new features
Caused by: java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 100, y.size = 49 at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$27.apply(LogisticRegression.scala:753)
It seems the serialized model format should include the feature id / name for each weight. In that way, the model can align features and weights properly and new features will use weight zero. The old model won't be able to take advantage of new feature
thus the model is updated. But at least the prediction won't fail.
This functionality is important when we need to evolve models in production. Often we discover new features to improve model's accuracy and we need a reliable mechanism to upgrade the data pipeline and models.