|
You could fit the k-means pipeline, get the cluster centers, create a Transformer using that info, then create a new PipelineModel including all the original elements and the new Transformer. Does that work? It's not out of the question to expose a new parameter in KMeansModel that lets you also add a column with the cost; I'd review that kind of PR.
First some
background:
- We want to use the k-means model for anomaly detection
against a multi-dimensional dataset. The current k-means
implementation in Spark is designed for clustering purpose,
not exactly for anomaly detection. Once a model is trained
and pipeline is instantiated, the prediction data frame
generated from the transform function only associates each
data points with individual clusters. To enable anomaly
detection, we would need to recalculate distance of each data
point to its corresponding or nearest cluster centroid, and
compare with a predefined threshold value to determine
anomalies (e.g. normal = distance <= threshold, and anomaly
= distance > threshold).
- The anomaly detection procedure (e.g. calculating the
distances and compare them with the threshold) occurs outside
the ML pipeline (e.g. after invoking the transform method).
This causes problems when we try to persist the pipeline model
and later retrieve and instantiate and use it in production.
We really would like one Estimator to do this whole process,
from ingesting data to anomaly detection in a single pipeline,
without the extra code at the end (e.g. after
pipeline.transform() is called).
Questions:
- We wanted to just make a custom Transformer to append to the
end of the Pipeline so to enable anomaly detection for the
test dataset, BUT it requires the clusterCenters from the
KMeansModel stage. We can’t figure out how to pass this data,
which comes from a fitted stage, to a later stage during
runtime. Any Ideas?
- Is there a way add a callback to the KMeansModel to persist
the clusterCenters in the dataframe, or in a file? or add a
ParamMap to dynamically set this parameter during runtime?
Thanks a lot in advance!
-- ND
|