Customizing K-Means for Anomaly Detection

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Customizing K-Means for Anomaly Detection

Artemis User
First some background: 
  • We want to use the k-means model for anomaly detection against a multi-dimensional dataset.  The current k-means implementation in Spark is designed for clustering purpose, not exactly for anomaly detection.  Once a model is trained and pipeline is instantiated, the prediction data frame generated from the transform function only associates each data points with individual clusters.  To enable anomaly detection, we would need to recalculate distance of each data point to its corresponding or nearest cluster centroid, and compare with a predefined threshold value to determine anomalies (e.g. normal = distance <= threshold, and anomaly = distance > threshold).
  • The anomaly detection procedure (e.g. calculating the distances and compare them with the threshold) occurs outside the ML pipeline (e.g. after invoking the transform method).  This causes problems when we try to persist the pipeline model and later retrieve and instantiate and use it in production.   We really would like one Estimator to do this whole process, from ingesting data to anomaly detection in a single pipeline, without the extra code at the end (e.g. after pipeline.transform() is called).
Questions:
  • We wanted to just make a custom Transformer to append to the end of the Pipeline so to enable anomaly detection for the test dataset, BUT it requires the clusterCenters from the KMeansModel stage.  We can’t figure out how to pass this data, which comes from a fitted stage, to a later stage during runtime. Any Ideas?
  • Is there a way add a callback to the KMeansModel to persist the clusterCenters in the dataframe, or in a file?  or add a ParamMap to dynamically set this parameter during runtime?

Thanks a lot in advance!

-- ND

Reply | Threaded
Open this post in threaded view
|

Re: Customizing K-Means for Anomaly Detection

srowen
You could fit the k-means pipeline, get the cluster centers, create a Transformer using that info, then create a new PipelineModel including all the original elements and the new Transformer. Does that work?
It's not out of the question to expose a new parameter in KMeansModel that lets you also add a column with the cost; I'd review that kind of PR.

On Tue, Jan 12, 2021 at 12:59 PM Artemis User <[hidden email]> wrote:
First some background: 
  • We want to use the k-means model for anomaly detection against a multi-dimensional dataset.  The current k-means implementation in Spark is designed for clustering purpose, not exactly for anomaly detection.  Once a model is trained and pipeline is instantiated, the prediction data frame generated from the transform function only associates each data points with individual clusters.  To enable anomaly detection, we would need to recalculate distance of each data point to its corresponding or nearest cluster centroid, and compare with a predefined threshold value to determine anomalies (e.g. normal = distance <= threshold, and anomaly = distance > threshold).
  • The anomaly detection procedure (e.g. calculating the distances and compare them with the threshold) occurs outside the ML pipeline (e.g. after invoking the transform method).  This causes problems when we try to persist the pipeline model and later retrieve and instantiate and use it in production.   We really would like one Estimator to do this whole process, from ingesting data to anomaly detection in a single pipeline, without the extra code at the end (e.g. after pipeline.transform() is called).
Questions:
  • We wanted to just make a custom Transformer to append to the end of the Pipeline so to enable anomaly detection for the test dataset, BUT it requires the clusterCenters from the KMeansModel stage.  We can’t figure out how to pass this data, which comes from a fitted stage, to a later stage during runtime. Any Ideas?
  • Is there a way add a callback to the KMeansModel to persist the clusterCenters in the dataframe, or in a file?  or add a ParamMap to dynamically set this parameter during runtime?

Thanks a lot in advance!

-- ND