Alex Tech Topic — Issue #4
Demystifying Anomaly Detection | Understand the Norm and Detect Abnormalities in Your Metrics
In the last newsletter, I briefly covered the Alert threshold topic based on what I learnt from writing Static vs Dynamic thresholds.
I realise that the crux of the issue with using dynamic thresholds was the lack of expertise in anomaly detection. The Anomaly detection is not limited to Alerts but asks the question:
Given a set of data, how do I know if something is abnormal?
Anomaly detection is the process of identifying metrics, events or observations that do not conform to an expected pattern or are unexpected in some way. In other words, it is the identification of irregular behaviour in a dataset.
Simply put, anomaly selection is a classic data science problem. My point is that Infrastructure engineers and SRE alike should be more aware of those data science problem-solving techniques.
There are two popular ways to detect anomalies:
- Statistical Methods: Use historical data to identify normal patterns of behaviour.
- Machine Learning Methods: Use algorithms to learn what constitutes normal behaviour for a given dataset.
The nuance between the two is very thin by the implementation is radically different. However, statistics are more in reach for any engineer. Apply the right formulas to your monitoring data.
Below I have gathered a selection of five articles that I read and constitute a good base to put this Idea into practice. here is a small summary for each:
1) Is an overview of anomaly detection and the key concept sounding it
2) Is a small reminder of the basic math required, including Z-score calculation
3) Is a step-by-step guide to implementing anomaly detection with Prometheus using the Z-score approach. A Gem 💎.
4 ) Is a useful reminder about how *_over_time metrics are calculated in Prometheus. Very useful to understand the previous article.