Intelligent Alerting and Root Cause Analysis

You have just settled down with a plate full of freshly grilled food and a cold beer at a fun summer barbecue. You are about to enjoy it all. Instead, much to your disappointment, an alert on your pager goes off and you remind yourself: “OK, this is part of the job - I signed up for it and I get compensated for it.” Even though you’re hungry and that barbecued food beckons you, you instead open your laptop and try to find out what’s going on.

You soon discover that a new service got deployed before the weekend. The team set up some alerts on the metrics that it exposes, using thresholds that - everyone agreed - seemed reasonable. You reach out to the on-call developer on the team who also received the alert. Together, you take a look at the monitoring dashboard for the performance metric. Turns out that the service was slower than expected, because it depends on another service with increasing latency due to a nightly batch job. You also soon find out that this is not really a problem: daytime latency is not affected. In other words, it was a false positive. Finally, you suppress the alert for the night, your colleague signs off and you go back to that delicious – but now cold – plate of food.

While this is not exactly a nightmare scenario, if it occurs too often the responsibility of being the person handling pager duty turns into a burden rather than a source of pride. Ideally, you’ll want to eliminate these false positives as much as possible, otherwise it’s hard to know when alerts really matter and when to take them seriously. But with constant change in the system: new services, processes, usage patterns and many more other things added and updated continuously, how can you avoid this?

In this blog post, I’ll explain how smart anomaly detection can improve intelligent alerting and root cause analysis. Smart anomaly detection is part of the set of AIOps capabilities. You can take advantage of AIOps capabilities by using an observability solution that incorporates artificial intelligence (AI) and machine learning (ML) capabilities.

Intelligent alerts

Microservices come online and get redeployed frequently. This is a good thing, of course, as new capabilities become available to end-users and bugs get fixed. You need to assess the health of these services and use their metrics to define alerting rules.

Updates due to new business insights and the resulting software changes will affect metrics over time. But metrics may also change as the result of external events. These non-technical interventions could be, for example, marketing campaigns, soccer matches, or a change of the weather, any of which could place additional and unexpected demand on a service. Higher than expected demand can easily cross conservative thresholds, generating false alerts and leading to alert fatigue. Therefore, your alerting rules should adapt to a changing world: slow changes to the patterns in metrics are inevitable and are often a result of external events, sudden changes require attention because they indicate something might really be wrong.

When a business metric experiences a sudden change, or if an SLA is in danger of being violated, you need to find out why - quickly. Inevitably, there will be a number of services involved. Each of them will have their golden signals. You look at throughput, latency and error rate. A specialized metric has additional valuable information, over and above the golden signals, based on your deeper understanding of the service.

Efficient root cause analysis

When an issue pops up, you need to dive into a problem and the clock starts ticking. As time elapses, the pressure mounts. Higher levels of management get involved and everyone wants to know when it will be resolved.

Browsing the dashboards, you inspect metrics that might exhibit suspicious behavior. Service dependencies guide your tour. Zooming in and out help you get a feel for “normal” behavior - and what’s now changed. At the highest levels in the stack, you can make progress quickly, as these are dashboards and metrics you’ve seen before. Lower levels need more time. You need to discover the patterns, determine reasonable ranges and identify anomalies.

Each metric stream is unique, but many of them are similar. And you need to decide, based on knowledge of the service, which anomalies should be pursued. This can be a daunting task. Wouldn’t it be nice if this process could be sped up, by having a machine learning model prepare a list of anomalies - highlighting the most suspicious ones?

Machine learning

What exactly is “normal” behavior? A metric stream can exhibit all kinds of patterns. A business metric, like the number of transactions, will exhibit a daily pattern. No transactions during the night or the weekend, a bump around 10AM, a dip around lunch and a slow decay into the evening.

A service metric, like latency, will have a few modes: it hugs zero when only the health endpoint is invoked, a few endpoints respond in 100ms and there’s the occasional heavy operation that takes a second. The JVM heap size has its iconic sawtooth - fine as long as the tooths are big, worrying if they are small and the heap nearly reaches its maximum size. Besides these archetypal patterns, each service will have its own unique metrics that measure how it is performing its business functions.

You can’t know in advance which metric streams are critical to resolve a problem. When searching for the root cause of a P1 issue, there certainly is not enough time to let a data scientist model a stream and find anomalies. So, if machine learning is going to help, it had better identify the anomalies in advance.

To do anomaly detection on a stream, a model is needed that can evaluate if data points are normal. Or not. Such a model needs to be developed - a data scientist would explore the data and create a model based on the observed characteristics. While this procedure will give the best possible model (and anomalies), it takes time to develop and hence can only be justified for a handful of metrics. For other metrics, feature selection and model selection need to be autonomous - not dependent on the valuable time of expensive data scientists.

A model needs to be trained. Moreover, since metric behavior will change over time, it periodically needs to be retrained. In order to be useful, anomaly detection should run in (near) real-time. There’s no point in having the anomalies identified an hour after the alert comes in. It should also be available quickly, not only after the new service has been running for a week and all the teething problems have been ironed out. And finally, it should adapt to changing usage patterns in real time, rather than reporting hours and hours of anomalous behavior just because it needs to be retrained.

We need models that give accurate results after training on a small amount of data, can be trained incrementally and can be compared amongst each other so that the best model is used.

Fast adaptive models

Modern machine learning techniques, such as deep learning, are making headlines. They are very flexible and, hence, very powerful. If you have large amounts of unstructured data, such as images or text, they are your best bet for constructing a suitable model. Training will be done off-line and requires significant resources. Large amounts of data are needed, i.e., a very long history and significant compute power, such as GPUs or TPUs. Real-time anomaly detection on a large number of metric streams, however, should be quick - both in CPU time used and in the amount of historic data needed.

When faced with challenging requirements, it’s wise to go back to first principles. The best model is the simplest one that describes the normal data accurately. A model that fits the training data perfectly has overfit and is using a lot of parameters to do so. It will fail on data it has not seen before. Conversely, a simple model with only a few parameters is not able to describe significant features in the data. What’s needed is Occam’s Razor , also known as the principle of parsimony.

A natural implementation of Occam’s Razor can be found in statistical models. Each data point gets a likelihood associated with it. Fitting the model consists of optimizing the likelihood over a set of data points. Compare models by comparing their likelihoods, with complex models getting a penalty for using more parameters.

The use of statistical models in real-time decision making has a long history. The Kalman Filter was used in the Apollo guidance computer to turn noisy data into actionable information. Given the increase in computing power since the 60s, it is now possible to run this model on many metric streams concurrently and further optimize the parameters in real-time.

Dynamic stream processing

The most important applications in a business are those at the highest level. For example, applications that directly interface with customers or suppliers - those handling sales, fulfillment, inventory. Metrics are closely related to business metrics like revenue and delivery times. In an emergency, you start with those applications and their metrics and you work your way down the stack to find the faulty component. You navigate the topology, drilling deeper when you observe an anomaly. The most important streams for anomaly detection are therefore those you encounter along the way.

As you’re using anomaly detection to help reduce operational overhead, it should not pose operational problems by itself! You should be able to rely on the most important streams to always be up to date. Other streams can be checked when sufficient resources are available.

In traditional real-time stream processing, implemented by the likes of Spark, Flink and Kafka Streams, you start out by defining the streams to be checked. As the pipelines are deployed, you make sure the capacity is such that they are all handled without lag. This is neither flexible (with streams defined up-front) nor the most efficient use of computing power (being over-provisioned).

We need a design that is quite different. A prioritized list of streams determines which stream needs to be checked. Priority is based on the topology - the highest-level application has the highest priority stream. The highest priority stream that hasn’t been checked in the last five minutes is the first one to be picked up. A singleton “manager” service maintains the list while a multitude of workers pull tasks. Workers are stateless, they can be scaled up or down on a whim.

The highest priority stream is always up-to-date and as many streams are checked as possible, given the available resources. As you are on the hunt for the root cause, you can even temporarily bump the number of "assistants” by raising the number of workers!

StackState’s Autonomous Anomaly Detector

StackState is on a mission to ease the complexities of running and evolving the digital enterprise. We are building the best observability product experience for site reliability engineers (SREs) and DevOps teams. We believe combining traces and telemetry data with time-travelling topology makes it possible to handle ever-larger systems.

Anomaly detection guides your attention to the places and times where remarkable things happened. It reduces the information overload, thereby speeding up any investigation. And thus, allowing you to scale and manage more components. We have implemented the design in this post in a (Kubernetes) microservice that runs alongside StackState.

The Autonomous Anomaly Detector add-on builds on the 4T Data Model of StackState for up-to-date topology information and a wide range of integrations for metric data. Found anomalies can be inspected in the StackState UI. The most suspicious anomalies are available as first-class events, ready to trigger alerts. Component health states are updated based on these events, enriching the topology - priming it for root cause analysis. For Kubernetes services, this functionality is provided out-of-the-box, with health states being updated based on their Golden Signals .

Summary

In this blog post, we examined how anomaly detection helps with setting up healthy alerts and efficient root cause analysis. The constraints coming from today’s dynamic environments lead to a design that is adaptive at different levels and allocates the available resources efficiently.

After implementing StackState, during your next summer barbecue rest assured that your pager will only alert you for urgent, drastic events. You'll be able to enjoy that scrumptious food, with anomaly-annotated metric streams helping you identify the root cause quickly and also help you know that alerts are real. As the old Dutch tax collection slogan goes: “Leuker kunnen we het niet maken, wel makkelijker” ("We can't make it more fun, but we can make it easier").