Changes are Observability’s Biggest Blind Spot
Lodewijk Bogaards· 12 min read
Classically, the space of observability lies within layers of information on a dashboard. It operates by using the fundamental trio of data — metrics, logs and traces — from each layer of the environment to assess the health of an IT infrastructure. However, a time component is critical, making the stack observable at any point in time. Gathering reliable data and insights into your IT infrastructure remains the primary role of observability tools and services. However, most observability tools lack a fundamental capability — the ability to track change over time and at any point in time and then correlate it with incidents when they occur.
Metrics, logs and traces do indeed form the backbone of good observability. They provide much of the data produced by a system and, in turn, provide insight into the state of the system. However, when a problem occurs, the first question an engineer asks themselves is “What changed?”
Generally, changes somewhere in the IT environment are the overwhelming cause of issues. Analysts peg the percentage of incidents caused by change at between 75-80%. Root cause analysis - or getting to the root of the problem quickly and resolving it - is paramount to maintaining a healthy IT environment.
But finding the root cause of a problem can be expensive and time-consuming. What’s more, simply browsing the metrics, logs and traces will be fruitless, especially when dealing with reams of data. Without the right tools, the task of finding the change that caused a problem in order to resolve the problem - and resolve it quickly - amounts to a blind spot for engineers.
Change Explained: Beyond Metrics, Logs, Traces
So what is change exactly? Changes in a system may stem from a variety of causes, such as upgrading, adding or removing a component in your IT environment. The results can lead to less than desirable consequences, e.g., something stops working. Within IT systems, changes affect things – things like a database, load balancer or a microservice.
To break it down further, these changes are related to the development lifecycle and occur when something is created, updated or deleted. In complex IT systems, when a change occurs, it’s often like trying to find a needle in the haystack to pinpoint precisely where and why a resulting failure happened and what caused it.
Broadly speaking, changes occur in the following four scenarios:
Deployments: the steps and activities taken to deliver an update or new version of a service or microservice.
Configuration changes: an infrastructure or application-level service is modified.
Automated changes: any automatic action taken – actions are typically programmed to create more efficient and reliable services and reduce latency. For example, when Kubernetes deploys more pods to match demand, or when an unhealthy pod is restarted.
Behavior/tangential changes: sudden changes in traffic, latency, error rates or saturation. These are changes related to human behavior and don’t occur in-house. Think of a sudden uptick in traffic after a commercial airs.
The goal of pinpointing change is to discover what changed, why it occurred, how it occurred and the impact it had on the IT environment. Detecting a change simply from analysis of metrics, logs and traces provides a number, an inference, but not enough context - assuming that you are even looking at the correct data. Analysis becomes particularly challenging when an incident crosses information silos - applications or parts of the IT infrastructure being monitored by different tools but that are interdependent. In that case, you have to try to manually correlate different data from different systems to (try to) piece together what happened.
A Topology of Change
Remember, the average observability or monitoring tool simply tracks data and measurements of data without registering changes. Being able to register changes – and therefore detect and analyze significant events – requires a topology of your overall IT environment, including the ability to show interdependencies between components. In networks, a topology is essentially a map of the IT environment. Topologies establish relationships and dependencies between components within a system. These components can include everything from operating systems, databases, protocols, software and runtime environments, for example.
All these components can make an IT system fairly complicated, especially as cloud and other IT environments are constantly in flux. So when an issue occurs within the IT stack, finding the specific change which triggered an event can be difficult. With an established topology, it’s easier to understand changes to components within a system — because topology provides a comprehensive overview of the system, in other words, a bird’s-eye view with the ability to drill down more deeply into areas of interest.
Furthermore, when components are connected, they form a chain of dependency. Taking care of the stack requires a state-of-the-art tool that enables the user to see what’s happening, where and when - the topology view, which must incorporate tracking changes over time, as they occur. It’s then much easier to gather the information to determine root cause and scale a response to fix the problem, while involving the right people from the start to fix it. Not only does this visibility provide context to quickly solve the problem, but it also draws out data to proactively prevent such incidents from reoccurring.
Causal Observability Provides Answers
In the Observability Maturity Model, the ability to observe changes over time and link the effect those changes had on the IT environment is called causal observability - a higher level of observability maturity. It shows the cause and effect(s) of a particular change, where in the stack it occurred and when it occurred. For example, a developer adds a new script resulting in numerous changes which cause an immediate breakdown. At the maturity level of causal observability, it’s easy to pinpoint the addition of the script as the cause and to see the effect that propagated across the stack between all interrelated components. However, without the visibility provided by a topology map tracked over time, it isn’t such a simple journey. Today’s IT infrastructures are complex and ever-changing, so a bug in the system could be brand new, hours old or days old.
Ultimately, causal observability provides the answers to many questions such as:
What effect did this change cause?
What is the root cause of this performance problem or P1 outage?
What changes happened to the infrastructure that is running this application since midnight, when we received an alert?
Which alerts have the same root cause vs. which alerts are unrelated?
What was the business impact of a change?
Which team should be involved in finding the solution for a problem?
Proactive Observability With AIOps
Once you have matured to causal observability, you can then advance your observability maturity to Proactive Observability With AIOps. At this stage, artificial intelligence (AI) for IT operations (AIOps) is added to the mix. AIOps, in the context of monitoring and observability, is about applying AI and machine learning (ML) to sort through mountains of data. AIOps can help you find patterns that drive better responses, at the soonest opportunity, by both human and automated systems.
AIOps builds on core capabilities from the solid foundation you have laid in implementing observability. It adds in pattern recognition and assistance with probable remediation. Causal Observability maturity is a necessary part of Proactive Observability With AIOps. When you have confidence in cause and effect, you can start to automate responses and become more proactive in recognizing incidents that are brewing. You can even prevent them from occurring.
A Better Way to Solve IT Problems
Traditional monitoring methods are becoming relics. It’s no longer enough to simply know if a component is up or down. Ensuring the performance of your services is critical and monitoring infrastructure plays a large part. When something goes wrong, you want to know why - and you want to know fast. And while the ability to decouple systems is ideal, it is nearly impossible. The sheer complexity of today’s hybrid and cloud native environments and the amount of data they produce requires better observability tools. When there are multiple changes or combinations of changes, using metrics, logs and traces simply won’t cut it. Complex environments require a tool that can provide the insights to move quickly when an incident occurs, and even proactively prevent problems.
A topology-powered observability tool such as StackState’s is a good example of the kind of observability platform needed to address this issue. StackState’s observability platform not only has the capability to register changes, but it can also enable a team to immediately dive in to see and understand what happened and when. AIOps capabilities in the StackState platform can precisely find that change needle (or needles) in the haystack. Our 4T Data Model - with its ability to track Topology, Telemetry and Traces from all data sources and correlate the data over Time - provides deep insights to quickly get to the root cause. All changes in your IT environment are recorded in StackState through topology and are easily visualized, saving time and money - and putting you on the path to being a zero-downtime enterprise.
Below you can see how StackState records a change. The example shows a change in Kubernetes with kubectl. It then shows how the change appears in StackState. Finally, it shows how the change is related to metrics, logs and traces.
Lodewijk Bogaards· 12 min read