How to Achieve Observability in Chaos Engineering

Most Series A and B companies are born in the cloud. Instead of the traditional mainframe architecture, they use AWS, Kubernetes and the likes to run their production environments. While striving to do things faster and better, we must address the other side of the coin: How do you support the constant shifts inherent in these environments?

Chaos engineering allows you to observe your environment continuously and reliably. Let's look at the four main ingredients for building the capabilities you need to support an effective chaos engineering program.

First Thing First... What Is Chaos Engineering?

Chaos engineering is the process of testing and experimenting on a system to ensure it has the capability to withstand unexpected conditions.

Cyber liability engineers and architects have a pretty good grasp on the known knowns and the known unknowns. But we don't know what we don't know. These unknown unknowns can trip us up—and that's where chaos engineering can help us get a handle on things that are most likely to bring us into the war room.

4 Key Ingredients for Effective Chaos Engineering

Observability is the foundation of chaos engineering. Here are the essential components you need to build this capability.

1. Telemetry Data

Telemetry data such as throughput, concurrency, threads, CPU utilization, etc. are descriptions of the symptoms you can observe in real-time to identify and understand an issue. The data also serves as a historical reference on how your system functions daily, giving you a bird's eye view of the items you should address to improve site reliability.

Think beyond the current peak load to optimize your chaos engineering strategy. Don't assume that you'll have the same level of activities or throughput mechanisms a year or two from now. Use telemetry data as a starting point to build for growth.

2. The Rate of Change

Changes impact availability and are a fact of life in the continuous integration and continuous deployment (CI/CD) environment. Use a mechanism to keep track of these changes for you automatically—even during a load test or vulnerability assessment.

Keep your eye on all changes, such as code updates, TCP traffic, DDoS attacks and extended loads to expedite root cause analysis investigation. The more you can tie changes to telemetry abnormalities or issues, the better you can respond to problems as soon as they arise.

3. Topology

Most organizations use multiple monitoring tools and the disparate data sources make it hard to gain a complete picture of the infrastructure. You must consolidate the information and keep track of the topology of your environment so you can align it to telemetry and change data.

Such topology-powered observability gives you real-time visibility into dependencies, which is key to understanding the relationships among various components across an entire application and establishing a baseline for predictive monitoring.

Topology changes all the time, especially in serverless and cloud-native environments. You can no longer manually go through logs and monitoring graphs to gain a real-time picture. This is where StackState comes in to help you track all the changes, map the dependencies and pull together telemetry and change data to generate real-time insights.

4. Timeline

Let's say you're working with Kubernetes in production. It changes minute to minute and you must keep track of the cause and effect of these changes to create an accurate representation of the environment. The information can support chaos engineering and save you a lot of time in troubleshooting.

Most importantly, the insights can help you convert unknown unknowns into known knowns. StackState is built to address the dynamic nature of changes in dependencies over time—essentially giving you the ability to "travel back in time" to see how components interact when issues arise to get to the root cause.

Better Data: The Foundation of AIOps

These four ingredients give you the topology-powered observability you need to conduct predictive monitoring and become proactive in preventing issues. You can also get rid of the unknown unknowns and automate root cause analysis to shorten your mean time to response (MTTR).

StackState helps you consolidate all the point solutions and provide high-quality data on your environment to support anomaly detection, reduce false positives and a whole lot more—establishing a reliable baseline to run AIOps tools and maximize their value.

Want to learn more about achieving observability in chaos engineering? Watch this full-length video in which I present you with the nuts and bolts of observing chaos.

Observing Chaos: Is It Possible?

First Thing First... What Is Chaos Engineering?