The Power of Data Correlation: Troubleshooting Made Easy

Mark Bakker Profile Pic
Mark BakkerProduct Owner & Co-Founder
10 min read

As software engineers, we all know that troubleshooting often involves sifting through heaps of data points — scanning metrics, reading logs, checking resource status and analyzing events. We manually connect the dots, and if we're experienced enough, we might spot an issue that's about to become a problem.

At StackState, we've faced these same challenges. And our interactions with customers have given us a clear understanding of the bigger picture: we needed to make it easier to correlate metrics, events, logs and resource health in one simplified and actionable solution. 

Why bother with data correlation?

Imagine you're a software developer on a bug hunt: what's in your toolkit? Metrics, logs, resource status and config info are all valuable assets. Add Kubernetes events and details from the latest deployment, and you're ready to roll.

The trick is weaving a coherent story from different data types for different components coming from different sources. Trickier still is getting them all in the right chronological order. That's the key element to analyzing problem timelines effectively, leading you to a faster resolution of the issue at hand.  The focus on data correlation boils down to two key reasons:

  1. Efficient Troubleshooting: When data is correlated, it's easier to spot patterns or cause-and-effect relationships. This speeds up pinpointing the root cause.

  2. Improved System Operations: Understanding how different parts of the system interact provides valuable insights into system behavior and potential bottlenecks.

Does data correlation help engineers in their day-to-day activities?

A tool that combines all these elements effortlessly can dramatically streamline the troubleshooting and monitoring process. This unified approach minimizes the need to juggle different tools and waste time manually consolidating information. 

The result? Increased productivity, quicker issue resolution, and a holistic system view that allows for an easier understanding of complex interactions and dependencies. 

Putting ideas into action with open-source tools

Here's a practical approach:

  • For logs, consider tools like Grafana, Loki, or Logstash to gather logs from various parts of your system.

  • To manage metrics, you can pair Prometheus (for metrics collections) with Grafana (for visualization resource usage).

  • For container orchestration, use Kubernetes and the kubectl command-line tool for collecting events, status updates, and configuration info. 

Unfortunately, using separate tools often means bouncing around a lot, and the correlation process relies heavily on human effort. Not only is it time-consuming, but something is bound to slip through the cracks, slowing the process down further. That's where transitioning from mental gymnastics to computational power comes into play.

Walking through a troubleshooting scenario

Picture a Kubernetes pod hosting a microservice that's crashing and restarting frequently.  Here's how an integrated solution —like StackState's offering — simplifies the process:

  1. Analyzing Logs: You spot the recurring restarts and dig into the pod's logs, looking for error messages or warnings. You discover the app sends "cannot allocate memory" error warnings before each crash.

  2. Checking Events: You investigate related events and discover the pod is being killed by the Kubernetes system due to an 'OOMKilled' (Out of Memory Killed) event, aligning with the log findings.

  3. Examining Metrics: Digging into resource usage metrics reveals memory spikes before each crash, confirming the 'Out of Memory' errors and 'OOMKilled' event.

  4. Reviewing Configuration: Lastly, a look at the pod's configuration indicates that the memory request and limit settings are set relatively low compared to the memory usage you observed in the metrics — a misconfiguration that means the pod is not being allocated enough memory, causing it to be killed by the system.

 By weaving together logs, events, metrics, and config data, you uncover the core issue: a memory-starved pod due to improper settings.

The solution? Adjust the memory request and limit settings for the pod to a higher value that better aligns with the pod's actual memory usage, observed from the metrics. 

Sidenote: Want to see a full workflow, including the correct commands? Take a peek at "A visual guide on troubleshooting Kubernetes deployments" from Daniele Polencic.

Making the switch from brainpower to compute power 

What if you could shift from mentally juggling correlations and tool switches to allowing an automated system to take the reins? That's what StackState does in a nutshell. 

But StackState isn't just about centralizing data; we offer the ability to mark crucial moments across various components and data types — the real key to taking Kubernetes troubleshooting to the next level (and then some).

Under the hood, StackState seamlessly orchestrates this correlation for you, leveraging the following capabilities:

  1. A Unified Data Hub: All data, from Kubernetes events and status updates to configurations, container logs, and component metrics, converges in one solution. The need for multiple tools and constant switching becomes a thing of the past.

  2. Interconnected Components: Components are linked via an extensive dependency map. This foundation — also called topology — facilitates a quick understanding of component relationships, enabling seamless navigation.

  3. Data-Component Correlation: Data isn't just stored; it's intricately linked to components and the structured dependency map. This forms the cornerstone for effortlessly — and quickly — correlating data between components.

  4. Intuitive Dashboards: Data is sliced into user-friendly, out-of-the-box dashboards. These dashboards serve as StackState's visible face, allowing for easy navigation across diverse resource types and putting data at your fingertips.

With these four pillars in place, achieving correlation is as simple as a single "Shift + Click." Whether it's a timestamp, timeline, event, metric chart, or health visualization, connecting the dots becomes second nature.

Introducing StackState's new capabilities for enhanced data correlation

At StackState, we're all about pushing the boundaries of innovation and enhancing our platform to meet the changing needs of software engineers and DevOps professionals. With this in mind, we've introduced three new capabilities tailored specifically for addressing Kubernetes troubleshooting:

  1. Comprehensive Event Timeline: Our new timeline offers a panoramic view of events as they unfold. Rather than merely focusing on Kubernetes events, it also integrates insights from all deployments, changes, and alerts. This ensures that users never miss out on any crucial data point. Think of it as a bridge between seemingly disparate pieces of data, ensuring that the bigger picture always remains in focus. Next to this the new timeline includes all the events of the resources that are related to this pod.

  2. Change Insight Feature: As Kubernetes environments evolve, staying in sync with resource evolution becomes paramount. Enter our groundbreaking "Change Insight" feature. Now, users can monitor the progression and evolution of their resources over time. One significant challenge in troubleshooting Kubernetes is detecting and rectifying configuration drifts. Armed with the power to trace changes, this feature becomes invaluable in understanding such discrepancies, ensuring the system's configuration remains consistent and optimal.

  3. Timeline Marking Mechanism: Often, when troubleshooting, being able to identify crucial moments can be a game-changer. Our latest feature takes this to a new level — allowing users to mark any point in time, across any timeline or timestamp, on any screen. This isn't just a bookmark; it's a new way of navigating that alleviates the need to juggle multiple data points and paves the way for a more user-centered troubleshooting experience.

These enhancements are more than just extra tools—they bring a new dimension to troubleshooting. They empower engineers to dig deeper, reduce manual work, and refine the correlation process. This not only streamlines troubleshooting but also reveals insights that have the potential to reshape how we manage and monitor systems.

Unlock the future of troubleshooting

Navigating from scattered data to actionable insights doesn't have to be a headache. StackState's innovative approach simplifies complex scenarios, guiding engineers with precision and efficiency. 

Ready to dive in? Experience these capabilities in our Playground—a simulated environment with real data and scenarios. Try it now

And when it's time to set your focus on troubleshooting, sign up for a free trial to see how simple it can be.