Guided Kubernetes Troubleshooting: How To Reduce Toil for Dev Teams

Andreas bluecircle
Andreas PrinsCEO
7 min read

Let’s be brutally honest, your customer does not care about a specific Kubernetes service, they care about the overall service you are providing them. The quality and reliability of the service you offer to your customers is what matters the most, not the underlying technical application services. Remediating issues fast is of greatest importance to provide exceptional customer experience with your service. But how do you provide smooth operations without downtime? How do you ensure the performance of your services is excellent so the customer experiences a very fast interaction pattern?

StackState is there to support engineers running applications on Kubernetes, who need to quickly troubleshoot issues when they happen or optimize application performance and reliability. Our SaaS observability solution helps engineers accurately detect the issues and determines the cause.

Kubernetes engineering challenges

Troubleshooting can be tedious and slow, and keeping systems in a reliable state is hard. There are various challenges that prevent your engineers from being truly effective. Here are six of the most common ones:

1. Kubernetes environment complexity

On-call engineers are often not Kubernetes experts, yet there is a lot of complexity in Kubernetes they need to deal with when troubleshooting issues. Just look at the number of different resources in a single Kubernetes application and multiply this by the number of tools and frameworks in the Cloud Native Computing Foundation (CNCF) landscape – it’s simply mind blowing.

2. Too much change

In highly dynamic environments, changes come and go – not just your own application components and the underlying infrastructure, but changes are also made by other teams or even a central platform team. It can be a struggle to understand the system’s current state and know what change caused an issue.

3. Lack of knowledge

Knowing how to collect and aggregate the required observability data for monitoring and troubleshooting purposes is a challenge. Even if you have the knowledge, you still need to go to many tools to gather the data you need. If you don’t have deep Kubernetes knowledge, you barely know where to look.

4. Too much context switching

The information you need to solve an issue is spread across many different tools. Engineers frequently need to use a monitoring tool with multiple dashboards, a log solution and the command line to fix a single problem. Continuously switching between these tools takes time and can be intense because you need to constantly navigate to the same place you left off in your other tool.

5. Lack of historical data (events, metrics, logs)

Kubernetes does not retain historical data for very long, such as events, metrics or logs. So if this data is not captured in an observability solution, you likely will not have the historical data you need to get a complete picture of what happened and solve the problem.

6. Cost of engineering time and toil

Open source observability may be free, but is not cheap – it takes a lot of time to configure, maintain and run open source observability solutions. This manual work adds to the toil.

The end result is time lost struggling to troubleshoot, which can have a direct customer impact. If engineers can't troubleshoot fast enough, the application may be unreliable and might impact the customer experience in latency or even downtime.

In addition, engineers who are troubleshooting but lack the underlying knowledge often disrupt other engineers to get the information they need. It’s a lose-lose situation and it needs to change!

Types of Kubernetes troubleshooting tools

Right now, engineers take various routes to troubleshoot and optimize applications running on Kubernetes. All these approaches have the same goal of improving customer experience, although the insights each approach provides can be limited and results can differ a lot. Let’s put these solutions into several categories based on the type of data they capture:

The metrics trend spotters

This is probably the most common type of tool. Engineers often use Prometheus to collect and store data and then use Grafana on top to visualize metrics. This combination provides a strong approach that tracks changes in performance patterns over time and detects when things go wrong. Often teams make use of various dashboards to reflect the specific needs they have. PromQL and OpenMetrics have become de facto standards in this area. (Some commercial solutions in this area are Chronosphere and ContainIQ).

The log aggregators and treasure hunters

A second approach is to collect, aggregate, analyze and visualize logs. Logs provide a valuable source of information, including serving warnings, errors and other behavioral data about the resources running in your cluster. The first step is to get logs from individual resources such as pods, nodes or applications. Each log will provide a piece of the puzzle: the next step is to put the puzzle together to figure out what happened. Using logs for troubleshooting requires deep knowledge of how resources and services are related and how to find your way around. (Common solutions in this area are Grafana Labs, Loki and Logz.io.)

The tracers

If teams are looking for an end-to-end picture of what is happening, trace information can certainly help. Currently, the most known open source framework to capture and process trace information at the moment is Jaeger, a monitoring and troubleshooting solution focused on complex, distributed systems. If your customer’s transaction requires multiple services, then seeing and troubleshooting the bigger picture can be hard and traces are particularly helpful in these situations. OpenTelemetry is rapidly becoming the standard to capture trace information. (Another solution is Epsagon.)

The event watchers

There is another approach to troubleshooting: Build a deep understanding of all the changes and events that are going on in your Kubernetes landscape, such as images pulled, containers started or pods killed because they ran out of memory. All these Kubernetes events, and many more, give an understanding of what is happening. Bringing these events from all resources together into a single tool is helpful to quickly go through them, rather than using a command line to query them time after time. (A solution in this space is Komodor, which tracks events and changes in configuration.)

Our approach to troubleshooting and why it yields better results

At StackState, we believe that all four troubleshooting approaches need to come together in a single tool to deepen human understanding and drive accurate, fast remediation. Out of the box, StackState collects and correlates all four essential data types: metrics, logs, events and traces. StackState shows the impact of issues on the business, identifies the cause of issues accurately and automatically applies expert practices to help teams detect and remediate issues.

With StackState, any engineer can ensure smooth operation of all Kubernetes-based applications and services, even if they lack specific knowledge of the application, service architecture or Kubernetes itself.

Do you have applications running on Kubernetes? StackState is the most efficient tool for troubleshooting Kubernetes applications, due to our approach in four areas:

1. Automatically applies Kubernetes expert practices out of the box by providing pre-configured monitors to look for common problems.

You want to enable all your engineers to be immediately effective in troubleshooting Kubernetes, so you need a tool that automatically applies expert practices out of the box, as well as one that offers smart assistance to find the cause of issues. With StackState’s guidance, common mistakes can be avoided. These built-in practices are a good foundation for your SREs and other experts to build on, to meet your company and team needs. For example, StackState includes expert practices for troubleshooting unhealthy pods, containers that get killed or services with very high latency.

2. Collects all essential metric, log, event and trace data including golden signals (latency, saturation, throughput, error rate), then correlates all relevant data for a service or resource and shows everything in context.

As described above, the four patterns of troubleshooting all provide different paths to success. Bringing metric, log, trace and event data together will provide every engineer with the right information at the right time, without the hassle of going into every different tool and frequently switching context. StackState’s metrics are based on the powers of Prometheus, so you can still add the specific Grafana dashboards you need on top.

3. Automatically discovers and visualizes all Kubernetes service and resource dependencies to help you keep track of all changes in your dynamic environment.

Unlike the majority of other tools, StackState provides a connected picture that helps you understand how resources relate to and depend on each other. An accurate understanding of relationships is required if you want to easily browse the web of resources that comprise your customer offering. StackState has a unique topology mapping and change tracking capability that tracks all changes in relationships and configurations over time, providing a clear foundation to understand dependencies.

4. Guides remediation with hints and visual assistance using smart problem clustering and probable cause recommendation, to help everyone fix issues as quickly as possible.

Data is key, but you can easily drown in too much of it. The main question is how to use all the data for your own benefit. The expert practices as described as item #1 in this list (pre-configured monitors that look at the right things and issue alerts at the right time) are enriched with clear hints to enable engineers to remediate the issues. This guidance helps every engineer immediately understand what needs to happen in order to remediate. In addition, after the issue is solved, this information will support the process of a blameless post mortem to determine what needs to be improved.

A standalone, complete SaaS solution to observe your entire Kubernetes cluster

If you want to empower all of your engineers with the knowledge they need to troubleshoot Kubernetes applications, try the first deep observability tool that aggregates metrics, logs, events and traces, shows connections and dependencies across services and takes engineers straight to the change that caused an issue. With StackState, you will make remediating issues a breeze.

Get our free trial, and/or become a design partner for some amazing new features.