Kubernetes Observability Tool - Support SRE Best Practices

Kubernetes can be tough to troubleshoot and remediate fast, especially when you have many interdependent services. This blog, part 3 of 3 in the “8 SRE Best Practices to Help You Troubleshoot Kubernetes” series, describes the Kubernetes observability foundation StackState has built to support SRE best practices and enable rapid remediation of issues.

Read Part 1 of the series to understand the unique challenges often faced when troubleshooting Kubernetes and read Part 2 to learn the 8 SRE best practices.

Efficient Kubernetes troubleshooting is an important part of providing a reliable service to your users. Although developers are often called upon to support the services they build, SREs are in a unique position to provide leadership, build processes, share knowledge and put tools in place that can significantly improve troubleshooting outcomes for engineers who may not otherwise have access to the data they need. By setting up easy-to-replicate and automated processes, building guard rails, collecting the necessary data and implementing the right set of tools, SREs can make troubleshooting faster and easier for everyone.

StackState has a Kubernetes observability tool that is specifically designed to help SREs follow best practices and build a solid foundation for effective Kubernetes troubleshooting. It works out of the box to collect, store, manage, correlate, visualize and analyze all types of data and scale expert knowledge across your organization. With StackState, you get comprehensive observability data, the most complete dependency map available, expert knowledge baked in to monitors and processes, and guided remediation assistance. We give you an ideal foundation to support the 8 SRE best practices to help developers troubleshoot Kubernetes outlined in blog #2 of this series.

Our unique approach to observability helps you accurately detect issues and quickly remediate them, regardless of your Kubernetes knowledge and skill level. Here’s what you can do with StackState to simplify Kubernetes troubleshooting:

Collect, aggregate and correlate Kubernetes observability data – Use StackState to automatically collect all the data you need using open standards like eBPF, OpenMetrics and OpenTelemetry. Bring in data from your other tools and view everything together in a central location, in the right context and correlated.
- Make the most of your metrics – Automatically collect all the metrics you need, store them for months instead of hours or days, and use PromQL to easily write metrics queries. Correlate metrics with events and logs across clusters and services to track changes in performance patterns and detect when things go wrong. Use StackState as a scalable replacement for Prometheus.
- Bring together your events – Build a deep understanding of all the changes and events that are going on in your Kubernetes landscape, such as images pulled, containers started or pods killed because they ran out of memory. Bring events from all resources together in one place rather than issuing a long series of command line requires.
- Make log information easily accessible – Automatically collect, aggregate, analyze and visualize all logs to see valuable information about the resources running in your cluster, including serving warnings, errors and other behavioral data. No need for command line queries to retrieve logs.
- Use traces to create detailed insight – Use automatically derived golden signals – error rate, throughput and latency – to observe applications running on Kubernetes clusters. Get detailed information about service performance as well as dependencies between different components.
Auto-discover your environment – Automatically discover and visualize all Kubernetes service and resource dependencies so you can build a holistic understanding of all your clusters. Now every troubleshooting engineer can fully understand complex relationships and easily see resource changes, even for services they don’t own.
Apply best practices using monitors – Automatically apply Kubernetes expert practices in the form of pre-configured monitors that look for common issues and apply compliance, security and other policies too. StackState monitors work out of the box and are written in an easy-to-read YAML format so SREs can further extend them.
Automate your runbooks and get guided remediation assistance – Any engineer can easily follow StackState’s step-by-step guided remediation to fix issues as quickly as possible. StackState provides troubleshooting hints and visual assistance using smart problem clustering to simplify and accelerate the remediation process.
Combine data in comprehensive dashboards – Get dashboards that aggregate and correlate all relevant metrics, logs, traces and events. No need to context-switch between tools nor write numerous tedious queries to get the data you need. StackState also provides a solid foundation to power Grafana dashboards that show your business metrics.
Track changes and see their effects – Keep track of all changes in all resources in your dynamic environment over time, from services to pods to clusters to configurations. Know what your environment looked like and what resources were running when an issue started and see how issues in one service may impact others. In many cases, change data is essential to determine the cause of the issue. StackState uses change information to more accurately prescribe what needs to happen to solve an issue.
Set it up and see results in 5 minutes – Install StackState in minutes and see observability data, a map of your services, and recommended troubleshooting activities immediately.

StackState helps all team members effectively troubleshoot Kubernetes-based applications and services, providing the necessary observability data and giving everyone insight into service and infrastructure dependencies:

Get instant visibility in production
Improve reliability and save time
Reduce troubleshooting toil
Do more with fewer experts

With StackState, you can automate observability processes, encode expert knowledge into repeatable practices, follow guided remediation tips and lower the barrier for every engineer to troubleshoot Kubernetes. And we are SOC 2 compliant, so you can rest assured we have the right protections in place to safeguard your data.

Explore the first tool that automatically collects all observability data, shows dependencies across services and leads engineers on the fastest path to remediation: get our free trial.

What else can you do to make things easier for everyone…?

Read our white paper, 8 SRE Best Practices to Help Developers Troubleshoot Kubernetes, to find out!
Read Part 1 and Part 2 of this blog series

A Kubernetes Observability Tool to Support SRE Best Practices

What else can you do to make things easier for everyone…?

Related resources

Mastering Node Affinity in Kubernetes

SIGKILL vs SIGTERM: A Developer's Guide to Process Termination

Understanding and Troubleshooting Out of Memory Error Code 137