8 SRE Best Practices to Help You Troubleshoot Kubernetes

Lisa Wells
Lisa Wells
8 min read

Maintaining reliable Kubernetes systems is not easy, especially for people who are not Kubernetes experts. This blog, part 2 of 3 in the “8 SRE Best Practices to Help You Troubleshoot Kubernetes” series, explains 8 simple best practices SREs can follow to help developers and other SREs build knowledge and effectively troubleshoot issues in applications running on Kubernetes.  

Read Part 1 of the series to understand the unique challenges often faced when troubleshooting Kubernetes and read Part 3 to learn how to build a solid foundation to support these 8 SRE best practices.

Fast remediation of customer-facing issues is of the greatest importance in providing reliable services. But many companies struggle to find enough people with the right skills to both code and support cloud native applications, so it can be challenging to move quickly and still keep everything running smoothly. 

SREs can help compensate for this skill shortage in many ways – such as helping development teams expand their abilities to quickly troubleshoot issues in applications running on Kubernetes. This blog shares 8 SRE best practices you can follow to support effective troubleshooting and remediation for everyone.   

As an SRE, you are in a unique position to mentor, coach, set examples for and guide the company in many ways. If you can guide others, automate processes and provide good toolsets, you will give developers – and other SREs – a clear map with directions to troubleshoot applications efficiently. Armed with a holistic picture, Kubernetes background knowledge and the right guidance, everyone can remediate issues with a minimum of toil and without calling in others to help.

8 SRE best practices to help developers troubleshoot Kubernetes. 

1. Take a Mentorship Role for All Teams and Lead the Way  🕺 

When it comes to quality, take a leadership role and help advise the rest of the organization. After all, your job is to ensure performance and reliability, and one of the best ways to do that is to establish yourself as an engineering leader in your company and then mentor others.  

Set a goal to keep learning as much as you can about Kubernetes, the other technologies in your environment and how they all work together. In your own work, keep an eye on the big picture and take a proactive, holistic approach to identify the weak spots in your landscape. Then pioneer the way for others: set an example and be a role model through your actions, share best practices and create solid processes others can easily follow. Finally, create a culture of continuous improvement: challenge everyone to keep thinking outside the box and find better ways to do things.  

2. Facilitate Knowledge Sharing  🎓

Help your team help themselves! If you can increase the expertise and effectiveness of your engineers, you are making an amazing investment in reliability. Added bonus: they will need to bother you less for help.  As the resident expert in the overall environment, you have much to teach. This SRE best practice is related to #1, Lead the Way, but there’s so much to say about it that it stands on its own. Here are six steps you can take to share detailed knowledge and help even novices become effective troubleshooters: 

  • Build a culture of blameless post-mortems. What series of events ultimately caused the issue you just remediated? When something goes wrong, everyone can learn from it and avoid that issue in the future. But it needs to be a safe and shame-free conversation that motivates improvement, not casts blame.

  • Encode expert knowledge into repeatable practices that anyone can execute. Monitors, runbooks and automated processes can help here; we’ll talk more about those in a minute.

  • Create good onboarding training for new team members. If people have a solid understanding from the beginning, they will be able to build their knowledge foundation much more quickly… not to mention be effective much sooner.

  • Make everyone an expert by providing regular trainings such team pizza lunches. Invite others to do trainings in their areas of expertise. Use these trainings to help everyone build a better understanding of Kubernetes itself and of the production environment. You want all engineers to have at least basic knowledge of the big picture, what behavior is normal and why things are done the way they are.

  • Organize hackathons and tech days in your company to work on tricky problems, make fast progress and bring people together. Create cross-functional teams that work to solve problems and teach other about their areas of specialty.

  • Make it fun! Make learning opportunities a positive experience and find innovative ways to make them enjoyable and entertaining.

3. Implement Tools That Bring all Observability Data Together  🛠

As an SRE, your goal is to equip your teams with powerful tools and establish the necessary infrastructure to maintain reliability. To overcome the limitations of Kubernetes' built-in troubleshooting capabilities and manage the large volume of data from multiple sources, it's important to use tools that can bring all data together. Having all your Kubernetes troubleshooting data in one place, together with the ability to store it over time and view everything in context, will give everyone the insight they need to quickly identify and address issues.  

You also want to implement a toolset that makes it easy to correlate all your data. By correlating and analyzing metrics, events, logs, traces and changes (such as configurations changes and new deployments) together, you create the most comprehensive understanding of the behavior of clusters, individual resources and your application as a whole.

4. Create and Distribute Best-Practice Configurations for Dashboards and Monitors  🎛

The more time you spend setting up optimized troubleshooting processes for your team, the less time they will spend troubleshooting – and the more reliable your environment will be. Use your expertise to build reusable tools for your team, such as dashboards, monitors and optimal default configurations. 

The right dashboards will immediately point troubleshooters at the most likely areas to investigate. With the relevant queries already set up in dashboards displaying centralized, correlated data, your teams won’t waste time switching between tools, dashboards and views to find what they need. 

In addition to configuring good dashboards, it also takes expertise to know exactly what to monitor and time to set it up. Make it a priority to encode your knowledge and turn it into processes and policies that your monitoring tools automatically apply. Define monitors that proactively look for common problems and generate immediate alerts when behavior deviates from expectations.  

5. Help Team Members Visualize Service Maps and Understand Dependencies  🗺

In today’s fast-changing Kubernetes environments, applications depend on a multitude of different services, and many services depend on other services. Kubernetes resources such as clusters and pods come and go rapidly, and deployments happen continuously. It’s hard for SREs – and even harder for developers building individual services – to see how everything is connected and to track down what ultimately caused an issue.  

Anyone who troubleshoots Kubernetes services needs a holistic picture of how everything fits together to really understand what’s happening and remediate effectively. You can support this process by providing a dynamic service map that connects the dots and helps teams visualize dependencies across the environment. Use trace information to help you build this map, combined with metrics, logs and events that tell you about the issue that needs to be resolved.   

Put infrastructure in place that can discover and update relationships between all services and resources, combine dependency information with observability data and store it all over time. Then any engineer can see how everything fits together and understand how services they don’t own may be affecting their service (or vice versa). They can also easily track how a change in one area (configuration change, new deployment, etc.) affects the whole environment. 

6. Automate Observability and Software Delivery Processes  ⚙

Once you find a process that works well to optimize Kubernetes observability, find a way to execute it repeatably, then automate its execution. To be most effective, automate all the steps that lead to production, not just the production steps themselves. 

Automation helps both share knowledge and alleviate the need for deep knowledge. Create processes other teams can use out of the box or that happen automatically. We’ve already talked about automating collection, aggregation and correlation of metrics, logs, events and traces. As part of this automation, you will also likely be automating alerts and notifications. For example, when you set up monitors that are automatically applied for all services, they will run in the background to keep an eye on the health of Kubernetes resources, including outright failures, performance degradations, behavior deviations and other anomalies.  

Although it’s not exactly observability, if you haven’t already, set up continuous integration and deployment (CI/CD) pipelines to automate deployment and management of your Kubernetes services. CI/CD reduces the risk of human error and improves the reliability and consistency of your deployments, so your teams will have less to troubleshoot. 

By automating observability and delivery processes, you can reduce the time and effort required for all engineers to monitor and manage services running on Kubernetes clusters, and you will ensure that issues are detected and resolved quickly and efficiently.

7. Build Guard Rails That Automatically Implement and Check for Control Policies  🛤

Guard rails are an effective form of both control and automation. As an SRE, part of your job is to articulate policies on how to set up and use infrastructure, as well as to create processes that support compliance and security. Some examples of compliance and security policy checks for Kubernetes include:  

  • Check if a publicly accessible, business-critical service is configured in a highly available fashion (e.g., at least three Kubernetes pods at all times) and with redundancy (e.g., pods are replicated in multiple data centers).  

  • Report a warning state if a Docker container uses an image that doesn’t come from a trusted source. 

  • Check for the required audits of all pipelines that push deployments to production. 

  • Make sure limits are set for all pods, since unlimited requests may cause pods to be randomly evicted from their nodes. 

  • Ensure Ingress controllers are not being used without TLS, which would allow unencrypted, unsecured traffic to the application. 

 Make it easy for teams to follow policies by building guard rails that validate what goes into production. You can build security and compliance policy checks into your monitoring to help ensure policies are followed correctly from the start. Once these policies are articulated in monitors, they will be automatically applied to everything that's pushed into production.  

Whenever a deviation occurs, such as a team member’s failure to follow an established guideline or policy, it will be immediately flagged. They will see what’s happening and can quickly address it before a bigger issue occurs later that is much harder to find. If teams are just looking at dashboards rather than using active monitors, it’s much harder to apply policies and create immediate alerts. 

8. Provide Step-by-Step Guided Remediation Assistance  👣

One of the best ways to make every engineer effective in troubleshooting is to give them guidance on what to do in a particular situation so they can remediate the issue quickly. 

You can’t be everywhere, but you can enable automatic troubleshooting guidance based on common issues and runbooks. To do this, you need to combine expert practices, observability data, dependency data and smart detection of the probable cause of an issue into a system that creates step-by-step instructions on what to do. Guided remediation can point developers to relevant data, help them understand the issue, assist them in making a judgement call or even tell them what command to execute to solve the issue. 

By implementing these 8 SRE best practices, you can significantly improve Kubernetes troubleshooting efficiency and effectiveness for all teams in your organization, both SREs and developers. 

What else can you do to make things easier for everyone…?