In this blog post
Troubleshooting and Fixing Kubernetes CrashLoopBackOff
In this post, we'll dive into what CrashLoopBackOff actually is and explore the quickest way to fix it. Fasten your seat belts and get ready to ride.
Everyone working with Kubernetes will sooner or later see the infamous CrashLoopBackOff in their clusters. No matter how basic or advanced your deployments are and whether you have a tiny dev cluster or an enterprise multi-cloud cluster, it will happen anyway. So, let’s dive into what CrashLoopBackOff actually is and the quickest way to fix it. Fasten your seat belts and get ready to ride.
What Is a CrashLoopBackOff?
Let’s get straight to the point. CrashLoopBackOff is a Kubernetes mechanism that deals with broken containers. That means if your pod is in CrashLoopBackOff state, at least one container inside is in a loop of crashing and restarting. That’s the raw definition, but to understand what it actually means, you may need a quick reminder about how Kubernetes works in general.
Kubernetes manages your containers. That’s one of its primary purposes. When you create a Kubernetes deployment, you define that you want to have a certain number of containers running from a specific Docker image. Kubernetes will then download that Docker image and run the specified number of containers from it. But that’s not where its job ends. Kubernetes will constantly monitor all deployments and make sure they still match the specification.
Imagine that the application running in one of the containers crashes. This will most likely result in a container dying. Kubernetes will soon realize that instead of one instance of the application running, it has zero instances.
Therefore, it will quickly create a new container to make sure that the state of the cluster matches the desired configuration. But there’s a catch. What if that error in your application was not a random, one-off glitch but something that will come up again after restart? Let’s say, for example, that a file that your application needs is not available. In this case, after Kubernetes creates a new pod, your application will crash again if the file is still not available. And if it crashes again, Kubernetes will again realize that it’s missing one container and, therefore, will quickly create yet another one. The cycle will repeat until the file becomes available or until you spot the problem and do something about it. That’s what the `CrashLoop` part of CrashLoopBackOff is.
As you can imagine in a situation like that, Kubernetes would waste quite a lot of its brain power constantly restarting the container. If your containers are lightweight, this could mean restarting containers multiple times per second. To avoid that, Kubernetes doesn’t just blindly restart a container all the time. Between every restart, Kubernetes will increase a timeout to give your application some time to rethink its life decisions, get some fresh air and cool down. That’s the `BackOff` piece.
So, to be precise, when you see that your pod is in CrashLoopBackOff, it means that it’s in that period of time between restarts. And here’s something important that you need to understand: CrashLoopBackOff is not the root cause of the problem. It’s a state of the pod that indicates that there’s an issue with one of the containers. This is important to remember because it means that in order to fix CrashLoopBackOff, you need to first find the underlying error that’s causing it. And yes, there can be plenty of different reasons for it.
What Can Cause CrashLoopBackOff?
Now you know that CrashLoopBackOff itself is only an indication that Kubernetes wasn't able to fix your pod by simply restarting it. But because it can't do much more than that, it tries over and over again, hoping for external help in the meantime (such as the user or CI/CD process correcting the pod configuration). CrashLoopBackOff is basically a big red warning light, and in order to fix it, you need to find out what caused it.
What can cause CrashLoopBackOff? I can give you a few most common examples so you’ll have an idea of what we are talking about, but more importantly, later in this post, I’ll show you a universal method for debugging and finding the root causes.
One of the most common reasons for CrashLoopBackOff is human error. This means that you’ll most likely see CrashLoopBackOff after a deployment. There could be, for example, a simple typo in a deployment definition or a missing environment variable, volume or config file. A wrong networking configuration could be another reason. If your application needs to connect somewhere but can’t, it may crash (depending on how you handle errors).
Here’s a real-life example that you can test yourself. Below is a simple pod definition that will result in CrashLoopBackOff:
- image: busybox
Why will it result in CrashLoopBackOff? Because we specify that we want to run a command called `wrongcommand`, which as you can probably guess, does not exist in our container. Therefore, Kubernetes will try to run that container (it doesn’t know what’s inside the container, so it can’t validate if that command exists or not), but the container will fail, so Kubernetes will try to run it again, and it will fail again and the pod will end up in a CrashLoopBackOff state.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
broken-pod 0/1 ContainerCreating 0 3s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
broken-pod 0/1 CrashLoopBackOff 1 (2s ago) 12s
These are just a few common examples of what could cause CrashLoopBackOff, but don’t forget that CrashLoopBackOff is a state of the pod that resulted from an underlying error. It’s best to learn how to debug and find the root cause of CrashLoopBackOff, and that’s exactly what we’re going to do now.
How Do I Fix CrashLoopBackOff?
Now that you know what CrashLoopBackOff is and what it isn’t, let's practice debugging and fixing it. As we already established, the first thing you need to do is find the actual underlying error. The best place to start is the `kubectl describe pod` [pod_name] command, which in many cases can lead you right to the source of the problem. Let’s see if it can help with the broken-pod example above.
$ kubectl describe pod broken-pod
Last State: Terminated
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "wrongcommand": executable file not found in $PATH: unknown
Exit Code: 128
Somewhere along the lines of container details (you can also check at the bottom of the output in the Events section), you should see a termination message that tells you directly that a container failed because the `wrongcommand` executable was not found. This could most likely be due to a typo in a command or because you’re trying to run a binary that’s not available in the container. In order to fix CrashLoopBackOff in this case, you need to fix the typo in the command or rebuild your Docker image to include the missing binary.
Unfortunately, sometimes `kubectl describe` won’t give you an answer straight away. Sometimes you’ll only see something like this:
Normal Created 17s (x2 over 18s) kubelet Created container broken-pod2
Normal Started 17s (x2 over 18s) kubelet Started container broken-pod2
Warning BackOff 15s (x2 over 16s) kubelet Back-off restarting failed container
This usually happens when the issue is not in the container configuration (as in the previous example) but in the application running inside the container. What do you do then? Well, if the issue is with the application, the best place to find the answers are application logs. You can read them by executing the `kubectl logs` [pod_name] command.
$ kubectl logs broken-pod2
Loading cache data...
Loading shop configuration...
ERROR: Wrong shop configuration detected...
You can see the error now. In this case, the application is reporting that the wrong configuration was loaded.
So far, we’ve focused on pods and containers because CrashLoopBackOff directly relates to container issues. But if you don’t find anything useful in a pod description or container logs, you can also take a look one layer higher. Normally, pods are created by some other resource, such as deployments, daemonsets or statefulsets. You can, therefore, also inspect those to find potential issues. To do that, simply execute `kubectl describe` [resource_type] [resource_name], `kubectl describe deployment broken-deployment`, and, as with pods, take a look at any error indications, especially in the Events section.
If there’s one thing that you should remember from this post, it’s the fact that CrashLoopBackOff is not your main problem. This is important because once you see CrashLoopBackOff, you need to focus directly on finding the actual, underlying error. And if you’ll remember that CrashLoopBackOff means an issue with a container, then you’ll know exactly where to start your debugging.
As a company dedicated to helping teams succeed with Kubernetes, we want to provide useful information in as many related areas as we can. We hope this tutorial proves helpful for your team. When it’s time to set your focus on troubleshooting Kubernetes apps, sign up for a free trial to see how simple it can be.
StackState is designed to help engineers and developers quickly identify and resolve issues in their Kubernetes-based applications. With features like real-time visualization, integrated monitors, guided remediation and time travel capabilities, StackState provides a comprehensive solution to streamline troubleshooting and improve system reliability. See for yourself in our live playground environment.