In this blog post
MTTD: An In-Depth Overview About What It Is and How to Improve It
What Is MTTD?
Mean time to detect, or simply MTTD, is the average time it takes to discover an incident that led to a failure. In other words, it's the number of hours spent between a failure happening and its discovery.
Software engineers make use of metrics to measure how well they're performing. There are many metrics used to track success, like how long it takes to deploy a new feature or how many deployments you can make in a day. Additionally, some metrics track failure or, more specifically, incidents that cause failure. Properly identifying and managing incidents is a key aspect of maintaining a healthy system and guaranteeing customer satisfaction.
Why MTTD Matters
MTTD is a key performance indicator that gives insights into the effectiveness of your incident management process. Failures can lead to downtime in which your system is out of service. The sooner you can detect failures, the faster you can restore service and maintain productivity.
A lingering indicator light on your car's dashboard means something is wrong, and you need to find that problem as soon as possible or else it can lead to further damage. It's the same with ignoring anomalies that seem minor—allowing your system to fester will risk major failures and increase the amount of remediation the system needs before it can return to an optimal condition.
How to Measure MTTD
To get your MTTD, you'll have to divide the total time to detection by the number of incidents within a particular period.
Let's look at this scenario. Your system broke down four times in the past week. Each time it took five, seven, four and eight minutes, respectively, for the system to detect the failures and send an alert to the engineers. The total time to detect was 24 minutes, putting the MTTD at six minutes.
Tracking this metric over time will give you the ability to predict how long, on average, it takes your system to detect failures when they happen. Tracking MTTD will also identify areas of the system that may be problematic and need attention.
What Is a Good MTTD?
There isn't a specific industry-standard range for MTTD, but it must be kept as low as possible. It should concern you when the MTTD starts reaching several hours or days. Longer detection times usually will mean longer downtime which will affect productivity and lead to unhappy customers.
An observability system should be able to detect and report anomalies within minutes of their occurrence. The team can then decide if they are anomalies that matter - that is, those anomalies that are likely to cause issues and that should be proactively addressed before they do. Here the MTTD is zero - because you have taken steps to proactively prevent issues. Obviously, the more issues you can prevent, the better.
How to Lower MTTD
Lowering MTTD is dependent on having solid processes in place to quickly identify failures when they happen. Failures are part and parcel of managing an IT environment. It's therefore important to establish and maintain good practices to reduce the overall risk.
Here are some tips on how to keep your IT environment running reliably and lower your MTTD.
Automate your incident management process. Have a system in place to handle detection, testing, monitoring and deployment. Have multiple checks in place to detect anomalies and alert the engineers when an anomaly occurs as soon as possible, before they become incidents. Preventing problems doesn’t specifically lower MTTD - but it does lower downtime, the number of incidents and failure rate.
Reduce the size of your deployments. The main goal of practicing DevOps is to deploy high-quality code as frequently as possible. By having smaller deployment sizes, you're reducing the risk of shipping broken code or bugs to production. Blue/green deployment makes it easier to reverse a deployment, too, should it not meet quality expectations.
Maintain good communication with developers and customers. Constant feedback is another important aspect of good practices. An error, bug or report can come from a complaint made by a user and this can kickstart the incident recovery process.
Properly log incidents when they happen. This will be useful as historical data to predict when failures are likely to occur and what causes them. It can also help in identifying the root cause of recurring failures. Finally, a strong process will help you to identify the bottlenecks in your incident management process.
Have your developer environment ready to receive work at all times. The right tools and permissions for detecting and reporting bugs must be available. Again, a festering system will cause more damage. Having these tools ready will reduce the amount of time the bug sits in your system waiting to be picked up.
Have an observability solution in place to help you manage your IT environment, ensure maximum reliability, proactively alert on anomalies and enable deep visibility into the health state of all of the components within your environment.
Detecting failure is a critical part of the incident recovery process. There are several metrics you can use to track incidents. Measuring these other incident metrics alongside MTTD will help you assess the reliability and efficiency of your incident management system.
Mean Time to Recovery
After detection, the next step, naturally, is recovery. Mean time to recovery (MTTR) is the average time it takes to recover from an incident. MTTR is how long it takes to return to normal service after discovering a failure. (The "R" in MTTR can also mean recover, repair, remediate or restore.)
To get the MTTR, divide the total amount of downtime by the number of incidents within a particular period. Suppose it took the engineering team one, four, five and 10 hours, respectively, to fix four issues that occurred in one week, the total downtime would be 20 hours. This puts the MTTR at five hours for that week.
A low MTTD will directly improve MTTR. The sooner you find a problem, the faster you can recover from it. This is a very important metric for keeping track of the stability of your entire system, which is how often things go wrong.
Your MTTR shouldn't exceed one day. Elite performers have their MTTR at less than an hour. It should be concerning when it goes over one week and up to one month. (For further discussion about MTTR and MTTD, see our related blog post, "MTTD Versus MTTR: What Is the Difference?")
Mean Time Between Failures
Before detecting and recovering from failure, there must have been an incident in the first place. The mean time between failures (MTBF) is the average time between one incident and the next. It's a measure of how often failures occur. The MTBF should be as long as possible.
MTBF is determined by dividing the total uptime—the period when the system was operational—by the number of failures.
Still using the same example, a week has 168 hours. Let's say the total downtime was 20 hours. This means the uptime was 148 hours. Dividing the uptime by the number of incidents will put the MTBF at 37 hours per week. That is, the system works an average of 37 hours before breaking down.
Mean Time to Failure
Mean time to failure (MTTF) is the average time between failures. How is it different from MTBF, you might ask? MTTF measures the period between failures you can't recover from, while MTBF is the time that passes between incidents that cause repairable failure. Failures you can’t recover from are catastrophic failures and can involve hardware components, like a hard drive crash or a server failure due to overheating. It can also be a software issue like a virus or malware.
You get the MTTF by dividing the total operating time by the number of items or products. If you add up MTTF and MTTR—that is, the time to failure plus the time to recovery—you'll get the MTBF. (Hopefully, you won’t have a failure you can’t recover from and won’t be measuring MTTF!)
Conclusion: Tracking MTTD Leads to a Healthier IT Stack
You can't escape bugs and failures, but what you can do is manage them properly.
MTTD, MTTR and the rest of the incident metrics will help you in this regard by tracking the effectiveness of your incident recovery process.
Analysis of issues has a massive contribution to the overall health and performance of your entire system. Understanding why incidents happen is important for preventing similar incidents from occurring in the future, thus ensuring you deliver good, quality code. However, without a strong process in place for handling incidents and resolving them, your ability to continuously improve the health of your IT environment will be low.
This is why you should have a clear process for recognizing, logging and resolving the anomalies that matter - those that really can cause issues. Developers can learn a lot from anomalies and actual incidents that are consistently logged and triaged with the result that, over time, your system becomes even healthier and more reliable.
About StackState: How we help
At StackState, our topology-powered observability solution can help you to shorten MTTD and MTTR, lengthen MTBF and eliminate MTTF to ensure maximum up-time. The StackState platform can ingest data from other IT tools, including data lakes, monitoring solutions and other sources of data, correlate that data and provide a unified topology of your entire IT environment.
This enables you to quickly pinpoint errors when they occur and even proactively prevent errors from occurring when the health of a component starts to change. When an incident does occur, our time-traveling capabilities enable you to go back in time to the state of your environment right before the incident. You can see the change that triggered the incident, thus quickly pinpointing root cause and resolving the issue.