Major challenges IT Operations have to deal with

1. The high cost of application downtime

Let’s face it. Application downtime is a dilemma your enterprise cannot afford. With the increasing complexity of IT infrastructures, it’s a huge challenge for organizations to fix outages quickly. On average, it can take IT operations more than half an hour to resolve an IT failure. Half an hour doesn’t sound like much at first, but when your business is software-defined (and most enterprises are these days), this could add up to financial disaster. International Data Corporation (IDC) has researched the implications of IT system downtime and found these other sobering facts:

For Fortune 1000 companies, the average total cost of unplanned application downtime per year is $1.25 billion — $2.5 billion.
The average hourly cost of an infrastructure failure is $100,000 per hour.
The average hourly cost of a critical application failure is $500,000 — $1 million.

The facts speak for themselves: application downtime drastically impacts business costs. Resolving IT failures is, and always will be, one of the biggest challenges for IT operations and is a pressing challenge to deal with in 2016.

2. Too many teams in the problem solving kitchen

Figuring out what caused a problem is complicated. Every Dev/Ops team has its own part to play in controlling and maintaining the total stack. But when problems occur, this often makes it difficult to determine where they originated. A simple scenario, which you’ll probably recognize, demonstrates this familiar challenge:

In the evening, the infra team upgrades some middleware with the help of their provisioning tool Chef. After a functional test everything seems OK.
The next day the finance Dev/Ops team detects a higher than normal error rate for the sales service. This is detected with the help of their Splunk dashboard.
The finance team contacts the sales Dev/Ops team. This team uses AppDynamics and sees that there is a time-out, but they can’t figure out what caused it.
Next, a crisis team is formed, with people from different teams, including a member of the infra team. They collectively figure out that the middleware update most likely caused the problem.
Finally, the infra team rolls back the middleware upgrade. The problem is solved, but valuable time is wasted.

This problem-solving scenario involves too many teams. To speed up the process, you could create a downtime action plan and have every team record their daily changes and upgrades, but that’s not really the best way to move forward. At StackState, we think the better way to deal with this challenge is to fully automate the problem-finding process across teams, reducing the time-to-repair to a minimum.

3. Adaptation

Newer agile technologies and processes, such as Dev/Ops, continuous deployments, containerization, micro services and private, public or hybrid cloud computing keep coming and changing rapidly. They come at a higher frequency, are more granular and introduce a more complex environment. As application updates and changes in the IT landscape grow exponentially, adapting becomes a complex challenge, impacting IT operations tremendously. Yes, new Dev/Ops solutions will pop up for each technology stack, and this is a good thing. But it’s time to deal with the adapting dilemma head on, by implementing an automated and integrated approach.

4. The Dev/Ops “Freedom of Choice” conundrum

We have written about freedom of choice in an earlier blog post. It seems like a good thing at first because it is important for Dev/Ops teams to choose their own tools. But the problem presents itself when too many different tools are used within teams. It leads to multiple dashboards and data streams that require continuous reconciliation to understand the overall health of the team’s stack. This manual process is time-consuming and error sensitive. Since most teams use different tool sets while also depending on services from other teams, the lack of unified health data between them is a real game breaker. The ability to remove waste and find problems in the whole stack efficiently is the key driver for dealing with this challenge.

5. Too much data, no information

Organizations are using a wide variety of tools and systems for monitoring, deployment and incident management, producing a deluge of different types of data. Too much data isn’t a challenge if it’s turned into useful information, but the challenge for IT operations is translating it all into something meaningful to the business.

IT operations store information in different silos or systems. Some organizations have started to apply big data analytics to a single type of operations data, like huge sets of metric streams, and this helps a bit. But without context, it doesn’t always show how a problem relates to critical business services. The lack of multiple and different data sources degrades the outcome. Data just for data’s sake doesn’t make sense.

Dealing with these challenges in 2016 is challenging, but doable. The future of IT operations is automation. At StackState, we’re building an advanced IT operations platform that will make it easier for Dev/Ops teams to overcome these obstacles, ultimately making 2016’s challenges a thing of the past.

Book a guided tour!

Related resources

Mastering Node Affinity in Kubernetes

SIGKILL vs SIGTERM: A Developer's Guide to Process Termination

Understanding and Troubleshooting Out of Memory Error Code 137