Getting Actionable Insights from Massive Amounts of Observability Data

image with title

In today’s world, there’s no lack of monitoring data. In fact, the amount of data DevOps and SRE teams are dealing with on a daily basis can be confusing and overwhelming.

Often, the problem is threefold:

  1. Different teams are responsible for different parts of the environment,

  2. Each team uses their own observability solution, and

  3. Each solution comes with different data types.

Teams, tools and data types are the new silos

When you’re dealing with these different silos - based on teams, tools and data type - it is often impossible to find out how different services in an environment affect one another. Team A might get a notification that service X in their environment is experiencing latency issues, but it’s very difficult for them to determine that this issue was caused by a change that occurred in an environment team B is responsible for. 

The mass move to the cloud added even more complexity to the problem. With the rise of continuous integration and continuous delivery, cloud environments are affected by continuous, incremental changes. Moreover, containers spin up and down. A container that existed a few seconds ago might have drastically impacted your customer’s checkout time and is now… gone. How are IT teams supposed to find the root cause of an issue if the container that caused the issue is not there anymore? Moreover, how are they supposed to find the root cause if it was caused by a component that is not part of ‘their’ environment nor can be monitored by the tool they are using?

At the same time, customer experience depends on the reliability and stability of a company’s entire IT stack, not just the separate silos that were created by various teams taking ownership for different parts of the stack. 

The more data, the better?

Tools like Splunk are widely installed to address some of these issues. First: they can bring monitoring data from different tools together into a central data lake. Second: they enable teams to sift through and query monitoring data more easily. 

However, even though many IT teams are using Splunk to collect observability data in one place, they still end up dealing with massive amounts of data that they need to make sense of - especially when there is an issue. Whereas previously, IT teams would say “if there’s data, let’s use it,” today, more and more DevOps and SRE teams are starting to realize that more data is not always a good thing. Too much data can be overwhelming, confusing and takes away the attention from the things that actually matter. 

Moreover, these massive amounts of data often lack context: it requires a lot of manual work for IT teams to determine how the services are related to each other and how a change in service A can cause an issue in service B. If observability data is not contextualized, it often triggers a lot of unnecessary alerts, resulting in stressful alert storms. 

Needless to say, the need for holistic observability - a clear view on a company’s entire environment, including how all of the services are related to each other - is more prevalent than ever.

Webinar: Driving Business Performance With Observability in Financial Services

In our upcoming webinar, our customer Sander shares how he and his team were facing similar challenges at Nationale-Nederlanden Bank a large financial organization in the Netherlands. After using a Splunk data lake to bring all observability data together, they realized they needed help to add a contextualized layer to the massive amounts of data they were collecting.

Sander and his team implemented StackState to add a topology layer on top of their data. This layer correlates observability data from several monitoring solutions as well as between all the different components and services in their highly complex IT landscape - whether they are in the  cloud or on-premise.

Join the webinar to learn why and how NN Bank uses StackState’s topology-powered observability solution to: 

  • Get better visibility into their entire stack,

  • Improve Mean Time To Repair, and

  • Significantly improve their Net Promoter Score.

Want to join? You are very welcome. Sign up (for free) here, and we’re looking forward to seeing you on Thursday, March 3rd!


Blog