An answer to the alert storm: introducing Team View Alerts
4 min read
As a Dev or Ops it’s hard to focus on the things that really matter. Applications, systems, tools and other environments are generating notifications at a frequency and amount greater than you are able to deal with. It's a problem for every Dev and Ops professional.
Alerts are used to identify trends, spikes or dips in your metrics and events – for example to detect low free memory, high page-fault errors or unavailable database servers. With the right alerts in place you can get notifications or signals of problems before they escalate or respond quickly before it takes a business service down which could affect your customers.
But most companies don’t have the right alerts.
When problems occur, they have to manually correlate all alerts, metrics, events and log files from different tools to get contextual information and to understand the problem they are dealing with. How do you know which alert you have to focus on and which not?
The first step is knowing the system purposes, dependencies and the business services (that rely upon those systems) for which you're responsible. This will help guide you to what is important enough to get up in the middle of the night, and what can wait until the next business day. This is really the most important part. I think you've all made assumptions in the past about alert X, Y or Z and have at least once been burned by a bad assumption.
We all know that being a Dev or Ops means you use a lot of different tools to keep your business running. It’s something that we talk about all the time here at StackState, and we’re always thinking about how we can make it easier for Dev and Ops to understand the alerts you're receiving from all the different tools – to tell you where to focus, to give insight into what happened in the stack and how you can solve (or even prevent) an outage.
That’s why today, I want to tell you more about the alerting capabilities of StackState.
Cut the noise
StackState gives you the ability to run checks over not only one type of data or tool, but any kind of monitoring data and DevOps tool in any combination, taking the entire stack into account.
Checks are functions, which are pre-supplied, but may also be user-defined, that receive one or more monitoring streams, anomaly models and parameters as input and produce states and events as output. StackState’s alerts will notify you when metrics cross the check or anomaly model you set. Just add a check to a component and you're good to go. Of course you can reuse this action as a template for all components found by our agent or your own discovery solution like Zipkin or HP uCMDB.
Now you’re able to cut the noise that’s coming from all the different tools that you’re using.
But there is more.
You can add checks on single components, but ALSO on a specific team view or part of the IT stack that you're monitoring.
With StackState, it’s easy to create a specific team view of the stack. This view consists of multiple components, dependencies and other relevant information fed by our own agent and the different DevOps tools you use. Each team view has an overall consolidated state that's based on this information. This view state is clearly visible and will only trigger an alert notification when the TOTAL state of the stack that your team is monitoring changes from state.
StackState's alerting capabilities makes it easy to send alerts directly to Slack, Pagerduty or to notify you via your own favorite notification tool. Whether you’re using 1 tool or 50, when you receive an alert, it’s not one to ignore.
Our goal is to create the 100% uptime enterprise and to ensure our customers can guarantee an optimal performance to their customers. StackState aggregates information from a multitude of sources and existing tools to create a visual model in which your key business processes are reflected along with their dependencies and states. It provides Dev, Ops, Architects and Managers a real-time blueprint of the entire IT stack.
On top of this full stack Insight we allow for Investigation with automated root cause analysis, event correlation and anomaly detection. After Insight and Investigation come Remediation and Prevention. With techniques from the worlds of Artificial Intelligence and Machine Learning we are currently building out these elements of our vision. It will give you the possibility to automatically resolve issues before they become a real problem.
If you’re interested in learning how StackState can provide you with the right alerts, just request a demo and we’ll give you a personal tour of our solution.
4 min read