Why Observability Is the Way to Go | StackPod E2

Georg Höllebauer - enterprise metrics architect at APA-Tech

Georg has nearly 30 years of experience in monitoring and observability under his belt, so we are excited to learn more from him. We talk about his early days as a support engineer and why and how he became an enterprise metrics architect. Also, since Georg is our customer, we're delighted that he wants to share why he uses StackState to make his demanding job a little easier (for himself as well as his team).

Enjoy this episode. Also, if you prefer to read: you can find a written transcript of the episode below.

Episode transcript

Georg: [00:00] Every time we saw the same movie. The monitoring gets red, red, red, thousands of alarms, nobody knows what's the root cause. Everybody's panicked, and real total chaos.

Annerieke: [00:17] Hey there, and welcome to the StackPod. This is the podcast where we talk about all things related to observability because that’s what we do and that’s what we’re passionate about, but also what it’s like to work in a tech company. So if you are interested in that, you are definitely in the right place.

Annerieke: [00:34] I am very excited to announce the next guest of the show: Anthony will be talking to our customer Georg Höllebauer and Georg is an Enterprise Metrics Architect at APA IT. APA IT is responsible for all IT services for the Austrian Press Agency, which is the largest press agency in Austria, and they also have other customers they work with. In this podcast, Georg shares about his career, how he became an enterprise metrics architect, what it means, what he loves about his job and – because he’s our customer – also how he uses StackState to make his demanding job a little easier. Enjoy the podcast.

Anthony: [01:24] Hey Georg.

Georg: [01:15] Hello, hi Anthony.

Anthony: [01:16] How you doing today?

Georg: [01:18] Fine. Thanks. Everything's okay. And you?

Anthony: [01:22] Not too bad. So Georg, why don't you just introduce yourself, what you do, a little bit about APA, and why it's full name, the Press Association, is a little bit misleading in terms of what you guys do.

Georg: [01:41] Yes. Because the Austria press agency, is our parent company.

Anthony: [01:47] Yeah.

Georg: [01:47] Yeah. And, this APA IT, where I work, we do all the IT stuff for them. And not only for them, but also for other customers. So let's start, as I already mentioned, I work at the Austrian company APA IT, information technology is the full name. My actual job title is Enterprise Metrics Architect. A little bit about the history, my history there, I started 25 years ago as a normal help desk support engineer with 24 hour, seven days a week shifts. So we also worked at night. Every time there are real people who you can call.

Georg: [02:41] And after a few years, I became the chief of the help desk and operations team. At that time, the administration of nearly all metrics and monitoring tools were transferred to my team. So they all, these monitoring and metrics tools, they were separated, administrated by different teams. And I took everything to my team because we are in the center of everything. So the administration went to my team, and in 2015, I left this position and I started to work only for metrics monitoring and observability in our company.

Anthony: [03:27] Okay.

Georg: [03:29] And at the moment, I'm responsible for all projects relating to metrics, monitoring, and observability.

Anthony: [03:38] Okay. I used to, my first job in IT was working, I was 18 years old and I got a job at T-Systems down in Toulouse, France. And they do, well at the time, they did all the IT for Airbus. So I was literally helping people with level one IT questions in the UK, from France, having to work, whenever they were open in the UK. So it could be nighttime, it could be daytime, it could be on the weekends. And I find there are two ways out of that, because you can't do that forever. For your mental health, you've got to think about it, from the fundamental standpoint, is that you're dealing with nothing but problems all day, every day.

Georg: [04:31] Yes. That's the point.

Anthony: [04:32] So yeah, I found that there were two ways out of it. You can either become really technical. And so you become, more important, for the more important things, or you make your way into management. The early version of myself, my 14 years ago self, would have said, "oh, I want to be a manager." But, what ended up happening is that, there's only one manager position, right? And it's not like people churn that role all the time. And then there's other people, that may have worked there longer. I find in support, it's usually like, whoever's got the most seniority gets the next promotion. That's generally how it goes. Right?

Georg: [05:16] Yes.

Anthony: [05:16] So I focused on going more on the technical route and the way I ended up getting out of the 24/7 thing, was by, getting up to third-level support, which is like, there's no emergencies, it's more root cause analysis at that point, is it a bug with the product? What happened at that time? And that was far better, even though you're still dealing with problems, you don't have people screaming down the phone at you, that there are issues. It's like, "okay, how do we do that?" So that's how I got out of it, anyway. But it sounds like to me that you went down more of the management path and then you eventually got out of it because you're like, "okay, let's just focus on the metrics and the observability component."

Georg: [06:03] Not really. Maybe it's a little bit misleading. Our chief back then, had the bright idea, from my job, will be a blame captain.

Anthony: [06:18] Okay.

Georg: [06:20] So I was there at that time, I was already the best support engineer there. Yeah. So, and my former chief had some, serious illness and I had to take his job, it was some overnight thing. And my chief said, "okay, you can have his management job, but you're also, you stay a support engineer." So you are managing the whole team, but you are working like an engineer. So it was this blame captain. This was very hard. So I was to blame for everything. Not only for the problems that people had, but also for the problems my people had, for the problems the whole company had. Yeah. So, and that's why I had to get out there.

Anthony: [07:16] Yeah. So you still had all the problems, and the responsibility is partitioning the responsibility and the problems. So, yeah, that sounds like hell.

Georg: [08:11] It's something like, hell, yeah.

Anthony: [07:33] Yeah. And there is something to be said right? About the fact that, if you are in that position, and you kind of feel like you're the only person who can help, there is an element of importance to that. Do you know what I mean? The fact that people are calling you, the fact that people are demanding things from you, although it's stressful at the time.

Georg: [07:55] Yeah, no. It was definitely not all bad. The responsibility and the standing you have in the company, this was real good. Yeah. But it was not forever. It's not possible.

Anthony: [08:10] Yeah. So, we talked a little bit about, your career, how you've gotten here, a little bit about APA IT. Let's talk a little bit more about, what you're doing currently with StackState, because you are a customer. So, the whole point of this podcast is really to introduce people to the culture of StackState, and by and large, our customers allow us to exist first of all. And then, also if they're successful, then we're also successful. So, let's talk a little bit about your interactions with StackState, some of the history you've had with StackState, some of the people you've worked with, and some of their stories in terms of how they've helped you, but then also, how is our software helping you personally?

Georg: [09:00] Sure. When I got the chief of the network operation center, I built the complete monitoring stack new. And after that, like in many companies, you have one big monitoring tool, and then you have some special tools around. They pop up Prometheus, and all these kinds of stuff. They have the right to be there, everything is okay.

Georg: [09:31] But then, we realized we have too many monitors to look at. We got too many alerting from too many, different tools, and we had no big picture. At the same time, we first heard about the term observability, and I read a few things about it and I was like, "okay. I think this is the way we have to go. We have to get a big picture, but without killing all these other monitoring tools." Yeah. They have the right to exist. They are special tools for cloud things, for Kubernetes, OpenShift, and all this kind of stuff. They have the right, because they do the things special for these kinds of services. And we tried to find the tool, like an umbrella, which can integrate all these different tools into one glass of pane, or something like that, this is a term you hear often.

Georg: [10:41] And we looked at some other observability tools, but they're very special. The most other observability tools, they try to kill all your monitoring tools. They try to push them aside and they tried to get the only one. And the cool thing about StackState, they can do that, but it's not their intention.

Anthony: [11:08] Yeah. It's like a byproduct of their capabilities. Right? It's like, "Hey, stick all your data into us." And then they're like, "oh, you get observability." But it's not really observability, you just see all your data in one place, like Splunk, right? They're very good at that.

Georg: [11:27] Yeah. This was the first thing we needed and the second one, observability comes then with tracing and instrumenting code, and these kinds of stuff. And these features, StackState is also integrating. So after a very short time, it was relatively clear that StackState is the perfect product for us. And the benefits we want to get out of this combination, this umbrella system, with combined observability, we want, that everyone in my company, the IT, maybe, even not in IT, maybe even in the parent company, some managers, they have the same view on the status and performance of any services we deliver.

Georg: [12:19] We want to become more proactive. And if something breaks, we want to troubleshoot more quickly. In the history, when something big breaks in the infrastructure, storage, or networking equipment or something like that... Every time we saw the same movie. The monitoring gets red, red, red, thousands of alarms, nobody knows what's the root cause. Everybody's panicked, and real total chaos. And the first few hours we try to identify the root cause. And this is, this time has to be much more short. Yeah. And that’s the time when my chief said, "Hey, George, we have to find something to be better there."

Anthony: [13:17] Yeah. There's the KPI, right? MTTR, right? Most people will be like, "oh, that's meantime to resolution." And I'm like, "well, actually, most tools out there today, deal with meantime to react." Because it's like, "Hey, it's red. We did our job. That's great." You have one incident in ServiceNow that you need to react to. Great. It's like, "okay, fantastic. You've given me my PagerDuty alert that something's wrong." You've kind of used some fun AI algorithms to figure it out, but can you point out to me, what is the actual root cause? Because at the end of the day, that's the bottom line, right? If you want to reduce your meantime to resolution, you need to get to the root cause as quickly as possible.

Anthony: [14:08] And sometimes, when everything's red, that becomes such a big distraction, right? Because you're going down so many different rabbit holes to kind of figure out, "okay, what was the first alert? Okay. Well that alert is a by-product of this going bad over here. Oh, wait. Prometheus is saying something else." And so a lot of tools that are in the AIOps space, they focused on, "Okay. Let's bring it all together into that single pane of glass that you were talking about, and then let's just use AI so that we can tell you when something is actually wrong." Which is fine. And that's great. It's progress in some way, shape or form. But it doesn't really have true observability, right? It can't give you a visual representation of your stack, if you will, or the application and, the infrastructure. Especially now, when we get into containers and Kubernetes and all this kind of other stuff, and serverless, where we've got Lambda scripts, running all over the place that becomes a nightmare to manage from an SRE standpoint.

Anthony: [15:16] Can you tell me a little bit about, what was the aha moment for you in terms of StackState, that made you say, "This is the solution for me."

Georg: [15:26] There are multiple moments. The first one was, when I saw how easy it was to integrate, even my Checkmk monitoring system, which it has no integration. So we built a custom integration. It was only a few lines of Python code, and we were ready. It's not, it was not so high, sophisticated, but it worked.

Anthony: [15:51] Yeah.

Georg: [15:52] So this was one. The second one was the magic of the timeline, where you can go back in time and see, or maybe play like a movie, see the problem actually happening, in the past.

Anthony: [16:09] Yeah.

Georg: [16:10] So you can go back and you don't have to read logs and all these kinds of, you just go back in time, and then you find the first thing, when that happened. You can really easily find what was the first problem, and what did change at that time, or shortly before the time, because this is the next one. It's very, very easy to integrate annotations into the timeline of StackState.

Georg: [16:48] So if somebody is doing a change on the switch, or on the server, they work in their change management system and they press the play button when they start to work. And there is a web hook, which creates this annotation in the timeline of StackState. So if I don't know about this change, so then I go back in the timeline, see, "ah, my colleague made this change, and can jump right into the change record and see what he did."

Anthony: [17:16] Yeah. Yeah, no, I used to work for ServiceNow, I worked there for seven years almost. But, the CMDB, was always static, right. So, first of all, it was always incomplete. That, you either had a really good discovery for one thing, but then it was missing-

Georg: [17:42] It was never complete.

Anthony: [17:42] Yeah. And even if you do get a hundred percent, it's still static. So, you're only seeing the latest version of that, and there's no time travel capability. And again, as we move into a more server less and containerized environment, you're going to need time travel capabilities, right? In order to be able to do that.

Anthony: [18:05] But then also, to able to do point in time references, like you're doing with the web hooks, that's gold, right? You don't want to know how many changes have impacted a single component. You want to know, at the time the change was done, what did the environment look like? And what were some of the events and the alerts that you were getting at that time as well? So that then you can intelligently get better over time by saying, "okay, if a change happens in this part of the environment, we get this noise over here. So let's set a threshold and get iteratively better ourselves, so that we won't run around next time." And a lot of tools don't really give you that capability, because they're a black box. You don't have that ability to just get better at your own job in a way, it's all about the tools telling you, what they think is a problem, which, it doesn't always turn out to be true. Right?

Georg: [19:06] That's right. Yeah. Yeah, so these are the three main things, we went for StackState.

Anthony: [19:13] Awesome. So, we're almost out of time now, is there anything you want to kind of leave with? Some kind of final comments and something you want to kind of lay out there in terms of maybe people who helped, things that we've helped you with? In terms of StackState?

Anthony: [19:31] Yes. I think I have two examples where the StackState really, really, will help our operating team. The first one is also related to this timeline, nowadays it's a 24/7 shift, like I said before. So there is one operator sitting for eight hours, and then it comes the next operator. And now they talk to each other and saying, "Hey, this and this happened, this happened. And look there and look there." And it's just, the short story, what happened in the last eight hours, or what's happened in the time when the next operator who comes to the business was not in the company. I think when we have all this information on the timeline with all the changes, the problems and all this thing, he can just tell the next operator, "Hey, here, make a replay."

Georg: [20:44] Maybe it's a little bit science-fiction, but I think this will really help. And the next thing is, we talked about when something big breaks, in the classical monitoring systems, you have thousands of red alarms, and you want to find the root cause. So myself, I have nearly 30 years experience in monitoring, and not only monitoring, in our data center. So if I look at this big red mess, I may be able to tell, "okay, this must be a networking problem. Or this must be a storage problem. Or this must be an application problem." Only because of my experience.

Anthony: [21:23] Yeah.

Georg: [21:46] But, all my colleagues in the network operations center, they don't have this experience. And I think with StackState, I want to give them the ability to find the same things like myself, when I look at it. But without thousands of red alarms. This is the next one, we want the 24/7 operators to really quickly decide which special team they have to call.

Georg: [21:55] Do they have to call the networking guys? Do they have to call the storage guys? Or the Linux server or the windows server guys? This is the first and most important decision.

Anthony: [22:05] Yeah. And they need to have the technical backup. Because if you're talking to a storage expert, and you say, "I think the issue is with you." They're going to find a million excuses to turn around and say, why it's not them. And it's the knock, that you need to get. So, being able to turn around and say, "okay, I believe this is for you, because I have this, this, and this, that points to you guys as being the root cause of my issue. Can you prove me wrong?"

Georg: [22:38] Exactly. That's, that's the point.

Anthony: [22:40] Yeah. No, that makes sense. That makes sense. Well, Georg we have to drop now, but thank you so much for your time. And for doing this. We're always interested in hearing how our customers are advancing with their uses of StackState.

Georg: [22:56] Okay. Thank you. And have a nice day, early day.

Anthony: [22:59] Thanks, man. Take it easy. Bye.

Annerieke: [23:03] Thank you so much for listening, we hope you enjoyed it. If you’d like more information about StackState you can visit stackstate.com, and you can also find a written transcript of this episode on our website. So if you prefer to read through what they’ve said, definitely head over there and also, make sure to subscribe so that you will receive a notification whenever we launch a new episode. Until next time.

Subscribe to the StackPod on Spotify or Apple Podcasts .

About StackState

StackState’s observability platform is built for the fast changing container-based world. It is built on top of a one-of-a-kind “time-traveling topology” capability that tracks all dependencies, component lifecycles, and configuration changes in your environments over time. Our powerful 4T data model connects Topology with Telemetry and Traces across Time. If something happens, you can "rewind the movie” of your environment to see exactly what changed in your stack and what effects it has on downstream components.

Curious to learn more? Play in our sandbox environment or sign up for a free trial to try out StackState with your own data.

EP #2: Why Observability Is the Way to Go With Georg Höllebauer (APA-Tech)

Georg Höllebauer - enterprise metrics architect at APA-Tech

Episode transcript

About StackState

Related resources

Mastering Node Affinity in Kubernetes

SIGKILL vs SIGTERM: A Developer's Guide to Process Termination

Understanding and Troubleshooting Out of Memory Error Code 137