Observability Is a Data Analytics Problem - StackPod Episode 16
Annerieke: Hey there, and welcome to the StackPod. This is the podcast where we talk about all things related to observability because that's what we do and that's what we're passionate about, but also what it's like to work in the ever-changing, dynamic, tech industry. So if you are interested in that, you are definitely in the right place.
Annerieke: In today’s episode, we are talking to fellow podcaster Dotan Horovits. Dotan is an open source and technology enthusiast: he is the host of the OpenObservability podcast, he writes articles about open source and observability, he is an avid speaker at events like KubeCon and Conf42, he’s a co-organizer of Cloud Native Computing Foundation’s local chapter in Tel Aviv, Israel and in his day job, he is a developer advocate at Logz.io.
Annerieke: One of Dotan’s recent blog posts is titled: “Observability is a Data Analytics Problem,” and in this blog post he explains that observability is not just the simple sum of logs, metrics and traces, but for effective observability you need to fix the data analytics problem. We wanted to invite Dotan to dive into that a bit more. So, listen to this episode, to hear Dotan and Anthony discuss why logs, metrics and traces are not enough anymore, what it takes to start solving the underlying data analytics problem, and what responsibility observability vendors - like Logz.io and StackState, for example - have in solving this problem. Let’s get started and enjoy the podcast.
Anthony: Hello, everybody. Welcome back to another episode of the StackPod, with your host here, Anthony Evans. Today, we're going to be talking to a good friend of ours, Dotan Horovits. Dotan, would you mind giving yourself a little bit of an intro, because I'm very excited to talk about observability today with you.
Dotan: Hi, Anthony. Glad to be here and thank you very much for inviting me. It's always good to meet fellow podcasters. I also have my own podcast, OpenObservability Talks, so glad to also meet other communities. So as you said, my name is Dotan Horovits. In my day job, I'm the principal developer advocate at Logz.io. At Logz.io we provide a cloud native observability platform that's, essentially, based on popular open source tools, such as ElasticSearch, OpenSearch, Prometheus, Jaeger, OpenTelemetry and so on, which is a perfect fit for me because my passion is open source, in addition to software technology and others. So, that's a passion of mine and I'm also a co-organizer of the CNCF's local chapter in Tel Aviv. I'm based in Israel. So, CNCF, the Cloud Native Computing Foundation, of course. And I co-organize DevOpsDays, and am involved in many other communities, and also run the podcast (if you're a podcast listener), the OpenObservability Talks podcast, that's about open source, DevOps and observability.
Anthony: Ah, that's really cool. Have you been based out of Israel, for your professional career, or have you relocated in...
Dotan: Yeah. For the large part, I've always been part of multinational organizations, but oftentimes based from here, from the Startup Nation.
Anthony: Yeah. Yeah. It's amazing how many start-ups are in Israel and the AI space is actually really hot there. And, actually, synthetic data is a really big market right now in Israel. There's a lot of synthetic data companies, so for AI you need, basically, data models and data sets to run algorithms on, and so they create synthetic data. So, it could be a million different humans that go into an aircraft with a million different variables, from weight, whatever, that's all the examples of synthetic data that they would generate to them, run AI on a data set. And Israel seems to be at the core of where all those companies are coming from. I don't know why, but it's a really interesting market and it's actually really useful, because that's why AI fails a lot of the time is, because you just don't have enough good data or reliable data to actually do anything on. But yeah that's cool. Any local events coming up soon? How are things going after the pandemic and in-person stuff, how is it over there?
Dotan: So, actually it's really nice to see things waking up, specifically in Israel, all the restrictions after the pandemic have been lifted. So, meetups are back on kicking. I organized the CNCF Tel Aviv chapter and I'm involved in several others, and we're now working on DevOpsDays Tel Aviv that is going to happen at the end of the year. Also, I see that globally, I was just at the KubeCon Europe in Valencia, Spain the other day, three weeks ago, talking about OpenTelemetry, there on stage. Fascinating experience and seeing over 7,000 people in person, and more than 10,000 on the virtual platform. That really feels like it's getting back. Not nearly, not the 100% of what I remember of KubeCon, but nearly there. So, really, really exciting, and I'm invited to be speaking in Germany next month in some other event, same persons. So sounds like in many other places around Europe and other places, it's recovering. Great to see that.
Anthony: But anyway, let's talk about observability and where we're at in the current state of things. You've written some really interesting content and blog posts. And that was part of the reason why we wanted to have you on, and talk to you as somebody who is involved in the cloud native world as somebody who is excited about the new technologies. But you also appreciate one of the things that I did like about when we were talking before, is the fact that you don't have an idealistic view of the world. You do understand that technology is.
How to keep on track with the observability subject when technology stacks are evolving from older technologies to - for example - Kubernetes
Anthony: There are old technologies where we can't necessarily collect and observe them as well as we would like, compared to, let's say, an AWS Kubernetes application that has a lot more APIs and data points for us to observe. Tell me a little bit about that. What has really helped you keep on track with this observability subject as we've evolved with our technology stacks over time?
Dotan: It's a funny story. In my background, I come from software engineering and then I evolved into system architecture, system engineering and then solutions engineering and consulting other companies. So, in a way, I broadened my scope throughout my career to see the end to end systems, from the one I understand, the needs and the pains of the users and then match them with what can be achieved with the current stack, with the alternative stacks, with the correct architecture to serve the purposes and, most importantly, a realistic view. Because, architecture could be a delight, could be an art, but then you can find yourself sinking into things that are much more bigger and bombastic than what you actually need to do. And there is cost in maintaining this system.
Dotan: So, I think this is a perspective that I've brought with me to the current roles. And in the past years, I've spent more time, specifically, in DevOps and observability space, monitoring and what evolved into observability, and reliability and more terms that keep on coming. But, ultimately, it boils down to understanding the state of our system, understanding what goes on in the systems and being able to proactively engage, debug, optimize. This is, ultimately, that's been the need all throughout the time. The systems and the methodologies change. So, obviously the move from monolith to microservices, the introduction of cloud native architectures, in a way, you could say, complicated things, at least on the monitoring side, obviously. It simplified the ability to scale out the development, to segregate between different teams, all the benefits that we all know and love. But from monitoring perspective, suddenly you have, in my read, of that pieces out there that you need to monitor, each one with its own endpoint and APIs and lots of interaction and gateways and the sidecars.
Dotan: And from the monitoring perspective, it definitely shook up the industry, things that used to work like, people used to put StatsD and Graphite and everything worked perfectly. So these form class no longer work. People realize that, and not only these specific tools, I love these open source tools. Don't misunderstand me, but the very essence of them, for example, the hierarchical model for metric, naming. It doesn't work when you work in so many dimensions, for example. So, things like that shook up the industry and required a different approach. Let's put it this way.
Anthony: Well, I think it sounds to me you have an approach, which is almost like a questioning approach, which is like, "Okay, with everything that we build, with everything that we deploy, why are we doing it?" And when I ask why is, what's it doing for the end user? Whoever is interfacing with my technology to get the outcome that it's been built and designed and functioning to provide, what are they seeing?
Scaling out the customer-first mindset to developers
Anthony: Having that mindset allows you to empathize a lot more with the customer and then if you have an empathetic mindset, it then drives everything that you're doing, because you're really just working around that transaction and the things that are servicing that transaction. Would you say that, that's a hard mindset to scale out with developers?
Dotan: I think for me it comes naturally, maybe because one of the episodes that in my past career was as a product manager for platforms, for developer platforms, for orchestration, cloud orchestration, cloud management and things like that. So, facing the engineering personas, but as a product manager, the first thing you need to have is empathy. You need to understand the pains. You need to understand the challenges. You understand the experience from the end user and then work your way back to what I need to give, what I could give and how can I can help make it as smooth as possible. So for me, it's natural, but I definitely get what you're saying that many engineers don't come with this built-in mindset. And it's something that I've been needing to advocate for, internally, to get the engineers to see, to take the outbound, or inbound, let's say, view of what they're doing rather than outbound.
Dotan: This is what we do rather than this is what you need. And this is how we can help you. This is very important, especially again, in the modern workloads that we can see today that are far more complex and they have so many third parties. Every typical system these days has the SQL database and the NoSQL database and the columnar and the graph database and then they have the API gateway and the proxy and whatever. So many third parties and open sources and cloud services.
Dotan: And this is part of the game. You don't control the third parties, but you still own them. You need to monitor them. You need to remediate if there is an issue, because your end-to-end service to your customers, that's what they see. So, this is something that we need to adapt and I think a good starting point for that is, SLOs. You need to focus on what the Service Level Objectives are and work your way back from, this is the SLO. And in order to meet that, that's what I need to do for management, for monitoring, for automation, for release pipeline, CI/CD, you name it.
Anthony: Well, the challenge you have is... talking about CNCF, I was at KubeCon back in October in LA, it was actually a really good technology event. Lots of really interesting people there to talk to. And I had a good time, but when I was walking around the booth, I could see Splunk, I could see Dynatrace, all these different sponsors, but everybody had one word in common on every single booth. And that word was, observability. And I was looking at it and I'm like, "Well, hang on a minute. If I'm Splunk, I provide observability, yes. But, do I provide end-to-end technology observability? No." Not unless I deploy Splunk everywhere and have Splunk ingest everything and pay billions of dollars to Splunk to have everything in a logging solution.
How to break down the silos in observability
Anthony: And even then, it would still be hard. You'd have to build out the dashboards. You'd have to fit it to the technology use case of observability. It is what it is. But then, thinking about that, and the fact that developers who aren't necessarily in the mindset of thinking the way you think, which is the bigger picture, let's just put it in that context, of what happens outside of my tool, what happens outside of my traces, what happens to those things? That's a different tool. And then that's why we end up with war rooms, because as soon as something breaks, you've got your app guy, you've got your network guy, you've got your cloud guy. They all come to the table, they all load up their own individual dashboards, which is their best-of-breed collection tools.
Anthony: And they're like, "Well, my Splunk doesn't tell me that there's an issue. My Dynatrace says, that there is a problem." Yeah. And that's what a war room is. It's like, you basically build a topology on the fly where you're like, "Okay, what the hell changed?" And then you figure it out. But that's observability at the end of the day. And what do you think is going to take us to get out of that mindset, because we've got these collectors everywhere, we've got this data. You're talking about this from a logging perspective. What to you, is one of the answers here and where do we go?
Dotan: Well, I'm not speaking only from the logging perspective. I think that logging is a very limited narrow mindset. Even my company, by the way, although named Logz.io, because of the origins, provides observability across many other signals, metrics, traces, so we need to change our name. That's for sure. But observability is, definitely, not just more than logging. It's more than just the mere... some of the raw data. And one of the things that I found myself talking about often, fighting the misperception, misconception that, "Okay. You have the three pillars, also this holy three pillars of observability, logs, metrics and traces. You need to collect them and you have observability." And my answer is, "No." It's logs plus... you have this on my LinkedIn post and it's actually from an article I had on insideBIGDATA magazine. Logs, plus metrics, plus traces does not equal observability.
Dotan: Logs, metrics, traces are nice. They're very important, but they're ultimately raw data. And we need to get out of this mindset of raw data, the three signals, the three pillars of observability. First, because three is not a holy number. I have discussions ongoing now in OpenTelemetry and the TAG Observability in the CNCF about continuous profiling. I had, by the way, an episode already half a year ago or more about the topic of continuous profiling as another important signal. And people are talking about events and many other pieces of data that can augment your observability. So, first of all, raw data is not limited to these three pillars. Let's throw these three pillars out the window. It served us well up until this point, but now it's actually holding us back.
Anthony: Those three pillars were probably built by a monitoring company.
Dotan: That's true. It's true. I work for a company that grew out of logging and you mentioned Splunk and New Relic and others. Each one grew out of another very specialized signal, but now all of them are converging. This is, by the way, why you saw that at KubeCon. Everyone is putting observability, because no one stayed in this niche. People understood that if they stay on that niche, they become irrelevant. Everyone is now trying to provide a more holistic approach, because that's what we need as practitioners. That's what we need to observe our systems. The first thing is, let's move away from the three pillars and open our minds at any data, that raw data, that can serve us, to get better observability into our systems is more than welcome, that's one. Secondly, is going beyond the raw data. The raw data is nice, but like any other, we need to look at observability as a data analytics problem.
Dotan: We have lots of raw data from many types, many signals and also many sources. Some of them are my front-end applications. Some of them is the back-end application. Some of them is infrastructure like AWS services, or Azure, or Google, or maybe open source that I run, my Kafka, my Redis, my Postgres, whatever. And what I need to do is, actually, now the challenge is to fuse all that data together and be able to draw insights to ultimately understand the state of my system. This is, by the way, the definition that I'm trying to promote of observability, moving away from this definition that everyone has been using from control theory and to definition that I think is more practical, being able to understand observability is the means to be able to ask and answer questions about my system, as simple as that.
Dotan: And the challenge then, when you frame it as such, suddenly you see that the focus goes somewhere else. Suddenly my question is, okay, so how do I like any data analytics problem? How do I enrich the data in a way that will help me correlate that and correlate the data between... I know, traces and logs, just for an example, to know that they refer to the same thing. Between metrics and traces, how do I add metadata from Kubernetes, from my own application context, business logic, account ID, customer ID, whatever, to be able to map things all the way to the database transaction with the business context that I need. And suddenly it opens up observability, opens up a lot of many other use cases like FinOps and BizOps use cases that you can use observability. So, that's just to start teasing the ideas.
Anthony: It was funny. I was working on a project and it was like, "Okay, we have AppDynamics deployed to do the tracing and the monitoring. We have our logging somewhere else. And then we have our ticketing somewhere else and then it's all on AWS, so all the configuration is in AWS." And so, I'm like, "Okay, what was the issue?" They're like, "Well, AppD told us there was a problem and then we found out it was a configuration issue four hours later." Okay. I'm like, "Okay, fine." So, in order to remediate that going forward, what you needed to do was take the AppD monitoring alert and align it to that change and get rid of the four hours. They were like, "Yeah. Why don't you just look at our logging?" And I'm like, "Well, what's the root cause in the logging or was that just noise? And they're like, "Yeah, but we need the data."
Anthony: I'm like, "Why do you need the data in that context, if all it did was create noise and distraction that added four hours to your delay? Why are we bothering going through that?" You're just looking at it, you're unraveling their mind and they're like, "No. Well, that's where all the information." I'm like, "Why? There. Change. Caused problem.
Anthony: So, that's my point though. And to your point is, a lot of this mess actually contributes to a lot of problems than it does solutions. Because people are like, "Well, if I mine as much data as possible, that means that the answer is more than likely going to be on my screen when I have an issue." That's it. That's observability for most people. If something goes bang in the middle of the night, have got an alert or a piece of information somewhere that I can then just take and then proof. And a lot of people don't have that. That's why we have war rooms. That's why we have the dashboards. That's why we have everything because it's so hard to get to that outcome.
The responsibility of observability vendors to reduce data and reduce noise
Dotan: And by the way, I think that as vendors, we are somewhat to blame. Again, and I admit being also... also have some hat as a vendor, although I'm not a salesperson or a marketing person. But, vendors have been traditionally billing by the volume and, therefore, there's the conflicting interest of actually encouraging their users to reduce the amount of data and reduce the noise. And even in my company when I say, "Let's help our users reduce the noise and filter out things and drop the things that are irrelevant, or at least send them to an S3 bucket, or whatever and not put it on the expensive memory that if they're not needed, maybe from audit once in a blue moon." Sometimes people are looking... maybe frown at me, but this is what I say, maybe it's my origins as a solutions architect consultant.
Dotan: I want to have the best experience for the users. And the users are, people like you and me, SREs, DevOps. People want to make sure that the system works as they need to. And they need the insights and not the raw data and certainly not the noise. And the ability for us... and this is something that we need to adapt as an industry, to help actually reduce the noise and increase the value. Reduce the noise, the value to noise ratio if you'd like. This is what we need to aspire to, because the data is, massive amounts of data, high cardinality metrics and the logs and the tracing, that even tracing now use sampling strategies because they can't handle the 100% sampling and things like that. So, we need to actually come with the right solutions to help our users in that.
Anthony: Well, we are nearly out of time, but I hate to leave you with a... or leave you, or leave everybody else with a fairly negative thought. But, if your monitoring tool is, basically, being paid because of the amount of data it collects and ingests, then that's all they're going to care about making you do more of. They're not going to care about the outcomes, in actual fact, they're going to want you to just have more problems so you can monitor more stuff, so you can ingest more data, because if they change that model, their stock price is going to tumble, because that's how they get their revenue. You sign a three year deal with somebody to ingest data. They're going to turn around and try and get you to ingest as much as possible, if not more. It is what it is. Don't think for a second your sales rep at whatever company it is, wants you to stop consuming and use it more efficiently. That's like firing words for them.
Dotan: That's your role and my role, Anthony. We're advocating for the right thing regardless of the sales people, so that's why we're here for.
Anthony: Yeah, exactly. The best projects that I've always been a part of, have been the ones where people have gotten promotions after we've deployed the solution. You make friends for life, when you can architect a solution, you work with them, you go through the good, bad and the ugly, technology breaks, delays happen, bugs occur, whatever. But as long as you're approaching the problem from an outcome perspective like you're saying, look at the end user, look at what they're doing. Don't just focus on doing technology for technology's sake. Focus on the outcome and then go from there. But, yeah-
Dotan: I can, by the way, mention, I know we're running out of time, but maybe I'll put it if you're okay with that on the show notes, the link. But I had an article published very recently, The New Stack, joint article with Jujhar Singh, a very senior person, maybe you know him from the UK sphere, very known there. And the topic was, how much observability is enough? That's the title and the whole article was to actually take observability in a pragmatic approach, not to just try and copy paste from the Netflix blog post about there, or Google's, or whoever's, but actually understanding what you need, what fits your organization, your ability and the capacity to consume. Because it's not just technology, it's processes, it's people, it's culture, even. So I think this is, again, one thing that I found useful that I can use the stage to help people understand that they get pragmatic approach. So, happy to share this article, if this is of use to the listeners.
Anthony: Yeah. No and I'd like to work together as well, if there's any opportunities for us to do so, obviously, we'll have you back to talk about observability. We're thinking about doing some panels and things where we can bring people from different perspectives and talk about stuff, but we'll see where it goes. This is part of the fun of doing what we do. We're just trying to get the message out there. If people disagree, they disagree, it's fine. Just as long as you take the time to think through, maybe reassess things and just think with an open mind, that's what technology is all about at the end of the day. It's not about setting your flag to a banner and saying, "I'm going to go with these people." You change your technology habits and you change what you use and what you consume to work best for you, and businesses should be no different.
Anthony: Cool. Well, yeah. Again, thank you for your time. Before we go, is there anything you want to recommend that people read, any recommendation around something that's keeping you up at night, or something that's inspiring you recently that you want to share with everybody?
Dotan: So, first of all, I want to invite again, all your listeners. If they listen to you and enjoy, I'm sure that they're interested in this domain, they may find interesting the podcast, OpenObservability Talks. And tomorrow, actually, I'm going to live stream the episode about OpenTelemetry with Alolita Sharma, a very prominent figure in the open source community, the CNCF and the industry. So, I hope that you find the podcast interesting.
Dotan: And one thing that I found interesting as part of the recent article that I wrote is, I went back to Google's SRE book. And one of the things is, was to reflect on that book that has already been out there for quite a good few years and check the relevance of that for today. Many things, by the way, stand very true today, but some things actually got me thinking about what we need to adjust in our way of SRE operations in light of changes in the technology like, microservices and things like that. I even wrote something about the changes we need to do with approach to SLO in light of the microservice environment. So, I found it interesting experience to go back to the classics and revisit them, maybe an invite to others to check it out themselves.
Anthony: I do think it's interesting to note that Google, site reliability engineers have the ability to choose the technologies that they use to do their own day job. A lot of site reliability engineers, actually, probably 95% of site reliability engineers don't have the luxury of even choosing what is being given to them as their observability solution. It's like, "Oh, the app guys like to use New Relic, so we can get a bit of New Relic data." And the SREs are basically piecing it all together and like going, "Oh, crap. How am I going to do the SLO?" But it's about time, that they had their own solution to bring this altogether. They're the most underserved role in the enterprise as far as I'm concerned, when it comes to money.
Dotan: And very loaded. Very loaded. I hosted the staff SRE from Google at one of my episodes and he was in charge of all the logging, the identity management service. Imagine everything on his shoulders and the log and the scale and it was a fascinating discussion to hear how they do that internally.
Anthony: Yeah. Yeah. It is funny when, especially, in CNCF, there's a lot of Google people that hang out in that community because of the Kubernetes stuff. And it's always interesting to see how technologists who work on technology solve their own problems. But again, Dotan, thank you so much for your time today. I really appreciate it. And I hope to see you again soon.
Dotan: Thank you very much, Anthony, for having me. It's been a pleasure.
Annerieke: Thank you so much for listening. We hope you enjoyed it. If you'd like more information about StackState, you can visit stackstate.com and you can also find a written transcript of this episode on our website. So if you prefer to read through what they've said, definitely head over there and also make sure to subscribe if you'd like to receive a notification whenever we launch a new episode. So, until next time...
Subscribe to the StackPod on Spotify or Apple Podcasts.
Articles Dotan (co-)wrote:
Dotan's podcast: Open Observability Talks