EP #13: Open Source Observability With Michael Hausenblas of AWS

34 min listen

Annerieke: In this episode, we’re talking to Michael Hausenblas. Michael is an open source observability enthusiast and is currently part of the open source observability service team at AWS. This means that he keeps a close eye on new open source platforms and technologies like Prometheus, Grafana and - of course - OpenTelemetry to see how AWS should integrate with these new technologies to provide the most value for the customer. Michael also wrote a couple of books, with the most recent on observability, he has a weekly observability newsletter that you can sign up for and he’s very active on Twitter, so we’ll make sure to add links to these in the transcript of this episode on our website, stackstate.com. Needless to say, we’re eager to share this conversation with you, so without further ado, enjoy the episode.

Anthony: Hello, and welcome to the StackPod, the best biweekly tech podcast there is. As you all know, this week I'm joined in the hot seat by a hot company right now, AWS. We've spoken to a few of his colleagues in the past. Michael is here to talk to us a little bit about OpenTelemetry today and some of the interesting goings on with AWS and how they're interacting from a product standpoint, but we're just going to keep it open and see where the conversation takes us. Michael, do you want to introduce yourself, where you're from and what do you do over at AWS?

Michael: Sure. Thanks for having me, Anthony. Yeah, my name is Michael. I'm based out of Ireland, but, as you probably can tell from my last name, I'm not Irish. I'm originally from Austria. Moved there more than 10 years ago. So I'm part of the AWS open source observability service team and that contains or owns three things. And that is managed Prometheus, managed Grafana and our AWS distribution of OpenTelemetry. And the interesting part is that we have all these wonderful open source projects, specifically OpenTelemetry being a cloud native computing foundation CNCF project. That means upstream first. That means a lot of things going on in the community. Think of the upcoming KubeCon in a couple of weeks time, where the retainers meet and meet with users and trying to figure out what are we going to do next? Some of the high level things are clear, then there's always what is the exact order of things. The program languages, the specific parts of OpenTelemetry. Yeah, and I have one leg in product, one leg in engineering. Yeah, I'm super excited and super happy to be part of that service team.

Anthony: I think it's really interesting because we were talking a little bit about this prior to us talking in this interview right now. And I think it's a really cool area where you're currently working because you've got AWS, which was always traditionally viewed as one of the first cloud providers. And a lot of people, when they initially thought of cloud, they thought of data center replacement. And now we've moved past that. We've got more companies that are born in the cloud and that means that their notion of deploying EC2s even is unheard of. A lot of people are doing just straight serverless infrastructure and infrastructure as code. And if you've got a VPC in AWS, you've basically started software defined networking and you never have to touch a network cable in your life, but you straddle that innovative line where we're moving to that and now we've adopted that.

Anthony: But then how do we keep on top of or how do we work with the innovations that are occurring so that they can be made available for customers as quickly as possible as part of the core AWS services that are provided? And you have that developer role whereby you're actually helping deliver code and prioritizing the delivery of code, if you will, for your customers, if you don't feel the innovations are going to keep up. And then you also need to figure out are the innovations going to keep up for our customers? How do you go about dealing with that? Just give me a day in the life kind of thing?

Michael: Yeah, so a big part really is talking a lot with customers. That can be more like one-on-one settings, like a video call or whatever. Now, increasingly, thankfully, it's also more in person. That can be something like KubeCon or actually going onsite, but it's mostly about learning and understanding what are the issues and what are potential solutions. It's not always directly clear that for a given problem this is the only solution. So there might be two or three or four different ways. And given that we are in this open source space, it's about upstream first. It's not AWS or our service teams. Okay, we're just going to implement that, but it is about we have identified the need. Other providers might have identified the same need, but then we need to work together in the community to figure out, let's say, a specific example in OpenTelemetry.

Michael: There's this component that's the collector, that's the agent that gets the signals from where they're produced to the destination where you want to consume them. And there are certain parts in it. You need to have a receiver that talks with a certain data source that emits logs, for example. So you need to figure out which one you build first. Who does it? If we want something in AWS for our customers, then we can't just go there and say we'd like to have that and someone else please implement that, but there are also the established processes. You have an issue, you start out with an issue. There might be an enhancement proposal where you say here is something that we think that in the future is needed. And so you have a high level design, you collect these requirements, you collect feedback in the open. Again, think of GitHub Issue, think of Slack, think of the community meetings. There are a lot of different meetings going on in OpenTelemetry. And then you decide how do you go about it.

Michael: It might be that we directly implement it, it might be that we're working with partners to implement it, but it's always influenced by the prioritization of the extra outcome by what do our customers tell us, what kind of problems they need to be solved.

Anthony: So let me ask you a different type of question. So let's say, for example, we've got OpenTelemetry and support for logging is coming out. We know it's coming, it's being worked on. We can see the progress being made at the various check-ins and whatnot and you can see people at work on it. What I would probably be most, let's say, frustrated by if I was in your position at AWS would be I've got a customer who needs logging today. So I've got a choice, so I can either work and build something myself with my own engineers that meets the requirement of that customer and gets quicker delivery versus waiting for the feature. Now, the thing that I would get frustrated by is if I said let's wait for the open source version and then implement that. How do you make sure that the requirements for your customer are going to be met when that initial version comes out when you don't have a full stakeholder seat at the table in terms of that final outcome? Does that make sense, where I'm going with that? Absolutely.

Michael: Absolutely. And that is where we always need to decide if there is some feature, whatever it is, that is very much AWS specific. It's like it's only beneficial for if you're running a workload in AWS. Then clearly, or very likely, we will end up owning that piece of code from the issuer to actually implementation to testing to shipping it. If it's something that ...

Anthony: It's technical debt at the end of the day because then you've got to support that.

Michael: Of course. And we never deprecate or switch something off, so it's a very serious, long term commitment. If you think about Syslog or whatever kind of log format support, that is generally usable and useful, no matter what cloud provider, no matter what workload, no matter what environment, like Linux or Windows. Then very likely many vendors, if not all vendors, will say, "Yes, we want that and the implementation and everything." Design implementation will be shared between multiple stakeholders. Stepping back a little bit, I think it's important to understand where OpenTelemetry is coming from and where the impedance mismatch is between, at least for the time being, what we have available and what people effectively need. So OpenTelemetry is a CNCF project, which originates, or comes initially, from a merger of two distributed tracing projects, open sensors and open tracing.

Michael: So the initial scope was distributed traces. That was what the initial in terms of signal type. We have logs, metrics, traces and profiles currently. Mostly widely used traces where the focus of open telemetry was. Now, initially that was clear and that is also the reason why traces are already in GA. Which means both the OpenTelemetry collector, which is the one part, and the SDKs, meaning the 11 different program languages that OpenTelemetry supports, fully have implemented open telemetry. Meaning you can use distributed tracing in OpenTelemetry in Gas. And over time the project grew and decided the scope of the signal types that we want to support should also be widened. It's not just distributed tracing. It is metrics and its logs. So in a sense it's weird because out there in the reality, if you think about it, everyone is using logs. I would bet no one who is not using logs in any forms. It could be system level logs for the operating system, it could be infrastructure logs, like an AWS service, it could be application of logs, but at the end of the day everyone is using logs.

Anthony: Right. There was a joke about Splunk talking about logs with Cisco. Somebody tweeted when they tried to acquire Splunk for $2 billion or something and somebody made a joke around whether that was them acquiring them or just paying their Splunk bill for the year.

Michael: It sounds like Corey Quinn would've said that. And next to logs, metrics logs are quite widely adopted. Maybe not as widely as logs, but also very widely adopted compared to distributed tracing, like traces, are not that widely used yet. And in OpenTelemetry we have exactly the other way around. We have traces that are already GA. Metrics are in the process of being GA as we talk in the next couple of weeks. And logs hopefully will become GA later this year, beginning of 2023, so it's exactly inverted. That's the impedance mismatch that we currently have to deal with. The good news is that in terms of, if you look at as a customer, if you ask yourself should I be betting on OpenTelemetry, the answer is yes and the reason is simple.

Michael: If there is code that you own, which means your application code, it doesn't matter if you put something into container, into Lambda functions, whatever. If that is the code that you own, which means you are responsible for instrumenting it, you decide I'm going to omit a log line here, and a trace there, and this metric here and so on, then there are either existing the SDKs for traces and for metrics, where you can, for example, if you're using the Prometheus standard of metrics, you don't need to re-instrument. If you're talking about logs, then there is essentially in the logs designing. If you look through the OpenTelemetry log's overview, the guarantee that no matter what kind of logs you're producing we will be able to map that.

Michael: And the mapping's already there. You have long list of mapping of existing log formats and standards to the very generic OpenTelemetry way to represent logs, which means the basic insight there is you don't need to worry about it. You need to re-instrument things, you can just bet on the OpenTelemetry. The code that you own, you can use and then don't need to instrument. So code that you don't own, take any kind of database that you might be using or any kind of service or whatever, there it's a matter of your provider, the person that provides you with the database or service like an AWS service. That they essentially expose the signals in a form that is compatible with OpenTelemetry. That depends a little bit on the signal type.

Michael: As I said, metrics, for example, we have Prometheus. There is quite some work. Last year it has gone into this working group to make sure that OpenTelemetry is compatible with OpenMetrics. And for logs, as I said, there is this mapping idea there. And traces is set already GA, so they can already use the OpenTelemetry SDK without any work that formats without any words. So in a sense, while in 2022 we're still facing this issue, logs are not there yet and so on, it is a safe bet and is something that I would encourage everyone to do because it also gives you this portability. If you're betting on open source and open standards, you have the very nice path from migrating stuff that you currently, for example, are running on premises and moving that to a cloud provider because you are using these open standards.

Anthony: Yeah, and it's getting rid of that technical debt as well. And then when also people turn around to me and say, "Well, we're a financial services company. For security we'll never do open source." And I'm like, "Well, actually, closed source is one of the most insecure things you can do because the changes aren't public." Anybody could literally change anything and then that becomes the status quo. At least with open source there's a communal sense of direction. There's a communal sense of understanding that some baseline technology is needed in order for us to get more effective, more efficient, better, whatever the outcome is supposed to be. And I think with OpenTelemetry it's trying to solve this issue of the fact that still even today logs and tracing, that they're usually in two separate tools anyway. The fact that a lot of the time people still go to Prometheus. People still go to Splunk for the logs and then maybe they'll go to Dynatrace or AppD for the tracing data, if they're not using any kind of open source today, right?

Michael: Right. Yeah, so just to comment on your previous statement, the fact there is really that with open source we have essentially expanded the supply chain options. If you think about 30 years ago, whatever, or even when I started around 2000, not a lot of open source in commercial setting available. So you only essentially had two options. You either own that bit yourself or you trust your vendor. These are the two options. Either you write that code yourself. You know what it is in your team, your company, or you trust your vendor. That's it. And now we have three options. You own it yourself, you trust your vendor or you can trust or use the community as a whole. I mean, you have many eyes looking at that.

Michael: You have many folks that test in various environments, et cetera, et cetera. So it's just expanded our options there, but at the end of the day you still need to make a decision. You cannot ignore the supply chain issue. You can't cannot - say you think of Log4j or whatever. You cannot say I don't know where that comes from, whatever, I don't care. At the end of the day, your customers, they will ask you is my data safe, has something leaked or whatever. So you own that part and you can decide with whom do you share that. You're like, "Okay, I'm totally all open source." Yeah, then that's great and you probably also need to invest being part of that community. You can't just say, "Okay, I want to benefit from that open source," but not be part of even meetings and the heavy lifting and testing and whatnot essentially.

Anthony: Well, I think the unique thing with something like OpenTelemetry is let's say, for example, you completely developed a custom product. Let's say I am a bank and I've got, let's say, an AI thing that can basically forecast the markets for 50 years. Hypothetical, right? You wouldn't want to use the code that runs that from an open source project because then that's your intellectual property at the end of the day. If you're bringing it from open source, everybody can forecast the future. You want to be able to make it so that you can forecast. However, with OpenTelemetry, is it such a big deal if the monitoring technology is open source? That's where it comes in. Do you want to have your engineers building a monitoring component for that forecasting when it's not really part of your business model? The output is the forecast. You get where I'm going with that? Do you want to have technical debt in that area when it's not really pertinent to the outcome?

Michael: That's fair, but also OpenTelemetry, at least at the current point in time, things change. There are OTEPs, or these open telemetry enhancement proposals, that might broaden this scope currently. OpenTelemetry is very much focused on the telemetry part, meaning getting the signals from where they're produced from your container, your environment, EC2 committees, whatever to the back ends. So to CloudWatch, to manage Prometheus, to Splunk, to whatever. So this bit in between is how do you expose them? How do you get them on the wire? How do you collect them, route them, filter them, get them into the back ends? Meaning the focus is on making this telemetry bit tables stakes, right?

Anthony: Yeah.

Michael: It should not matter, where 20 years, 15 years, 10 years ago you had a specific agent for a specific signal type, like logs, that was able to collect and route these logs.

Anthony: Hang on - except for today. For the most part for your fans still running agent.

Michael: Okay, then I stand corrected. Today, mostly you still have specific agents that are able to collect and route specific signal types, like logs and metrics and traces, to specific back ends. The vision of OpenTelemetry, and I would argue we are slowly moving in that direction as all the signal types become GA, is doesn't matter where it comes from, doesn't matter where it goes from. It is standardized because of the protocols, because of the collector. And if you decide I'm sending my traces today to this one provider and tomorrow to another provider back end, I want to consume it in a different environment, then that means only a configuration change, for example, in the collector and not you need to re-instrument your code so that you can send it somewhere else. That's what the real focus or scope currently of OpenTelemetry is. I'm not saying that in two or three years time OpenTelemetry tries to do even more and tries to do all kinds of things, but currently this is the original scope, vision of OpenTelemetry.

Anthony: Yeah, that makes sense. Well, we've been talking for a while around OpenTelemetry. We've been talking specifically about that around the different approaches and where AWS sits. What are your thoughts around where things are going, given that you are continuously working in that observability space? What are some of the major innovations you're looking forward to over the next six months to a year?

Michael: I'm super excited about where we are. I can imagine look back in 10 years time or five years time, it's like 2021, 2022, 2023. Hopefully we will not so much think about COVID anymore, but really about where OpenTelemetry was at that point in time. And although, as I said, at the current point in time, if you look from a customer perspective in terms of adoption, it is challenging, we're doing our best to help over this transition phase, it is this year and next year this wonderful, where there are so many options and so many different routes to take from a product perspective, from how to enable our customers. And this is already a little bit scary for me because the focus is still on logs and the OpenTelemetry community decided to essentially do one single type after the other until the traces was GA. There was work on metrics, but the main focus was traces.

Michael: Currently the main focus is metrics and only once metrics are GA the main focus will shift to logs. But looking beyond that, there is already an OTEP, an OpenTelemetry enhancement proposal. I think one out of 39. It's the number which is about continuous profiling or profile support, so extending the signal types beyond logs, metrics and traces to continuous profiles. And I'm a big fan and believer of continuous profiling. Think of things like open source projects like Parker or Pyroscope, or CNCF Pixie also supports continuous profiling. And I think OpenTelemetry has a wonderful place to play the standard for that as well. So that's the 2023 plus plus scope that I see there, but for now it means we have to focus on logs and need to deliver that to the community and products. That's the most important thing for this year and beginning of next year.

Anthony: So there's a lot of talk around the collecting of the data and then traditionally people put it in Prometheus, Splunk and all these different areas. One of the things, if we can talk about StackState just a little bit, obviously the sponsors of this, one of the things that we do right is that we believe that that collection of the data should be in a single place. It should be in a topology. All that logging data should either be used to identify a metric or identify some kind of anomaly or tell us where in the topology is that component talking to and from something. And then storing that over time allows you to create some outcomes because effectively that's what OpenTelemetry is doing. It's going to help you find the symptoms of your issues so that you can still go and identify the root cause.

Anthony: And so it's still going to be on some other tool that you're going to have to go to at the end of the day to both consolidate all the data together, visualize it, and then go off and do something. Are you seeing any innovations outside of something like StackState in that space or have you used StackState? I don't know. I'm just asking out of interest.

Michael: Right. I would say that the keyword there is correlation and there is indeed a lot of work going into correlation in OpenTelemetry, really. Again, if you look at the log's specification or overview design document, being very, very explicit about how to enrich and how to go about that can actually ... I think there are three different types. The one is resource, the other one is time and the third one is execution context, how you can ingest the necessary bits of information so that you can actually correlate logs, metrics and traces. In addition, of course also things like Templar from Prometheus, where you have a specific way to embed a trace ID in metrics so that you can correlate metrics and traces, but I think there is a lot of room, specifically around how you use it.

Michael: It's one way or one piece of work to say this is how we represent the extra correlation bits in a log form within metrics or whatever, but then how do you actually make it use of it? How do you make it easy for customers to actually click on the dashboard and jump from logs to traces, from metrics to traces, et cetera, et cetera. And I believe that's what part of the value prop of StackState also is, to make it as easy as possible for customers or users to have that correlation and automate the correlation bit so that you don't need to manually ... I remember the ID here or time code here or whatever and then jump into another tool or another environment to look something up, but correct me if I'm wrong.

Anthony: Yeah. Well, no, so the idea being because, as you've highlighted, it's an adoption roadmap. That some people will be hybrids, some people will be all in on the cloud, some people will have adopted one language and one set of standards for one app and then another for another app even within AWS. It can be like that. So the idea behind StackState is that we can build out that topology from multiple sources. So whether you've got Splunk, whether you've got OpenTelemetry, bringing it all together and then building out that view of a single view as opposed to just pulling traces and just saying this looks good. Putting everything into context, that's what we do and that's why we focus on the database.

Anthony: It's not really about the collectors. And that's why we support open telemetry, for example, because we don't care about the collecting of the data. We just want mechanisms and tools that do that for us so that we don't have to do it ourselves. And that's really it. And if you've got Dynatrace or AppD, that, for us, is another collector on our behalf. We still need to consolidate everything into a single view. You get the app stuff in one app and then the network stuff in another app, like Splunk. You need to unionize the two and that's what StackState does. Yeah.

Michael: Cool.

Anthony: So we've run out of time, really, and that's really everything that I wanted to dig into. Is there anything you want to recommend that people read, like any books or anything you'd recommend or maybe an anecdote that somebody should mention at a party or something? Meeting somebody for the first time.

Michael: There is a community driven event that's next week, I believe. At least where we're recording. That's the o11yfest, where we will have a book club as well. Yeah, we go through a number of up to date, recent, relevant o11yfest observability books in this o11yfest. I don't know the exact name, but I think it's called Book Club or whatever. Happy to share the resource there. Yeah, I think that's maybe a good way to learn about books. I'm writing on one for many, Cloud Observability In Action, but there are many recent books in the last year, half a year. So, again, this is a super exciting time and obviously with that come all kind of resources, books being one example. Yeah, there are many wonderful community sites where you can learn about good practices in this space. Yeah, I also have a newsletter if people are interested in getting updates once a week. Very easy through email or type it in the website. Yeah, there are a number of resources out there where you can learn and expand your knowledge on non-observability.

Anthony: We'll include the links in the show notes. So whether people are on Spotify, Apple, whatever, they can access all those resources. I'm not sure if we're going to come out in time for the Book Club, but we'll post a link regardless and then people can catch up.

Michael: Yeah, record it. Yeah, sure.

Anthony: Yeah, figure it out. Yeah. No, we should be good. Again, thank you so much, Michael, for doing this and taking the time out of your busy day. I know that you're a busy guy and you've got a lot to do, but I really appreciate you getting on this podcast and giving us an interesting perspective on tech. It's a new role. Your role wouldn't have existed 10 years ago or it would've been some weird EMC, Dell type thing where you are navigating around with USB keys. I'm just envisioning, but it's very interesting. Yeah, thanks, everybody, for listening and I'll talk to you soon. Thanks, bye.

Michael: Thank you.

34 min listen