Time-Traveling Topology and Observability | StackPod E1

Mark Bakker and Lodewijk Bogaards - co-founders of StackState

For the first episode, we have two special guests: we invited our co-founders Lodewijk Bogaards and Mark Bakker. Mark and Lodewijk came up with the idea of - what is now - StackState - about 6 years ago because they wanted to solve a big problem in the monitoring market: there was so much data, but so little insights. In this podcast, Anthony interviews them about StackState's product perspective now we're moving towards cloud-native, why a time-traveling topology is crucial for that, and how we will be thinking about cloud costs in the future.

You can find a written transcript of the episode below. Enjoy the recording!

Episode transcript

Lodewijk: [00:00] It doesn't make sense to know only the topology at a given point in time or at the current time, because once there is a failure, that likely will also cause more remediation events, and that means that the landscape changes from moment to moment.

Annerieke: [00:18] Hi there, and welcome to the StackPod. This is the podcast where we talk about all things related to observability because that's what we do, and that's what we're passionate about, but also what it's like to work in a tech company. So, if you are interested in that, you are definitely in the right place.

So, in today's episode, we actually have two guests: Mark Bakker, and Lodewijk Bogaards. They came up with the idea of – what is now StackState – about 6 years ago, and Anthony has a nice chat with them about, starting and building a tech startup, and now scale-up and some other topics that are related to, of course, observability. Enjoy the recording.

Anthony: [0:58] So welcome. As co-founders, right, of the company, you guys should be thinking more existentially than just the company, right? In terms of product direction, in terms of where we want to go and how we're going to achieve that. That's what true leaders think about, right? They're not sitting on their islands, building a fortified fence, so that nobody can see what they do. It's better to be out there.

I think one of the cool things about working in software is that there are outcomes, right? You're either you building something, you're selling something, or the customer’s consuming something. Right? It's not like we're working at a grocery store and we just put a bunch of things and people just pick them up and go kind of thing. You have your grocery store and you just need to have more stuff than the other grocery store. I think when you work at a software startup, there so many elements in terms of growing cultivating, and basically owning that entire process from, if we were to put it into a grocery store perspective, right. We're the farmers, we're the sellers and then we're also sitting at the dinner table with the consumers, right? That's one of the unique things about working in software, right?

Mark: [02:11] The most important part is the last part sitting at the table during the dinner to see how people react to it, how they like it, and really optimize basically your product to that. And the nice thing about it, is that for cost risk, you need to wait another six or nine months. I don't know the exact timing for that, but yeah, now we are very agile. And if we hear that multiple customers want the same thing, we can change it quickly and in our next release, we have it done. And then we can optimize and in the end, get better results for our customers.

Anthony: [02:43] What would you say Mark is the biggest focus from a product perspective of, as we look to the end of 2021?

Mark: [02:50] At this moment, our biggest focus is really to nail down Kubernetes and AWS observability. Make sure that we give all the context to our users, and don't only rely on telemetry signals, but really know how everything is connected to each other. What is it changing in your environment and what's the real root cause of a problem? Because, what we see is that most products basically give you information about, what is going on. So, the problem at hand, but they don't show you what has changed and what's the cause of your problem. And that's really what we are focusing on it at this moment. And also our main focus for that is listening to lots of people getting their inputs, talking to SREs. Yeah, that's basically it.

Anthony: [03:33] We've got several customers that are in the Hybrid IT space, where we've focused on, on-premise, going to the cloud, observability across different environments being a huge issue. And I think it is still an issue that needs addressing for the most part, but it's a very noisy marketplace, right? That everybody says they do observability these days. How did we get from Hybrid IT to, to Cloud Native?

Mark: [04:05] Yeah. What you see in the market is that basically, in the beginning when we started almost five years ago, what you saw was there were products for log aggregation, there were products for storing your metrics. There were products to have helped monitoring and so on... Bringing all that data together was something that you'll need in the market. And this is basically still a need in the market. But, what you see is that the market is changing and basically due to COVID, the market is also changing more rapidly than it was changing before. So, what you saw was that many companies were already moving to the Cloud, but this pace at what that is happening, increased tremendously the last, I will say 12 months. So we are basically accelerating that direction and putting more focus on it than before.

Lodewijk: [05:08] And also a sweet spot for our technology. So, what is the core of StackState, or what is at the core of StackState, a time-traveling topology, which we built on our own database - version graph database. And that allows us to understand how all the different components are correlated with each other in the form of a graph or what we call a topology. But one of the core ideas that we had as well, these graph table, they are very important and they've been important, for decades. So, there are CMDBs and there are discovery tools that give you topology. Even in the 90's, this was already happening, probably even before that, but these graphs, they were never really up to date. So, the first component of that, that we needed to tackle as well, we need to make it real time because, especially because we see a lot more change, but also because we see a lot more change, it needs to be time traveling.

Lodewijk: [06:20] And it doesn't make sense to know only the topology at the given point in time or at the current time, because once there is a failure that likely will also cause more remediation events and that means that the landscape changes from moment to moment and you want to really make sure that you can rewind and capture all of these moments so you can easily get down to root cause. And when we started, Kubernetes wasn't really on our radar. I guess it existed as a technology, but it became popular more later on. We also went with Mesos for a while. I mean, remember we went actually quite deep with Mesos, but then Mesos actually didn't become the defecto standard. And we had only one Mesos customer, but we thought "Okay, this is a really good direction." But then, when Kubernetes came around, and then the whole cloud native movement, that is such a good fit for our technology, that it's a no brainer. So, we've been focusing on that for a while now, but what Mark said with COVID indeed, we were accelerating that.

Anthony: [07:32] Yeah, I worked for ServiceNow, right? So, we would always sell the CMDB, and in theory, in the old worlds, when we're living in a virtualized world and as opposed to a cloud native containerized world, which we're moving into, it made sense, right? Do IP-based discovery, just give us everything, run it once a day and then you can assume - for the next 24 hours -that everything is going to be the same. You'll be able to run change management. You'll be able to correlate priority 1 incidents to where they're coming from. But the theory was not the reality, because even customers who bought that, they realized, oh wait, there are a whole array of people that manage different elements of that infrastructure. And, if you want read-only admin credentials, you're not going to get them.

Anthony: [08:44] So, even customers who bought the products, they were still siloed in the same issue. People were basically using spreadsheets to update the CMDB and then you're running change control on that. And so, I only see that getting worse actually as we move into a more cloud native environment, because we live in a world of very possessive people and people want to possess their own data. So one of the things I do like about the Kubernetes integration that we have in our agent, is that it sits at the cluster level. We don't need the developers to change their processes in order for us to be useful. We don't need a whole, because I've been on tons of these DevOps calls with people that run these cloud native applications and their response to this issue, right, of CI/CD impact is process coupled with Jenkins or whatever you're using to operationalize your environment.

Anthony: [09:57] I just don't see that working if you want to innovate effectively, right? You kind of need to break things in order to make things and if you're controlled by these processes and these pipelines, you're not able to really truly innovate. You're being held back by somebody's rules that they had back in the day kind of thing. And so, I think one of the impacts of our product and the way it's deployed, is that it's going to remove a lot of red tape because one of the reasons why there is all of that going on is because people are afraid of breaking things, right. And having outages. But if we're able to quickly align a change to the impact and go back and just say, "Hey, quickly change this, redeploy". But that, going to save like a ton of red tape. Why have the red tape, that's the way I'm thinking about it? What do you guys think?

Lodewijk: [11:04] Well, for sure what you're saying is right, but I just want to put one nuance there, or at least not give people the idea that there is no place anymore for IT Service Management or the CMDB. I think all of these ideas were right. Only we have to see which role they play in the modern world. And we really shouldn't want to throw out the baby with the bath water. I mean, every next generation that comes along and says, "You're still using web services with XML and with SOAP and, oh, that's so old school, let's do rest APIs with json". And then later on, somebody comes along and says like, "Hey, what's your schema?" and say, "Well, which schema are we in, schema?" And so all of these ideas, they have a way of, even if you throw them out, they have a way of coming back.

Lodewijk: [12:06] And so, Mark and I, we've spent some time in the Enterprise Hybrid IT World, and we're still working with a bunch of customers who are very big users of ServiceNow, and I think it's a great product. And even the CMDB still has a very good role to play, but I don't think the role of the CMDB is to really keep up with, you know, every little container that exists at any given point in time. So, if there is an outage or there's a SEV1 outage, you do want to have that in your IT Service Management system, and you do also want to relate that to something in your CMDB. But, if you're going to put it on, let's say the most fine grained level, which is, let's say the process ID in the Docker container that is running in the Kubernetes cluster, on a certain node, that's such volatile data, that really doesn't fit with the CMDB.

Lodewijk: [13:13] And yet I think it should be registered somewhere such that it is relatable back to that type of information and StackState is really made for that type of volatile information. We know each process running in each docker container, running in each Node, running in each cluster and so forth, but we also want to make sure that is relatable back to IT Service Management, because those processes still are very valid and valuable in today's world.

Mark: [13:53] Yeah. That also brings us back to the other foundational technology which we have, because one of the parts is indeed Stackgraph where we really made a version graph database, made it scale, it did cost as a ton of man hours, I could tell. So it's the first database we have ever built. And I think it's the last one.

Lodewijk: I'm not too sure about that one.

Mark: I'm also not sure yet, but let's see. More databases to come - keep tuned. No, but the other foundational technology that we really have is our EBPF-based agent. The core foundation of Linux and Kubernetes really helped us because basically Linux changed the way everything was containerized. So previously we had VMs running on Linux and at some moment Linux really made a chance to have namespaces for everything. So namespaces for your networking, namespaces for your file systems and so on... And basically that's what Docker does. It puts all your processes in certain namespaces. And what we really do is at the kernel level, see how all those processes are connected to each other, how they are communicating with each other, what they are communicating. How fast they are, and then something breaks really know and understand how that affects other parts of the system. And with that, we can really give you the right alerts and send out the right message to the right people. But of course the change management systems then come into play because that's something we don't do and don't want to do even.

Lodewijk: [15:33] Yeah, so at one of our customers, for example, we're also integrating ServiceNow. So, we have the ServiceNow CMDB read into StackStates. Then we have also the Kubernetes landscape and a bunch of other foundational technologies that they're using in that company. Mapping all that together, such that when there is an issue, you know exactly where you should be, but at the same time, through our time-traveling topology, it becomes very easy to find the change request that happened in ServiceNow when people actually started working on the change. So you actually see the changes in the Kubernetes deployments, but you also see the ServiceNow changes. Because for us, that's all one big connected topology or from StackState’s perspective.

Mark: [16:24] It's just, the one is about the very fine-grained detail that some port changed or some communication was started between two processes onto containers onto hosts. And the other is really about, we made a big change and it's related to this application, but we also know about that application and how that interacts with the different services and containers then.

Anthony: [16:46] Yeah, because one of the things with the ServiceNow environment is that, there are so many different modules that different customers have adopted at different levels, right? You may run Change Management through ServiceNow. Some customers even have their software development life cycle through ServiceNow, but that's not the biggest use case for ServiceNow. They usually use maybe JIRA or something else on the side because it's just more agile and less, workflow-based that ServiceNow is. It's a very rigid platform, but it serves a purpose when it comes to ITSM and even HR issues, right, that need to follow a process. ServiceNow is a great thing for that and the CMDB doesn't need to. And I like what you said Lodewijk in terms of, you don't want to keep track of containers and all this kind of stuff and what was spun up at one point and what was not because in the Kubernetes environment, it's there to be orchestrated and to have that elasticity, if you will.

Anthony: [18:02] That means that, when there is demand, resources are available and when there isn't demand resources are taken down because in the cloud, that costs a ton of money to have resources up. But having said that, where do you think people are going to go in the future for tracking spend when it comes to, I know we personally have a huge cloud spend, but where do you think people are going to go in the future to, especially if they're multi-cloud, right. If they use Azure, AWS and Google? The way I see it, right, is that people eventually are going to keep track of spend across all these different clouds. And they're going to compete with these, not compete with the clouds, but make the clouds compete with each other more in terms of cost per container, cost per whatever in terms of dollars and cents. Where do you think people are going to go to keep track of that type of stuff? When we emerge into this new world of services, containers and compute versus storage and all this kind of stuff?

Lodewijk: [19:18] I don't know to be honest, I'm not too opinionated on that, on that front also with StackState is squarely in the observability space. If you ask me a question on that, I'll probably have already made answer for you, but in this, I can only philosophy a little bit together with you.

Anthony Evans: Well, that's the point!

Lodewijk: [19:43] Yeah. Okay. Oh, just to put it up front. Well, for sure what we see is, that the world's like, I think like one of the reasons why cloud native is so popular is because it is, is cloud independent also. Well, at least in theory, because in practice, of course there are still all these different couplings and so forth, but this kind of cloud independence is becoming more and more important. And we've actually seen a couple of initiatives of big vendors, like IBM and Red Hat. And then, there's probably a couple more, who are actually working on multi-cloud management platforms, that make it kind of abstract away the cloud.

Lodewijk: [20:36] I'm not sure whether it's really going to happen that way, I mean not the way they envisioned it. For sure, there needs to be some type of abstraction, but unfortunately, as it goes with any abstraction that I've seen, at least in IT, is these abstractions are, well, leaky and useful at best and useless most of the time. Especially, when it comes to something as complicated as the cloud, which has so many different types of components. And if you look at also, the different types of services and how to optimize these different services, that requires a lot of in-depth knowledge. So for sure there are cloud cost platforms, which are also cloud independent, but I don't see any kind of super useful abstraction appearing anytime soon. What I would like to see though, and that's something that Mark and I have been thinking about pretty much from the beginning, is that the model that we have in StackState also lends itself super well for the different types of states.

Lodewijk: [22:00] So when we actually came up with the name StackState, we thought about not just the state in terms of the type of state that you use for observability, so uptime, availability, stability issues, performance, etcetera. But also for costs. Because the abstraction that we have, which is the time-traveling topology, lends itself super well also for tracking costs. On the other hand, what I said about leaky abstraction cost is a very complicated and very cloud-specific type deal. Because you have to really look at the type of service and how you're using it and what the pricing model is, etcetera. So for sure, we will forever be moving into the direction of, or we keep the abstraction in such a way that we can at some point start to, well, not pivot, but add a module on top of StackState, where we would also do some type of cost analysis. But we're super careful not to touch upon that because we really recognize all the difficulties that exist in that space and how much effort it takes, especially when you go multi-cloud.

Mark: [23:32] And also what I find really important is keeping focus and doing one thing at a time and doing it well, nailing it, and then thinking about the next thing. So, that's the way we worked for StackState for long time and that's really important. So, cost management is something we don't, we will not do in the short term for sure. But if you talk about costs, there's one other aspect to that. And I do see that happening and it's especially related also to the cloud and container world, because basically one of the reasons people choose to run all their applications and servers on Kubernetes, is that at least they have a way of migrating it easily from one cloud to another cloud.

Lodewijk: Yeah.

Mark: [24:11] So I see, two types of customers basically, we have customers that are really going deep into one of the cloud providers using all their services and go full in basically. And there's also no way out then because you use very specific services so getting to another cloud will basically almost be impossible. But I also see quite some companies that really focus on ‘let's use it in a foundational layer which is everywhere available’ and then build on top of that. And Kubernetes is really one of those foundational layers.

Lodewijk: [24:40] Yeah. And then what you see also with software architecture, it's super important that you keep track of all your non-functionals. Anthony, at some point you mentioned all the different aspects that go into building a tech company. And if you just take one step back and just look at the software engineering part, you always have to keep track of a bunch of non-functionals. And the way, well, I remember being a software architect 10 years ago, cost wasn't really one of those aspects. You looked at security, you looked at like high availability. You tried to prevent bottlenecks. You thought about test coverage and quality assurance and all those kinds of things, but cost really wasn't on the horizon. Nowadays, cost is a big driver for how you architect your applications. And so indeed, having that multi-cloud possibility is super important.

Lodewijk: [25:48] And another aspect of that is also what we see with serverless, that cost model is super relevant. Even my dad actually is also a software developer and he develops an app for doing bridge tournaments because he really likes to play bridge. And he's now also considering re-architecting the backend of his app for serverless, because it's such an attractive model because these tournaments, they happen very sporadically. When they happen, there are few requests and in the meantime, you don't want to pay for anything. So, but then of course we talked about that at the coffee table last week, but I told him, yeah, of course the idea is that it's cheap, but then if your app explodes and you're serving customers 24 hours a day, then at some point it becomes expensive.

Lodewijk: [26:56] And that's exactly the cost model that cloud providers also kind of hope that you fall into because that's going to provide them a lot of money. So, what we see there for example, OpenFaaS functions as a service on top of Kubernetes would then be another logical progression. So even everything that's happening in cloud is also happening in the cloud native Landscape. And so, yeah, it's very interesting to see that cost has become such an important driver for modern app development.

Anthony: [27:34] Cool. Well, we've run out of time. Is there anything you guys want to end out with given our audience, it may be at first Xebia and StackState people, as well as some prospects and clients, I've got a bunch of clients that are going to do these kind of interviews. Anything you want to say at this point in time and then we can wrap up?

Lodewijk: [28:02] Yeah, I really liked how the conversation went. I listened to a lot of podcasts myself, actually. I typically fall asleep with podcasts as well as I, it's a nice method for me to fall asleep with. What I do like is these podcasts where they take a few topics and they tangentially meander through a bunch of different topics that are related. And I, I really felt that was exactly what happened. So I think this would be an interesting podcast for myself to listen to, and so, job well done.

Anthony: Good! Mark?

Mark: [28:44] Yeah, it was a very nice interview I would say. And a nice podcast. And, if there's only one thing I will say to people always listen very well to the problems you see. If you think you have a nice solution for that problem, discuss it with more people if they also find it a very nice solution and then follow your dreams and try to extend it.

Anthony: Nice. Well again, thank you guys for taking the time. Thanks guys. Enjoy your weekends.

Annerieke: [29:13] Thank you so much for listening. We hope you enjoyed it. If you'd like more information about StackState, you can visit www.stackstate.com and you can also find written transcript of this episode on our website. So, if you prefer to read through what they've said, definitely head over there and also make sure to subscribe if you'd like to receive a notification when we launch a new episode. So, until next time.

Subscribe to the StackPod on Spotify or Apple Podcasts .

About StackState

StackState’s observability platform is built for the fast changing container-based world. It is built on top of a one-of-a-kind “time-traveling topology” capability that tracks all dependencies, component lifecycles, and configuration changes in your environments over time. Our powerful 4T data model connects Topology with Telemetry and Traces across Time. If something happens, you can "rewind the movie” of your environment to see exactly what changed in your stack and what effects it has on downstream components.

Curious to learn more? Play in our sandbox environment or sign up for a free trial to try out StackState with your own data.

EP #1: Time-Traveling Topology and Observability With Mark Bakker and Lodewijk Bogaards

Mark Bakker and Lodewijk Bogaards - co-founders of StackState

Episode transcript

About StackState

Related resources

Mastering Node Affinity in Kubernetes

SIGKILL vs SIGTERM: A Developer's Guide to Process Termination

Understanding and Troubleshooting Out of Memory Error Code 137