Implementing SRE at the Largest Online Retailer in NL and BE | E5

Bart Enkelaar - lead SRE at bol.com

With more than 13 years of experience in backend engineering, Bart is now a lead SRE at the largest online retailer in the Netherlands and Belgium: bol.com. His challenge of the past 2 years is rolling out site reliability engineering to over a hundred DevOps teams, and that's why we're thrilled he said yes when we invited him to the StackPod!

In this episode, Bart and Anthony talk about:

Why Bart and his team decided to set up Google's SRE framework within their organization
How easy it was to put Google's theory - which can be seen as nirvana - into practice
How bringing about that culture change was especially challenging and how Bart and his team are continuously improving that
Why music and SRE is a great (but unexpected) combination

Links to interesting and fun things Bart mentions:

You can find a written transcript of the episode below. Enjoy the recording!

Episode transcript

Bart: [00:00] And that resulted in, after we had done this whole transformation later, we found that especially the consistency in our operational stability was just, there was no consistency. Essentially, each team did it in their own ways, and some teams were doing amazing, others not so amazing because they were focused on other things. So that's why we took this framework that Google developed in site reliability engineering and is now being developed everywhere.

Annerieke: [00:28] Hey there, and welcome to the StackPod. This is a podcast where we talk about all things related to observability, because that's what we do and that's what we're passionate about, but also what it's like to work at a tech company. So if you are interested in that, you are definitely in the right place.

Annerieke: [00:44] Today's guest is a lead SRE at the biggest e-commerce platform in the Netherlands, which is bol.com. His name is Bart Enkelaar, and Bart is not only crazy about site reliability engineering, but also about music and how to combine SRE and music, which is quite the unusual combination. So let's dive into it and enjoy the podcast.

Anthony: [01:08] First and foremost, Bart, do you want to just introduce yourself and what you do and if possible where you work and that kind of stuff?

Bart: [01:16] Yes. Sounds great. So, hi, my name is Bart Enkelaar. I'm a lead site reliability engineer at bol.com, which is the largest online retailing platform in the Netherlands and Belgium. I've been a backend engineer for about 13 years now. I've been at bol.com for seven of them, so I'm quite happy there and it keeps challenging me every day. And my latest challenge of the last two years is taking the next step on our DevOps journey, which is rolling out site reliability engineering to over a hundred DevOps teams.

Bart: [01:57] That's where my main focus is and the reason that we felt the need for that next step on our DevOps journey is that we did our like DevOps transformation in around 2016. And then two years later, we were done and we built pretty cool tooling that enabled our developers to do their own deployments and get alerts themselves. But we didn't spend enough time on what does it actually mean to actually combine development with operations. And that resulted in, after we had done this whole transformation later, we found that especially the consistency in our operational stability was just, there was no consistency. Essentially, each team did it in their own ways, and some teams were doing amazing, others not so amazing because they were focused on other things. So that's why we took this framework that Google developed in site reliability engineering and is now being developed everywhere. And why we are taking that as our next step.

Anthony: [03:18] How easy was it to put the theory into practice? Because Google's thing primarily it's a view of Nirvana and going back to the early days of Google, it was like a dream to go and work at Google and I think a lot of people still have this utopian view of the Google campus and they just do things differently and get more done. And for people who aren't aware, bol.com is an eCommerce site.

Bart: [03:49] Yes.

Anthony: [03:50] Just so that people aren't aware of who bol.com is, but an e-commerce site primarily based in the Netherlands. Is that correct?

Bart: [03:57] The Netherlands and Belgium. Yeah.

Anthony: [03:59] But how easy was it putting the theory into practice, if it was easy at all?

Bart: [04:06] Yeah, it's not easy. It's definitely a journey that... The first books came out in 2016, right. And back then immediately some of our operations engineers and some software developers at bol.com took notice and they took these theories to heart. And those are particularly the ones that were already very operations aware and were applying it to their own team, to their own little corner of bol.com. And we found that more and more people got interested in these ideas and that matched up with the problems that we were seeing.

Bart: [04:52] So what that means is that this has been a journey maybe already five years in the making, but we didn't really start to take it seriously until maybe end of 2019, which is when we did the first pilot, we thought, well, maybe we can have a part-time team. And that was a wonderful failure where we learned a lot from. That was lots of awesome people together who did not achieve the things that they wanted to achieve, because bringing about that culture change, and essentially it is a culture change, that is really hard work. We're 2,500 people, a company like 700 developers and getting them all to think alike on this particular topic is essentially a culture change. And so when that failed, we used that as momentum to start with a full-time team early 2020. So now we've been doing it for a year and a half, and it's a journey every day. We take steps and we've identified a maturity model for our team, but as we were identifying that, we were also seeing that there's key pieces of enablement that we're still missing. Key tools that we need to build or apply.

Bart: [06:30] So, we've been doing this for a year and a half and we're only now, I think, getting the traction that we really need, but there have been lots of learnings and things that made bol.com a better place to engineer and to work and to buy along the way. But it's definitely not easy, but I guess that's what makes it fun.

Anthony: [06:53] Since you've instituted some of these changes and you've implemented the team, have you seen any practical or even business objectives improve? Like SLOs and all this kind of fun stuff?

Bart: [07:07] The interesting thing is that we were coming from a situation where we were measuring things quite traditionally, so like golden signals, but then pre-generated from a standard team to all the teams. And that was actually for only applications that run in our own data center, our cloud applications didn't by default measure things. And then many teams had to figure this out for themselves, where for most of our applications, we are still in this area of, at least now we know how badly we're doing, and we know this in a more relevant way, but we have of course seen relevant improvements. Because one of the first things we did from SRE out was build a way to handle out-of-office support in a decentralized way that did not require all software engineers to be on call. Because, both financially and gate-keeping wise, we didn't want to ask all engineers to have to be on call, but we did want to have all engineers be responsible for their own software, for the running of their software. So we set up virtual teams of software engineers arranged per product domain, so per value chain essentially. And they take the out-of-office-hours responsibility for responding to alerts on cloud services.

Bart: [08:53] And where I'm going for benefits is that at one point there was this issue with a health check on Pub/Sub. Pub/Sub is our Google cloud native messaging system. We use mostly Spring Boot-based applications and they come with default configured health checks, and one of them was failing with a new version of the Pub/Sub driver. And in this application, where we had the software engineers on duty arranged, that was discovered and fixed within 45 minutes. And then in 20 other applications that had the same issue that was discovered eight hours later, and it was resolved only the next day after they came to us to ask for a solution. So that was a clear point where I was like, 'Oh yeah, this works.'.

Anthony: [09:58] Yeah. It's the critical event that distinguishes between, 'Let's keep things as is, or let's move forward.' I think one of the biggest challenges you have in technology, especially at the most senior levels, is that there's always an appetite to keep things as is because, at least if I can keep the train going and it's still going at the same speed and it's achieving what it's meant to achieve, then that's it. It really usually takes a critical event to say, 'Okay, well, this train is busted. We need to either improve the tracks, we need to improve the train, it needs to go faster, it needs to go more reliably.' It takes a long time to usually ... especially with something as new as DevOps, because it's more of a buzzword, people always talk about DevOps. And especially when you're talking to C-level executives, they just drop the word container or Kubernetes.

Bart: [11:03] Yeah, yeah.

Anthony: [11:06] 'Okay, when are we moving everything to the cloud? Come on, where is the cloud? Let's get it in there.'.

Bart: [11:11] Yeah, exactly. 'It's just other people's computers.'

Anthony: [11:16] We moved our solution because obviously, StackState, we're a four and a one-half-year-old company, right? So our software was originally built on-prem, it was an on-prem piece of software. We migrated it to the cloud. Even our AWS solutions architect looked at it originally and goes, 'Well, you know, you may just want to look to see if you can take advantage of some services and go a little bit more serverless with this infrastructure because it's going to be expensive otherwise.'

Bart: [11:48] Nice.

Anthony: [11:51] Do you ever have to deal with anything like cost or anything like that? I know a lot of people are always dealing with availability and objectives, but do you ever have to help developers manage the cost of their solution because they're using a stupid architecture or they're looping in a way that just uses too much compute or something like that? Do you ever have to deal with those kinds of situations?

Bart: [12:19] Yeah, so especially also with our move to the cloud cost became a very relevant issue because basically the cloud is just more expensive. You get more, but...

Anthony: [12:33] You don't have to pay for the work up front, but you're going to pay for the compute it's going to use at the end of the month.

Bart: [12:40] Exactly, exactly. So that means now you have the capability to scale up, but you also have the possibility to scale up. And if that's unintentionally used by a bad algorithm, then you're wasting all that money. So with our cloud adoption, Peter Young did this mostly. So he was originally one guy and I think then he got a team who really started focusing on this cost-control angle. And this is also where our extra layer that we have around Kubernetes, around the cloud comes into play. We have an in-house-developed tool that is a thin layer over Kubernetes. And it mostly accepts Kubernetes like YAML, I think they try to make it as transparent as possible. So what you read in the Kubernetes documentation, you can just put in those YAMLs as well, but in there we can do overlays and safeguards for maintaining costs.

Bart: [14:02] But, to be honest, I wanted to continue with something you said earlier where in innovation there's always this balance between keeping the train running and getting better tracks and getting a better train, essentially. I think that is the heart of what we're trying to achieve with SRE, because it's all about this balance between reliability and innovation. And I've been working on a talk that's basically ‘all software engineering is doomed’, because the more value a system delivers, the more complexity is in that system and the more effort you're going to need to put in keeping that system operational. So, if the total amount of effort you can spend remains the same, that means that the more you innovate, which should be adding more value and thus more complexity to the system, you will, at some point, reach a point where the total amount of effort it takes to keep the system operational is larger than the amount of effort you have to spend.

Bart: [15:22] So this tragical law of software innovation, I think this is where the difficulty comes in in that balance between innovation and reliability and it it's all about what do your customers need to have optimized? And it is never one or the other. It's always, you need both reliability and innovation and you need them in a controlled manner.

Anthony: [15:55] I think you're absolutely right. But I don't think that what you're explaining is necessarily a new problem.

Bart: [16:06] No, it's not.

Anthony: [16:07] I think it's a problem that's been around since software began. Every time there's an innovation, there's something in the backlog that buckled under the pressure, couldn't scale, didn't do whatever. And sometimes these innovations may go in, but it's not until let's say you're an e-commerce site, so let's say I do an enhancement it's June, then all of a sudden, the Christmas holidays come around. So late November, all of a sudden, there's an environmental change in that I've got half a million more people logging into my website and buying PlayStations, then all of a sudden, the whole thing goes down because of a bug that was introduced back in June.

Bart: [16:45] Did you hear about this bug?

Anthony: [16:50] I was just guessing.

Bart: [16:55] Haha, literally this happened, PlayStation 5.

Anthony: [16:57] I think this generation of consoles is the bot, this is the dawn of the bot. I didn't get an Xbox series X the first time, but I was then part of a telegram group, which would then send out a link as soon as one came out on one of the digital things. I managed to go to the microsoft.com store, I added it to my shopping cart, it was in my shopping cart. I then go to click checkout. Errors, always errors, errors, constantly errors. And so it turns out that because there was so much traffic, I couldn't execute the checkout service, so I ended up just pulling up a bit of Java script, running it in my Google Chrome console, it took six hours of looping the confirmation before I finally, at 10 o'clock, got a text message saying, 'Oh, you've just paid a thousand bucks to Microsoft.'

Anthony: [17:54] So that was a real thing and that's an environmental change, but that's not necessarily a code change, but an environmental change affected the availability of that checkout service.

Bart: [18:09] Exactly.

Anthony: [18:11] And so I think one of the big things that you are now going to see, and which goes back to your point, is that because we are now living in this Kubernetes containerized environment, the rate of change is not just code, boom. There's so much now going into all of the nuances, right? So when you deploy your code base with that YAML profile, you're basically saying, 'Okay, Kubernetes, you can only use so much resource with this particular function over here.' And that can introduce bottlenecks that you didn't know about, but then being able to understand that, at that point in time, there was so much traffic that these containers were only able to execute 20 threads at a time to get the execution of the checkout is part of this new layer, if you will, of changes, equaling availability issues.

Bart: [19:08] Yeah, yeah. With the software verification of the infrastructure layer, that becomes part of your design. And that becomes part of what innovates and what moves. And it's been quite clear to me ever since I started working with Kubernetes, actually, that Kubernetes is not going to be the end step of our cloud infrastructure organization tooling. And, as you do as a software engineer, I've been considering it would be fun to write a new programming language one of these days. But you want to write something that makes sense. And these two thoughts have recently combined, and wouldn't it make sense to have a programming language that understands both the infrastructure layer and the business domain layer? So that you can actually basically type-check your infrastructure against the needs of your algorithms. I don't know.

Anthony: [20:30] I think something that can invoke elasticity across multiple services is definitely something that is needed. At the minute that is something that is generally missing, right? So if you've got Kubernetes, it'll spin up as many containers as you tell it to in order to fulfill the workload that it's been provided. But, as far as I'm concerned, Kubernetes is just the next step in virtualization, right? We went from a physical to VMware virtualized to now Kubernetes. But all Kubernetes is doing is running these containers, which are effectively limited operating systems to run specific purposes and fulfill specific things. You're still limited by the traditional virtualized world in that you can only give so much resource per container, the physical machine that they sit on can only do so much. You can tell this because AWS and Google Cloud, they charge you based on how many minutes you keep the cluster active, not how many containers you're running and all this kind of fun stuff. It's like, 'Okay, here's the cluster, it's using X amount of compute all the time, unless you give it more, then that's it.'.

Bart: [21:49] Yeah.

Anthony: [21:50] But I think the next thing is where people are going to evolve to be serverless. And I think there's going to be a version of Kubernetes, it won't be Kubernetes, there'll be something else in the future, which will effectively be able to not only manage maybe some traditional things like containers and maybe virtual machines and maybe traditional databases. Because that's another thing that Kubernetes isn't really good at managing right now. Because you've still got a need for a database a lot of the time, even if it's NoSQL, MySQL, Graph, whatever.

Bart: [22:24] Yeah.

Anthony: [22:25] You're still going to need data stored somewhere. That's also going to get more complicated. Because I think more people with all these denial of service attacks and whatnot that they're going to go to a more blockchain type thing. I think a lot of people don't understand what a blockchain is for the most part, but once they realize that they can distribute data across multiple servers, areas, geographical locations, and you need so much access in order to put together that data, it's just going to be so much more secure, especially for customer data and GDPR and stuff like that. I think they're going to start making companies distribute data as part of GDPR policy, as opposed to just keeping your data for a specific amount of time. I know I've kind of gone off on a tangent there, but...

Bart: [23:15] No, but it's an interesting angle. So, so you use the naturally distributed security essentially that a blockchain gives you as a defense mechanism against things like data hijacking?

Anthony: [23:33] Correct. Yeah. So people stealing data, doing a ransomware attack, because at the minute what's happening is, is that somebody opens an email, they lose their computer, or they may not even know that they've lost their computer, but all of a sudden then all the services that they're connected to through VPNs through all this other stuff, all the passwords that they're using, just get piled back in. Whoever that hacker is, or that malicious entity they're going in then and using whatever thing it is and nobody is really distributing their data. Everything at the end of the day sits in either one area or one subset of areas. And that's how people lose their data, because it's not secure enough basically. That's it.

Bart: [24:21] So how do you rhyme that with, because you're saying GDPR is going to demand this, but how, how do you rhyme that with the right to be forgotten?

Anthony: [24:32] So it's the same thing, right? So, with a blockchain, there's a ledger, and that ledger contains not just the transaction approvals and what's going on, but then also the actual physical data that's being transferred. You could easily do something where it's like, if you've got a piece of data and it's not been verified in over seven years, you could assume that that verified data is, or that data that was verified, because you could verify it by somebody logging into their account, somebody accessing a piece of information. Boom. As part of the ledger, yes, they can do that, bring together the data. Let's present the data to that person, that adds a check into the ledger. You can say, 'Okay, as long as you haven't accessed your data within seven years, it's going to go. It's no longer part of our ledger.' Does that make sense?

Bart: [25:26] I think so, but because you'd have to encrypt it in some way, right? Because you can't guarantee that at all parts of the distributor network will delete that data.

Anthony: [25:42] Yeah. You would have to do it as part of your...in cryptocurrency form, they have this idea of the mint at the very beginning. So the mint defines the behavior and the availability of the cryptocurrency. That would be when you'd have to define these GDPR policies so that there would have to be almost like a selfdestruct on the data, in a way.

Bart: [26:04] Yes, exactly, exactly.

Anthony: [26:05] Do you know what I mean?

Bart: [26:06] Interesting idea.

Anthony: [26:08] See? We went completely off topic there.

Bart: [26:09] Yes. And it was fun.

Anthony: [26:15] No, there we go. So what's your next product idea?

Bart [26:20] Yeah. Well, if you build a database that is natively GDPR compliant, I think you have a killer feature for the future. I think that's something and if that's also natively resilient against data hijacking man, you, you can sell that. Especially if it's also performant.

Anthony: [26:50] Well, I just need some smart people to build it. But yeah, no, it's interesting. Getting back onto some of the things that you're doing, I know you have a podcast, you've got some very interesting back-stories. Your father-in-law wrote a book as well.

Bart: [27:16] Yeah, exactly.

Anthony: [27:17] Let's get to know Bart a little bit more and his interesting back-story as opposed to talking about just technology.

Bart: [27:24] Okay. Okay. But I like technology and talking about it. Yeah. So I've been doing some talks at conferences over the last year and a half and that's been lots of fun. And through that, there was a colleague that I respect a lot and we always had really interesting differences of opinion while we were both working at bol.com and I was like, 'Yeah, but I think it's this. And he was like, 'Yeah, but you think it's this?' And we both really respected the way we disagreed with each other.

Anthony: [28:05] Okay.

Bart: [28:05] So then, when he left the company, it was like, 'Hey, I'm going to miss our disagreements.' And he was like, 'Yeah. Okay. So shall we just continue them? And that's why we started Friendly Tech Chats, which is on YouTube. And my wife and I, we just had our second child, so yeah.

Anthony: [28:26] Oh congratulations.

Bart: [28:19] Thanks. He's eight weeks old now and he's awesome. But so we had a little bit of a hiatus while we were getting to know our new family. But before that, every week we'd released a - so this is me again, not my family - we'd release a new 10-minute episode where we talk about software engineering stuff. And to be honest, mostly we've been talking about software engineering quality, because we both care a lot about building the right thing right. And that's quite hard. So mostly over the first half-year, we've been discussing different ways to approaching test strategies, test automation strategies, basically, as a software engineer.

Bart: [29:26] And then recently we've moved more into other aspects like SRE and we just taped an episode on hexagonal architecture yesterday. So we talk about all kinds of stuff and we try to find the disagreements. But to be honest, the more we talk, the less we disagree.

Anthony: [29:47] Yeah. That makes sense. I think when you're working on something side by side you've got the red lights going and one of you is like, 'No, you're looking at the wrong thing.' And you're like, 'No, I'm pretty sure it's coming from here.' And you put together all the alerts and the logs. Yeah. I used to do that, but mind you it's good that you guys got along. I found when I was in the ball pit, trying to figure out root cause and dealing with a bunch of people that always wanted to know the answer first and were very, very opinionated around whether they were right or wrong. And then it was so satisfying when I'd like, press enter on my solution and all of a sudden, the application works. I would be like, 'There, you see?'.

Bart: [30:39] Yeah, I get that.

Anthony [30:40] It's not as satisfying after the fact because you don't get that ability to be right.

Bart [30:47] True. What I appreciate in being wrong is that afterwards you're better.

Anthony: [30:56] Yeah. Because you certainly don't forget when you're wrong.

Bart: [31:01] Exactly.

Anthony: [30:51] You certainly don’t forget when you’re wrong. That's actually when you learn. Yeah.

Bart: [31:08] I also sing in an Irish folk band! I would say I have an amazing wife and two gorgeous sons. They're two years old and eight weeks old. But I also sing in an Irish folk band, which is the, the Greenhill Travelers and COVID is finally loosening up, so we're getting ready to play again. So if anyone's looking for an Irish folk band, the Greenhill Travelers are great.

Anthony: [31:37] Do you have a website?

Bart: [31:42] Yeah. It's greenhilltravelers.nl . There's also actually an album on Spotify , but that's actually not me singing, that's the previous singer, but he stopped showing up at some point, so then they got me.

Anthony: [32:00] That's a great plug. It's like, you can listen to us at Spotify, but don't because it's the old guy.

Bart: [32:15] Yeah. At least those are the songs we play, and it's pretty decent. To be honest, these musicians are amazing. They're the best musicians I ever worked with and I really love Irish folk music because that has so much heart. So it's great fun to do. And that's something I also do. And maybe bridging it back to technology, over the last years I was saying I've been doing some talks and in some of those talks, I'm also incorporating music. I wrote a site reliability engineering musical for SLOconf which is a three-part musical. And the first scene is set to the theme song of The Big Bang Theory.

Bart: [33:04] Big nerd alert. But yeah, I've got all the geek creds, but so it's basically a history of site reliability engineering set to the Big Bang Theory tune. And then the second scene is a meeting like an all hands meeting where everyone is complaining about things that are going wrong, that they don't understand. And then the site reliability engineer is saying, 'Oh yeah, you can fix this with an SLO. Oh, you can fix this with an...' Basically, all the problems you can fix with an SLO. And that is set to the theme song of Game of Thrones. It's like, 'A SLO man, build a SLO man, build a SLO.' Then it ends with, There's ‘No Business Like a SLO Business’ with a big... So yeah, that was the site reliability engineering musical made for SLOconf.

Bart: [34:04] But it's also on YouTube. Maybe you can link it or something. It was a lot of fun to make. (Click here for Bart's musical! )

Anthony: [34:10] I was going to attend SLOConf, I'm part of a slack group for SREs. But they, some of those guys...

Bart [34:19] The Seattle SRE thing.

Anthony: [34:21] Yeah. The Seattle SRE.

Bart: [34:22] Yeah, exactly.

Anthony: [34:25] Are you in there as well?

Bart: [34:27] Yeah.

Anthony: [34:18] So I'm in there. I attended their coffee hours a few weeks back.

Bart: [34:25] Nice. Nice. Yeah. I have an invitation somewhere, but it's next to all the other invitations. Can't do all the things.

Anthony: [34:51] Awesome. Well thanks again Bart, for joining us today. It's been a real pleasure.

Bart: [34:42] Yeah. Thanks for having me.

Anthony: [34:22] And I want to know more about this musical. You've got to send me the screenplay or something, some content, whether it's recorded or something somewhere, that would be really fun to listen to.

Bart: [34:54] Yeah. Thanks and good luck with StackPod. And if you have a panel on SRE or rolling out SRE to people or maybe music and engineering, I'd love to be in that.

Anthony: [35:25] Cool. Yeah, no, definitely. We will do. And yeah, thanks again for doing this, man. It's really appreciated and hopefully we'll do this again soon. Okay.

Annerieke: [35:24] Thank you so much for listening. We hope you enjoyed it. If you'd like more information about StackState, you can visit stackstate.com and you can also a find a written transcript of this episode on our website. So if you prefer to read through what they've said, definitely head over there and also make sure to subscribe if you'd like to receive a notification whenever we launch a new episode. So, until next time...

Subscribe to the StackPod on Spotify or Apple Podcasts .

About StackState

StackState’s observability platform is built for the fast changing container-based world. It is built on top of a one-of-a-kind “time-traveling topology” capability that tracks all dependencies, component lifecycles, and configuration changes in your environments over time. Our powerful 4T data model connects Topology with Telemetry and Traces across Time. If something happens, you can "rewind the movie” of your environment to see exactly what changed in your stack and what effects it has on downstream components.

Curious to learn more? Play in our sandbox environment or sign up for a free, 14-day trial to try out StackState with your own data.

EP #5: Implementing SRE at the Largest Online Retailer of the Netherlands and Belgium With Bart Enkelaar (Bol.com)

Bart Enkelaar - lead SRE at bol.com

Episode transcript

About StackState

Related resources

Mastering Node Affinity in Kubernetes

SIGKILL vs SIGTERM: A Developer's Guide to Process Termination

Understanding and Troubleshooting Out of Memory Error Code 137