EP #14: Moving From Network Engineering to Site Reliability Engineering With Murali Suriar of Snowflake (Former Google)
Murali: A thing that a lot of people get hung up on is "if I'm not doing all of these things that" pick your company, right? Facebook, Google, Amazon, whatever. "If I'm not doing all these things, then I'm not doing SRE." These are all just tools, right? They're tools in the toolbox and you pick the tools you need for where you are right now, right? Like solve the problems you have. Don't go and do a thing just because you feel like it's the thing you should do.
Annerieke: Hey there, and welcome to the StackPod. This is the podcast where we talk about all things related to observability because that's what we do and that's what we're passionate about, but also what it's like to work in the ever-changing, dynamic, tech industry. So if you are interested in that, you are definitely in the right place.
Annerieke: So in this episode, Anthony talks to Murali Suriar. Up until a few months ago, Murali worked as a Site Reliability Engineer at Google, where he also edited the ground breaking book that is the foundation of the SRE practice: The SRE playbook by Google. So, to give you a quick overview of his career: after finishing his computer science studies, Murali first started working as a support and then network engineer at Goldman Sachs. After Goldman Sachs, he started as a network engineer at Google, then dove into the SRE roles and ultimately became an SRE tech lead for one of Google’s services. After about 11 years at Google, he very recently switched roles: Murali now works as a senior site reliability engineer at Snowflake - a cloud data as a service provider.
Annerieke: We were very happy that Murali said yes when we invited him to the StackPod, because he has a lot of experience in different roles at these very different types of companies: from a very large, perhaps more traditional bank, to a tech giant and the founder of the SRE practice to a - still big but much smaller than his previous companies - software scale-up. So, in this episode, Murali and Anthony discuss how and why Murali got into SRE from network engineering, what are some of the differences and perhaps similarities when implementing the SRE practice at different types of companies and how you can implement the SRE practice if you are interested in it, but - for example - you work at a smaller company and perhaps you don’t have that many resources. Murali has some great tips on that too. So, I will not let you wait any longer. Let’s dive into it and enjoy this episode.
Anthony: Hey everybody welcome back to the StackPod. My name is Anthony Evans, your host every week providing you... Well, every other week I should say, I've already messed it up and we're only 10 seconds in.
Anthony: Today I'm here with Murali, but goes by Mux to make it easier on at least myself. We're joined here today to talk about site reliability engineering, but then also what site reliability engineering means when you're in the different verticals. So we often think of site reliability engineering as being the guys who react to alerts, but more often than not the context of those alerts, the things that need to happen, the amount of approvals, what it means, the teams that they need to involve, that changes drastically based on not just the product and the services that they're supporting but the company that they work for and the priorities of the company right.
Anthony: And so I'll let Mux introduce himself and let him walk you through where he comes from and his background. There you go.
Murali: Thanks Anthony, and thank you for inviting me on. Yeah, so my name is Murali Suriar. I go by Mux just because my name is hard to say. I have done many things. I did a computer science degree a long time ago and then, basically through chance, my first job was with a big financial, working more in kind of an infrastructure role. And this is actual like physical infrastructure, not the cloud, on-prem servers, on-prem networking, physical connections to other financial institutions, that kind of stuff. And then from there, I decided I quite liked infrastructure and so bounced from there into kind of network engineering, still at the bank, so focusing very much on routers and switching, your Cisco/Juniper that kind of stuff, your big name networking vendors and just building your network.
Murali: And then from there that spring boarded into kind of network operations at Google. And I did that for a year. So that was very much the sort of thing you mentioned earlier, the responding to alerts, responding to fiber cuts, responding to whatever stuff breaks Google's network - both internal and external I should say. From there bounced into network engineering, so doing more kind of design and planning and deployment, and then eventually into an SRE role at Google. We can talk a bit more about the nuances of the SRE role because you said some things that I would take gentle issue with.
Murali: So spent, God so when was that, 2014? Yeah so spent seven nearly eight years doing SRE at Google in various guises which we talk about in more depth, and then most recently moved on to a company called Snowflake who are a database as a - cloud data as a service provider and doing what they're calling production engineering, SRE production engineering. There are lots of terms in this space as well, which have more or less meaning depending on who you talk to. And I've been doing that for about three months. So I'm still very much new to the smaller-if-not-that-small company space.
Anthony: Drinking the Koolaid still.
Murali: Yeah, very much so.
Anthony: Getting into the thick of it and I do apologize. I really botched up that intro and that initial piece. I'm talking to one of the people that reviewed the bloody SRE handbook and I'm just saying, oh SREs, they do this, that and the other and I'm of course you would take gentle issue with that very high level analogy of what a site reliability engineer does. But actually speaking of that, just to give people some context around your credentials, you've worked for some really big companies, you mentioned investment banking, you mentioned several roles at Google, right? So it's not just like you went in there spent 18 months and then you moved on kind of thing as a notch and your belt or your resume, but you actually made a career out of it right?
Anthony: And so you went from being in network operations to then the engineering side. So they let you out of the cage if you will, so that you were no longer figuring out the connections and doing things, but then designing things and actually implementing those visions to then moving into that overall picture of site, reliability engineering and Google is the Mecca of site reliability engineering, right? In terms of that ability to bring out a practice in that context, you may or may not agree with that, but everybody who you speak to references Google as like, they must know what they're doing and it's also the most consumed service, in my memory anyway, that I never experienced downtime with Facebook, I remember an outage, Google, I've got a Google home as well. Never really have downtime with it but it's a very reliable service.
Anthony: Would you mind giving people a little bit more context around your input to the SRE community, but then also how did you make that switch from being more on the back end and the doing sides to being more of the designer. And then now obviously leading initiatives and leading programs that Snowflake where you're currently at.
Murali: Yeah. So I guess in order, moving in from network operations, network engineering, that was opportunities given to me by my managers at the time. Right. It was like "Hey, you look enthusiastic. You look like you want to learn. We have jobs you can do." So I was doing, at the time what was 20% time at Google, working on various things actually, which we'll come back to in a second, just getting experience and exposure to things not directly related to my day job, but kind of adjacent and parallel. Then the... I think it's one of those things I struggle with because at the time, and this-- I should highlight, this is not true now, the way that Google operates its network today has, evolved significantly to be much more in line with how it runs its software services. In terms of you don't have an operations arm and an engineering arm and an architecture arm, you have mostly-- in most places, you have a site reliability arm and a bunch of engineering teams, but it's a much more collaborative and interactive setup.
Murali: Whereas when I started at least things resembled the up till then conventional way that you would run a network, right? You would have... Wasn't actually a NOC like a network operation center, but you would have very much first line and then engineering and architecture. As for the move to SRE. I think again, it was due to people who gave me opportunities. So my first 20% project actually was with the team who I would later join as my first SRE team. This was four years before the fact, and it was fairly, what's the best way to put this. There's a term haunted graveyard, are you familiar with it? So this is the thing that's been there, sat running forever, which everyone's afraid to touch because they're not entirely sure what will fall over if they break it kind of thing. So like-
Anthony: Oh, I've never heard of that expression, but yeah it does make sense.
Murali: Yeah. There's a couple of SREcon talks, no haunted graveyards. I think maybe lightning talks, which you can search for and find videos on. It was one of those things, right? It's like it's this Python script that runs on what was at the time, a magic machine that had a lot of privileges and it just didn't have enough telemetry and monitoring. And it also didn't have funding to really maintain it and so, "it would be good if we could get this monitoring thing working cause it would be good". And we're like, okay, he wants to do it, let's try and so, okay. I had, it was in Python. I had some Python experience, but no Google software experience and, it was a show of trust. "We're willing to let you work on this thing."
Murali: Like yes, guide rails and yes code review, but this is how this part of the world works right? Without going into details of what the thing was. And that was like, OK, that's cool. And then the other thing that was interesting was, again, mostly coincidence, serendipity, whatever you want to call it. Physically where we were sat the team I was in at the time when I first joined were opposite the traffic SRE team. So we were kind of within shouting/pager sound/Nerf gun distance of each other and networking. And when I, and sorry, when I say traffic, at least in Google parlance is let's say everything through getting user requests in the front door. Right? So above your networking hardware layer, but think things like DNS, think like your reverse proxies, your nGinx or equivalents, your network load balancers, what would be, I guess, ELBs today in Amazon, whatever.
Murali: And so there was obviously a lot of overlap and shared understanding between those two groups and we ended up chatting to each other, "Hey, what's going on over here? Did you break this?", "No. Oh no. Yeah. That was me," etc. So there was just a lot of interaction there. And a couple of my mentors to this day actually came from both of those experiences. And then when I decided I wanted to do the SRE thing and we can talk about why that was, if you would like, that seemed like an obvious route to go, because I already knew the people. I knew the environment fairly well. They had quite a lot of overlap with my previous role and it worked out and they had happened to relocate to London, which is where I'm from. And so I was like, okay, I can move to London. I can work with this team of people that I know. And so that's kind of how that happened. I feel like you asked another thing, but I've forgotten so we can.
Anthony: Well so, one of the things actually on a kind of separate topic though, this isn't related to my previous question, just something new. One of the things you mentioned, and one of the things that I've always liked about the Google kind of SRE handbook approach was the fact that the SRE should have more of a hands on role within the development world, right? And so you mentioned that you had some Python knowledge, right? And that's how you're able to kind of get in when it comes to that ethos of SRE, how many barriers or obstacles did you have to overcome to, or did you have to overcome?
Murali: And that was it right? Anything serious got written in one of those four languages. And so it meant that the barrier to entry was quite low, right? Like investing in learning how Python at Google worked was serving you quite well. Back then as well, I think the orthodoxies maybe changed over time, but if you would listen to SRE management in Google at the time they had this philosophy of, okay, you have systems people and you have software people, right? So SRE is fundamentally applying or sorry, Ben Treynor's definition of SRE is Ben Traynor, who is the VP who founded Google SRE however many years ago is like, take software engineers and point them at operations problems and you kind of get what looks like SRE because software engineers get annoyed and bored, just like system administration and operations people do, they just solve the problem in a different way and maybe change things so that you don't need as much human interaction. But equally there are domains where you need deep systems knowledge right?
Murali: So networking, storage, there are places where you actually need people who have at least as good an understanding of foundations of internet protocols and stuff, which a lot of people wouldn't get in a typical software engineering or computer science education. So I felt very much in that latter camp of- I had reasonably good networking knowledge when I joined. It was mostly coloured from a kind of enterprise background, like I hadn't done much internet facing stuff when I was working in banking. I was doing kind of business to business and internal datacenter and-
Anthony: But still, if you're in investment banking, there's a... It's a very high demand thing. That pipeline business to business, milliseconds of latency can cost billions of dollars in the right area, right? So I would say that having seen that firsthand, that was also, that can be very stressful to work in, but it's a different type of pressure, right, when you're working in tech.
Murali: Yeah. I think we can talk about pressure and stress as well. But so yeah, to run that in. When I was even still in the networking organization, I worked with traffic team on some, it was like trivial automation, right? It was like, we needed to upgrade some software on these edge switches for some reason, right? I forget exactly what, and it was like, okay, behind them, you had a bunch of proxies that were serving user traffic, and so you had to stop sending users to this location, confirm that was done, do the switch upgrade, reboot the thing, and then bring traffic back and make sure nothing broke, right? Like nothing super complicated. You could run it in a script on your workstation if you really wanted to. But there were several hundred of these things and no one felt like doing that.
Murali: And so I chatted to one of my, at the time future colleagues and said, Hey, you have this thing. Google has an internal automation framework called Sisyphus, which is, it's like a workflow execution framework that gives you a nice UI, right? You say "given them these parameters, do this step followed by this step, followed by this step, here are you health checks". And it gives you a nice visualization of here's where I am. You can click pause, you can click retrial, all that kind of stuff, And so they had a Sisyphus to do some stuff and I'm like, "Could I teach your Sisyphus how to do switch code upgrades?" And it's like, "yeah. Seems like a thing. Why not?" And so we did that and we saved ourselves, God knows, hundreds of hours of human work messing around with that kind of stuff.
Murali: So yeah, just the, I don't know that it was- there was certainly some barriers, right? Like it, I, but equally, I think I probably had a less difficult time of it than someone who say, had been nothing but an infrastructure person the whole time. Right? I had written code, like I did four years of a computer science degree and I did do some writing code in my teams at Goldman. It wasn't like I was coming from nowhere. But equally, the nice thing about code is that it is approachable, right? You can fundamentally, if you have a computer, you can download and run and write code and see what happens so.
Anthony: Yeah. I usually find that the barrier to entry is usually around experience, right? It's like, Hey, we brought you on to be this person. I'm quite comfortable with you being very intelligent and doing that job very well but, because that's where I need you kind of thing. But if you were able to use that time, that 20% to grow and they let you do that and they allowed you to pursue that. That's great on them. I think the, sorry, I think the best leaders create independent leaders, right? They don't create followers because then you're still dependent on the head kind of thing. It's one of those things and so then obviously for you, that was great because it allowed you to pursue your creative Liberty.
Anthony: There are a lot of people who don't survive in that environment, right? When you give them that flexibility to do what they want, they can get insecure about it, right?
Anthony: So we've kind of discussed a little bit around working and making your transition over to a site reliability engineer that you were given that flexibility, that there was still that environment where you were maybe a little bit fortunate in the, there weren't as many diverse languages at Google at the time and as big as a language barrier, let's put it like that in terms of getting into the development side of the house from more of the infrastructure and the enterprise side of the house.
Anthony: Now that you are in the business to business tech space, right? Moving on from Google to Snowflake, how are you finding the mentality and the differences. Snowflake is a public company, right? So there's a lot of intellectual property that needs to be guarded. It's not like early Google days, right? Where you would have source code on laptops all over the place kind of thing. There's a lot more restrictions in place when the company is at the size that Snowflake is. So how is it though transitioning to something that's strictly in the B2B space as a SRE or as somebody who manages and owns that SRE space?
Murali: Yeah. One caveat, Google, when I joined was still, I think 10 times or 20 times larger than Snowflake is now. It's easy to forget that even in 2010 Google was a behemoth. It was just a slightly smaller behemoth. But, yeah, the transition, in some ways, it's kind of a bit of a return to my roots because I did a lot of business to business stuff when I was working for the bank. That was a lot of kind of electronic trading type stuff where the bank would allow access to its trading systems from third party customers who would then use arcane protocols to place orders on markets and stuff. Differences. So, yeah not being internet, public facing is interesting. It's just a very different thing.
Murali: So at Google I was working on stuff that was all about public internet facing stuff until relatively recently and then I moved into the storage world and that was all kind of internal infrastructure. Stuff would break and it would have downstream implications potentially on external users but very rarely in my most recent three or four years would stuff that I was responsible for be in kind of implicated in externally visible outages, the odd occasion, but not much. Snowflake yeah, it's business to business. It does run over the internet, which is a big change. So when I first did this sort of 10, 15 years ago, a lot of business to business stuff was like private lines or private networking provides that kind of stuff and everyone's like, you know what, just encrypted comms over the public internet is fine for most things now.
Murali: So, that's kind of changed. The other big change, I think is just, there's a big difference from being a... "We own all of our physical data centers and tin and we run the operating system all the way up to the user land, to the applications that run on top of it" to "we pay people to solve problems for us." I've had to have a crash course in the major cloud providers, because until recently I knew a little bit about Google's cloud offering, but only because I ran the infrastructure that operated under it kind of thing. And suddenly I now need to understand how Amazon and Azure and Google Cloud Platform work.
Murali: And yeah, the other thing is obviously the product alignment, right? The conversations here are much more around commitments to customers, right? What do customers need? What features do they need? What features are they expecting when? That kind of stuff. And I think particularly from a SRE from an infrastructure point of view, there is... Don't get me wrong. It's hard and it can be daunting to be running like a massive internet facing service, but it also gives you a lot of data, right? If you are running web search or Gmail or whatever, there are enough users that just looking at samples of what's going on can give you interesting, meaningful graphs. Whereas if you imagine a B2B company where each customer gets their own instance of your product and they might have five or 10 users using the thing at any given time, you have to actually do a lot more digging to investigate kind of subtle issues.
Murali: At least that's my impression, right? There's the, "we can't connect to this thing at all is right". It's DNS or it's networking or it's a firewall or whatever, but it's like, occasionally this particular little thing is slow. Often you don't have enough data to be able to see and graphs and stuff like, okay, "when did this happen and why? And what's correlated here?" So it's very challenging and it's a different way of, I think you have to build things differently to be able to kind of investigate and debug things versus when you're running a high volume kind of public internet facing thing.
Anthony: Yeah and well, you've also got unique challenges, right? With Snowflake being a data lake effectively, right? So you've got data going in, data being indexed, data being encrypted for multiple sources, data being queried, data, there is so much layer of obscure... So many layers of obscurity is what I should say when it comes to performance issues to your point. Why does this thing act slowly on this day versus on this day, odds are you're probably going to find out that at that exact same time there's a huge data load of something or a huge report that gets run or a compliance dump that gets pulled that is going to trigger a CPU spike somewhere, that's then going to trigger a VPC limit of a gig or something and then you'll find out that you need to lift that to 10.
Anthony: And then all of a sudden you'll have enough bandwidth for that time of day whatever. It's an example, right, of something very real that you're probably having to investigate that has that nuance. And if you don't have that skillset of understanding, not just infrastructure, but how the code is actually an end user going through and getting something from that database and then coming all the way back and being presented. You're not going to make a very good SRE quite frankly, because that's where you build a lot of the nuance, right, around well, if this little thing flicks over here, okay it doesn't give me a Datadog alert, but will it produce a slower than normal report within the app? Yeah potentially.
Anthony: That's the type of stuff that as a really good SRE, you need to be able to at least investigate, you don't need to necessarily fix it. But being able to then understand where the issue is and then have the opportunity to fix it potentially. That's where I think there's a good opportunity when it comes to Snowflake, right? To embed that kind of practice of continual improvement so that you don't interrupt that development cycle.
Murali: Yeah. I mean, this comes, this whole, "what makes a good SRE, what is SRE the job", is a thing that is a perennial topic of debate. I think your point about lots of layers of abstraction and indirection and not direct access to what's running. Yeah. That's definitely a challenge, right? You have to be able to, you have what your vendors tell you of what things are and how they work. And you kind of guess how they've implemented things underneath that. And then it's just a lot more kind of spooky action at a distance compared to, "I own this machine and I can do strace, I can do a tcpdump. I can do all of these things and see everything going in and out that is physical piece of hardware."
Murali: So yeah, there's definitely an element of that. In terms of systems knowledge. Yeah. This is again a thing that I chat with a lot of my colleagues in the kind of wider SRE sense about. The experience of running your own tin is going to become a minority one if it's not already in the industry, right? If you think about people who have been leaving [university] in the last five, probably 10 years, actually, right? Many of them have probably never run software on prem anything. They've only ever run software in third party clouds. And there's a different set of kind of debugging skill, not just debugging, but also engineering architecture skills, right? A lot more of it is, go to your, I don't know, customer success, whatever the terminology is and say, "Hey, this is the kind of workload we want to run.
Murali: And this is the kind of amount of traffic we're going to see. Do you anticipate that causing a problem?" And then they go away and read their tealeaves and they come back" and say, "yeah probably fine". But in terms of what estuary the skill is, estuary the skill is I think is really about quantifying risk and making people explicit about the risk trade offs they want to take, right? It's not, you don't need to be a super in-depth systems person, right? Be that file systems or networking or storage or databases or tape libraries or whatever it is right? To me, the important thing is RE right? Reliability engineering. Reliability is a feature, right? It is a dimension that you, ideally, you should be explicitly planning for right? You should be saying, and we don't have time at least today to get into SLOs and the various approaches to measuring reliability and availability and the shortfalls of all of them.
Murali: But you need to fundamentally get an agreement from your business. "We want to be at least this good", right? Where good can be defined by yourself, by your customers, by some contract or whatever and you want to be able to measure, are you doing that? And are you getting better or worse? And what changes should we make to move the needle on that in a kind of macro level? I don't think it's true that every SRE needs to be able to do kind of these deep debugging, does it help? Yes. But does it also help to have really good software engineers who can go and then dig into your code and hand optimize a tight loop that suddenly means that you do an order of magnitude, less garbage collection inside the JVM, for example, right? That is also reliability work because that will massively improve your latency.
Murali: And particularly at the tail, right? I think it is what you make of it to a large extent and I think the other thing that is becoming clear for- if you are not a company that can afford to have hundreds or thousands of people dedicated to reliability, operations, production, whatever, which is most companies, right? Most companies are not Microsoft, Google, Amazon Facebook. Then it is not practical to have, I think, a dedicated team of people who do just this, right? It is much more, to your point, about spreading that knowledge and that approach more widely in your engineering and product organisations, possibly supported by tooling, right?
Murali: I'm not going to expect everyone to learn how to debug the innards of of DNS or TLS or whatever. But maybe I can provide good reference material and a bunch of self diagnostic tools where the diagnostic tools will say, you've probably fallen into one of these three cases, check these things. And if you fall off the end of that, then you need to go and contact a specialist. And just try and reduce the amount of people like me that the business as a whole needs to rely upon on a day to day basis.
Anthony: Yeah. You've really made me think of something there actually, because you're bringing out a really good point around reliability engineering, right? It's a practice in of itself and a feature, that's the way you worded it. And that's a really good way of thinking about it, right? Because, maybe my approach to it seemed as though I've always, I've never been an SRE. For me an SRE has always told me that there's a problem and then I've always got a... I've as the third level guy would have to put two and two together. So my view of an SRE is warped compared to other people's views of what and how an SRE should be and whatnot. But what you are saying is more of a proactive ability to ensure the reliability of the services and infrastructure that's running, as opposed to just being the person that turns around and says, there's an issue, right?
Murali: I hate getting paged. I will do everything I can to avoid getting paged in the future, right? And the best way to do that is to plan ahead.
Anthony: That's really... That's a really good thought. I actually, I would really like to dig into this a lot more, but we have actually run out of time and maybe we can have another session at some point, but I really do appreciate your time today and for coming on and for providing us some of the insight, I really like the conversational manner and your ability to articulate some critical points and to pay attention to the detail around that is extremely critical, right? And it's a good skill set to have just in general and also makes for a better podcast in my mind as well.
Murali: Hopefully. Yeah. No, thank you very much for the opportunity. It's nice to chat to people about this sort of stuff. It's one of the things I find working remotely is engineering opportunities to actually talk to people is a good thing.
Anthony: Yeah. So any last minute recommendations, any books you're reading, anything that you can give to our audience as a parting favor, as a gift, a fact, if they were at a party, is there something that they can tell to people about SREs?
Murali: Oh, you've put me on spot now. I do not have at any time to read, because I have a two year old, what am I doing? About SRE, I think SRE is not... A thing that a lot of people get hung up on is "if I'm not doing all of these things that" pick your company, right? Facebook, Google, Amazon, whatever. "If I'm not doing all these things, then I'm not doing SRE." These are all just tools, right? They're tools in the toolbox and you pick the tools you need for where you are right now, right? Like solve the problems you have. Don't go and do a thing just because you feel like it's the thing you should do, right? And I think that generalises to a lot of tech advice, I'm just not qualified to comment on maybe development practices in the same way as I am on SRE stuff. Do the stuff that helps you and leave the rest.
Anthony: I do think sometimes yeah, sometimes the theory isn't always the best in practice right? It's like a religious text, you know what I mean? There're the things you focus on, but don't try and push an organization, the round peg through the square hole type of thing, focus on the immediate issues that can be resolved as opposed to the theoretical issues that may need to be resolved.
Murali: Yeah. And I think the other thing to bear in mind is a lot of the literature comes out of tech companies and big tech companies, right? That's not to say you can't use some of these tools and approaches in a small healthcare provider or an accountancy firm or whatever, right? And you don't have to use all of them. You just pick what you need. You're not going- you don't have to be Google and not everyone wants to be Google and no one needs to be Google right? Just take the stuff that you need and let go of the stuff that requires a thousand person organization and a hundred people working on your tooling, right?
Anthony: Awesome. Yeah. No, that's a great thought to end on and thank you again for your time. Thank you for listening and we'll see you again soon.
Annerieke: Thank you so much for listening. We hope you enjoyed it. If you'd like more information about StackState, you can visit stackstate.com and you can also find a written transcript of this episode on our website. So if you prefer to read through what they've said, definitely head over there and also make sure to subscribe if you'd like to receive a notification whenever we launch a new episode. So, until next time...
Visit the Snowflake company website
Read the Site Reliability Workbook by Google