AWS Observability: Best Practices for SaaS Solutions on AWS | E10

In this episode, we’re talking to Russell Foster. Russell is a SaaS DevOps engineer at StackState and he is responsible for making sure our SaaS product runs smoothly on AWS. Russell has a lot of experience in monitoring and observability: he's worked at both startups and more mature companies. His responsibilities include keeping things up and running in cloud environments to making sure hybrid and on-prem environments remain stable and reliable.

As you can imagine, Russell is the perfect person to talk to about some of the observability challenges that pop up when you’re taking on-premise software products to the cloud, something StackState did. Russell answers questions such as, Why did StackState choose AWS to run our SaaS solution on? What does observability for SaaS solutions on AWS look like? How can you make sure you’re scaling your SaaS product in the most effective and cost-efficient way? So, without further ado, let’s get into topics like these. Enjoy the podcast!

Anthony (00:09): Hey, my name is Anthony. I'm back again with the StackPod, and welcome back everybody. Thank you for listening. Today, I'm going to be talking with Russell Foster. He is a StackState employee, primarily responsible for our Software as a Service initiative, as well as keeping our AWS bill under wraps, and making sure we utilize the services most efficiently, and effectively and securely.

Anthony (00:48): Russell, you want to introduce yourself and kind of give us a bit of background on who you are and what you do?

Russell (00:52): Yeah, sure. Thanks Anthony. I've now been with StackState for about a year. Historically, I've worked for startups, very much in the cloud space. I've done my own SaaS company in the past. I've worked for large enterprises. I've worked for small startups.

Russell (01:11): I'm what some people would consider a bit of a dinosaur, but I think, well, I make up for in youth, I get back in experience. Got myself a master's in business from The Open University.

Russell (01:25): I've been playing with AWS for many years. And I've basically been with StackState because it offered an exciting opportunity to take our own prem service to the cloud, for which we went with AWS, being the de facto standard. So I asked for a job and I'm here. There you go. And that's why I'm here now.

Anthony (01:49): We needed you. And we still need you.

Russell (01:52): I know, I'm just like the savior.

Russell (03:03): I mean, obviously, as we're going down this route, because StackState historically has been an on-prem company, which for when you're monitoring your own stuff, it's great. But taking that people have been moving to the cloud for the last five, 10 years, actually, monitoring the cloud becomes a much more difficult scenario to deal with because you've got so many moving parts.

Russell (03:31): You've got so much dependency, but you've also got much more collaborative approach, that historically, everyone would be in the same building. Now, remote working, and especially in these interesting times that we live in, you've got multiple teams in multiple countries doing multiple deliverables onto the same platform.

Russell (03:49): And one person makes a change and something breaks. What the hell happened? And it's worth knowing that whilst fundamentally, cloud is someone else's server, you need to know what's running on your part of that server.

Anthony (04:06): That makes sense. Also, I found with the Software as a Service delivery model, you can do things like patching, you can upgrade, you can just manage the life cycle of a customer, and the features, and the issues that they run into, far more effectively than in the old days, when it would be literally, mailing a USB stick with a software update to somebody to deploy it on-prem, which we've managed to overcome that with helm charts for some of our current on-prem customers, but there's only so much you can do, if you don't have the expertise of running a product.

Russell (04:48): I mean, you say that, I remember back a few companies ago, where I worked, I think it was a Friday afternoon. I have a rule of Read-Only Fridays, but this company didn't, and it was literally burn something to a DVD, jump on a train, jump in a taxi and get into Paddington Central, London for 4:55 to get a five o'clock deadline for installing this patch, because mission critical and SLAs.

Russell (05:19): And you just don't have to deal with that with SaaS software. Now, there is the other side of this, in the traditional, you can schedule downtime a lot easier in terms of, we're taking portals offline, SaaS customers expect and rightly so, expect their application to be up. So you get a much more nuanced approach to upgrades.

Russell (05:44): Before you'd go, "Oh, we're having a maintenance window." When you have some observability platform like StackState, you don't want to be going, "Oh, we're turning off for a few hours." So there's an entire thing around managing those upgrades for SaaS face, for customer facing software.

Russell (06:04): So historically, you would do a big upgrade from version A to B, as with SaaS, you do multiple deployments per day, whilst keeping the customer service up, but also, everyone's on the same version.

Russell (06:20): It makes for a very different delivery model, but ultimately, I think the news is when we, as an organization, because part of this is also going backwards is a lot easier, if need to be.

Anthony (06:34): Yeah. One of the previous companies I worked for, the CEO did say something very interesting to prospects and customers. He would basically say, "Okay, if you're a bank, your business is finance and making money. So everything that you do in terms of your real estate and everything that you manage, needs to be around managing customer accounts, performing stock transactions, whatever, you should not be investing individual's time and effort to build solutions that effectively monitor that infrastructure."

Anthony (07:14): Because that doesn't make you any money. Okay, you can argue that the availability of the software and whatever, and it needs to be running, but that's not your business. So having a SaaS subscription, allows you to easily offload all the work to the experts, even the hosting of the software, the maintenance of the software and whatever.

Anthony (07:35): And it becomes a fixed cost at the end of the day. It doesn't matter if you cause an outage on that software, it's not on you. You paid your money. You've got your subscription. Just because you threw a ton of data at there, and it broke, that's not your fault. You can get service credits instead, and it becomes a far more lucrative proposition.

Anthony (07:55): Because I mean, you know me, I worked at ServiceNow for years, and back in 2009, 2010, Software as a Service was like brand new. There was only Salesforce and ServiceNow, and it went like hotcakes because people didn't want to spend money on their ticketing systems. They didn't want to keep having to spend money on upgrading Remedy, and the different components, and managing that stack.

Anthony (08:22): So it was like really, really a great proposition to have a highly customizable platform that doesn't sit within your infrastructure, and all you have to do is consume it, and that's it. Everything else is taken care of. I don't think people truly appreciate that proposition, because it's just so much out there now.

Russell (08:43): Yeah. It's very rare that a business has a case for writing a very bespoke solution. Some industries, absolutely. Some have requirements that demand it, but fundamentally, a whole bunch of issues can be fixed with the same software.

Russell (09:05): I mean, I don't know about you, but I personally subscribe to Spotify. I subscribe to Google. I subscribe to Netflix. These are all SaaS software offerings, where I'm not worried about how to produce TV shows, I just consume them, and the software is very much the same. A bank doesn't need to know how to make hamburgers. You outsource that to a patron.

Russell (09:34): I think it's very much the same with looking at the infrastructure for SaaS offerings. So obviously, we are our own customer. We have AWS, we run our SaaS platform. We use StackState to monitor our AWS infrastructure. And ours is not a particularly unique case. Fundamentally, our SaaS offering is a bunch of EC2 instances. It's a bunch of EKS, it's a bunch of load balances and it's a bunch of S3 buckets.

Russell (10:10): We have a few Lambda functions. There's nothing intrinsically unique about what we do underneath; our uniqueness is in the software that we've written. But we still need to make sure our Lambda functions are firing, our S3 buckets aren't filling up, and we need to understand how stuff relates to each other. We're running all this in meetings as well.

Anthony (10:35): I was going to ask about that actually, because I know recently we moved away from kOps for a lot of our hosting stuff. So what exactly do we use these days to host StackState?

Russell (10:48): I'll correct you there. We moved away from EKS.

Anthony (10:52): Okay, we moved from EKS, okay.

Russell (10:55): And this is where I come back to one of your things you said; you need a highly customizable offering. For what we were doing, and some of our requirements, EKS didn't quite fit the bill. We needed to have a little bit more access for some of our monitoring stuff. Like eBPF than EKS offered.

Russell (11:17): We also hit some limitations in the platform. So for example, EKS is not always bang up to date with the latest version of Kubernetes. We needed the latest version Kubernetes for testing.

Russell (11:32): So we made actually the choice of rather than consuming a managed solution, we rolled our own. Now, we know that there's a price to pay for that in terms of complexity, in terms of what we're responsible for. But for us, the benefits have outweighed the risks in that particular scenario.

Russell (11:53): But on the flip side, we still use S3. There are open source solutions like MinIO for S3, where we could just consume that. But for S3, Amazon offers a 99.99999% availability. And there is no way we will be able to engineer that level when we are not a storage company.

Russell (12:21): We're an observability company. So why should we reinvent the wheel? But for the SaaS offering, we were like, "We need something a bit more," but we weren't doing something super, super bespoke. You can go out there and you can get qualifications on Kubernetes. It is the de regard for containerized services.

Russell (12:44): And we built on top of that, in terms of the availability, in terms of the scaling, to deliver our SaaS platform, but also, comes back down to, we need to monitor our own SaaS platform, which we use StackState for, to a significant degree, not for everything, because StackState, it's focused, it is a focus product on what it does, and there are some things, it isn't.

Russell (13:10): It isn't something that is going to be a monitoring alerting platform. It has alerting in, it has monitoring in, but there are products out there like OpsGenie that we actually use for flagging that something needs response. We don't want to be managing that part of the production life cycle when, for little to no cost, we can go to a subject matter expert or a company that specializes in that.

Anthony (13:37): That makes sense. I had a ton of issues with EKS when I was installing just the StackState agent. By default, even if you have privileged access to a pod, that doesn't mean that it can communicate out to the internet. So I was getting the initial topology, and some telemetry data from the nodes, because that's all part of EC2, and all that level of stuff.

Anthony (14:06): But I was getting absolutely no updated topology information because my pod, which was running the privileged agent, if you will, the cluster agent, just couldn't communicate to the Iinternet. So I had to go through, I had to add security groups. I had to mess around with the VPC. I had to add subnets, just to kind of route this thing out.

Anthony (14:26): And all of that is really just imposed by EKS. It's not a Kubernetes issue. That's an EKS policy issue, which is an AWS thing, which has been added on top of Kubernetes. So I definitely empathize with that need to kind of run into more of a Kubernetes native approach. Let's call it that.

Russell (14:49): Yeah, I think that's fair. You've got to understand where EKS is part positioned in the market, is a case of, it's good for most people starting up, but like anything they have... it's a classic 80/20 rule; 80% of the news is going to consume 20% of the features.

Russell (15:12): EKS fits that bill perfectly for that 80 percent, fantastic. Don't knock it for what it is, but we're in that other 20%, where what we need isn't there. So we have to go make that cost value judgment on terms of, what is best for us.

Anthony (15:32): Yeah, that makes sense. That makes a ton of sense. I want to go back a second because you mentioned that you got your master's from the from The Open University. So for those people who aren't aware, The Open University is in its name.

Anthony (15:48): It was one of the first remote learning universities out of the UK. It's often an option that people take when they have to have a day job, and then they need to work on the side in order to ensure that their personal and educational efforts. Did you have a similar kind of challenge? Did you have to kind of work and then you kind of did it part time? Did you leave work and focus on it? Tell me a little bit more about that. So I think that's interesting.

Russell (16:14): For me, it was a matter of, life at the time. Didn't really allow me to go to university. And it got to a point I was about five, 10 years into my career. And I spoke to my CFO, the guy at the time, a guy called Adam Kiglour, and he was basically a case of, "What do you want to do?" And I said, "Well, I want to go into management," and other, such things.

Russell (16:43): It's actually now transpired. I don't particularly enjoy it, but it was a case of, you need something more than what you are. Because one, I was called a one- trick pony. And I'll tell you, that stung. Because I was really good at what I did, but it was the same thing over and over again. And it was a case of, at the time, young kid, going back to university full-time, just wasn't economically feasible.

Russell (17:12): So it was a case of, "What are my options in terms of getting a recognized qualification, whilst balancing being a dad, living a life," and such like. So The Open University offers a really good balance of cost, is flexible learning, remote learning, at a time when distance learning was... the internet was starting to become a thing people used.

Russell (17:41): The first course I did, I got a whole bunch of books delivered to the house, further ones, it was all online. It was also say cost-effective; you've got to think now that someone going to university in the UK, you're looking at £9,000 a year. So you do a four year course, you're £36,000+ in debt.

Russell (18:11): The Open University offers something where you can pay per month, and you can also access grants for it, if you're in the UK. But also convenient. I've moved to the Netherlands now. So if I want to do continuing studies, I can still pick up with The Open University to carry on a new course, get a second degree.

Russell (18:33): But like all things, it might not be the best fit for all people. I personally, I'm quite happy to be on my own, sat down, and focus, studying, but you've also got to balance that with life, because realistically it's a part-time job. If you're already fully busy, something's got to give.

Anthony (18:52): Yeah, I know, but it's good that you kind of took that step. I mean, when I was in my early 20s, and I was in tech, but I went through the support kind of path. So I was an application support developer type thing. But I was kind of like a full stack support person. So one minute I could be... at ServiceNow, my job was just basically to keep the lights on.

Anthony (19:19): I was in a basement in London, as they were expanding into the UK. My job, one minute would be trailing Java logs within Tomcat to figure out why we're running out of memory on a node. And then the next minute, I'd be figuring out, "Okay, why is my sequel 5.2 no longer taking more than eight threads in parallel?" Adding indexes and all that kind of fun stuff.

Anthony (19:44): I thought the only way out of it, was kind of just like, "Okay, you become a manager." And that's often a sense that people have. It's like the next step is to go from being the doer, to the person who manages the doers, and that's that's your career path. What life has kind of taught me, inadvertently, is that that's not the right path.

Anthony (20:11): And actually, that's really the wrong way of looking at hiring and promoting people, because most people are not meant to be managers. Quite frankly, most people are not.

Russell (20:23): I'll actually come back to this. I'm very aware I'm an okay manager. I can do it. I will listen to people. And I have a strong belief that people don't quit jobs, they quit managers. And I've always viewed myself as a manager, people don't quit, but I would also tell you that I've had managers that I found inspirational.

Russell (20:43): I have found I want to aspire to. My personality is not that. What I love is being in front of a bunch of code digging through stuff. Now, I think we'll come back to what you said, that my career path was tech, manager, and I'm back to tech.

Russell (21:01): Now, obviously coupled with my background, I've got a bit more business acumen. I can sit down and go, "Oh, look at this EBITDA number and valuations and M&A," and all these other cool stuff. But fundamentally, I've learned enough skills where I can speak to the CFO, I can speak to CEOs. I can speak to front line techs.

Russell (21:19): I can speak in their language, but I also know what I'm good at what, and I'm not good at. I mean, one of my favorite things is saying, "If you're the smartest/best person in the room, you're in the wrong room." So for me, I think stuff's gone full circle. I know that I'm a good tech and that my future career is, maybe a team lead, maybe a CTO, but fundamentally, I want to keep themself hands on because I've been hands off for a bit and... it pays the bills and keeps lights on, but I don't enjoy it. I enjoy coming to work now.

Anthony (22:04): Yeah. I found that if you enjoy what you're doing as well, you're more likely to just be happy. And if you're unhappy with what you're doing, thinking about managing the people, doing the same thing, is not going to be the best thing for you, because guess what? You now have to manage X amount of people, all doing something that you don't like.

Russell (22:30): I think that's a really fair way of looking at it. It's like I don't like being punched, so I'm going to punch other the people so they can punch me back.

Anthony (22:36): Yeah, exactly.

Russell (22:37): It's still being punched.

Anthony (22:38): It's ridiculous. But yeah, no, thanks for going into that. I know this is primarily a tech focus thing, but I think it's interesting to understand people's backstories, understand the lessons learned and give people a platform to just share their story and whatever and just get to know people and get to know our culture in terms of how we communicate, the type of people that work here and all that kind of fun stuff.

Anthony (23:07): But let's jump back into AWS and what it actually means to manage a Software as a Service environment. We talked very high level around some of the things that we do, the fact that we moved away from EKS, but let's talk about the logistics of managing a Software as a Service platform in AWS, as well as the developer pipeline.

Anthony (23:35): You having to not be a bottleneck, but still having that platform availability as your primary target or KPI, if you will, for whatever you bonussed or whatever.

Russell (23:45): And I think it's worth, just picking up a point there, you said, develop a pipeline. It's easy to sit here and go, “Well, a SaaS service is what people log into and consume.” It's also worth looking at, it's an iceberg model. You've got the SaaS at the top, but you've then got our software and how we look after and manage it, and then we've got our internal tooling.

Russell (24:11): It's not just one account on AWS. We've got segmentation. So we've got production stuff, we've got pre-prod, we've got testing, we've got our playgrounds, we've got a whole bunch of different things, all of which have different monitoring/observability requirements.

Russell (24:33): So our production stuff, that's the stuff that generates our cash. That's the stuff that has to be secured. So there's only a few people who can access that. There's only a few people that can set up the customers. Because you don't want everyone in the company to be able to log in and spin up freebies for their friends and such like. But also conversely, that could potentially become a bottleneck, if there's only a few people who can do it.

Russell (24:57): So you don't want your developers having to wait on someone to be able to do that testing. So you want to be able to just let developers not run free, but with minimal hindrance. But what I care about is sometimes the money at the AWS bill at the end of the month.

Russell (25:16): Developers don't think about that, which is fine because it's not their job. But it's got to be a position of where we can see what's happening, we understand what's going on in our infrastructure, but also making sure that we're secure, but not where the security is getting in the way of everyone else.

Russell (25:35): And as we are becoming a more mature organization, you get to that point where there's some stuff that people would come to someone directly and go, “Hey, can you just do this thing?” which is great until you're on holiday and so then you spread out to the team. And there's some stuff we do that we do once a month.

Russell (25:56): It's not worth automating or such like, but as you get bigger that once a month becomes once a week, becomes once a day, and then you start having to automate that. And this is where stuff like StackState comes in and other monitoring tools to go, “Okay, actually, what's the information you need to know?” Because, what you've got to remember is we've got relatively complicated application.

Russell (26:21): You then spread that across multiple customers, across multiple locations, across multiple clusters and you've got all this magnitude of information coming out, and what you need to do is be able to consolidate that and actually get a good view of what's going on, which is where something like StackState comes in.

Russell (26:39): So StackState might be pulling in tens of thousand bits of information, but what you fundamentally want to be able to go is, what instruction's green, what's red and why is it red? It's obviously one of the great things about StackState is the 4T model. So it can be green, someone does a change, something goes red time travel back, and you go, “Ah, there's this thing here.”

Anthony (27:04): Sorry to interrupt you there but yeah we were just talking to a customer or a prospect I should say yesterday, and they're a Fintech company out of the UK heavily on AWS and they, for some reason, routing stopped working within one of the VPCs for their environment. They don't know what happened, they don't know why it happened.

Anthony (27:29): They opened a ticket with AWS. They don't know why it happened, and so they were literally telling us, “Man, it would've been nice if we had StackState, because we'd probably see what happened.”

Russell (27:43): Yeah. I mean, this is our USP, our core feature, our X factor of that within the observability, we can tell what's happened. Because there's plenty of stuff out there that goes, “It's green, it's red, it's green, it's red. Oh, it's a little bit of greeny-red.” But Stack State is unique in going, “This is why.” We use StackState in conjunction with other tools to monitor our AWS environment. And one customer is about 50 moving parts on it's own.

Russell (28:25): And that's not even digging down into some of the internals of our own stuff. This is just stuff running on the surface. You then add the next level. You've got to remember cloud is abstractions. It's a physical server with virtual machines running highly available Kubernetes, running their applications, which has data replication, but then has networking and monitoring.

Russell (28:48): And that's a single client and you've got hundreds of moving cars, which for one person is impossible. I mean, I don't know our application inside out. I'm sure you don't our application inside out, but if we monitor ourselves, we at least know it's this part and we can go speak to the SME. But then you get these hundreds of moving parts and it's over tens of thousands of customers.

Russell (29:14): So you think of the large SaaS installation of something like Spotify with 14, 15 million consumers in the UK alone. How much data do they have coming in? Too much for one person to digest. And you look at like our AAD, autonomous, our AI special sauce that tells you what's wrong because I cannot say the damn thing.

Anthony (29:40): Autonomous Anomaly Detector.

Russell (29:41): That thing, yeah. And it will tell you when something's about to go wrong, which is great because if I'm being told something's about to go wrong, I can go fix it rather than having to go and work out, “We're running out of disc space,” or something else as a classic running out of memory or CPU.

Anthony (30:05): One of the use cases actually is, what if something is unusually low or unusually high like with streaming and stuff like that? It could be an indication that there's an outage somewhere on the internet. Maybe a backbone goes down in the East Coast of the states as often happens these days, apparently.

Anthony (30:28): All of a sudden the East Coast can't access whatever streaming it is, Netflix, Amazon Prime, whatever, being able to detect that anomaly allows them to be proactive because they're not going to turn around and be like, “Oh, the backbone of the East Coast is down.”

Anthony (30:43): They're going to go on Twitter and they're going to say, “Amazon Prime is down,” or, “Netflix is down. They suck. I'm going to cancel my subscription,” because people, they don't put two and two together in the fact that, “Hey, there's a ton of stuff in between even getting to Netflix in the first place.” And then there's another ton of stuff there to get the stuff back to your TV, and anomaly detection is a way of doing that. Sorry, I didn't mean to interrupt.

Russell (31:10): No, it's fine. I mean, this is the interesting thing about it. So recently there was a large social media website called Facebook that had an outage. Now, I will tell you now, I was on my phone at the time trying to send a WhatsApp. Okay? And what happened was the WhatsApp went to just a little constant message.

Russell (31:29): Now, I am so used to this service being available, my default reaction was to restart my phone. At no point did I consider it's an infrastructure issue, because we've become so used to it being… The internet becomes like water. You expect it to be there. And Facebook doesn't go down. And your thoughts are, “Oh, the internet's broken,” rather than, “This service, I can consume is broken.”

Russell (32:03): Because fundamentally Facebook is a SaaS service. It just happens that the price you pay is your privacy. Yeah. And stuff like observability software actually will help you go, “Oh, right. It's not me. It's them.” Because yeah, I mean, you don't expect large websites to go offline.

Anthony (32:28): Yeah. Your instinct is not going to be, “Oh, Facebook must be down.” I actually consumed Salesforce quite a few times and it is infamous for going down. So, because it's infamous for going down or for just not being available for whatever reason you don't assume it's the internet or your VPN or whatever, because you're used to it.

Anthony (32:52): But if it's something like Facebook… I did think it was ironic because in the social network, the film that was about Facebook. One of the things he says in there is that the reason why people use Facebook is because we don't go down and the minute we go down, people aren't going to use Facebook anymore.

Anthony (33:11): And so I was kind of thinking, “Oh, I wonder if people are thinking about that,” but nobody seemed to reference that in the news or anything, but yeah, no, it was, it was interesting. But to your point, WhatsApp you're kind of like, “Oh, it must be my phone network because we're more used to switching airplane mode on and off again to fix those issues.

Russell (33:33): My thought was, “Mobile data is turned off. It's all my problem, not someone else's. And you know, this comes back to your saying, traditionally, if you're doing monitoring, you're monitoring for limits of usage is high. Looking for usage is low or usage is abnormal is something that's fairly new in terms of understanding what's going on.

Russell (33:59): And StackState is great because you can at least go, “Historically, it's here. It's almost down here. There's something up.

Anthony (34:05): Yeah. And I think there's going to be more issues. I mean, you're already seeing some of the streaming folks struggle to keep up with the live sporting events. I think it was CBS in the States. I forget. I think it was last year. Yeah. This time last year, basically the Super Bowl was on and they were streaming it for the first time.

Anthony (34:29): So not only could you get it on your TV… Actually, two years in a row now I've had issues. So first of all was with Verizon two years ago. I had the 4K Verizon Fios cable box and it was this big thing where it's like: the Super Bowl is going to be in 4K.

Anthony (34:48): And guess what? I try to log in, there's an outage of the service because too many people were trying to view a 4K stream of the Super Bowl on their 4K cable box. And then last year it was done by CBS, I believe, and they just recently had released their streaming service, which is now Paramount Plus here in the States.

Anthony (35:11): And again, I went on the iPad to begin with. It was there. I then went to my TV to just stream it from the TV in the app there and it worked. I turned off the iPad, two minutes later, the TV stopped streaming. It was just buffering, continuously over and over and over again.

Anthony (35:29): I went back to the iPad. I was still able to stream from the iPad, but obviously the software stack on the TV service was overloaded with people streaming from their TVs then you could just see this behavior happen, then all of a sudden my iPad started struggling because obviously everybody's going to the iPad.

Anthony (35:49): And these are really big issues. How do you prepare? How do you know whether something's abnormal? How do you take into consideration these environmental events, which is the Super Bowl, which means that you're only going to have that level of activity once per year. How do you prepare for that?

Russell (36:11): The answer is it's difficult. One of the first things from tons of capacity planning. So in the UK, there's a popular program called called Coronation Street. At some point, being watched by a third of the population, 20 million people. And energy in the UK is supplied by a company called the national grid.

Russell (36:31): And they knew that about 07:15 every night, everyone during the commercial break, they go make a cup of tea. You can't get any more British than this. But they know that come 07:15, there's going to be a power spike. So they make sure power stations are up and running.

Russell (36:51): They come in, everyone makes their cup of tea, power usage spikes, everyone goes to watch the next half of it, power usage goes down. Now that's a really regular thing, but what do you do when a power station goes offline or there's a storm? They have contingencies because they are a utility. They will buy power in from other countries, they're buying from France, they can spin up backup generators. There's a whole bunch of stuff they can do.

Russell (37:22): But also, there is a finite top level. When you're talking about capacity planning for SaaS solutions, no, we're lucky our customer base is steady. Little spikes jumps here and there, but when you're talking things about sporting events, they know they're going to get a spike and I know they will have guessed, “We need this much, but it's this much.”

Russell (37:49): But this is where you get challenged, because if we actually can't have this much, and it's only this much, well, you're just going to spend a bunch of money you didn't need to spend. You can have the argument that your experience is one where you're not going to go back to CVS next year.

Russell (38:06): So the opportunity cost might have outweighed the actual real cost, but this is the problem. I mean, obviously all the platforms out there, like Kubernetes, are designed to scale, but there is still an inherent limit how quickly they can scale.

Russell (38:25): And something with the Super Bowl is going to be a very high spike very quickly, rather than a gradual increase. And breaking news story, you'll be able to scale with it. You have a big instant there, you're going to have problems. And this is fundamental thing that I've not seen anyone solve reliably.

Russell (38:46): Because if you have something that happens every week, you can plan. You have something that happens every year, you can't plan. You can make an educated guess, but…

Russell (38:58): And I think it's also probably unfair to pick on them a little bit because the ones that have scaled, you won't know. You only know when something's gone wrong.

Anthony (39:13): Yeah. But it's this discrepancy between the business that will more than anticipate the numbers, they've made a bid to host the Super Bowl on their network, they spend billions of dollars, they're selling the ad space so they know the numbers of people that will be watching the Super Bowl and they're more than happy to accept the cheques from the advertisers and whatnot to promote their advertisements during their Super Bowl.

Anthony (39:48): But then it's the lack of ability to take that data or that business momentum, and then translate that into technology capacity and engaging with the technology component of the business and be like, “Listen, have we run a load test to show, if I take a live video of a Panda in a zoo and we put it out on our service and we simulate, I don't know however many people, 50 million or a hundred million people watching that, streaming, consuming data in real time, have we done that?”

Anthony (40:28): The answer would probably have been no, but we've got scalability and we've got whatever is contingency and that's a very asinine way of looking at it. Do a load test-

Russell (40:43): I think this comes back to the point I made previously about, is that what they should actually be focusing on?

Anthony (40:45): Yeah.

Russell (40:47): You go look at YouTube, the defacto website for video streaming. I've never had an issue streaming data off YouTube when it wasn't my problem. I've got dodgy connection. But then again, you don't want to be necessarily partnering with YouTube if you're another large multi-national.

Russell (41:08): So is it that they invested in a platform? Did they outsource it? I'm sure we've all dealt with vendors that have promised the world, but when it actually comes to delivery, it's been a little bit crappy, but it's very easy for us to sit here and go, “Oh, well, you should've done better.”

Russell (41:32): I'll be honest, I've been on the other end and hit my own scaling issues, more from a personal view of you've got an outage, you got a bunch of support tickets, you only got a couple of people on staff who can reply to it and you're swamped. I mean, it's very easy to sit on the outside, looking in, I think is a fair point to say, but also you'd like to think next year, they won't make the same mistakes. As it says, “Once is forgivable, two is a happenstance, three: there's something wrong.”

Anthony (42:07): Or as a famous president once said, “Fool me once, shame on you. Fool me twice, well, I don't get fooled again.

Russell (42:16): Is this the same one who had a shoe thrown at him?

Anthony (42:18): Yeah, I think so. We've actually run out of time. I would love to continue this conversation and obviously there's more stuff that we can talk to. We didn't even get into things like compliance and SOC 2 and that kind of fun stuff, but ultimately we could always bring you back if we want to go further.

Russell (42:44): Yeah. You know where I work.

Anthony (42:45): I really appreciate you taking the time to jump on this. I hope you had a decent time.

Russell (42:51): Yeah, it's been good.

Anthony (43:32): Awesome man. Well, I'll tell you, enjoy the rest of your day and I wish you the best of luck. Anything you want to share with the listeners? Any kind of thoughts or tidbits that you want to end with?

Russell (43:46): Thoughts or tidbits. Oh, you are putting me on the spot. So don't eat Stinging Nettles is always a good tip, but I would say actually, if I was to give you… My core mantras for life are always have a backup plan and look after yourself because no one else is going to.

Anthony (44:07): Yeah. My wife would disagree that she looks after me, but outside of that, yeah, no, that's a good note to end on and thanks again.

Russell (44:20): Okay, no thanks. Been great, Anthony.

Annerieke: Thank you so much for listening. We hope you enjoyed it. If you'd like more information about StackState, you can visit stackstate.com and you can also find a written transcript of this episode on our website. So if you prefer to read through what they've said, definitely head over there and also make sure to subscribe if you'd like to receive a notification whenever we launch a new episode. So, until next time...

Subscribe to the StackPod on Spotify or Apple Podcasts .

Useful links:

Find Russell on LinkedIn
Visit the playground to see what StackState’s SaaS solution ‘Cloud Observability for Cloud Native Environments’ looks like

EP #10: AWS Observability - Best Practices for SaaS Solutions on AWS With Russell Foster of StackState

Useful links:

Related resources

Mastering Node Affinity in Kubernetes

SIGKILL vs SIGTERM: A Developer's Guide to Process Termination

Understanding and Troubleshooting Out of Memory Error Code 137