Flatcar Container Linux Tech talks, 27 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Building reliability into our cloud applications - Margarita Manterola, Kinvolk

Description

When trying to iterate quickly on our code, reliability tends to be overlooked and given lower priority than getting the latest features out. Until a large incident comes and knocks our applications out of service.

This talk will give a quick introduction to the Site Reliability principles, and look into how they can be applied to cloud applications, regardless of the size of the organization.

A

All right, so, let's get started um so uh I'm marga as lexi said, I'm a staff software engineer at kinfolk. I work in the flat car container linux product, which is one of our open source products before working at team fog.

A

I was a site reliability engineer at google for almost eight years, so reliability is a very important theme for me and I've learned to make it part of my life and uh I moved from working at the big tech company where there were like teams of engineers that were working on reliability to working at a small company.

A

Everyone is wearing many different hats, so we have very few people and we need to get things done with the small team and so some names, I'm a software engineer and I'm developing software other days. I'm a test engineer and I'm writing tests and other days, I'm a reliability engineer and I'm focusing on reliability. And so this talk is for people that are in that position.

A

So it's not for google employees who have like huge teams, but rather people that are trying to balance a lot of different balls um at the same time and still care about reliability.

A

uh So, let's start by saying like explaining what reliability is right, because this talk is about reliability. We need to be in agreement what it means. So, if you look this up uh on the web, you will see that it says that it's the quality of being trustworthy or of performing consistently consistently well uh and so yeah okay. So this is what the dictionary says, but what does it mean when we move this on to like cloud applications in the contents of cloud applications?

A

We will care about some specific uh features of our application, like availability, which is like if our users can reach the site or not. If we have like a super cool website or app or whatever and half of the time, our users can reach it, they will move on to a different application when the is actually reachable, even if it's less fancy or less cool and and similarly with the other measurements so like, we want our site to be fast.

A

If, if we have like a a great application, that does everything perfectly, but it's super slow users will go away or if it's super fast, but it gives you the wrong answer: it's not useful and uh reliability, it's also about data safety. So we also care about like doing the right things with our users, data having backups etc. All of that is included, and these are the generic themes right and depending on uh what your application your service does.

A

You will maybe care about other things as well, not just about these four uh and that's all right uh because, like there's a wide range of things that you may care about, uh these are just some basic examples, but there's a there's a theme in all of this, which is that, uh if, if our application is not reliable, users will go away. So this is why reliability matters if an application is not reliable users move on to a different one.

A

So maybe it's not so much fun to develop reliability into our applications, but it's something that lets us keep our users and an application without users is not relevant, so uh so we need to uh make it reliable. So all of this seems pretty obvious. So why are we even talking about this? Like? Isn't it like super clear that everything needs to be reliable?

A

The problem is that there's a conflict of incentives when, when I'm writing software, when I'm developing features, I get like this uh rash of developing something and then seeing it in action and it works and it's great right and if you're a programmer, you you've experienced this thing. Where, like you have an idea, you write the code and then it works and it's so much fun and of course, when you're a software developer, you want to do this all the time right.

A

You want to create new features, launch a new version, make your application, do some new cool thing, and and do this as fast as possible. Maybe you have like a weekly sprint or two-week sprint whatever, and you want to keep releasing new features, and this is all nice, but then the other side of the coin is the person who's, maintaining the software and who's keeping the software running. Who can who can be you right?

A

It's like it can be you on a different time of the day, uh and so, when you're trying to keep the system running, you don't want to have any outages. You don't want to have any headaches, you don't want to do any firefighting and the more changes you introduce, the more chances there are of an outage, and so here's where the conflict of incentives comes. So if you, if you want to release, features as fast as possible, you will cause outages.

A

If you want to have as few outages as possible, you will not release features, and so there's this conflict of features or no features, and- and this is where the the the principles that I want to share about reliability come to help us to to find the right balance uh in this conflict.

A

So how do we do that? uh The first step is going to be measuring how our application is doing, because if we don't have data, if we don't know how our application is behaving, if it's behaving well or badly, then we can't do anything. So how do we get the data we get it through monitoring?

A

So in the cloud context I say monitoring and you probably hear prometheus uh and that's fine, that's a tool and it's a valid tool. But in this talk I'm not talking about prometheus. In particular, but rather about the concept of monitoring and the tool that you use depends on your infrastructure, your other needs you could as well use like google analytics for getting the the information that you need. So as long as you get the information that you need, it's fine there's, no like less tool. That is the thing to use.

A

The important thing is the metrics right. The important thing is getting the metrics that matter for your service or your application, and you want really to be the ones that matter for your application. So whatever system you're using might come with a set of default metrics and those default metrics might or might not serve a good purpose for you. So if you're, if you're maintaining a website, you probably do care about the error. 500S that your website is serving right. You probably do care that those are low.

A

It depends on what your website does, but probably you care about that and most monitoring applications might make it easy for you to expose that.

A

But there are other things that you might care more about, like if you, if you have users that subscribe to your website, you might care more to see that the the subscription workflow is working correctly or if you are selling stuff, you might care that the latency of purchases is very low or if your website is about like meme creation, you might care that memes are created if they are not created.

A

If something is wrong, so you really need to focus on finding what are the metrics that are specific to your service or application and the sorry.

A

The other thing is it's not just enough to measure from the inside so like what your service is seeing, but you also want to measure from the outside, because your service could be serving like an awesome website. But if nobody can reach it, then it's not useful.

A

So that's what proverbs are for which you can run yourself or you can get a third party to run them for you, but it's a way of measuring how your website is responding from the outside and if, if your application is a global application that is reached by people from all over the world, um you want to have proverbs that are also all over the world, not just in your country, because uh you need to know like. Does it work for someone connecting from south africa?

A

Does it work from someone connecting from brazil it's if, if you're trying to reach a global audience, you need to check it globally all right. So I've spoken a bit about monitoring. Let's assume now that we deployed our monitoring infrastructure, we have a super cool dashboard. It has the right metrics.

A

uh Are we going to get someone to look at the dashboard all day? No, of course not looking at dashboards, it's very boring and it's not a good use of anybody's time. So what we're going to do is to set up.

A

A

And similarly, to what I was saying with the metrics with monitoring when we set up alerting, we need to focus on alerts that uh matter. So if it's an alert for something that is not actionable, it's not useful.

A

If uh it just goes away on its own and you don't do anything, it's also not useful. We need to trigger alerts for events that require a human to intervene, um so the alerts can suffer from false positives and false negatives.

A

False positives is when, when there's an alert and there's nothing to be done right like what I was just saying, uh either it went away on its own or it's actually not a problem. It's just a fact of life that that thing is happening and the problem with those types of alerts is that they create alert fatigue that uh people get used to ignoring them so like.

A

If every day you get an iron saying that you are getting too many error, 500s and then, when you go to look at the system, it just went away and you don't know why you start ignoring the alert and then one day, there's actually a problem, but because you were ignoring this alert all the time you you just think. Oh okay, it's the same area of everything, that of always that it will just go away in its own and it doesn't right. So so that's really bad.

A

So that's why, when we have an alert, that's triggering that is not useful. We need to fix it so that it gives us an actual signal, either disable the alert or make it less sensitive or whatever make it useful and then false negatives are alerts that should have triggered but didn't, and they didn't and we had an outage and everything broke whatever. Whatever our application was doing, uh users were unhappy and we didn't catch it in time because the alerts didn't trigger. So usually we we realize that we have these false negatives.

A

After when, when we have the outage and it's important that we follow up and fix it right, we add the alert. That's missing. We add the metric, that's missing, so that we don't have another outage for the same reason, which would be very embarrassing all right, so we covered monitoring and alerting. But how does this help us solve the incentive conflict that I mentioned earlier to help us solve that conflict? We need to introduce one other concept, which is service level objectives, service level objectives are metrics that help us assess how our service is behaving.

A

So uh it's important that they are metrics. So there are numbers- and we can say, for example, that our service is 99 of the time available. That's a typical slo or we could focus on the latency. We can focus on the latency of each request or the 99th percentile of our requests. Things like that. So we want to measure how our service is behaving and we measure it with this metrics and then we set goals of how we want our service to behave.

A

So, let's, let's use availability because it's easy and um we can say we want our service to be 99 of the time available right. And so we can measure whether our service is available or not. And we can say whether we.

B

A

Achieving it or not, yes, there's.

B

A price yes: well, actually, the monitoring slide is stuck.

A

Okay, so I was talking about slos and metrics, and I was saying that the slos need to capture the users expectations and the developers expectations so like if you're, if our users expect our site to be up, uh we need to fulfill that expectation, but also we need to let developers release features. So that's how they help us find this balance and slos need to be achievable. So there's no use aiming so high that we can't achieve it because then it they are not really providing any any help.

A

Okay, so let's look at how that actually works, and hopefully you can all now see a table.

A

Yes, yes, okay, uh all right! So it's working, um so this table helps us understand that this availability number that I was talking about, what it means uh in regards to days or hours that our service can be done.

A

So if we say our service can uh that we want to have an slo of 99 availability, that means that we can have 3.65 days a year that our service is down right and that's like over the course of a whole year, and if we look at it over the course of a month, it's 7.2 hours and over the course of a day is 14.4 minutes.

A

Usually services don't go down per day right, usually you don't go down 14 minutes per day, but sometimes you have an outage right, and so that's why you? You pick the window that makes the most sense for you so say you picked per month. You you have 7.2 hours per month that your service can be down right and uh that's that's for an availability of 99 7.2 hours can be a lot or can be very little. It depends on what your service does.

A

So, if you are a back-end application that is used by a mobile app and the mobile app just reaches this back-end application to get some updates 7.2 hours a month. It's like okay, during seven hours a month you, your users, didn't get the update, but they got it later. It's okay!

A

But if you are developing a banking application and in this 7.2 hours or 7.2 hours, where your users were not able to reach their bank accounts, they probably are not very happy right. So that's why we need to figure out the right metrics for our application and uh the perfectionist people. Like me, may say why don't we just aim for 100 reliability? Why 99 percent 99.9 99.99?

A

Why can't it just be 100? Well making it like uncrushable means spending a lot of money, because you need to have a lot of redundant hardware just there in case the other one fails spending a lot of time and effort and even then it's basically impossible to make it uncrushable.

A

So you can add more nines, but it's basically impossible to reach 100 and the question is: is it worth it right you can you can get five nines of availability?

A

Is it really worth it for some applications it might be, but for most applications? It's not. So unless you are developing medical devices or aviation devices, that really should be as reliable as possible, you probably want to aim a little bit lower, something that is high but achievable, and then how do you use? This number here comes one of the most interesting concepts which is error budgets.

A

So this is how how we uh solve the conflict of incentives when we have an error budget, let's say for a month of our service being down for seven hours right. We we said we would aim for 99 availability, seven hours of downtime, and we can reach this target by never being down, of course, or maybe having one hour of outage one day, another hour of outreach another day and maybe three hours a day where the outage was really bad and it's still under seven hours. So we are reaching. We are inside this error budget.

A

So what happens if we had like a terrible outage, we couldn't recover from or we are having outages every day, even if they are short and they add up to more than seven hours.

A

That's when we say we reached our error budget, we don't have any more budget anymore, so we can't keep releasing new features and we need to spend time on reliability instead, right and- and this is the for me- this is the key concept that helps us uh fix this problem of incentives when, when we have so many problems that our application is no longer as reliable as we decided that it should be that's when we need to stop and say, okay, no more features, we work on reliability until everything is working exactly and okay.

A

So so what do we do? How do we fix that? So here is how how we start mitigating the risks of uh putting out features so that we reduce the chances that we will have an outage. We never can reduce it to zero, but we can reduce the chances so that we are stay within our error budgets.

A

The first thing is the stream infrastructure, and when I talk about testing, I'm not just thinking about unit tests or integration tests, which is what people uh first thing when they talk about testing but like a bigger uh set of tools that are related to testing.

A

Of course you want with test coverage, but after that comes the more interesting things. So continuous integration is something that probably you've all heard about, and it's very nice and then pushing green, which means releasing only the things that pass the test. It's also very nice, but a lot of people don't actually apply this or they give themselves a pass, and then things start to turn badly.

A

So if you have continuous integration and your tests always pass and are always green everything's fine, but if you have flaky tests that sometimes fast and sometimes fail or worse tests that always fail and you've taught yourself to ignore these tests, whether they get slicky or always failing, but they are still part of the continuous integration, so you're no longer pushing on green. You are just clicking some override button or whatever to push a release that didn't pass.

A

All the tests then you're, basically ignoring all the testing infrastructure that you have, and it's very likely that you will make mistakes and release stuff. That isn't good because you are ignoring the test so having a good testing infrastructure implies not having flaky tests and not having tests that you just ignore, because a failing test needs to be actionable.

A

Otherwise, your testing infrastructure is just wasting resources and it's not helping you, but on top of unit testing and integration testing, we also need to do other kinds of testing like load testing, to check that our servers will be able to handle the load that we expect and even more than the load than we expect. It would be great if our application is super successful and then we need to have more loads.

A

So we need to check that we can actually do that and other things like fusing, like checking that our service can get weird inputs and it still works and it doesn't crash.

A

Of course you want to do this, not in production, but in your test servers, or at least don't start doing it in production. Do it in your test servers? First, you don't want all your production servers to suddenly crash uh so release. Canaris is a strategy to test in production, but without actually breaking of production.

A

uh The name the origin of the name of canaries comes from it's a bit morbid that it's useful to understand it to know what we are talking about. It comes from the canaries that were used by coal coal, mine workers to know whether there was enough oxygen, so what they did was have canaries and if the canaries they died, they knew that they had to get out of there because there wasn't enough oxygen, uh so yeah, it's it's a little bit south for those old canaries.

A

Fortunately, nowadays there's technology, so they don't need to harm any more animals.

A

uh But we we kept this name uh for uh sacrificing some of our servers or with our instances uh with the new versions of our software, and so we check whether it's working correctly or not, by running it on a subset of the servers, and if we see that this new version is not working successfully, we roll back to the previous version and only a few users were affected. So this.

A

This idea of of having a subset allows us to also use less of our error budget because um say you deployed a new version to 10 of your instances and this version was not working and it took you one hour to realize that it wasn't working and roll back to the previous version. So you had a one hour outage, but it was only ten percent. So instead of 60 minutes of your error budget, this actually it's six minutes of your error budget, because only 10 of your users saw the problem.

A

So it wasn't like a 100 outage, and so, but it's important that we use this uh correctly by by checking. So we deploy to the new to the new instances, a new version, and we check that it's working correctly and if it's not, we roll back right so rolling back is what saves us and what allows us to uh go back to the the previous working condition.

A

uh What happened? No, we rolled back the slide uh and then um yeah and then, if it's working correctly after some time, it depends on what it is that you are deploying, but usually it's it's usually to wait for one day and then you deploy it to more instances.

A

It could be just go from 10 to 100, but it could also be that you do a progressive rollout like the first day, one percent then 10 percent, then 25 and then 100..

A

It depends on how how big or small the application the service is all right. Next slide, um yeah and then another source of problems are the humans so like. If we are doing the kind of everything we need, a human that we may need a human that is pushing it to one percent, two percent 25 percent and then the human may make mistakes.

A

So it's important to try to remove humans from the loop as much as possible, in other words automate as much as possible, and this includes release automation, which is like the canary thing that I was talking about, but it also includes things like automatic roblox. This may sound kind of like black magic, but it might be possible to have your monitoring infrastructure detect that your service is now suddenly responding with a lot of 500s, and this doesn't seem right.

A

So, let's roll back to the previous version and, of course, automatic backups and automatic everything, because the more human interaction that we have in our processes, the more mistakes that will be made we humans are not reliable at all all right, so let's say we have automated everything.

A

Are we immune to outages, of course not there's always going to be outages. So, let's move on to dealing with outages.

A

Yeah, so outages will happen. They are a fact of life. We can deny it, and so what we need to do is be prepared next slide.

A

um So how can we prepare ourselves for outages the first step, the step? Zero is to accept that outages will happen, uh even if we we feel like they shouldn't, they will happen and once we accept that we can prepare for them, so have playbooks for the most common problems. Playbooks are basically documentation of what to do when a problem occurs. So if, if we get an alert, what do we do with this alert?

A

And you might think that it's very obvious what to do with the alert? But when you're under pressure when, like the system is down and and the the phone is ringing, and why haven't you fix it uh having a clear, step-by-step process of what you need to do? Even if it's obvious? It's really helpful. um Also.

A

I mentioned this already, but it's very important, so I'm repeating it have this rollback first fix later philosophy like engraved in your mind, because it's very very common for software developers to want to fix it in production and do a hotfix and you you see the code and you are like. Oh yeah, it's just a plus one is missing here.

A

I will just add the plus one and push it, and then it turns out that that plus one that you added had unintended consequences and uh you went from some user seeing a problem to all users seeing the problem because you didn't realize so.

A

Roblox fares fix later, even if it's tempting to hotfix and then also have a process for the like a meta process for handling the incidents like who's going to communicate with the customers, who's going to escalate uh who's, going to write the postmortem whatever, but to how this meta process again, because when you are in in a very stressful situation, it's really hard to think on your feet. So the most things that you can just follow from a checklist, the better all right- and I mentioned pros mortens and that's our next slide.

A

uh Postmortems uh also called by some people root. Cause analysis are documents that explain what happened. What happened before the outage? What happened during the outage, how it got fixed? Who did what and why etc? But it's important that none of this is blaming people, so postmortems should be blameless, um even if you say margaret did this like margaret run this command it shouldn't be marga, is stupid and run this command. It should just be margarine, this command, and it's important that we remember that whatever mistakes were made.

A

Even if a human made a mistake, the problem lies in the system. The system that allowed the human to make that mistake right, because we are all trying to do the best we can and if the system allows us to do things that are wrong. The system is at fault, not the humans, because humans will make mistakes.

A

We need to engineer a system that will not let those mistakes, uh cop outages and so that the goal of the postmortems is to learn from all of this, not to blame anybody and to list the action items that will help us prevent the outages from happening in the future and it's important to follow up on these action items, because it's no use listing all the things you need to do.

A

If then, don't do them and then next month you have an outage exactly like the one you had before all right next slide, we're almost at the end.

A

All right so uh I've given similar versions of this talk before and usually when I get to this point. People are very anxious because they feel like there's a lot of information, and they don't know where to start. So, let's look into how you can get started next slide, yeah all right. So the first step- and this is like it has to be. The first step- is to monitor your service. If you don't have monitoring, you need to deploy monitoring and, as I said, it doesn't need to be prometheus. It can be something simple.

A

It doesn't need to be self-hosted.

A

You can delegate to someone else, but you need to have monitoring, because without relevant metrics of how your service is doing, you have no idea of what you need to do and and how your application is behaving once you have monitoring, you can start doing all the fancy things like establishing slos and seeing how you're doing with those slo so set the slos and then look at them in a month or a quarter and check if you are meeting them or not, and what you need to do to fix it and the same with alerts.

A

You can set up alerts and then tune them. Setting up alerts will will require time with fixing and checking that you're setting the right alerts, but it's better to have some alerts than none so start small and then grow.

A

And then, as you start getting this information from your monitoring, your slos, your alerts, you can decide to invest time in testing and automation as needed right and, as I said, it doesn't make sense to try to aim to 100. Reliability, uh invest as much as it makes sense for your service and finally have a plan for outages and learn from the mistakes that you make.

A

Mistakes are really valuable in helping us uh deploy better services, so make sure that you don't just like paper over them, but that you spend the time uh to learn from them all right. uh That's it. We have a question slide yeah, and hopefully there are questions now.

A

B

Anybody kind of mute themselves.

A

B

Yeah cool thanks thanks a lot for the talk, I found it very interesting. um I just would like to ask something: maybe you share some ideas to this side of the problem? So in my case I come from operation team point of view.

B

We have a lot of the things that you have mentioned, like metrics, alerting, uh slo. We do post mortem and many teams uh also support us from the development team. Of course we cannot do this on our own, so many of our applications already support this, but I still find it sometimes not so easy to kind of convince developers why this is important. So any ideas on this side like we don't to be to be fair. We don't uh do this error budgeting. Yes, we are thinking about it and this might be maybe something.

A

Well, yeah, that might be something that you can introduce to developers and maybe have a way that developers can visualize like. If you say you already have all this monitoring and slows, etc.

A

Maybe have something simple: where developers can see the results of their work so like if, if the service is down, if there are too many errors, uh whatever the the problems that uh arise from uh from bad features whatever, uh if it's visible it's easier to to communicate and to to send the message, and in the end it does, uh I understand the struggle uh for me.

A

I think it helps to see it from the user's point of view, so developers will want to put out features because users want features, but then they also want the application to work so that users keep using it right so like if at some point it's it's getting so bad that the users are experiencing pain, then, like developers, will not want that right.

A

So, like maybe try to uh send the message from that part that point of view of like users are not happy uh with the service being down all the time or with it crashing or with half of the request returning errors or whatever issues your application is.

A