Cloud Native Computing Foundation Online Programs, 30 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native Live: Designing and Operating Reliable Cloud Services

Description

A View from the Trenches

A

Hello, everyone Welcome to Cloud native, live where we dive into the code behind Cloud native I'm, Anita, Lester and I'm. A cncf, Ambassador and I will be your host Tonight. So every week we bring a new set of presenters um to Showcase how to work with Cloud native Technologies and the stories behind them.

A

uh They will build things, they will break things and they will answer all of your questions, so you can join us um every Wednesday or like this, sometimes especially on other days as well, um and this week we have amazing set of presenters here to talk with us about designing and operating reliable cloud services. A View, From, The Trenches, so very exciting for this special program. Today, and as always, this is an official live stream of the CNC app and as such, it is subject to the cncf code of conduct.

A

So please do not add anything to the chat or questions that would be in violation of that code of conduct. So basically, please be respectful of all of your fellow participants as well as presenters with that done I'll hand it over to our amazing presenters to introduce themselves yeah who wants to go first.

B

I'll go first, so my name's uh otherwise Gupta I'm, uh the founder and CEO of shoreline.io and one of the founders of reliability.org. Both uh correlate to my lifelong interest in building High highly reliable systems.

C

My name is Nyle Murphy I, Am CEO, founder of a teeny, tiny startup in the SRE ml Space Coast stanza systems. If you know my name, though, it is probably because of the site, reliability, engineering book, the esri book and or possibly the ml Ops book, which are both explorations of what it means to build.

B

Reliable systems.

C

Stephen and.

D

I am Stephen Townsend I am currently a developer advocate for a company called squid up who has unified with dashboarding and visibility. I did performance engineering for many many years, I moved into esari more recently and I have a podcast called slight reliability where I share share my Learning Journey in the reliability space.

A

Perfect amazing group of people here with us today, so today we have a great kind of panel format where I will ask questions and then we're gonna get amazing answers. uh But, as always, you can still ask your questions to your audience as we go along um and we'll probably have some time for Q a in the end as well, um but ask them away as we go along as well. So uh first question to everyone here: what does good or great look like when discussing reliability.

C

So we only have an arrow right, yes, yeah yeah, so uh like just off the top of my head, I I, don't think, there's a a context. Insensitive answer to this, like I, don't think there is one answer which just fits everything and I I expect. Most people in the industry wouldn't be surprised to hear that um I will say, there's a couple of fundamental questions.

C

You have to be addressing or addressing to a sufficient level in order to, in order for this question, to kind of mean anything to you really, um the first one, of course being how much does the ORD care about reliability like if it doesn't care at all, for whatever reason like there might be arbitrary reasons, then good or great doesn't really matter. uh I would say, though, that making sure the basics are kind of done. Well is always going to be a foundational thing. You have to have before figuring out.

C

What good or great looks for you even makes sense as well and that's stuff like are.

A

C

What it is that you're doing have you decided organizationally how reliable you should be like if you haven't decided that and you're just getting I suppose whatever the native platforms will deliver to you, then you, you have a huge decision to make about precisely what to improve. Where and like, are you resourcing it sufficiently? Do you have the right number of people, the right kind of people, the right resources, etc, etc?

C

There's a lot of folks who are even struggling with the basics before it comes before we come to the question of good or great I will say, though, that if you do Define or if.

D

C

The right measurement infrastructure in place if you have to find what your levels of reliability should be and if you are sufficiently resourced to meet those levels. Well, that looks pretty great from my point of view, great.

B

A

B

Say great to me means that your system is running at a level basically indistinguishable from a hundred percent, so.

D

It's a receding, Horizon.

B

You know you can always get better um now here, I'm talking about reliability, not availability. Sometimes people confuse the two.

B

They talk about their fleet-wide availability, but that ignores things like am I getting errors back from the system is the performance out of whack, and you know I often find that people take a fleet-wide availability goal and then they claim they did really well, even if, like one region was down for an hour that has happened to me with some of the services I used to work on at AWS with you know, the services I depended on, and you know your customers just don't care about your fleet-wide availability. That's an internal goal.

B

What they care about is their particular experience right that they were they able to check out were they able to perform their task, and uh you know in their expected time without any drama- and you know ideally without having to code against a bunch of error cases due to in efficiencies in your software.

D

A

D

I think I think the ultimate goal of reliability is to support an organization like to meet its objectives, whatever they might be so because reliability, just for the sake of being reliable, I, don't think necessarily is enough. You know what I mean it needs to have a purpose behind it. So I think maybe good reliability is when the augers achieve against business goals.

D

The customers are able to achieve the outcomes they set out to achieve the Ops Team enjoy their work hopefully or are not totally stressed out by constant incidents and fires burning all the time uh the technology is easy to operate. Incidents are manageable and, of course, the technology is also reliable and available and performs well enough to to do what they need to do. So it's maybe a bit fluffy, but that's that's my take.

A

Yeah, it's a good floppy. Take no.

D

A

Worries there um perfect. So next up we have the next question, which is tell us a story about us particular or unusual or reliability. Failure that you have seen was it a failure of design implementation or operations.

D

Oh I'm gonna flow on so I have a story. It's actually from my performance teasing guys so I was, and it was the system which was served on the side. Though I was thinking about it used to be. It was a system of record it used to be a Mainframe over the years. It got upgraded and I got put into a Mainframe emulator running on a Windows server and at one point we needed to take some some of the data that system used and put it in a big cloud repository.

D

So this mainframe system, which was emulated, would call to the cloud to retrieve you know, customers or something like that, uh turned out that we didn't know. I knew this, but the Mainframe emulator, which also its own database, was a SQL server was considered an external component. It would have make calls externally to the cloud, and there was a performance issue. It would take like 15 seconds to retrieve a customer.

D

The thing is that that those calls were single threaded so that when it was making a call to the cloud to retrieve a customer, the entire system would freeze for every single user for 15 seconds. It couldn't even read his own database to do anything, it was like the worst like thing, I've ever seen and I, don't know when I think back in it I think it was yes, it was a failure of design, um probably more it's more of a failure of too much technical deed.

D

The system which hasn't been cared for, for you know maybe a decade or more, and it just hasn't been thought about. So no one knew about it anymore. So when we're introducing new Cloud stuff to the old world and not thinking about and understanding how these things are going to integrate.

B

So I spent about eight years at AWS, so I saw a lot of failures there um and you know I'd say the biggest ones were almost a failure of imagination where the system had an error path to deal with the resource becoming problematic, but it wasn't able to deal with the at scale case where the failure cascaded. So, for example, the week before I joined AWS, there was a huge outage in Dublin due to a lightning strike and that because so many different machines went out at exactly the same time.

B

We had this remerroring storm and where every machine was trying to remarry to some other machine that was trying to remarriage it, and you know there just wasn't enough capacity and nothing was going through and you know the reason I call it. A failure of imagination is because we all can see and code against the things that everyday things that happen all the time. The question is really thinking about. What's the largest scale.

D

B

That you're going to go and deal with you know instead.

D

B

And an availability, Zone there's a 100 hosts. You.

D

B

Know, but uh you know designing.

A

B

Is really I think incredibly important when you're designing reliable systems, huh you must have seen a bunch.

C

Again, uh kind of the margin of this call is too small to contain all that. The things M I, actually in a rare example of trumpet bone trumpet, blowing I, actually run a podcast called getting there with Nora Jones the CEO founder of jelly.io, which.

A

Actually goes into.

C

Very much detail about incidents and and tries to examine them under a framework, our lfi framework or socio-technical systems, framework or safety.

C

Two lens uh there's there's an emerging movement about how folks actually analyze and respond to incidents uh which is making quite a lot of inroads into the SRE community, in particular, and but also in in hospitals and like chemical engineering and Aviation, and a whole bunch of other fields where the standard reaction to incidents can be possibly a little bit more um blameful than we might like, and the the organization doesn't learn as much as it would otherwise, but in terms of kind of bite-sized stories, yeah like I've, I've, loads, I, suppose there's the time that somebody dereferenced a pointer incorrectly and everyone saw the data for customer zero.

C

That was pretty cool. uh That was not so good because, obviously customer zero did not want to show their data to customer one through umpty million. uh But it was relatively simple from a kind of a contributing factor. Point of.

B

C

The code got changed and pushed the difficult thing was figuring out. What the correct privacy and legal response was. Now that you have this, this data, which which shouldn't have been accessed by other folks, then there's the ones where the technical thing is maybe clear, but uh you're, working internally or externally, with folks who um folks who, who might have some trust issues with the the the framework of authority that you're dealing with. And so you have to kind of, establish trust in the middle of an active incident.

C

I had one with a public sector um in uh in Ireland, where the technical cause was actually relatively clear from from day Zero and was something to do with traffic routing, but folks on the call didn't weren't in a position to believe that a large organization could make uh guarantees about how their systems were performing, that the smaller organization was kind of ready to really absorb.

C

So there was primarily a social conversation rather than a technical conversation and then a little bit like um I, suppose, anorex story, early uh early in the days of S3 S3, three, the AWS system used a gossip protocol to tell other machines where it's where the data lived, and so if, for example, a system or a data center lost power and all of the machines came back up at one time, they will go.

C

Hi I'm machine X I have chunks 1 through 70 million and hi I'm machine X plus one, and they would all tell each other and completely flood the network, and they would completely float the network even worse, because there was an assumption that this network, which was actually split between two data, centers, had exactly the same bandwidth characteristics on every point-to-point link, which of course, is not the case. If you're going between two data centers so and that was fun.

A

Perfect, a good variety of fun challenges and failures there great to hear, but with these experiences in mind, what design recommendations can you share for delivering services with maximum availability.

B

Let's start so my first take is: you should automate everything and that you can so because you know we've done a decent job over the last 20 30 years in improving quality and so I think it's now time to provide that same sort of engineering rigor to reliability as we did for Quality. You know: have things go through pipelines, inject failures, see how they're handled simulate large-scale events to see if the system can heal so I feel like the more this.

B

Is software the more it scales the more you can manage it like it's software, as opposed to you know people and processes intrinsically hard. You know people change processes, uh sometimes they're followed and sometimes they're, not.

D

I think maybe maybe the biggest challenge right now is just how distributed our systems are and how many components are they need to talk to each other. So we talk about zero trust for security, I. Think in a way we need to start thinking about reliability in a similar way. So maybe you know being careful about the you know the vendors that we choose to connect to and making sure that they're reliable, because at the end of the day you are only as reliable as the dependencies that you depend on.

D

uh That's there's a whole nother way of thinking, I, think and also realizing that there are going to be. You know, maybe dozens of components that you're out of your control that you depend on um I was going to say yeah and building sort of degraded service around that. So if this thing goes down, have a backup plan um or you know, provide the graded service I think that's that's that's one thing and I guess.

D

The other thing is, of course, having really effective observability and that's easy to say that with observability but I think being able to uh pinpoint. What's you know to understand and see clearly what's happening in a really complex, distributive system? It's hard it's hard to get right, but when you, when you get it right, I think it makes a big difference too, and sort of improving reliability.

C

D

C

There's an absolutely gigantic amount of stuff. We could say here it's even a struggle to keep it to 56 minutes, uh so I think I'd start off saying, like there's a bunch of things, you could do at different levels and some of the levels are maybe relatively tactical or code focused or whatever uh one thing that believe it or not, contributes a lot to uh reliability.

C

Problems is, have you checked the return code from your system call or your HP request or whatever what happens if that does not complete and the information that you were hoping to get does not in fact arrive in the buffer and you just pretend the buffer is there and have something valid in it, and you proceed to crash merely like there's all kinds of stuff at that level, then the level up where you're going okay boxes and lines, how do they communicate to each other?

C

Is there some kind of resiliency you can think about there on a design level? There's a uh it's not quite a programming language, more kind of modeling language, I'm, quite fond of called TLA, plus hello, Wayne and a bunch of other folks do popularizations from it. Of course. It's a Leslie, of course, production from the same folks who gave you or same person who gave you paxos another leader, election Master leader election uh kind of Primitives, but TLA plus allows you to kind of model, State transactions in a distributed system and say: okay.

C

If this goes to here- and we message this thing back with this other information, can we prove that this is correct? Or can we fuzz it enough so that we can look at some potentially uh some cases where potentially reliability might be in threat and actually Amazon used TLA Plus in a bunch of stuff and discovered like there's, some 39 level stack deep stack of stuff that actually ends up being um problematic.

C

A little bit like I think somebody proved bgp doesn't actually ever converge a couple years ago, or something like that anyway, lots of things you can do.

A

Perfect, um how are you feeling do we want to take an audience question here in the mix? Oh perfect, enthusiasm, that's nice! So we have Lauren George asking how do small slash medium-sized organizations predict themselves from outages from Big public Cloud providers without breaking the bank? Is it even possible today, foreign.

C

Sir I'll initially give um really depends on what your app is. Trying to do really depends on what your dependencies are. There's, actually a fascinating piece of work from I, think Walmart, who are in no sense a smaller medium-sized company, but a fascinating piece of work from Walmart who do kind of multi-cloud uh spot pricing for instances for stuff, and uh that's almost the definition of not breaking the bank, I suppose or breaking other Banks. But the basic deal is figure out what you're, depending on figure out.

C

If in your application, you can use specific cached results or algorithmic fallbacks or hard-coded fallbacks of various kinds. Note all of these May potentially have reliability outcomes that you were not expecting at some future point of time. But if U.S east one disappears- and you are using uh some subset of stuff, but you can limp along with cached results for a while, like actually that's that's kind of a win. uh A lot of people are pushing multi-cloud at the moment, which there's been a lot of discussion about in the industry and basically the feeling is.

C

It doesn't make a whole load of sense to go copy paste for your infrastructure from one provider to another. They work hard at making sure that you can't actually duplicate that kind of thing, a little bit like mobile Telco, building models, I suppose, but the the real deal is, if you are using a specialized thing, that's only available in one provider make sure it's something you can do without for a couple of hours. If you need to like big query equivalent, but I'll stop talking now.

B

Yeah I think there's a lot to unpack in now's the answer, let me add on to it a little bit so I think the question really becomes. You know to think about reliability. The same way that we used to think about Security in terms of threats, except now the threats are when you lose uh availability of your dependencies, and so it's you know for how long can you relate?

B

How do you degrade when that happens, rather than fail entirely like if you're losing, if you're using Lambda and Lambda goes down entirely you're kind of out of block? If you are using VMS and a VM goes down, you can probably spin up another VM, particularly if you've got a warm pool already out there. So.

D

B

Know there it's really a question of uh tolerating failures and uh you know you can't just duplicate everything right.

B

um Yeah even you've, what's your review.

D

Yeah I I, don't know I'm I guess my two cents is that if you're, a small to medium organization and you're thinking about going multi-cloud, have a serious think about the the risk versus the cost of what you're trying to implement in the complexity of it. I think it would be a hard case to you know to make just to say: that's actually worth the effort and the cost and uh yeah there's other things you can do. I'm like no I was saying.

A

Great uh and then we had a few more questions but I think we're gonna go through some of the pre-decided questions first and get back to few of those. So Andrew I saw your question about case engineering.

A

We're gonna get to it eventually, um as well as Laura Lauren asked about uh books or articles and I think that's a great question to maybe wrap up in the end towards if any of our panelists have great resources for everyone to jump into next, um and then also I saw the podcast question and I'm gonna send the links to everyone in the chat so no worries there um and then to the next question, which is how do you identify and resolve potential reliability issues before they become your customers concerned?.

B

You know at AWS one thing that you we all did even uh Andy jassy did was monitor Twitter because at least at the time there was uh you know, people.

D

Would go on to Twitter and.

B

Ask is that three down or something like that and your whole goal was to make sure that by the time someone was asking that question you already knew you already had the event started, you're already working on it. Maybe you didn't have it identified the root cause, but you're working and so I think it's.

D

B

Important question and it's a really important goal we should all have, because what you're doing as a cloud provider is, you know, you're taking uh you know, you're taking on the responsibility that your customer will otherwise take on themselves, and so they need to trust you. They need to trust you to care about them in the way beyond what they would do on their own. So you know you just need to be really good at this stuff.

D

uh Well, obviously, I think great observability is is important, so and I think that one of the keys there is to focus in the beginning, let's say you're, starting from from scratch start with making sure the customer can use the service rather than getting lost in all the Myriad of technical metrics and events that you could be tracking and looking at, because once you get, if you can't answer that question, can customers consume the service, then nothing else really.

D

Matters in my opinion, uh I think that eslos are a good way to potentially do that, but I don't think you have to do slos either, but that's contentious, but this is my particular opinion.

D

I, don't think the other thing is coming from a performance testing background is to be testing reliability during during delivery, um especially for a new product which, maybe you can't just go, live with immediately and have load on it, because there's no customers at first, you know so testing it um is, is a great way to to shake out not everything, because real customers in the real world do unexpected and wonderful things. But you can't shake out a lot of issues and understand your Solutions and systems and services better another. But beyond.

D

So those are the I think ways you can I identify issues before customers do. But beyond that, there's all the other things we can do to mitigate the impact yeah. The way that we deploy, you know doing deployments like blue green deployments or Canary or rapid rollback and things that can reduce risk there, um learning from incidents not just having internet secure. Not not. You know gaining something from that, because incidents are fantastic in terms of learning and growing as an organization and also I guess.

D

The last thing is really prioritizing the services that really matter the most to your organization and your customers, rather than trying to you know, treat everything equally when you might have some Administration API that no one uses or it's not important. You know.

C

Okay, so I'm not saying you should do this, but there is an improvement Loop available to you. A little bit like anurag was saying where the answer to what to do for monitoring. Is you do nothing and you wait for the phone to ring and the phone hello sides down?

C

No, thank you very much, and then you figure out what went wrong and then you turn that into an observability Rule, and you just do that a billion times and eventually everything is covered, hooray except, of course, everything is covered, and so you have a billion things to Monitor and it's not necessarily clear which seven of the billion things actually matters. So that's a question that you can also resolve with this other magic trick, which is we're pretty kind of big back-end people right.

C

So we think about things in terms of microservices and communicating via SLO or communicating via or pces and setting slos and so forth. That is back-end language.

C

It's not front-end language and the interesting thing about the 2023 catch Point blameless, SRE report is that it says only 35.8 percent of respondents said that they had client-side monitoring that fed into their observability that they could make decisions about and.

D

C

That's the huge gap that can help a lot to to plug, particularly or to plug observability problems, particularly if you haven't done what's called a cuj analysis or a.

D

Critical user Journey analysis.

C

For your your site or your service, in some sense, to figure out what the actual people actually want to do and kind of instrument that so lots lots to do there.

A

Great a good suggestion there um so then getting to the next topic is is reliability.org. So why do you see the need for new reliability.org community.

B

A

B

uh Started reliability.org, which is a non-profit nothing to do with you know, showing any of our own Goods is that building highly reliable systems is something of a black Arc and it's mostly informed by just bitter experience, and you know, that's okay at a hyperscaler, because they've got lots of people with lots of bitter experience and they get better.

B

But it's a problem for the rest of us right and there's just no good place that you can go to offer your thoughts or to get advice, and so you know that's uh Twitter kind of used to be, but it's less so now so you kind of want a safe place. You can go without a lot of noise from the thought, a lot of vendors that you can do this. So you know I, asked Lyle I asked you know other people like Stephen to join and yeah it's early days, but uh I I'm, actually.

C

B

The conversation in there.

A

hmm It's great anyone else want to add their viewpoint.

D

Yes, I do I just love to make the Nights okay.

D

uh Where are we come on? I I actually wasn't aware of any other sort of reliability: uh communities out there that were not based around a particular technology, maybe or an open source project, so I haven't found one before so. For me, it was like a new thing, but maybe that's just because I wasn't aware what else was going on.

D

I also think that there's generally been a split between open source communities uh who are very active from you know, from what I see and also these sort of commercial communities of people who are built around technology like AWS or whatever, right and so bringing those together, I think is quite exciting and getting those different perspectives, uh yeah I, think cross cross company collaboration is important as well.

D

um Yeah I think we can talk and talk more about that, but I I know someone in New Zealand who who's in SRE in quite a large important organization and he's the only one he's a winey sorry. So he has no internal Community whatsoever. So the only chance for him to get ideas at the bounce ideas is to go out um to the industry.

C

On the other hand, he's living in New, Zealand I mean could be worse, so I hang around a lot in either role focused or conference, focused uh or sometimes technology focused slacks is what it tends to be these days rather than mailing list or whatever so uh I like the idea that there would be a kind of a crossroad cross company cross-industry conversation about this that isn't tied to any one particular thing, and that seems to me to be a coin.

A

Great, um so how in the future, then, now that this great new um Community has started, how do you see reliability.org, Community, growing and evolving in the future,.

B

Let me start, you know, communities are hard they're taking nurturing, they take, you know constantly adding useful or thought-provoking content, and it takes. You know, creating a safe place where you can offer your opinion. Even if there's an expert like Nile hanging out there, you know, who might just you know say well my experience. It's the answer is 12. and.

D

B

Then what do you do? No, it's not like that. You shouldn't work, but no, you want to re provider your response, the.

C

Answer is 13.. Let's just get that right, um yeah I I mean I, don't know, communities are hard. Things are hard, uh maybe it'll be.

A

Wonderful, maybe it won't be I.

C

Don't know, but I will say that I'm I'm increasingly anxious about the future of the SRE profession in a world which is I, suppose growing increasingly unreliable. Is that a good word for it, I suppose um and in a sense like I? Have this whole thing? I talked to this Recon a number of times about weaknesses in the intellectual foundation, for justifying the value of reliability and so on so forth. Right so I think those questions are unanswered.

C

I think that part of what's happening right now is this idea that actually things can be worse and it's totally fine, uh or maybe it's.

B

C

Fine but we'll just do it and move on and I think that's a like terrible user experience and a terrible abandonment of contracts with the user right like if nothing.

D

C

Emotional contracts, even if they're not kind of slas and so.

D

I think that kind of.

C

Stuff needs attention and generally one of the ways it gets it it makes. Progress is via across Community conversations.

D

I see I also see the potential in the in the community for mentoring, minty relationships, potentially something we could extend upon uh and and yet, as has already been said, just the idea of having a place where you can put out an idea bounce an idea with people with a whole wide range of experiences. It's just a it's fantastic and it can't be I can't undersell how important that is.

A

Great uh and then Stephen, as you were one of the first people to join the reliability.org community. What made you jump right away um and.

D

Yeah so I work for a smallish company around 100 staff. We're building a new product I haven't hit that infliction Point, yet we're reliabilities the key thing: it's more about building the right product at this particular point in time, so I I was really excited to be. You know to have a place where I could go and sort of keep my finger on the pulse and hear what's happening in the reliability world, because I'm not getting the chance to do the work every single day.

D

um So that's a great thing and I just really like the the vendor neutral nature of it, of the community yeah, as I said before, most other communities that I've been part of or have heard about, have been around a project or or a technology, and this is great I I'm, just excited I, don't know. What's going to happen, it's oh, you know it's great.

A

Amazing, um so to everyone here: how can people get involved with the reliability.org community.

B

We go to reliability.org the website, you join the slack you introduce yourself and then you start contributing I mean it's pretty simple and the more people we got the more activity there is I mean. Finally, you know communities are all about participation.

A

Great good answer, since no one wants to I, guess, add anything yeah, perfect um and then now I think we're gonna grab uh one audience question. While we uh then go next to the other question here so, and you asked before how trendy is chaos, engineering, the practice Netflix pioneered a few years back and he adds of course tuber news wasn't as popular as it is today, but your takes on chaos, engineering.

C

um So if you'll you'll, forgive me for putting on my kind of copy, editor house or like parser hot and going, how trendy is chaos engineering like do you want a scalar answer like 13, or are you saying how important is it that I should know about chaos, engineering or how relevant relevant? Is it in the industry? I have only my opinion here, like I, don't have strong survey data or anything like that. I think one credible group of people who are doing this are verica.io if you've come across those.

C

They also run the um the void database, Courtney Nash from the void database of kind of incident data out of verica as well, but.

D

C

Main thing is kind of chaos. Engineering chaos, engineering is is really useful, like it's a really fundamental technique. Instead of just waiting for things to arbitrarily break into your production and tidying up after it, you go. Okay, I'll break a tiny bit of it all of the time and if I break the right, tiny amount of it in the right place, I'll, hopefully learn something that I can make progress on with respect to improving reliability in my production without actually having a complete Wipeout event.

C

So it kind of in the, if you think, of outages or whatever, as kind of a tree of possibilities flowing from uh some kind of single node, then chaos engineering helps you to kind of depth. First explore a bunch of potential failure modes that you might otherwise only really encounter after they've set something serious serious off.

C

um If has one particular downside, which is uh as I understand, it I have no direct evidence for this, but it has one particular downside, which is people go okay, so you're gonna break my production and they don't like that bit they go I would much rather just wait for it to break completely rather than break it a little bit all of the time, because.

B

C

The moral failure is somehow not directly connected with my actions, which is not true at all, of course, uh but there is an issue with having the the case for it kind of resonate with certain kinds of audiences yeah. That's all I can tell you.

D

What would it be for you to say that if your engineers are already fighting fires constantly and there's a lot of technical debt that chaos engineering is not probably a good action to take at that time?.

C

uh I suppose you could make that argument. It depends on whether or not the because often chaos engineering is quite complex to set up because you can't like you, can set up a simple buff that goes around zapping arbitrary VMS every so often, but if the subset of VMS that had zaps aren't performing different functions, you only ever learn the same thing that you were going to learn anyway. So it's not adding to your additional stock of knowledge.

C

So in order to be really useful for the organization, the chaos engineering has to be doing something nastily and new to you every time. But if you're getting nasty and new things happening to you all the time anyway, like that's, not much additional value. So why don't? We just do the thing that we're doing right now, which is nasty and new until it stops being so new, at which point we can introduce cast engineering again.

B

So my quick thought is that chaos engineering can be done chaotically right. You know where you're just doing random things and the chaos Monkey kind of case and I. Don't personally find that terribly useful. You know it's kind of fuzzing uh compared to um thinking very carefully about your testing framework um or security framework.

B

What I do think is incredibly useful is Fault injection, where you really think deliberately about what percentage of things do you treat as a fault as you call a subsystem, and you know how well do you deal with those things in terms of recries in terms of redirects, uh whatever it is, and that I think can be done in a very careful, methodical. You know way that you know you can actually get some use out of, because it's very hard to get useful knowledge out of randomness.

A

Great, if no additional comments hoping on to the next topic, which is um what are the top causes of major site outages and how can people avoid them.

C

Yeah so um there's some old data from the second three book, suggesting that Iran 70 advantages flow from change of some kind like config change or binary change or whatever so stop changing and everything will be fine. Oh hang on actually I'm very sorry. It turns out you can't stop changing okay, so what we have to do instead is to change in a particular disciplined fashion, so we don't trigger the unexpected interactions between attribute sets a b and c on.

C

This thing attribute sets c d and e on this other thing, and that looks like a bunch of kind of relatively well-known best practices which are still not you know universally done today, but it's the ICD testing in production, canaries, fast, rollback, I once worked in an organization that wasn't able to roll back and sometimes wasn't able to roll forward, which was also interesting.

C

uh So organizing your Productions, such that those things can be done is more or less the first step towards tackling that uh root cause for what of a better term for most of those outages.

D

uh This is just purely speculation, but I I feel like an increasing I, feel I. Think in an increasing number of uh allergies, are going to come from the the dependencies that we have, because we've got so many dependencies and they're growing all the time. So that's I feel like there's going to be a a an increasing source of incidents and additives.

C

D

Can feel Stephen you're.

C

A human being you're allowed to feel I.

D

I used to work with a um a guy, and he used to tell me that SRE is devops without empathy.

B

C

Into bitter experience, I see foreign.

C

B

We used to have a two-hour meeting every week where we'd go through the prior weeks, outages in some important services, and so, if I think about the collection of, shall we say greatest hits that were on replay across those weeks. uh There are always things like database issues, bad deployments expired certificates, misconfigured network settings, and you know it's very similar to what uh Nile was saying, and you know what's common across that set. Is that there's widespread severe impact because you know they have a large blast radius and that they take time to resolve.

B

So how do you deal with that? Well, one thing is: is that you plan the deployment roll out so that you control the blast radius? You automate the rollback of changes so that you can. uh You know, minimize the time of failure, because it's basically an integral. You know the severity, the duration and the breadth of impact right, and so it's um you know you can reduce any one of those Dimensions. That I think make progress.

D

B

C

B

Mean we pretty much stopped using databases internally, because you know it's uh at least Charlie bell used to say like it's like putting a switchblade in your baby's Crypt. You know. Just don't do that. You know the it's complex software, that's easy to use and you should stop using it. Just use. Definitely TP opinions may vary I've written a lot of the databases over the years. So something of a fan.

C

Stop using databases would definitely be a message. I.

B

Would not expect to.

C

Issue from one anuragupta Esquire.

B

I use sqlite at Shoreline for all of the bitter experience libraries SQL.

C

I, like sqlite, is awesome actually and the the unit test framework for sqlite is really awesome. Yeah um but like type safety is for worses I. Think general idea. um Yes, sorry, stop I'll! Stop there.

A

No, no! It's good! It's great! It's good to have some discussion. um So how do the best people out there manage their Cloud environments.

C

When I meet them I'll, let you know.

C

um I, like the bed, people isn't necessarily the right framing for this right, because the environment and the resources also matter a lot and if quotes the best, people, don't have the right environmental resources to do the work they'll go to somewhere.

C

That does so uh like there's there's a lot of nuance behind that question, um but I would say that a lot of the things we actually talked about earlier in this session, with respect to understanding your dependencies figuring out, observability figuring out critical user Journeys, making sure you can roll back all of those best practices are things that quotes the best people and quotes are either doing or they've got a good excuse for. Why not- and sometimes it's a question of picking your excuse, yeah.

B

I'll add in the idea of Designing for reliability, so you know it's like, for example, injecting faults in production which can be sounds scary, but.

A

C

A

B

Resilient architectures, which can handle it so, for example, you know I built Amazon Aurora and it effectively injects a large-scale event every week, because it does a deployment which breaks, uh which you know takes one out of six elements of the Quorum out. While it does the deployment and uh you know, but it handles it without any, you know drama, and that's just because it's designed to deal with that failure and dealing with that failure means you can also deal with a z, failures or disc failures or network failures, and you know blah blah blah.

B

So you know I think that sort of basic design methodology I'm, not saying you should you know all go, do six-way.

C

B

But uh you know the I think that kind of Designing for the fragility of the environment in which we operate is important.

A

Great um so now let's take grab another question from the audience, so we had Luther asking um my company embeds an SRE team in rotation with different teens in hope to work closely with them to improve reliability and monitoring, occasionally in-house workshops any downsides to this approach.

A

Any opinions, okay,.

B

I'll take a crack at it and then now can correct me because at least there'll be something to approach so I think. The key question in here is about ownership. So who owns the issues, the failures?

B

How do they escalate the I've seen orgs that have put in SRE functions, but everything still flows to engineering to fix because it and it's just a bump in the you know in the wire and that's useless right- and uh you know, having someone look over your shoulder and just tell you: okay, Implement, you know you're not doing processes uh well enough, you're slouching, you know you're, not uh typing correctly. You know, that's not helping me make things better.

B

So what does help me is if they're, actually there shoulder to shoulder fixing things which I think uh you know Luther's Point kind of touches upon the notion of embedding together, rather than creating it as a Cascade or treating it as a separate retrospective function. Now you know all about this yeah.

C

I mean there's a huge amount of nuance to this, depending on what the definition of theme, rotation, different team, etc, etc. All of those could have a huge impact on what the end result ends up being I will say: I am most familiar with the single individual embedded model, rather than the team to team. Embedding like team to team embedding seems weird. That's like that's a partnership model, not an embedding model, I. Think the the weakness of the embedding the single.

B

C

Kind of model where you would have an S3 that would go and sit with the team that has a particular reliability challenge or some kind of knowledge, deficit or whatever um for six months, say. That's a pattern. That's very common in Facebook production, engineering and last I lost I was aware. The difficulty with that model is like the the thing it's good at is responding to particular emergencies or lacks uh quickly. Yes, uh Team. These 17 teams have some problem. We will send staff out there and they will fix stuff and so on.

C

Yes, like often often it does help.

C

But what turns to happen is if you do a lot of these kinds of rotations, you've, no real team identity, at least not one that lasts longer than the period of the rotation, and that turns out, even though you might not think it's that important, it actually turns out to be really important with respect to giving people the idea that they can build a career and have a kind of a long-lived contribution to long-running Services Etc, uh all of which are kind of necessary sub-components of promotion, amongst other things.

C

So um that's kind of the upsides and downsides of of embedding uh the other question. I suppose that kind of hides behind some of that is. Why does the team in question feel they can't do this themselves like? Are they looking for Warm Bodies, because if they're looking for Warm Bodies, that's definitely speaking, not a good sign? It's like I, don't care who they are just get them cranking at the code now.

C

uh Well, actually, that's, maybe not good, or are they looking for a specific, guided expertise on on something in which case that can sometimes be a bit better yeah depends. He said shockingly and.

D

In my last role, we had a different kind of a meeting I guess we were an enabling team, and so we would spend time with you know, one team at a time helping them uplift because they asked us to or because there was a particular need and I guess the they work quite well, but the the challenge was that if the team had competing priorities from an organizational perspective like you must deliver these bunch of features by this deadline, and that would just completely blow away anything we were trying to do because they just couldn't listen.

D

They were too busy worrying about all this other stuff they had to do so. If the priority isn't there then embedding uh is in their sort of enablement is pretty hard.

B

I found it worked really well at AWS to inject a highly skilled person right when the service was launching, because, typically the service team members didn't really have any understanding of how to operate in production or deal with all of the various processes you know and so having someone sort of help them train those muscles helped.

A

Right uh so now we can enter the audience q, a not that we haven't taken audience q a already here, but if you have any questions in mind, now is the perfect time to ask them, because we have a bit less than 10 minutes left so type them away. And while we wait for people to type their questions, if they are now frankly typing away, let's ask the last question from from my side, um so tell us about the best. Oh, we have immediately a question.

A

So let's jump into that and leave my question a bit later so have you found uh that those with certs such as Azure, AWS and so forth, are better at thinking through designing and operating reliable services?.

B

What's meant by certs here, I.

C

Think it's Cloud certification programs or applications yep. What's your view yeah so I like I, have a definite view on this, which may not be shared by other people. He wanted to warn everyone. um Yeah, like you say, Lauren um I I have no certs I, never met in 11 years. In Google, for example, I never met a single person that had a vendor, related Source, uh I, think that's true yeah and in general I'm, not saying I'm.

D

Not saying it should.

C

Be suspicious of them like there's certificate having people are people who said this thing is important, and I should work towards getting knowledge about this thing, which is a positive signal like hugely positive signal, but often like that's an abstract way of representing the situation with respect to certs the concrete way of representing the situation with respect to Services, often like there's a very fixed set of knowledge you have to have and that fixed set of knowledge might not map onto your problem domain and it changes quite quickly as well.

C

um Like I I, know back in the very old days, like the networking certifications that Cisco used to do CCNA and all that kind of stuff you would spend. You know, 36 of your life learning about the difference between Type 2 and type 3 lsas in ospf. And how relevant is that? Do you really probably not that relevant, um I I suppose it helps to distinguish you from the larger mass of people formally in some sense, who have no experience with these ideas?

C

But it isn't in any way a guarantee that they will successfully wrestle with your particular problems.

B

Yeah so I kind of agree with you I feel like you can kind of have three tiers of people. You know there's the tier who actually kind of don't know what they're doing and then there's a tier where you know they've shown through some sort of certification that they kind of know what they're doing and then you've got this tier that you really want, or they really know what they're doing, because they're working at it and they're way too busy to get certifications, and so the problem is.

B

How do you distinguish between the top tier and the bottom tier? And you know, assuming you can do that I, don't think certs matter. uh If you can't do it, it matters a lot right. I mean you know. Suddenly everyone's been a uh you know, AI engineer for the past 30 years and I kind of doubt it. But you know that's how they describe themselves.

A

Okay, anything to add Steven uh anything, uh and then we have uh one question from the audience, but I also want to address um you're asking about the profile uh as a community member I. Think there's going to be a good resources online um about how to change your profile around I think we don't have the time maybe to handle that question here, but um I do want to get back to Lauren's Lauren's earlier question about any grade.

A

Books or articles can help guide, monitoring and observability of automated services, so that alerts are meaningful and actionable for teens.

C

Yes, so uh there is another book yet another book uh called the service level objectives book, uh primarily written by Alex Hidalgo, the the SLO book. It has a chapter in it about SLO monitoring, which happens to be written by uh somebody called Nile Murphy uh who I've never met, and this Nile Murphy wrote about how to do pretty concrete steps with respect to slo-based uh monitoring and observability um I've been told it's a really good introduction that might be a place to start.

C

There are a lot of other monitoring um things like resources out there and honeycomb has an observability book, which I think is very good. Actually, yes, excellent, Steven's, holding it up now um and uh written by many of my favorite people and there's also some sections in the SRE books about monitoring as well, and something from James Turnbull I think a couple years ago, called The Art of monitoring, which I also think is freely available online.

D

Yeah loads of places, there's an annual conference called Ollie Fest. um um You can just still go to olliefies.org ollie with the the ones you know uh and I think you can still watch them all for free and there's tons of really good content there. If you want to watch those monitorama yeah.

B

The other thing I'd say is: is that pretty much everyone who's? You know illuminary in the SRE Community uh is sitting on Twitter and you could just reach out to them with your questions and they're, pretty.

D

B

You know so I followed a lot of people there and uh they're pretty generous with their time. So yeah can do better than reading some dusty book.

A

Great, no, no anything to add solar.

C

Yeah I was gonna, say that um also chat. Gpd might help you to write your code and it might be more available than the time some of the people on Twitter. But, yes, folks are generally available.

A

Yeah perfect, but I think that's uh um all that we have time for today but great to see um so many amazing questions and answers from everyone here and thank you always um everyone for joining the latest episode of cloud native live. It was great to have a session of our reliability here today and also I love, the interaction and the questions from the audience, and you can always tune in in the coming weeks. We have more crave sessions coming up and thanks for joining us today and see you all in the coming weeks,.