Red Hat OpenShift OpenShift Commons Briefings, 10 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Learning from Incidents John Allspaw (Adaptive Capacity Labs) 2020 07 10 OpenShift Commons Briefing

Description

Learning from Incidents
John Allspaw (Adaptive Capacity Labs)
Andrew Clay Shafer and Diane Mueller (Red Hat)
OpenShift Commons Briefing
July 10 2020

A

All right, everybody welcome back to another open ship Commons briefing with the the good folks from the GTO office drew clay Schafer here with us and John Alpha from adaptive capacity, labs and other incantations of himself, and today we're going to talk about learning from incident which it's against its incidence. Okay,.

B

You know I wish we could only have one only.

A

One incident yes well: zero accidents this week in my household, how about yours, so I'm gonna? Let Andrew and John introduce themselves and we're gonna have a rolling conversation here today, so no slides and go for it. Andrew. Take it away! Yeah.

B

So I've talked to you before and I'm. Not sure I want to talk too much about myself, but I will talk about myself a little bit to introduce John. So the the thoughts that I have you know around some of the things regarding DevOps and operations were definitely influenced by this man. John all spa and the way that you.

C

B

Got to be part of some, some very will call him generative projects and and gave a talk that I would say essentially gave DevOps the the movement its name. So there's this famous talk from velocity conference, where John all spa and Paul Hammond talk about Devon, ops, cooperation, that flicker and that that, like chained into a bunch of other things, led to a bunch of conversation about DevOps. He is a big part of velocity conference.

B

He also wrote some books and now he's really focused on and very passionate about, the this notion of learning from incidents and human factors and I'll. Let him introduce himself a bit more and then we'll chat about that.

C

Thank you, sir Andrew.

C

Yeah, that's that's! That's! That's a great intro I would just say that I've learned as much from you Andrew, as as you might have learned from me, yeah that that's that's about right and really at the highest level is, on my mind and my colleagues mind are introducing new ways of looking at how work gets done and one of the most effective ways of looking at how work gets done can be can be seen by looking closely at incidents.

B

Lee has a pitch but explain what you do at active capacity now kind of way. The sure.

C

Sure yeah, and so so what what we do adaptive capacity labs is is help organizations.

C

It's a small consulting group help organizations understand how they learn from incidents currently, who learns where that learning travels or dissipates, and and how to how to glean more and richer understandings of their incidents to help them do what they're already doing, but it tends to go sort of unnoticed, and that is preventing incidents and a great deal of doing this work means bringing a sort of a host of particular techniques from other from research into human factors and cognitive work and other domains.

C

But none of those techniques and methods are, you know new they're, just new to being applied in in the software domain, in the way that we do that from.

B

C

To time worth mentioning from time to time, organizations will experience a significant event, something that's really visible, so it's sometimes advantageous for for them to hire us to do the analysis ourselves. A really bad oversimplification, but we're gonna say it anyway.

C

Is you could think of one of the things that we do is you've heard of the NTSB in in the US who's, whose role is accident, investigation in aviation and other trace fields? You can think of adaptive capacity labs as helping build.

C

You know, mini NTSB expert. You know a cadre of people inside your organization who have NTSB like skills and expertise that they currently don't have.

B

Reify this a bit when we say incident, what we're talking about is the the website is down or something I.

C

Don't know actually, as it turns out, and maybe what you've teed up for me here in an incredibly veiled way, is that the definition of incident is not as crisp and and standard and clear, as as we might think.

C

As we know from looking at real incidents, incidents, don't always show up with a big label on their forehead. That says, I'm an incident even part even even working out whether a thing is an incident. It can be a few.

A

Weeks ago, cat swittel I was on talking him on another one of these sessions and she brought up the a slide that had a picture of like in a factory floor of you know zero incident. So in the past you know 365 days or any event, and basically the anecdote she was telling was whenever she saw something like that. It panicked her a bit because that meant they weren't watching for something or they were missing.

A

Something because there's just like really no way that there wasn't something that they could learn from from these incidents, and so I think the definition of incidents is: has lots of different semantic meanings in different ways. So I think that's a key piece of the conversation.

B

Indeed, I don't know where you want to go with the John, but keying off that and this notion of what considered an incident is also, in some cases a question of blame right. So so like or attribution causation, so I know you have lots of thoughts on this, so maybe you could give us a little a little monologue about about some of these things. Yeah.

C

Yeah yeah well, first, that I think I actually I actually like what Diane had brought up and so I'll risk. From from that vantage point, so you know like a measurement for a lot of what we do is bringing new perspectives to understanding what makes work hard and what makes people good at it and what makes them what could potentially either support or hinder their ability to do work.

C

The majority of the techniques and and perspectives come from sometimes called safety, critical domains like power plants and and medicine, and military and transportation all that sort. So we have to remember that categorizing remember even declaring a thing, a thing that happened in event as an incident right, labeling it as an incident is itself a categorization.

C

The notion that that a that there's really only two sounds cartoonish, or at least I hope it sounds cartoon, there's still a lot of a lot of the the viewers here, but quite often you'll hear okay in the wake of a an actually, you say: okay! Well, is this a result of human error or technical failure for whatever reason the journalists one just one of those two categories- the that frame that what makes an incident?

C

What makes not an incident, is sort of beyond the scope of this, but as a bit of a trivia Henryk in the early 1900's and put together this notion that you could characterize and and by that, even declaring a thing, an incident or not here even human error, technical failure- and that was his sort of contribution what's not often brought up, is that he is that he worked in an insurance company and so having a perspective on a categorization is a political as much as you know, genuine curiosity.

C

What stems from this is exactly what you what you just mentioned, Andrew, which is which is blame. Blame, certainly gets a lot of attention because it's sort of palpable it's a it's telling the story of human error or making it about the individual attributions of a particular person. This is the you know the the root cause with Stephen right or something along those lines is really just a special version.

C

Yes again, exactly and Lisa had a bail amount once more, the the the notion that we have uncertainty and, in that matter, sort of an uncomfortability in the wake of an accident, an accident meaning a thing that has some form of surprise and an adverse you know, effects or whatever so think, one because that came out of nowhere.

C

Otherwise it wouldn't be a surprise in some way came out of nowhere right, but to admit that those things in the future are possible and, and the sort of ever-present dread that they can't all be anticipated means that we have to put this sort of fear. This general, like, oh, my god, how good about the future, even if it's a lie, I'd rather feel good right, and so, where can I place this? This mission.

B

In case of blame, I.

C

Can put it me, we.

B

Can uncomfortable with uncertainty.

C

In particular, we want a place of you know we wants to hold up or some form of just put a place. You know scapegoating pile of sins. The village on the back of the goat and sent send the goat out of the town of town right.

C

It is the way, so we need put me to put it in a box notice, I didn't say, container put it in a box and sometimes that box is embodied in a person. Sometimes it's in a really big, vague. It's in the system, man or sometimes it's in our cloud vendor or you know, as long as there's a if there's a place to put the uncertainty, the the underpinning. This is developing an understanding of the incident, so you.

A

Can be yeah? Aren't you missing just a little piece, though, because or I'm sure you're not missing it, but because there's always that phrase that failure is where the innovation comes from. So but when we put things in a box or you know container.

B

So, where you stop looking I want to make one quick comment that I think might help the listeners, which is John and I, have spent hours and hours talking about some of these things over the last 10 years and and I. Think that there's it's in our best interest to articulate that these quote unquote systems are neither human nor technical. It's socio-technical both of those things together and then also add and I think this is relevant to the openshift community.

B

There is no organization on the planet running any of these systems that thinks that the system themselves are fully autonomous and that and that their their reliability is not dependent on the the actions of those those human entities and agents to keep the reliability.

C

Yes, yes well said: I think we've probably been we've probably spoken on the order of days Andrew on these topics over the years. In total, we're.

B

C

Yes, we are yeah, and- and- and you know a big part of taking a you know taking some of these perspectives can be somewhat mind flipping. Actually so Diane you mentioned, you know, you said something about. You know, failures, sort of where innovation happens, which is undeniable.

C

One of the things that that I that have come to understand in a really deep way is something that's quite unintuitive, which is that, which is that success. Is you know, understanding how people are just plain doing their work? It can also be a significant source of innovation. Right, you can think of many products in the world. Very you know very successful businesses that that turned a profit earned a what was otherwise a workaround in a previous product into a significant and really groundbreaking service, I. Think of CDNs.

C

It's a great example of that, and so, but the difficulty- and this is the difficulty with the field of resilience engineering is- is that you have to you know I can't just say all right, everybody at the end of the day, let's get together and let's talk about all of the ways that the site could have gone down, but didn't right because there's not enough time. The next we'd be there too, the next day, and so, and this reflects in the same thing of safety right. The denominator in cats in cats slide.

C

It's a great example in the world of safety. Where you see those signs and those signs actually douchey talking, you are- are in a number of two places notice the denominators missing, which is one. It takes four for an account that an incident is all incidents are the same. It also doesn't count how many incidents were prevented.

C

It only shows the ones that were there and Erika Naugle has said that when you start measuring things, hoggle being sort of pioneer of resilience engineering, when you start measuring when you start measuring things by what is not there, you run into some difficulties you can. You can certainly prevent a lot of scores on goal, but if you're not scoring, then it might not end as well the way the way you think, but again.

B

For resilience engineer.

C

Do I have a pithy.

C

Okay, yeah, improvised I, would say that resilience. Engineering is a study of both currently is the study of it's, that is to say, adaptive capacity, investments in adaptive, capacity playing out in real-world situations, standing on, grounded and concrete empirical evidence, resilience engineering, the engineering of resilience, but stands on understanding. What resilience looks like to begin with?

C

It's a field: it's a domain, its community and its planets, its twenty years old, at least, and but it's only maybe five years old with software and technology starting to bridge and understand those not very pithy, not.

B

Very I think I mean one of the interesting that, like side note, is there there's things that were emerging in practice that were gravitating towards what you just described as a resilience engineering that that definitely predate the five years they that you're, giving it the label you're.

C

Absolutely right, and- and that is the thing that was fascinating- the reason why I was able to you know when I became first interested in this did my master's degree and started and continuing reading when I. You know, when I contacted the heavies in that field. You know it was. It was a Richard cook and Dave woods and Sidney, Decker and and and Steve shock and others I. So.

B

This is, this should probably have come out in the introduction, but just forget listener, like walk, walk through like how you got there right, so you you run these websites this way, that's a little bit of that arc to where you've gravitated towards that field. Yeah.

C

Sure sure so so so I worked in a photo sharing website called Flickr. As you mentioned, we were, we were acquired by Yahoo, but for the most part we were sort of our own standalone sort of entity and we grew in ridiculous ways. I mean in like cartoonishly batma sphere. Like stratosphere ways, we went from being like the 25th most trafficked property at Yahoo to the fifth most trafficked, like behind, like the front page and Mail right and in in, like 18 months, the complexity of the back end of the website.

C

All of the things that made things work its kind of exploded and at some point you know, I had a team of six infrastructure engineers and at some point we had some big outages and some pretty significant outages, but I couldn't get over the fact that on paper we should have had way more and I couldn't understand. What's that about, and actually some of these you know having been just reported responding some of the incidents after you know, you work out the incident and all incidents can be really harrowing.

C

We just kind of like okay after aftermath, okay wow, that was bananas- that was crazy, like yeah, it's kind of crazy that we even worked out what was happening like yeah and so I couldn't as a manager I was I, was thinking to myself.

C

Okay, what's what's going on here, either I'm, incredibly good at hiring and then like being able to do this work is sort of innate you're born with it or something or whatever. I just happened. Just you know strike bold with the people on my team, or they were certainly pretty good and I'm an amazing manager and that's what like, so both of those are completely unbelievable.

C

Certainly the latter one would would have been existentially difficult to accept because I had have no idea. What I would what I did to make that so I started to under two into like what makes what makes what what are? What underpins people's ability to solve a problem, not just a solve a problem but solve a problem under time pressure where any of the actions you're taking could could very well make things worse right and represent in in some.

C

In definitely cautionary tales, an existential business, you know situation and that's what led me to human factors and what I understood about kuna factors is this. You know fields, you know most of us understand. Ergonomics ergonomics is quite often seen to be a sort of specialized for subset of the field or, if you're in Britain you'd say that was the field and human factors is a subset, but the the fact of the matter is where technology work and people you know happen is. Is this field? What I? What I realized was?

C

Something happened in about 70s and a part of human factors. Traditional human factors started to undergo a sort of a again an existential sort of wait, a minute. We don't. Maybe we we actually don't understand this stuff. We think Three, Mile Island was the point the whole the whole planet that was doing human factors. Work was like holy crap. No, actually, you can't design a since. You know an operations room without taking into account the cognitive work, not just like plain old.

C

Can you see the dials and all that sort of stuff and cognitive systems, engineering sort of was born and it's a very I, wouldn't call it a splinter, but certainly it's a it's a field in and of itself Don Norman Dave woods. These are. These are folks that were almost entirely came from nuclear research in nuclear power plants, but then went on and to this day, even though resilience engineering as a field resilience engineering is a pretty broad field because it's not a law. It's it's entirely. There are sociologists.

C

There are operations, research, there, Center decisions, there's lots of people a core part at this juncture. Is cognitive systems engineering, it's not all of what we represents: resilience, a Giri, but certainly a core part of it. Much like you know, statistics is a bit of a part of computer science or you know mathematics. So these things sort of interrelate, that's a little bit of my. You know background of how I got there and I'm still learning. So that's that's! That's the gist. What I?

C

The final thing I'll say is that the it is much more rewarding and the thing that I am excited about is that you know much like continuous. You know delivery continuous deployment, then no all of the things that we associated with that things that enable it the rationale for even thinking about the thinking about it. In those sense, there was nothing that you know there was nothing special about that 2008-2009 timeframe, like all of those ingredients had been set up.

C

You could argue that extreme was a big that you know was pretty much the thing that tipped people down that road.

C

Like you know, it's like one of those things when you look at like oh yeah, it seems so obvious in hindsight, and it was pretty straightforward. You know small and frequent changes for these for this reasoning, and you need these to do this sort of straight work, but it is a perspective shift. I mean you think. My guess is that both of you were were there to sort of see this perspective shift light bulbs go on.

B

The perspective shift is not evenly distributed. You're.

C

Absolutely right, you're, absolutely right and so yeah.

A

So how does when, when you talk about resilience, engineering and cognitive systems, engineering, can we talk a little bit about the work that how you applied that, maybe not at yahoo but after woods and stuff and teased that a little bit yeah, because the thing that actually sprung into my mind was how we tried almost to automate that in software with things like chaos, engineer, chaos, monkey and things like that like which doesn't take into consideration the human factor at all? It's just like, but well it tries to simulate it, but there's no hue.

A

It's like running test after test after test, and and doing this to your website and stuff. It's like. Can you pull that up? Tease that out a little bit more yeah yeah.

C

Yeah so actually.

C

I'll comment a little bit on what you sort of what you what you just mentioned with respect to chaos, engineering as kind of an example.

C

The application could look so myself, Norah Jones, Casey Rosenthal, there are, and it's and others matter of fact they have a there's. A new book at from O'reilly on gas engineering has pointed out actually that the certainly one perspective is the one that you described.

C

Another perspective is that the that the creation of a of a chaos experiment the process and practice the the the dialogue that generates, where how, when an experiment ought to be thought to be performed, can be as valuable, if sometimes even more valuable than actually running the experiment, in which case this is a capture of cognitive work, and so so the the you know what I would say is.

C

Matter of fact, actually I, because I was just reading. Let me just read this here.

C

Norah Jones is an interview that Norah Jones also states that, before and after averting a chaos, experiment is as important as running the experiment itself, um and so the you know the. How does how does the application of cognitive systems engineer any look?

C

Well, it said the first real sort of application was in my master's thesis would just understand what what rules of thumb or heuristics engineers use when trying to resolve and understand and respond to outages, especially when signals, as we know, can be disparate, sometimes contradicting sometimes not make much sense in in when faced with an entire. You know C or almost infinite number of places to look. You have to look. You have to you start looking somewhere. What leave you look in some places rather than other places, and so this is a this.

C

Is the study of cognitive work, my thesis, which you're happy to download in case you're, having difficulty sleeping, will sort of go into into detail there. I would say the significant work is, if you were to, if I were to give you a couple of threads to pull on.

C

If you were to, if you were to look into methods, techniques approaches it's an entire family.

C

Of things that make up, what's known as cognitive task analysis, cognitive task analysis is more or less the formalized method, with related cognitive work, analysis, CTA and CWA, and all of these tips and tricks that go into that is the application of cognitive systems. Engineering you can think of those are the tools to understand how people understand and how people wrestle with both cooperatively in teams and also individually problems that they're that they're facing problems that they're anticipating and what those problems in anticipation or in in responding to what they mean.

C

And what comes out of that is this closer look and that what we? What we always like to say, is look the expertise is coming from inside the house. There's much more yeah, there's much more to understand about how people do their their work. That is represented in JIRA I. Would.

B

Add there's a tendency in all of these practices, especially when you're kind of outside of the the core conversations to focus on the the tools, because you see it as a concrete representation of what's happening, but in my mental model and the conversations I've had with some of the people you just mentioned. I feel like the core chaos engineering community- and you know the stuff we're talking about with cognitive engineering. Resilience. Engineering like those are essentially inseparable in my in my head, yeah.

C

Absolutely and what's exciting about chaos. Engineering is not only the original. You know a lot of the sort of you know. Proponents, even the earliest proponents of chaos. Engineering are are seeing this connection and they're seeing this connection in in ways. That is for me really satisfying and they're, making new connections, that is between resilience, engineering and chaos, engineering that I had I wouldn't have even seen, and so that's really satisfying super happy about that.

B

Someone just dropped something in the chat that remind me of some of the stuff I've seen you talk about before that. There might be fun to articulate here, which is this notion of the the kind of the the lines of our models and and how the the process of incidents and and analyzing them helps us build clearer models. Yeah yeah, yeah.

C

Yeah this this notion of this line of representation. It's a bit of a mind, blower right so I, and this is entirely from the worked out in the snafu catchers consortium and is describing a lot more detail. I'm not going to be as much as eloquent here, but in the stellar report describes this sort of frame and the frame the frame goes like this. We have all of the stuff. The technical we've got, the databases we have you know. We've got the thing that we build. Here's the thing that generates.

C

You know that users, users generate revenue, here's the stuff that we that we that we build and maintain to help us build that thing right and here and and all of the things that sort of intertwine with that, including like dependency as long as so we've got all this stuff that that sort of fits together, databases and code repositories and networks and firewalls and all of this stuff.

C

That's that stop. We manipulate that stuff. We do things with that stuff via a representation of that stuff. It's not with the stuff right. When you go to make a schema change, you don't go to the data center and do a thing physically to the database right and and and what would and so what we everything we know about that world is via these representations, they're, not the things. They're representations of those things right distributed tracing app is a representation to the extent that it's useful, it is a representation. It's not the fame.

C

It's not it's! It's not! You know it's! It's not the thing that you hope. You know you can look at it. So you all understand what that means is that people's ability to not only make changes but also anticipate, anticipate.

C

What the system might do in the future and where it came and is based on where it came from all comes it comes from nowhere, except for their mental model.

B

The work we do is both facilitated and limited by these mental models that we've built up about what we're working on. Yes,.

C

Yes, exactly, and so what should surprise and and and and in addition to that, what incidents and what close close study of incidents shows is that no one has the identical mental model of the same book is below the line stuff as others they may have. Some that are close, they may have more.

C

They may have more detail in some areas and others and what's happening, is that teams are continually recalibrating these mental models through discussions with other other with others through looking at dashboards, looking through code writing new code, seeing how that behaves and that it's this constant recalibration, so we have overlapping mental models, but, and so what's surprising, is they're, never complete, they're, always faulty in some way, and such works almost all the time, despite that, and the reason why it does is because only people can adapt and recalibrate that mental model.

C

It's not this stuff below the line. It's not that there's no intelligence that goes down there other than what what has come from us and it's not the below-the-line stuff. That's doing it right! It's it's our ability to make sense of what's happening. What's happened in the past, what's happened what's happening right now, what makes that matter and what makes what might matter in the future important to pay attention to, and so that's the notion of above the line below the line so.

A

That and to go back to something early very early in this conversation, the you know, the blame game and and I come from a perspective of open source community development and trying to you know, shed sunlight, and so, when there is an incident, we have one team has their mental model of how things are working. One of the things that I try really hard and is almost as very hard to get people do, is to share their model.

A

It's almost a cultural shift because it often it's inside it's something that went wrong with a product or a service or something like that. Flickr went down or you know somebody went down and they're very reluctant to like have an open dialogue with the user community about what went wrong because then maybe they'll ship to another service provider, or you know something like that. So there's this and I'm just wondering. Maybe from both of your perspectives, you know how you help.

A

Companies and organizations understand that putting some sunlight on your mental model or your your models.

A

And exposing them sharing them with people on how to do that effectively and allow other opinions, because going back to where the innovation happens is those aha moments of oh yeah often come from outside perspective. Oh I,.

B

Want to add, before John goes back to answering the question for real that this. This occurred at several layers and levels as well.

B

So internally, there's often you know, this is people talk about job security, wherever people will will protect the mental model that they've constructed and not share it internally and and then that you know also happens between between teens, between departments and then, as you mentioned externally, but at the same time I feel you know in the velocity conference and the community around DevOps days and the rest of that over the last decade, or so has has essentially started making post-mortems or incident analysis publicly into an art form.

B

So all that I'll, let John make his comments on that. But this doesn't this isn't just between the organization and the outside. We protect our mental model scales, yeah yeah.

C

Yes, and and to the you know to your observation band that there's that there can sometimes be reluctance right to giving I wouldn't say, sharing mental models, because I can make a point about that. But really even just.

C

Really relating any sort of information about what was what was happening for.

A

Them doing a public post-mortem on something that's very bit a public service outage or something like that sure.

C

A

Reluctance from engineering teams to do that and that's I'm sure.

C

Well well, I mean I mean if they believed that they would get something from it. They would do it if they believed that that that- and this is internal- and just like just like Andrew said- and external there's- nothing that there's some. You know, peculiarities about in you know write-ups about incidents to the public, but remember those are those are the purpose of those the audience for those is very different than an internal right, and it's a mistake. The two as being similar, it's different, the point that you've brought up, which is reluctance.

C

There's there's a reason: why there's that that people are reluctant right if they think that they can get something if they think there's something positive and they feel supported in giving a story then great, if there's, if there's something that is potentially threatening for them or others, then then they won't write, and so remember the the and and the somewhat of a potentially nitpicky point on mental models. Is that peep I can't ask you for your mental model.

C

You can't give it to me, you can't you can tell me you can tell me a story from from from a cognitive.

C

Technique when you ask somebody something about how they rationalize right found something called reflexivity. You will give the answer that you think that the asker, the requester will give.

C

You have to build a constellation of data that supports this mental model, calibration recalibration and you have to and that's about a mixture of records, of what people do, what people say and what people do and say about what they do and said, including others. This is called process tracing, but it's the way that you can do way. You can make ever and inferences about cognitive processes, sorry to get really nerdy there for a second, but this is the reason why you know this is this is what makes doing this work.

C

Difficult people won't share a thing that they think everybody knows, or they aren't even aware of themselves. Famous famous researcher in this in the late 60s said it quite quite best about tacit knowledge. We can know more than we can tell and and a significant part of studying cognitive work is exploring tacit knowledge, and there are some ways that you simply cannot do it and you have to learn to how you have to learn and practice how to do that.

C

Otherwise, the results aren't valid and there's only one thing worse than a really poorly captured incident write-up, and that is an incident write-up that every everyone, despite its contents, finds to be non credible, because the authors and the methods by which that was that was formed is seen to have an agenda to the effective incident. Analysis requires an analyst to be a non stakeholder full stop full period. There is no other alternative.

C

You need to have no stake, no dog in that fight, no horse in that race, about what the analysis does other than provide others. A boundary.

B

C

A source of dialogue isn't.

B

That exceedingly difficult to have no agenda. This.

C

Is why adaptive grassy labs is an expensive professional service.

C

Maybe it is if it was easy, we'd already be doing it. You know, what's the difference between.

C

I'm gonna be super blunt care, the world of human factors. It's been said. My many colleagues of mine now have said this: the origins of cognitive systems, engineering, human factors, all the stuff that we're talking about a definitely cognitive task analysis.

C

The reason why a lot of this comes out of research in the military, DoD and do-e funded projects in the US and in other parts of the world is because of the because of consequences in time pressure and it's you know, jokingly, said that you you're you're doing this work either, because somebody who was supposed to get killed, didn't or somebody got killed, who should have and and that wipes away consequence and time pressure wipes away anything else. That is immaterial. That's what makes incidents that's the Trojan horse.

C

We think you know that it's a myth to say that that using these techniques looking into incident analysis, isn't there the focus isn't necessarily to find what broke it's, not some. It's not something sort of socialized debugging is to find out how stuff works at all. The incident is just a director of attention. The incident is just the you know. The filter, you know- and you can think of an incident as your system saying: hey everybody.

C

It's incredibly efficient in that way.

A

It's the opportunity and the opportunity exactly.

C

Exactly I would say.

C

How to do it they're only gonna get so much out of spending an hour in a conference room filling out a template. Sorry.

A

Well, I was gonna, say. The other thing is that a good incident report or good post mortem doesn't necessarily tell you what caused the incident. It just gives other people information that they can can help you sift through and maybe sparks a conversation that gets you to that opportunity.

C

Yes, and in order to do that, it needs to be compelling for the broadest audience in the deepest ways possible.

C

Engineers need- and this is something we know about software engineers- they don't read anything, they don't think they need to read and when they think they need to read something they have an expectation. They're gonna get something out of it: you're damn right, they're, gonna, read it and so doing that capturing what makes incidents hard capturing.

C

What makes you know red herrings and wild goose chases happen because, following those have worked in the past mm-hmm right, but you never, you very rarely see the details of red herrings and what made red herrings so attractive to follow in incident right hours. Very, very, very rarely do you see that that's an example of something that's an example of the messy details. That's really important. The.

A

The other outcome of doing post, mortems and incident reports is also building trust when you share that information, you're building trust with the other folks across silos internally or your end user community- that you're sharing this information as opposed to withholding it and not exposing you know the the things that might have led up to it. So there's so I think you know, and the hardest thing is to do it well and yeah.

C

That's this is proportionately you're right about trust, but but that trust is proportional to the quality and what others find of interest in the report right, which is why I'd say a very strong signal, not be signal, but a very strong signal is how many people read how many people read it. You know how many, if you can't I'm gonna, go out on a limb. I know that counting and doing statistics on how often somebody has visited a web page.

C

There's a salt problem, their entire there's I know of a company who's built their entire business on that.

C

But yet, if you know the way to build the way to break trust, is this is to make available all your incident write-ups that are terrible.

A

And it's a school to do good ones, it's a skill and it shouldn't be done lightly. So I.

B

Think in a lot of organizations, it's a mandated perfunctory action and that's against the problem that the John's trying to expose that we're kind of coming towards the top of the hour and given the fact that not everyone has has John all Spa on retainer at this point, what kind of what kind of practical advice would you give to someone listening about where to start where to explore?

B

You know what can they do, that that would make some maybe meaningful changes to their own mental model, not just about their systems, but about this type of work. Yeah.

C

That's great for the record, if everyone was interested in having me on retainer, certainly please reach out the so that there's a great question so there's there there are two things that I would I would suggest. The first is to understand that there's a growing community who is who is it's not just about depress you labs right.

C

There's a website called learning from incidents. You will see reflected in a lot of blog posts, more and more people. Talking about these. These topics happy to tweet much more I would say that the learning from incidents Paige and in particular, Lauren Hochstein, on github as written an absolutely stunning sort of set of resources about resilience, engineering and they're standing cognitive work that you can look at pragmatically, practically a couple of suggestions. The first is to make effort to capture from as many people as you can.

C

What was difficult asked them and put it put it in the news, put a new section in your post mortem template or wherever you wanted and get people to to write what was hard, what was surprising, what was difficult, um the more people and not not what they thought, the team thought was difficult, not in an abstract way: individual perspectives, individual perceptions. What was hard was difficult, the more that you can button lots of things are difficult. It's not just sometimes even even understanding.

C

The thing that you're seeing is bad can be difficult, so gathering those sorts of reflections right. Every engineer has this feeling this this sort of when we've we've talked with organizations we sigh asked: have you ever, you know, have you about to run a command you're, responding to an incident about to run a command, and everybody thinks you should do this right? Well, your colleague like you should do this. This looks like it's a best shot. Okay, all right, I'm going to go. Do it?

C

Have you ever had that feeling that, right before you hit enter, there's an equal chance that this might make things worse.

A

C

A palpable, extremely important experience that almost never finds its way into these narratives capturing. What makes work hard? What makes work harrowing, uh you know absolutely astonishing. You know there are. There are surprises that are absolutely fundamental right that there's this notion of a situational. Surprise: that's when you buy a lottery ticket and you win the lottery right and then there's fundamental surprise and that's when you don't buy a lottery ticket and you win the lottery. Okay fundamental surprises are what make sure noble.

C

They make the bats IPO they make a night capital, they make a Three Mile Island.

C

They make they make accidentally sending a ballistic missile alarm to the entire state of Hawaii, and so so capture that that's my pragmatic good advice captured that stuff put it down. People will read it because they've been in that situation.

C

What do you think about that? I think.

B

I could sit and talk to you all day.

A

Well, definitely, have you back I think, there's a piece that that I'd, also like to tease out is because again Andrews, maybe focused on organizational change and transformation and DevOps f'n and I kind of have I'm trying to figure out how to apply this to some of the open source communities that we're helping support because doing this in open, transparent processes, as opposed to maybe in an enterprise process, which is very, very important because I'm sitting inside of you know good old, Red Hat- and you know this stuff happens all the time and I mean we do have a great engineering team and I think you know, and they have read all the books and you know they've done and they actually apply a lot of this stuff.

A

So it's great it's been wonderful, but then, when we take it and we have to do it in the open, yeah and and when I talk about sharing that you know, you know how we do this in an open, positive way and and learning the practices in open source communities. What is something that now that I've read the books now I've heard you speak and I've heard Andrew speak and everybody, but everybody is trying to figure out how to take this to the open source community work that we're doing variants.

C

A

C

Are gonna yeah? That sounds like an amazing, an excellent challenge: excellent excellent topic.

A

The cool and we are at the top of the hour and we're gonna hit button soon and end this conversation and that's gonna, make you know a fundamental issue for all of us, because we'd love to doesn't.

B

Have to end the conversation you can reach out to us on LinkedIn or Twitter, or what have you yeah.

A

B

Understand it for the day.

A

Yeah and I'll try and find many of the references that you spoke up. The both of you spoke of and add them to a resources page for this. This conversation when we post it up and definitely have you back again and boy lots of things to think about now over the weekend and ongoing. So thank you very much for joining us today.

C

I'm very happy to talk with you Diane and I I, always love talking with you later.