VMware Cloud Native Social Hour, 26 Apr 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native Social Hour - April 26, 2019

Description

Join @apinick & friends at 2:00pm PST as we talk about what it means to introduce chaos to your environment and how you can use it to improve your performance.

A

Hello and welcome to your cloud native social, our amazing, client and I am your host as always I'm going in the office by get my friend and colleague, Koscheck, and this is actually going to be the last time we are going to do the climbing of social in our inaugural studio, the boardroom of haeppy Oh from hectic hours.

A

I know it's sad that being said, but yeah, so the Seattle office, the beaver Seattle office, the FT office is going away very shortly, and so this is last time we're doing it here and I know it's a nice space. We've got like a pile of junk over here that we never really ended up using, but I think we'll that'll be joining us in uh in Bellevue, somewhere, medic stuff in a corner and yeah we'll be doing it out of a various conference rooms in Bellevue. From now I.

B

Think TTI K's is also moving there. Yes, sir I'm assuming PGI cave, oh no,.

A

I think there may be one more year, I think it's a the second week of May. Is the plan? Don't I know this is recorded, but I'm holding tonight I have no control over these things or any understanding, but I know that. For me this is the last time we have were able to do this because.

B

A

Second week of May I'm going on vacation, probably well after I'm, back we'll move to the Bellevue office. But thank you for joining me and the rest of my friends from the communities are depicting and interview mark Jesse. What's up hey.

C

A

Just saying that this is the last time we'll be able to I do this from the empty office at.

C

The tower is that it is it is it it's.

A

We're going to be the end, but also I'll be going on vacation sometime around when we're moving, and so logically this is the last time, though yeah too bad. You guys it'll be good to kind of have a new home when we get settled in there all right. So this time on the cognitive social hour, we are talking about chaos, as I said in a internal email, we're issuing the gaudy rep trappings of stability and introducing a world of randomness toward infrastructure.

A

This is something that I've only recently become like I've, been always interested in the idea, but I've really started to look into it more actively recently and trying to do some of my own. My chaos, development and stuff, like that, and so I kind of wanted to talk about this.

B

A

B

A

B

Morgan I just saying also to talk about everyone. Damn.

A

B

Chaos I call it resiliency testing for communities and just ensuring we are getting things done in the right manner. Like communities is looking as it was intended to yeah, so I call it resiliency testing chaos testing just to ensure everything works as expected. Yes,.

A

Before we jump into our main topic.

A

Ecosystem last couple of weeks, your fancy I.

A

I actually haven't had too much of a chance to look into a new project or anything that really intrigued me actually, the last time we kind of in over this segment of things that interested us. It was around chaos all right. What's up mark, it was around chaos and actually Duffy you're going to be doing a demo on that very topic. Oh yeah.

D

A

I'm excited I want to see how it works. I haven't had a chance to really dig into it, but anybody else anything that really stood up in the cloud native ecosystem.

C

You uh hot topics for me lately then I've been having a bunch of weird like Network timeouts. So of course you know networking and cloud natives always good, but also we're talking about storage lately and persistence.

C

That's and then also there's a thing coming up through the CN CF they're doing like some sort of webinar about that like persistent services. So that's kind of that's. That's me kind of getting into that a little bit gold, oh I, know there's like ruk out there, but which is so I mean it's it's on top of Ceph, but I was hearing if there was anything else up there. So this.

A

Is actually good, I have a good answer to that I'm, actually, not gotten. Look to work. I know that the the team was working a lot of like really smart people and I think that it's a great tool, though personally I've, never done it to work so one that I have turned to or if I need something like Ceph. It's open, EDX I, keep talking about it continue to keep talking about open EBS is a really cool storage tool that rhyme I'm, a cool guy. So.

A

Basically, the way it works is it uses the ephemeral storage of each pod to create a set cluster and so gullible what happens behind battle. Pod dies. These anyone comes out and the set cluster migrates the data and make sure that you poor monster. So you have this kind of cloud native storage solution that can grow and expand with a number of pods at your cluster bombs award. That being said, it also takes a compute and other resources that you're we might be competing with, but I think that's a really handy tool, particularly for prototyping.

D

Yeah Creed there.

B

A

D

Others out there I'll suck him to vote for open EBS and for stuff they're, both pretty decent, really good. If you're looking to pay somebody for storage, which is probably not a terrible idea, cause it's storage, so it's always um it's. You know obviously gonna be pretty darn important to you in the long run or in the short run, either way um there are some other. There are some other ones that I've seen customers have pretty reasonable success with things like port work.

D

There was another one thing where a few others out there fort works I liked because they actually are one of the people have seen as the idea of encrypting at rest. So you can have volumes of port works that are encrypted, which is pretty.

A

B2 is like encryption of rest and I haven't really seen a good solution.

C

A

A

D

You know yes actually so another one. They probably talked about it a little bit in the in the UH in the TTIP this week, but there was a blog post on kubernetes blog. We find that.

D

That's a great thing that happens weekly is it put on by the CN CF ambassadors called cubically. I heck out of this thing can understand, what's actually happening when they just leave. This thing called Q by P Taylor, Cuba, iptables, Taylor.

A

D

Like I'm super impressed by this idea, I haven't played with it. Yet it's definitely my list of things to do, but I love the idea of it. You know cuz effectively. What Q iptables Taylor tries to do is, if you slip log entry or log line on failed packets at the underlying node they've written into this daemon set a mechanism that would allow you to annotate the event of a drop packet for any given part.

D

So if you're seeing packets be dropped, maybe on purpose because of network policy or envy you're seeing packets get dropped because of you know a number of different things. This seems like a a pretty interesting way of exposing useful information up as the event stream, so I was kind of impressed by that and they're gonna. Have it exported too much, but I was really kind of impressed, not only by by this idea, but also just gonna by the design pattern.

D

The idea that there may be more information that you want at the pada that you would like to expose you know kind of generally at each culet or node, lower layer that you want to expose us as related to what's happening with the given part, and that's that was pretty interesting pattern.

A

B

You look at there's another blog post that we had written in that same page. If you look at, there was a box trying to do for the control plane. That's a pretty nice function like they were looking at doing like reporting errors from the control plane to applications using it's kind of up there yeah this one.

D

Yeah same pattern: I haven't looked oh yeah. This is like that's exactly what I'm talking about right now, I like this. This.

B

Is like brought you up here this me some something that we should probably think about.

A

All right, oh, if you don't mind, adding these to the hack. Indeed, these links.

A

Joining us I would love to and follow along, like I definitely check this out and I will forget about it.

D

Because I I know- and you gave me a hack and B link and everything and I just have to remind you- be.

D

A

Anything else that anyone's interested in the cloud native world.

B

This tree, for me, has been a little interesting because been exploring show back in Cuba near East. Oh yes,.

B

Honestly, looking at number of solutions out there, it seems a little limited. We I did explore operative metering quite heavy week for what it does or what is expected from the customer.

B

But it does work if you look at monitoring the resources and getting putting our dollar value against the amount of resources that you've used in your cluster Prometheus does store a lot of this data, but because of it being not a long-term storage option or the limitation on how much data you can get from Prometheus and that's what operative metering decides to solve like they have their own database that you have to install and where you can get data out of it and then like report costing based on.

B

If your application is using this much amount of CPU and this much amount of RAM and then put a dollar value ma on it, so that that can be go back to you, which your lines of businesses are deploying that a particular application. So there's another vial. Exploring this show back. I also came across a small company. Doing cube cost cube, cost is another way of doing show back and communities, and if you look at the inverse solutions, we have something of what is that called.

A

B

That's a SAS offering so something to do with on-prem is what my customers currently looking at. So we're doing all options around cloud health. You cost a cube cost and also the first one which is operator metering and then the other things that I'm working on so nice.

A

And I'm glad to see that there's buildin action in that in the community like this has been in question when I said this is a classic question: it's something that really hasn't solved yet like a lot of people handle it through, like Prometheus or the fauna metrics. You know you can take it for media's metrics and multiply it by some amount and that's the amount. Thank you all right to the resource.

A

So there are issues with that because chicken remember we were talking about this where Prometheus lose this data, if your step of data is big enough, it will just drop data and it's not convenient for that purpose. It's more about a floating point in time series so, I'm glad to see there are some other companies that are written on this enough. A knows, speaking a bad game enough a knows is a another rule in this. In this arena, right, it's a SAS offering for Prometheus, it's like long-term Prometheus or yes,.

B

Yeah also does your head shape aspect of it, which is great and that's also another solution. We were writing some scripts custom scripts pull data from like a long-term storage like if Thanos actually works, yeah. We were looking to use tano's and then uses custom scripts to pull out data, but that would be also another solution to look at. But honestly, operative metering is in alpha yeah.

D

B

Was been enough for for like two years now so be very cautious, so and Q cost is great. It's very much a lightweight solution, but that's also an alphabet. They have a their own slack Channel and the very responsive as well going pretty good stuff. There.

A

um All right before we move on to your topic at hand, I'm going to visit assignment to the show next song recommendations, I'm hosting a.

D

Song, this.

A

Last week, called drunk drivers slash killer whales, the killer whales like it doesn't make me to do with the song that they'd say they like screaming at one point: it's by a band called car seat headrests, and it's just this really fun kind of cool song. It is fun, I, find it fun, but it's actually it's kind of about a person who is going through what sounds like a breakup and they're like convincing themselves not to drive drunk and they're. Like you know, screwed up and terrible.

D

Moment but they're like.

A

D

Doesn't have to be like this and I just really like it, and so, if you want.

A

To check it out, I highly recommend that car seat headrests drunk driver, slash killer whales. If you want to like crank it in your car and scream it I, it's a perfect song for that.

A

Anyone else right now, they're fancy.

D

Yeah, listen, Justin, I, like blues stuff lately, but I wouldn't say I have anything any particular ones that jumped out at me. Yeah I've been kind of an RA LaMontagne kick, which is nice, got some really good, really good stuff. I remember when I first heard ray lamontagne sing, I was like I had a very different I was trying to picture this dude in my mind, just by listening to his voice like he do.

D

You know, and and I was way off, you know didn't he didn't, but I've seen him in concert a couple times with my wife and that was in East tremendous in country. So if you get an opportunity and definitely recommend checking it out,.

A

This is gonna, be me being bad certain things, but I am with Tracy Chapman I listened to a slow car I thought that was a man singing lady and I. Looking back I'm like, why did I think that? How could I possibly have thought that, but a little under sign.

D

Well, that's kind of the amazing thing raised like our we're repenting recognition machines generally right, so we we always kind of want to try and like in our head even.

A

If we don't realize again.

D

We try, we have. We try to categorize the things that we see here in terms that are are reasonable for us right. I know they might be totally off and skin health like it's always kind of amazing, when you, when you, when you think that you pet something figured out you're like no I, wasn't even close.

A

D

A

I think that's a good point Duffy and it's going to drive us directly into the topic of the discussion, which is Cass engineering and a part of chaos, and a very important part of guessing is data, recognition and monitoring. I think that's something we're going to talk about in just a second cush, Creek or Duffy. If you have the item be available, looking you edit of the zoom chat, so that there's a part of the heck can be called Twitter handles for gratuitous follows follow request.

A

It is follow Friday, as you know, so if you would like to add your to the end of the.

A

A

A

A

Right so this week we're talking about chaos and chaos. Engineering, so I first became familiar with chaos engineering when I heard about like the way that Netflix handles their infrastructure. They have a kind of this interesting idea where and I kind of I've heard about this. Originally, when I was working with mostly PMS back there, when I read had doing like.

A

Platforms and read them, which is the retina bridge, virtualization manager, and so one of the things that they were, that Netflix was telling was the idea that no VM should last longer than a day, and it handled that by inducing chaos. They have a mechanism that would like to periodically just go and destroy random VMs and see what happens, applications and so ever since then, that's been something that's been kind of on my mind to check out and now recently I'm really starting to dive into it.

A

A bit more and Chris Nova's book funded, an infrastructure goes into Google.

A

But what part eight in particular I found it to be incredibly fascinating, really easy. Reading a lot of good information about how to introduce chaos and.

C

B

A

Means they're in this and for most of us are familiar, but this would go over it again. Like chaos, engineering, the idea is that you introduce logical or predictable chaos into instability into your infrastructure and into your applications so that you can see what happens so. Basically, the idea is that how do you measure the stability of your application of your infrastructure if it isn't tested at all right? So if you have infrastructure that never goes down what happens when he goes down, are you able to recover? What's the meantime, your recovery?

A

All of these things are important to test out and if you don't do it, if your precious with your infrastructure of your applications, something kind of trap, it will happen because we live in a world world by the way that was a cat, see the Dragons. You can, if it's a wild world, and you will get the rug pulled out from underneath you. So this is something that I think. That is very important and we don't see it a lot in practice and.

B

Which should ideally be like I think I tie it up all the time that you discuss it with customers. Once we get into some sort of clusters, are up and running they'll find the deploying applications. They should always be part of their testing before they get into production. Yeah.

A

I totally agree, so it's something called day to operations and it seems like to me a lot of people never get past day, one from day, one being setting up your infrastructure, setting agreed clusters, things on your applications and it seems like no people don't really get past. They wanna get mean today to do they do being monitoring logging and and it's something I want people to be braver, about trading jump into instability like have it be part of your day, one or day zero operations.

A

If you can, during your set up basic like record all the things that failed along the way and then fix it, either through procedure modification or through an automation, modification I think those things are very important.

A

Anybody else you have any hot takes on chaos, engineering, yeah,.

D

I agree with most what you said: I think that you know a I think another thing that another way that I kind of internalized, like what chaos engineering is trying to do is that we are all of us in the field of computer science, but like we should be applying science to the work that we're doing. We should just be like throw it out there and see what happens see what stuff.

D

A

D

Why I feel like the chaos, engineering, piece, really kind of comes to the fore because you, you might develop a theory, a pretty obvious theory that like if you have a few instances, and one of them goes away you're going to lose half your traffic, but when you get into the science part of that, that's probably your true that you're going to lose about half of your available of the availability of that of that service. But for how long and what servers?

D

And what does it look like empirically to when the deployment spins up that new instance and, like you know, understanding the characteristics of these applications, especially at scale, is like critical to your success in this collimated world. You know and that's right, I think the chaos engineering really comes in comes into play. What things that I'll point out, though? It's like it's interesting, like the engineering part of chaos.

D

Engineering I think is also pretty important right, like anybody can write something that will introduce chaos or troubleshoot order or trouble into a system whether that things that you've introduced is that you're, just gonna randomly kill, pause or that you're going to increase latency and whatever those things are. That's all pretty easy to actually generate you know kind of these. You know insert errors into a system.

D

Sometimes it can get complex if you're trying to like to insert a particular error rate- or you want to know patterns like that.

D

Obviously, if it's been as much time as you want iterating the different patterns for that sort of stuff, at the same time, I think it's critical, that we understand that when, when introducing chaos that we even in patterns like I, asked you when those things that are out there, that we have to keep in mind that you know having some being able to reproduce the air that you have found when inserting errors into your system is hugely more valuable than just breaking it yeah.

D

You know describing that scientific experiment and knowing exactly what you were up to when that thing broke in that interesting way, is hugely more valuable than just being able to break it.

A

It's kind of silly right, it's you know. How do you know that the test occurred or that it was something that is this like kind of outrageous right, it's like well what happens if you know suddenly, but how does our application handle if we introduced a primary counting system, binary counting system well,.

D

Then one will never do that.

A

So why would you ever test for something like that, and that was this may be quite ridiculous?.

A

Your computing system yeah like if there's no reason for that piece of chaos to occur or there's no way to know the chaos occurred, it's kind of useless. Actually, it's just you know it kind of acting like a uh a child with a magnifying glass in an anthill or I, just killing things. I ran up for.

B

No reason right and.

A

So so this seems like a compelling at what pager duty does for chaos engineering. They have like a specific process called failure, Friday, where on Fridays they introduce failure into the system for specific services. They like publish what they're gonna do in advance.

A

Everyone knows and they they're tracking everything, booting hosts, isolating networks, etc, introducing latency and then, like judge how those specific services react and then that team can then take that back and like fix that stuff right, they've, their first posts on it on their blog was like 2013. So they've been doing this for a while, but it is a pretty cool process that, like it's, not just kind of like like Tuffy, was saying it's not random. It's like a process that they do every Friday with specific services and teams.

A

So, like you know it's coming, you know, but it's a way to like test and it's like controlled chaos. I guess, if that's a good way to say it.

A

Chaos toolkits that we're gonna be talking about a little bit, but a lot of them have that control mechanism like you're only going to introduce gas during this timeframe.

A

Friday most teams introduce that by trying to deploy to production on Friday and and that's how they introduce chaos, it was something no I like the idea. What they were like instead of deploying anything outside is they're gonna, try and break something. That's a lot of fun, I think yeah, but it's not just a it's, not just breaking random things. It's like a specific teams service is targeted and they're running. You know, tests that are published and advanced kind of thing right, like we're gonna, do things to your service.

A

Saying it's like have a plan, have a hypothesis and a test right and see.

B

What happens I think I also like I, think there was another tool that target had developed. This is, which is something like they would kill. Pods like there's a 25% chance that, at the specific kind that a part would get killed. There was also a nice demo that we did over here.

B

I think rich Lander had done it where like, if it doesn't get killed like any poor, doesn't get killed and that first time frame where there was a 25% chance, the next tongue would be like a 50% chance that you would kill a corn, so they would introduce like chaos and chaos. Just oh that's interesting.

A

Like yeah I dig that so you never know exactly woman's gonna, it's pretty cool something going along. The lines of this that we're talking about is the need to measure D sense, so chaos for chaos to say, as we said, is kind of useless having some control around your chaos like knowing the time frame that the stoner curve that's valuable. It's also kind of it's not about any it likes.

A

It depends on the philosophy when we going to define like do you want to live in an instable environment all the time, but make sure that these things are measured or if you want to do a furniture, it's certainly during a time frame. So you can get all that data very quickly, I, personally kind of enjoy the idea of living in an unstable environment. At the same time, I also, apparently don't like having free time- or you know like a weekend or anything like that.

A

So I like the idea that if you are constantly living in an environment, the application that you're developing, if you're doing iterative processes properly, it should be more and more bulletproof as you go on I'm kind of add to the agile mechanism, Angela ideology a little bit night if it doesn't survive this test this time, how do you get sort of how it survives? Oh.

A

Can you in tjk channel.

A

It's a little inside baseball, so I got distracted now again with one was trying to make. So measurement is important and like that's what exciting, so you want to measure things all the time of your dimensions drinks this hour, and how do you measure chaos?

A

A lot of people use monitoring tools like Cortana or date, a dog, a Prometheus on the previous gopanna or data dog, or any of these other monitoring tools and keeping like a track record like okay during this timeframe, when we knew that chaos would be happening, what happened or application? Did we lose X amount of traffic?

A

Did we lose X amount of products service, little sort of those types of things, um and this is something I actually don't have a lot of experience in I'll be honest, like I, don't I'm only like seven recently getting into Q of chaos, engineering and also more into monitoring development. I've only really started playing around with your fauna. I love your pond now I actually very much enjoy it, but I, don't know the mechanism well enough to create like a good and I'm kind of curious.

A

If any of y'all out there have kind of a dashboard that you've used or created yourself or chaos for monitoring this passage.

C

C

Can't say: I have anything good, but we're.

A

C

Well, usually, what I tend to modder is the kind of the golden signals for you know what you always want to be monitoring for to really let you know something's going wrong errors.

C

Latency throughput, there's one more I'm, not remembering it, but, like you know, if you're monitoring for those things like you know, the impact of the chaos will become pretty evident. Like you know, errors start to go up okay. So when we did this thing, you know you saw stickers or, like you know, my now, my like 95th percentile Layton, sees have gone from. You know 700 milliseconds, to like five whole seconds. You know it's like.

C

Oh, that was let's not do that again and also, let's restart the state of the service, to kind of clear that or.

A

You know it's more like how do you fix your tool for the pulls are using to handle.

A

That's something I want to talk about in just a second, but the idea.

A

There's like the idea here saying like errors like these things, you need to track like these are the kind of well-known paths like errors. Lady see help my pot help or just like application of what processes are running. These are the things that are like kind of key triggers, but for each application you can do more more specific around like or else engineering like these are. The metrics of this application cares about right. It's like did you introduce a guest into your database? What like? How many rows are change over what time frame right?

A

That sort of thing so we're gonna, add more complexity to the monitoring I'm, one of our one of our colleagues growing Lyle's recently tweeted about, if, like how many people think monitoring or observability is just a point for me and how many people think that it's a mathematical function as I kind of a you know, bringing the math back into observability, but I really I saw that and I was like hey shut up. I, don't do that.

A

How dare even call me also directly Brian, damn you know, but it's it's drilling to me for up to this point, to be observability most tools but I like the idea that it's the math actually so you see like aggregate errors over time aggregate or even put a like ethic tation over time. Doing this, like rigorous mathematics around your system, to see truly what's occurring, I find them endlessly fascinating, really really cool.

B

I'm logan details because the tjk ended, but chris is going to retweet it. Okay.

D

Okay, I think observability is what is an interesting term and I think it's kind of come.

D

It's come to mean a few different things and it kind of depends on I. Guess the consumer of what that term might be right. So from the perspective of Jesse's, or you know, somebody who's doing a sorry site, reliability, engineering, that sort of thing and observability means being able to find some percentage of transactions that move through this. In a way that gives you a capability of understanding how each of the components as part of that system are operating so that you can actually find a dye needle in a haystack.

D

You can actually find where things are broken. um It isn't just mentoring, it isn't just matrix, it isn't just monitoring it's more than that. It's like you, know, being able to actually instrument your code in such a way that you, under the places where you it's, where you make calls to systems that are external to your own, are measured right and in in a reasonable way. You understand, and okay well, I, know that, there's an error in the system. My availability is down.

D

I need to be able to determine, like you know, we're in this gigantic haystack of services. I have is the actual problem happening. Where am I incurring this latency or if I wanted, to observe entire transaction lifecycle of like a Google search like how would I go about that?

D

That's where there's a ton of stuff in this space right now and it's all all pretty interesting, there's some interesting work being done by the open, tracing folks and they're kind of in zip can and just do and kind of all of these things all kind of fit into this space and trying to wire all that up, but I would say yeah. So for me it's definitely a loaded term. I! Think it's like it's. Definitely it's more than monitoring it's more than metrics, it's more than logging.

D

It's the ability to really observe the assistant, the behavior of the system under test yeah.

C

So, do you believe that the three pillars of observability and being metrics tracing and events or like logs, do you think that encompasses everything or is there more a.

D

Little bit, each of them through the glacier right like that.

C

Is that is very true.

D

I do I think I do think you could say those three things get it, but at the same time, like the depth of each of those, things is not. You know like oh.

C

Yeah yeah, no, it's it's massive and there are there's some really incredible tools that are coming to the forefront. You know for for kind of figuring, some of this stuff out. You know you mentioned data dog and then there's honeycomb yeah.

A

C

It's uh it's a startup still, but charity majors who's talked a lot about monitoring and observability and and stuff like that. You know she's the CTO and it's really the.

B

Idea is that they.

D

C

Yeah definitely.

D

Like charity isn't just somebody who talks about it, she's been in the trenches. She knows exactly what it is. The problem that she's trying to solve, like it's and I, think that really drives a lot of the legitimacy and credibility into the project that she's working on yeah.

C

So it came from, she was at parse that got bought by Facebook and then once she moved to Facebook, she saw that they had this tool. I think it was called scuba, but basically it they were able to diagnose once they you know, instrumented their application. They were able to diagnose problems like within minutes versus whenever they had their limited tooling at parse. It took them hours or days, and so she wanted to take that to the larger lake.

C

You know she wanted to kind of make it available to everybody, and so you know she got she and a team got together and wrote a specific data back-end to do high, cardinality search engines and queries and stuff like that which is kind of really what they're building their whole message on is. You know we're here to let you search like high cardinality data.

D

Which is definitely one of the challenges like I mean it's interesting, just as an example, it's like a little explanation into like what cardinality is because I think this is actually one of one of the most fascinating things like if you've ever seen, a histogram that breaks like a bucket based histogram right, something where you're like we're looking for error rate, you're looking for error values from an HTTP server right.

D

If you have error values that fall into 200 to 300, 400 500, that's actually why they're kind of broken up by the category, so you can kind of categorize different classes of problems or different causes of results from wet sewer.

D

One way to do that again to a place where you can complete- or you can understand it in terms of a graph- is to idea of like a budget based histogram, where you break it up like 100, 200, 300, 400, 500, and that histogram speaks directly to cardinality, because effective what we're doing is we're taking the value and putting it in 105 buckets, and now you can literally only have as much detail around what the error code was.

D

As the name of the bucket right, so in our case, I know that there were 500 I know there were 200 500 errors, but I don't know which of the 500 errors. It was I just know that it was similar in that 500 field.

D

I know that I have a bunch of 200 errors, but I don't know if it was two, oh one or two two or three, oh one or two four there's a lot of detail that gets missed it because missing there, and so the value of a high cardinality database is that you can do things like say. You know what I want to gather all of the information at great depths and be able to.

D

You know, interact with that in the way that you might like, with a with a MySQL database or some such right, where you're no longer trying to abstract away the amount of data that you have to store, you're, storing, all of it and you're.

D

And if you that high cardinality provides you the ability to kind of like filters down and to exactly the error that you were looking for like if the only metric I was gathering was a histogram of HTTP errors and somebody and somebody might have a ton of it, one of my developers came to me and said: I think that there's a problem where we're returning way more to o4s than 200 I would have no way of knowing that in my current metrics implementation, I just know that there's a bunch of 200 type errors, and so that would mean that, as I said, somebody's on the observability team, if you will I, would have to go in and kind of reinstall.

D

My specimens trick for that application. For this brief time and get into that detail- and this is where I think you know, I- think honey comes on the right track. Actually, I did another good friend of mine, Ben our chart it there and he's been there since beginning, and he came from Linden, Labs and kind of solving that problem there, as well, so yeah they're on the right track. They're. Looking at the problem, in my opinion correctly, but it's tough.

A

Have we defined what is the kind of goal of chaos and monitoring, and all these things and it's confidence? Confidence is what we're trying to test for right. It's do you have the confidence to do to do your job right, the way that you want to do it right, if you have the confidence to be able to recover from an outage or to withstand and out of it right. All of these things give you confidence and a lot of times, I. Think that, like when we deploy applications, we don't have that confidence.

A

At least I know like sometimes all it happened to me earlier today and I had a kubernetes cluster running last night was totally fine, was in AWS and then for some reason.

A

I woke up this morning and two of my nodes were restarted and there were like two of three of my control Clinton homes, and so it was just like ah crap, like I, didn't have any time to like go back and like figure out why they restarted there like once a we started, I also haven't set up, beat mechanism to rejoin them to the control plane.

A

And so they tied it in any way to any fix and.

B

So now I just got.

D

A

Control not broken, but almost broken kubernetes cluster and two superfluous nodes that are sitting out there that are tied to an otoscope look and so like I, have no confidence in that cluster. I, don't know if it's gonna be able to survive itself.

A

What, if those two of those nodes, were my Etsy notes right and then suddenly my exit is different, but you have no confidence in that closer to be able to continue to run because I have no monitoring around these events that are current, why they occurred no idea, and so it's what got introduced to me was pointless chaos. Yes, yeah no function behind it. You normally I'm just like sitting in the lurch like if I were like going for production in this environment. I would be tearing my hair out right now.

A

This sucks this feeling sucks, and it's just some dummy cluster I, said earlier this week. Imagine the buzz around, like my entire business. Awful an awful feel have no confidence in this, and so these tools and this practice help drive confidence and resiliency. As as kosher computer saying, chaos, engineering is resiliency engineering right. It's. How do you stay up and.

B

You're kind of try almost testing out communities except the curative platform or.

A

Whatever happened, yeah.

B

Yeah yeah you're just trying to great things and ensure that whatever you're trying to break comes back online and working as intended. Even and that's.

A

The that's the other aspect, there's a final aspect to ks engineering that doesn't get executed on a lot, but I think, is the most crucial part improvement. It's constant improvement, so, like chaos, engineering gives you confidence in like for the example of my buster that just went down what do I do now? What I do now now that I've seen that this failure occurred, I want to resolve or I want to develop automation to make it so that that doesn't happen again.

A

So if these nodes go down, my cluster stays running the entire time or if you introduce chaos and there's this threshold spice, have your kind of your applications survive these threshold spikes right these, like infinitely growing like use of network right or infant growing use of memory? How do you either kill the pods that you don't want? How do you like mitigate your traffic, such that it avoids that, like erroneous, like endpoint chaos, engineering should lead to improvement of your infrastructure and your applications through these tests.

A

Right if your hypothesis is be like I, believe that my cluster is stable and then I test it through chaos and find that it's unstable or it doesn't lead to resiliency, and that means that my test failed and I need to introduce steps to fix it and I'm. Really, that's the part that I really really like I love. The irritant growth that house can quickly give you write these out. Is that you control yourself versus outages of the world controlled for you right.

A

Let's say you have a application, that's running in the US East to it's 2013 and UNC's to wind down for three hours, because some of that finger the next three men that is, chaos, I, got introduced for you and have you resolved that right? That instance doesn't happen that often, but if it were me doing, okay cool now, I see that this application for all of my applications need to run in many regions and then how do I facilitate that through a lot of mention through never in through infrastructure right.

A

But hopefully we don't go through to any catastrophic outages us ease to going downhill of. You feels blessed to wind down right now, like all of my clusters would be happen, and that would be pretty not great for me and I'm.

C

A

C

A

Be for everyone there's.

C

There was actually a lot to look at, especially around chaos engineering. With regards to that particular s3 outage, you mentioned mmm-hmm, no cuz. You know if you could think of like doing that. You know in a like it was done actually during the day as part of like a normal operation. But, like you said you know, someone gave it an input that caused the system to kind of go down. It's like well, they you know people were like. Oh you know, you should input, you should be.

C

You know like doing all this other stuff and it's like yeah, so you know everyone's learned this because it happened, but the takeaway there was that if you think of it like roll, the in vary the variables, and you know you could have kind of everybody watching and everybody.

C

You know looking at the system, you know when you start to introduce this chaos, especially if you setup like if you've given yourself the allowance to maybe introduce some of this lack of reliability right, I'm, not I'm, not gonna, say era budgets, but that's kind of what I'm talking about it. So you know just because what they learned also what's when they tried to kind of get this underlying systems that supported s3 back in line.

C

What they realized is that they had never restarted like at least the number of nodes, and so they were unable to come back up on line because as soon as they would because I mean they're all JVMs soon as they've come up on line, they'd get a whole ton of like assignments to you know, data to manage it would send the jvm Zinn like a GC loop like anyway, cause them all to crash, so they couldn't even bring up mute capacity because they were already running at a diminished capacity and they never tried to.

C

You know, work under that type of situation, so this is kind of something else that comes out of chaos. Engineering is like doing it during the day, you're doing it when everyone's there versus in in with actual chaos, whatever it's gonna, be at two o'clock in the morning, and you can't find the person who's on call for the thing. That's not you know. That's that's not responding. Yeah.

A

It's like, if you've never.

D

Tested out it's something: automation.

A

If you never cut out your procedures, or how do you do this like? How do you recover? How do you have a confidence in those procedures right like oh, these are the. This is the step for how do you recover from like this kubernetes clusters? Damn right, how do you get a criminal ester back out?

A

Oh we'll follow this checklist right, but if you've never test that checklist, how do you know and if you need to recover from an emergency state and you've never done it before it takes so much longer than if you've done it before you have that kind of, like I, want to say I would want to go through a recovery state with like now muscle memory right, just like you know, my Leslie enduring commands because then you'll end up in the same situation of s3, but having like knowing the steps to follow, knowing where to get the information to get yourself back up and running is incredibly important and.

C

The way that you get better Incident Response is practice right. That's that's why anyone who, like myself, who has dealt with incidents in the past, a that don't worries and why we're like sort of capable of getting and jumping into an incident jumping into like an emergency and kind of you know being able to just kind of have kind of a calm approach to it. Is it's like I've been here before the frantic like? Oh, my god, the world is on fire like in the middle of the night.

C

You know and frantically trying to do stuff, you're exhausted, because you know you haven't slept and it's like it is a terrible place to be in, and that's you know, that's that's where that's the conditions in which a lot of people have learned which are terrible conditions for learning.

C

So why don't we condition so then that way, when we we start, this is how it improves, and in TTR right mean time to recovery, is you've kind of you've practiced you've gone through it you're like oh yeah, like whatever it starts, spitting out this log I know what that is because I saw this in the you know, chaos test that we did like two weeks ago, yeah.

D

C

Start to kind of see all this stuff and then it improves your MTTR cuz, you're you're, able to think about these things faster, instead of like going through logs and then having to cross-check that with the line of code, that's referred to and say: oh I see how it's getting there. You know you.

D

Made that really that was a whole bunch of really great points. I.

A

D

To brush on two of them real, quick, so one is it's true, regardless of what you're doing that you're gonna be good at what we practice right. This is become good piano players, good bicycle riders, good speakers. We practice it you're absolutely dead on target. For that one. The other piece that I would like to highlight is that I I have found myself in the position over the years, where I'm trying to teach someone how to troubleshoot and I've, never really thought about it from this perspective.

D

But when you raise this up, you know if we, if we have a good practice, a good, a good culture, around chaos, engineering, the ground, you know understanding, resiliency and characterizing our applications and services.

D

All right, then we are intuitively teaching ourselves a better model for troubleshooting because it becomes time troubleshoot. We already know what to look at. We already know how to characterize these systems, and that's it I mean yeah.

B

Yeah I wanted to rush down a little bit.

A

You talk about error budgets, and this is actually.

A

This is a moment where I don't necessarily intend to be a gatekeeper, but I kind of will be a little bit. It I feel that you should be hard-pressed to describe yourself as an SRE if you don't practice any of these things. If you don't embrace chaos, if you don't want to use the things that you need to do, to maintain type reliability, I feel you'd, be hard-pressed I've actually described yourself as an SRE, despite whether from your job table.

C

A

C

A

I need to be a gatekeeper.

C

It should it's not our job to be our job, to be gatekeepers. It's our job to be enabler right. You know we're taking two different perspectives here. There's the perspective of okay. Like we've got, you know, we've got an engineering team that wants to do this and then we've got the other perspective of.

C

We know how the system works, and so we know where some of the ideas of these, like the you know, services that are being built or whatever an engineering team is working on, aren't going to be 100% compatible with the way our infrastructure set up, and so there has to be sort of a a alignment that has to happen and say, look I know you want to do this, let's, let's compromise, because our system is set up this way, and you know, while we can get you to the level of Flake, you know being able, we could do this for you we're not you know if we do it in the way that you want to do it, it's gonna take six weeks versus.

C

If we can compromise a little bit, we could have it done in two. You know and stuff like that, and also you know kind of relaying, like also we do it in this way, because it provides a more reliable system. The thing that came out of this blog post is great for a dev environment, but won't work in a row. You know production system kind of you know advocating for the you know, different qualities of a production system, making sure that that's understood by everybody involved.

A

Jesse, do you want to most assuming it's an amazing place.

B

Just on that note, I've probably wanted to tell everyone: I guess everybody knows about it as well, for the day to operations managing Cuban Aires by create crazy, oh yeah,.

A

So there is a managing kubernetes, but written by our friend and colleague, Craig Tracy well he's one of the hardest authors.

B

Some no-name guy named.

D

Brandon brands.

A

And whatever crazy music.

A

But they have an excellent piece in there about day to operate highly highly recommended and those.

B

A

B

Just a fun book to.

A

Read I haven't gotten through all of it, but uh every part that I've read is system, it's kind of a page-turner for being a technical book. It's actually pretty interesting to read it's written in an interesting way. I should say: there's a lot of technical books can be dry and boring to have them, and so also to see. One thing you kind of touched on briefly is the idea of error budgeting. This is something I can't want to touch on as well in going into a gatekeeping corner.

A

It's if you have an SLA or SLO for like X number of lines and you don't measure any of those nines. Where did that come from what does that define Don, and if you say this thing is five minds really. What you're saying is that you want 100% uptime, but you don't want to say a lot of times. That's cool anymore.

C

Yeah the I mean I, think everyone understands that stuff can't be available 100% of the time. Yeah that you know stuff sometimes just doesn't work. That's that's the way. That's the way. Life is right, mm-hmm, but I think what is subtly understated and like just not understood is that in it I think it's best condensed into the phrase. If you want to add another nine add another dollar sign, because you know the more uptime and reliability that you're trying to get is gonna cost more because you know that means now.

C

You have to run an additional replica of your database in a different region or a different availability zone. You know- and it starts becoming like more and more costly I think that, but you know with regards to and I wanted to break down real quick like no SL A's s, ellos initialized, so the SLA, that's like the service level agreement, people in product and legal and all these people who are mostly not engineers, come up with that number like they they kind of come up with you know.

C

When do we have to start giving people money back for being unavailable? It's that's usually the way that I think of an SLA like engineers typically should not be thinking of essays, because their number is the SLO. That's the service level objective and that those are what you define is like. You know give yourself some padding between. You know if you're say 99.9%.

C

If that's your target up time for your SLA, then target like for your SLO 99.99, and that way, if you go over it, you know you've given yourself enough leeway that you're you're tapping into your era budget right, like you, know, you've gone over budget, which you know in the Google sre book says: that's when you you don't get to deploy anymore, you don't get to like give me changes out to production, because the reason why is because, if you're, if you're already over budget like in any other project you're over budget, you need start cutting them, it's go Peter.

C

You know figure out how to to get the project back into budget. This is like a budget that resets itself every month, yeah and so with those s ellos those now those have to come from somewhere and that's where your SLI is come in your service level indicators, and so these are the metrics or you know, whatever you're going to use to determine the number that the solos are based out off of, and so that's I think you know, that's where the error budget is right. Is it's the area between where your SLO is? Like?

C

You know, our requests in the 95th percentile can't be lower than like two seconds to render a page on a web browser. You know something like that is what defines your SLO wished. You know should be supporting the SLA, which is like. If no one can log into the site or more than 45 minutes a month, then we have to start giving people money back. You know yeah.

A

Which is I thank you for breaking that down, because that was something I actually forgot about this Oh eyes, and you know that those are the metrics that you care about. So when, in the process of like deploying these things, requests Masai people masa, it's like, oh, what are the metrics we care about, like I, can't define that for you.

A

Well, I, don't know what metrics your application cares about, how to define these things, and so it's it's good to have that like be understanding it like the breakdown of an SLA, SLO and SLI, and so thanks for running through that. That was something that I forgot about, and this is why I don't call myself an SRE, because I I would totally be a bad one.

B

Yeah, you can actually break it down even simpler SLI and basically, if you can think of that, as that's your variable. So this is what you're measuring you don't have a target. Your SLO and variable meet some target Aleya.

A

B

I fail my variable meeting that target what happened yeah.

A

Monitoring or incremental improvement or s alive, I.

D

Want to see a demo.

A

Duffy, no we're not gonna move into the portion. We like to call demos because it's a portion in which we know things and so I'm gonna, be.

B

A

A nap a very short demo and I, don't know if it's going to actually be very useful at all. haha This is fine.

D

Dog and I have my stickers.

B

D

For this time of time of day, this.

A

D

This is actually a sticker that Ben Ben. The elder gave me excellent.

A

Excellent I'm going to do is a quick demo of the cube monkey tool. Another thing we're really able to look at just a tool.

A

The cube monkey tool is a take on Netflix's chaos, monkey and so like retarding them at the beginning of the shoulders. He idea Netflix introduced where they restart their the ends of their pods every so often and that's controlled by a tool called the chaos monkey it. Basically, it's a monkey throwing the wrench in the operation, so monkey are tube monkey.

A

Basically, what it does is every so often depending on how often you set it, for it will destroy X number pods or all pods for an application if it's hard to get in this way, this basically every hour or, however long it said it. The reason why I think this might not be a particularly valuable demo or like it might not have all the information I wanted to show is that you can only set up the treatments of our and I. Couldn't have that start running I finally got it.

A

My system set up because I've been a busy boy right around the time that we started the company with social hour and you can't set the time for keep monkey to start at the same time or you wanted to start operating at the same time that it starts and so I couldn't it like. It's gonna start it too, and have it run at some point in this, like our timeframe, so I don't know how. Hopefully it'll have killed a pod that get mine out of we'll see, but these are pretty graph to show you so.

A

Also, yes, I'm walking over to my giant screen. If anyone wants to play DMD you later I'm a little for it.

A

Alright, let me share.

D

My screen again look no roll.

A

For initiative, Duffy.

A

Alright, let's see how this works cool, alright! So here on the screen, we have this your font page. This is something I set up using Cube Prometheus. This isn't a demo to Prometheus, although it's pretty awesome, if you have a chance to check out cue Prometheus, it's in the Corollas repo- and this is just using the example- manifests- are there. It provides you with your fauna and your fun of dashboards that are fairly useful kind of giving you the building blocks to set up your own monitoring system.

A

So right now in the dashboard called kubernetes compute resources, namespaces, pods and I'm in the data source Prometheus with the namespace test, and so in tests. I just have a bunch of nji. Next pods running hello, world I've got three of them, and you see here the color changed at some point.

A

What that was was me setting up the cube or the kid monkey labeling and I'll get into in just a second, so I created the deployment and then I modified the deployment to enable testing on this on these pods, and so these old pods had to die, and then the other ones came up again. Monitoring is really awesome. You can see just the kind of flow of your infrastructure with this very simple tool and I find it endlessly fascinating. Anyway.

A

Let me this is change.

A

In gear BAM, can everyone see oh no shoot? Let me change something really quickly, so it happens. This very.

A

All right there we go. Can everyone see that can read that pretty? Well, you can see I'm running our clinics, so one of the things social, our so jumping into the cube config. This is the config map for cube monkey, and this is the way are you defined like what how it works? Essentially, so it takes a tonal type of config file. It will run it starts running or starts running the destructive tests based on the run hours.

A

So in this case, it'll run at two o'clock and then it will start at 15 and will end at 16 so start trying to kill a pod between 3 and 4 and hopefully I set it up for today. We'll see I actually might have screwed that part up. That might be why you nothing got destroyed.

A

This is what happens when you try and do them another set up a demo the same day you're running today, but this is basically how it works and you can say like: oh don't you have, you know, don't run test in this namespace I, actually sort of like force it to run in this namespace, the other one being deployment that ya know.

A

So, when you deploy cube monkey, it's a there's, a repo for it. It's a pretty easy to look for an out. Add the link to the hack and beef out. Its here is the image ordinance under the same repo, and so it is a just a simple deployment. They have a mechanism or they have the manifest there. So if you might do to poison cube system and it'll run, you know output this log and into this log there, and so that's.

A

It was very quick and easy to set up boom boom boom and then it should hopefully run at some point in the next like hour and what I was hoping that we would see. While we were talking about chaos, engineering is, you would see a dip in one of these pods running.

A

So this is memory usage you'd, see that the memory usage for one pod went down and the other came up and regrettably, that didn't occur, but that's kind of a nature like randomness and chaos that it is unpredictable and it it's only been running for a very short time as a like a testing mechanism, but it didn't quite pan out, but so this the reason why I want to show you. This is like this is actually kind of the beginning of chaos, engineering and chaos. Cooling.

A

That kind of didn't really kick it off, but is a a good first step. If you want to start introducing chaos into your infrastructure is a tool like cube monkey where, in a controlled fashion, between these hours on these days in these namespaces, you can say what pods you want to destroy and how you want to destroy that there is or like how many you want to destroy.

A

There is more documentation on the website which I'll add the link that will talk about like percentage of pods, so you can destroy like a certain percent of pods or a random percentage of pods so like it will choose anything from zero to 100 percent of the pods selected to destroy.

A

You can tell it windier on what days I run one more thing. I wanted to show up actually.

A

Is how you define a pod to be targeted?

A

Second, Oh.

B

A

Why happen tonight? Oh.

A

That's right there there we go so here we go umm so for every deployment and for every pod that you want to be targeted by cube monkey. You need to add these labels, so Q monkey enabled and the identifiers monkey victim um kill mode is fixed, so fixed. Basically, the kill mode can be like percent random. There's a bunch of these configurations. Fixed basically means it'll, kill only the killed value, number of pots, so it'll only kill one pot and then the mtbf is a mean time before failure or between failure.

A

So this is a value in number of days. That'll wait before trying again to destroy pods in this in this application. So you need to add them here in the labels and then for the deployment and then down here in the template labels for the pod as well. So if you monkey enabled identifier as monkey victim, the identifier can be any string by the way it just needs to be unique, and then, however, you want to be destroyed, oops and then my screen look like her Oh No zum-zum suddenly.

A

uh So something interesting that happens with our clinics is that occasionally zoom will, instead of broadcasting your screen will create a black box that fills up your screen, and now you can't do anything so the meantime and that.

B

Was basically that.

A

Was it for my demo? It was just a very simple tool to use, but it's a very good first step to get into chaos. Engineering in your environment, very simple tool to use very easy to control, but as we saw it, there's an element of chaos involved in the use of it and I wasn't able to demonstrate anything particularly.

B

One of the other options is to like tag it by your application mm-hmm, so that you can target specific application, pods belonging to an application to be digitally.

A

B

A

Yeah, that's a very good point and I think that's kind of what we've we had in the identified as EU scores like these ones can be killed in this way.

B

Leaving industry.

A

I might need to give me one second shoot something they're killing zoom on. My local stream would have.

B

John is still on the co-host, maybe no, when going to.

A

Do is remove myself.

D

A

That's crazy, alright, ok! Alright! There we go so now: I can't rejoin zoom. So that's like, but anyways loves kid monkey very simple, pretty simple cool for a very simple chaos: engineering, but good for sir Duffy.

A

What do you got for a mini or actually does anyone have any questions or any luck here? Any thoughts around that.

C

Cool I kind of was wondering like what do you? What do you think the goal is? Is it just to see what happens if you lose a pod, or you know to make sure that your pods responding to turn signals and stuff like that, I.

A

Think that's both of those are valid. The main goal of it, though, is like I, said to follow the Netflix mechanism of just killing VMs.

A

B

Hopefully, what.

A

I actually wanted to get done is a better SLI say like here. Are the error codes we're seeing of the 404 like how many 274 hundreds are we getting per pod I got your back. Yeah I was.

D

Hoping you might have something like that.

A

I wasn't able to get that set up a cute monkey, but what I wanted to show this? The destruction that what what what never like you've seen it is for show off, is at the application self website, because it doesn't engender Todd should have been resilient through the destruction and I've enabled I just didn't there.

B

Was a and this just his point and adding to that, like there was this nice blog post on learn Kate's about what happens when Q proxy crashes right I think always about.

B

Have you read that you know those IP tables in destroying those IP cables, and that is a value inside the manifest files, whether it is a set of about 30 seconds before which apart can be, should you and not everybody, everybody just takes it as a default right. You just don't go around messing up messing with these manifest files. These chaos, testing tools will tell you exactly what might happen to your production systems. If you kill cube proxy it will you immediately know that a card coming up?

B

It's gonna, take like 30 seconds the default high before you can have it up and running. So those are some of the things that we you would uncover. When you do this sort of chaos testing, it's a pretty nice blog post, it's about communities, chaos, engineering, lessons learned, I can probably link it. It essentially talks about the documentation about the various default like I, think before 1.9 its IP cables. Now we have my EBS, what values can be changed or it can be defined and make your? What is your tolerance level right?

B

We're looking to discover what your tolerance level level is to failure and then define those values accordingly, no yeah, sorry, doc, I think you had some point on that. No.

D

I think that's I mean again, you know like we're talking just about like you know, I think Jesse's question like one of the. What is it used for I feel like care, you know anything is gonna, kill a pod in your and your in your crew breeding system.

D

The goldens are basically just be like one of the elements of your experiment, so in my my demo, I'm going to show kind of what that mean what I mean by that like it's gonna, be my theory is you know like what the theory isn't know, what we're gonna do and have infest it, but to your point, koushik I think it's fascinating and I feel kind of silly and might actually it might trigger me to to make like a whole blog series about this, but it'd be really awesome to do to apply chaos, engineering against kubernetes itself, because I have a completely different theory about what happens with coop when to proxy crashes right as long as you proxy crashes, without being in a right mode, everything's still cool, because those iptables rules will still persist over time.

D

B

The way it does rewrite stuff, but it takes about the rules, are refresh from between 10 to 30 seconds. Sometimes you get the Refresh rule. Rules are refreshing like 27 seconds, sometimes 25 whatever. But what is your tolerance level for that? Refresh rate will happen, it will refresh, but it's your tolerance, tolerance level, so you can go ahead and now define okay, I, just don't want to wait 27 seconds I want it to be refreshed in a much faster rate.

B

These are some things that you would consider when you're doing like your intubate, you operating your Cuban itis clusters. These are some other things that you would want to consider and which we do not actually yeah and.

A

So to a human being, I just made a very good point: what I really demo was AB monitoring.

C

Okay, huh oh, how was it actually trying to make a point? I was just legitimately curious. You.

A

Know no you're right.

A

This is a perfect example of what not to do, because my honor he had nothing to do with the chaos that I was involved with I didn't show off resiliency in any capacity it didn't, it was just showing the effect. What, hopefully, would have happens that were showing the effect of chaos, but not the effect on the application, and so it was essentially useless. Well,.

C

Didn't know the important thing is: is we learned that this was you know a poor way to monitor and the point of chaos it got there? I was testing yep.

A

Prometheus and like you, Prometheus, is very good default, but, based on this, it's on a failed test. I now need to go back and iterate on it and make better monitoring for my chaos.

A

Okay, all right. Yes, all right, I.

D

Can I looked up.

D

And stop to share anything. My screen.

B

We may be the screen size, I, guess the font. This.

D

B

That's good good.

D

Alright, so my demos on chaos cube, which is a different tool, that kind of forms similar functions, I almost kind of impressed when I was evaluating, it seems to provide some pretty good kind of a number of interesting capabilities.

D

It can be deployed, was helm before I get too far into how to deploy it. I would like to share that in the hack MD I put a link to this repository, which is in github under github calm such my name, Maui Lyon under kind chaos cube. So if you want to, you know, follow along at home the documentation for a to do that should be here.

D

Okay, Cass cube: this is a repository. Their github link, e li n, ki chaos cube and then inside of this they've got some pretty reasonable documentation about how to get started. They have a film chart that they maintain. They provide a number of interesting filters, including you can filter by name space by label set by annotation by age, because you can, you can have things like exclusions right so certain days, but don't do it on Saturday or Sunday just deserve your chaos for business hours. Please see you can avoid time to days of here.

D

It kind of all. It's kind of all. It's got a it's got a pretty reasonable implementation and you can also and so form Michael. What I was trying to do was actually make it so that yeah, so that's the that's their website before I get into the demo. Part of this I want to show this page, which I thought was also pretty good shooting and it kind of speaks sort of to like what we were talking about and I'm gonna put this in the heck and be.

D

But yes, this is a blog post by a company called blow bill in the bill is a company that provides a learned in externally facing a customer facing.

D

Julie- and it can actually also help you kind of instrument, this thing in such a way that you can understand whether the application is responding in the way that you might expect this blog post kind of gets into a lot of the topics that we talked about, which is you know like what is chaos engineering was before. How do you go about launching this, and the neat thing about this talk is that it gets into like you know the experiment idea, you know behind it right.

D

We're gonna, create an experiment, we're going to deploy our application, we're going to simulate steady state, specific the load on the on the host, then we're going to use chaos in this case they actually I think they ended up using the cue chaos. Another chaos saw me was showing up earlier, and then they go back to load mill and because it's actually the thing measuring the students need for the application. They can provide this really pretty graph.

D

That shows like what happened when interruptions happen to the application, which I thought was actually a little summary of of chaos, engineering and the way that we've been talking about it in this episode kind of, like you know, generate a theory test it. If it didn't work out, you know make sure that your make sure that your your inputs enough, it's all make sense of that. Your series will dig into why I didn't work out or why it did work, etc, etc. Good stuff know a little bit about my setup, so I am running kind.

D

I've got seven nodes and one and it's just a single master control, plane, I'm, actually running metal elbe as a load balancer you can ou get suc I.

D

Have metal lb deployed and so I can create on my kind cluster things that are load balancers with their own external IP addresses these IP addresses are part of the same range as the docker containers that I'm running in kind, so this is actually how I'm able to access them I'm running analytics, which means that these external IP is because they're part of the same l2 domain as my kubernetes nodes are going to be there, that it's all reachable so I can use my browser, for example, to access my application.

D

The reason I'm running, cut metal LD in this case is because I needed a load balancer that would be in front of my service right. I needed I needed some mechanism like that, so that the so that I could show kind of the resiliency or failure model here. If I just went directly to the application, I would only be showing access to that. One.

D

Application and I want to see this I want to show this traffic kind of being balanced across a shared IP or exhibit, and that's why I'm actually using metal it'll be for that.

D

So in the repository I have a directory in the middle will be how I applied it. That's all in here and then also inside of this directory, I have a directory called card required, where I'm, actually using kind of the application kubernetes up and running with a health check and with a readiness check and all that stuff kind of exposing it vehicles. The balancer and I'm annotating, the pods.

A

D

Chaos equals true, and so in this case, what I'm doing is I'm. This is actually how I configure chaos cube to kill these things right. So I've deployed chaos cube and said a way that can actually apply this stuff or that could find all pods that are annotated with Kaos true and it will kill them now. Side. Note I actually confused the heck kind of myself for a little while and as I. Imagine everybody there's at some point, because I was playing this annotation at the deployment.

D

Which, obviously doesn't do anything I wanted to do because, from the perspective of chaos cube, if it's not a pod, it doesn't exist. Yeah, so I have to actually apply this down with the template level, and that was a took me a second to figure out. You know, that's it's a good flex world we live in. So that's that's how that was.

D

We got that all working, that's all in here in the example, so you don't have to figure it out. You can just borrow my or my stuff if you want- and that is where we are currently so. If I do get, pods I can see that I have to pause for the card and employment and two pods for this card.

D

Safe deployment card is actually the one that I'm running the chaos test against, to publish it and maybe something other than part and then safe is the one I'm not actually running any, and you can kind of see that by the age right. If you look at the PATA age over here for these four pods I've got two that have been up for 37 minutes on.

D

That's been up for 83 seconds and one for 23 seconds, and if I do refresh probably couple more refreshes we'll be able to see one of those pods go away well, so that is basically what I have set up in my environment now for the fun part.

D

So what I was trying to actually what I was trying to prove? Well, this error rates- and this is kind of getting back to sort of that histogram idea. We talked about and some of that stuff. Now, in my tests in my environment, everything is local to me, so I'm kind of just beating up on my own little laptop here. You know it's suffering into the load, just fine, but there are a couple of things happening on the screen on the left, but I wanted to kind of walk through car damaged. Okay, some.

C

D

Here, what I'm doing is I'm doing a get for once I'm 217 250 5.1, which is the which is the card service, that Kaos is actually going to affect and I'm going to use a vendetta which is a tool but can provide load, bound load, testing for web applications at a rate of 1,000 requests per second and then I'm gonna, pipe that to pajetta in code, which gives me like the output in a JSON format, and then I have another little tool called Jagger, which is a little go binary.

D

That's going to provide me this one line, aggregate, the massive output, the Vegeta outputs and just show me what I care about here right, which is that for the status code of the return of the request, I want to see how many of them fall into each of these buckets right.

D

So one hundred two hundred three hundred forty hey we're talking about before right and so I've got that set up on my on my Kaos service and I also have it set up on mine on Kaos service and hopefully we'll see some differences here between the two.

A

Fully easy to attack that confronts an arts repository.

D

From vagina yeah.

D

Center yep cool.

A

So I'm really really yeah. It's actually pronounced Vegeta. It's based off of the Dragon Ball Z character, so scroll down, here's a little. uh Let's go down a little bit.

C

There is even Walt here, yeah.

A

This actually named after this character so yeah. This is Vegeta from Dragon Ball, Z I.

D

Thought it was vagina, that's awesome, yeah and then I actually got the Jagger thing from this page as well, because they were going down to like how to produce graphs and stuff I I was actually looking at the real time analysis piece here and that's how I found Jagger, which is a way of taking the output and getting just it up with it. You care about.

B

A

The create metrics for you like on the fly, that's awesome.

D

Unfortunately, the J plot tool, the J buckle, requires I term, which I don't have, and I haven't looked into like to be reporting this to some other terminal. So for now all I'm. Looking at it's just the error rate, which is good enough for our experiment. Here.

D

Let's click those guys off and you know kind of look at what the output is. Is he what we're? Looking at and I have a couple of different, interesting experiments to kind of walk through here, which I think will be fun for all of us, so so on the left side, these top two, which are card f4 for d5 right. These two are running. You can see that they're running up first, our time and I can see the majority of my error code or my return.

D

Codes from my HTTP jets are all coming in as two hundredths or in the 200 range. But as soon as I see a pod die all right as.

A

D

See a pod die see my 500 error rate go up, and so what this is actually testing is what happens to the traffic that was being terminated on the ID when the pod suddenly goes away, the pod just dies out of hand. What's gonna happen from a kid like you, proxy or another, it's actually evening, I guess it would be key proxy.

D

What's gonna happen when I have a truck when I've got routed to that pod and that pod goes away during the middle of that call right, so I'm issuing a thousand requests per second across a load-balanced IP address. That means, presumably that one each way my availability drops in half and I see an error rate increase, because the traffic that was in flight for that endpoint is now hitting more and that's what I'm, seeing in that error rate here whenever, whenever we see a pod company, we see that 500.

D

A

In itself was a kind of an interesting.

D

Experiment, yeah and I can also like I can also like to scale this up or I could skip those down, and we could really highlight the how that's going to be affected right. It's down to one pod and we test it. We can really see the 500 errors.

A

D

Is really awesome? That was one experiment. The other thing that's neat is that a another great example of this and which I thought was actually kind of interesting, is when I was sitting this out. Obviously, none of my seven notes actually had that pod image only like the two of them, but the two of them were. These pods were initially deployed.

D

You can see they right now, I'm using cubic kind worker, 3, & 4, for this image and when I delete de pod, it will get rescheduled to some other pod, and when that happens, if the image was not already available which by now it is because it's been running on all of the news, I have to also incur the amount of time it takes to get that to get that image deployed.

D

On that note, so that was also kind of an interesting experiment and like it, wouldn't necessarily increase my error rate, but it would take longer for me to get to the place where I had enough budget to support the request per second that I was trying to satisfy. Just that was interesting.

B

D

Yeah and then down here, which is extreme so on the bottom of my screen, which I should probably make a little bigger. What I'm doing what I'm doing here is I'm actually looking at the end points that are behind that bit right, so I do just the trophy here: I'm doing a team cuddle get end points which is a command that you can run inside of kubernetes and I'm.

D

Looking at that particular service and I'm, watching those endpoints change, so as I Todd goes away or comes back or a new one gets created, I'm watching the ending points you know populate with a new IP, and this explains some of the other interesting behavior that we're seeing here and that we're not down for a really long time. We're not incrementing errors for a long period of time, because as soon as that pod becomes unready, we stop sending traffic to it.

D

So, there's only a very small window of time in which in which we will be able to trap up those errors over those 500 errors. Up Roisin right is, as soon as the pod becomes unready.

D

It gets removed from the from the healthy endpoints of the service, and mine and Vegeta will no longer be able to Regina a little ogre, be able to actually brought traffic to it because it won't be in the running, so I thought that was pretty so that was that part of it now there's one of the tests that I wanted to share with you, which I thought was. This is actually kind of an artifact of the tool that these folks wrote when they are running kubernetes been running.

D

Is this great little web app provides a number of interesting things, including aliveness probe, and a readiness probe, a keychain memory cube all that stuff now what's neat about this app is that I can also configure I can I can decide whether I want the health or the readiness probe to fail over time?

D

So in the on the left on the left side, my theory was that if I kill a pod I'm going to be able to capture some 500 errors and a group that I could on the right, what I'm going to do here is I'm actually going to force a pod to restart by having it failed. Aliveness check and my theory is, but I will not be able to see any errors, because as soon as it fails the liveness check, it will also fail.

D

The readiness check and it will get rescheduled and I- should not be able to capture in ears.

D

So, let's see what happens so, what I'm going to do now is I'm actually going to start failing the liveness check- and this is somewhat artificial because, like the traffic is still evil determinate on the pod, even if it's, even if his liveness check, fails right, that's not like I've turned off the webserver I've just told it to start returning 500 for the liveness check and that's why I'm not going to I'm not going to be able to increment any 500 errors when I do this, because the web server is still there and still active and still working and then when the liveness check fails.

D

What's going to happen is going to be pulled out of the running, and so any existing traffic will still be able to terminate, but no new traffic will go there.

D

Pod will get rescheduled that will come up on a new node and it will enter the pool for reddit for availability again, which is a completely different life cycle than what we saw on the left here, where I was not using HealthNet enough healthy check, our health checks, aliveness checks to actually handle this I was just shooting it and seeing what would happen right and so because I, because it died out of hand I'm able to see those error rates incorrect, which was interesting. So let's try this out on on this guy.

D

So this service is at 172 70 22, because I bought two I'm gonna fail. The liveness probe.

D

And because of this gross work, you're gonna take a second for it. Actually, it has to fail so many times before before the system kicks in and so the path you've lost, two of them so far and there went pull down the pool. We cannot see any errors so.

A

One yeah there is a 180 for have three for the 500 left.

D

On the left and the different thing, oh.

A

D

A

D

I didn't see any errors increment on my on my right side and now I'm back to two pods and again, the the reason that was is because the web server was still active, we're still able to turn to service those requests right and then what happened is there's like this time period.

D

Where in the pot itself is marked unhealthy, then it gets rescheduled to another node, which was the same because we started and then when it comes back up, it gets turned back into the pool, and so in this way all I've done is have my availability like I only had I would say: I was able to get a thousand requests per second per instance right if I was if, if I kill off a pods and obviously my if they built my availability goes down, their elbows would be able to get a thousand about two thousand.

D

It gave you yeah.

D

So that, but I wanted to share with you and against like if you go to.

D

Kind chaos cube repository high 35, github name, how a lion all that it's documented. Okay, here you know it's all out here here, you're kind of Vegas laid out here how we're actually doing it. The reasoning talks about the instructions a little bit. It's not super mobile document, but I think it's enough to get started, but so I hope that was interesting. um Any questions there anything we want to try like it like it crazy here that.

A

Was super interesting? Nothing I really like that test. I, don't know some really awesome way to show off like one this tool, which is a really awesome tool, but also like one of the points of live in his hopes and like the redness checks and all those things which I think kind of get looked over a lot.

A

The best they're very valuable and useful tools and the kubernetes like the probes, probes and people don't manage them as much as I think that they could be, and that was a cool way to show off. But yeah you failed. We simulated a failure, but the failed, but the system itself was resilient enough to manage in a way that the user and user would not know about so cool, really cool dude, Nick.

A

We don't have a full a lot of time left, so I, don't think jump into another another demo of anything. But does anyone have any questions around acute chaos? Gasps? You.

A

Know all right, cool I think that's, basically all we have for a chaos engineering, but in social hour, do you guys everyone enjoy. It is how gain some some new knowledge new perspective. I know. I did absolutely.

B

I think because it gets us started in the right direction and look for options and we we could test our clusters. So that's definitely the takeaway. For me excellent.

A

Anybody else on the on the line. Do you enjoy yourselves.

A

We do yeah, that's fine. Yes, thanks everyone. uh By the way, if you haven't added your Twitter account or anything that you wanted to talk about or or anything in the hacking be that should be available, we can add it to the channel, but they're not useful. It is anymore Jessie, just bounced yeah.

B

Happy Friday for everyone, yeah.

A

Happy Friday, everyone I hope you all have a great weekend. Thank you for joining us on the quality of social hour. We will be picking this back up in a little bit. It's usually before night, but I will be on vacation next week, unless somebody wants to pick up hosting duty, who certainly with a mind that continuing um there was something else, I was gonna, say and I totally just blanked on it. Well, alright, thank you. So much and I hope you all have again.

A

D