Cloud Native Computing Foundation Online Programs, 27 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native Live: Litmus Chaos Engine and a microservices demo app

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone welcome to cloud native live where we dive into the code behind cloud native, I'm annie and I'm a cncf ambassador as well as a senior product marketing manager at camunda, and I will be your host tonight. So every week we bring a new set of presenters to showcase how to work with cloud native technologies.

A

They will build things, they will break things and they will answer all of your questions so join us every wednesday to watch live, and this week we have um amazing two speakers here with us to talk about using litmus chaos, engine and microservices demo app to demonstrate automated rca and as always, this is an official live stream of cncf and as such, it is subject to the cncf code of conduct. So please do not add anything to the chat or question that would be in violation of that code of conduct.

A

Basically, please be respectful of all of your fellow participants as well as presenters so I'll hand it over to gavin and brandon to kick off today's presentation.

B

Hey everybody, uh my name is seamus and I'm joined today by braden and we're both devops engineers for zebram. um So we're going to talk today a little bit about litmus, which is a cloud native, open source chaos, engineering framework, and we want to talk a little bit about what that means and what we're doing with it. um So so what we're doing with it is actually a little bit unique.

B

um So at zebrin we built a product that analyzes logs to find the root causes of issues so being able to cause issues on demand is just absolutely invaluable for us. um So when we're validating and demonstrating our product, we need to be able to create problems within our kubernetes demonstration clusters.

B

uh We generally don't have access to customer prospective customer environments. So it's important for us to be able to do this ourselves on demand and litmus provides on-demand chaos by simulating issues that can occur in environments such as just bad configurations, heavy infrastructure loads rainy days. Just any stability threatening issue you can think of really the only limit is imagination for it.

B

So a quick thousand foot view what exactly is chaos and what is chaos? Engineering in this context? So there's many different testing methodologies available in the world right now. um The thing that they all have in common is that they all have blind spots and the problem is the blind spots can overlap, and then you can have really bad unpredictable behavior sometimes, and if the first time you find out about a resiliency issue is when a customer reports it at three o'clock in the morning. That's an ops fail, that's bad!

B

We don't want that to happen, and chaos engineering allows you to create these doomsday scenarios in a more controlled environment in a way to test resilience before bad things occur in the wild um litmus is by far the best cloud native framework for inducing chaos that we've found. um We've.

C

A

B

Searching and we've actually developed some stuff on our own um and litmus is just by far our favorite tool for it. So exactly is litmus. So it's a framework for conducting chaos, experiments uh just individual little blurbs of bad things. That can happen. uh So this is done in a declarative way of via experimental templates uh and experiments can be orchestrated into chaos. Scenarios which can include things like chained experiments. You can run experiments in parallel and sequential.

B

um You can set up and tear down experiment resources and you can even deploy entire environments as part of a chaos scenario. It's an extremely versatile platform, so litmus was originally accepted in cncf sandbox in 2020 and actually just moved into incubation. So huge congratulations to them. For that, that's a big step up and we're really happy for them.

B

So what kind of experiments are available right now? um There's a fantastic library available at io, hub.litmuschaos.io, there's currently 58 on the shelf, uh minimal configuration required experiments, they're, pretty much just drag and drop and hit go and receive chaos um they're available for a wide variety of cloud platforms, including things like kubernetes cube, aws, um azure vmware.

B

um If you use a major kubernetes platform, there are compatible experiments waiting for you. um So I'm going to hand things over to brayden now and brandon's going to give us a live demo of setting up litmus in one of our demonstration environments, uh configuring it and running some experiments and seeing what happens.

C

Yeah, so thanks tramus, um let me go ahead and share my screen, which one I watch this one.

C

Cool and hold on, I have probably mouse there. It is all right so kind of what we're going to run through real quick is we're going to kind of run through the litmus kind of install directions, um we're gonna spin up the litmus cluster, we're gonna access, the chaos central, their ui, um we're gonna, run their default um test cluster and then we're gonna connect it to one of our live apps and actually break some stuff.

C

um So the first thing we want to do is litmus offers an install through the through qcto or through or an apply ammo or through helm, um I'm just going to use the home one. So the first thing you do is add: olympus um home repo and then.

C

Make sure it added yep it's in there somewhere yeah. I have a lot of requests um and then let's do an install command um so we're just gonna install the base configs that come through witness.

C

um So this command slightly different than the one they offer in the instructions. um I'm kind of grouping a couple of steps together to basically create the name space and upgrade if it doesn't already exist and apparently.

A

Should we have a bit more zoom um on the terminal.

C

Yeah there we go.

B

Hey that's legible. Now.

C

Okay, so go back and forth. um First one was just installing the repo listing the repo and then now we're running upgrade command, except I.

B

Forgot to actually add the repo, so let me do that real quick.

C

Oh um so I'm using lens as a kind of web ui to front or cluster um it's a little easier than trying to remember and type 5000 uh keep ctl commands. um So as we see it's kind of going through and um applying it right now, so wait a little bit for that. To finish.

C

I don't really think we have any questions at the moment.

A

Apparently, um there's a question from someone on: will this be recorded and be available? Yes, it will be so it will be available in youtube in the cnc at youtube pretty much immediately after this live ends, so you can tune in to watch it. There.

C

Why no don't stop? Why are you upgrading.

B

No, no, no, you have to update outlook now update outlook. Now it doesn't matter about the presentation.

C

Apparently you close that look and that wants the update all right. So it looks like it's fully installed um the next step when you deploy this out of the box, um as you can see. If I go look at the server and apparently now alexa's going off um yeah. So when you complete straight out of the box, um if you look here, it just does no, no port cluster ips.

C

So there is, um there are instructions for installing it either going through and editing a noteport, creating a load balancer or putting an ingress object on it. I'm not going to dive into that. I'm going to cheat and um lens does a great thing where you can do an internal q, proxy and just proxy from the cluster to local. So that's what I'm going to do um just to kind of get a straight log in.

B

C

Yeah much faster and I don't have to start messing with ssl search or anything like that.

B

Admin litmus, I think, right.

C

Yes, it is, if I can type this on right. Let's see, maybe you talk. No, I don't want to add you to the last house.

C

I want to add you to google and we're not going to change the password for now um cool. So just like that, we have it stood up. We have the intro ui um and now we're kind of ready to rock and roll. So a couple of things to kind of walk through here um chaos delegates you can see this one's pending. um What we installed was just what they call the chaos center. It's the ui, it's kind of the command and control center. um So it's the ui.

C

You can specify scenarios you can download from the hub do analytical stuff, the actual real meat and bro or the bread and butter of how this works is installing the self-agent which okay there we go. um So what this is kind of, think of it as a runner.

C

um So the idea being is that you can install different chaos delegates and cite different clusters, you own, so that the ui doesn't have to be inside the um where it's actually going to be um so that way, we can do it when you first open up the ui and sign in for the first time it actually installs the self agent um for that cluster- that you have it running on itself, which is what you see here. If we were to hop back into lens, which is here, um we can see that.

C

Let me close this. We can see that there are three or four servers and that's the one we just did that just installed, um and this is all part of the chaos operator um some of the subscribe buses that actually this is all of that cell service agent- that just installed in here.

C

So let's actually break something. um The first scenario we're going to run through is we are going to use their demo app their demo app is called potato. That's actually a funny word on mr potato head. um It's pretty funny um yeah! So you have mr potato here, potato head. Sorry, my bad.

B

Mr potato, head.

C

Yeah, so we're gonna run through we're. Gonna leave this the same um yeah, and so we have kind of the sequence of what's gonna happen, um since this is their predefined template. It's going to actually install the potato head application.

C

It's going to install the chaos experiment that we're going to run um this chaos experiment for this one is a um cube kill. So it's a pod kill, as you can see right here, it deletes the pod and then once that completes successfully, we are going to revert and uninstall the um the chaos container, as well as delete the application that we installed directly here.

C

um These workflows are customizable, it's all ammo based, so you can upload yaml. I believe they also have an api, so you can actually just apply directly rather than having to go through the ui steps, um we'll circle back on what the weights do in a second um and so yeah. So we kind of within a couple clicks. We have our first thing. We want to go ahead and schedule it now. It's just gonna ask us to verify everything. um It all looks good to me.

C

So, let's hit finish and let's go to the scenarios now we see it's running, um you can see the experiment. The main experiment is a pod delete um and yeah. So now it's just going to kind of sit here and if we go back into lens you can see that mr potato head is actually so it went through and it's spinning up a couple of um containers. You got one for the head: hats, new arm, left leg, main body right arm right leg. I don't actually know what this does.

B

It just it's basically hello world for kubernetes. Essentially, it just has.

C

B

Of containers that talk to each other and pop up a picture of mr top, mr potato head, when you go to the front end.

C

Yeah, it's all looks like it does not really yeah. We can go. Look at that.

B

Let's go get it, let's do.

C

B

Was really fun because, like there was, there was nothing you had to do like this? Was a hundred percent set up by chaos? Up like it was just drag and drop? We didn't deploy this. It's just a self-contained kubernetes application that just spat itself into existence and is about to delete itself.

C

Yeah so yeah that looks like, let's see so, if we go click onto here, we can see where we are so we're in the middle of running the actual pod delete command. um The one thing about chaos, experiments and I can see if I can find it in here.

C

Oh here he is um so every experiment kind of gets spun up as a job and that job does whatever the experiment does. So, as you can see, this is a pod delete. It has a target argument. That target argument is probably one of these mains that just deleted. Probably this one, um and so what this job will do. So you can see it terminated one pod and it's spinning up another now, um so it's just kind of a little jaw that runs in there.

C

They can be really complex or they can be as simple as this one was where it's hey, we're gonna, delete a container and we're gonna delete a pod and see if the plot comes out, um as you can see, because of the way this app is designed, our hello server is still available. Oh no, it's not. I lied.

C

Oh never mind scratch that I was proxying from the container that died so as soon as it died. My proxy connection died.

C

um If we had no water around that it would have been available um yeah. So we're kind of just waiting for this kind of finish.

B

But setting up a load balancer involves work. I don't like to do that stuff.

C

Yeah something something bill: gates, lazy engineer, best test.

B

Yeah I've heard that a few times, I'm not sure if that's actually true, but I have heard that.

C

I am 100 at times the lazy engineer, so the less work I have to do that. I'm going to scrap later the better.

B

Oh yeah, especially like volatile work that, like is only going to exist for a brief moment and then just gets vaporized yep.

C

Yeah, so we're just going to kind of sit here and watch this big circles. Anybody have any questions.

B

I could try singing, but nobody wants to hear that. I do I pay money for that. Let's go no we're.

C

C

uh Cool all right, so it ran and now we're doing clean ups. I know this is kind of the you know the cheesy side, but does it help with declared configurations.

B

Well so this is, this is actually what we're seeing here is the execution of declarative configuration. So this is, uh we didn't configure anything. This is all just completely off the shelf. This is just default, behavior that comes with the chaos center installation, um but yeah every individual step. This was uh yeah. Oh there you go, there's manifest.

C

So it does actually provide manifest files. You can go through and do a declarative, manifest file and apply it directly like that. um That will work um we're just doing it through the ui, because I didn't write manifest false.

C

Does it support? Yes, sorry, I see your clarification there um and then mark cloud arc. Yes, it does allow you to target, destroy specific name spaces, we're actually about to do that.

B

Yeah I thought that was a clean star. I thought it was doesn't support everything I was gonna, be like yeah sure.

C

Yeah, no, it does um all right so we're gonna. So we ran it. You can see the experiment. The experiment was a result. Everything passed um so let's do some stuff where it doesn't. um The first thing I'm gonna hook up the other thing that this allows you to do is you can tie into a previous data source.

C

I'm hearing myself? Okay, um that's gonna. Throw me.

C

So we have prometheus installed in all of our clusters. If I can type correctly.

C

Mainly because I'm also too lazy to spend my own previous, so I'm gonna use the one. That's right there do you have the option to name a lot of authentication? Yes, you do. I just didn't. Do it.

B

um It's it would be good to not allow just any arbitrary person on the internet to run chaos. Experiments on your production, stuff.

C

Yeah, so they actually have to.

C

um They actually have the ability inside the settings to go in user management. You can actually add users to authentication with create new users, login details using passwords and all that um you can actually as part of the install directions as well. You can do um go off authentication.

A

C

That was actually an.

A

Extra question so something like aws, cognition, google, apple and so on.

C

I believe so you can do that with oauth, um I'm not 100 sure we haven't set that up or gone in that far yet. um But yes, I do believe it does support sso go off um to actually also answer the other question about declarative. You can set it up for get ops as well.

B

I think do we answer mark's first question about the uh targeting namespaces uh you can. If we did. If we didn't answer that already yeah, you can target specific namespaces.

C

Yeah I answered that with a we're about to do that.

B

Fair enough, fair.

C

Enough, um I think that's everything we're caught up on so far did I answer your question mark about the oauth and stuff or about the authentication. I believe you can um best answer. I got to check the docs. I know there's a section in there. We haven't done it yet I haven't personally done it yet, but I think so.

B

This is true jonathan. This is very true. This.

C

Is true we're not going to play with production right now, though,.

B

Why it's so much more entertaining.

C

Because I don't want to get paged, that's the main reason. I don't want to get paged um all right, so we're actually going to we're going to play with the sudo production. um So one of our demo apps that we have installed- and it's been on here for a while.

B

C

Everyone's familiar with this, I think, if not we're gonna walk through the ui real quick after I found my ui, um so wework stock shop, it's another open source project. It's like one of the microsoft or microservice demo or applications. I think it's. The two big ones are wework sockshop and google's boutique app. We like sockshop better.

C

Do you two mind.

B

I just like socks.

C

um Because, as yeah socks are cool and it has more, um it has more applications and several different database layers on the back side too, um but yeah. So this is the socks. It's a fully functioning. uh Basically marketplace store, um so you can go in buy socks. We can buy shame with some more socks because.

B

Appreciate it bud appreciate it. Welcome.

C

um Yeah yeah, so it has like a full catalog. You can go see. Colorful socks, non-colorful socks, super soft, oh super sport socks. I can't really panic. I.

B

Want some cat socks please.

C

Okay, I'll buy you two.

B

All right, thanks thanks, you know.

C

It has a full working cart. um If we, you know you safeties, oh we're missing shipping payment. um Transmiss. Can I have your credit card, so we can.

B

Yeah yeah absolutely just uh just start with the 404.

B

Yeah, I remember the rest of it off the top of my head.

C

Okay yeah, um so it's a full app so now that it's up, let's break this thing because breaking's fun, that's the wrong tab! All right! um Oh the other thing to note we do have. If I switch to the right namespace, um we do have a below generator on this site. That's been running for like 15 days. um We just kind of permanently keep it running, um so there is load. Then we're gonna see some fun cool things in the graph. Hopefully um this is all graffana prometheus stack, um so we'll see some fun stuff.

C

uh What do we want to name this thing.

B

um Suck breaker.

C

Yeah, it's called the sock breaker.

B

C

The breaker of socks, game of thrones fun, we will throw if anybody got that um so, as you can see, there's a lot of different experiments. Actually, let me let me back this up, I'm doing things out of order, so chaos hub. He talked about it earlier. It is where litmus stores all of their experiments they've written. So this is the free to fund scenario. We don't care about that right. Now, that's what we bring um so chaos experiments. They currently have 58.

C

I believe you have the ability to add your own repository and they have instructions for how you create your own scenarios.

B

Yep, I like to call it 58, but it's also 58 times infinity, because they're all infinitely customizable so.

C

Yeah, um so they have a little bit of aws ssm. um They have some azure stuff, some core dns, some gcp stuff, some generic stuff, so we'll play with the generic and the generic pod stuff.

C

um If anybody has anything, they actually want us to run feel free to drop it in the chat, and if I see it we'll run it I'm going to.

B

Stay here, yeah we'll do that one in a second I'm going.

C

To stay away from the cube aw stuff because, like I said I don't want to get paged, um I don't know if I could pinch for this cluster.

B

I, if you are, we should turn that off.

C

Yeah we should yeah um and smoking ebs stuff, so we're gonna do network correction to start with and then we'll do some fun stuff. um If anybody has any suggestions or just wants to see anything, I keep hitting the wrong tab all right. So let's go.

B

C

And do that again? uh That's the one I want. We want to use the self agent because I haven't installed it. I haven't installed this instance on anything else. We'll choose an experiment.

C

If you call this thing: sock, breaker,.

B

uh Breaker of socks.

C

I spell everything it doesn't really matter um all right, so you wanted network.

B

Corruption, I love network corruption, that's between corruption,.

C

I know so: where did my mouse go? I lost my nose there. It is come back, it's trying to run away.

C

um Did it add it camera? Apparently, I'm thinking.

B

C

Just thinking about it, let's do it again, hey, oh, I didn't click on it. I don't think it's done there. We go all right, so we've added it as you can see it kind of defaults to app and nginx. So, let's edit it and let's change some stuff um experiment, name, I'm not gonna mess with that default. So this is where you're asking can you target stuff? The answer is yes, so there's two different ways to install um litmus out of the instructions. There's a uh I think there there might not be a dns poison.

C

There's dns spoof though so that's kind of the same, but maybe not um it's close-ish, um but there's two ways to install this: you can do it either through um a cluster wide or any space void scoping. um I did this cluster-wide, so you can see all of our name spaces, but we're going to target sockshop uh we're going to target an app. So it targets based on labels um and let's do let's do the cards? That's fine, we'll crash a tv next.

C

um This actually will end up taking out the network on the node itself. um This is only a one node cluster, so I should take out everything and you'll see that so it doesn't really matter which, for this one doesn't really matter which one you target.

B

It makes the best crater in the graph, though, for prometheus. This.

C

Is true, um so you can add probes what probes can do? I'm not gonna go bother. Looking up what the endpoints for this is um but yeah, you can add a pro pro names. It does an https endpoint. um You can do http commands days or prom. um These are all commands. You can do um you give the timeout period reach right period and this will actually probe the endpoints of your application, um and so this is where the waiting comes in.

C

So this basically says: hey is this thing up and is this thing healthy uh um and if the thing is up the entire time you consider it successful? If it's not up um and the probe fails, then the app the tests kind of considered failed and you're not resilient. um That's where, if we look at this next section, that's fine.

C

B

Have to hit next again.

C

I know I'm gonna go clean up the pod yeah.

B

You know you come for the number weight thing right, yeah this thing yeah, so.

C

The weight thing so, basically, if I were to schedule like five or six different of these things, let's say I do a pod network, corruption. I take out an aws node and oh by the way this is running on eks. So that's why I keep referring to aws um say I take out a node and then I do it. You know a memory or cpu load test. um I can weight the different tests accordingly to each of this on a one to ten point system.

C

um So let's say I don't really care about network corruption, so I can write it before. um But if I were to go in and do our node kill, which is something that's highly more likely to happen, I can rate that at 10 it will actually hit the endpoints. It will test everything and basically, if it succeeds and the um if it succeeds- and it doesn't go down at all- you get that percentage points calculated to the resilience score.

B

And this is this is part of the really some of the really cool stuff you can do as far as like ci cd stuff, uh you can actually integrate uh chaos center with your git ops, so that every time you're updating things. If you make changes, you can actually run resiliency tests automatically to see okay numerically like what is our our score for. What's our resiliency like uh like, for instance, when mr potato head when we ran that we came back with the perfect score? Okay, what? If it wasn't so perfect?

B

What if we ran something one of our chaos, experiments uh did actually cause a service disruption like how bad was the surface disruption. This allows us to tune that, to a level a more high level that, uh especially like management's, really interested in seeing because it's a.

A

B

Just integer value say the bigger the number the happier we are.

C

Yeah, basically um so the kiosks workflow won't actually remove itself. um If you select all resources of the chaos itself, when you run it um that thing I did at the end where I said, let's clean up um and this in this aspect, it is talking about cleaning up the network, corruption, pod, that it is running. It's not talking about cleaning up the stock shop workflow that actually exists.

C

um The only reason it cleaned up um the application, the pod, the potato one is because it actually deployed that internally. So if it were to deploy it, you can then have a step to clean that workflow up, but since we're using a pre-existing workflow, the only thing that's going to clean up is the um the job that actually ran.

B

And I mean like, if you really want to make life difficult on yourself, you totally can have it set up just to delete something that already existed when you ran the chaos experiment on it like it. Has the flexibility to allow you to do that. I personally would prefer for it not to do that, but no, that is, that is something you can configure. If you want.

C

And I think I actually got a different aspect to your question too, um and the aspects if we're doing like no kills and stuff. um This is where it would probably be beneficial to run chaos center in the cluster you're, not trying to test from the off chance. If you do nuke the no that it's actually running on and all that. That's.

C

Where kind of the chaos delegates come into place where it's only a small, tiny subset of pods, and hopefully your environment is running on more than just one uh aws node, where it can just get rescheduled.

C

um I haven't actually tried the instance of doing a node kill, mainly because I don't want to be paged, um but I believe in that instance it would kill the agent and the test would just fail, basically being like hey. We've lost contact with delegate.

B

And crash cross-cloud compatibility, which is hard for me to say, for some reason, is a really big aspect of of what makes cassandra so good. It's it just has stuff baked in say, okay. I know I know that I'm running on eks I want to go. Do some bad stuff to azure no problem out of the box.

C

Yeah, this triple c is really 30 for a loop.

C

It's going to think about life for a few minutes. Hopefully I didn't actually jinx myself and my node died. I don't think so. That would be really funny. Actually, if that happened.

C

B

Is it doing stuff.

B

Wait, what was that was it pending at the top.

C

Bending, what no oh.

B

C

Mind, I don't know seriously.

B

Hey sad life, all right, let's try to finish again: hey.

C

Now it worked, I jake myself talking about it, that's what happened.

B

Yep, that's exactly what happened.

C

All right so now it's running that experiment. um It's installing the chaos experiment, so.

B

You should be able to see this both in lens and grafana here in a second.

C

Yeah, um so the one interesting thing to point is when you install litmus and when you install the um the cloud delegate, the workflows will actually run so you can see right here. The workflow is running. um The workflow actually runs inside of the name space that litmus spill. It miss chaos, delegates running, so it's not even running inside my sock shop.

B

Oh, the breaker just popped up and disappeared. Yeah he's popping up and running some stuff.

C

um So if we go into here and go here, we can watch this for some cool stuff.

C

Maybe hypothetically, you should see what's going on the network stuff, so that was it thinking about itself. That's the last hour, let's look at the last 30 minutes, yeah.

B

Tighten up the.

C

A

B

Yeah, it does take a second for cpu to start cratering too.

C

Yeah, well, that and prometheus is on, like a 30, 40 second delay.

A

That's kind of: why can I ask a question you're perfect, um so how frequently do you use logs when troubleshooting.

B

Oh constantly, that's uh the absolute constant thing for us, which is partially the reason why our our software exists. Why zebriam exists in the first place, um so we were trying to alleviate some of the the just head banging headache that goes into diagnosing root causes as issues occur, so we have a very powerful artificial intelligence engine that can actually help identify the root causes of your problems as they happen.

B

um Well, there we go there. We go so yeah, the uh uh normal log volume we would have to look at for sockshop we'd, be looking for a five minute range, probably about two two and a half million logs uh lines of logs, and we do not have the time or interest to actually try to look through that many log lines to figure out what exactly went wrong in our environment.

B

um So with zebra, we can actually pick out only 30 to 50 log lines that are the actual relevant log lines. It's much more user friendly, much more uh human readable to present information like that.

B

um So if we uh yeah, can we scroll up to the cpu and rams uh yep there we go.

C

Yeah, so all that drops um slightly because all the network activity drops which, since this thing's communicating to itself it all plummets um yeah, so we'll just wait for this to finish, uh an error occurred. Fetching data well yeah. That actually makes sense.

B

Because I just took out the network interface of everything yeah, considering what just happened, yeah.

C

Hindsight's 2020, something like that: yeah.

B

C

You know yeah, I love it um so once the interface comes back up which only took 60 seconds, so it should be fine, it should be coming up. We should just get in the lag of scraping.

B

C

That's the fun thing about all the network ones is it kind of, especially if you're running on one node, it kind of just nukes everything.

B

um But the nice thing is the blast: radius is confined to the namespace, so we didn't like knock out the entire cluster or anything.

C

Oh, we knocked out the entire cluster. Oh.

B

We did yeah well, am I standing there? Yes,.

C

It um the what the network corruption does is actually corrupts. I believe it removes the network interface for docker on that container or on that node. Oh, oh wow! This is a single node cluster.

C

We knocked out networking for the entire cluster because there's only one node. If we have multiple nodes, it'd be a smaller glass radius.

B

Yeah fair enough yeah.

C

Today, that's actually partly causes the dip, and all the grafts, too, is because everything stops. As you can see here, network yep.

B

I like seeing the sharp craters in the graphs.

A

B

Well, let me rephrase that I like seeing that when it's not production.

C

It's a bad day when somebody accidentally runs this on production, which has never happened. Thank god.

B

Knock on wood, oh, look: we got a detection too.

C

Yeah and so kind of to go full circle about this. um This is full disclosure. This is like our own, our widget, that we have installed in grafana um kind of where we talked about the beginning. We use this tool to kind of induce uh live alerts and stuff. So, as you can see here, it's a grab card, stevie says: hey, the masterpod was restarted and the keyboard restarted. I don't know if that's actually seriously. You're gonna me sign it now.

B

So I got the credentials for that. If you need it.

C

uh I'm not really.

A

And then there's a new audience question as well, uh when we have.

B

So yeah, I would say it's possible to bake that into a scenario.

B

I would say that it's probably not let me say I I don't know for sure, but I don't think that that's a feature that comes natively so.

C

Actually, yes, and no, um let me go back and let's let me show you, let me go back to schedules hold on. Let me get back into the manifest of this.

C

Everything's time bound um so inside of this massive chunk of script somewhere, there is.

C

Yeah right here total chaos. Oh.

B

C

It is so everything's, time-bound in seconds so for that test we ran it specifically for 60 seconds, um there's also a flag you can set inside. um Let's say you apply this with a yaml file directly. You can reapply that email file and there's actually a.

A

C

Variable you can set to disable the chaos test. It'll, stop it instantly um through the ui as well. We can also stop the execution of the test if it actually breaks something hard as part of a glass break procedure.

C

um Yeah from that aspect, everything's time-bound. That would be the glass break as you kind of it's like there's a mixture between. I believe you can cancel a test inside of here, as it runs um as like a hard glass break, stop or.

B

You can just do that from cube cuddle if you, if it actually gets really out of hand.

C

And the scenario he said he locked himself out of the target k cluster right right. If you lock yourself out of the cluster, there's probably bigger issues there too. Oh.

B

A

C

B

What's uh what kind of log lines do we get out of that detection by the way.

C

Yeah, a bunch of errors creating workflows, mongodb can connect yeah.

B

Yeah so yeah the word cloud there on the left. We see that yeah. It was the the cart's thing. It went down. uh It's mongodb uh yeah.

C

Yeah not the best one we've gotten out of that, but it's still pretty good.

B

C

B

Better looking at millions of log lines.

C

Yeah, so that's our into india's, I think uh the last one I think mark wanted to see us.

B

You know yeah, I didn't.

C

Do this one, so you know why not yeah.

B

I mean hey this, the new territory for us, let's find out what happens.

C

We got 20 minutes to kill so yeah.

B

Honestly, this is this is one of the things I like so much about chaos center is, I can just like sit in here and I can just play like you know. I've never done something so catastrophically bad, that I've uh not been able to just hit a button to reset everything but uh yeah. Let's uh today might be the day, let's find out.

C

Any worst case we just pull it all the way I mean it's a demo cluster anyway, so yep, okay,.

B

I know I don't have any delegates selected. Will you it's a deep breath.

C

B

C

It's a little buggy right now. I think most of that's due to me using acute proxy and it being on a vpn, and the combination of the two is a little fun um the work it actually set up a load balancer. If I wasn't lazy, um go that actually set up an ingress object. Add in the proper annotations for us to spin up an internal aod for it. um It would actually be a lot better but, like I said, I'm lazy.

B

It does work a lot better with you know, an actual network who would have thunk it, but now.

C

Let's see, let's see if we do this one's I'm curious coordinates air spoof.

B

Yeah, I think that's the closest to a poison.

C

You know what as fun as that sounds mark. I don't think I'm gonna do that.

B

C

I mean I might as well just do it. You know just hop into the note and just do an arm dash or slash.

B

I mean I got a kettlebell, we could just throw that at the server.

C

Let's do some spoofing. I haven't done that.

C

Next, let's target up, I do not want to target.

B

The phone I want to target suction and I don't know.

C

I I you know what it's too early for that.

A

B

That thing with ida pro how ida pro is like the most reverse, engineered software of all time, because everyone who downloads out of pro immediately uses autopro to reverse engineer ida pro.

C

I didn't catch half what you just said: cool.

B

Not a pro at a product, bro.

C

Pretty much all right once.

C

Let's do that: let's do that I'll schedule. It now uh yeah finish, put a scenario, so this would be fun because I have no idea what this is actually going to do.

C

I guess I could have been like. Oh look. We can see our app well for now. Yeah.

B

For now we'll see what breaks.

B

That was gunny buddy angry, I don't know. Why can you.

C

B

Well, that's a specialty yeah.

B

So yeah there's uh get ops integrations that I have messed around with. um I would assume that there is something that we can do to interact with slack. I honestly don't know off the top of my head.

C

Yeah, I don't know either um and in full disclosure us moving to the v2 is kind of. This is definitely newer for us. um We first grabbed on to litmus v1 v2 is when they scan and put the ui and the chaos center and everything in front of it.

C

um V1 was entirely server and api based, um and so that's really what we're doing is actually crafting yaml manifest files and just doing qctl applies with a series of files that, like added the r back process and all that um this is definitely much easier and after diving into this a lot it's on both of our we both have tickets now to go. Look at. You know the full scale implementing this all the way through with the ui, because it just makes so much everyone's life easier. You can actually, we can.

A

C

Our sales, engineers and stuff be able to go in and do on-the-fly chaos scheduling without having to know hey. How do I attach to the cluster or without me having to give them access to apply.

B

Stuff to a cluster, well yeah limit the great factor and all that um turns out people actually really like gooeys. That's uh interesting innovation. Man, fresh out of the 70s.

C

Yeah, so um I haven't honestly dove into that, so I can't really say that I believe they have a slack integration. But again I don't.

B

Know yeah using the get ops flow you can you can cobble something together to make slack channel stuff happen? I just don't know what the logistics that would look like.

B

So it looks like.

C

Actually do anything because it should be, it should have affected the dns. So maybe oh.

B

It's actually doing something: it's cool not entirely sure I might need to check prometheus.

C

Yeah, it doesn't look like it's doing anything. I wonder.

B

So, like so part of it, since we're not doing this on like an actual network, I wonder if that.

C

No, I wonder if it's because it's looking for something that's not coordinate us to actually go straight with. um That would be my guess. It's looking for, like cube, dns or something um if I actually go and read the manifest like I said we haven't ever done it. So it's cool thing that tennessee. Let me see if prometheus, says something interesting. Yeah.

B

Yeah yeah, I mean because if we, if we see network stuff in prometheus, that would tell us.

B

Well, there's a little bit of a shelf, but I don't know sort of.

C

B

C

No package draw, there's no read, write, drops.

C

A

C

Yeah, it doesn't look like that actually target anything correctly. um I probably could have set it up wrong.

B

B

I mean we could do something like uh pod delete or something like that, something a little bit more innocuous.

B

C

Yeah yeah, I didn't do anything all right. Let's go do something a lot more fun.

C

Let's actually break break something.

B

Okay, I'm brave.

C

I gotta go out with the bank.

B

Fair enough, fair enough, let's see the fireworks.

C

Okay, maybe it actually did something.

B

uh Watch it turn out that experiment affected chaos center more than sock shop.

C

If anything, it affects my q proxy. More than anything.

C

Not yourself don't be lazy next time you do this cpu hog, no train, no train, no train service, kill that one. Oh, maybe.

C

You know what screw it.

B

Fair enough, fair enough.

C

Let's terminate some tags.

B

uh Yeah, you might want note selected for that. One.

C

B

C

B

Oh yeah, okay, fair enough fair enough.

C

You see two instance id uh give me one. Second,.

A

And mark is also asking supposed to have isolated kubernetes only for chaos. What do you recommend for authentication and authorization from my chaos cluster, to connect to my communities, target.

C

um On some of it, um like I said, if you're using eks gcp stuff, obviously authentication authorization is going to be some form of an our back role using um you know, defaulting back to their ima management using a rule or something like that. um Internally, I don't really know um yeah. I don't have a good answer for that. Really, um like I said, we haven't, played super much with isolating them and actually sharing it out to different clusters. Yet I'm sure I'm going to have that same question in about a week, um so.

B

Maybe in a couple of hours, depending on what happens after this call.

C

Yeah pretty much um yeah, so that's the as bad as an answer. As that is it's the best answer. I got.

B

Yeah we've we've pretty much only messed around with like the default off stuff.

C

I promise I'm actually doing something.

A

Perfect then, there is about 10 minutes left in our time. So if any audience member has any questions now is the time to start typing them away. um So um everyone ask these questions.

A

Where is the instance.

A

I saw something.

C

There we go here instance, id.

C

Okay, I lost my mouse now.

C

A

Is my mouse over.

C

Okay seriously, like how do I get my mouse back.

B

I mean I just saw it fly in front of chaos. Center. Did you? Oh it.

C

B

C

uh I don't know if this will work but we'll try it next next finish. Next, I should be the instance id.

B

This is what we call yellow chaos testing pretty much.

C

See if this thing actually makes the node.

C

Oh, it's installing. It still.

C

Well, I guess the other thing we could do um so they actually once you connect it to a dashboard. You can actually do something cool where you can set dashboard and they actually have some built-in see. The problem is, I just said the node to delete. So I'm like it'll rush to like beat the node.

B

They grab those metrics while they still exist.

C

Well, they'll exist they're, persisted by a pvc, so that's just more. I still have access to them here we go yeah, so you can kind of see. Here's plot metrics. um It's I mean it's limited use case. I think you're still blowing this out, but it's still kind of cool because you can see this actually needs a prometheus scraper, which I didn't set up. um There's bringing the escriptor I'll dump all the chaos intervals into prometheus as part of an exporter with a service monitor.

C

um It didn't play nice with our already previous creators installed the directions they have kind of installs prometheus itself, but you can hack through it um to get it running. Did this thing actually kill the node or, and then it's still up? Did it actually run? I might have done that wrong, we'll see. Oh, it failed.

C

So answer your question: you can permission about it, so it can't do stupid stuff, like that kill itself. um Yeah.

B

Honestly, I'm a little bit relieved but.

C

A

B

Cpu hog is a fun one.

C

B

Oh, why don't we do like a cpu and a ram hog like in parallel.

C

Let's just do okay.

C

B

Because we haven't.

C

It's doing this thing again.

B

Yeah, because we haven't demonstrated running like parallel chaos, experiments.

C

Sure got eight minutes to do it so.

B

That should be sufficient yeah.

C

Assuming my proxy holds up there, it goes.

B

I'm impressed at how well it's done so far today, honestly,.

C

Yeah pardon me is also like. I should have taken the five minutes and gotten all the yield. The load, balancer configs, correct.

C

A new experiment, let's add.

B

Memory, hug yeah right.

C

uh Potbelly hog.

B

C

B

C

Oh, come on configure this one first target application.

C

Deployment, let's do.

B

Payment, that's fine yeah! It's fine hit finish yep.

A

B

You go and then memory hug baby come on man. It does not like memory hog.

B

And try one of these other ones, maybe yeah.

C

Proxy doesn't like it.

C

I can do this fill. I can fix that later.

B

There we go all right, so you see up at the top left where it says edit sequence.

C

B

C

Yeah once I change this, yes, we can fill. Let's fill a database.

B

That's a good one.

B

Yep yep that looks good.

C

Sequence: visualizing, your sequence.

B

It should not look like this. I know it's my proxy's acting up or might be here sad.

C

I don't want to hit refresh.

B

C

Let's see if we can get this guy going.

C

A

um Context by saying it only works, if you do not have namespace quotes defined memory and cpu.

C

Yeah going back to the remember, I said I was lazy yeah. I didn't set that up on this cluster.

C

And I'm approximate.

A

uh And also this is getting closer to final call for questions. um So if anyone is typing push enter as soon as you can, you have a few minutes left.

C

Yeah, let me see, let me stop for. Let me try this one more time.

C

Don't let me sign in there we go all right, let's see, let's see if we can speed run this.

B

Speed run chaos.

C

I don't care that experiment.

C

Ew you're working now.

C

Hey, that's amazing. What happens when you reset your proxy? Let's do a desktop.

C

All right, let's go through and speed run this.

A

C

City user db: next next finish, and let's do you as the cards db.

C

Next, don't care all right.

B

All right well something else to failed. So it's.

C

Nothing else to fail. That's oh.

B

Ignore me, I generally do that's probably healthy.

C

C

Now we're fighting the clock, let's see, will it.

B

Will it work, will it blend.

C

There's a hands-on laugh, I feel like this is sort of right. That's good to know.

B

B

Yeah, I've also seen some really good demos, so uh chaos carnival is uh annual conference specifically for litmus and there's uh some really good demos that came out of that.

A

Yeah, so jonathan is recommending the hands-on lab for lithium's introduction inside the orally learning space.

B

We should uh open from atheists and check this out.

B

Oh yeah, prometheus.

B

Drags by a few seconds.

C

Yeah, that's not what I want so that's running so.

B

Let's see now, I want to switch the up.

B

And get rid of that all right.

C

Nothing's bouncing so that's good, all right! Oh! Let's go to prank this.

B

Well, the line's starting to go up a little bit.

C

Yeah, I forget which one I did of ram and cpu one. So just randomly clicked a couple.

B

A

uh And we are at the time uh great speed running in the end, any final comments or notes to our audience.

C

um Not off the top of my head yeah, so it looks like that one did the memory.

B

Yep yep yeah. Now I really appreciate the opportunity to come. uh Show us messing around a little bit with the with what litmus can do.

C

A

Yeah loved it, uh particularly the speed running, was very, very nice perfect. Thank you so much everyone for joining the latest episode of cloud native live. It was great to have a really good session about using litmus chaos, engine and microservices demo app to demonstrate automated rca.

A

So we really also loved interaction and questions from the audience and as always, we bring you the latest cloud native code every wednesday. So next week we will have a session on in the cloud with cloud mothers. So thanks for joining us today and see you next.