Cloud Native Computing Foundation Online Programs, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native Live: Introducing LitmusChaos 2 0

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

Hello, everybody welcome to another cloud native live. uh My name is mario, laura you've seen me before. I'm sure uh I've been hosting cloud native live on and off uh since we started uh the cloud native tv network this year. um Thank you so much for joining us for another wonderful episode where we are gonna dive into chaos engineering.

A

um So today I have karthik with me from uh chaos native uh and they're, a company that is working on resilience, engineering um in our world of cloud native and leveraging a chaos uh engineering to achieve that with their product litmus chaos. So I'm gonna, be I'm not gonna get into karthik. uh What I'm gonna do is try to answer all of your questions. Please leave your questions in the platform of your choice that you're using to watch this right now we thank you for tuning in and spending time with us uh today.

A

uh This is going to be a really really fun session. I have a lot to learn.

A

um I know of chaos from the chaos monkey project uh on github and other talks from from netflix engineers, like adrian cockroft and others, who have been pushing for a chaos to increase resiliency and reliability of your platform uh very, very difficult to hone in and get uh just right, and it's also very scary for a lot of organizations uh knowing where uh companies I've been in sre at you know what they've been able to do and what they've been comfortable with doing uh introducing chaos is actually really scary uh and it's hard to take that first step.

A

So uh I think karthik is gonna, be teaching us a ton of great stuff I'll. Let leave him to get into his background, uh but I again I thank everybody for joining. Please leave your questions, comments, thoughts um in the chat and I am monitoring those, so I will be sure to um get those questions asked uh to karthik and we'll be kind of going through a wide gamut of uh different areas uh in chaos, engineering, the ecosystem and cloud native's role. So thank you so much.

A

uh Please join the public slack channel hashtag cloud native live as well.

A

uh If you want to chat with more people, I think myself, karthik and others uh that have been on the show are definitely uh definitely hanging out there uh and check out chaos native while you're, watching uh at cast native at litmus cass on twitter cassnative.com, um I'm scrolling through here, and I even see victor farcik from the I think, devops exchange podcast uh has even done a episode on integration with argo workflows, which is super exciting and I didn't know, existed, and I use our ghost.

A

So uh I have a lot of work to do this week to share this with my team, I think without further ado karthik. Thank you so much for joining us I'll. Let you take it away. Go ahead.

B

Thank you, uh mario, really excited to be part of cncf life and discuss chaos, engineering and it was chaos. Thank you for the introduction, um like what you said. My name is karthik. I worked for a company called keos native and one of the maintainers of the open source, uh cncf sandbox project called litmus chaos, and today we just want to discuss what um cloud native chaos engineering is and talk about litmus project.

B

There has been a new version that was released just early or late. Last week, I should say so: fitness 2.0 is out and it has some improvements over the earlier version.

B

That was one dot x and I think in the process of talking about the project, I'll just introduce you to what requires fundatex did and what is the feedback that we received from the community and how 2.0 was brought up, how it was created, and we will go through a couple of demonstrations to just discuss about what this platform is about, and I hope this uh encourages you to start with your chaos engineering journey in your organizations.

B

So please feel free to ask us any questions and be happy to answer and yeah. That's about it all. Right with that said. uh I just want to introduce what chaos engineering is. I'm sure a lot of folks uh already know about chaos. Engineering, you might be practicing. Ks engineering also might be practitioners, so you might have heard about it. Read about it.

B

Mario mentioned about adrian coccraft and netflix. They were really the pioneers that sort of started the movement of chaos.

A

B

And um netflix, along with a few of the other organizations earlier, doctors of chaos, came and created the basic tenets of chaos. Engineering. uh You can look at this website called the principlesofchaos.org which carries a lot of information about what it is why it is important what are its principles, etc.

B

There is a brief summary. I have a couple of slides, which I just use to talk about: chaos, engineering and state of chaos. Engineering. Today, with all the cloud native revolution that is going on, so I hope my screen is visible.

B

uh Okay, I am hoping that the screen is going to be visible. Let me just quickly run through this, so one of the basic reasons why ks engineering is very important is because downtimes are very expensive. We had, we have had some past incidents in organizations that are generally very resilient, where it has cost a lot of money because of downtime, and that is something we would want to avoid.

B

We would like to test out how the system behaves to different kinds of failures and how the deployment environment can be improved and how the application residents can be improved in order to withstand this difference and still be available.

B

A lot of services that we are consuming on a day-to-day basis have publicly available uh slas, for example, google cloud initiative, since it is 99.95 available all the time, so chaos engineering is a practice which actually helps you to verify. If you are able to provide that kind of necessarily be available all the time and by definition, there are a lot of definitions on the international. You can see this. There are a couple of one that I've picked.

B

It basically says it is a method by which you're testing a distributed computing system so that it can such that it can withstand unexpected disruptions, you're testing, if some of the components or some of the um assumptions that you had when you built the code, for example, you always thought the network is always going to be alive. You have unlimited compute, storage bandwidth, that's not often the case.

B

There are failures that happen all the time and we need to check if we can withstand the disruption we can recover quickly enough and continue to provide the service at an acceptable level to the users and chaos. Engineering is not about reckless fault injection. It's a scientific process by which you identify a control group, an experimental group, you inject faults in a very controlled way. You basically ensure that there is minimal blast radius when you're injecting faults and then basically see what happens. You try to learn about system. Sometimes you go with hypothesis.

B

That is proved. Sometimes it is disproved if it is disproved. It is better. That means there is something that you learn newly you can go back and fix your application, or you can go back and fix your department practices. Maybe you can improve the underlying infrastructure and make it more resilient. There are a lot of things you can do repeat. Your experiments gain confidence, etcetera, so that is generally the practice of chaos. Engineering because of the times.

A

That we are living in today.

B

I think the pandemic- probably this is a better analogy. It is like a vaccine, we inject harm and try to build immunity from outages. That's what we try to do. The standard chaos experiment flow is like this. You identify its steady state conditions. That's part of your hypothesis.

B

You see how much deviation that is going to be from your steady state if at all how quickly you are expected to recover, and then you go ahead with the fault injection verify if the hypothesis assessment, if it is, that means you're going to go to the next point, maybe it's a more complex part that you're going to test.

B

You might call yourself or call your system another resilient to this form that you injected, and then you move on to the next one or if there is a weakness that you found and your hypothesis was disproved, then you go back to the triangle and check what went wrong. What needs to be better make those fixes and then repeat, once again, chaos engineering is traditionally has been traditionally done in production um for a long time. That was the philosophy. Chaos engineering is most effective and useful when you do it in production.

B

The principles of chaos is as much but with the recent proliferation of kubernetes and the evolution of the cloud native paradigm, wherein a lot of organizations are re-architecting. Their uh applications they've been taking from taking away from the wall with modern and creating everything as micro services, they're containerizing, it they're running it in new deployment environments, mostly kubernetes.

B

So there's a lot of apprehension about how things are going to work and um probably folks are not ready to do chaos. Engineering in production from the get go, there's a lot of projection and chaos, experimentation that is done in preparing rounds to gain confidence before it is really done in production.

B

That is the change that we've seen happening in the chaos engineering world in the last couple or so years, and we mentioned the adrienne parker, netflix and amazon.

B

At the beginning of this discussion, uh there is a principle that indian is a big advocate of called as the chaos first principle, so it is about doing chaos, engineering in a more ubiquitous and a democratic way. You start doing it in development environments. You start within stages in production.

B

Maybe you react, add failure tests as part of your ci by applying ci cd pipelines and you basically do slow validations based on or during chaos. Experimentation. Basically, you go and validate within your system uh continues to stay alive and your objectives are made so that several objectives are made under duress.

B

Before, let's say your application is moved into production and then probably when you make sure you start doing the actual game days, uh chaos experiments in your production environment and try and see whether it system holds up there. So that is what we're seeing happen in the recent times I mentioned about how communities and cloud native is a factor in getting people to do chaos engineering much before and earlier, and often that's because in a communities-based deployment environment there are so many variables. So many factors, kubernetes itself is quite dense.

B

There are a lot of controlled services, but you're hosting that on top of some platform infrastructure. Then you have a lot of touring that you've pulled a lot of frameworks that you pulled from people from the cnc landscape for service discovery for um monitoring storage.

B

Then you have your direct application dependencies, your databases, message queues and then your app with all its services like resources, your middleware front, facing user facing services, etc. A lot of things can go wrong and it is important that all these components work well in sync, to provide the best user experience that you have guaranteed to the users of your service.

B

So it is possible to test out varied scenarios and test it often, um and one of the reasons, uh one of the pillars of the cloud native way of doing things is to release fast, to keep everything as micro services to ensure that everything is declarative. Everything has the git as a source of truth. You have controllers that ensure your infrared. Your code in your source is synced always so. Changes are happening in a fast pace, so you need chaos. Engineering to sort of borrow.

B

Some philosophies from the the current model ensure that your chaos intent can be declarative, ensure that you automate steady state hypothesis, validation as part of the experiment, ensure that it lends itself to githubs and ensure that you have the same user experience or same homogeneous experience that you have had, while doing other things on communities, maybe you're talking about defining application, life cycles, security or resource policies, etc. Everything is defined as resources, and you have controllers in communities to manage them, and you would sort of like to break that into our chaos engineering as well.

B

So that is uh an introduction on what chaos engineering is generally and what this category of cloud native chaos engineering is. Let me introduce you to latest project. uh This is an open source project which has been around for about um three years or so now, and um it provides you an end-to-end platform for doing chaos. Engineering on kubernetes and we've also started expanding it, providing capabilities to do chaos against free, kubernetes or say non-communities infrastructure, as with ec2 instances on gws, gcp, vms vmware machines, vmware, vms, etc.

B

So the um litmus platform branch has a set of microservices on kubernetes, so it uses kubernetes as the substrate to run the chaos services so to say, and you could make sure you could pull some ready made off-the-shelf experiments that are available in what we call as the chaos hub. The chaos hub is an open market place where you have a lot of common scenarios that you would like to execute. You can pull the fault templates, install them on your cluster and define a custom resource that maps the form to any object on your cluster.

B

It could be a known resource, a power resource and you sort of go from there. That's what litmus is about. Very simply speaking. It has a set of custom resources to define chaos and statistical validation at its heart, and it has a controller that reconciles these resources and carries out the chaos experiment or default injection business logic, and there is a way for you to look at the results of the experiment so performed and explained some information about your applications in christ's behavior.

B

So litmus as a project was started, in fact, to serve the resilience test, needs of another cnc project called open abs and over a period of time it sort of acquired a roadmap with its own became more popular in the community. So we had requirements which sort of uh started coming in and we went from being a platform that can do chaos um in a cloud native way. That is, you define intent in cr. There is a controller that is going to carry out experiment for you from there.

B

We went on to make it an end-to-end platform, because chaos engineering has a lot of other requirements um in terms of observability needs in terms of defining blast radius in a very controlled way. uh In terms of in terms of ensuring that your chaos results are being analyzed for a period of time to give you useful information that you can uh generate about your system, there's some general kpis associated with the case engineering practice um so help you bring some information about how your kpis are doing as far as your practice goes.

B

So some of that information we wanted to bring in- and we also wanted to make it easy for folks to do complex experiments, maybe not simple faults. Maybe just you start off doing simple faults, of course, but sometimes you want to generate complex conditions.

B

You want to provide the steady state, validation intent is what an experiment run, and sometimes there is a very diverse place of verifying study state. So you want to pack all of them in so all that is about what we did to get requests from its initial stage or the one dot x, release to richmond's, true dot, o or what it is now. So that's an.

A

B

To the philosophy of chaos, and what later says I'd just like to see if there have been questions before.

A

Yeah for sure, thank you so much. Oh, my gosh, so much to chew on this is this is great okay, so I just have a couple. These are lightweight questions that you'll be able to smash pretty quickly. I think when, when most people think about testing an environment um to to maybe test as close to a real world example as possible, what they often consider is something like emulating a ddos attack.

A

Something outside is causing harm coming in right at the kind of ingress layer right um and they are hitting certain api endpoints or uh they are sending malformed requests right. um They are doing something at maybe a higher level, and it sounds like what litmus does is actually runs in cluster right, and this is kind of going back to chaos.

A

Engineering you're actually inflicting chaos from from yourself like in internally, not externally right, and so you know you mentioned limits, has a few different types of maybe like attack library and I'm actually looking at gremlin as well, which is another gas engineering platform um and I'm interested in some of the differences there. But what what you know?

A

What is kind of the go-to mo for the the default patterns that you find people using litmus for, um and and can you expand on a little bit of what you're doing it sounds like you're, taking limits to the next level building a platform? How do you intend to like leverage that platform to help provide continued reliability, insights, slas things like that for not just your kubernetes cluster or the api or an api? You know for your application right, um but more so for the entirety of a platform.

A

If I you know, I have 30 micro services. What does that look like.

B

It's a great question um yeah, so you you're right about the first part. uh I think lateness when it started a lot of uh members of the community um started using it and still do predominantly a large set of users use it for inflicting some chaos within the communities cluster and the services that are inside it.

B

So redmos has this feature to do some asset discovery, which we will show very shortly during a demonstration where you can identify what are the different services that are living inside of your cluster and you will be able to sort of access them and sort of target them and two different faults. For example, the generic category consists of most of the part level and kubernetes node level faults. So you could um just kill parts, uh send some permission signals to containers.

B

You could do some chaos on the pod network, uh maintaining latencies or heat up resources and slow down the applications. Your pid one essentially running inside of your containers and similarly for nodes. You basically take into maintenance, you drain it or you cause an erection, paint and push out all the pods, so things that can be done with in capabilities, and it just has this model, especially in 2.0, where you could run the control plane, services within one cluster, and you can register several other target environments or communities clusters into the control plane.

B

And you can run some agents there, which will actually carry out the parts. There are ways to keep the litmus parts immune from being impacted by the chaos it is generating. So you can target specific other parts.

B

You can have label the annotation selectors in space space filters that you can use, and you can also set up some affinity policies and node selectors, and things like that to ensure that your your application services, fitness application services are not really impacted by the fault that it is causing, and you can really point to specific resources against which you want to do chaos. But when you think about things happening outside of your cluster, for example, you are talking about some application services which are receiving requests from an increase.

B

Their requests from outside uh circus also has the capability to go ahead and ensure that the traffic is inhibited against a certain destination addresses. Probably so, let's say there is a service that is talking to a service inside of the cluster. You can ensure that the traffic between these services alone are disrupted, for example, that's something that you could do. That's uh one thing that we developed recently and a lot of organizations still have a hybrid model.

B

Not everything is on communities, so there are experiments that we've started, creating for awas, for example, where you can run.

A

The experiment.

B

Business logic from inside of kubernetes, so it drops us apart, but we are making use of the api. That's provided by the the cloud provider, for example aws and gcp or seo, and we can go ahead and invoke those apis. They all provide very good sdk, so you can make sure that you can have some access secrets created on the kubernetes cluster, where the experiment business launch comes, and then you target some non-communities resource, that's completely elsewhere and still invoke the failures there.

B

You can cause instance, values discipline attaches and also cause, uh let's say the source burn or network degradation within those beams they have which may or may capabilities. So that way you would still run litmus uh from within the uh humanities control plane and you would still be able to target resources that exist outside of kubernetes.

B

So that is the model that we are sort of working towards and there are a limited set of experiments on aws and gcp and vmware nature, and where the process of expanding those experiments and the faults that you could do on non-communities, so that you get sort of a wholesome um platform, uh you're able to use the same, centralized platform for doing chaos, different kinds of targets.

B

So that is, you brought up gremlin, which is another great tool um which, which has been around for quite some time, and they have added capabilities to do. Kubernetes based cares as well, so they primarily started out as doing chaos against vms.

B

So that is, we are, I think, the chaos engineering community and the tools in the open source, as well as the closed source space, are really growing. Today. Litmus is differentiated in terms of its architecture in how it runs as a humanities, app as well as the way in which it treats and experiment. So what we're trying to do is align with the the principles of chaos and provide an end-to-end experiment.

B

The notion of a complete experiment that has fault injection uh as at its core, but also has blast radius um control ability to do steady, state validation, ability to simulate complex scenarios by sort of stitching together experiments. So you could actually go ahead and run more than one font. Let's say you have a node which is almost exhausted with resources, there's nothing much that can be scheduled there.

B

Then you have an other node in which there was, let's say, an addiction that happened uh or, let's say, there's a there's, a part that got deleted there because of some reason- and you are not able to get scheduled anywhere else, because there is another node which is already running full capacity. So this is a condition that is sometimes seen in production. You might want to bring up this scenario. It's a complex scenario. You might have to do two forms and tie the right validation along with this.

B

So it was enables that, through what we call as request workflows, so it um to summarize we're trying to build what is an um into end gears platform for doing complex experiments and also visualize the progress of the experiments and you mentioned about. How can I get information? So how can organizations try and take a look at how their case engineering practice is going? Does it have an overall residence view? That's what we're trying to build with litmus as well.

B

So there is an analytics section here which goes through all the past, workflows that you have run against your services. Then you will have a comparison that you can do against. You might have done these workflows or experiments against different environments, maybe devastating in production or maybe across the roof releases. So you are trying to validate this. You are trying to compare how your experiments went and see whether you're, improving or whether you are going down. So that is something that we are trying to add. So there are other views.

B

The other viewpoints that we're trying to add here, based on community feedback as to what is what they're most interested in when they're running experiments. For example, people would like to see how their application behaves we'll see that one of the demos where okay there's something that I'm peering into my application dashboard. Now I want to see when chaos is actually running. When did it start? When did it end and how the application behaved during this process? So that's some amount of observability that we're trying to add into the um into the platform as well.

B

So there are some dashboards you can add within the litmus chaos center, as we call it and there's something that you can use directly to instrument dashboards, etc. So that's how we are trying to um improve the platform.

A

Yeah, that was amazing. Thank you so much for getting into that and some of the this platform. You know this ui really helps seal the deal in terms of what am I actually getting from an end-to-end perspective. um I think the analytics like you need to be able to measure progress right. Where am I at now?

A

What is my desired state and what are the incremental changes or pieces um to getting to what my desired state is, whether that is you know, with being able to support so many requests per second or being able to sustain failures of database connections or whatever it might be right. So I think uh you know principlesofchaos.org you've heard karthik mention it a couple of times. I think that's a good starting point. That's actually a github project as well. um There's literally just search cast engineering on google.

A

You can find tons of great resources uh that kind of look like what what karthik and- and I have been talking about here, and why this is so important. The other thing I wanted to mention too uh carthage is a lot of people. Don't really understand. Why do I need chaos, engineering? What do what why do?

A

I need that right, like what I'm not going in no one in our our s3 team is going to go in and just delete, pods no one's going to go and mess with external name objects and kubernetes or screw with our cni uh daemon set like no one's going gonna. Do that right and it's not about the humans as much as this is like the natural elements of a cluster and- and you know the churn what's going on- there's maintenance, there's updates, there's auto scaling.

A

There are devs constantly deploying things um you know there are people hopping in pods to look at things and test things right. There are objects coming in and out, there's many different name spaces.

A

So I think the you know I'm kind of talking about some of the sre core principles here of assuming failure, assuming failure and using measurements, strong measurements, slas slis slos, to track your services, your endpoints, and and really once you, you think about that, and you think about maybe a let's say, because I have experience with an e-commerce platform uh where uh you've got at any given time. You might have a marketing event and have millions of people coming into your platform in the scope of five minutes.

A

How are your applications gonna work and I've seen so many different sorts of problems where you know you can throw compute at something, but if the things it depends on, if it can't get out to the internet, if the nat gateway is is broken or if other services it depends on in the chain of doing its operations, to do some processing and produce an output. If something is broken, there isn't scaling there as you'd expect right, uh you're not going to know about that until it actually happens.

A

I fundamentally think that there is no way to anticipate problems until you have actually experienced them, um and so I think that chaos- engineering, basically, is saying you know this. Is you have to commit to being? Okay and again, it's gonna be scary, but um things are never gonna, be perfect. You're. Never always gonna have five instances of your application. 100 working perfect, so you know hitting health checks. Responding in you know under you know one second right: you're, never gonna have things uh in a perfect capacity, and so what actually happens?

A

What does that mean? For your end, users, for the people that are using your platform day in and day out, who you know, might be buying something from it or depend on it, for whatever reason, this is all together, making kubernetes a better platform for you to continue continually use ship applications and really get that feedback loop of analytics, metrics and other data. So you know what's actually going on having that intelligence right, it's not it's, not! uh Maybe the older world, where we're just like throw it on there and the services system. Ctl great.

A

The service is running, and you know hopefully everything's good right. um I think this is the new kind of model for thinking about how we do things, especially in the cloud native way. um So with that I'm going to give it back to you karthik. I think you've already shown us a little bit about the platform um you you probably want to dive into what some of the differences from 2.0 and 1.0 I'd love to hear more about.

A

You know what was the intent with 1.0 and what are some of the key things and learnings that that you and the team uh leveraged to kind of figure out. We want to do 2.0, and this is what it should be. Take us through that a little bit and I'm sure we'll have some questions from there. I know I have a few other questions as well, but take us through that sure.

B

I think when we built 1.0, um it was just like I said the need for us to be able to create something that was cloud native to do chaos. So one of the things we felt was in communities. Everything happens to be a resource, whether it is native or custom, and then you have a controller that reconciles things.

B

So we want to bring that experience to kubernetes chaos, engineering as well, and that's when we basically created some custom resources, um something called liquid experiments here, the engine and the result, of course, along with the controller, to carry out the chaos process. So this is a very brief summary of what it is on the hub that I just showed you a few minutes back. There are a lot of templates that you can put the build templates that define particular fault.

B

Then there is a chaos engine, the one that is user defined, the one with with which uh users deal with on a day-to-day basis, where you provide run characteristics about the experiment and map a fault with some component that is living on your cluster right now, which is either a component or a service, or maybe in case of non-communities chaos, some instance that is living somewhere on the cloud.

B

So that's what you create and run so chaos engine is the one that actually triggers when creation of creation of which triggers the injection process and the results of this experiment are stored in a chaos result and we sort of create separate resources, because there is a huge scope for um you know, sort of um expanding the schema and what all this can hold.

B

In case of chaos results, you can store the experiment status, the verdict of the experiment upon completion based on certain steady state, validation constraints. You would like to know how each of those constraints worked when you run the experiment, we use something called probes to define those constraints when you run the experiment, so how did they go? And then, of course, to repeat the experiments and then with different scheduling options you might want to randomize your experimental donate over a prolonged period of time, either strictly scheduled or with some randomness thrown in.

B

So these are all the chaos crts. So when we started with uh chaos, uh engineering project litmus, we had- let's say this uh part along this deployment alone- called the kiosk operator. The rest that you're seeing here is um you initially had this big ass operator and you could create a clear engine manifest something like this. You just have um an application that you're defining by name space label and the kind these are the identifiers for a given application and you you can basically go and run this experiment with a specific service account.

B

It was allows you to run your experiments with ah define certain functions, so you can choose who is doing the experiment and what permissions the persona has and therefore you can control what can be done as part of the experiment. We provide permission just say to delete parts, maybe nothing more. So you cannot do a node level experiment that way litmus lends itself to sort of a self-service model of experiments. Maybe there is a shared cluster. Different service owners and developers are running their own experiments.

B

You can still do that by adopting different service accounts, and then you have something like this. You have a total duration for which the chaos starts, and then you have some tunables just to control the next of the experiment and we're going to do this experiment against a very simple service called as a hello service. It's basically a hello world application that lives in demo space. Let me just take you to that list. Name: space called demo space, it's just a part.

B

What I'm going to show is like the hello world of chaos, engineering we're just still apart with one.x, we provided this capability to create resources and run forms, so you can actually see these for running, so people found it very convenient to make use of this uh in their uh scripts and automation, pipelines, uh ci cd, things like that, so you basically run the the default happens a few times.

B

We just said uh 30 seconds duration, 10 seconds: it's going to do the kills two three times and um you're going to get a chaos result and then you're going to get some events on the transmission resource that might be of interest. So you can actually see what's happening as part of the experiment.

B

um It does a free chaos, check each experiment and litmus does a pre chaos check to see if the application that we're doing chaos on is in good state, because we don't want to degrade an already degraded system. So we make some checks and then we carry out the fault. Then we do some post checks and then we finish the experiment that is essentially what was available. um You can see the case injection is in progress like, like I showed you, so this is going to complete the experiment and then just allow you to take.

B

Take a look at the results and see what happened to your applications. You, you have your own influences that you can draw from what is happening. So that is what we had. You can summary based on the constraints. There are no constraints essentially in this experiment, so it says experiment past, because the only checks that we used to verify that it passed is whether the app was good before and whether it recovered after within a specific period of time.

B

But um as we went along- and I think I just missed, showing you the chaos result. So the krs result is quite simplistic uh in this case.

B

So you can see that the experiment was around once it passed and this was the target and um it just ran right. So this is very something that's very simple, uh but real world scenarios are more different and real world requirements are more different, so we sort of got that feedback. As we went on the building the project in the open, so they were asked. uh How can I visualize the impact of chaos?

B

I have an application dashboard and I want to see exactly when character starts, when it ends uh sort of how they visualize the impact of chaos. They want to see what stage it is in. um Of course, the events are helpful, but events is not for everybody. There are a lot of communities.

B

The way applications are operated, uh so the different personnel is involved there. There are some folks who are very deep involved and know the communities api very happy to navigate on things like blogs and events. There are others who are looking at more graphical representations of a process um dashboards. So how? How do I visualize the impact of chaos and then how do I basically validate application? Behavior? The word that you are telling me is too simplistic uh application.

B

Health check application recovery after chaos is great, but I want to see what's happening during the fault injection. I have some um slos that I want to validate. Maybe some iops throughput latency that they want to validate, even as the experimentals.

B

How can I go ahead and do more faults, maybe as part of a larger scenario, like you mentioned some minutes back, we will talk about a case where your node is run to exhaustion and there's another coordination happening elsewhere, you're stuck you're, not what is in pending state. How do I simulate these kind of scenarios?

B

How do I do benchmarking? uh I think, um when you mentioned about these cases, where there's thousands of users using your platform as a single point of time, how do I, how you simulate that load and then you fall there? How does your system respond? Faults under such loaded conditions is being chaos in the utopian uh conditions is not is not great. I don't want to do it in idle case. Whatever in this traffic, that's full. How do you do that? Basically do multiple things at once.

B

As particular experiments a lot of pattern processes- and how do you tell me in residence? This is a metric that you can show and tell me. This is how your fault, or your experiment and your application or service is corrected. This is the metric that you have that says. Your application is resilient to this formed by this level by this much right and then there are other operational challenges. How do I get different team members to come and collaborate on my chaos, artifacts and visualize it?

B

How do I ensure that there is a single source of truth for case experiments? Yamas are great, you can store them and get, but git gitops is really the in thing. I want to ensure that there's a change made on my git and the experiment definition needs to get changed on my cluster when it is wrong. So how do you ensure that? um How do you use single platform to target different environments?

B

We went up on this a little bit doing care some kubernetes, but also doing chaos against other components while still running as an app on communities. So how do you do all this? So these are the requirements that we got um and we sort of spoke to the community and there are several meetups where we went and presented. We got talking to people, and this is what we brought back into the project and built uh with it was 2.0.

B

So the result is an architecture which basically gives you a single, centralized uh cross cloud control plane. We like to call it it's like a centralized management platform where you can connect one or more target environments of chaos depends on where you want your gears to be done. I have a fleet of clusters, but I can use a single management platform to manage chassis institute, so you have the self agent. So this is the cluster on which the portal of the your center is installed.

B

It automatically registers itself as a candidate for a target environment for chaos. Then you could add other clusters as well for doing chaos. Essentially, this is an execution plane. You could do your chaos, business logic on the same cluster, where the community is a controlled plane for your size. That is basically where the key of center resides.

B

We can also use a different cluster as an execution plane, so by doing that, you are able to discover microsoft sitting on that cluster, but you are just um making use of you cross you'll just be making use of that to run your parts, your care spots, targeting something with stored, inaudible also, and we have this ability to create workflows.

B

Now, instead of a single chaos engine that you saw getting created, you can create a workflow that can stitch together more than one ks engine in a different order in parallel or in sequence, and you can also have load tests embedded within above flow. uh You could probably use locust or vegeta or k6 io or a lot of other tools. F5. Probably um that runs maybe as a job says bench things like that that we know that the community is using today, and that can be done.

B

If you can containerize then run that as a part you can run. That is one of the processes in the workflow, along with the experiment, so you get a better scenario that you test and we also have the analytics, which does comparison of workflows against. What is called as a resiliency score, so the resilience score is essentially a metric that connects your experiment and your service. So the resilience score is created or or calculated, depending upon uh some importance or maintenance that you give to an experiment.

B

We'll see that very shortly and the success factor returned in the chaos result. Success factor itself is dependent upon some steady state validation constraints that you define within the experiment using probes. So you take the success factor and the weightage that you give the experiment.

B

You get a product and you get a summation of that product for several experiments that are listed within a workflow divided by the total points possible and you get a residence scope and that resilience score is something that you can use, and you can use that to compare over a period of time and you can see how your experiment is sort of improving or whether it is reducing its resilience, etcetera. So that is about the analytics, and it also has the option for you to sort of pick.

B

Workflows um clears artifacts from a git repository, uh so you can commit workflows that you constructed on the on the care center on this wizard that is provided with the kr center. That gets committed into your git repositories and any changes made.

B

There is reflected here and the next time you're on the workflow you're only get within your spec modified state that you want, and it also has a set of users that you can bring together as part of your team to collaborate in chaos, and each individual is a project into which you can invite users to form his team 013 and the uh the users can have several types.

B

uh You could create a user who is an admin view only or just for an editor, etc different levels of users that you can use, and you can collaborate on chaos so, and it also has the ios hub embedded within it uh very much inspired by the open shift operator hub, that's embedded within four context.

B

So we have the embedded chaos hub here. So when you construct your workflows, you can actually pick experiments from the hub and stitch them together. So these are all the capabilities that we created in response to the requirements that we gathered I'll. Do a very quick set of demonstrations to show you how you can leverage this mario was talking about e-commerce applications. We've taken the example of an example: e-banking app called the bank of panthos, which is comprised of several micro services.

B

You can see that I have bank of antos in the department space on this cluster. I have lengths here just to make things easy in terms of visualization, so you can see that there is a balanced leader application of surveys, rather, which is allowing me enabling me to read the balance here. Make payments do all sort of things. I accept this application of without much resilience and what I'm going to do is inject a black hole attack, something like a very similar data's attack.

B

Maybe we're going to incubate the balance trader service and that's going to give us a semi or quasi-operational e-banking application, which is something you generally want to avoid. So you can take a look how we do it. We are just scheduling the workflow and selecting an execution environment. I'm selecting the same cluster as where minutes here center the sides and I go ahead and select the chaos hub from where I can pick my experiments. You can define your own hub here as well.

B

If you're in a private environment, you can create your own chaos hub and pick exponents from there, and then you go ahead and give a name, I'm going to call it back of antos backhoe and we're just going to pick an experiment in this case, network loss is the instrument I'm going to use to create this attack. So once I've selected the experiment, I can go ahead and tune it. The way I need so. I am interested in the balance reader service residing on the default name space.

B

So I'm just going to select that I have the option of validating uh some behavior as I do the form, but just keep it simple for this first round, I'm just going to say next and I'm going to keep it um intact, the part intact for 60 seconds. This is going to be 100 network loss that we're going to inject.

B

So I'm going to say finish, uh I'm going to say, remember, schedule fox, so this is basically a feature that cleans up your chaos, resources or your customer resources that are created to do the fault afterwards, I'm going to keep them just to show you the logs, and this is the step I said where you can provide the weightage or criticality to your experiment.

B

I'm going to give it all points you can do this for once now or you can do a recurring schedule and what we're going to do is just do the forward ones now, and this is going to give us an orgo workflow, underneath it's going to construct an output workflow where it has a couple of steps to pull the fault, template for network class and then actually run the experiment. You can see the kos engine is auto configured or created. You don't have to create it by hand.

B

uh You can see the same thing as we saw last time with the body lead that we did a few minutes back. uh We have the balance reader being targeted in a particular name space, and we have the duration and other two levels for the fault and just going to say finish so this is going to run the workflow.

B

You can visualize what is happening as you do the format and each step is going to get played out on the cluster through some transient parts that gets created you can see, it first holds the fault template and then it actually does the the trigger of the of the chaoses.

B

For this point, vancouver anthony's looks healthy and good. So once the fall starts running, you will be able to see that we cannot read the balance or make payments so, like I said uh sometime back, uh let me see chaos and post chaos checks at this point. The previous checks is to see whether the parts that carry this label that we just described is healthy and it is alive before it actually starts doing the fault. You can see that the step trigger the fault has started, you can use.

B

The um the kiosk engine schema is is quite rich. You can do a lot of things with it. um The documentation for that is available here. You can take a look at the um concepts section and you can see all the the specifications that it contains and all that can be tuned. So you can take a look at these sections here uh to see what all is studio. You can set resources, you can define affinity, policies where the spot has to run. You can inject annotations into it.

B

You can define near the the amount of time that you spend trying to validate whether an application is alive. A lot of things can be tuned, but if setup is something very simple, the fault is active at this point. So if I refresh this application, you can see it cannot lead the balance and I try to make a payment, but that shouldn't go through, um because I cannot read the balance to see how much I have my wallet to actually make the payment.

B

So these are some things we would really like to avoid in our applications. We need to find the uh have the right middleware property to direct us to a different replica that is working and ensure things work, and it is always risky when you have semi-operational.

B

Applications, for example, I'm able to make deposits, but I can't read how much I have deposited things like that. So this was a very uh quick demonstration of how you can inject a fault and how it runs, and you will see that this experiment.

A

Yeah this was uh this was wonderful. Sorry, I didn't mean to cut you off. I just wanted to, I think so. I think a lot of people think of down and up and they think binary about how a platform is operating. It's it's just it's it's black and white right and that's not actually how most outages work. Most outages are actually kind of like a a brown out like a like a some things are working. Some things are not working.

A

The problem is that the totality of maybe some of the critical flows for using an application. This example is fantastic that you that you use is that some things might stop working right and, and things might seem, okay, but when a user actually goes to do something, that's when uh dependency services, other api endpoints that are called um and certain microservices that make up the overall platform are not doing their job. As is expected, I think that's one of the major use cases for this. I'm loving this workflow dashboard.

A

I think this makes it easy for me to yeah kind of against.

A

You know obviously the schedule interface you have here, uh some of the agents and the hub stuff is fantastic, so it's all tied together, um a kind of all unified interface, um and I think, like this is the next evolution of what it looks like to be comfortable feeling, like you, have the control to test your environment in the way you need to to really get the correct signal uh instead of just noise about like oh well, you got this one pod using lots of resources.

A

uh You know that's one of many little things going on, but there's a lot more signal that you need around the the flows, I think and what's actually happening um in these certain scenarios. So this is fantastic to see karthik. Well, you click around a little bit. I'm going to ask a couple lightweight questions, because we have just a few minutes left. um I think one of the the big ones here uh that I'm thinking about is what is some of the key things.

A

What is the next step- and you did a great job of talking about like what 2.0 is delivering versus one point? uh Oh and and so like? What are the what's next on the road map? What can we see here over the next uh few months? As we end, you know end of the year? What should what is lit, moss looking to uh implement?

A

That has been like something you know huge that a lot of users have maybe been talking about or things that you've been asked about from from people saying like hey, you know I'm using this, and if it did x, y and z, uh I would be so much more efficient right be able to really nail down certain things. What are those things look like for you in the team.

B

That's a great question, so one of the things that we're being asked is to improve it's exactly what you mentioned. It's about. You want to see what's happening to your applications, it's not about it's! It's not binary! It's it's brown out generally, and you want to validate a lot of things, get insights into a lot of things. That's happening on the cluster, even as you do the fault, so probes is one thing that we introduced to help.

B

Do that, for example, in this case, I'm just repeating the thought that did the one dot x experiment where we're trying to kill a part, but along the way we are doing some checks. We are checking whether a downstream application is always alive. I am basically checking if uh it's always going to be alive. There's a 200, okay that I get against it for a polling interval of every one. Second, if it is not, uh then probably I would like to abort the experiment, so you can see stopping failure true, ah so these are things.

B

How do how do I abort based on conditions? How can I have a deadlines switch so to save it, as they run the fault to sort of auto stop and what are the different kinds of probes that I can run?

B

I I can run probably these probes today. That's basically using for this metrics to check deviation in your study state. Maybe is it under your slos or no. So I support for different kinds of probes that work with different observability tools that are there in the in the cnc landscape. Today, that's one of the things that we're being asked and we're working towards and more experiments for, non-communities environments. So that's something that we are working towards as well as your awas gcp.

B

These are some of the beam where these are small: the environments where people are still running a lot of their services and we have initial set of experiments there. But this request for being able to add more and from uh folks that who are trying to use this in enterprise environments also want uh support for different kinds of authentication and authorization mechanisms, as as to how people can access the chaos center and run it.

B

So that is one of the things that we are looking forward into the uh adding roadmap as well, and there are other fault types within kubernetes that we can add. The faults uh are like thousands of falls that you can actually do uh using a base set of faults. You can create thousands of scenarios.

B

Rather, uh solutions provides you a good set in the chaos hub, but we are committed to increasing that set of faults for communities and non-communities and also provide a resilience, a better resilience view that people can make use of analytics and resilience. Scores are great, uh but there are other things people would like to know about. uh When there are experiments. Some probes are stuck in that direction. People also want to know things about how their recovery was. How much time did it take? Really?

B

Maybe it was under the limits, but how much under the limits uh is there? Is that something that people can improve upon? How do you get better with? How do you provide better insights to people who are practicing chaos? Engineering, that's something that we are going towards.

A

That's awesome.

B

A

Is all fantastic yeah? We just have one more minute. uh I want to end on a strong note for people that are looking to dip their toes in and get started, and maybe even evangelize this a little bit in their organization or play around in a uh kind of lower end environment like a dev environment, a playground. What what would you say what's kind of the best way for them to get started, understanding, chaos, engineering and starting to leverage it in their day-to-day on their laptops, whatever they're trying to do? uh What are some resources.

B

Yeah, I think we have done a fair bit of refracting on the docks, as well as part of the wonder text to 2.0. So this is the docs documents, chaos that I hope there are a lot of resources here that you can use to learn. uh Some of it is still coming in, but there is a good set of good sort of concept.

B

Docs that you could you use here to learn about bitmaps, and you also have experiment documentation, there's a lot of information about how you can run each experiment, the different durables that it provides and how you can do it with different options. All the information, for example, we talk about coordinate.

B

There are different ways in which you can run moderate, a lot of which is explained here. So this I I would recommend our couple of good resources, the docs and the pages of the repository itself, um that's about where you can find information about litmus. But when it comes to general information about chaos, engineering, uh you can take a look at the principles of chaos.

B

Engineering and cncf has just got started with a chaos engineering work group, um which is trying to actively put together information about gears, engineering for beginners and for practitioners, uh who've been doing chaos, engineering for a long time and then jump into a cloud native world. Looking at doing care, science, engineering and cloud native day we meet once in two weeks, and we are trying to put together a white paper that talks about chaos. Engineering.

B

There is some information, for example, if you are looking at common terminologies associated with chaos engineering. This dictionary that we're creating uh talks about chaos. Engineering is not really an alphabetically sorted glossary, it's more about chaos, engineering as you learn it, so you sort of come to principles, and then you talk about what is an experiment and how you can uh sort of understand each part of the experiment, blast, radius, hypothesis, slice, slos and how you can practice it as an sre.

B

You know you can conduct game days, uh information that we are trying to expand upon. um So this is probably good space to look for one.

A

For sure yeah, this is perfect. I did not know there was a working group that is amazing to hear um thank you very much karthik, so much great content. Today, you really did some amazing demos as well, and the demo gods were clearly with you litmus chaos, dot, io chaos native is the company. um Thank you to karthik today and thank you for the the people working behind the scenes. My name is mario lauria. It's been a great pleasure to host uh today's session and talk to everybody later have a great rest of your day.

A

Thank you. So much bye-bye.

B

A

B