GitLab #EveryoneCanContribute cafe, 23 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 35. #EveryoneCanContribute cafe: Litmus - Chaos Engineering for your Kubernetes

Description

Blog: https://everyonecancontribute.com/post/2021-06-23-cafe-35-litmus-chaos-engineering-kubernetes/
Litmus: https://litmuschaos.io/
Twitter thread: https://twitter.com/dnsmichi/status/1407731465509654530

A

Now live on youtube for our uh 35th or the 35th. uh Everyone can contribute cafe, and today um we are not breaking something on purpose, but we yeah. We are actually doing we're doing something around chaos and some chaos, engineering, and I'm super happy that um we have our friends from litmus over here and showcasing us or like teaching us what is new? What is coming um in in chaos, engineering, what it is why we want to use it um and many more insights.

A

I've had a little peek already, but I don't want to spoil you. I just want to hand it over to karthik and priypt. We just go ahead and share some news. Please.

B

Sure, michael, thank you so much for introducing us and thank you so much for inviting us to this. Everyone can contribute show. I think we are really really glad to be a part of it, and I mean it's: it's going to be an experience all together with the cloud native world coming so far and chaos engineering as a technology coming so far, I think it will be an honor to present in front of the audience and before.

C

I present anything.

B

I think uh I would just like to talk a little bit about how this journey has come together and how the community has responded to it. So when back when I started last year, the the amount of users towards cloud native chaos engineering were pretty less. I think he had a community of 40 to 50 members and and kiosk engineering adoption was taking a peek, but I think the cloud native chaos engineering segment was still having a low amount of users or was not getting that amount of attraction.

B

But today, when, when a few days ago, I just saw a hundred thousand installations, and I saw a lot of folks taking keen interest even in the community or if you talk about the events that are happening, the support from cncf kubecon and all these things. Then I think today I can claim that chaos. Engineering as a technology has has come really far and has been an interest of many many, not just sres, but for initially we thought it is an interest of just sre, the sorry person or the the testing.

B

The ones who are testing are looking at chaos, but today even developers, everyone, even the ones who are building, are taking a closer look at chaos. So with this perhaps I would like to start off the presentation I have in hand and perhaps I'll share my screen.

B

So so is my screen visible to everyone.

B

It is right thanks, so so we'll be starting off, as, as you can see, uh the presentation on hand is introducing litmus chaos, cloud native chaos, engineering, and we are glad to be a part of the gitlab everyone can contribute. I hope the folks who are joining in will will have a good run through this presentation and the demos that we have so starting off I'll just introduce myself.

B

My name is prithviraj and I am working as a community manager at chaos native with just spun off from a company called maya data, and I was back in my data since last year, working as a community manager for this atmospheres project, which is, as you know, a cncf sandbox project, and alongside that I work as an event organizer or co-organizer of these meetups and events. As you can see, chaos carnival an event based on chaos. Engineering, I mean chaos.

B

Engineering has grown so much that there's an event around it and a lot of folks starting from red hat to intuit. To so so many folks from around the world. Around different companies came in and talked about various chaos, engineering use cases, and then we have our chaos. Engineering meetups, which happen every last saturday of the month. So folks who want to join in, can come in and join in the chaos engineering meetups.

B

We we pretty much have an informal session or have some agenda items and then pick up q and d's and try and bring in this journey of chaos. Engineering adoption further and then this communities community is happening this saturday, the bangalore edition, so anyone wants to join in kubernetes again is seeing the adoption at such a level where even chaos engineering hasn't yet so, where communities has crossed the chasm for major market adoption chaos engineering is still coming up.

B

So moving on the agenda that we have in hand today I mean there are a lot of things we'll be talking about, but starting off we'll be introducing chaos. You you can ask your questions. You can put your questions in the stream chat as well, and then we'll move on to cloud native chaos, engineering and then introduce litmus to you.

B

What exactly litmus is and then move on to the fun stuff which will be taken forward by karthik, we'll be talking about the architecture and giving you the live demo and how we evolved to 2.0, which is due to release in in a few days and lastly, obviously everyone is invited to contribute it's an open source project, so contributions are welcome, so we will move on to that.

B

So what is chaos? Engineering? That's that's what you all wanted to learn. What is chaos and how did it come up so I'll just start with the story and then perhaps I'll move on to what chaos engineering is so so in india, or you know around the world, there's there's this black friday sale around the world or in india we have great indian festivals or a big billion day sales.

B

These are these are huge sales organized by flipkart or amazon, these these companies and what we noticed was when, when back when, I was not introduced to chaos. What chaos is, I used to wonder how these spikes are actually causing an outage. What exactly is an outage, and why are resiliency goals? Not being met in spite of so much being spent on quality and assurance penetration, so white box, black box, so many types of testing is around there. But still, why is this one factor which is not available to meet these resiliency goals?

B

Why is these outages still occurring? There's a spike of users and the these applications used to go down, and it's not just about these applications. Any application. Any system has the the potential to go down it's just about time.

B

We we know that some day or the other, an application has to go down it's just when does it go down, and how can we prevent that so so that is how this this comes into play, and that is how the term chaos engineering came into play back in 2011 2012, when netflix brought into play the the idea, the idea of bringing in something into the system to prevent an outage.

B

What is it we'll talk about it, but moving on just let's just not talk about the spike in users, but what does it cause exactly down times? Chaos engineering causes down times. Your system goes down and there's not just loss of time, but there's lot loss of money as well. I mean billions and billions of dollars are lost for a huge enterprise if, if the system goes down due to spiking users or due to some outage that happens, and this.

D

B

Where chaos engineering came into play, netflix thought that when they are having a spike of users, when suddenly there's some show some series that comes up and they they need to scale, they need to understand the value of scalability with legacy systems moving into the microservices world or even.

D

B

Systems itself it was, it was pretty essential to to actually understand how how these are functioning so down times are expensive, and that is where chaos engineering came into play. What chaos engineering is chaos. Engineering is nothing but inducing an outage in the system deliberately in a controlled way, so that you understand what a future vulnerability or an outage can be, how exactly your system reacts when there is a future actual outage in production and when the system goes in production. That is where the devops feedback loop is activated.

B

That is where you understand that you know you have to do proactive testing in production. I'm not saying that it starts with production, but eventually the goal is to achieve resiliency. While your system goes into production, so.

D

B

Don't wait this.

D

B

Which which we learned through the chaos first principle now a lot of people ask what is the chaos first principle? The chaos first principle is is nothing, but why why to wait for an outage? Why not test first so test your systems first, before moving into production? That is how chaos engineering came into play.

B

So what has been the state of chaos? Engineering till as of now, as I mean there are standard practices that exist. It started all all of this started with chaos monkey, but it's it's still a limit. Not everyone is practicing chaos. Not everyone has moved on to practice, chaos and what exactly chaos is doing is still the the amount of unawareness that is out. There is something which is, I mean mind-boggling for me, but it's very important for you to understand how how to start chaos, engineering practices- or this is very important.

B

This is, this should not just be limited to experts or large deployments, but this should come as a practice for each and every one whose looking forward to resiliency so as of now those who have burned their hands, those who have gone through an outage for them. Chaos, engineering has been a resolution and you can see a lot of folks from amazon to netflix to disney hbo. These these folks have started working on chaos already. There are some chaos engineering stories already out there from them.

B

So how's it done. Typically, usually it's it's done in some these ways, but this is not what it should be limited to. This is not what it should be. You know just should be the ways to follow chaos. I mean they are done through chaos engineering game days. Usually there is a game day where one of the applications is. uh I mean chaos. Experiments are run in in one of the applications and a few kiosk tests are carried out to see how how the application reacts it can be specific to to the company.

B

It can be specific to the use case, and basically this is how usually the chaos engineering experiments are done, and then it's rarely integrated to the ci or the cd of your system. It's it's. I mean everyone hasn't even thought about chaos in your ci cd. Why not? I mean case? Chaos can be anywhere. It can be in your ci cd in in your developers, environment. It can be in your pre-staging staging production environment, so kiosk engineering has to come in everywhere in some form or the other.

B

As I said before, the start of the presentation that only sres are focusing on chaos. As of now the developer. Persona is not still not adopted. Chaos in a huge way. Everyone has not adopted or engaged in chaos, been practicing chaos, engineering, but slowly slowly. It's it's coming up and I mean there's a manual planning and execution that is going on, but you have to understand that chaos. Engineering is also a part of automation. It's it's. I mean with systems getting.

D

B

I think you need you'll be able to run automated chaos as well. That's that's the goal eventually, so the manual planning and execution is good, but slowly you need to move on towards the automated way of doing it. Where you can schedule them, you can create workflows. You can see how various chaos tests together function and observability is key. I mean you have to see what is happening to your system right. It's it's not a commodity. It is actually every enterprise that everyone has to see. What is exactly what is going down?

B

What is going up and how is your system being affected when a chaos test is induced? And, lastly, I mean there's a lot of things to see your results and how they increase your reliability. I mean there's a custom measurement process to managing all these things, but the typical way needs to change. That is what the agenda is of talking about all these things. I mean that these typical practices are good. I think the adoption which has come in has been good, but this typical way has to change.

B

I mean more and more tests need to be curated and you need to create more and more use cases. I mean write your own experiments, think about the system and and it should come out in the community as well. I think this event talks about that. I mean such such streams and such events are very important. So what are the benefits? What what exactly are the benefits of chaos? Engineering, I mean the first and foremost. Is you run your services without an outage chaos tests help you and you you test repeatedly.

B

It is a process, it's just doesn't happen once you you test repeatedly. What happens? Is you need to understand the steady state of your system? What exactly is the steady state and how? How does it function and then you need to run your services without an outage. You need to actually take a look at repeated testing, so I'll just perhaps uh take you through just just give me a sec I'll I'll, take you through something which which shows how how it is tested. Basically, a workflow.

B

B

So if my screen is visible to you all, you can see that chaos engineering. Why I mean this is an example of a kubernetes application that, as you can see, that there are various I mean it's a pyramid. It's like a pyramid, there's an application you have and then there are other.

D

B

As well running alongside that, it can be in mongodb, or I mean in the cloud native world you you run all these uh applications like kafka, ti, kv witness and all these applications and then come the services which you know, code, dns onward prometheus to get your metrics or databases. Open eds is one example. Then there are kubernetes services in a chaos engineering uh I mean in an application and then come the platform services. So it's like it's like a pyramid right.

B

It's not just your application, but there are various layers which are also functioning alongside that. So what is very important is that each layer should you know the resiliency of each layer should be maintained. Each layer should be tested so that all the components and infrastructure are strong. All the components and infrastructure become reliable because resiliency doesn't just depend on one, but it depends on all the components and how is it done so? Basically, if you can see, I I mean you, you.

C

B

The steady state you identify what is the steady state of your system? I think I mean this diagram is going here. I don't know why and then you introduce a fault, you introduce one test. It can be anything I mean, for example, just giving an example in a kubernetes application. It can be a part delete, so you introduce a port delete experiment and then you see if the steady state conditions are required or not. These are part of the principles.

B

The principles of chaos tell you that first, you have to identify what are the steady state conditions of your system and if the steady state conditions are regained, then yes, your system is resilient, but you go on testing. It's not it's not a single step process. Then you inject a new fault. It can be upon cpu hog.

B

Just another example that if a body elite experiment works, then you induce another fault to identify the steady state conditions and know if if your system is not resilient, then there's a weakness found you fix it and then you again introduce a fault, and this is a process which continues going on. This is a repeated step and because you never know when in an outage occur, so this this is how a chaos engineering- this is basically a diagrammatic representation of the principles or how your system gets resilient, with chaos being induced into your system.

B

So moving on, I mean again moving on what are the benefits of chaos, engineering. You run your systems without an outage and then slas and slos. I mean service level, objectives and service level.

B

I mean agreements and all these things are very, very important to run your services to meet your business goals or to meet your individual goals. We just had an external confirming slos have become so important and that you know everyone is thinking about it. So chaos, engineering and slos go hand in hand, and you are able to meet your business level slas and slos by running these chaos tests and then scalability, as I talked about your services are scaled.

B

According to these tests, I mean these tests help you understand the resiliency, and that is how you scale your services and earn demand and then upgrade them. According to your requirement, however, you want add, adding more, I mean systems to it and doing it without an outage. That is what the benefit of chaos is and why we talk about it. As of now it's I mean there are a lot of projects, as you can see these.

B

These are the projects out there, there's kremlin there's vmware manual and then some open source projects as well like yours, mesh kiosk, plate, cube invaders, chaos cube and then also on the enterprise side of things. Also, there it's a trending technology. Cncf has called it one of the top five technologies to look forward to in 2021 and with all these projects coming out, I think the the options are pretty.

B

C

There are a lot of options out there and then there.

B

The chaos engineering ecosystem is completely developing, so why exactly I mean chaos? Engineering has been given so much. Focus too is because reliability is a budding challenge and where does chaos engineering come into play in these container level challenges? I think all of them has some amount of role for chaos to play. Complexity.

B

Of course, chaos engineering helps coming to understand where your system is getting complex, then security challenges again. There are 32 security challenges. If you talk about reliability there again, 12 are reliability, container level challenges, so I think and cultural changes. I think one thing is we just conducted a survey recently, and I mean people talked about a lot of challenges they face, and that was all part of cultural challenges. Someone doesn't have observability in place that its chaos is not in their road map or their systems are not ready.

B

They have not migrated some of the other. There I mean the the higher level authorities are not ready yet or the ecosystem is not fitting their system. So these.

C

Are the challenges.

D

But as you can.

B

See I mean, according to the trends and predictions chaos is the future. Chaos is something which everyone should adopt and moving on to. How did litmus come into play or when we started litmus.

B

It was all about kubernetes and I mean kubernetes needs a resilient tool set because every day there are changes and traditional chaos, engineering tool sets that are out there do not really suit the kubernetes way of doing it, because that's that's a personal right and what are the driving factors, continuous verification, chaos, engineering, your platforms and, as I said, slos for your cd and these, these are also recommended by the sig, app delivery or the tag app delivery delivery, as as it's known now in the cncf, it's a working group and kiosk engineering is one of the vital projects that they have been focusing on.

B

So what is cloud native chaos I mean karthik will be explaining a lot more about it with litmus, but I'm just gonna point out a few principles in our belief that are part of cloud-native chaos engineering. So, according to us I mean there are a lot of principles of chaos, but- and I mean the majority of them are mentioned at a website- called principles of chaos. Engineering.Org I'll, perhaps forward the link here as well. So as you can see, the I mean principles of chaos.

B

Engineering are already out there and it's it's an open source repo. I think he has put up the principles and these are the major principles. You build a hypothesis around your steady state, as I mentioned, and then you vary. Your real world events run experiments in production and then automate them. So that is how you it continues. And lastly, you minimize your blast radius. While you continue experimenting in production.

C

B

The basic principles of chaos, engineering, but what are the principles of cloud native chaos? That is what we came up with, while developing litmus. So these these, as you can see on your screen, has to be open. Source community collaborated, open api and life cycle management.

B

Git, ops, git ops, is something which is coming up in a huge way. I mean you need scalability and the guitar integration is must and then observability.

B

So, while we move ahead with the demo and while we present you we'll go in depth into all these principles of cloud native chaos, engineering and how they build the foundation of it, how they came up- and that will be explained by karthik, but these these are the principles mostly which, which was followed in litmus as well. So moving on to what litmus is I mean this? This was the best time to introduce litmus litmus is nothing but an open source. Toolset part of the cncf sandbox foundation, which is used to practice.

B

These chaos engineering practices in a cloud native environment just not limited to kubernetes by the way in case you're thinking, it's just limited to communities. It's it's also, I think I mean the the with the the community demands coming in. It's it's more than kubernetes for aws, or I mean azure and all the non kts uh scenarios as well. This comes into play in a very huge way, with various experiments and obviously, while I talk about the features it provides, I think you'll get some more idea on it.

B

So, as of now I mean there are 58 chaos experiments you can find them on the chaos hub. Karthik will take you through the chaos hub, and these are a few stats. I think, as I mentioned, it, helps you identify weaknesses and potential outages by leaving chaos in a controlled way, and litmus is the early leader. I mean the amount of adoption we have seen and the amount of uh use cases that have come up with litmus.

B

If you talk about the banking sector, e-commerce sector or any fintech sector, these various there are various been various use cases which have come up and have helped this litmus community develop. So just a little.

C

Bit about the history of the project, it was.

B

I mean we, we founded it late in 2017, early 2018, while testing another project which which we're working on that was open. Eds, open ebs is a project based on cloud native databases and then we thought that we need a tool set to actually find out our resiliency goals. We needed chaos, testing back then itself to to find out, and we started writing out just a chaos testing.

B

I mean we started, writing out our own code and thought that we can write our own chaos engineering tools and we never knew that today become this big and it will come come so far, but yeah I mean that's the history and the project status of it and talking about one of the features, as I mentioned that it's not just about chaos, engineer I mean the k test scenarios I mean one is obviously it's possible to test your systems in non kubernetes scenarios as well, and the second thing obviously is there's a.

D

B

Nice feature that litmus provides is bring your own chaos. That is, you can write your own chaos experiment with the late sdk, so you just don't have to depend upon the chaos hub where all the chaos experiments are present, but you can write your own.

C

B

And induce chaos in your systems with litmus so we'll talk about it more. I think I mean that was it from my side to give you an overview of what chaos is and how litmus came into play and what litmus is and perhaps I'll be open to any questions. If I die and then karthik can take it forward.

B

I see a couple of things in chat.

B

C

So are there any questions.

B

A

um Can you maybe share the url to the litmus sdk? Is it the litmus goal, repository and github.

B

Yes, yes, it's the it's. The litmus go repository github I mean okay, most of our experiments are written in go. I think some of them are written in python as well, but yeah. It's the litmus code repository where you can check out how.

A

Perfect because I want to share it in the twitter thread, I'm currently doing, and you can just go ahead. No further questions from my side. Thanks.

B

Thanks a lot thanks a lot everyone, I think my uh gothic can share.

C

B

C

E

All right thanks uh thanks michael, um let me go ahead and share my screen.

E

I hope my screen.

D

B

E

Yeah, let me introduce myself, um I'm karthik, one of the maintainers of the rippers project and I've been having a blast really trying to contribute to this project.

B

And uh maintaining.

E

It great community interactions like he was mentioning.

E

So what we'll do as part of this segment of this session is we'll try to go through the basic architecture of fitness and the written version that is being used by folks. All over is one dot x that is 1.13.6. That's the version that's being used, there's also some efforts going on to come up with litmus 2.0.

E

It is an advanced beta right now and it's being tested by some interested community members, so we'll talk about what caused us to evolve and transition to 2.0. What are the new features we'll pack in a couple of demos, one on the one dot x variant and then one on two dot o as well.

E

Before we proceed, I will just like to recap what prithvi was mentioning as part of his demonstration was basically telling how chaos engineering is generally done, and you start off by identifying your steady state conditions.

E

How do you expect your application or infrastructure to behave when it is under optimal operational state? And then you inject a fault right and you check whether your steady state hypothesis is met, and this steady state hypothesis that you're checking um can basically be anything. It could be a lot of parameters.

E

You might want to see how much a particular performance metric has deviated. You might want to see how quickly your application or service recovers, that's also possible.

E

So there are lot of ways you can define hypothesis around steady state and once you've checked whether that is met, then you go and qualify the application or service as being resilient to that fault.

E

If not, then you would have found a weakness, so you go back to the drawing board and you fix the business application or you might fix something in your deployment environment and essentially repeat the experiment and see whether things are as per expectation and then move on to the next fault or the next resiliency test or experiment, as you would call it. So this is the typical flow. Now. Why did we? Why did we do?

E

This recap is because I wanted to introduce you to this new paradigm that has come about over the last two three years. The cloud native way of doing things is all declarative and when we say cloud native way.

B

E

Things you can implicitly read it as uh or understand it as the community's way of doing things, because that is the predominant platform that is driving the cloud native innovation today. So everything on kubernetes is declarative and.

B

E

Is all dealing with resources and controllers map to those resources, everything that you would do right in terms of defining infrastructure and cluster api and tinker bill and lot of such projects today are defining how you can define cross plane exam example you you can basically go ahead and define your entire infrastructure in a declarative way and with the help of communities orchestrate the creation of infrastructure itself and then the kubernetes control, plane, components or the system services, as you would call it or the applications or workloads that you put on kubernetes are the characteristics of those applications such as security policies or resources.

E

Everything is declarative, it is, it is basically yaml and it is reconciled and when you are coming to resilience, checks or resilience, testing or chaos intent, you would want that also to confirm or adhere to this user experience that people are having with kubernetes, but we talked about how um litmus was actually trying to um it was brought up for the resilience. Testing needs of open abs and open abs being another cloud native container attached. Storage solution offered everything in terms of resources.

E

So when we were trying to test open ebs- and we were trying to get the users of open eb is to test their environments using weakness, we were trying to give them that homogeneous experience and trying to put chaos intent in a decorative way. So apart from this huge dependency tree, that you saw, the pyramid that was explaining, where in case of kubernetes and dense environment, contains your platform, services and kubernetes. On top of that, and then you have all your tooling from the cloud native ecosystem that you so dependent on from your services successfully.

E

Then you have your application dependencies database, storage, etc. And then you have your middleware and application services front-end user-facing services. There are so many points of failures, so there is one reason of trying to do chaos. There is one more: what do you call good reason or more reasons for doing chaos and humanities environments and better done in pre-product environments, and when you do that, you would like to do it in a cloud native way using a declarative approach, and I have just created some or mapped a couple of diagrams here.

E

You can see that this process of identifying steady state conditions, introducing faults, etc is now converted into resources and controllers that reconcile those resources right. This part of the flow and the checks whether the slos continue to be met.

E

These are all burned in into your experiment process, because many times you want to run your experiment in an automated way and when you do that, you want these checks burned in into your experiment flow.

E

So there is, there needs to be a way to define your steady state, hypothesis or application validation, also as part of your experiment, in a declarative way, not just the fault, and then you go ahead and put all the information inside of a resource which you can consume in an easy way in another cr, and you would basically go ahead and generate useful reports. So that's what we try to do with litmus and at the heart of litmus are some chaos resources.

E

We will take a look at them as we do our demonstration in a second chaos. Experiment is a template. ah It is pre-built and it contains granular definition of your.

D

E

Intent and the experiments run as kubernetes jobs like I was trying to say, and all the kubernetes chaos experiments. The templates are available in this chaos hub. You can pull them right and there are categories defined here. Generic are the ones which consist of most of the experiments that you would do on a day-to-day basis on standard kubernetes.

E

There are other application, specific experiments here which are farts combined with native checks or those experiments for those applications, and you can pull chaos experiments from there and then you create chaos, engines which are user defined, mostly the how part of chaos experiment.

E

You bind your application instance with a fault, and this is the bridge which actually binds a particular fault that is defined within an experiment here, the experiment resource with an application instance as defined by what name space it is running in what labels it is it can be identified by and what kind it is etcetera and you could also define the steady state hypothesis, validation, intent in this particular chaos engine, and this is the one that is active and reconciled by the chaos operator, which was written using operator.

E

Sdk the chaos result holds information about your run process. What happened to your hypothesis intent and what happened? What is the state of the test? Currently, if it's a long running test and what is the verdict of the experiment? The verdict here is something that is optionally consumed. If you are doing your experiments in a freestyle, exploratory manner, you do not need to depend upon the verdicts that litmus provides you so much.

E

You have your own observation to make, but if you're running it in a knob in an automated way, you would like to know what happened to the experiment, whether the expectations were met or no. That's where the verdict is going to be useful. There are some other crs auxiliary ones like the schedule, which helps repeat experiments in a in a different in a schedule that you would like to see.

E

You can define start or end time stamps or you can define miranda's intervals. You can do it strictly scheduled or randomly etc. So the the flow of um the litmus one dot x uh is something like this. As a devops engineer or an sre um or a qa or a developer or whoever the persona is you have your application on your cluster?

E

You pull the experiment from the chaos hub. You create the chaos engine to apply this experiment against a particular application, or maybe some infra con current, like a disk or node on your cluster, and then you apply the chaos engine. The chaos operator is then going to create some child resources, the experiment, runner and the job.

E

The job is the actual unit of execution or the the one that carries out the business logic of chaos, and that eventually does default, hypothesis, validation and then finally, stores everything inside of a result, and the same thing here is represented in a block diagram and let's go and take a look at how to do this as a demo.

C

E

C

One small question: when how I would download or get experiments in an air gap, environment.

E

Great question so litmus allows you to set up the hub in your own private repository, so you could just work it as a private repository or you could create these um experiment dmls by yourself. They are very simple you just. Let me show you how an experiment ml looks like.

C

um This was my second question.

E

All right uh so yeah, the experiment, cr looks like this. You have an image here that actually runs the chaos, as you can see the spec dot definition, which holds information about what minimum permissions are necessary to run this experiment. It's just indicative. This particular uh spark respect dot permissions is basically specular. Definition of permissions is to just say what minimum permissions are necessary to run the experiment. You can create an r back using that information.

E

There are some tooling on the right side that can help you do that, but you could create it yourself and then you could define how the job runs. What is the image that is going to be used? uh What's the pull policy and there's some entry point here and then some minimal environment variables provided the bear? Minimum ones are necessary ones that are just enough to carry out the chaos. The mandatory inputs are provided here, which you will eventually override from the chaos engine, because that's the instance thing.

E

That's the dynamic entity that you're going to create on a per chaos basis. This is something that's global to the name: space global at the name, space level- or you could put it at a cluster wide level as well. So in case of air gapped environments, you could have your own images coming from your own registries, and these crs can be maintained in your own repositories.

E

We just create a private fork of this repository, called as chaos chart. Let me just show you that the hub here is a canonical front end. What manifests are stored in the chaos charts? You can just clone this and maintain your own set. It's heavily inspired from operator frameworks operator hub, so we have the experiments defined here in the respective folders inside the category.

E

The generic category contains the experiment folders, and there is an experimental that you can pull and there is a recommended r back that you can use to run this experiment that is going to be consumed by the job. That runs the experiment and there is just some metadata files here to render the information that is on the chaos all the metadata that you see within a card. So you could run this by pulling these manifests into your own repositories.

E

Inside of your intranet, if you're running in your gap in the comments and getting the images from your own registries, does that answer.

C

Yes, thank you, but then one another question before we start, as I saw it, um does the operator and the jobs it spawns. Are they uh privileged do they need more rights, because I saw some some host, let's say experiments, so are there privileged or.

E

Great questions, so some of them are not some of these experiments use purely the cube api um and just go ahead and run without needing to be privileged containers.

E

If that's what you're asking there are some experiments, for example, the ones that make use of some redux utilities within the hose or within the system, to go ahead and inject network loss or stress chaos, for example, you're trying to simulate some resource exhaustion or resource chaos within a pod. By running some stress processes, you are running from cpu burn or memory burn processes or in case of network experiments, you're trying to create some egress packet loss.

E

So to do that, we make use of the runtimes apis container d or docker or cryo, and for that we need to be living in the same node where your target application resides and we will be injecting these processes. The ks processes into the network, namespace of the target part or the process name space of those target containers, and in doing so we need to be able to run them in a privileged manner.

E

So that is a few of the experiments which do need privilege escalations for the jobs that run most of them most of the others. uh The power delete ones are the um the node related experiments that the ones that are draining the nodes or causing eviction things things like that. Don't really need the privileged services privileged inspirations.

E

Does that also uh help.

C

Yes, cool, I mean I really like it that let's say on demand privileged when you need it and not like always. So thanks for clarification.

E

Yeah and one since we're talking about the privilege escalations, um one of the features differentiating features of late most is the fact that, as you can see here in this architectural diagram, we.

D

E

And the job is going to run for a period of time, as defined by you, let's say 60 seconds or 120 seconds for whatever duration, you want the chaos to be active and we are going to run paths with privileged escalations for just that period, rather than run something like.

D

E

On your node, this is perennially running and carries all these docker cap and privileges. So you probably restrict the amount of time that you're going to run it, which probably is one benefit that you would get with litmus. We also try and recommend some part security policies.

E

uh There might be some gatekeeper policies that we might document in time which might uh help you to create some security policies and create our backs that just make use of that security policy and not use the service account that you might be using for some other purposes on your cluster and thereby find means to minimize the impact of running something like this.

E

So that's something we are consciously looking at in litmus and trying to document as we go.

E

Okay, coming back to the demonstration of fitness 1.x we're trying to go with um the hello world um uh process, so to say in chaos, the power delete is always it's like. The hello world programming gears engineering, a very popular test, pod delete in the later part of this demo. If, if we have time, we can show how effective it can be by adding the right kind of intelligence into the experiment, so the deletion of the pod, simple as it sounds, can have, can unearth a lot of impact on the cluster or on the applications.

E

So right now I just have a gke cluster with me as a single node cluster, um and I'm going to install the operator, I have come to the litmus docs docs dot treatment, squares, dot, io. This is getting started section here.

E

I'm just going to select this command, which is going to help install the latest version on the one dot x branch, one dot, x, series so 1.13.6, and it's going to create a namespace for fitbus and it's going to place the operator there and it's going to create the crds, the ones that we just saw in the slides and it's going to create some cluster role and role binding and a service account that this operator is going to make use of.

E

um You could choose to use the same service account or have it good enough to run all your experiments or you could choose to create per experiment service accounts depending on what kind of environment you run in what kind of autonomy you exercise over your cluster?

E

If you are in some kind of service model, where you would like the admin persona to put litmus there, the operator there an individual service owner or developers to run their own chaos, they can create their own service accounts. That's all! That's the model, we're going to use right now and the operator itself can be installed in a cluster-wide mode like we did now or a namespace mode where it can work only within a particular namespace.

E

That's also supported now that I have got I'm just going to verify. If my crds are there, my api resources are created sometimes on some clusters or some distributions. It takes times for time for this api endpoints to be available. Now that we have them. uh My next step would be to pull the experiment cr from the chaos hub, I'm just going to pull them from the hub dot atmospheres, dot, io and I've just been the cube ctl command.

E

Here, I'm going to put it in namespace called nginx, and I'm just going to put a lot of experiments here and I'm just going to do. One experiment, though I hope my font size on this terminal is all right. You're able to follow this.

C

It's good, it's good! Thank you.

E

All right, so I have uh um an application for I just selected nginx, just because it's a popular uh communities deployment. um So I have this part which I'm going to try and kill. That's what we're going to do, and I already created this nginx ahead of time. We have a note here that asks us to create a sample application and I've already installed the experiment, as I showed you, and my next step would be to create an are back for running this bar delete.

E

I could choose to use a single service account that runs all experiments or I could choose to use the ritmos service account that we just used for operator to also do the experiments by just augmenting it with a few more permissions, or I could just create a separate one.

E

You could that's left to the user and the use case. So I have this um file copied onto my harness machine, so I'm just going to go ahead and apply it and that's going to create my service account in my nginx namespace. I'm just going to confirm that.

E

So this is what we have and I'm also going to check whether I have the chaos experiments installed here.

E

I have this as well. Now we are going to define the chaos engine. The chaos engine is going to look like this.

E

You can see that we are trying to do part, delete, experiment or the fault against an application instance, as defined by the label name, space and kind. There are a lot of options you can provide here. Auxiliary app info is just to see if you want to find out if some other application other than the one that you selected here is uh good and ready and healthy. Sometimes you want to do those checks on downstream applications and um you can provide the service account of your choice.

E

We've selected, the one that we just created and the job cleanup policy is just to state whether you want to retain the chaos parts, the job, essentially that rand default or you can clean things up automatically, and then we have some env here. These are the overrides for the tuneables we saw within experiment cr. I want to do a few iterations of the kiln. Sometimes may not be so meaningful, but this is just for illustration purposes.

E

So this is the chaos engine. So typically you go ahead and create this chaos engine and that should go ahead and trigger off your experiment. You will see that it is going to create some parts in the nginx namespace.

E

And you will be able to track the kill of the nginx, so you saw that it's already killed. It's come up recently, but since we've, given more than one iteration of power elite, it's going to continue doing that for a few more times and then it's going to complete it and um you could do similar chaos engines uh for wide variety of faults, cpu, hog resource chaos like cpu, or memory, hog or network suite of faults like loss, latency or corruption, etcetera.

E

You could eat up some disk space. You could do pod. Io stress you could do um things at the node level. um Node level uh resources can beaten up. There are some very subtle differences between the power and node level ones in terms of the way the resources are consumed and also the use cases where they're employed- and um you could also do things like um draining your nodes or- and you could put eviction things and go ahead and move all the parts out of that node.

E

Something like an ungraceful kind of nha could do restarts and do dns errors, etcetera. So these are the kind of parts you could do with a very, very similar approach. As to the one I showed, and now the experiment is completed. These parts continue to live here because we selected retain as the cleanup policy and the chaos result can be checked here.

E

um The one that we are going to see is going to have very minimal information, because we did not really define a very complex scenario, so we just have uh kiosk result resource this status portion is of interest to us.

E

It basically said the the phase of the experiment is completed in the experiment, but it is passed so you may ask, by on what condition did we call it pass? So right now, by default, litmus is going to check whether the fault was injected successfully and if the application that was subjected to the fault is ready and available after the fault seized to be injected. So by by saying, ready and available, we are saying that the part phase should be.

D

E

The containers listed within the part should be ready, and once that is satisfied, we go ahead and say it is pass, but you can define complex conditions here, which we will see in later part of this demo, where you can look for custom conditions around kts resources or maybe some rest calls. You want to make to check some health of applications of downstream applications, or you might check some metrics um that you are trying to get on prometheus or some such observability mechanism and use all that information to arrive at this passive page.

E

And then you also have some history, um and this is the resource- that's actually going to be parsed or queried by the chaos exporter, which is another component which will provide prometheus matrix around experiments, and this is the source for all that information, all the metrics that it generates. So this is what we had in one dot x and what we try to provide is a declarative way of putting out your chaos, indent and preview mention something about by oc.

E

You can create your own experiments, uh your own fault, wrap them up in an image, define that in a chaos, experiment, cr and test it out as a job and then get it orchestrated by it. That's that's, essentially how you would do the byoc and we have a declarative construct for defining chaos, intent and gathering information about what happened during chaos and basically make it simple to run chaos engineering for kubernetes, but there are a lot of requirements on top of this that you would need right. Let me go back to my slide.

D

E

To show you some of those.

E

Things, I hope my.

D

Slide is visible.

E

Yes, so one of the things that we decided was okay. All this is great um but faults.

E

You need an ability to define complex faults. Right standalone faults are great, but misfortune. When it happens, it doesn't happen singly. You have lot of things going bad.

D

B

E

Times you would want to simulate those complex scenarios. So how do you do it? How do you stitch together faults? That is one thing there's another thing. Let's say uh you want to run some benchmark or load job or some performance test. As you do chaos, uh and you would like to see what's happening under those stressful conditions, uh you don't want to inject chaos under idle utopian kind of cases, conditions of the app you want to do it when these active loads.

E

How do you factor in how do you weave those capabilities in into an experiment, and then there is this need for um visualizing chaos not just running it, but you want to know what is running, how it is running and what's happening. As you run, the experiment we talked about observability and observability is really important to derive benefit out of an experimentation process. You need to know what's happening when you run chaos, uh how the application is behaving.

E

You need to be able to visualize that you need to have you need the chaos framework to generate observability information and also consumers. Observability information generate so that you can correlate those the data with some observability that you already have set up around your application of infrastructure.

E

You can map the observability data that you get from the framework with what you already have and say: okay, this is what's happening to my infra when chaos is happening and the need for the framework to consume absorbent information is because you need to find out in an automated way whether your expectations were met or not, and how to find out whether expectations were met to take the help of some service level indicators, some of the observability data that you already have.

E

So these are some requirements that we saw coming around chaos and few other things such as okay. This is great. I can go ahead and do chaos. I can install an operator, but let's say I have a need for centralized chaos management.

E

I have several fleet of clusters and I want to do chaos against all of them and manage them centrally and also visualize, what's happening with them. Compare reports, chaos, reports over a period of time associate and experiment with some resiliency information that I can some quantifiable resiliency metric that I can associate the experiment with and thereby experiment my thereby associate with my applications of three way association.

E

You have your application or infra, you have your experiment or the fault, and then the resilience metric that relates these two, so your application or infra is resilient to this degree to this chaos workflow or this chaos scenario. So you want to be able to do that. You want to be when a more useful story that you can sort of use in your organization. All these requirements prompted us to come up with the next ah version of litmus um called. It must two dot x, which is in advanced beta right now.

E

Beta 8 is what we have and let me show through the demo of how we are achieving all that I mentioned the basic topology is something like this: it has a portal. It also simplifies chaos through a ui for people who would like to use the dashboard there's a portal and there are agents.

E

So the agents are the ones that actually carry out chaos and they speak to the portal and get information on what chaos to actually do and then go ahead and do it on your clusters and you can connect or register your clusters to the portal to a centralized portal through the agents and get a single plane of class management, of all your chaos environments. So to say- and we also have prometheus metrics that are collected from each of these clusters or namespaces.

E

Depending upon what kind of target you connect your portal, you can connect either an entire cluster, which is me, which means the agent is in cluster mode or you can connect a name space, which means the agent is in name space mode, and then you can collect some metrics. The agents collect some metrics around your experimentation and it is going to derive this metric from the chaos result, resource and then populate it. And basically you can scrape it from prometheus and instrument.

E

Your grafana dashboards to see what happens uh on your applications as chaos is happening. You could use grafana rotations and things like that, and with this model of uh execution we also started supporting chaos against non-kats applications. So let's say you have a hybrid infrastructure. Where you have some legacy components, then you also have some kt services that you would like to do. Chaos on.

E

How do you do it so, with what we are trying to do here is we are trying to enable the experiments to mount config maps among secrets, etcetera so that you can run the api provided by some cloud.

D

E

um You could use the.

D

E

Sdk or the google cloud sdk and go ahead and do out of band chaos against instances living on those clouds. In case you have hybrid environment or you have some services residing on vanilla, ic2 instances or gcp instances. You could run them from the portal. The experiments or jobs will run in kubernetes, but the impact or the subject of chaos will still be outside and we will be making use of the apis provided by those providers to carry.

D

E

Chaos, so these are some of the things that we we can do, and this is just another schematic which is trying to reinforce the same thing. The architecture here is you have the request portal, which is comprising of the graphql server. It has mongodb to store chaos, state of your workflows. Then you have the odd server.

E

Then you have the agents sitting on your cluster. The green part here block represents the cluster where chaos is happening. Subscriber speaks to the it must portal and creates the kiosk workflow resources and this workflow resource, which essentially is an argo, workflow embeds the rituals chaos experiment, chaos, engine resources within it and they are applied as part of individual steps within the workflow, and you could create chaos, engines or kiosk experiments. You could string them together in different orders, in parallel or in sequence, etcetera.

E

When you apply that the operator, which is also one of the installations that happens when you do the cluster registration, along with the other agents, then takes care of ah reconsidering this geos engine and the same process that we saw in the one dot x demo is going to play.

D

E

And then the result.

D

E

This experiment is going to result in is it's going to influence the chaos workflow status and that's something that is going to get reflected back on the portal, and you can also see its impact in grafana through the prometheus matrix. So that's what happens in 2.2 and before I proceed to a.

B

E

On my 2.x installation of the portal and how you can do chaos just pause for a minute and take some questions and do a quick time check if we can go on. uh Do we have uh time for that, uh michael.

A

You can go on. I have time, I'm super curious. What's coming next.

E

Awesome, thank you. So I've just opened up the litmus um portal uh documentation here. It's in the litmus readme, the litmus repo is request, krs, slash litmus and I've gone inside the request portal folder. I am just selecting this particular command to install my litmus beta and before that. Let me just go ahead and create namespace call. It must and go ahead and apply the service and, as you can see it has these control plane components that it has created which I'm going to try and check by look at the pods in first name space.

E

The last time we did something similar on one dot x. You could see that we had just the litmus operator, and this is the cluster on one dot x. I had this operator, I'm just going to close this now for ease of navigation, so we have the litmus parts that are going to come up and we also have these services, which we are going to make use of by default. It is going to use node port output is not recommended for production.

E

It's just for illustration purposes that we're using here and I have a front end service. The web ui exposed at 30869. So let me just get my node ip.

E

Sorry about that.

E

So this is my ip and I have a few.

E

Browser tabs that have opened here take a look at the port again.

D

Let me open it.

E

So it's going to um take you through a very, very simple set of steps to configure the rightmost portal. Typically, you will be asked for some user information and you'll be asked to reset the password things like that, I'm going to use the default credentials and I'm going to keep this cards on this namespace in watch mode so that I can see what new things get installed.

E

As I proceed with my project setup, I have a little bit of a slow internet, so it's probably going to take some time to load just bear with me and give me a few more minutes. So once the portal is going to come up, we are going to set up a project. Each user is given his or her own organization or project within the litmus portal, which is where you can actually invite some team members into and when you set up the project, we have the cluster on which the portable is installed.

E

That is this one. The gke cluster, where I have the portal installed auto magically register itself as a target of chaos, so the agents will get installed. The moment you set up the project, which means you could start doing, chaos against some applications that are already residing in this cluster, but you could also connect external agents into the litmus portal using the rightmost ctl, which is a very useful cli tool which will help you to install agents on remote clusters and subscribe them to the portal.

E

So let me go ahead and use the default credentials to log in into the litmus portal, and once I get in, um I just probably get the same. Password hit admin and litmus is what it is and once I get in, I will be able to take a look at the project dashboard and you can see there are some tabs on the left hand, side.

E

The workflow is the unit of execution of chaos in the reference portal in britain's one dot x. It was the chaos engine which you are all aware of. The kiosk engine is what we basically ask users to create, and while that remains to be the central piece, um even in two dot x, we have by default the workflows that are going to be the overarching manifest. That is going to do your chaos with all the dependencies baked in a lot of faults together. uh Stitched in you have the option.

E

Also, you will have the option of creating a pure chaos engine using the portal without selecting an argo workflow, that's something it's coming up in beta 9, which is an upcoming version of the beta. So going back you can see, I don't have any workflows right now. I can schedule one and I have this chaos hub embedded inside the portal. You saw the chaos hub on the outside, but you have the same thing that's available here.

E

So the same answer applies when you're doing that first 2.0 in air gap mode, you have option of connecting what's called as a private chaos hub. You can create your own git repository the the repository with all the yamls.

E

You can provide that here you can through access token or ssh, you can connect it and you can give a private hub that will get rendered here and then you can pick that to construct your workflows, select faults for a given workflow and you can see that there is an agent. That's auto magically registered the self agent means the same cluster where the portal resides, and now you can see.

E

I have set up the watch and, apart from the these three components which had been installed, you have few more, the exporter, the operator subscriber the workflow controller and then the even tracker.

E

The workflow controller is going to basically reconcile the argo workflows and it's going to launch a set of pods that carry out individual steps within a workflow, and one of those steps will be to install the experiment and install the engine, and when you install the engine, the operator is going to reconcile it and carry out chaos. Export is going to provide metrics and even tracker is something that is going to help you in doing even triggered chaos.

E

That's something we'll hold for a later discussion. I don't want to probably bring in too many things here. uh So let me go ahead and also show you a few other things. You have an analytics tab.

C

Just one small question for the events: do they use? Also argo events, then.

E

C

E

Great question: that's something we are looking at integrating. We don't have that yet the.

B

Event, tracker.

E

Is going to probably be the component that might get integrated with argo events, so what we do with event tracker I'll just give you very brief overview is.

C

E

No problem, it's basically to help you with triggering chaos, experiments upon some custom events, so we've been manually triggering experiments.

D

E

um By running workflows or constructing them from the portal and running them or running them by hand, as you saw in the previous demo, etc, but sometimes you want chaos to be triggered in an automated fashion as a response to certain events that you see happen on your cluster and we have an event tracker policy, which is a cr which you can go ahead and define right now, it's configured to look for image changes. So let's say you have argo cd or flux.

E

That's actually upgrading your applications on your cluster as a result of your um gitops flow, and you want to check the sanity of your new change deployed onto your cluster and the way you want you could do that is have this application subscribe to a predefined workflow that you have stored in git. Your portal actually has this some thing called gitops flag. There is a setting that you can do where you will have workflows, created or constructed in the portal.

E

If you have the githubs enabled it is going to get committed onto a git repository, it will have a workflow id you could subscribe. There is the applications could subscribe to one such workflow that you have stored and if you have an application change happening, you may change to a particular tag or something like that. You could go ahead and um even tracker will go ahead and watch the applications for those changes.

E

And if the change has happened, then it is going to trigger the workflow which has been subscribed to by that app, and you can track the progress of that workflow within the portal and analyze that here so right now we are in the process of enriching what all policies you can fit in into that event. Tracker just like image changes. There will be so many other things in the cluster that you want to do or trigger chaos in response to, and not only do you want to do it for git rocks flows.

E

You might want to do it for non-gitops flows as big and and the um that's where we are looking at different kind of inputs or triggers for chaos. um Rvince is a great one. We are thinking on those lines, but right now we don't have that.

C

E

um Going ahead, let me create a workflow and I want to explain a simple use case. I am just going to show you two things with this workflow. I am going to do a network loss just for variety and show you a very interesting use case. There is this bank of anthos application, which you might all be aware about.

E

It's from google cloud platform, it's a microservices application, and this is just.

D

E

You, how simply you can construct an experiment from the portal and run it and see the impact of chaos, and once I do this, the next part of it would be to see how you can visualize the metrics of chaos, and for that I'm going to use a potato head application. We're going to kill this, and I have set a black box exporter to maint to check the availability and access latency of this particular service.

E

I have on a dashboard for it and I have annotated it with some chaos um annotations and I am going to run an experiment which is going to kill a single replica of this particular service and it has been deployed with just one app replica and you're going to see the the application basically change here. um I hope my screen continues to stay shared.

E

I might have just flipped something to disable sorry about that yeah. So we will see this parameters changing on grafana and we'll also see interleaved dashboard. That is, will have some area covered here by the the chaos matrix. So before that, let me show you how you can run a simple experiment: the bank of amthos has some money here. I am going to inject network loss or black hole attack against a balance reader service. So I cannot read my balance. Even if I deposit something, you cannot read what what my balance is.

E

These kind of scenarios do occur in real time, sometimes so to do that, I just click schedule a workflow and I just go ahead and select a fault that I'm going to do just give it a few more minutes to load. As you can see, my.

C

One question in the in the meantime, so when I, um with version 2, we have to use the portal so it's like by default, or can I just insert our operator like in version one.

E

You could continue to use it the same way as you did with one the operator and the crds all remain the same. You could just consume just those components and install it and continue to uh work the way you consumed it earlier. This is something you could also do both yeah. I mean.

C

For multi-cluster make makes sense, maybe for one cluster the portal is maybe a little bit overdosed. I don't know.

C

E

You could do both, so uh I just selected the clust, the that's the same cluster, where I have the bank of phantos as well as the portal. So let me go ahead and select next and I'm just going to select my hub. There are different options. You could create a workflow from a predefined template that sort of burned in into the image or you could clone an existing workflow and then reuse that template or import a yaml file that you've constructed by hand or you can select experiment and construct your workflow newly from the wizard.

E

So I've just selected the public keys hub, that's embedded into the portal and I'm just going to call it black hole. um Bank of antos and um the workflow is going to run in litmus namespace and I'm just going to select an experiment.

E

I'm going to select for network loss, that's the one we're going to do and um we're going to go ahead and tweak some tunables here. I just clicked on the experiment, and I'm just going to do next and see where I have my application. I have it in default, name space and it's of kind deployment, and I'm trying to look what application I have. There is a deployment by this name, balance reader that you can also see um in the default name space.

E

That is the application here. This is the one that's what we select and I'm going to keep the cleanup policy as retained. I'm going to select next and I can't choose some probes. Probes is a concept which we will come to in a while. I am just going to say finish and with this I am going to take a look at how my aml has been constructed.

E

You can see this is the workflow and it pulls the network loss experiment from the hub as one of the first steps. That is the first step. The second step is to install the engine and the engine is defined like this. I have selected default application. App is balance. Reader kind is deployment and we're doing a pod network loss, and I'm going to do it for 60 seconds, I'm going to do 100, network class and just provided some socket paths and runtime details and yeah.

E

This is what I have so it's making use of a service account called litmus admin.

D

E

Came by default when the agents were created, so this looks good, so I save it and I have an option to revert chaos that is, remove all the chaos crs created, but I am just going to say false because I want to see the logs and I go ahead next, so we were talking about this metric right in request over the one that connects your application or service, and the fault that you have been doing so this is called the resiliency score and we can give some weight or points to a particular fault, and this is especially useful when you do multiple faults within a workflow, you can provide a criticality and then what platforms will do?

E

Is it's going to multiply the points that you gave to an experiment with the success factor of the experiment, the probe success percentage which we saw earlier, and it is going to get a summation of that for all the faults divided by the total points available. That is going to give you a resilience score. It's like a weighted average. um That's going to tell you how resilient your application is to this overall scenario.

E

So I just have one experiment: I'm just going to give all points to it and I could do a recurring schedule or I could do a single time schedule so just go ahead and finish, go ahead and apply it. So that's going to start it. We have a workflow visualization graph, which is going to tell you the progress of the workflow as it happens, and you will see some experiments that will get created in the plus name space.

E

In order to carry out the experiment. There are some bank of anthos parts that will get created, so the first step, as you saw, was to pull the experiment for network loss experiment from the chaos hub. It's going to apply that the next step would be to apply the chaos engine and launch the actual chaos process.

E

So once this starts- and this is not yet created- which is why you basically see this- it's going to turn blue, which means it's going to start doing the network loss experiment and once it starts doing that experiment, ah we will be able to go ahead and visualize some impact that will happen here.

E

So let me go ahead and see whether my network loss has been applied or no.

E

So it's still doing an innate it's pulling a new image. It's going to take a few seconds.

E

So what we've done to the vanilla workflows is we've instrumented it with some alertness images to carry out these steps of creating the engine and tracking it tracking its status, waiting for its completion, etc, and it these images understand the rightmost api, so they have been used within these workflows, so it's basically going to go ahead and create that it's still doing it in it. It's a fresh cluster that I've created, which does not have some of this images already. So it's probably going to take some time.

E

Meanwhile, while this is happening, I could also probably just show you what probes are why those are useful, and for that I'm just going to show you a sample workflow that we have for kafka, where we are doing a simple part, delete experiment, but we are going to kill the broker pod, which is handling the io or the message string and that's going to trigger a failover and in that process we want to check uh the message stream continuing and not breaking.

E

We also want to check other constraints such as I should not have offline partitions. I should not have under replicated partitions before and after chaos, etc. So some of that can be checked with probes, so just hold on to the thought.

E

And let's let me take a look at why this is so. This is actual chaos. We have something happening because of which we are not able to.

E

It's going to give me a container okay, I think it finally started. It took a very long time to pull. It doesn't usually take this much time. Okay, so now that it is actually running, you will see some new experiment, the parts that get created you have the pod network helper. Let me go to bank of anthos and let me refresh it so let me sign in and you will see, I will not be able to read the balance inside this particular application. So.

D

You can see the balance is.

E

Gone which is dangerous. I can deposit some funds deposit goes through, but I cannot see the balance, which is not good, and I cannot send payments though, because make a payment. I need to check how much money is there in the bank so balanced leader services not accessible, so the payment fails.

E

Some of these checks, for example, you could weaving into the experiment as the probes, so going back sorry for changing the context, but going back. If you take a look at this particular chaos engine that we are making use of within this workflow, so we have some checks that are inbuilt.

E

There is something called as a prompt probe that I am doing here where I have the promethean point. I have a query which checks for under-replicated partitions and I am going to vary it against a value called zero. This is my slo or my cut off. This is my cell.

E

I you can think of it like that, and I am going to check at the edges the beginning of chaos and after the end of chaos, that I don't have any under replicated partitions, uh which is basically a check to say you leave your system insane state and it has self-healed auto recovered, and then there are also checks. You would like to do through the chaos continuously and there is. There is a check we are doing for offline partitions, which is again through a prompt probe with a different metric.

E

You want to do this as you do the chaos. Sometimes, there are negative checks you might want to do just for the chaos period and not through the experiment, which has all the pre chaos, post, chaos, phases, etc. So I am going to check this consumer container, which is the one that's actually running this message stream that you saw- and I want this to always be available and ready, which means that there has been no exception. The failover has been successful. The mississippi stream is continuing, so I am just verifying.

E

The container status is true and I am doing this using a command probe and there are also other probes like kts, probe and http probe, which you can use to do. Rest calls get or post requests, etc, and all this can be put in inside of an experiment, and the chaos result that you get in such cases is probably more elaborate and gives more information about what's happening in your experiment.

E

These are very useful for automated runs, and this is what we meant by consuming observability information when we do the chaos experiment now coming back, so I think the anthos I assume will have recovered because the chaos duration is over. It was a 60 seconds, so it's back and the experiment has finished here. When you click this, you will be able to see the logs of the experiment and also be able to see some results here and we didn't have any probes, so it just basically says the experiment passed and it's successful.

E

So this is how we do chaos and I could go ahead and run multiple such workflows and then go to analytics and see how this workflow ran over time or compare this with another workflow.

E

Another iteration of this workflow to see whether the results were improving or they were decreasing, and you could get your residence score as 100 percent here um going ahead now I will show you the other aspect of chaos and I'll probably end after this. um This is about how you generate metrics. As you do chaos- and you can see this- there is a chaos exporter that we are running and which is getting some metrics and I am going to actually create in the litmus namespace a service monitor.

E

I I actually have a service monitor defined for the other components. The the blackbox exporter is already defined. That's why you see this griphon dashboard, I'm going to create one service monitor for the kiosk exporter in the litmus name space, and let me just check if my elitemas services actually have the label, so it has. The label called app is equal to ks exporter.

E

So in my service monitor I'd like to select that I have app called chaos monitor here instead of case exporter. So let me go ahead and fix this.

E

Because exporter is what it is, and then I have this named as it is appgas exporter and added it already into my uh sorry about that. I have already added it into my prometheus cr. I'm using cube from atheist stack, so it's basically using a parameters operator, so app chaos. Exporter is what I have and that's what we are going to provide here as well. Tcp is the port name, and that's, I suppose, the name that we already have for my service as well.

E

So let me go ahead.

D

E

Check if my port has been named yeah, I think yes, the port has been named sdcp, so let me go ahead and apply my service monitor.

E

This will create a service monitor in in in litmus name space, which is where we have created it. It's already added to prometheus. So now I should have the target on on prometheus. Let me check that.

E

Let me go ahead and select the prometheus service.

E

Which is in my monitoring name, space.

E

And that is three triple 1 4, so it's basically the same lp on a different port. So let me open that.

E

And once I have prometheus and go take a look at my targets, I'm going to check if there are litmus metrics yeah, you do get richness matrix. You can basically check that in the targets as well. So now let me go ahead and do another chaos experiment, a very simple one, this time trying to delete this potato head replica and for that I'm going to use a very similar method. Select self agent, I'm going to select the um my hub.

E

So I'm going to select the I'm just going to give some default name, go ahead and add the port delete experiment and I'm going to provide the right name space here and we have the name space demo, space, it's deployment, so app is hello service and I'm just going to go ahead and say next and finish this I'm going to turn off to see logs, I'm provide the same schedule, etcetera.

E

This yaml is now going to look like this, so I have chaos engine and you can see some. This extra metadata here instance id and context a lot of this translate into.

E

For example, the metric.

D

E

That you would see on the experiment, for example, the context and other things are all added here. You can see that you can filter by that context, can be anything that you want to define.

E

You might be doing experiments with defined context in mind, so you can provide that and we have the app labels and name space and deployment for the hello pod, we're doing with witness admin, and I just want to do not too many iterations of chaos, just one iteration of chaos, but nevertheless let me go ahead and I'm just going to finish this. Let me go ahead and run chaos.

E

This is going to show the creation of the experiment and then the actual part, and let me go ahead and watch my paths in the demo name space so that the potato service gets killed.

E

Just have a single replica, it's a deliberate attempt to make sure that we lose this service for a while. I have not put multiple replicas with breaking things on purpose, so it's still pulling the experiment here and now it's going to do the podiate step. It's going to inject the particulate chaos and once you.

B

E

um It's going to go ahead and delete this part so that this is getting deleted and at this point of time let me go ahead and make this.

D

Five seconds.

E

So you can see that there is the success percentage that has dropped access, duration that has spiked right and greatness prometheus we're trying to get some metrics what's happening to this. So there is an awaited experiment, which is what now we are actually trying to show in grafana. There should actually be a red area. The sort of startup starts up here, so that's not appearing. Then I need to take a look at what happened to my annotation, whether that is right or wrong.

E

So let me take a look at whether the it was awaited chaos, experiment, details and chaos. Name, space is litmus result. Name. Space is basically witness and um the job that I've provided is chaos monitor, but unfortunately that's not what we have it's the chaos exporter. I think that's the difference. So let me go ahead and change this to chaos. Exporter and the rest remains the same.

E

So I'm just going to update this I'm going to save my dashboard and let me go back and take a look at my dashboard, so that should probably provide my interleave dashboard. Let me take a look at something else is also needed: space weakness. Okay, there was a service monitor, that's again oversight, so I think the service also is called use exporter here all right um now. I hope things will work.

E

And going back to my dashboard.

E

Yeah, so you see this red area and.

D

E

This is probably not very visible with the white dark background, so let me change the preferences to make it light mode.

E

And just give it graphana a.

E

Second right now, let me go to my dashboard.

E

Yeah and now you can actually see uh chaos happening, so this period is when chaos actually occurred, and you could also use other metrics to give you the verdict, also on rafana, if you so intent. But this is the period when the chaos actually happen and you can see that the availability drop access, duration, spiked and then it recovered right.

E

So the hello services, I'm sure it's going to be back anyways, but this is how you can um sort of use the observability data generated by the framework, the ks framework and instrument, your application dashboards with that and then basically get a closer look, uh get a better idea of what's happening. You can even run this automated and see this later, um so that gives you an idea how things have run so this.

E

This was the other aspect that I wanted to show and we've also covered the probes so which is pretty much what 2.0 is all about and now just to recap what we were talking at the beginning of this session about these principles that litmus follows.

E

So we talked about being open source and community collaborated without the community getting involved. You cannot have a rich library of experiments and scenarios, which is why we believe that cloud native chaos should be open, source and community collaborated, open api and life cycle. You took a look at the chaos engine and the experiment and the result crds are themselves apis and using the byoc approach to bring your own chaos approach that we enable with the chaos experiment crs.

E

You can write your experiment in a standardized way and have a standard structure of orchestrating it as well, which makes it easy for people to come and contribute their own official experiments and but still have a defined way or expected way of going about it and how they can operationalize it. Git ops is the event tracker story.

E

We just discussed some time back and then open observability is about being able to generate metrics, consume metrics and have a standard for how you can speak to existing observability infrastructure, open source, observability infrastructure and leverage that, within your chaos, experimentation process. So these are the principles that we thought we should inculcate into the platform as we were pulling it out and that's what litmus is all about and with that um I'm concluding the presentation.

E

Thank you so much for giving us so much time and being patient through this process, and um I think there was a contribution uh aspect which we probably missed out on, but there's a link that I've shared with the sdk.

E

um You could contribute new experiments using that tool to just bootstrap new experiments and fill in your business logic and share it or you could also contribute to documentation or docs, are really welcome. That's one area that we really seeking help on um and also to the infrastructure components of the portal control, plane, operator, etcetera. So these are the areas which you can. Collaborate on and trippy will share some information on where you can join the community and how you can sync up with us. There are some monthly sync up calls and meetups that happen.

E

We do a monthly release it on 15, for both beta of two dot x, as well as um one dot x, so feel free to be tuned. In on that and yeah hope, we can help the community with their resilience, needs and get going with chaos.

A

E

A

So much um to be honest, I'm now like flooded with ideas. What I want to try out immediately. um I think one of the uh the contributions could also be you've talked a lot about github's, workflows and and argo and flux. There are potentially other tools around that so like combining it with gitlab cicd or any specific other integration is then, and not only writing a blog post or a tutorial around it, but also provide email repositories, provide the insights in ios session with us, which everyone.

D

A

It was, it was basically a proposal to do it, because um the more different tools and different environments we can gather, we can like um get an idea about um the more like popular litmus. Chaos will get, and it's also like a diverse setting of saying hey. Does it work in that scenario or do we need to add a certain?

A

I don't know feature code thing um to actually enable everyone to follow the chaos engineering, workflow.

E

That's a great thought, michael, that's something that we would like to do. Definitely there's some integrations with other tools not to be ci cd tools that are there.

E

Let me just share my screen to show you what is already there: it's a little bit rudimentary, but it's more on the one dot x side of things, um so we have what we call as git lab remote templates.

E

So if you have a gitlab, ci, dot, yaml and um you are running your own stuff in your pipelines and want to introduce like your stage or as a residency test into it, you could make use of some of these templates that we have already uh you could. Just gitlab has an amazing feature where you could import templates, remote templates, and you can just import this and override some of these variables from the gitlab ci dot tml and get the experiment done.

E

So this is going to actually pull it must there is also actually another flag called. It must install true or false, and you can actually install it must run the experiment against the app of your choice and get the gitlab job um pass or fail. Based on the chaos result, this the the success of the gitlab job depends upon. The chaos result toward it.

E

That can be done with the gitlab templates, and we also have some integration with captain where there is a control plane uh service called it for service that runs in the captain, control plane and you can add, chaos stages within a captain pipeline, and you could basically uh do this with uh load generation that the captain naturally allows. You can do geometer locus tests on your application, and then you could also use captain's quality gate feature by using litmus experiments.

E

You are making a more stringent assessment of your application, resilience, load, chaos, etc, and then you can set your sli and slos and, depending on that, your quality gates will enable you to promote or roll back your applications, and there is also an auto remediation use case that we are exploring with captain where purpose can be used as a sanity test and as a remediation verification step.

E

So some some of this integrations are going on, especially in the cicd space. Another space where we would really like to integrate and get better at is the observability platforms. Prometheus is great, it's very common and it's probably the most used one. uh Similarly, there are other platforms cloud native platforms we would like to hook into and enable people to sort of get their this experiment details onto their own app dashboards and apm environments.

E

So these are some of the integrations we are looking at and there are also other.

E

I would call integrations, but collaborations that are happening with teams like pravega, which is another cncf sandbox project shrimsey, which is a kafka cnc project, another sandbox project. Here we are trying to enable them to run chaos, experiments for testing out those projects as well praveda has in fact been using us for nearly a year now um for their e2e tests.

E

So these are some of the things that we have been doing on the community front on the integration front, within the larger context of the cloud native ecosystem, we'd like to continue doing that and be more engaged with the community and make it a tool that is going to help um sort of shipping other tools. That's that's the intention and we're also trying to dog food, some of our own templates and scripts for litmus test litmus components itself. Some of the faults are being injected against witness components.

E

These are some of the directions that we are going into, um trying to um provide some documentation around all these things, which is now missing, to be honest uh with something that we need to go ahead and add in the the days to come before we sort of go ahead and do ga of 2.0 and um that's where we're also trying to seek some contributions on from industry folks.

E

um So yeah, that's! What's going on.

A

This is really great. um To be honest. I just pinked everyone on twitter to try it out um and shared all the the things you you shared. um Like reverse engineering. uh Your screen share, um yeah, um I would say, uh potentially we will try it out. Well, I will definitely try it out and especially the observability parts and integrating it into the existing kubernetes cluster and the monitoring stack and the promises operator.

A

I think this is a really great and nifty idea, and also I really like I admire your passion of presenting that I would just wanted to say that out loud I also tweeted it and the other thing is I really like the ux. um This is like you have a wizard, you have workflows, you have immediate feedback, you get the logs, you get the uh the chaos results.

A

So this is, I think it's it's a really great, um first impression when you start it um and you can keep going so I think um this will definitely be one of the uh yeah. You want to stay in the ui and I think that's really great what you have um created, what you have designed. So thanks uh for your passion and also for your open source spirit- um and um I don't know philip- do you have any other questions around.

C

I guess yes, yes, sure just two one technical uh when I use the ux and the portal okay, how can um I make it like declarative because, like for example, let's say I migrate, my management cluster from a to b and I don't want to click to the ui again to create all those experiments?

C

um Is there something like x? I think you use mongodb, as I saw for persistence or in the background, as a back end is it like? Can I export it? Can I I don't know yeah? How would I add the declarative way with the ux basically to make it short.

E

That's a great question: um mongodb you're right! That's what we use to store the details of the experiments and there is something that's downloadable or exportable today, in a way that we probably also share that.

C

I mean it's beta, it's okay, I mean so everything is good. Just wanted to ask.

E

Right, so there is um the section called for dashboards in analytics where you could compare experiments, you could take a look at how it's done, and then you could download stuff.

E

Basically, that's something we could do. We have pdf now and some other report types being supported.

E

um That's how you can get information, but as regards I declared declarative way of using 2.0 or the portal, which is a great question. um The workflows are the intent of the work, the scenario, the chaos intent and the hypothesis intent.

E

All that is declarative, as in it's a mixture, if you see of being declarative and imperative, argo uh is reconciling the workflow and the chaos engine is reconciled, but both argo workflows, as well as request chaos, experiments here, allow you to define what you want to run and it could be imperative to a degree within those manifests and it's sort of a good mix of both. We are also having some features coming in to manage the status of these tests.

E

When, let's say when the cards are, evicted or jobs are lost, and things like that will still be able to reconcile will be able to clean things up without needing more fund resources. All that is something that's there to a degree and is also being added, and we kept it this way to support both modes of usage.

E

As regards how you can uh run portal without the dashboard, if that was the question, you could do that with apis aps, which we are documenting and we'll expose shortly uh to the community, and you could start using that, and you could write some scripts around the day, 0 and day 1 operation that you want to do on the portal things like connecting agents and creating workflows or triggering schedules downloading reports- all that is something that you could do, but most of whatever is defined as yaml and whatever is mostly the git stored component of the portal would be the workflows.

E

Of course. The portal itself is a deployment and stateful set, etc. The you could use arco cd or something like that to manage its life cycle, but the experiment life cycle is mostly deciding within the workflow and did that sort of answer. Philly.

C

Yes, uh good thanks, and my second question is more: like a general, I mean, as you said, you started 2018 around, which is, let's say, movement yeah. How is it like, in I mean cows? Engineering, to be honest, is like showing your your, for example. Okay, let's try others, let's start different when I um have enough trust to run couch engineering and production yeah. Basically, it's like a sign of weakness. In my work to my boss.

C

You know how you can pick up those people to to get engaged in cloud engineering because yeah we are humans. We are not people who want to show weaknesses in the team and to our boss. Let's say: hey your work is bad because your pot is when it's restarted, then your work is bad, you know. So how can we all together, like um yeah, pick up those people to engage more in chaos, engineering and more resilience?

C

So how was your experience there? Basically, I said the question.

E

It's a great question: it's in fact, a very uh one of the initial big challenges that we faced was exactly on this account. um It's a culture thing like quickly also related to earlier. um People, don't want to do chaos because, like you said humans, we don't want to show weaknesses and we don't want to disturb what's already running and good there's also, this are other inhibitors to adoption of chaos. One is the lack of a proper observability infrastructure.

E

Sometimes teams are in the initial stages of setting up micro services, environments and all that, and they do not have the right tools in place to see what's happening when they do chaos, so they feel that lack of control and they are basically very reluctant to go and do chaos because they don't know how to remediate when to step in things like that. So observability goes hand in hand with chaos and is a real real prerequisite, and the other thing is, though doing chaos and production is really the.

E

Engineering practice the pinnacle of chaos, discipline within an organization. There are not so many people who do it in production; they want to do it in staging environments, free fraud, environments, dev clusters, etc, and one of the things that, like you say, for the people to pick what are the right people to target? What is the lowest entry barrier teams to take for chaos and the observability teams are the ones that will probably jump onto this first, because why you don't want to disturb your main business applications?

E

You don't want to basically go ahead and disturb what's running well, you will still be okay to kill some observability systems or some services you are using for monitoring to see if you are getting the right reports and notifications and alerts for of your main application and your observability is highly available or not. Let's say you have prometheus and you just have a single replica to make this deployment and you went and did a portrait or network failure there.

E

um You would still want to know how your end service is behaving. You would want to get the right alerts and notifications. You want to see the graphs there, so is that still happening when your monitoring framework is going getting subject to some kind of faults, so that is a good service to target and the team that works on observability is a good one. A good adopter of chaos, upfront, which is what we have seen in our community as well, and then they slowly gain confidence from it.

E

They pitch there for the other teams, the service teams, and then they go ahead and do chaos and then it sort of becomes a movement which, with which everyone becomes comfortable at some point. So that is one easy way of pushing things and another way of doing it is to start in dev clusters. It's like um even in dev test when I say they're testing and not even I'm talking about the ones that the qa teams are doing, because we also need to sell it to the qa teams.

E

Sometimes they have their own test plans and they have their own schedules, and this comes in as sort of an additional thing, so you could even do it developers themselves, could do it in the sandbox environments that they have where they do their initial tests before they push off core into their cd pipelines or into the qa teams. That's also another place where they could do chaos and show the value.

E

Maybe they might not use the same kind of data sets or they might not have the same kind of slows that they would eventually have in staging of broad environments, but still they can do faults and read out those initial issues that will appear or that may appear and then show the usefulness of this practice to the other teams.

E

Then they can catch on the fewer teams they can catch on and they can do a lot uh one of the one of the huge consumers of kr's in the last few months, as we see in the request community have been the qa teams, not only the traditional ops, sre teams, who we generally associate with chaos. Engineering, of course, that persona is there, but what we see are a lot of qa people trying to do failure, testing with the chaos mindset with this exploratory mindset that is associated with case experimentation.

E

They have also been doing it. So these are probably.

B

E

Of the methods where you can inject the discipline into an organ and sort of make it a movement.

C

Yeah, I I think always like I mean if you have pre-products dating and tests, it's all good yeah. You need this definitely, but I mean at the end in staging and pre-port, you never have the production workload. Okay, so you never have one-on-one the situation which you have in production and I'm the kind of like I would rather haven't planned.

C

I don't know half window of maintenance announced hey, we do maintenance and then run our cars experiment in production. Then have I don't know some unplanned outage, which can I don't know nobody knows of, and then you know how it is sre in the night the greatest cluster goes down. Then you have the unplanned work. I mean if we refer to the devops handbook. The unplanned work is the worst work yeah, I think. Hopefully it will grow because movement.

E

Some of the really mature ones do it aws. These are the pioneers the people who constructed the asian medical store principles of chaos. They do the game days with a very popular concept where, like you said, they plan the outage time they get all stakeholders involved, get it get a buy-off.

E

uh Have all medication plans ready to go injected? They not on observations and they share findings, and then they repeat it because, like you said nothing can simulate production. The dynamic nature of the production environment is true. I think slowly. It should build up to that point where people are comfortable running chaos in production, the they should have tested it enough and also the framework that they're using should probably enable them to do that. Give them that feeling of control.

E

For example, one of the features in reference that we've added in recent times is called as um the deadman's switch of the automated rollback. Where let's say you have a provided probe defined to say you are exceeding a certain metric.

D

Certain certain threshold of resultant metric.

E

Which means things are getting out of hand you need to stop and you can immediately bash the engine to about it automatically stops the experiment and then you sort of, and then you can specify blast radius in a very controlled way. Apart from just the labels and namespaces we gave you can use annotations to say amongst the so many deployments in this namespace with this particular label choose or zero in on just one particular deployment, or maybe one specific part, or something like that. So you could do this blast radius control.

E

Do the right hypothesis, the ability to abort the ability to scale you at a specific time when you know the traffic is less things like that. These are all enablers that you could add from the framework side to get people to run it more in production in a more confident way. So that's been our endeavor as well. We've been trying to think in those lines of also.

D

C

Thanks a lot I mean yeah short, you said start with your observability. Observability stack, gain some confidence and then go to prot.

C

Basically easy.

A

um You mentioned before that kubernetes is not a dependency. Did I understand this correctly, so you're also targeting aws or something else.

E

Yes, so you have, you do have experiments that can terminate easy to instances or detach eds, whether they are like just discs attached or maybe they are marked as pvps, um and we also have a recently introduced experiments that can do not just these out of band kind of things using apis of aws to kill instances or dash disks. But you can also go inside an instance and um simulate some resource form or latency, and network latency and stuff like that.

E

So the experiment with two running cabinets, the kiosk engine- is going to look very similar to the way you saw how partly it was defined, but the you will have some secrets created or iem rolls mapped within the experiment, and you will use that to go ahead and do the chaos against the cloud resources uh over the network, but the control plane and the experiment plane. The execution plane will still be on kubernetes when you're targeting entities which are not really kubernetes. There may be infra components, we vandalize it to instances or gcp instances.

E

You could you could do that as well, because a lot of people run hybrid environments and they are not completely migrated into kubernetes. They have some services here, some services living in legacy infrastructure and they want the same tool set or platform to be able to orchestrate chaos and all these things so there's also a requirement that we got and we sort of enabled that recent times so same goes for bare metal systems as well.

E

In fact, there are experiments that we have, we are going to add, or we've been working on, where you can use ipmi uh apis to sort of do out of band power of power on stuff machines like some, for example, dell machines provide eye drag capability and they give you apis. You can do things there, so these are things you could do while retaining a common experience, homogeneous experience on kubernetes engines, declarative storage, on a git, etc. But doki also means a wide variety of infrastructure around you.

A

Thanks, that's that's really great to hear um I I also get some or like from my own experience. I often get like kubernetes is like hard to get into, and I experienced it myself. So the the learning curve is like wow um when you're into it. It's like. Oh, it's, amazing and I'm totally into it um so like when I want to start as a developer or maybe as a devops engineer.

A

Whatever the road title is you potentially want to hide the kubernetes component and just say: hey here, execute something or this is like the steps- and you maybe have a q a this cluster or it's just something else, and the entry barrier is not that. Oh, this kubernetes, I'm not using it, um because I think right now. The messaging on the website is also a chaos engineer for your kubernetes or kubernetes cluster.

A

um This could also be opened up when you have like a directional roadmap, where you say: okay, our potential target groups are kubernetes 90, but we're also moving into, like you said, hardware bare metal or specific hyper cloud, so the small cloud providers anything which can be tested, maybe even going into iot or I don't really know what edge is, but all the new stuff which potentially has some yeah.

A

You don't have any competitors inside because there is no chaos engineering, yet there's, potentially some unit, testing or extreme testing somehow um but like finding in or putting in the chaos in production is something sitting there saying how. How should I do that?

A

I have no idea and having a framework and having a tool stack around that makes that super easy, and especially since you're, also focusing on the declarative way of having having a specification, it's easier to to sell it internally to your teams, to your managers to your groups and say: okay, this is a proven standard.

A

Everyone is using that um we can rely on that and we can invest on uh our in in our product in our infrastructure and it um adopt little scales, for example litmus in in that specific sense, and I think that's a really great idea- and I will I will um talk to our product managers and also to our teams around this and see how far we can get or maybe even like work in a similar fashion.

A

As we tried with kept with the captain project just to collaborate even more and see where we are where we are going and can build a great cloud native or even more um chaos, engineering idea or collaboration around that. That would be amazing.

E

Awesome that would be just fantastic. Thank you for that, uh michael. I think you summarized it very well um yeah. We have uh seen some interesting use cases in the community around chaos.

E

Some of the end users stories are very fascinating um and there's some good discoveries made as a result of chaos, which we are also trying to sort of get them to speak on some forums like this or go to cncf and basically talk about some of the very interesting use cases that have that they have so that the community at large is benefited by those insights.

E

So hopefully we will see more um sort of information and more awareness, chaos. Engineering has been there over a decade. It's really gathered pace in the last I would say couple of years um accelerated, no small part because of all this cloud native uh paradigm and digital transformation journey people are making migration to cuban at least micro services.

E

The process was always there, but the tooling around it and the awareness around it has sort of become increased with me was mentioning about how cnc, if the team at cncf feels that this is also training technology and hopefully, there's a lot more innovation that comes out of the community in this area, and hopefully it must also improves and proves to be useful to a lot of folks.

A

And I I also submitted some cfps for later this year with chaos, engineering inside I I knew that I needed to look into litmus, so I'm really grateful that you took the time today to to educate us or to show us what is what is hot or what is new, especially, um I would love to have you back like I don't know in half a year or something when you when 2.0 is released or 2.6, and we have some more use cases, and maybe we have deeper collaboration or something around that other than that.

A

I really appreciate you taking the time. I know it's late for you, um so we kind of extended it to 125 minutes now, but it's totally fine. I told us, I was very much enjoying it and yeah. Thank thanks to you colleague and pritivi for joining today, and I would just say, bye. Bye on youtube. Thanks for listening, um blog post will be will be published later on.