Cloud Native Computing Foundation Online Programs, 11 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Project Update: The latest on Chaos Engineering with LitmusChaos

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hey everyone welcome to this webinar on the project updates of open source litmus and what are the new set of features that we have been working on and added on the open source product. My name is shayan, I'm a senior software engineer at harness and I've been closely working with the litmus open source project since the past two years, and also mainly working on the litmus 2.0 and the features associated with it.

A

Now before we move on just wanted to quickly add that we're really excited to do this, and thank you for having us the agenda for today is about the new features that we have added on: latepass 2.0 and not just the core features, since I'm pretty sure you have already heard about the product and looked at the features that this bus 2.0 comes with, but rather the development features and the developer.

A

Centric focus features that we have added to the open source platform and also we are going to take a look at two different use cases to help understand how litmus is actually being used in the world right now how litmus is adopted by different clients. We are also going to finish it with a small demo, where we'll be seeing how uh the new enhanced features of litmus can be put to use in a much in a commercial application.

A

So before we jump into the new developer, centric uh features that we have added uh just a quick refresher of what litmus is for all the new users out there. So litmus is a cross cloud cloud native chaos, engineering tool set or a framework. You can also say that which helps not only srs but also developers and different personas, any any personas try out chaos with seamless integration and automation that will help you ease your chaos engineering journey, so you can choose different experiments.

A

You can create scenarios out of them, and then you can run your workflows and do chaos in a much simpler way. With the help of that mask now, assuming some of the community users are present here and have already used clickmas, we are going to take a look at the developer centric uh and the developer focus features that we have added to help ease both the developers, as well as the community users, the community uh contributors that have been working closely with us.

A

So previously we already had the ability to add probes, but we have also worked upon that and improved the probe addition capability. So now you can also once once you've added a probe. Now you can also go ahead and edit the same probe. Previously we did not have the capability, but now you can edit it uh like.

A

Previously, you have to delete the entire thing and then create a new thing, which was uh quite a bit of a pain when you uh are thinking about constructing new probes and thinking of the hypothesis and doing the steady state validation with groups. So it's better that you get an edit option. So now you do and you can change the probe types in the same edit feature, so you can just completely change the probe all together with this new probe, editing feature.

A

We have also improved the tune, workflow section when you customize the scenarios. So now you can also edit the sequences.

A

So let's say you have two or two or more than two experiments in your same scenario, then you can also go to edit sequence and then customize, your sequence, put them in serial put them in parallel, based on your requirements. You can also have a combination of both and also in the yaml section.

A

You can also update the same steps and that should be visibly that should be visually visible in the error sequence as well, and there's also the option to go to the configuration wizard, which is the pencil icon that you see on the different experiments in the table. So there you will have the option to tune your environments, give certain values to your environment, which will be reflected to the kiosk engine and also at the bottom of the table. You would also get advanced uh advanced options, so in there you can select the port gc strategy.

A

If you want to clean out all the different pods uh post chaos and then also, if you want to add things and tolerations to your particular chaos experiments, then you can also do that kind of advanced configuration to that.

A

Third, we have the ability to upgrade a kiosk delegate. So let's say you are on the latest version of litmus you're running your litmus workflows on the deployed with the latest version of litmus, but uh the community has pushed a few uh you you're a bit behind on the upgrade part. So let's say the community has pushed a new feature, a new upgrade of the kiosk delegate, so with with the new versions of litmus, what you would be able to notice is there's an upgrade option at the kiosk delegate side.

A

So if, if you're on a latest deployed version of litmus and there's an upgrade available of your gears delegate, then it must litmus would suggest that you go ahead and update your kiosk delegate to the latest version. So if you're already on the latest version, the button would be disabled, but if you are not, if you are a little lagging behind on the update, then litmus will show you the option to upgrade your chaos delegate to the latest version.

A

Next, we have added a more secure, our back with update to the apis. This rbac security updates have been added about to the api, as well as to the ui. So this was mostly um a bug addressed uh from reported from the community side which is addressed by us, which is uh there are two, uh our back permissions just editor and the viewer.

A

So as a viewer, you shouldn't be able to access certain pages certain apis, certain you shouldn't be able to do certain perform operations as a viewer which, where there was a few leaked cases where you would be able to create a scenario or view and like go to the editing section of a particular workflow which you shouldn't be able to do as a viewer. So those things are addressed and now we have added a more secure hardback. So now let's say a viewer is given the specific screen through different api calls.

A

Even you wouldn't be able to do so, because the our back checks are also added to the api. So now the apis are hardened, as well as the ui. So now a viewer should not be able to view screens which are only accessible via the admin and the editor also. There have been uh requests of uh adding the support of running or scheduling a basic cargo workflows, rather than just uh chaos workflow.

A

So previously we had the support of running different kinds of chaos, workflows, which were also argo workflows, but we didn't had the support to clearly run a very stripped down simple version of a basic cargo workflow. So now we have modified our back-end and we have added the support of uh just scheduling a basic argo workflow as well. So if you uh directly take on argo workflow from argo and try to run it on our platform, that should work as of now.

A

Similarly, once kubernetes moved once kubernetes updated to 1.22.0 and above uh there were a few apis, there were a few manifests deployments that users coming from the community side were trying and they were unable to execute them because uh we didn't had support for 22 and above so.

A

We have also addressed that, and currently we do support uh all communities versions on 122 and above we also added the support for ipv6 and to ensure that we have a better end-to-end uh capability and to end uh strictness of all the different builds and continually do this kind of deployments via nightly, builds or via regular ci checks. We have done that. We have ensured. We do better testing and we have added more e2e test suits to cover all aspects of this uh development.

A

So when we push or when we complete our work by the end of the day, there's al always multiple nightly builds happening so that it ensures, and also it gives you the report so that it ensures that you that you know the status of your deployments. So we have also added the ability to skip ssl verification. So let's say in case of um applying a manifesto: you are trying to connect an agent. So if you want to skip the ssl verification, there is very much an option to do so now, so we have added feature.

A

So you can just provide the flag and you should be able to. You should have the ability to skip your ssl verification if your use case demands so also for developers also for both the community developers, as well as the developers in the core team, you have improved the log functionality of the different servers that we use so that way, whenever you encounter a bug or whenever you encounter a specific log that you're looking for, let's say an asian connection or a subscriber connection, so those log metrics have been uh enhanced.

A

So now you you'll be getting better events, you'll be getting better results. So that way you uh would be knowing what exactly went wrong or what exactly is happening. So we have enhanced that part of logs both uh for the development, as well as for the productions uh production setup for production when you visit the kiosk engine log.

A

So when you go to the show the workflow and you check the logs there, the pod logs, they have better highlighting so let's say if a probes uh was resulted in success or failure or if the experiment, the verdict, everything resulted in success or failure you'd be having that individual line uh highlighted out as either green red or if it's a warning, so those kind of things. So we have taken measures to enhance the logs both for the production, as well as the development side.

A

Now, on the internal side, we have also migrated the project collection that we use, so the project collection was used mainly to store metadata of the project of of your litmus projects. Previously, it was in the litmus database. Now we have shifted that to the authentication database and also apart from just this shift. We have also done uh internal code refactor of the authentication server to enhance and improve security. We have also added enhancements in the cmd probe, uh specifically in the source tab.

A

So previously we had the option to provide source just as an inline source, but with the latest updates we have added, we have restructured it and we, you can add images and host network separately as a part of the source configuration.

A

We have also hardened the litmus alpine images that were used in different litmuscas tools and also the e2a pipeline to monitor all the pipeline builds uh have been created, so there's nightly bills and and a whole e2e dashboard that you can explore where you can see the different uh builds for individual workflows for each of the experiments, etc and also. Lastly, uh we of course are working on new experiments, so experiments like aws, az, experiment, azure, disk laws, etc. So those new different kind of experiments have also been added to the public chaos hub.

A

Now, let's take a look at what ifood is what the company does and what was the use case that they switched to using litmus and how it was actually helping them with the challenges that they faced. Ifood is one of the most demanded latin american online food ordering platform, and they have.

A

uh They have delivered more than 60 million orders each month they do deliver more than 60 million orders each month it was founded in 2011 and it aimed to provide a better and a quicker solution for online phone delivery with uh innovative systems, so that user can order deliveries on the internet with uh no hassle and with ease of use so with over 80 percent of the market share. Geographically ifood covers most cities and regions in brazil, especially in the brazil's uh financial center of sao paulo.

A

Now, what were the exact challenges that were faced by ifood, and why did they think that the solutions they had didn't really work out specific to their specific use cases now, due to its growing popularity, of course, because they were serving such a huge amount of customers, they decided to break their existing monolithic architecture into several micro services so that they can scale better.

A

So this decision of moving from monolithic to a micro service, oriented architecture of course came with.

A

This came with its works, but also it came with much more complexity and additional costs, so that was one why they were trying to look for solutions to handle this kind of complexities and to deal with resiliency, the more and more they scale false, like database services going out of business message, brokers crashing and the entire region of cloud providers going down due to different kind of outages and also network bandwidth, dropping uh without notice, uh where definitely some other challenges- and these are like these are uh definitions of different outages that might happen to your systems to your applications.

A

So these were some of the major problems that ifood was facing in latin america and they wanted to mitigate these challenges because the user base was growing and they wanted to give them a seamless access to their delivery platform. Now they decided to tighten up the reliability by continuously doing load tests and bare minimum chaos experiments, but the solutions that they were using at the time uh lacked specific use case driven functionality.

A

So if they had certain scenarios in mind, they wouldn't be able to do it, but they will be able to target and do basic cares, but based on the problems mentioned above, they wanted something, some solutions which can tether to their specific use cases and also the ability to customize their own uh kind of scenarios based on their experience and use them automatically as a in an automated fashion.

A

So they uh also had the requirement to know which users perform what kind of chaos uh to enable a better back control in production. So that, let's say because chaos testing uh requires uh an amount of responsibility when you're doing it. So they wanted to know which user is performing. What kind of cures, because, if they're doing it on production or let's on a specific environment, and if something goes horribly wrong because of a specific uh chaos experiment, then they would like to know which user performed? What experiment and what was the scale of it.

A

uh What exactly did it target so those kind of things they wanted to know? Which user performed? What uh and enabled better our back controls and the current kiosk engineering solutions they were using?

A

uh Was not really automated and it also had limited number of experiments, but with the amount of ideas that ifood had regarding the scenarios they wanted to create uh these kind of things they wanted to customize the experiments and eliminate manual cost as much as possible because they wanted to have these ideas, created custom scenarios and also automate the same and have it running by itself rather than doing it manually.

A

So these are some of the challenges that they faced and one of the main challenges downtime, which is why the thought came into their mind to switch to an automated chaos tool. So what exactly would go wrong if you have down times right so right off the bat there's a loss of customer confidence, which is the biggest uh let down?

A

If you have an application with a huge base of customer and uh there's a time, there's even a slight amount of downtime, uh you would have a loss of customer confidence and not to also mention the amount of costs that you might incur in that time frame. So the average down time for an outage is reported to be about 79 minutes and the average cost of these down times are about 84 000, so which is uh huge considering uh in that uh period of time.

A

Let's say even if it's not 79 minutes, even if it's five minutes uh in that five minutes, millions of users could have ideally clicked on or wanted to get food or wanted to have something delivered or just wanted to check out your platform.

A

So that is the main thing when there's down time, so you have a huge amount of customer confidence that is lost also the damage to brands integrity if, if a certain outage is faced by them, so I would consider those two, as the main points to note when you have down times now how litmus exactly is helping ifood.

A

Now litmus came in with the idea of providing a lot of chaos, experiments which suited their requirements because litmus has the ability to add your own private hubs, as well as the public hub, which litmus has is also filled up with over 50 plus experiments. So one of those experiments they can usually pick up and then add on top of it and it comes, and it covers a range of different types of experiments.

A

So I would really like that idea of having to uh customize something that they can use for their specific requirements, because there were multiple options: multiple areas uh that litmus was touching, so they went with a declarative approach which helped them customize these kiosk experiments and then target the chaos engineer and target the chaos engine further to add their own ends and attune that specifically to their requirement. Well, it must also give them the ability to uh fine grain.

A

These are back controls by having integration with decks, so they integrated with the authentication service called x and authenticated users still atmospheres which restricted their services as a developer, to target specific applications where they can inject chaos, so that gave them the ability to restrict certain users to perform certain activities and gave the r back level control that they were initially looking for.

A

We also gave them the ability to construct a workflow as a cron now because they wanted to automate and also and save manual labor, because we have the option to even save it, save the different scenarios as a template for later use, so that aided with easier automation and auto chaos even after specific intervals. So that is one. So that is one feature that they considered handy, and that is something that is helping them to automate this entire process and remove manual labor as much as possible. So yeah.

A

That was uh one of the stories that ifood currently is using litmus for and to continue with the next use case and also the demo. I would like to hand it over to nilanjan and he can guide you with the rest.

B

Hello, let us take a look at another end user use case hello, doc. Hello doc is the most popular all-around healthcare application in indonesia, a rapidly growing startup founded in 2016. The mission is to simplify and bring quality healthcare across indonesia.

B

Hallo dock has partnered with more than 4000 pharmacies in over 100 cities to bring medicine to people's doorsteps. Recently, they have launched a premium appointment service that partners with more than 500 hospitals for booking their doctor appointments using the application.

B

The platform is composed of several microservices hosted across hybrid infrastructure elements, mainly on a managed kubernetes cloud with an intricately designed communication framework.

B

Halodoc has leveraged aws cloud services such as rds lambda ms3, and consumes a significant suit of open source tooling, especially from the cncf landscape, to support these core services, while operating a platform of such scale where newer services are onboarded quite frequently, it's quite plausible, to encounter service down times due to an unanticipated causes.

B

It's not an isolated incident where newly added services go down which eventually get mitigated after much effort that affects the team and end users in a system with the kind of dependencies that halodoc had. It was prudent to test and measure service availability across a host of failure scenarios this needed to be done before going live and occasionally after it, albeit in a more controlled manner. Hence chaos engineering was found suitable to supplement the existing qa with comprehensive, automated test suits and periodic performance testing analysis to make the platform more robust.

B

The major requirements that halodoc chaos engineering practices sought to include first being kubernetes native halodoc uses kubernetes as the underlying platform for majority of the business services, including hosting tools that operate and manage observability across their fleet of clusters.

B

They also needed a chaos tool that could be deployed and managed on arm64, that is aws gravitron based kubernetes, as well as the ability to express a chaos test in kubernetes as language that is resource, yamls, second, being wide range of experiments.

B

Considering the microservices span across several frameworks and languages such as java, python c, plus, plus and golang, it was vital to subject them to varied service level. Faults add to it the hybrid nature of the infrastructure, varied aws services and the ability to target non-kubernetes entities like cloud instances, disks, etc becomes clear.

B

Furthermore, the application developers were required to be able to build their own faults and integrate them into the suit and have them orchestrated in a similar fashion. To the cloud native faults, chaos scenario. Definition there was a need for full-fledged chaos, scenarios that combined faults with some custom validation, depending on the use case, as the chaos jets were expected to run in an automated fashion after the initial experimentation or establishing a good fit hello.

B

Doc also uses a variety of synthetic load tools, mapped to the families of microservices in its test environment that they wanted to leverage as part of the chaos experiments to make it more effective and derive greater confidence.

B

Security features. The tired staging environments at halodoc are multi-user shared by environments accessed by dedicated service owners, sre teams with frequent upgrade to their applications.

B

Hello doc needed a tool with the ability to isolate the chaos view for respective teams, with admin controls in place for the possible blast. Radius contain this. Allied with the standard security considerations around running, the third party containers were required observations, hallo dock relies heavily on observability, both for monitoring application or infrastructure behavior. The stack includes new relic prometheus grafana, elasticsearch etc, as well as for reporting and analysis.

B

They use allure for test reports and lighthouse for service analytics. It was only judicious to choose a chaos framework that can provide with enough data to ingest in terms of logs metrics and events. Lastly, community support. Hello doc saw value in an open source project that has a strong community around the tool, with approachable maintenance, who could see reasons in the issues raised and the proposed enhancements, while keeping a welcoming environment for users who can contribute back.

B

Hence, hello, dog chose led per squares, which met the requirement criterias to a great extent, while having a road map and release cartons that align well to their needs and pace. Another reason for choosing latvia's chaos is the ktop support which allowed for the automation of chaos. Experiments halodoc has also contributed towards better user experience in the chaos center and improving the security of the platform from them.

B

Halodoc's initial efforts with litmus involved manually, creating chaos, engine custom resources targeting the application ports to verify their behavior. This in itself proved beneficial with some interesting application. Behavior unheard in the development process. Eventually, the experiments were crafted with right validations using litmus probes to form chaos, workflow resources that can be invoked, programmatically and automate. The process of hypothesis validation during the chaos today.

B

These chaos workflows are stored in a dedicated git repository are mapped to a respective application services via subscription mechanism and are triggered upon app upgrade via the litmus even tracker service residing within the staging cluster.

B

While the chaos experiments on staging are used as a gating mechanism for deployment into production, the team at halodoc believes firmly in the merits of testing and production scheduled chaos. Experiments are used to conduct automated game days in the production environment, with a mapping between the fault, type and load conditions that are devised based on the usage and traffic patterns.

B

The results of these experiments are then fed into a data lake for further analysis by the development teams, while the reports from the chaos center, the control plane component of the litmus chaos, especially those around comparisons of the resiliency score of scenarios, are also leveraged for high level views.

B

The personal involved in creating or maintaining the and tracking the chaos tests on staging are largely developers and extended tech teams belonging to different verticals, while the game days are exclusively carried out by the members of the sre team.

B

The upgrades to the of the chaos microservices on the clusters are carried out in much of the same fashion as any other tooling, with the application undergoing standard scans and checks in the gitlab pipelines, with that we are all set for a demonstration of litmus where we'll see that how we can inject chaos into a kubernetes application to assess its resiliency, see you in the demo, hello there and welcome to the demo on litmus chaos, but before we actually jump into creating some chaos. As you can see, I'm here in my chaos center.

B

I would like to bring your attention to the booty cap. The booty cap is basically an e-commerce microservice application, which is completely composed of kubernetes microservices.

B

You have your typical sections of an e-commerce application, such as the different products that you can buy the product descriptions for each one of them, and you also have something like a for cart, for example, where you would essentially store all the items that you are meaning to check out and pay for.

B

So what we are going to do today, as part of our demo, is basically we are going to inject some chaos within this boutique application and see. How does it face against the injected chaos to be more specific?

B

What we are going to do today is we are going to use our http chaos, experiments, which is one of the newly added chaos experiments and will target it against this particular card service, which is a microservice within the kubernetes and see what is its effect on our application before we actually jump to doing some of the chaos.

B

I would also like to show you this dashboard, which is a grafana dashboard for our boutique application. As you can see right now, the metrics that we can observe here in the dashboards are indicative of a normal system. Behavior, we have a black box exporter, which indicates the service endpoint is quite healthy and the probe success percentage for the same is 100.

B

Therefore- and we can also see the queries per second of the cart- lies somewhere in the range of 60 to 40, f, uh qps or ops, basically, which is indicative of a normal system behavior and the access duration, or you can see. The latency is also quite low right now, which is in vicinity of somewhere 2.4 2.

B

2.8 seconds, which is quite normal, so yeah with that. We can actually go ahead and target our chaos within our application, using an http chaos experiment to do so in my chaos center I'll. First of all schedule a chaos scenario, I'll choose the self agent that I have and go next then I'll choose chaos up, because that's where my baud http latency experiment, the experiments is situated, then I'll go next, and then we can name this scott kiosk scenario, since we are targeting the cart.

B

And over here we need to now add our experiment, which happens to be pod: http latency.

B

Now that we have added our chaos experiments, we just need to simply uh fill up this experiment in a way that we are specifying the exact details of our chaos, so that the experiment can target the requisite pod and the resource that via target that we want to target.

B

So for that I'll, first of all go next over here and here in the application name space. We need to choose the name space in which our boutique applies, so that happens to be litmus for now, since we have installed it in the same name space as of litmus chaos and for the application kind. Well, it is a deployment.

B

The cart is present as a deployment here and if we check for the label, we can see that the litmus chaos performs your kubernetes resource discovery so that it can fetch the label for our cart service over here. So we'll choose app is equal to cart service.

B

Now it's worth mentioning that we have only one port under this deployment which we are going to target right now. So let us see what would be the effect on our application when we are targeting the only prison pod.

B

So we'll go next from here and at this point we can add a litmus probe. So what is a litmus probe well for the uninitiated? Litmus probes are a way to automate the process of hypothesis validation in a simple and declarative manner.

B

What we are going to do is we are going to define a criteria for this probe and it the probe, will basically validate this criteria when we are injecting our cures, and this would allow us to validate whether that condition is fulfilled or not as part of this experiment, and help us to determine the outcome of this experiment.

B

So to do so I'll, first of all, add a new probe I'll go for, let's say a cart probe over here, that's the name of the probe, and it would be a type of an http probe which will be running in a continuous mode that is throughout the experiment, in a continuous fashion, before we fill up any of the pro properties I'll. First of all, try to bring your attention to what we are going to do as part of this http probe.

B

We are going to validate the end point of this cart of this cart here in the booty cap.

B

So we are just going to provide the url over here for the cart and the condition that we are going to enforce to be checked is we are expecting a response code of 200 whenever we are performing a get request. So what would happen? Is that we'll be performing a http? Get request at this particular end point in a continuous fashion throughout the duration of the experiment.

B

Now we can go ahead and specify a few pro properties. That is what is the timeout, after which the probe would time out basically with fail. So let us give this as three seconds and how many times shall we retry in an event where our probe is actually failing, so we can set this as once. We can retry once just to be sure, and then what is the interval that we want to uh have between the successive probe iteration?

B

So we can see that this can be one so with that uh we are pretty much done with expressing our probe in a declarative fashion, as you saw, and that's all, you need to basically to initialize a probe and check your application, steady state conditions during the chaos, so with that I'll, add the probe and go next. Lastly, in the last step, we just need to specify a few environment parameters for the experiment, so these are the parameters through which the experiment would run. First is the total chaos?

B

Duration, so we'll be running this for a duration of 60 seconds, which seems plausible and for the latency. uh What I'll do is I'll add a very big latency, which would essentially go ahead and block our http request for this latency value, and this is in milliseconds, so I'm adding an http latency of 80 000 milliseconds, which appears to be very large so we'll see what happens when we are applying this large of a latency to this experiment.

B

Also, we need to provide our target service port, so this is the port that we are targeting for that deployment service. So let us try to see what this target service port looks like within our kubernetes terminal, that is using cubectl. So what I'll do is I'll? Try to list down all the services that we have over here? You can see that we have a card service and the card service has a port of 7070, so we'll be using this as our target port.

B

Lastly, I need to also specify our container rundown, so I'm using a container d runtime, so I'll just promptly go ahead and mention the container runtime as well as these socket paths.

B

With that, we are ready to go ahead, but not before we actually specify the parts affected percentage. So this is the percentage of the parts that we are meaning to target. So the minimum number of parts that this experiment will target is one and above that, like whatever percentage we are specifying over here, would be the uh would be the percentage that it will go ahead and target. So I'll mention 50 over here.

B

50 would essentially mean half of the parts that are as part of our deployment, but since we have only one part right now, so it would go ahead and target only one part.

B

So with that, we can finish up over here and I'll. Make revert schedule to false. What this will do is that basically, it won't delete any of the application or basically the experiment, uh metadata that that is getting created during the experiment execution. This includes all the pods or the workflow resources that we are having as part of the experiment.

B

So this would allow us to retain the locks so that we can view them with that I'll go next and we can specify a weight for our calculation of the resiliency score at the end of the test. So we can keep it at at 10, since we have only one chaos experiment and it doesn't really matter what uh weight we are providing here, as we only have one experiment, we can go next now and I would like to schedule now next.

B

So this is the summary of our entire chaos experiment. What we just created, let us actually go ahead and create this chaos scenario.

B

So our chaos scenario has been successfully created over here. As you can see it's running, so let us actually wait for a while for the experiment to get initialized, so you can see right now that the kiosk experiment is getting installed. The pod http latency experiment is getting installed right now as part of this step.

B

All right, so with that our installation of the chaos experiment is over, and now we can see that the pause pod, latency http latency experiment has in fact started. So what we can do is that we can verify whether this the effects of this experiment in real time using our absurdity dashboard, that is the grafana dashboard.

B

You can see that due to the chaos annotation applied by the kiosk exporter, we are able to see the impact of the chaos here in the dashboard.

B

We can observe that, due to the experiment running, the qps is going up right now, and this is only explainable due to the impact of the experiment. What we are doing is.

B

We are essentially applying a very large latency on our uh on our cart service application and therefore we can see that the axis duration or the latency is increasing exponentially, while the qps is also taking a hit, you can see that the mean qps indicated in yellow is going up while the uh while the 99 percent or the immediate qps is in fact going down.

B

So this indicate that indeed our application is affected and to confirm that if we refresh our cards, you can see that we don't really get a response from our application.

B

The response is still pending and we can see that it says something has failed and there are some logs and basically some information for debugging. It sees 500 internal server error, which makes sense, because we have essentially added a very large latency and right now the front end is not getting any information from the cart and hence we are observing this error.

B

If we go back to our application dashboard right now, we can see that the chaos duration has in fact passed, and uh at this stage we can see that the experiment is effectively getting over. We are right now in the post cure stage where viktor's effect has been reverted hopefully, and what we would like to right now understand is what this cost to our application. What we saw in real time is that well, our application is unavailable, but during the chaos, it's very much important to validate what it's happening in an automated fashion.

B

So for that, let's first of all try to see whether our application has ended or not. Okay, it's still going on. Let us first of all wait for the experiment to complete.

B

As we wait for our experiment, to conclude, we can see that the service metrics are again regaining a normal system. Behavior, the access duration is going down. The card qps is returning to its normal state somewhat and for for the black box exporter. That is the probe success percentage that we have. It's also getting back to a normal 100 percent.

B

As part of our experiment run, we can see that it failed, but before we analyze that why it failed and what does the log indicate? Let us try to actually refresh this page yeah. You can see that the chaos has been in fact reverted.

B

uh We can see that there's no remnant of the 500 internal error that we were getting since the uh the effects of the chaos have been removed for now. So if we go back to our chaos center, let us try to analyze that what went wrong in this experimentation and what does litmus has to say about it.

B

If we go inside the table view and try to view the logs and results, we can see that we have all the experiment logs over here as part of our experiment. Logs, of course, we are first of all getting all the different experiment- metrics, for example the running pod over here. This is the name of the pod that we are targeting card service. Then we are also seeing the run properties of the probe over here. That is the timeout, the interval or the retries.

B

Everything is here, after that we are in fact going for the actual experiment.

B

So, in the course of time you can see that initially we were getting an actual value of 200 when the probe was running, which makes sense before since before the chaos there was, uh the service was working correctly and henceforth we were getting a 200 response code as expected, but during the chaos well, we didn't quite got a 200 response code which can be seen over here in this log.

B

It says that actual value is 500, which is not expected to the expected value of 200, and this is in sync with what we saw earlier in our application uh in the browser as well. Basically, we saw that we are getting a 500 internal error and therefore this has been the cause of the failure of this experiment.

B

As you can see, the probe status has failed and therefore the experiment has failed, so this shows that how litmus can be leveraged litmus probes can be leveraged to automate the process of the hypothesis validation during the chaos and how you can use the logs for verifying the perfect cause of your failure of your experiment or passage of an experiment using litmus chaos. We can also get a quick summary of the entire experiment.

B

Using the chaos result where we can see that the experiment status is completed, but the verdict is failed and the probe success percentage is zero for the probe that we have defined. That is the card proof. We can see that it says better next time for the continuous mode, which means that well, it has failed.

B

So with that we saw that how we can validate the experiment, how we can use atmospheres in order to validate our chaos, experiments resiliency. So we got the information and the validation that well. Something is not quite right. Our with our application and some component of it at least, is weak. So how can we make it more resilient in this case, so the most perfect or plausible explanation could be. We can just uh bump up the number of parts that we have as part of our application deployment.

B

So let us actually try to do that. We have one part right now: let's scale it up to maybe two parts and re-run this experiment and see how this experimentation goes. Then I'll go back to my terminal and what I'll do is I'll, try to scale up the card service deployment, which is the deployment enabling the cart to two replicas.

B

Let us try to determine whether the scaling is done or not using this watch command. So, as you can see right now, the scaling is still going on. The pod is getting created. Let us wait for a while for the scaling to complete.

B

All right with that, we have now successfully scaled our card service deployment to two ports. So now the question remains that what would be the behavior of the experiment in this case right now, if we are targeting 50 of the pods, that is only one out of the two parts.

B

Will our application be able to sustain the chaos? Let us try to find out so for that. What I'll do is I'll simply go to the chaos scenarios to the schedules.

B

Over here we have the card chaos scenario schedule that we had created just now, which we run as part of the earlier experimentation and I'll just simply read on the schedule.

B

What this will in effect do is that it would read on the same workflow that we created last time and it would run the same pod, http chaos, pod, http, latency chaos, experiment, one once more, this time around, let us try to observe the effects of the chaos again using the application dashboard, but before that, let us wait for a while for the experiment to start.

B

All right with that, our pod http latency has actually started. So let us head back to our application, dashboard, the bootiecap dashboard. We can see that slowly, our chaos experiment is taking place over here. The chaos annotation is quite uh prominent over here, but what we are observing here in this case is that, so far our steady state seems to be maintain the probe success percentage for the end point of the card service seems to be stable.

B

There's no deviation in the proof, success percentage and for the cards as well. We can see a slight change in the qps that we have over here right now, it's slightly varying in the range of 100 and our.

B

Access duration service is also spiking up a bit. It's in vicinity of 2.5 minutes right now, which is not too bad so yeah. It seems that the chaos is doing something to our application. The qps is steadily increasing and the access duration, the latency, is kind of flattering out, but yeah. The most important question remains: is the application still available so for that I'll simply refresh?

B

And as you can see this time around, it's not going down, we are not uh waiting for any response code or anything as such. It's still accessible, no matter uh what the application dashboard is showing to compare the result of the application dashboard I'll, actually like to compare them side by side, maybe with 15 minutes yeah.

B

So this time around, you can see that, although we observed a spike in the access duration for the card service, it's much less compared to our earlier expectation, it's almost half which makes sense because we have added one more probe and uh sorry. We have added one more part, and that is mitigating the effects of the chaos in effect and therefore helping the application to sustain the chaos.

B

This allows for a much scalable and much more reliable solution around our chaos scenarios, where even if one pot goes down, there will be another part which can sustain the chaos running at the same time.

B

So with that, we we can see that our chaos duration has essentially passed. We can go back to our application right now and we can see that well, even after removal of the chaos, everything is fine. Everything is working in order and we can wait for our chaos experiment to complete to observe its effect and, as you can see, the chaos experiment is this time around it's completed over here.

B

So let us try to observe the locks this time, although we can already see that it has passed, but let us try to still validate using the logs in the kiosk result. If we take a look at the logs this time around, we can see that, of course, before the experiment as well. We were getting an expected value of 200, as well as actual value of 200. That is the response code, which makes sense, and this time around every time we are performing this check.

B

We are always getting a 200 response code and there is no response time out this time. That is, we are right on track with what we observed within the uh browser as well for the application when we were refreshing, the website was available throughout the experiment, duration and as a result of this, our probe has in fact passed. As you can see, the card proof has passed and this in turn made uh ensure that our application, our experiment, is passing.

B

We can observe the same from our chaos results. As you can see over here. The experiment status is completed while the vertex is passed with the probe success percentage of 100, since we had only one probe and it passed. Therefore, the probe success percentage is 100 and the continuous probe that we had defined by the name of card probe. It has passed so yeah with that. We conclude the demonstration of litmus chaos.

B

We saw that how we can use the http chaos latency to validate the behavior of a kubernetes application when we are applying a latency of of a value to our kubernetes microservice, and we also saw that how you can define litmus probes within the litmus kiosk experiments in order to automate the process of hypothesis validation during the chaos.

B

With that I'd like to wrap up this demo. Thank you so much.