Cloud Native Computing Foundation Online Programs, 28 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native Chaos Engineering Preview With LitmusChaos

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello and welcome to this webinar about chaos, engineering with litmus chaos. I am um head of chaos engineering at harness along with me, my colleague, koptik, is principal engineer at harness.

A

Together we have created litmus chaos project around four years ago, which is now a cncf incubating project.

A

Here is a agenda we'll first look at why and what are chaos, engineering and its relevance to cloud native and I'll also talk about uh how chaos generally matures in an organization that practices around devops, then I'll delve a little bit into um the introduction of nitmus? It's features use cases there have been a lot of learnings about how litmus is used by community and enterprises.

A

I'll talk a little bit about the newly found use cases for litmus and at the end, kartik will do a demo of how to get started with litmus and he'll show an example of running chaos uh using litmus and connecting new targets, etc, etc to the litmus based control plane.

A

So, let's start with uh some facts around id that are happening today, so we.

B

A

Digital services are growing fast, and the digital traffic uh is growing at an alarming rate in the last few years, and this is uh fueled by the digital transformation. That's that's happening, and.

B

One of the reasons.

A

Why the digital transformation is happening at this rate? Is the enablement given by kubernetes, and the transformation of id into digital uh world is actually fueled by the adoption of kubernetes.

A

Because of this, there is a change that's happening in devops as well.

A

There is a subsection of devops that we are calling as cloud native devops, which is a little different than the traditional devops, in that it's uh faster. The deliveries happen in a more automated way, the new set of tools that are happening, the automation of uh applying the configuration upgrades overall delivery.

A

Everything is happening fast service reliability is extremely important for businesses. um So, of course, when you have less reliability, it really means that you are facing service down times or outages, which is not good.

A

The businesses can face financial and reputational damages, so we all know the importance of service reliability for the businesses and, in general, all services are built for providing maximum reliability in that they are built. The components or applications.

A

Micro services are generally built with redundancy uh as a goal right and even with that, we are seeing um outages uh happening right, and this is because how much ever you build carefully? There is always new thing that fails and you have not taken care of that corner case. So outages will still happen and um that's why it's it's a matter of how many nines in the reliability do you have 99.999 or one more nine to this percentage of reliability right so there are.

A

um There are general metrics that have evolved over a period of time when you talk about reliability, mean time to fail. uh How often you fail, um so the distance has to be uh bigger and bigger between the values and even when they fail how fast you recover right, so the service downtime can be reduced when there is a service outage right. So this is uh the context of uh reliability with respect to the the business uh you know and its reputation.

A

So there is a need to increase the reliability or reduce the downtime. And how do you do that right and we have seen uh devops uh continuously focusing on you know various degrees of testing various types of testing, including functional system, load, value, testing, right, a well architected system with a good qa, provides a good reliability or less outages, but there is a limit to which all these types of testing can guarantee uh the quality or reliability of a product or service right.

A

So the next uh option or innovation in increasing the reliability or resilience of a service, is to introduce chaos, testing or to practice chaos, engineering. And um what is chaos? Engineering right? It's in simple terms. It is uh introducing false on purpose and observing the system. uh You know, if a fault uh can happen, it will happen. So why not do it right away right? So you want to be able to uh increase uh your responsiveness to down times.

A

So, let's actually uh introduce this false to see if there are some weakness, uh weaknesses and then uh reduce the recovery times right, outfits issues, so in general, it's an iterative process. So in chaos, engineering, you pick some systems or applications, and then you pick a certain chaos, experiments or scenarios, and you run them in in a focused manner, with the low blast radius and then see if it is matching your expected results, or it is matching your expected, steady state hypothesis.

A

um If not, you have a learning and you go and fix that issue, and then you keep doing it. So this is an iterative process. What it means is that you start small and then you start covering various services, various components within services and the degree of randomness of falls increases, and you start covering uh more complex faults that could happen. um You know with chaos engineering right, so that is what chaos engineering is and uh it its relevance uh is becoming more and more in cloud native ecosystems.

A

uh Simply because you have proliferation of microservices right developers focus on their code within a micro service which is now a tip of a pyramid right, unlike in the legacy systems where you have that integrations with your stack underneath or surrounding in microservices or cloud native world, uh you are operating totally independently.

A

uh You're designing your code uh totally independently to run within a container as a microservice, so because of this uh you're running um uh you're shipping, faster right and instead of uh shipping every quarter uh we see now you know shipments happening every week or on a daily basis, sometimes multiple releases in a day right.

A

So in effect, uh you have multiple micro services to deal with and you are receiving the updates to this micro services at a phenomenal uh rate right um so, together uh you are seeing more dynamism, and how can you make sure that the service reliability is intact if there are any failures either? um You know within this micro services or the infrastructure that hosts this micro services, and the answer to that is uh you know chaos. Engineering can help because of this increased dynamism.

A

Chaos engineering is becoming, or has become, a more obvious choice uh to deal with the reliability problem of cloud native business services.

A

So in cloud native, if you are talking about chaos, engineering practice, um we've been advocating certain principles generally cloud native rights on open source right, so your chaos, engineering, stack or solution could be um an open source based one and uh the chaos. Experiments need to be. uh You know, well tested rugged, flexible, so they're, better built, uh you know through the community or put them in an open market place right and chaos. Experiments are like any other software uh code. So that really means that you keep changing them. There are various version.

A

You need version control management, around chaos, experiments. So it's better that you have chaos, operators and chaos, experiments as custom resources on cloud native and then obviously the real problems happen at scale. So you need to ensure reliability at scale so which really means that chaos. Engineering also has to be applied on a scale system, and you have to increase the scale of chaos. Engineering as well right, so gitobs is one possible answer to uh you know chaos engineering uh happening at scale. The last one is observability.

A

uh Observability is very important in chaos. Engineering uh in that um you introduce faults, and, uh but you always observe through your established observability practices right, so the context of chaos has to be placed along or on top of your observability system. So what it really means for cloud native chaos engineering is you better have open, apis or apis that are easily available to integrate with chaos, engineering or chaos, engineering tools, a framework should uh provide apis to upload, or you know, pull the chaos metrics from the chaos engineering platform.

A

So together these principles, uh it covers um all the requirements of a good chaos native cloud native chaos, engineering platform, so one such platform, of course, is litmus chaos which I'll talk in a bit. So before I do that, let's also see how chaos engineering happens, how it starts and matures within an organization.

A

First of all, chaos engineering uh is relevant in any of the bills, uh qa dev or ops. So it's not necessary that chaos. Engineering is only for ops, so you can do light uh right shift or left shift and similarly, um you start chaos. Engineering at generally infrastructure players.

A

uh You know with the simple experiments and as you develop and get experience of chaos, tests, running kl stress on cross section layer, then you go one level up into middle layer, api's message, queues or apa servers, and then you go into more stateful services, databases and data services and finally, your own applications. You can start introducing simulated faults right and when you reach that level, uh you can say that you have reached a maximum level of chaos.

A

Engineering uh in your organization, right when you reach that level, you can generally observe if your language of chaos, engineering, uh you know the maturity is being dealt in. Terms of uh you know, slows sorry metrics. Is it linked to chaos, engineering or not, and developer? Metrics is also linked to chaos, engineering or not. So, as you can see, uh chaos engineering plays a core role in in cloud native devops right and a good chaos. Engineering solution within devops or cloud native devops adopts the culture of chaos or chaos, culture across all functions within devops.

A

What I mean is uh in dev, qa and ops, or by developers and srs. You know both like practice: chaos, engineering, so developers run chaos in their pipelines and you know sres run slow validation using chaos and they introduce randomized chaos in pre-production or in production through game days. Similarly, the test systems actually go and introduce deeper chaos, tests in in their test systems and into their cd pipelines, 3cd and 4ct.

A

So when you introduce chaos, engineering in all fields, there is a general maturity of deeper tests that get developed, and you are obviously validating the products being shipped products being run for more error scenarios, so your services become generally more resilient. Litmus chaos is a cncf incubating project that has got high adoption uh levels by this popular companies and also it's got a large community around 2000 users.

A

The usage is growing.

A

You know it has grown 30x in the last few quarters and we're seeing around 2000 installations per day of litmus operators.

A

Let's talk about the litmus use cases that are generally practiced from vertical standpoint: they are more used in banking, retail and e-commerce uh services, which is directly related to the digital traffic. um We have seen um litmus being used in edge computing. um You know scenarios as well um in general. The litmus usage is more and more happening around wherever cloud native, uh devops uh or uh or commonplace, um and they can.

A

uh The use cases can be in in many different forms, primarily various uh game days to start with, and then slowly get integrated into your cd systems. A pre, cdr, positi and auto triggered uh chaos for uh continuous deployment, and uh one more level of maturity is to include litmus in ca pipelines themselves.

A

And if you have a good, um you know scale and performance testing chaos. Engineering can help or litmus can help validating the current strategy of your scale and performance testing. And similarly, if you have invested in good observability systems around cloud native litmus or chaos, engineering, that's based on litmus can help validating whether your observability systems will really help when, when there are outages that may happen right, so how litmus chaos in general works litmus.

A

You can start very small and grow. uh You know in a distributed fashion as it's a kubernetes application uh you, you start with a simple helm, shot uh when you install litmus. You have chaos center, uh where users can go and orchestrate create, orchestrate and manage chaos. Scenarios right and when you install the platform comes with a bunch of experiments.

A

These are chaos, experiments for kubernetes resources and you have some cloud. Chaos experiments as well, and it also comes with a good sdk way to introduce uh your own uh chaos logic into the platform. We call it as bring your own chaos right, so you use all these chaos experiments and then you put them into a chaos scenarios or chaos, workloads and you stitch them.

A

However, you want to make a meaningful chaos scenario. Once you have a meaningful scenario, you can use it um to trigger through uh githubs or directly trigger or schedule it using chaos center and then apply it on any use cases. So these are the general chaos experiments around kubernetes.

A

The platform provides a good coverage of chaos for various kubernetes resources and they're super highly tunable, and if you know how to manage kubernetes resources, you can manage the chaos experiments the same way so other uh very powerful feature of litmus apart from various different experiments, is the probes.

A

Probes are a way to declaratively, define your steady state hypothesis right um of a given experiment, and you can use probes to manage the meaning of chaos, experiment results or to implement slos, and there are many types of probes like http, probe or command probe, kubernetes, probe or prometheus probe, and you can implement this probes in different modes either continuously or at the edge or whenever there's a failure happens then only go and do this probe right.

A

So with this probes, you can actually, um you know, think of a chaos scenario, and then you know declaratively define it and then keep moving.

A

So this is how litmus landscape looks when it comes to ci, cd and observability in ci. What is possible is in in pipelines. You can log into chaos center, create a workflow that creates a manifest and include that manifest into a stage of a pipeline all right and we have uh tested this uh integration of ci. There are tested examples available for jenkins, get lab data actions and a harness drone.

A

Similarly, in cd, uh you can trigger it pre-cd, post, cd or based on an event um happening at the time of deployment example, a pod version changes, so you can run some chaos test, so it's well tested with our gopluck spinnaker and harness continuous delivery module and similarly, on the observability side, it's got integrations with well-known um observability platforms in general. How you can do observability integration with these platforms is uh by using litmus http pro.

A

You can trigger an http request around chaos and start interacting with the observability platform to go and read the slos or verify their slos, and then you know observing it on the observability platform, and there are some new use cases that have evolved in the recent years.

A

Politics and primarily uh it's around kubernetes, becoming a configuration control plane uh like google and those or azure uh or cross plane, or even kubernetes, to manage openstack, uh so the usage of kubernetes is increasing, and that really means that kubernetes becomes really really critical. The reliability of kubernetes becomes critical, but there are many users who are using litmus to validate uh such a platform implementation by introducing kubernetes faults and the other one is.

A

um You know you will always have hybrid infrastructures, even though you start uh litmus with the kubernetes experiments, you will see the need for uh an integrated chaos around bare metal or other switches, rack servers, load, balancer, etc, etc, and also, nowadays you have multi-cloud. You know, scenarios or deployments where you're using various cloud services, either database services or app services or serverless systems, where you will start using different other tools in conjunction with chaos, for example, load testers right, a chaos tool plus load tester together can simulate a chaos scenario for you.

A

So these are some of the newly foreign use cases of repress. So to summarize uh benefits of lateness with litmus or with chaos engineering, you increase uh your capability to inject false. As well so mean time to inject or identify a fault uh actually decreases your faster and whenever faults happen or service outages happen, you recover faster, your mttr decreases and because of one and two you are more agile and you are fixing issues or weaknesses, so the failures will eventually reduce right.

A

So these are the general benefits of chaos, engineering and especially true with the litmus together. So we have a service available for litmus chaos, engineering control plane as a service for users. You can get your own control plane by signing up at litmus k, all star cloud connect your uh targets and run chaos. Experiment and you know, observe um resilience or you know, find an opportunity to improve resilience with that. uh Let me turn over to karthik to do a demo.

B

Hello: everyone in this demonstration, let us take a look at how you can install the litmus chaos platform and how you can get started by running a simple chaos, workflow against a sample application and observe the impact of the chaos. My name is karthik. I am one of the maintainers of the ritmos project.

B

We will get started by installing the litmus helm chart on an eks cluster and then go on to add another cluster, this time on gke to our control plane. In order to carry out our chaos experiments, we will use the docs docs.litmus chaos, dot io, to help with our installation.

B

Let's copy the helm chart related commands.

B

Let's create litmus namespace.

B

And go ahead and install the charts using the release, name called as chaos.

B

Now let us watch the progress of the installation.

B

Our deployments are up, we need to be accessing the chaos center and for that we'll just make use of a simple node port for this exercise.

B

Let me identify.

B

The node eyepiece, alternatively, we could even use a load balancer. It is left to the user's choice. You could also use ingress along with certificates in order to access the service over a specific host name with tls.

B

I am going to be editing. My front end service.

B

In order to access my dashboard obtain load, balancer path here,.

B

This is something you can do as part of your health values.

B

We now have an external ip available to us, so let us go ahead and try to access this.

B

Now that we have our chaos center up and available, the load balancer is working. Let us go ahead and login with the default credentials, that is admin and litmus, so each user in kr center is allocated a dedicated chaos. Workspace also called as the chaos project, which is where they will be performing the chaos, workflows, management, creation of workflows, visualization of the workflows, creation of new teams, with the invited members comparison of workflow runs, etc.

B

We will take a quick look at what all is possible in terms of configuration in the chaos project before that. Now that we've logged in, let us take a look at our dashboard.

B

The kiosk dashboard is the first page that grades us once we log into the chaos center. We do not have any workflows that are executed. This was our first login, but one of the most important things to look for and check immediately after logging in into the chaos center is whether we have a chaos agent connected.

B

Like you're already aware, the redwoods microservices can be logically separated as control plane, micro services, as well as execution, plane, microservices or components. We took it. We took a look at what the control plane services are.

B

That includes the the chaos center front end the graphql based chaos, server, mongodb and auth server.

B

The execution plane components are the ones that actually carry out and participate in the execution of a chaos workflow or the implementation of the kiosk experiment, business logic and in litmus you could connect different clusters to the control plane and carry out chaos against micro services residing in that cluster and the entity that helps you to connect your clusters into the control.

B

Plane is the chaos agent, also known as the subscriber the subscriber takes instructions from the chaos control, plane and executes them on the execution plane, that is the target environment or the cluster that you've added.

B

In this case, we see that a self agent has auto magically register itself in our chaos center. It is an active stage.

B

What this means is that the cluster, where we have installed the control plane, microservices automatically qualifies as an environment in which you can do your chaos in our demonstration. We would be connecting an external cluster which houses our sample application, but this is the self agent that's available for you, so you could have micro services that you want to do chaos upon on the same cluster, and this helps you execute it next useful thing to look forward to in the chaos center is the chaos hub.

B

The chaos hub is an open catalogue of several faults or chaos. Experiments which you can piece together perform a chaos workflow. There are different number of experiments. There are different types of experiments, totaling, 50 experiments in the chaos hub and the chaos hub is also the place where you have some demo or illustration workflows that you can use to kick start your evaluation process of the kiosk project, fitness.

B

The experiments have been categorized based on their nature and based on the environment they can be applied in that includes generic kubernetes faults, faults on aws, using the ssm or the aws system manager, faults executed against gcp environments, azure vmware, as well as certain experiments crafted for popular applications like kafka and cassandra.

B

The chaos center also provides you with a view into how your workflow execution has taken place over a period of time. You can compare. Workflow runs once you have them to see whether your resilience trends are on an upward curve or whether they are falling away. What actually needs to be taken.

B

The settings page in kr center is useful to manage your account to create new users on the platform as an admin, you could create new users, and you can also create teams in each project where you invite members that are already created on the project. That is, the admin can go ahead and create different users, providing them with username and password.

B

The respective users can log in with their credentials, and once they get into their chaos project, they can come to settings, go to the teams tab and invite members into their own project with specific role. So let us provides our back at the chaos center level. The users are classified as owners, editors and viewers, each with desired set permissions.

B

In managing the chaos workflows, the other option that you have on settings is to enable the gitops feature.

B

You could store the workflows that you create on the kiosk center locally in litmus, which is the default option or you could choose to push these workflows into a git source.

B

The git source can be public or a private repository and whatever workflows are constructed on the platform will be committed into the git repository and there's also the option of changing your workflow specs on this git source and have them reflected for future iterations of the workflow run on the cluster on your chaos center.

B

The repository used for pushing your workflows can be the same as the one that powers your chaos hub, or it could be a different repo altogether talking about the chaos artifact source, the chaos hub, you could create your own chaos hubs coming from your own git repositories,.

B

It is expected for these repositories to house the chaos experiment artifacts in a specific folder structure, the details of which is available in the documentation.

B

You could create again a public canonical source for the chaos hub or it could be a private chaos hub in case you're, operating out of air gap or disconnected environments, or have restrictions on how how you could consume artifact sources.

B

You could be using the private chaos hub feature to create private repositories in your own environment and pull them onto the chaos center dashboard.

B

There are a few other settings that are available on the chaos center typically used as part of the data operations as part of a chaos practice. For example, you could influence where your images for the the chaos workflow parts are coming from. You could be using the default docker dot, io registry and the litmus repository, or you could be using your own registries, from where you pull in the experiment and workflow parts.

B

The usage statistics is a very quick, provides you with a very quick view of the execution thus far, and you could actually see how many experiments you've run, how many workflows you've run, how many users are there on the platform, etc. This is something that is useful for the admin to generate some helpful reports.

B

So with this introduction to the various screens possible on the chaos center, let us go ahead and connect our target environment. We spoke about a gke cluster on which we have our sample app residing. So let us connect that to the center. Let us add that to the control plane then begin with our first chaos. Workflow.

B

We talked about the agent that is helping us to connect external clusters to our control plane. The installation of the agent on a remote cluster today is aided by tool call it ctl.

B

I have just opened up the readme page for the litmus ctl on the git repository. It was chaos, fitness ctl.

B

We're first going to go ahead and pull this particular binary. You could do that from the release page or you could just use the table on the readme to pick the litmus util binary for your os distribu.

B

I have just pulled it already and I am going to go ahead and follow the steps for the interactive execution of this cli to set up my agent.

B

So the first step here would be to create your account or to set the account on the on the working space that you have where you have installed the litmus ctl tool. So let me go ahead and perform the ctl set account step, so this is going to help you with setting up the right keys and the auth in order to access the the control plane with your credentials for your project, we're going to provide the end point where litmus is running.

B

Let us take a look at our control, plane, endpoint. So this is it.

B

The next step is to provide our user detail. I am trying to perform all these operations as admin, so I'm going to provide the admin credentials.

B

Now my configuration is set, you could take a look at the configuration in the litmus config file. You can see that there are some tokens available here.

B

My next step would be to create the agent. Let us take a look at the instruction for that fitness ctl create agent is the command.

B

So let me go ahead and create the agent we are doing this in interactive mode, so it asks us for a few questions. I am going to select the admin project.

B

I have the option of setting up the agent in a cluster-wide mode or a namespace mode. By following the cluster-wide approach, you will be able to run chaos workflows targeting microservices that reside across different namespaces on your cluster, which is ideal.

B

If you have autonomy over the cluster or you could be connecting it in namespace mode, which means that you will be able to target microservices only that are running in the namespace where the agent is installed, and this is helpful when you're running in shared environments, where there are folks users having their own test environments in different name spaces, I am going to select the cluster mode.

B

I can provide some metadata about the agent and the cluster where we are installing it. I've just gone ahead and provided some information.

B

It asks whether we want to do some checks against the control plane in terms of ssl. It asks whether we want to install the agent in a specific node. We do not have these compulsions at this point.

B

And we are going to install the agent the default name space. That's with us using the default service account call it miss. You could choose an existing namespace and an existing service account if you would like to do so.

B

It provides us with a summary of the options that we selected with that. Let's go ahead and install our agent.

B

The agent installation progress can be tracked either on the console, or you can also go ahead and take a look: the chaos center and check whether our agent is active.

B

It is in pending state right now the deployments are in place and we're just going to wait for them to be active and establish connectivity with the control plane.

B

They're all running right now- and we should have this active in a few seconds time.

B

The subscriber, as you can see, installed with a set of other components, most of which are kubernetes controllers, the ks operator. The event tracker and workflow controllers are custom communities, controllers that act upon different custom resources and participate in the chaos execution process. The exporter happens to be a prometheus exporter that provides chaos metrics.

B

Our agent is active right now, and this means that our demo environment is good for taking chaos.

B

Let's take a look at how you can construct chaos workflows now now that we've connected our agent successfully, let us go ahead and create our first workflow.

B

Like we said this is going to be a simple podcall chaos experiment, along with an availability check against the the website, the potato head application website.

B

Let us choose our agent. This is the agent that we just connected. This is the gke cluster named demo. We have different options for creating workflows. We could be using a predefined workflow.

B

We could be cloning, an existing workflow template or we could be importing something that is handcrafted and when I said predefined workflow, this is the one that is available to us from the chaos hub, which is where you might have pushed your workflows earlier now, since this is our first workflow, it is constructed by picking the files or the experiments from the chaos hub.

B

I have selected the default embedded chaos hub, we'll just name it as a custom workflow.

B

We will pick an experiment, you could search for the desired experiment or the fault here. I'm just selecting the power delete one and now we'll go ahead and tune it.

B

This is some context or metadata around the experiment.

B

I am looking for a deployment in the default space with the label that is named potato head main. You could see that litmus provides you with some asset. Discovery features where you can see the micro services that are available on the system.

B

I am going to select the deployment that carries the label potato main, and this is the micro service that pieces together the different components and gives you the mr potato, rendering on the web page now that we've selected the microservice against which we want to do chaos, let us go ahead and add our availability check or the constraint.

B

The hypothesis, validation as part of experiment runs, can be carried out using something called switches probes. We are going to make use of an http probe.

B

For this particular demonstration- and we are going to be running this probe in a continuous mode- we're going to set up a polling interval of one second we're going to be looking for the availability of the website every second.

B

These are some properties that you could provide, am providing minimal values just one second for the polling interval, just one retry. If the response code is not 200 and we're going to begin this, let's say a second after the experiment, execution begins, so you have an option to abort the workflow in case the constraint is not validated, you could choose the stop and failure to be true.

B

We will set it to false, for this run, I'm going to provide the url of the potato head application in my probe and I'm going to provide a pretty aggressive, timeout we're going to be using. This is in milliseconds by the way and we're going to be using.

B

A get operation we're going to be looking for 200 response when we hit the url with this conditions. Let us go ahead and create our pro I've added my probe I'll. Take a look at the other tunable supported by this experiment.

B

You could run this experiment for a total duration of 30 seconds with a 10 second interval. That means a pod is going to be picked and killed every 10 seconds up until the total chaos duration of 30 seconds is the upper bound. We just want one iteration of chaos, so I'll just probably change the duration to a much smaller value.

B

One aspect to note is that the duration here talks about the chaos duration alone, not for the entire experiment. The experiment in litmus performs pre chaos and a post chaos check to ensure the system is left in the healthy state, and that takes a few seconds up and over the chaos. Duration that you have specified.

B

You could also add more environment variables depending upon your need, let's finish it and keep it simple. We have the option of cleaning up the chaos resources immediately after the run and we have the option of keeping them as well. In our case, let's keep them, because that helps us to look at the experiment, pod logs.

B

Let us click next and provide weightage to this experiment. This helps with the calculation of a resiliency score that you can use to compare workflow runs we're going to give all points to this workflow. As we have a single fault or experiment chosen.

B

We have the option of scheduling it repeatedly or scale it scheduling it just once immediately, which is the option that we will take.

B

This is a summary of all the options that we selected while constructing the workflow. Now, let's say finish,.

B

This triggers the workflow, so you have the first entry in the workflows page right now.

B

Let's take a look at what is happening as part of the workflow run. You could use the workflow visualization graph to track the progress of your chaos workflow for each experiment that you picked within a workflow. You would have typically two steps performed so, though, that is easily customizable.

B

The first step, installs the fault template from the chaos hub. This carries low level.

B

Details of the fault, such as the images that are brought up, run the part carrying out the fall business logic now other such details around the the fault and then the next step, which is what you see right now called as part delete, is when the actual chaos invocation happens, the chaos engine resource, which is the user facing resource of litmus, which is where you might be going and adding your probes, adding your durations and lot of other tunables that is going to be created and the custom controller called ks operator that we just saw when we set up the agent it's going to pick up the chaos engine and run the fault.

B

Let's move to the lens dashboard to see what's happening, you can see the potato main application has been killed and a newer one is coming up. Let's go ahead and look at what's happening on the web page. I've just gone ahead and refreshed it.

B

It is to be noted that this particular application carries an init delay. That is, it is going to take some time for the application to be ready. Now it has come up in fairly quick time, but let's take a look at what happened as part of our run in the dashboard.

B

You can see there was a slight dip in the probe success percentage. There was a small spike in access duration as well.

B

The red area that you see on the grafana dashboard is coming from an annotation that we added the annotation source being a metric that is provided by the chaos exporter it's available in prometheus.

B

So we have the metric here. It is the litmus chaos cluster scope, experiments. There are other metrics useful that you can use to construct your own dashboards.

B

We've used that to find out the exact period when the chaos has been active or the experiment has been active. Rather when the experiment was active, you saw that the probe success percentage dipped a little bit the excess duration spiked. Then it has come back to normal state.

B

We had set a threshold on our axis duration, which is why you see the broader red area the moment it crossed 25 milliseconds.

B

Now, let's take a look at: what's really happened on the workflow, as expected, workflow has failed our initial hypothesis.

B

Well, we said our initial hypothesis is that there should be no dip in the probe success percentage and no spike in access duration.

B

That did not happen, but we did brace ourselves for this problem because there was just a single replica of the potato main micro service. You can see as you scroll down in the logs, the probes have failed.

B

We did have an aggressive check against our application availability and that failed. What is the mitigation? You could be scaling up your potato main micro service to multiple replicas and repeat the same fault, in which case you will have the workflow succeed. As for the initial hypothesis or the desired case, where you would not see any downtime on the probe success percentage or access duration, once you have the workflow runs executed. Thus you can go to the observability section.

B

Take a look at the workflows, compare them and download reports.

B

The workflow observability section here is helpful in checking how your application holds up to a particular fault across different builds or different environments or different releases.

B

So this was a very quick execution of a simple kiosk workflow on a sample micro services, application which is residing in a remote cluster that we connected using the litmus ctl, and we took a look at how you could use the interleave dashboard and grafana use the litmus matrix to see when the kiosk is actually active and how you could go ahead and observe details of that run, using the observability page using the analytics for the workflow here.

B

So that was that is a wrap-up of the litmus demonstration hope you get a chance to try out this tool and do provide us your feedback. Thank you.

A

So that's a good demo by kothik. Thank you karthik and.

A

Thank you very much for uh watching this webinar and we are reachable at kubernetes, flag, community and the litmus channel please do reach out to us. Thank you.