keptn Keptn Tutorials, 12 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Tutorial - Resilience evaluation with LitmusChaos, Prometheus, and Keptn

Description

In this tutorial, we'll set up a demo application and have it undergo some chaos in combination with load testing. We will then use Keptn quality gates to evaluate the resilience of the application based on SLO-driven quality gates.

Follow the tutorial: https://tutorials.keptn.sh/tutorials/keptn-litmus-08/index.html

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject

A

Hi everyone in this video, I want to show you how to evaluate the resilience of your applications with litmus chaos prometheus and captain for this I'm following a tutorial from tutorials.captain.sh. I will also link this tutorial in the video description, so make sure to check it out. So for this tutorial, uh all you need to do is to bring your own kubernetes cluster to get started. I brought my cluster from gke.

A

It has uh enough um nodes and vcpus. So I came with a cluster with eight vcpus and uh around 30 gig of memory, and so I can install everything that is needed in this tutorial.

A

First part is that we are going to install istio. We will need it later on for traffic routing I've already downloaded the istio cli here, so I can already go ahead and install this on my cluster. So here in my terminal, I'm already connected to my cluster and I just execute the insta installer here and I just let it finish the installation for a second in the next part, we will download the captain cli and we will then also install captain on our kubernetes cluster. There are different ways to download the captain cli.

A

I have chosen the first way just by using a curl to download the captain, uh the most recent captain version onto my local machine and then I'm going to install captain from my local machine in the kubernetes cluster that I'm connected to.

A

So let's just look back if easter is already installed, not yet we're still waiting for it to be installed, but it should be finished just in a second, so I can already go ahead and copy this command, which will install captain uh or actually, which will download the captain cli on my local machine and then we're going to install captain into the cluster.

A

So, let's download the captain cli, which should just take a second and in the next part we're going to install captain into our kubernetes cluster. We will use the cluster ap endpoint service type for captain. uh We won't expose it by a node port, because we will do this part with istio the exposure of the captain api and the services that captain managers will do with istios is there will be our ingress in this exam and the use case. We are having for this installation is the continuous delivery use case.

A

I will just talk about this in a second, but first let me go ahead and install.

A

Captain into the cluster just move this a little bit and uh we're going to do this by just copying this command here into our terminal, and it will ask us if we want to actually install it on the cluster mine is called the litmus tutorial. I will go ahead yes and we'll install it into the cluster, so this will take about three minutes to install. Let's take a look. What is actually installed here, so we chose we're choosing here the use case, continuous delivery.

A

That means that we are not only installing the control plane of captain which you can use, for example, the quality gates use case of captain, but we're also installing some services from the execution claim, like the helm service or the cheat meter service that are responsible. The first for deploying our application in the second one, the cheap leader service, to be able to send some load tests, some genuine tests against our services that we actually manage with captain.

A

um You can also just install captain control plane and later on, that installed the execution plane services by itself, or you can just go ahead and use this use case flag into captain installation procedure. So in my case I want to install captain, including the execution plane services like the helm service and the cheap meter service. As we're going to need this in this tutorial.

A

I said it will take a couple of minutes to be finished. Let us take a look here and it's already finished. Captain has been successfully set up on your cluster, so that's perfect. Let's uh move on and now configure istio to be the ingress for our captain installation. For this we have provided a sample script and we just download this, make it executable and then execute it.

A

So in the first step, I just downloaded the script and I will just add the execute it will flag to it and now let us just execute the script, so it will basically set up an ingress for captain and a gateway, and it will also restart um the help service part, as it should pick up. The new configuration that we've just changed here with the script.

A

In the tutorial itself, there is actually also the description what is really going on, but we want to focus in on the use case later on. I just want to walk you through this to see um how long it takes to actually walk through this tutorial. So let us now connect our captain cli to the captain installation that we just did on our kubernetes cluster. For this, I'm copying uh these two variables here, the captain endpoint and the captain dpi token they will be fetched by this um cubesat commands.

A

So no need for me to uh to know the captain end point. In the api token myself, we have provided here some utility functions and in the next part I can actually authenticate now the captain cli against the captain cluster, and here we go. I am now successfully connected against my cluster just open this, and here we are. I have access to the swagger ui. That is our api documentation, and I see it's the correct version of captain that has just been installed.

A

That's great! So, let's move on in the tutorial also provided a couple of demo resources for you to use and let us just download them from github and we have everything locally available on our machine, and I think I already did this earlier. So let me just remove the folder where I downloaded it and download it again.

A

And here we go and we just jump into the right folder here and test data. We have all the data that is needed for this tutorial.

A

I just switched into this folder now let us install litmus in our kubernetes cluster litmus.

A

Chaos is actually the chaos engineering framework that we are going to use for this tutorial and we can execute chaos, experiments with litmus chaos following a very cloud-native approach by defining the chaos experiments as gamble files, so I've installed litmus and in the next part I am already creating a namespace.

A

This is actually the name switch where later on, our application will live, and I will just create some custom resources and our back rules in this namespace. That will be that we will be using later on to be able to actually delete ports via the chaos experiments in this.

A

Okay. Next step is to set up promethos and it's actually the installation of the prometheus inside the captain namespace. So we are adding the prometheus services, one for configuring, chromisoys and the other one for reading data from communityvoice to our captain namespace.

A

So the first one is our prometers service, which will be, uh which will be responsible for configuring and creating a parameters instance and the other one is our previous sli integration, which is responsible for retrieving data from promoters. So I'm installing both into the kubernetes cluster.

A

The next step is we want to actually install a litmus integration for capital, so this integration is responsible to listen for events from captain and then to trigger litmuscase experiments via the litmus chaos operator that is already running now in our cluster.

A

So this um service yaml, is actually has been downloaded from uh the repository from earlier in step number eight, we downloaded the demo resources and we're just going to apply uh the service. The business service, that is its definition, is stored in this folder. So let us run this, and since this is now created, let us take a look on all the parts that are now already running in our capital installation.

A

So the captain installation lives in the captain namespace, and we can see a couple of pods are running, including the litmus service that is already running in our installation in our capital installation. Perfect. Now, let's go ahead and create a project, so we want to create a project basing on this shipyard definition and this ship that this definition has one stage it's the ko stage and it has one delivery sequence where we actually want to do. First, the deployment a test and then an evaluation.

A

So, let's sorry one copy, let's first copy this line, and now let us create a project here we go project successfully created. I have not yet um defined a git upstream um for this repository. So right now everything is stored in the git repository managed by captain locally inside this kubernetes cluster, which is fine for the moment.

A

Let us now onboard our service, we call it the hello service and we have all the charts the help charts already defined in our test data repository. So we're just going to create this service here in our demo kinetic cluster and in the next step we are now going to add some jeta tests first load test in some configuration how we want to execute this load test. So I'm copying both lines and adding this also to the captain installation.

A

Here we go next step is to configure the quality gates.

A

What we want to do here is to add some sli yellow, which is the information how to retrieve the data from commit voice, and also the slo dot yaml, which is our actual quality gate, and we will see this just in a second. How this formula will look like. Let me first add both files to our project.

A

And now, let's take a look on the um slow file, hello service, slo, dot, gamma. So what we can see here is we have defined two slos.

A

First, the pro success percentage and also a probe duration in milliseconds and for the first one we define it will only pass if the criteria is higher than 95 percent, and that means that success percentage has to be higher than 95 for it to pass, or we would get a warning if the success criteria is higher than 90 and I just see there is actually a little bit of a typo here in the comments.

A

But what's really important is the criteria here in the morning and also in the past, um in the past property here of our slice and for the pro duration. That means, if the probe is faster than 200 milliseconds, we will get full score for this criteria and if not, we will get zero points for this quality evolution.

A

Okay, so now, let's add our chaos experiment to um to the git repository that is managed by captain. Also here, this experiment is already defined for you to use. We just copy this and execute it, so we are adding from our local instance. We are adding the experiment of the angle onto our kubernetes cluster.

A

Actually, the git repository in the kubernetes cluster that is matched by captain.

A

One final step for the setup: let's say we want to configure promethoice for our litmus project and our hello service application, and we also want to uh add the premier's blackbox exporter. We we will do this because our application is actually going to be removed. The the part in which our application is running will be removed by the case experiment, and we need some way from the outside to check the availability of the service, because the service won't be available for a couple of seconds and we're going to do this with the black box exporter.

A

So with this first we're going to do the captain configure monitoring prometheus that will set up prometheus and also alerting rules for prometheus, which we won't use in this example. But um usually this is what captain configure monitoring parameters is doing and, in the next part, we're going to install our blackbox exporter.

A

A

Finally, we need to restart the promethvoice part in order to pick up this configuration, so this is really fast perfect. Now we can run our application for the first time or our experiment for the first time. Let's do this with captain trigger delivery.

A

This will not only trigger the delivery, but actually the delivery sequence and we have defined a couple of tasks in this delivery sequence, like the deployment, the test and the evaluation.

A

Let's go ahead and trigger the delivery sequence. Now we also want to follow along, what's really happening, and for this we are going to take a look in the captain's bridge.

A

We can retrieve the url of the captain's bridge with this command. It will give us the url and we also need a password, the username and password, and this is done with captain configure bridge minus minus output, and this will give us the password for this.

A

Captain and the password- and here we go- we have our captains rich. We have our litmus project that we've just created and in the videos project we have one stage the chaos stage and we can see the deliberate is already running. So let us take a look at the delivery sequence.

A

Here we go in the delivery sequence, we can see the deployment is already finished. We can actually take a look on the deployment.

A

Oh, the hello from the potato web application is already prompt here and we can see. Currently the little service is running and the cheap leader service is running, so the litmus service is sending um the instructions to litmus chaos. To start the chaos experiment and the case experiment that we have added to our capital management repository are the the pod delete experiment, so litmus chaos will go ahead and delete the application port.

A

Actually, the part that we just created during the deployment stage deployment phase- and while this is going on chain meter, will fire some load against our application.

A

So with this we can make sure that our application we are testing is actually under some load. uh That is important because uh we don't want to evaluate the recipients of our education. If there's not, if there's nothing going on, if it's not under load, so we will measure and the impact of chaos in an idle situation, and we want to impact it. We want to evaluate the impact of chaos in a real world scenario in a real world, setting where there is actually some traffic going on fire against our application.

A

So this is what we are simulating here and we're not doing this in a production stage. We are doing this in any stage prior to production. In our case, we just called it a chaos stage.

A

So, while I was talking um both the cheerleader service as well as the little service finished their job, and we can already see the evaluation scored zero points, we can even take a look why it scored zero points. We can take a look here clicking on this little icon here and we see the probe success. Percentage was only 65, so only 65 of the time of the evaluation period. The evaluation period is actually uh around two minutes.

A

This was the time our tests were running and we could not satisfy our pass for warning criteria, and we could also not satisfy the probe duration in milliseconds. We would expect it to be faster than 200 milliseconds to respond to the probe, but it took more or less twice the time.

A

So what does it mean? It could not respond to all the requests that were sent um from the um from the prometheus blackbox explorer and uh it would not be able to respond to all the requests that are sent by real users.

A

So what we now want to do is we want to first. um Let's say we want to increase the replica set of this application instead of running only one instance of this application. Let's run three applications, uh three instances of this application. That means, if our case experiment is going to delete one of those instances. Two instances should still be available and should be able to serve the traffic that is fired from chelita against our application.

A

So let's do this. Let's follow sorry, let's follow the tutorial.

A

uh Actually, everything that we just saw is also has also been shown here in the run experiment step where we can see that the experiment failed. So let's now increase the resilience of this application and let's take a look on the deployment event that we now want to send to our captain installation.

A

So it's in the hello service and it's called deploy event.

A

So the important part that I want to focus on here is the replica account so just with sending this instructions here, this uh it's a cloud event uh holding all the information that is needed for captain to trigger a new delivery sequence and uh the important part that we want to change here is the replica account we are using the same image as before. We are just changing the replica account to three.

A

That means we will now run three instances of our application and before we're going to do this, let me just open another tab here, so we can watch the actual application, uh how it will be scaled up and how little scales will uh eject the chaos experience.

A

So here ctl get paul's lines and and it's running in the litmus ks namespace so right now we have. Let's do it like this? Probably we have one instance of our application running sorry. Here we go so now. Let's do a captain send event our hello service, the deployment event that I just showed you, let's now send this event to captain captain will pick up this event and it will trigger the helm integration to um to start the deployment we can already see. It's now scaling up to three instances.

A

Two are not ready, yet they are not yet available because there is a readiness pro defined as part of this application, and it will take 30 second seconds for it to initiate and then be able to serve the traffic. So after around 30 seconds, we should have two um ready instances available and once these are available here we go. One is already available. The other one should come up quite soon.

A

Once these are available, captain will be informed, the deployment is finished and it will go ahead and trigger the next phase, which is now actually the chaos experiment. So we can already see the pot delete. Experiment has now started. There is also a chaos runner managed by litmus chaos just started and one of those other three parts we actually deleted by this chaos experiment. So we can see one exactly this one is now terminating and another one is coming up.

A

This experiment, sorry, this part here, will still take around 30 seconds to be fully ready again here. The readiness probe is running again, so it will take around 30 seconds for it to be ready and also to serve the traffic, but the other two instances they were able to serve the traffic already. So in this run of the experiment, we have not deleted all the instances of the application, because we now have three instances running of this application and we only deleted one of them.

A

Two, the other two should be able to respond to all the requests and to serve the traffic, uh and we will see this indicator's evaluation, how this is actually impacting the resilience of applications.

A

We just saw that uh the litmus chaos managed resources, the chaos, runner and the experiment itself. They are now um finished. They are also removed again from this namespace and uh everything has been now recorded back also to captain. So let's take a look in our in the captain's bridge.

A

Here we go. The delivery succeeded in this case, and we can see just remove this. We can see the deployment was successfully finished. The tests both tests, fx, have actually finished, and the evaluation has now scored 100 one hundred percent uh of uh of our quality score. So, let's take a look, as we already um hoped. The success percentage is now 100, so, instead of um killing all the pot, so we just killed one quarter. We also did it earlier. We killed one pot, but it was only one part.

A

Only one instance of our application was available. We also did this now to um we deleted one instance of our application. The other two were still able to serve some traffic, so that was the success. Percentage of all the pros we sent to our application was 100 and um due to the high available high availability in terms of at least two replicas that were available to respond to the requests, we could also satisfy the probe duration in terms of being faster than 200 milliseconds in response time.

A

So with this, we have found a way to increase the resilience of this one microservice application. There might be other ways, but for our for our tests, we did it with scaling up the application making it. Let's say high available with three instances running of this application instead of one instance. So every time there is one instance um crashing, there is still two other instances that can serve the traffic.

A

So we can also see this here. There might be some other uh numbers. uh It really depends on your setup and it really depends uh um on the sizing of the cluster which numbers you uh you can come up with, uh but nevertheless we can see that we did one run that actually failed. The quality gates. We identified that our application is not recipient and should not be moved to production.

A

We fixed this issue by scaling up the application to um three instances of the same application and having the traffic shifted between these three applications or three instances, and we did another successful run of our delivery sequence with a deployment and test and evaluation. In this case, everything is going fine.

A

So thanks for watching uh this short video uh again, all the tutorials are linked on uh below this video and you will find all the tutorials of tutorials.captain.ch.

A

If you have any questions or if you want to learn more about the project, go to captain.sh, you will all define you will find all the links also on the website. Thanks for watching.