Cloud Native Computing Foundation Kubernetes Community Days (KCD) Chennai 2022, 30 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Using GitOps to Increase System Resiliency with LitmusChaos by Amit Das and Saranya Jena

Description

Chaos engineering has come a long way from its early days at Netflix. While companies are widely adopting Kubernetes for their microservice architecture solutions, for the much-required competitive edge, it is important to ensure the resiliency of these systems for ensuring reliable services for everyone. As a large number of systems are moving towards the Kubernetes-Native approach, an important problem arises, testing these applications the right way to prevent outages in production. In this talk, we will introduce LitmusChaos and how we can leverage GitOps to increase the system resiliency. Further, we'll talk about how GitOps can be used by engineers and SREs to automate Chaos in their applications.

A

On using githubs to increase system resiliency with litmus skills, this is sarna and I work at harness at senior software engineer, and I am one of the core team member at netmuscus, which is now a cnc incubating project and I've been contributing to it for the past for around for two years, and it has been a great journey so far and now amit will be introducing himself.

B

Hey everyone, I'm amit, kumar das, I'm a senior software engineer at harness and I'm a core contributor to litmus kiosk and I've been contributing to the project from past two years and yeah very excited to be a part of kcd channel yeah. That's pretty much about me and looking forward to it. Thank you.

A

Yeah, thank you amit, so today's agenda will be will be first of all, we'll be talking about cures engineering. Why is it required and then we'll be introducing litmus skills, talk about its core components and features, then amit will be taking us through the get ops and giving us a small demo of how you can use get offs to like leverage like use like increase the system resiliency uh using litmus so yeah without further ado, let's get started uh to start off with.

A

uh First of all, we need to know what is resilience so resilience is basically the system's ability to sustain a fault uh uh and bring itself back up. So, for example, let's say a pod gets evicted from the node. What is its state? Is it healthy or not? Does it bring itself back up? If it does, then it is resilient and that period from going down to bringing itself back up is the resilience.

A

So similar is the case with node and memory leak as well uh talking about down times uh down times are expensive, not just in terms of money, but also there are in other aspects as well, such as uh customer confidence.

A

There's a loss of customer confidence then uh damage to brand integrity, then loss of productivity and employee morale as well so considering all these etc like these are some of the uh aspects and considering all these uh all these aspects, uh we definitely want to avoid down times at any cost, and one way to do this is by adopting the uh practice of chaos, engineering and a cures. Engineering is the like process of testing a distributed computing system uh by injecting fault intentionally.

A

So the goal here is to uh identify the weaknesses in your applica in the application through controlled experiments, uh so that uh to check whether it can, whether it can withstand the unexpected uh situations or not, and so how it is done. It is typically done by like. First of all, you have to identify the steady state conditions, so steady state conditions are the desired behavior of the application in a given scenario when it is healthy. So, first of all you identify that. Then you introduce a folder application. Then you check.

A

If the steady state conditions are met or not. If, yes, then, the application is resilient and if not, uh you can go and fix it, and if some similar case happens in production, you are already covered, then are talking about the foundations of cures, cloud native cures, engineering and how like how it can practice effectively. So uh the cloud native definition itself uh includes some mandatory principles, uh such as uh it requires declarative conflicts being scalable, uh flexible and support of cross cloud.

A

So um these are some of the major principles which we have been following for the past few years and uh yeah so uh uh cloud native communities and technologies or revolve around open source and these chaos engineering framework being uh open source gives them the benefit to make themselves better and add more and more features and the cures experiments need to be a very simple to use highly flexible and highly tunable, with very less or little little or no chance of false positives or false negatives.

A

Then, uh with more and more people getting involved into cures, uh chaos, engineering more and more changes happen very frequently that, with the requirements be uh being altered, so there are arises, a need of um like it becomes very important for the cures. Engineering framework to enable proper management of the chaos experiment, and that too in kubernetes way, then, as you uh start practicing cures and doing more and more start fixing the little issues that come in more and more cure scenarios come into picture and gradually it becomes very large and comprehensive.

A

So these are cure. Scenarios need to be automated and triggered if changes are made either in the um application or in the cause. Experiments and tools around um guitars can uh are one way to um achieve it, then. uh So this is like uh this is a section we'll be talking about in detail and giving us giving you all the demo and um then lastly, there's open observability, which is also one of the principles uh introduction to chaos. Engineering should should not require any new uh observative system.

A

The existing ones should fit in perfectly so yeah. That's that, then, with that, I would like to introduce litmus chaos, which is an open source uh cloud native uh sending framework with it has also. It also has the cross cloud support and currently it is uh cncf incubating, and it has adoption across. Several organizations, then, um are talking about the features that cure center, provide starting off with the cures. Workflows.

A

uh Chaos workflows is the collection of several several uh experiments, which can be clubbed in either sequentially or parallel like in any manner, and uh it can be created. These workflows can be created using uh custom templates that you can like you can upload or you can um use, create your own class custom workflows from kiosk hub, which is the like repository kind of like it is the place where all the cures experiments are present. You can choose from there or you can use some pre-create pre-created yamls are also there.

A

You can choose uh from them as well, then uh you can schedule your workflows either as a recurring one, the chron workflows or you can have a singular workflow as well then. um Lastly, you can attach priority for to each of the experiments in your in the particular workflow. According to your own requirements, uh workflow management uh get offs. uh This is the section we're talking about a bit later.

A

Then um you can uh litmus allows you to add your own image from your own um image, server custom image so which can be the republic or private or then, once the cures injection is done, uh you can measure and analyze the resilience score of each workflow. uh You can analyze how your application performed in that particular uh close workflow, so yeah. That's that, then a litmus also supports multi-tenancy, which means you can create your own team.

A

uh Add other invite other users to your team and like as viewer or editor permissions like it has a fine-grained role-based access controls which gives the necessary privileges to the users. Then scope support is also there. I have talked about. You can install it in name space or cluster white scope and authentication is there you can choose to have local authentication or you can or the oauth one so yeah. That's that then um coming to monitoring and observability. So you can connect your own data source and monitor the workflows or you can.

A

Visualize are graphs present, where you can visualize the workflow run, statistics or the uh schedule statistics you can also, once the workflows are running uh or computed execution, uh you can compare two or more workflows how they performed, and um in case you do not like the interleave dashboard that is present. You can upload your own um dashboards from the available that are available in the community. You can edit them, you can tune your own dash tune, the dashboards according to your own requirements.

A

And, lastly, you can uh monitor the chaos in real time with the interleaved events and metrics uh from the prometheus data source.

A

Then uh with litmus cures, you can not only um target kubernetes application, but you can also target uh uh cures like on infrared sources or attacked by metals or a machine as well.

A

Lastly, get tops for chaos, uh so it basically integrates it, integrates any git based source control manager to provide a single source of truth, uh provided that um you have enabled githubs once the githubs is enabled it kind of switches off mongodb as the um as the db uh like the data. So um then git will like act as a single source of truth.

A

So- uh and this is also bi-directional nature, so that means if any change occur to either all the uh like, all the workflows are being uh stored and get uh in the git source. So if any change happened into either uh cure center or in the get source both of them will automatically in sync, then uh it also provides even tracker server as a micro service, where you can launch the subscribed workflows it like it launches the subscribe or works automatically.

A

If there's any change in the application, such as upgrades or and all so, it automatically launches the uh close workflow yeah. uh So that's that now I'll be like now. Amit will be talking about uh get ops in more detail and we'll be giving a demo so yeah over to you amit. Thank you.

B

Thanks serenia so before moving on to the demo I'll be talking about githubs, and why do we need chaos, engineering with githubs, so github's, basically, an operational framework which uses git as a single source of truth and any change in the code or in the git repository needs to be fully synced with the cloud infrastructure of the organization.

B

It follows the principle of infrastructure as a code where managing and provisioning of the infrastructure is through the code rather than manual processes. Now, moving on to the main question, uh why do we need kiosk engineering with githubs, so the chaos engineering with githubs will enable a vast scope of automation with ci cd pipelines. So currently, chaos engineering is being performed in a closed environment or in a pre-production stage, but what we?

B

What if we enable chaos engineering in the ci cd stage, so this will actually enable the developers with the known faults before it goes to the pre-production stage, and some advantages of gitoffs are increase, increase in productivity, so developers are more focused on the development rather than the ci cd of the infrastructure, and it reduces the mean time to deployment, and the second point is higher reliability.

B

So guitars practice are considered one of the best practices or because it reduces the mean time to recovery like if we have any fault, we can simply roll back to a previous, stable version.

B

So the third point is better security, so kit is a very secured uh platform or a framework, because uh it's very strong with this cryptography and the ability to sign your changes provides the uh ownership to the change or to the source code and it improves the auditing. So since gitobs uses git, so we can keep the track. We can keep a track on the audit logs and we can uh know any change which is uh going into the gate. Repository with the.

A

B

So it increases the auditing as well, so now, moving on to the demo, uh I have set up the litmus cure center. Let me yeah so for this demo.

B

I have installed the qr center on gk and, along with it, uh I'll, be using two cloud native applications which are the bank of anthon and those application and an online beauty application, and so this bank of antarctic application is actually a banking application and we can perform a lot of operations like sending a payment or depositing a payment, and similarly, this online boutique application is actually an e-commerce application.

B

Since you can see a lot of products listed here and we have a catalog, we have a functionality to change the pricing according to different currencies, and we have a cart option here so we'll be performing some chaos uh engineering on these two uh microservices and for this I'll be using your center and to enable the guitar functionality of chaos center. uh This is very simple to do.

B

uh We have in the settings tab, we have a tab named as gitoffs uh simply select this git repository option and I will be providing a git repository so moving here. Yeah, so this is a empty repository which I have created for this demo and to connect this git deposit I'll use. The repository link.

B

uh The branch where I will be pushing all my changes, which is the main branch and we can provide two authentication methods which are the access, token and ssh. So I have my access token with me so I'll be using it.

B

A

B

Delete my access token later so I'll, just click connect and it will take a few seconds yeah. So we have successfully enabled the git ops for our project and to verify the same. We can go to the git repository again and if I refresh this, I should see a litmus directory being created and the directory structure shows me the project id here. So if I see that this 205ed is actually my project id, which is 205ed, we can also verify it from here.

B

The project id is given here, so we have successfully configured githubs within our application and now we'll start to do some kiosk engineering uh and let's get started with the bank of anthous application.

B

So I have deployed this application along with all its services in the name, space called bank, and here we can see a lot of services like balance, reader contacts, load, generator transition, history are available, and so currently, what I'll try to do is I'll try to delete this spot. The transition history pod, which actually shows me all the transition transaction history within this application. So let's get started with it. So I'll try to schedule a workflow uh I'll, create I'll click on the self agent, and here we have four options: to create a kiosk workflow.

B

So we have the option to run a predefined workflow or we can clone a existing kiosk workflow or we can use the git. We can use the cursor, which is a marketplace of all the kiosk experiments and we can also import a workflow manifest yaml, so for now I'll just use the cure, sub I'll click. Next and I'll provide a name here, delete transaction or so I'll click. Next and now I will add the pod lead, experiment or delete yeah here it is and to target the pod.

B

I have to select the name space, which is this bank name space, and we have the transaction history label here so I'll, select this one and for the timing, I'll not add any probes I'll just continue to tune the experiment. Here I can provide different environment variables to my experiment so for for now, for this experiment, I will select the total chaos duration as 60, and the chaos interval to be as 30 seconds yeah. Now.

B

I'll finish up all my changes and I'll turn off this reward schedule, since I want to know the logs and other details of my.

A

B

So I'll click next and I can select the weights here of the experiment, I'll select the schedule now option and I'll uh I'll verify all my changes. It's the delete transaction board and I'll check if the labels are correct over here uh so which is the bank name space and the label is transaction history and I'll just finish. My changes here yeah. So we can see that the workflow has started and if I click here I'll get a argo graph which shows the live changes which are taking place in the workflow.

B

And, interestingly, if I go to my git repository and do a refresh I'll see that this workflow manifest is also here. So any change which is happening in this workflow will also be reflected in my git repository as well. So uh we, let's.

A

uh Wait for a few minutes a.

B

Few seconds or few minutes for the workflow to get completed and meanwhile we can observe uh the chaos which is which will be happening uh in this bank of anthous application. So we can see that the port delete, experiment, pause had just uh started up and if we go to the litmus namespace, I can confirm that the port delete runner has just started and the transaction part is actually terminating.

B

So if I refresh this page I'll, I should see that this service is under chaos and we don't have any data related to the transaction history and once the once, the transaction history part is back into its running state. uh We should see the details over here, so let me refresh this page again: it's still under chaos and once the workflow is finished, and this service is in running state, we we should get the details.

B

So let's wait for a while.

B

Yeah so since we can see that uh the workflow has completed and the port delete, experiment has also run successfully, so uh we'll go back to the bank of enthouse application and we'll just refresh, and we can see that the transaction history is now available available so to cross verify this. We can also see that the transactions report is now back and running, so we have induced a chaos on this service, the transaction history service uh on bank of antos application.

B

So what if I need the uh like currently in this manifest, we can see that the chaos duration was 60 and the chaos interval was of 30 seconds. But if I need to change these environment variables so instead of creating a new workflow completely I what I can do is I can go to my git repository and I can simply update these changes in my workflow manifest.

B

So let me go here and try to change the variables or the environment variables here. I change it to 100 and change the chaos interval to 50 seconds and now I'll commit these changes yeah. So in our git repository we have made the required changes and it will take a few minutes to get sync with this with the cure center. So, let's, let's wait for a few minutes over here.

B

So uh if I refresh the page and load the manifest again, I can see that previously it was 50 seconds or 60 seconds, but since I've changed the environment variables, the values can be seen here. So these are the updated values which I provided in my git repository, so the total kiosk duration was 100 and the kiosk interval was 50 and these changes are now available in my qr center and to run this workflow.

B

I just have to do a quick rerun of the workflow and the same workflow will uh get started with the updated values and we can cross verify it from our manifest and we see we can see that the key observation value is 100 and the kiosk interval value is 50.. So all the changes from my git, as well as from my uh cure center, are synced together and yeah.

B

So this is basically it, and apart from that, if you want to add some changes via some other methods like from a pull request, so we can also do that, and for that I let me go ahead and create a new branch.

B

Let me create a new branch named as test branch.

B

Yeah and in this test branch I'll add a file, I'll probably add a new kiosk workflow. I have created one.

B

So I have this workflow, which is the delete, catalog workflow, and it will actually target the uh online boutique shop, and here we can see that namespace is shop and the app label is product catalog service, so instead of configuring it from the qr center itself. What I'll do I'll just add a new file over here I'll upload, a file and I'll drag and drop this file over here and I'll provide a workflow uh name to this.

B

uh So this is the workflow name, uh and one thing we need to keep in mind is the workflow name should be the same as the file name, so the workflow name needs to be same as the file name over here.

B

So now, I'll just commit the changes and from this branch I'll make a pull request to the main branch where all the changes are being synced. So let me compare and let me just cross check if the manifest is uploaded yeah, so it has been uploaded.

B

The delete, catalog service and I'll raise a pr to the main branch yeah, so add files via upload or pr and I'll create a pull request and the pull request has been successfully created and once I merge these changes into my main branch, we can see that a schedule getting created over here, as well as the workflow, getting started since it's a one-time workflow. So let me merge this pull request.

B

Yeah, so the pull request is successfully merged and in my git repository in the main branch, I can see that this catalog service from pr has been added and let's wait for a few minutes to see to get the changes from the git to get synced with the cure center.

B

Yeah so now we can see that uh since the pr got merged and the changes are now in main branch, uh so it has triggered the git operations and we can see that the server schedule name delete. Chaos, delete catalog service from pr which is same as the file name over here has been created and similarly the workflow run has also uh started. So let me click here and see all the related information.

B

So we can see that it's currently installing the kiosk experiments and in a few minutes we can see that the catalog service getting down- and let me just show all the services over here yeah. So this is the shop name space where, where I have all the services running like the card service, the currency service, the front, end, email, service, payment service and the catalog service. So with the current experiment, we'll be terminating this catalog experiment and we can see that the status is in terminating state.

B

And if I refresh this, I will see that yeah, something has failed below some details for debugging and the service is down. So, even if I refresh this, I think it should be down, but I guess uh it's back into its original state, since the kiosk injection time was pretty low in this case uh yeah and yeah. So we can see that we have injected chaos uh from the gate repository and it's now it was uh visible uh in the application as well from the gate depository.

B

So these were a few uh operations which can be performed from cure center. So uh the major scope uh here uh for githubs with litmus curse, is to uh add these uh github functionality in your ci cd pipelines, or you can use these in your github actions to run chaos within your ci cd uh ci cd stage.

B

So I think that's it from the demo and yeah. That's pretty much from my side as well. Thank you.