Cloud Native Computing Foundation Kubernetes Community Days Bengaluru 2021, 10 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Chaos Engineering for cloud native systems

Description

Kubernetes Community Days Bengaluru'21

Tired of recurring production outages that nobody benefits from? You aren't alone! Introduced as a tool to test the resiliency of its infrastructure in 2011 by Netflix, Chaos Engineering is one of the top 5 technologies to watch out for in 2021 per CNCF. This talk covers all the important aspects of Chaos Engineering from a Cloud Native perspective & will focus on LitmusChaos, an open source framework helping orchestrate Chaos on Kubernetes. Towards better cementing of concepts, we shall also have a live demo of the tool in action.

Slides: https://drive.google.com/file/d/1gbFu9kGC-I8L8nLxF45DySur1mYRXC15/view?usp=sharing

A

Hello, everyone and hope you are having a fabulous time at kubernetes community day, bengaluru so far. Surrender- and I are super stoked to be speaking with you all about chaos, engineering for cloud native systems. But before we go ahead, um it's just fair, really that we introduce ourselves to you so saranya. Why don't you go first and introduce yourself.

B

Yeah, thank you divya, so hi everyone. This is sarna and I am a software engineer at kiosk native and also one of the contributor to litmus chaos. Thank you.

A

Awesome um so hello, everyone I am, and as part of my day job, I am a team lead with hsbc and probably from the slide that you're, seeing on your screen right now, you've decided that I um I'm an I'm an avid open source enthusiast.

A

I've contributed um for the past one and a year, one and a half year to the kubernetes and litmus chaos projects in different capacities, and um I'm also uh one of the co-organizers of cncf student user group, which is um owing to my passion, uh project of getting more folks into open source and having them benefit from the community and the learning that we have here um so today, as I mentioned before, we are going to be speaking about uh chaos, engineering for cloud native systems.

A

Now before we get there, we obviously do need to understand uh how the situation was before chaos. Engineering came into the picture, which is where most of us are currently um and then we need to look at uh why there was a requirement for chaos engineering to come into the picture. What exactly chaos engineering is and how it can be extended to the cloud-native context, some benefits and uh thereafter the second segment of the presentation.

A

I shall hand it over to saranya to speak about uh the various chaos tools and platforms, um and since we both work on uh the letmaskios project, she'll be um sort of doing a quick, deep dive into um you know. Let me scare us as a product with a really cool demo and uh towards the end. We uh be completely remiss if we actually do not lay out um what we think is the future roadmap for chaos, engineering as a discipline.

A

So um that being said without further ado, let's dive right in uh before chaos, engineering, which is where a lot of us are at even right. Now, even the chaos engineering is not a very uh new discipline, so to say so um before chaos engineering uh was implemented um in a lot of organizations and if uh we, where we are currently um there are, uh you know, customer and service, impacting outages. That is something that's there.

A

Irrespective of whether chaos engineering is there or not, really honest, but uh when those customer and service impacting outages used to occur, um the mean time to resolve, or the mean time to recover um from such an outage used to be high because more often than not during such an outage, you required um smes and getting them on the call.

A

Or you know, coordinating with them towards recovery was somewhat of a struggle um not because the processes were laid down and you did not have recovery documents in place, but it's simply um wasn't or isn't a streamlined uh process altogether and um like the shaolin monks say um the more you bleed and practice the less. You bleed when you actually have a problem, and I'm paraphrasing it here, but when an actual outage occurs, um we weren't really prepared for it.

A

If chaos engineering was not something we considered implementing um in our piece of infrastructure and of course, after this um you know outage occurred, uh there would be what we term as blameless postmortems, uh but they really are.

A

um You know far from being blameless really because it's always about finding uh the root cause, but we tend to target the symptom instead of the actual course, so uh these would eventually end up being the burden on your sres or your infrastructure engineers, as opposed to diving deep inside your particular code base or the way your application is designed, and, of course all of this um is sort of fed into um uh you know.

A

All of this information is gathered from um you know your monitoring and observability infrastructure and um in the absence of a discipline uh that sort of streamlines this process um that was really rudimentary it's just now in the recent past that monitoring, observability distribution, tracing logging have started gaining um a sort of importance towards better recovery processes and towards ensuring that the applications we deliver are highly resilient and highly available.

A

So um we can clearly see a why there was a need for a discipline, because, although we knew what was required, we did not have all of it streamlined in a proper way and um chaos. Engineering. Let's look at what it is. According to principles of chaos.or cures. Engineering is basically just experimenting on your systems um towards building confidence among your customers to withstand turbulent conditions and production as and when they occur.

A

How do they do that, and why is this really any different from testing uh so how we do that is by our experiments, which is something uh you know serena's going to demonstrate in the latter half of our presentation and um how is it any different from testing, which is a very good question, because that's something that I myself struggled with when um I started off learning about chaos, engineering, so in testing, you are basically uh aware of what you are giving as an input and what you get as an output.

A

So if your system is subjected to particular conditions, it should uh either give you a b or c result, and um if it doesn't it it is, it is failing the test. I mean you, you get the point test is uh testing is basically um uh a choice, not a choice. Really, but uh testing is between.

A

uh You know the systems input being known as well as the output being known correct, while experiments don't really have a fixed output, because when you're experimenting, you are just subjecting the system to unstable conditions. You do not know what the outcome of that experiment is going to be. It could be anything, and that is what we are looking to learn more about chaos. Engineering is jumping into the unknown and taking a leap of faith towards understanding your systems better.

A

Now, when we speak about chaos engineering, it's a common myth that a lot of us um really believe that, yes, engineering is only for cloud native systems um and there's often a question thrown at me. When I speak about chaos engineering as to whether cure chaos, engineering is only cloud native, so the answer is no chaos.

A

Engineering is definitely not cloud native because, as afo mentioned, we have experiments done on various systems and systems really um do not have this uh discrepancy of um you know, judging or amongst themselves whether they're cloud native or not it's as humans, who give them that context. So chaos engineering can be performed on cloud native and not cloud native systems alike.

A

The only um plus one here in terms of cloud native is that the availability of tooling is much more as compared to the um you know, others, because, um as a sector, I.t is moving towards the cloud native um side of things. So it's just check it. It's just generally evident that you know chaos. Engineering tools which would be developed are more catering towards that move, rather than you know, being stuck to um the baby used to do things. So you absolutely can't have chaos engineering performed on.

A

um You know your non-cloud native infrastructure, which is in your own data center. But the only thing is: you will have a lot of challenges because of the lack of a lot of associated monitoring and observability tooling, and um obviously, um if you are trying to do it by yourself, it's going to be a little more difficult because, um with the available tooling, you are leveraging a lot of uh you know, associated monitoring and of your observability tooling, uh with your own customized scripts or with your own customized um tool.

A

You probably might not be able to leverage that like if you create something from scratch, it's a little more difficult to actually have anything related to chaos done. So uh it's not that's not impossible, but uh yes uh doing it by yourself is going to be a problem and I'm sure serena is going to cover some of the available tools in the market, both cloud native um non-cloud native, open source and proprietary in the uh section that she is going to take.

A

um That being said, um you know I've been practicing about uh pierce engineering and what it is about and how it can be extended to the cloud native context, but really um when you're talking about something or you need to understand what are the benefits and how it impacts you as someone who's from the application, development team or someone who designs applications.

A

um So in that case um you know this slide is a very good, maybe overview of how it benefits you. So the very first um you know impact and I'm going to talk to the bottom. Is that because of um you know resilient systems and better processes, your um systems aren't going to fail as frequently they will feel, which is something that is a given. But they are, um you know, more resilient towards failure and because you have experimented on them and have learned um where their vulnerability is like.

A

You are aware of what could possibly go wrong, uh so you are in a better position to recover from such failures. So when you are in such a good position to recover from failures, you have better mttr, um which also means that you know customers have to spend less time being frustrated and obviously, when you have resilient system that directly correlates to having lesser outages, which is again, you know uh pretty good from a customer for a customer because nobody likes their apps down.

A

Right um then comes the business, because uh when we look at the business side of things we um are looking at, uh you know better processes, and when we look at better and more efficient processes, we are definitely looking at efficient incident management and uh when we are saving up on so so many overheads in terms of um you know, outages, and you know, coordination and recovery times.

A

We are definitely aiding in the prevention of losses and last but not the least um technical staff are the most important in this ecosystem because um they are the ones who ensure um that you know all all. This is uh working as per pro as per requirement. So when your technical staff are happy be more be rest assured that all the about two layers are going to be um happy as well, because then we look at chaos, engineering being implemented in organizations.

A

We are looking at better and more reliable systems, because during these experiments you will discover what vulnerabilities are there within your systems and you're able to address them ahead of time? And maybe, when an actual outage occurs, you are going to be um better prepared towards um understanding how to recover from that which directly correlates to for the technical stuff um that you know they have to not be present: 24, plus seven um in case you're, a subject matter expert or even the on-call for that particular application.

A

So that will make sure that you know your technical staff are able to cater to actual incidents and to effectively prevent them from happening the next time around.

A

So that being said, I think um you know it's time for me to hand over to saranya to talk about the various tools available in the ecosystem and maybe walk out to a quick demo of the litmus chaos project um over to you, saranya I'll. Just stop my screenshot.

B

So, thank you I'll be sharing my screen.

C

People can you just us, can you stop sharing the screen.

C

Okay, actually, it does like it is showing disabled that you have to enable the screen sharing for nonpartisan host participants.

A

Just a second yeah you should be able to have the screen share like that. Yeah.

C

Yeah so hi everyone so I'll.

B

Just present it yeah, so, as you can see, these are some of the tools that are being used for practicing kiosk and sharing um practically.

B

It won't be possible for me to like go through each one of them, one by one, so like me, being a contributor to litmus chaos, so I I know I I have used it, so I know about it much better, so I'll be talking about it in the upcoming slides and other than that we have kiosk monkey, which has been the earliest one of the earliest chaos engineering tool get introduced by netflix, and you can actually go through them and find like, according to your requirements and then like these can be classified into various uh categories such as some of them are cloud native.

B

Some of them are non, some of them are open source and some of them are specific to networking or some any other particular platforms, so yeah. So you can just find out your own use case and then get started with it.

B

uh Then then, coming to litmus litmus is an open source uh cloud native kiosk engineering toolkit, which provides all those features that uh diva has already talked about. While explaining the major principles of cloud medicare's engineering. As you can see, it is open source, um then it supports community collaboration. Then we have kept the api and life cycle management open to maintain actually the transparency that user can know what are the apis and how are they uh being managed so yeah?

B

So we have the githubs enabled so that, like get being a very uh like primary tool for developers uh like integrating it with litmus has proven to be very useful, then we have open observability so like as of now, we are using prometheus and grafana, but uh since it is open you can bring your own observability tool and add it to your solution so other than that litmus is a cncf sandbox project and we have been getting a lot of attractions lately.

B

So we have around 56khz experiments with more than 100k installations, and we believe that in the upcoming years, litmus is going to be the go-to tool or platform for uh practicing kiosk engineering to look forward to yeah. So, let's get like, let's get started, how we can like uh start inducing cures in our system, so we can actually install litmus using helm chart and we can uh like to induce cures.

B

We can uh either use custom templates or we can uh indeed induce our own uh custom workflows or our chaos using uh either by using public public kyosa, which contains actually all the experiments that like it, is actually public and we can uh even have our private owned private kiosk hub and connected use it within litmus so I'll.

B

Just giving I'll just give a demo of how we can induce skills but, like, let me tell you like, there are actually two ways, so we can either use uh induce it by using terminal or we can use litmus portal, which is actually a web. Ui uh like it provides uh like uh yeah. It provides better visualizations of workflow and it comes with a lot of features. A lot of other features such as tv teaming, analytics, etc so I'll be giving a demo of uh litmus portal itself. So without much dealer, let's get started yeah.

B

So, as you can see, uh I I am running it on my own mini cube, but you can use uh any other uh platform such as like kind k3s or any other public cloud. So I have applied the latest like litmus two beta yama, uh then I'm as it is visible, the pods are up and running. This is the front end part. This is the server port, and this is the mongodb.

B

And if I uh take like I'm taking the port number of front-end service and the mini qip, I should be able to run I'm able to run litmus portal so yeah. So this is the login screen of litmus portal, and by default the credentials will be like the username is admin and the password is litmus so once I log in our default project would be created for me and I'll be asked. So this is the onboarding screen where I'll be asked to change my password, so for now I'll be keeping it one. Two three.

B

So once um I have onboarded, uh so this is the main home page where, as you can see this uh like here, this is the uh default project that has been created for us and if you are part of any other project you, uh you should be able to see them here, and this is the user level details and you can like edit your profile here by coming to my accounts online settings and you can edit your details. Then, if you are an admin, you should be able to create new users.

B

So once the users are created, I should be able to invite people uh to my own project. So currently we are supporting just a viewer and editor uh like the viewer and editor role, and so this is the teaming section actually where you can change the project where we're getting the list of members, the invitation status, uh any active invitations or any active project that you have been a part of and other than that we have the github section and uh yeah.

B

So this is the analytics section where you can see the analytics of the workflow that you have created, and this is the agent section. So agent is like um when, uh when you install kiosk uh self agent gets created by default and uh the kiosk operator gets installed here uh and we can, we should be able to run any chaos experiments on this target agent. So I can actually connect multiple agents external agent by following this procedure, but for now I'll be using my own self widget.

B

So then comes the kios hub, as um I have already mentioned. So this contains all the experiments that are available publicly for use then comes the workflow page. So a workflow is a single uh single unit where, like it, contains a lot like uh several experiments which can be sequenced, parallelly or serially, or a combination of both according to our own requirements, and we can see if we have scheduled any workflow. So we can see any of all the details related to that particular workflow which, like which I've already run or which are currently running.

B

And then this is the schedule page where, if I have scheduled any workflow, I should be able to see all the details regarding that, so I'll go ahead and schedule a workflow. So, first of all I need to choose the target agent. So this is the self agent for now uh yeah. So, while choosing workflow I'll be asked four options, so first one is: if uh like choose any predefined, qrs workflow template and if you have any existing workflow you can using. You can actually clone it and use it.

B

Then you have the option to choose experiments from my heart. So for now, since I have just the cure, gives her, I can just choose it and go ahead and the fourth option is: if you have any uh workflow yaml, you can just drag and drop and get started with it. So for now I'll be choosing potato headquarters, the predefined workflow and then, if I hit next, so this is the page where you can change any descriptions or uh like according to your own requirements, then yeah. So this is the actual visualization of a workflow.

B

How the experience will take place. So you can actually edit the sequence uh if you want like by just dragging and dropping, but I I'll keep it as it is for now yeah. uh So if I hit next uh so I can tune the experiments of my workflow according to my own requirements, then once this is done so I'll be asked to schedule the workflow, so we have two options: either we can show it right now or we can have a recurring schedule so I'll go with option one for now so yeah.

B

So this is the like summary page of the workflow that we are going to schedule and if you want to do any changes, so you can just go back and make the changes. So once I hit the finish, button, uh workflow should get created and we can yeah. So this we are in the browser. We can get all the details here. As you can see, the box is currently running. So if I uh click on this particular workflow, so this is the graph view of how the workflow is running.

B

So this is like, as you can see, it is running right now and this the application installation is pending. So if you uh click on each of the node, you can get all the details regarding that related to that and yeah. So here like it is pending right now, so you can get the logs here. So this is the graph view and we also have the table view where you can get all the details related to the uh current workflow so yeah.

B

So this is the schedules page where, like I have since I have scheduled one, so I will be getting all the details regarding the that particular student here so yeah. So this was a very a brief demo of how we can uh like induce chaos in our own targets. So moving ahead.

B

Yeah, so litmus in a nutshell. So in a nutshell, we have the central litmus portal and all we need to do is just uh pick up any predefined, workflow or any uh custom workflow and run it against a target agent, and, as I have already mentioned, we have. We can have multiple agents. So once the experiments are run, the chaos metrics are exported to prometheus and the cures uh analytics are uh pushed back to litmus portal.

B

So we can like monitor the kiosk results using prometheus by, like, with the help of kiosk interleaved analytics yeah, so uh like litmus can also be um run on or use on like bare metal devices or bare metal environment or any other public cloud such as aws, azure, gcp, etc.

B

Then uh yeah future roadmap so like it is pretty evident that um chaos engineering is one of the fastest growing technology in 2021, so like um as the number of industries adopting your engineering increases uh these two. These uh chaos engineering tools need to be smart too.

B

So it is very likely that these tools would be uh integrated with machine learning uh tools or artificial intelligence so that they can help automating the analysis activities and then they can also help in uh improvising the monitoring of applications when the experiments are being run and who knows like they might be even able to find out or detect some threats or falls that have gone unnoticed before so. That starts that's all from asset. I we hope that you enjoyed this session. Thank.

B