Cloud Native Computing Foundation Chaos Engineering Working Group, 22 May 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Chaos Engineering WG Meeting - 2018-05-22

Description

Join us for KubeCon + CloudNativeCon in Barcelona May 20 - 23, Shanghai June 24 - 26, and San Diego November 18 - 21! Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy and all of the other CNCF-hosted projects.

Join us for KubeCon + CloudNativeCon in San Diego November 18 - 21. Learn more at https://bit.ly/2XTN3ho. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy and all of the other CNCF-hosted projects.

A

Already we've waited long enough for folks, so we might as well get started so hope. Everyone could hear me loud and clear. Let me go share my deck really quick, hello.

B

A

Hi, hello, hello, I'm gonna, go share my screen in a second.

A

I run see it see, yeah yeah yeah go so yeah Genda today, so we'll do some introductions. Then we'll have some community presentations about 15 minutes each from the gremlin on chaos. Toolkit communities we'll talk a little bit about the status of where we at the landscape. Mostly me, I'm asking help from everyone and trying to come up with some reasonable categories on how to categorize the different technologies and chaos. Engineering, we'll also talk a little bit about the white paper based on community feedback, I uploaded it, but to github.

A

So we could bang away at its vehicle requests instead of Google Docs, which seems to be favorable to a lot of folks and then we'll kind of end things out, but first off it would be great if we could get some introductions from folks, especially if you're new on the call. So any new faces from from last time walked to speak up and say, say: hello, cool.

C

Okay, my name is Robin, came up in Seattle I'm working on a early stage, stop here called fuzz box that and runs our failure. Simulations on your production, artifacts right cancer, Jason excited so I'm. Hearing interesting thanks, very cool.

A

D

C

D

Name is Lexie.

A

D

For open source project Pumba is cows testing for docker containers; ok,.

A

Awesome glad to have you anyone else, anyone else new yeah.

E

I can go hi, my name is Julian. I am currently in Stockholm in Sweden and I'm, a software engineer, I doing a lot of communities and docker work these days and I'm really interested in to cows testing and especially the continuous continuously testing production environments.

A

Awesome anyone else.

A

You all right that will be interest for this time, we'll try to make sure everyone has a chance to introduce themselves. That is new every time all right, let's just go to move on, so thank you presenting mode again, so a landscape.

A

So one of the few things that we want to accomplish you know from the working group is to kind of produce not only a list of kind of different chaos, engineering projects and products and tools out there, but also attempt to categorize them, and you know: I've been bashing my head a little bit working with Sylvain and some others to kind of come up with initial idea, but I would love to hear from the group if anyone has an idea in terms of you know how to categorize things.

A

So, whether you could you know, there's simple things you could categorize by like hosted solutions versus you know, maybe frameworks. You know that run. You know client-side and are driven client-side versus maybe chaos, engineering tools that are focused on security, so have an idea how to categorize. This thing is definitely something I've been trying to tackle so I'm asking folks we could have a.

A

We could have probably a five-minute discussion on the call today, but there's a github issue open and if you kind of have ideas of how to kind of categorize things, they would be great to hear because that's essentially is gonna drive the landscape that will be produced by ciencia for this. For this work, so I don't know if anyone wants to take a stab at this I know Sylvain. You had some thoughts that we chatted over email but I kind of love to hear from the group.

A

If they've, you know, got a picture in their mind of how this should be categorized.

F

Now I've got actually a good question: Christian yeah. How did that work for other? You know working groups like Cerberus I think. Did they go through the angle of the market of the the two teeing? You know what was the Russian of.

A

Well, I mean it was a lot of bike shedding for ever to come out kind of come up with the categories I mean first first, we started with a list of server list projects and things right and then from there we tried to break it up in terms of categories, I kind of what makes sense. So that's kind of the approach we took. It took a long time. You know, and I was trying to use that framework to apply to chaos.

A

Engineering because you know obviously there's seems like there's an upstart of kind of hosted offerings. You know you know with fuzz box women cassock, you Exedra and then there's a bunch of kind of tools that have different focuses on whether they're trying to do chaos, engineering in a what's it called security context or- or you know some other context. So so you know I'm just asking how to how to categories things from like what are people's thoughts on this.

C

Yeah I have some sense that focusing on the context like like whether it's security and something like that may not be the right way.

C

You know it seems like it might to begin with yeah, just cuz I, think yeah I think like reading the case, but like coming at it from the perspective of like either running experiments, so I think Julian mentioned be like continuously running, you know running tests and that kind of thing is maybe more of the way to dissect X, because I think at some point you get into the concept of like doctor and things like this, where it maybe isn't necessarily applicable, it's a very clear, clear domain.

C

So you know more and more ways of applying chaos like by invites very nature: okay, guys it'll, say it's thankful or not and can easily catch Christ and maybe I, don't know.

A

We could just start with like one big box and just call it. You know, cows, you know engineering and just throw everything in there in one box, but yeah I'm.

C

Still, like you know, you can maybe even say like whether it's actually in production, whether it's like you know how Netflix do it with salt canary and you.

G

Know almost how.

C

It gets supplied versus, what's being applied to I, suppose poorly trying to see whether it's you know like what well I might do, which is is like run on like a snapshot of beyond tonight. It's almost like, like you, know, staging environment, whether it's actually you know you have like the gremlin agent deployed and you're running now on production with a bus radius or whether you know you're, actually just saying something loose like case against monkeys almost like.

C

What's that, what's the scope of how you're playing it might be a way to break it up, because because I think there's there's there's different stages in sophistications the way you apply each of those and that been like a wait. So it's kind of static. Sorry, it's yeah.

H

I think that's very true, I think there's a case to be made to here's the technology and here's the philosophy and unless we might be a great place to start and then technology is a doctor as a kubernetes. Is it bare metal? Is it whatever yeah.

A

Yeah yeah I mean you know. I'd also ask people to have some like. You know, empathetic thoughts from like an end-user perspective. If you're trying to like evaluate with tools out there and like hey I, want this to stop to work against AWS or kubernetes, you know making it easy for them to find that, at least from that perspective is super useful, so I'd like us to also make sure that's somehow possible from like in just an end-user perspective.

H

You know how do you get started? You know I've been told. I have to do this thing at work, so where do I even begin yeah? Yes,.

A

Yeah and and I'm running on as your or something I want to make sure this stuff works there too there that type of perspective.

F

You know application has several dimensions when you can filter. You know things on different diverse. You know degrees of refinement or something like that. Yeah.

A

There could be multiple kind of accesses of a filtering. um You know I'm not asking for like a complete. You know solution, but for folks to kind of put their thoughts on that github issue and for us to kind of keep iterating on this eventually I kind of want to get to a state where we have kind of a rough agreement and then I could work with our design team to start sketching out how this is going to look and that we could kind of continue to iterate on that.

E

If I'm a young I've been maintaining the capacitor e4, that's called awesome. Docker, yes, being basically doing that for like four years and I can tell you the pain of trying to categorize a landscape that is shifting under your feet all the time. So.

C

E

With the other contributor, what we came out, the the main idea was, we split everything into use cases so who who is it for not? What does it do that? Who is it for and from there most people find like the common terminology and and can relate to keywords, much easier and find a way and contribute to the list as well. So that's that's so yeah. It will involve the community around it.

A

Yes, so this would definitely be like in an open-source fashion where anyone could contribute their thing to the landscape. We already have we rate if you go to something called L dot, C and C F dot IO, you kind of see what we've done for the wider clown native landscape. um It works pretty well, because community members tend to police themselves, which is beautiful.

A

It's a beautiful thing to watch to make sure that things are kind of categorized properly, but it's for us to kind of come up with the categories categorization scheme, so yeah, that's, basically, all the time I can I want to spend on this one just because we have demos and I'm really excited to see them. All I asked from the group is to throw some ideas on the github issue that I've linked to and we'll kind of go from there and try to do that. Work stream in async fashion as as possible.

A

All right so moving on so next up we have two community presentations kind of see how people are doing chaos, engineering in the wild, so first off we'll have Eugene from gremlin talk a little bit. What he's up to and then we'll have Sylvain from. Let's talk about chaos: okay, but let me stop sharing my screen, so Eugene could get going. Oh thanks.

I

Chris, you know you do yeah sure thing actually I share that screen. I just move over to the lighter. What.

A

I

Yeah sure give me a second.

A

Yeah I'll get going sorry Oh God here you go, okay got it.

I

That's good thanks everyone, so I hate. My name is Eugene I've been at gremlin since July last year, just one whoa, it's the slide. Okay, seeing a little bit of artifacts around you got it good, now very cool, so the company itself has been found in 2016 Holmes left at that point in time, and you know recruited for need to be CTO to start working on this product to make chaos engineering available for basically the rest of the world. Since we've all seen it work at places like Google with their debt during exercises at Dropbox.

I

Similarly, with Tammy and her dirt exercises, and also with coal net net flix Kaos engineering and then call them with forney doing you know, chaos engineering, making every having a tool in building a tool internally at Amazon retail. So we know the value at the corporate level and so we're just trying to make it available to everybody else, gremlin itself, installs in a myriad of ways, one of them is a Linux package.

I

So all you really do is pull down our repo and then you know install it, install the package or you can install us as a docker container, pull us from gremlins, slash gremlin on dr. hub or as a Kay as a kubernetes daemon set so I've seen a lot of our customers do that as well. We have a CLI interface that you could SSH into any host to run gremlin experiments.

I

We have an API as well for automation as well as a web app just to make it really easy for people to dive into all the types of failure modes that we could introduce into your ecosystem or otherwise just to properly scope here attacks, one of the things that you know I find very valuable. When helping our customers scope out chaos, engineering experiments that they go, you know where do I start. You know!

I

Well, if, if you don't have this nice rich UI to they give them all the parameters, it's really hard to get started, because you know I could just blow up a whole, auto scaling group any WS and see what happens right well, start small and we could help you scope that out properly, we also have a built-in scheduler. You know, I think one of the things that we all kind of hear a lot is about automation and so putting things into the scheduler really helps. You maintain that floor of resilience in any of your applications.

I

Finally, we also have this halt rollback feature or a Deadman switch where, if you think that you're causing a lot of damage- and you know you really want to stop it from happening- we have a whole book, just click on it or otherwise use an API to stop the attack and we'll be able to stop and get back to steady-state.

I

Alternatively, if you just to stop things from going wild- and you know your client loses control to our control plane, you can then we add, ms, which, on our climb, will kick in and then roll back to steady state automatically such that you know you don't have any uncontrollable chaos within your ecosystem. So that's this slide. Let me go ahead and get into the demo really quickly.

I

Let's see just go ahead, my screen.

A

Here, I need to stop sharing problem cool now, go for it, you should have. You should have permissions down cool. Let me know when the screen shows up for everyone.

J

I

Works for me antennas, so all right when you log into our service. This is the UI that you're gonna get. If you have clients already hooked up to our control plane, you could already begin to create your first attack, otherwise put things in the scheduler and finally just to manage our client, your clients or users. That's connected to us. I already talked about the way of installation, so I'm not gonna, usually talk about that with our with our prospects and customers.

I

So I'll just skip on that and go right into the types of attacks that we could run right now. The client itself is focused on infrastructure level attacks, so things that happen on your host things that happen on your operating system. Things that happen on the network, all those things we have a good suite of attacks that you could run with gremlin, so resource attacks. First off that I'll start off with is things that happen on your post right.

I

We could consume host cores, so the amount of course that you have on your hosts or your instance. For example, you can specify the exact amount of hosts that you want. We could specify amount of disk space. You want us to fill right. What happens if your alarm, if our log rotation does it happen? We internally got bit by Blatt. You should check out our blog post at gremlin comms and finally, we can also introduce disk rewrite activity for those of you that have a lot of disk intensive tasks.

I

This might be a good gremlin to run. If you have, you know a lot of heavy I/o operations and finally, we could also consume gigs of memory notice that every single one of these attacks we could also save as a template. When you have a this some attack that you find that you're going to recall fairly frequently, some of these attacks might be a little bit more specific and highly targeted. So as a result of that, the configuration could take a little bit of time, save it as a template.

I

That way, you could bring it back in the future or otherwise throw it into the scheduler so that you can then just automate that particular attack state gremlins here alter the state of your operating system. So, for example, we have a process killer here, where we can string match the process that you send to us and we'll just kill it perpetually. What happens if your web server like area httpd, were to go away or your job or your tomcat app where to go away?

I

Does health check pick that up and then I otherwise terminate the hose or start recovering from it good thing to test? With this we also have a shutdown gremlin. So what happens in a public cloud right like AWS, I, failed health check your instance good. It's gonna get terminated or meltdown your AWS with rolling reboot across your our fleet.

I

You know when it's gonna hit, but it's gonna hit, use the shutdown gremlin for the one of the key values that we have here is that you know if you use this in conjunction with our scheduler down here, and you say: Oh run this during business hours. Right, 9:00 to 5:00 we're running to shut down a reboot gremlin five times a day. Well, you basically, then, which have that chaos monkey experience right out of the boss for yourself right.

I

There final state gremlin that we have is time travel where we'll break NTP and introduce clock skew many times, I hear from our customers that when you introduce this in, say your cassette or cluster terrible things happen. Maybe you want to see what happens in that? You can your own worlds for your data layer, otherwise things like daylight savings time is also something to consider a certificate on your host were to expire.

I

That is good thing to simulate as well or a leap year now the final gremlins set of gremlins that we have on the network once and these tend to be the most valuable and most powerful ones, because in a distributed system in in AWS or any kind of cloud provider, the network tends to be the most fragile point, as you're breaking your applications from monolithic to micro services. Your network is now basically part of your application stack right. You expect it to be high-performing, but really some things happen because well cloud happens.

I

Network devices have built-in entropy to them, definitely want to. Similarly, what happens when things become degraded or otherwise unavailable the black hole. Gremlin here drops all packet going from one place to another, so you can simulate something like a full service. Unavailability notice that all these network gremlins have the most amount of arguments that you could pass to it, and I really just want to highlight the concept right here that we could actually simulate full service outages. Now some of you might remember this great outage.

I

That happened last year called the s3 outage, and we can stand like that for you out of the box just by adding that as your as a service provider right here, the next gremlin that we have is the DNS gremlin and we could break DNS for you. You want to test what happens if your primary DNS server were to go away. Do your host actually do secondary, fall or fall backs to your psyche? Definitely something to worthwhile worthwhile to check. Otherwise, you can simulate bigger DNS outages like Dai DNS, right or ultra DNS.

I

That happened a few years ago, or just maybe about fifty three being unavailable as well. These last two gremlins latency and packet loss are the ones that I would call great states of failure. You know your systems are are running, but due to things like noisy neighbor or having to traverse through a lot of Internet traffic, things should become slow or otherwise degrade it. So it's not operating at its most efficient point point.

I

So the problems usually manifest themselves in the form of leads I think she's become a little bit slower than what you're expecting them to be you've. Seen the menus here, so I'll just go ahead and talk a little bit about it. You've got you want to dial in how long you want to run run the attack force, sometimes you're observability or your monitoring tools, take a little bit longer for the metrics to show itself.

I

So if we definitely, you could definitely specify how long you want the attack, for you can specify things at the IP address, IP address range or cyber block level. At the device level, such as your eat, zero or your East, one, your hosting or endpoints such a. If you want to just inject some latency going to google.com, you could do just type it in or if you want to do, have an external your service has the third-party dependencies it's something before messaging, like Twilio, you can do that as well.

I

If you want a whitelist, particular traffic, it's just a carrot right here, so maybe you want to whitelist you're monitoring just so you could have some observability into this chaos experiment. We can also support port port ranges and, finally, you can also specify what protocol you want once you're done, defining the attack.

I

You now want to specify the targets that you want to attack and by this we're talking about the blast radius, if you will to help you with this, we pulled down for AWS instance metadata here, where you can click on any of these bubbles to filter by say region or availability zone. And finally, we also support services that you could pass on to us as well, so that you know you, your hosts are serving up a particular application.

I

You can specify that one of the things that I like to highlight here is that it's our concept of random. You know we, while most of the tools that we see say you know, do random. Do it in prod, we kind of say you want to still be a little bit targeted. So our concept of random here is that we take all of the clients that you have installed with us and then three of filtering like say, I, only care about things like things up like our API, for example.

I

You'll, then pull in a little ice filter all the clients that are serving that particular service, and then you can then specify well I only care about maybe two hosts here so go ahead and use that as by target or otherwise you can also support. A percent of our environment is impacted right, so maybe I will say. 50 percent of it is sorry 50 percent of it is going to get impacted. Now, if you want to do a container attacks like in your kubernetes environment or something along those lines, you could send us your labels.

I

Maybe you have put in your pots and we'll just go ahead and say we'll go ahead, attack the matching pods within those hosts. We don't use it right now. So let me just go ahead and just kick off this attack once you've done, specify your attacking and kicked it off. It'll take us over to our attacked page, where you see all the current attacks and also all historically run a text. We, our own dog food, so you're gonna see a lot of a text in the Kremlin plate.

I

For example, once you click in you get to see the full attack definition. All client logging comes back up to us so that you don't have to to remote into a host so to see. What's going on at any given point in time, you feel like you again, you've done enough damage or I need to roll this back, because I made a mistake, fat-fingered my attack, for example. We have a halt button right here. Our client will pick that up within seconds and go back to steady-state within seconds everything.

I

I've shown you again full feature parity with our API right. We don't circumvent ourselves via the web, app or anything of that sort. So you can go ahead and orchestrate your own, tooling or otherwise put it into your CIC, the pipeline's such that you would have your smoke test. Your regression test well spit up a canary cluster, install gremlin, run some resilience tests on it as well to get some confidence in the resilience of your systems. So that's the demo for gremlin I hope you all enjoyed it.

I

Maybe if there's any questions around that, I might be able to answer for everyone.

K

Hi, can you guys hear me yes yeah? This is Haman from.

K

In the beginning, so my question is: are you even mentioning that I you can read some tests to test the gravity? Is the? Is there any provisioning in gremlin where I can switch this test.

I

Where you can I'm sorry do what to the test? Can I can.

K

I

This test through Kremmling, yes, well, you do run everything through gremlin because we have the client installed and then you could use us to orchestrate the attack for you. Okay,.

K

Can you just oh.

I

Yeah please visit gremlin comm would love to speak with you further just go ahead and request a trial and we'll be on our way.

K

Mm-Hmm, thank you.

I

Nothing else, thanks for the time.

J

Just curious for this debug from Capital One, so for the network, destructions that you talked about. Are you using like a traffic command base shaping like are using any TCP libraries like what are using behind the scene. Sure.

I

For Linux you're, familiar with TC, is something that.

J

Yeah most of the production boxes, I've seen like Linux OS, so I believe yeah. As long as we have access to run the TC commands, we should be able to achieve that right right.

I

I'll defer to Forney to talk more about what's under the hood since.

I

L

All of our all of our moons use like core Linux libraries to do the do the impact so yeah. If you have the TCC library, you're good, to go.

K

Keep most of your demo around the alias invite. Do you support any other remote machines.

I

We support Linux and other air environments, so if you need those environments, you're good to go, we currently do not have a Windows environment, for example. Okay,.

B

Can you schedule attacks against like the given MCS API, for example,.

I

I am not sure about the kubernetes api specific.

L

No now we're adding in memory container and kubernetes support, okay, religion, dove built for bare metal and are you know, ec2 that sort of thing and now we're we're kind of iterating on that that build right now.

M

Can you share more details about what are you planning around API.

L

I mean I can't share exactly what we're planning right at the moment, but.

L

Suffice it to say there'll be better to British support how about that a.

M

Timeline that can be shared not.

L

At the moment now,.

A

Are there any plans to integrate with kind of the service meshes out there like sto or conduit, to essentially as you're doing a slow rollout, also integrate kind of gremlin testing as part of that to ensure things are kosher before doing a full rollout? Yes,.

L

Definitely something we thought about. You know if we're sort of just getting past the POC stage right here, so the things we want to do you know, and it's not where the word that you know the squeaky wheel gets the grease sort of thing. So containerization has been a real big part of you know our pasta, roadmap, okay,.

B

um Can you go in through a little bit of a telemetry that you get after you've run an experiment? I see like there was some logs on your dashboards. Do you get any sort of visualizations or anything like that, so right.

L

Now we primarily we've kind of like opted to Stanley integrating with monitoring just because it gets a little hairier when you start to take some of that information plus there are a lot of great monitoring solutions out there, so we tend to leave these sort of things to light step and data dog, those sort of things I'm sure you know, there's opportunities to expand into that space in the future.

L

It's just not been, since it's not sort of our our core competency, we've sort of let them do their thing and you know let things kind of line up in the monitoring dashboards, any sense, yeah.

N

Cool my question was that actually centered around monitoring as well? So how do you guys interface or do you interface at all, with things that are tracking, like SLO, is SL A's to like make that them put them on like a quiet mode or like? Would you even want that and like kind of sub note I'm new to this? So maybe that is something that you want to see if your essays are affected by such a thing.

L

It's asked a question since we kind of stayed away and it's kind of hard to do: I, guess, SLO and SLA definition generically a lot of companies do them kind of differently. We we've sort of not integrated into that space. Just yet, and.

N

That's fine, like I heard about your court, your statement on core competency I think that's great, but like being able to tie into something like Prometheus or and like maybe making it. So it's not gonna go page like an entire engineering team, while you're doing tests like this, maybe at least in a controlled setting but ya, know.

L

You're totally right, I guess what we sort of we usually tend to advocate is over communication in that regard. So if somebody does a page, they know that they're getting paged because there's we're running testing, they said yeah. You can turn off your paging right. You can turn off whatever pager duty or silence these sort of things often times, though. Actually we we want to do this to see that a page actually goes off when something that happens right. So if you pick a bunch of course, you expect pages to go off.

L

So it's kind of like unit testing, your paging and just the idea of making sure that your on-call knows what to do. Testing engagement, spinning new people up it's you can do it either way right! That's.

I

A fair point yeah. No, that makes a lot of sense. Thank you right. I do agree for Nia at this point many times when I run game days with our customers. It's not so much finding folks within more so making sure that they have their observability they're monitoring, they're paging dialing tune, and you know a lot of times when they go. This happened, I never got a page for it. Yeah you found here is real quickly to get that dialed in guy. Otherwise it's gonna be. This is already in production.

A

Okay, cool, oh.

M

A

One more question: no no I'm just saying one more question: Arun and then just to be sensitive time tool, surveys, presentation.

M

There is any indication with cloud watch events like let's say if my auto scaling group is going up and down to a fast now. Is there any integration with Grambling.

I

Where you could.

M

Trigger some tests so give me some more details about it.

L

Go ahead, Forney I'm, not sure, I, understand the question entirely. I mean if you want to trigger your ASE cycling, pretty quickly like you can just use kit like you can use the shutdown gremlin on loop, I suppose I'm, not sure that there's like a direct integration in terms of cloud watch. Events I'm not sure, like what I'm, not sure with your hypothesis is here that you're trying to accept or disprove I suppose.

M

Ask for integration with cloud watch events, particularly in the AWS land, because that's where sort of my state of my cluster, all the events, etc are thrown sure.

L

People have asked people I, don't know. If you know engineers they ask for everything they ask for everything under the Sun yeah. It's it's been asked. You know we're slowly, rolling out more integrations and sure prioritizing what our customers ask. They definitely ask for that. They definitely ask for other other sort of things as well.

L

Sorry can't be more specific.

I

A

Cool I just want to make sure that we're sensitive to at a time but Thank, You, Eugene and Matthew for the presentation that was those super cool all right so but we've got about 20 minutes left. So it's about 15 minutes for Sylvain to present with some five minutes for questions. So Sylvain are you there? Yes, I am yeah all.

F

Slides as well as as you did for Eugene I'd, be.

A

Nice I'm trying to go, find that right now, a sec I'm.

F

Afraid things won't look as yes, as I did for Emmeline.

F

Glamorous different.

A

Tools. Okay, so hopefully you see everything yeah.

F

So so hello, everyone right so I'm going to talk about the killer, kids and the pencils I thought that res Myles and I have started in September I think last year.

F

The idea was roughly that we saw that tools like Romanian Pumbaa. You know chaos monkey obviously weren't there, but as we were trying to figure out how to apply the experimental pattern that we had read about in the Kaos engineering book, we felt that those tools, while actually delivering the goods went helping us forging the experiment. If you will so we decided to create the girl circuit to do that. Basically yeah.

F

So the case will keep us by itself. Does nothing like remand us or others? What it does is it helps? You declare your experiment and then you decide. How do you want to drive what tool or what idea you want to drive to actually inject chaos? So the council, kid wouldn't actually provide anything it's much like Kremlin, but does actually drive their API if you wish to use Grameen for that matter. So basically it's just an open API for your experiment.

F

If you will it's a silly CLI driven, we felt like we wanted something we could automate. That way, that's why we didn't care for UI at first or initially, I. Don't get simplicity, I'm talking about the code itself, where we wanted something that other people could actually contribute to, and we tried hard to actually make things as simple as we could so it. Basically, it's just a bunch of functions glued together in Python way, it's a bit more than that, but that's that rough idea.

F

If you want to contribute just you know, you do need to know Python to very basic level what it does it's orchestrates existing tools are api by existing tools. That means that if you have a binary that you want to drive from the Kyousuke, you can call it, but if equally, if you want to call an api, you can also, you know just call it by passing all the parameters the API requires, and simply it will call that for you, we already actually implemented a set of drivers.

F

We call them drivers, but just extensions to kill toolkit really. We we don't claimed with support all the API of those providers. That would be foolish and you know just a lie, but we try to target a PR that people don't necessarily trigger very much like yes, taco service or stop. You know: remove remove just community service or thing like that, basically, anything that you probably don't call, except if you're a developer and you do that on daily basis.

F

But without the idea of a chaos engineering in mind, you just you know stopping something to restart after that. Well, those ideas are very powerful in in production or pre, prod or whatever. You want to run them to actually impact your system, if you will so, for example, if you want to remove a service, so if you want to terminate but just called you know, did it, but basically that's it.

F

If you want to destroy about that's different, but if you gratefully, you just want to stop something, you can do it with kubernetes and we started to implement things. You.

C

F

That for all sorts of providers, Asia we service fabric is a bit different because they already actually have kalos services native to the platform. So you don't actually call it just call start. You know, start chaos or stop kalos. Much like issue I think which has 40 injection and when realize we're, you know causing trouble is fine, but we really need some probes as well like, like you guys said earlier about coding.

F

Your monitoring tool is fantastic, but we wanted, during the experiment to be able to collect the data that matter to us in that regard of or the experiments, so that you can support your analysis after that. So we have probes.

F

Basically, we call we query, you can query Prometheus or email if you use that as a central logging platform and that, basically, all that's contained in the file that I'll show you in a minute. We've got a bunch of plugins for creating reports and sending slack notifications. And, finally, the future of Keio circuit is we're going to go more native in communities with chrome job so that you can schedule things and just run. Let them run we're.

F

Looking at operators as well to actually control a bit more, the experiments when we run natively into event, but that's just you know, starting to think about that- and drivers in those are runtimes we run in in Python, and we are not. You know we're not. You know, we love everything. So if you want to run your, you know your extensions and go and run so anything we're going to try to make it easy to actually call it from the Cal circuit as best as we can, and we are trying to aim for my son.

F

You know one this year at some point, so that's it for a very quick review of chaos, wicked it's open source, it's a batch elite license and now I'm going to try to demo something if the demo gods are with me. One.

A

Second, I will I will exit, and so you could share your screen. Yep.

F

Indeed, my screen should.

A

F

Yes, that's right! So that's website, we, like I, said it's a CLI driven, so nothing very fancy to show here. The idea is, is just to walk through the the important bits like I said. We try to create an open API, so the open API is, you know it's just a definition of the various elements of avocados experiment. That's what it looks like riff, you know briefly. You've got a set of metadata.

F

What's interesting is you've got the steady States here what what you know the normal in your system? So what we do is that is we use a bunch of probes. You can have as many as you want to query for some things in your system, and what are we going to do is if any of them fails at relevance here, it's just a boolean. It could be something else.

F

We bail the experiment because you know mind if your system is not normal, at least in your you know, in what you expect to be normal and there's no point actually in going through the control, though, because you won't be able to read analyze and make sense of what you you see, so we barely merely, but we run that steady state hypothesis, the N again once we cross trouble to see if we deviate it. If that passes again now that means to sing either. You ask some questions or you found a potential.

F

You know weakness so that stage what you want to do is basically go into the report and see what what you know what happened and make sense of it as a team. Now you've got the method so once you've got the you've run the steady-state once the hypothesis, you run the method and just it's a bunch of actions or probes.

F

Usually you've got one or two actions, because you want to make sense of what's happening. So if you change too many things, I guess it's similar to what Grameen is doing. You can't create attacks. I. Suppose you try not to create that act that conflict each other, otherwise it's probably much harder to actually make sense of what's happening and then you've got roll backs.

F

We we tend to call the end remediation these days, because to us roll backs are strong promise that you can come back to you normal state, which is not always the case if you're really broken your system, but the idea is it's some time he want to come back to you know steady states now with communities, usually roll backs empty because given teas is meant to actually support and deal with. You know failures automatically, so you don't actually do anything there right.

F

So that's it! It's a JSON file that is declarative and what happens? Is you define your probes actions? They all have the same format. So let's pick that one, it's a provider, it's in Python it takes that module and that function. In this case it doesn't actually have any parameters or arguments, and this one you can actually specify the name or the label. You know things like that.

F

There are just functions basic in the Python module somewhere, but you declare them and you can share them on github and you can, you know basically see the we wanted to have something on fine, so that you can really use that inside your CI CD pipeline as usual. It's just another, so thing like, although I'm trying to not sound like a test, it's only you know overlaps with that. You know tooling, in in some fashion, and you can have you know reference existing prop see so that you don't have to actually duplicate things.

F

Sometimes we do have opposed I personally dislike that, but I couldn't see any other fish fashion, except if I was doing synchronization with the stem telling me it was not doing so. Sometimes it's a bit freaky that thing. So you know contributions are very welcome to actually improve the API. Definitely alright. Let's try to show an experiment here, we're going to use a very stupid demo.

F

We've got that application, which is this one, does nothing but data that this year is pulled from a post rest database and what we want to see is under some medium load. Can we actually kill what happens if the database master actually dies and behind the scene? Occupant is what we're using is batch money from zalando which, as leader and follower for petrest and which should switch from one to the other if the master dies and that's what we want to prove, because we expect that if the master dies, we don't actually impact our users.

F

So, let's see it looks actually praveen's more interesting. That's not the one I want right. Can you see that I'm, not sure I'm? You know sumed enough. So please to tell me if it's not.

C

F

Sweet, so what we have it's exactly the same. What I showed you before we select? You know the application that were interested in. That's we check that the body's alive, the application must written. You know it does respond, and if that does happen, if that doesn't happen, you know the the experiment bales immediately. Otherwise it goes to the method. And here what we do is we terminate the DB master and we don't have actually a function doing that what we do is we terminate the part.

F

The part that has that label, and luckily enough, we only have one it's a demo round is here, means nothing because again we have only one that much is that label. But if you had many actually you know pick one through that through the documents and then we've got a bunch of probes. Now you might wonder why I am you know, picking you know going and fetch slugs during the experiment doesn't actually do and it does anything. But it's interesting.

F

Why, when you do the analysis, because you can come back and look at if you add the logs, that you were, you know looking for in your application or you know in your various parts of the system. So for you analysis, sometimes it's nice to actually go and fetch them, as you run the experiment and that's basically it so, let's pray that this works, so you just run. Chaos is the equipment that you're going to run.

F

I, don't actually see that, but there is a notification on the top right, because we send that to slag, saying it started and it basically it doesn't do much. It runs things in order that you know it reads them from the farm and that's why it does look like a tests because it does doesn't it in this case it should fail if I'm err, if I'm correct, because the application will collapse there. You go so here the reason it fails.

F

What so what's interesting here is we see that first did did succeed, so the system was normal, so we went on and killed the DB master, but it felt when we run that again and the reason is because in in that specific case, my code was not good enough to actually reconnect if you will to the new database master. So when I call the application, it failed, because the connection at that stage was was tall.

F

So what we want to do- and you know if you are the CD- what you do is you'd map, probably that to fix, and then you push that to you could use weave or something like that to say. Well, okay, I fix the thing now so I'm going to release the fix, apply, the fix wait for it to come back.

F

There you go it's right there and now we're going to run that what I did but I I don't have set setup right. You know properly oops, it's not set up, yet it's not up yet. So in that case, actually what you saw is yappi the steady-state bail. You experiment you initially, because the system was not normal, I went too fast and previous system was not yet ready.

F

It is now there you go, but what you do is probably you look at that whisk it you know, get cube or whatever and say: well: okay, I've reduced the fix rerun that you know all those things that you know you would do with any sort of test. Basically, and hopefully that fixed us, you know, does work and you've proven you had a weakness and you found it. You fixed it and you try it again and that's he before and after that you would want to see from chaos, engineering experiment.

F

Now there you go, it's you know sure that the steady state now is met. So that means, oh application now is- is able to actually sustain that kind of loss. You know if the master goes away and the connection is lost. We are able to actually sustain that that error, that's a kind of you know the failure that you won't want to see. That's a basic one and if I have time I don't know perhaps not, but there is another one I would I would want.

F

You know would want to show you I stop now, but I. Don't think that a time would be. If you let's say you have your you're using you know, gke or something that, and you realize one of your nodes. The virtual machine is actually I, don't know security, a security issue or something what you wanted to do is it's. You know, pull out a new node, pull right, new set of machines with a fix, but you want to see whether or not it's going to impact your users to switch from one node to another node.

F

Well, that's an experiment. You can run with your toolkit. You've been to bring a new node pool with new machines during that to communities, see the Lord spread from one cluster to the other side of the of the node pool and see if your application is actually impacted. That's the kind of thing you can do with circuit, because we all we do is we drive existing api's? We don't try to create new sort of care. You know tooling, because already they already exist and in invite questions like that right, there was my demo cool.

A

Thing. Thank you. We have a about five minutes for questions. Anyone have any questions. I.

L

Actually had one yeah I when it got to the end of the run: it's like: let's, roll back, no roll back to Claire and I'm curious like what the roll back would end up, looking like for maybe an experiment like this doesn't have one but I guess what is the JSON body of that end? Up looking like so.

F

Yeah, it's a good question. The wine in steady state we use probes only in robots. You can use actions. Basically all you do. Is you try to revert something so sample if I had the time to show you the not put one I actually did it a creator not pool? So it's just a bunch of virtual machines really and and one that runs I killed the node pool in the robots, but in in the example, I showed you because I'm using humanity's cue, antistick care of of the roll back.

F

If you will yeah yeah I killed a kill, the DB master, but but funny with your predator and Cuban, tease make sure that it comes back to life, so yeah in Cuba Artie is actually roll back, so I won't say meaningless, but certainly is less useful.

L

A

Any other questions otherwise I will ask for volunteers to present at the next meeting. I know, Paul has volunteered to talk about his work on spring and maybe.

C

A

Powerful seals sometime any other volunteers to present next yeah.

G

Chris I can demo the LinkedIn stuff. Obviously it's a little private, but it still is this: some of the techniques pretty cool. It actually probably builds on some that spring work as well.

D

B

D

Can also demo the Roomba, but again it may be next, not next meetings. We already have enough presentation week meeting come to that yeah.

A

Yeah I just want to collect the backlog of things and I'll just schedule. You I'll try to do at least one presentation. A meeting.

F

A

Write you in the backlog, Thanks any other volunteers.

A

All right, well, that's good for now, so we have a couple minutes left so just want to be sensitive to people's time. So thanks everyone for showing up I want to. You know, continue to do all the white paper and landscape work as a sneek as possible. Hopefully, people could give their input on that, while we kind of build that out, I think other than that. That's that's it. Anyone else have anything to say. Otherwise. We could think our presenters and meet again in a couple weeks.

L

No really bold question I'm, just I'm curious. Yes,.

A

So, like I love, forcing functions so like picking a date and like hammering towards it generally works well, I think it's gonna take, there's good. It takes Olsen consensus building, but I would love to get something out, probably in in a 1 a 2 month. Time frame, it's like we're. Also gonna have to give my designers on the CNC F side about two weeks to kind of make things pretty and that's.

C

A

And line up PR and all that stuff, so whatever we choose we'll have to at least have a two-week buffer for them to do that. Work, yeah.

L

Sounds great, I was just trying to get a general feel yeah.

A

So also cool already: let's do it see in a couple weeks: okay,.

C

Off. Thank you. Thank you.

A