Cloud Native Computing Foundation Chaos Engineering Working Group, 24 Jul 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Chaos Engineering WG - 2018-07-24

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

A

Kind of gets started so thanks again, everyone for joining before we kind of go over a couple: demos from the community and a little update on the landscape. Is there anyone new on the call that would like to kind of introduce themselves and say hi to the group before before we move on.

B

I'm here this is Tammy from gremlin. This is the first time I've been able to join I've, been traveling a lot so times and didn't work, but yeah I'm in San, Francisco right now and yeah great to be here and I. Do chaos engineering at gremlin and I also previously did chaos. Engineering at Dropbox or I got a 10x reduction in incidents and no high severity incidents for 12 months and before that, I did it at the National Australia Bank for like many years since really 2009 so yeah great to be here.

B

Thanks having me awesome.

A

Great to bump into you in time anyone anyone else.

A

Cool all right, moving on su on slide five, so you know: we've been discussing this kind of landscape and kind of fighting over a little bit how to categorize the different tools out there. You know I was getting a little bit frustrated, so I just decided to basically start as simple as possible.

A

Right so I went out, did a pull request to the cloud native landscape and started with four projects that I could essentially find an SVG logo for of high quality, a you know, base and also kind of information on contrabass and so on so different requirements that we have in the and the CNC F landscape. So I issued a PR if there are other projects that you want to add, please let me know I'm more than happy to kind of add them.

A

I think we'll start with kind of a flat structure first and then later on, we kind of go fight about how we want to sub categorize things, because we had definitely a bit of difficulty kind of trying to figure out how to do that. So hopefully, people are okay with this approach in terms of starting, simple and iterating, but look kind of left to open it up to the feedback from the group before before. Moving on.

A

A

I guess silence is there's. No disagreement. Have a look after the meeting: okay yeah! Now it's pretty straightforward, I just started with four and we'll kind of go from there. Eventually, once those are in the interactive landscape, we could dis still kind of. We have a design team that could kind of take that and break that apartment categories, but I first kind of want to just collect the information out there.

A

The example is a good place to start. Oh yeah I was getting a little frustrated after just kind of working on things. I just needed to get something out there, so you could get iterating cool, so yeah, but I'll continue to do that and give you an update in a couple weeks on that, but hopefully that should get merged in soon in terms of community presentations.

A

We have two things today, one from Michael to talk about fire drill and the other one from our oh cool, thanks, Michael, all right, I think it's you next Emma so feel free to steer when some Michael stop sharing this screen perfect. All.

C

Right thanks Chris. Let me share the screen. I.

C

Can't see my screen.

A

I can thank you I.

C

Think, Karthik and I are going to talk a little bit about quickly what led masses and we're still in the early stages. Late masses we've been toiling for some time on. You know what to call it math is you can call it it's a tool or our vision really says you know and get all the tools, the open source, the best tools and then use them together. So we we call it as actually a framework and it's a framework for class engineering right now. Kubernetes, so tagline is concentrating posted hulloa clothes on kubernetes stateful clothes.

C

Instead, that's kind of you know tough idea right now. All the problems will come. The moment you bring in the stateful applications and the underlying storage and networking will play a major role in the stability of stateful cloud and at the moment it gives a set of heat, a sensible playbooks and each litmus test is.

C

Playbook that runs inside a container right and in today's demo, we're going to show how a litmus test looks like on github and also introduced, has into MySQL app and see what lead pass does how it must helps in introducing case and saying whether the application works as expected or not. Primarily, litmus is supposed to be used by developers and devops groups in their kubernetes clusters, see ACD pipelines and sometimes before you put things into production example, a new kubernetes cluster being upgraded. How do you make sure this coven?

C

This is going to work for your devups teams right. So that's when, as a DevOps architect, you can put seven listen litmus test and run it in a pre-production by pane and then observe, then roll it out right.

C

This particular test has got three configurable variables. One is you know? How do you display the logs and what kind of storage you want to do underneath and take underneath and then in over the first type that that you want to really apply, and let me quickly show the anatomy of test.

C

The project is hosted as a sub Crapo and they're open ideas organization.

C

There are multiple tests here we are in the process of moving much of our ETV tests onto litmus framework and but right now I think we got a couple of sets of tests that are moved on to litmus already to get actually started with litmus.

C

It's just a matter of cloning, the project you get all the tests and then modify the test according to you, according to your need, and then really just from the test, using the cube shuttle command right, so every test will have file called and litmus test so that really kicks towards the actual it must test. Let me go and see, show you so, for example, my SQL application of the Percona application.

C

We got two tests here: one is storage, benchmark and the other one is data, persistence and the test will really have the actual setup tests to be done and run litmus. Gml will have the configuration file on how to control your desk, and so my skill is really the application part.

C

If you see here, we are right now using the underlying storages opening their storage, you could perhaps use rook or port works, which were following the kubernetes way of attaching the persistent volumes to the pod and we're currently supporting for this particular test. Three types of case, one is through Pumbaa Pumbaa itself. I think we heard from Alexei lost session that combo is a cast tool that can introduce two types of guys: one is to introduce network latencies and the other one is to actually dock a stop-start type of thing.

C

So, in this test, we're going to fill an application part using docker stop through the Pumbaa api's, and the sister also can be used to introduce different types of cares. You know you can use event where kubernetes things should be set up by litmus, so that the port gets evicted and then see you know what happens to the underlying application. Similarly, the node train that really means that the lateness job will go and kill one of the nodes right.

C

That's the kind of configuration parameters that we provide so just to give before we do a quick demo. I want to take you through the setup that we have also the demo flow and we have set of nodes, and there was this- the kubernetes cluster on Google Cloud GK engine, and we have a couple of nodes where the GPD is configured as a data source to these nodes, and we are expected to run an application that uses this data.

C

So what we do is we run a litmus job that is the following, so it lodges a litmus pod which does the real test and it launches the my scale pod and make sure that the underlying data I connect to it is done through open, EBS or whatever is configured as part of the test, and then it launches the caius framework Pumba, and then it introduces caps right and masks. You'll board will be watching, for you know the test, whether it's running fine or not. The moment you introduce casts it gets, killed and kubernetes relaunches.

C

It right says the same process can happen again and again you keep introducing her as it can configure how many attempts you want to do this and after the end of introduction of chaos, you go and really verify the data right, the stuff like monsters, okay, the curse has been introduced now, poor, you schedule somewhere. If it's not at all scheduled, then that's a failure, even if it's scheduled.

C

Is it really connected to the data and I mean seeing the right tables underneath and that's that's really what it is and then it cleans up one by one and then you know all the latest part, and then it gives you back the node in the same state, the cluster. So the idea here is Latinos with litmus. The DevOps teams can really take from end to end a given test in an easy manner into the pipelines right.

C

Let me quickly go through this test, so I got.

C

So I'm on one on the kubernetes notes on which lateness will run and eventually the logs are going to be put into the nodes, currently I've selected, to publish the logs onto a local node in my litmus test. So we are going to see the litmus test results coming out here and as part of the demo.

C

I got three windows here to show one is where I'll be watching were the parts in the litmus name, space and also I'll, be observing the logs coming out of litmus pod and then really this wind I am going to kick start the test. So let me just show the test again. I've taken open EBS as the cache and actionable is just puts all the locks on to the STD out and caius type and taking Pumbaa here.

C

C

C

Am running this test and watching the little Muslims base it already started, and this is the container that it's creating and I'm observing the logs on this window. As you can see, it's in the mode of it already deployed the application and it's it's coming up and you can see the open, EBS volume, controller and three replicas are already deployed and once the MySQL Perkin application comes up, then you will see Boombah, also getting launched and caius being introduced, and then you can keep watching how Percona behaves in the meantime.

C

Let me go and see started creating, yes, it did create.

C

So you will see a desta salt or JSON file here when the test is completed.

C

As you can see, the application is running right now. You know the entire test. We configure a date to finish in about two to three minutes and you can see that there are some test data being written. Also the Pumbaa is being launched and that's the has tool being used for this particular test and then the moment Pumbaa comes up. It starts introducing the curves, which is nothing but kill. This part the application point, and then we expect that part to come back up.

C

C

So this this part has gone into error state and we're waiting for copan. It is to reschedule it it's rescheduled back and again it's in that state getting killed and coming back. Then, once that's done, we have just for the benefit of keeping the demo short. We put less duration, the casts in the real test you would want to see.

C

Sometimes kubernetes puts turn or port back onto the same node right. So the idea would be the best practice in this case is now introduce some same guys ten times and observe eight going across multiple nodes and finally see whether the data is persisted or not, and it's coming back and here we'll we'll be able to.

C

So the pump is going down. That means the current reduction is done and here that the data is persisted or not, it total checked and put that into our result file. This is a most primitive way of recording the result. As you can see that the test has really passed.

C

That's the kind of introduction of cares and verifying the data is already they have not one typical way to do. This is whatever we did, because we could have put this into a pipeline, and then you know repeat as many times as you want, and the ansible jobs can be configured automatically to alter the configuration parameters of the play bugs. So this tests of a lady tend to be used in a friendly way by the DevOps themes. So that's the quick demo. Hopefully it made sense to you any questions.

D

Yeah I've got one gee: do you feel like since you've developed litmus implemented and started using it? It has you thinking about open abs, resiliency in general, I mean as it. You changed your your ways of approaching that in open beers. Oh.

C

Yes, in fact, litmus is born out of open ideas right, so we started writing.

C

This e to e test, and then we started introducing, casts as part of the development of open ideas, and then we thought, okay, you know we are using multiple types of tools and we should all put together a project, and you know, even if we have a stable product out there for the end users, our end users, while they put it into the production, the applications, they need to rerun the same test again right and that's when we thought we'll open source it and then make it as a more of infrastructure, and that puts everything together, a framework that puts everything together.

C

Yes, so we are following all this inside of me: bs.

C

Yeah, so it's a community project again, as I was saying in the next few weeks, two months more tests get moved into the litmus framework and we would like to see you know various application, developers or users getting their expertise into this test and then using this test to their own needs or requirements. Yeah.

C

All right, any other questions. Thank you.

A

Thank you, my Karthik, so not too many things to you know before we close out the meetings. So just to you know, if you go to slide 14 just calling out on essentially mainly the white paper, you know, there's been some discussion and iteration there. So I encourage the group to continue to do that and, of course, the landscape I have the initial PR out. So if you have another project, you want to add there, please do and kind of once we build that up.

A

We could have more discussions about breaking those apart into kind of subcategories quarter kind of wrapped things up, gentle reminder on slide. 16. The first chaos conference is happening in San Francisco, hosted by our friends at gremlin, so we'll be there and good to have some folks also show up there. We're also going to be doing a chaos engineering track at cube cotton cloud native con in Seattle in December.

A

The chaos engineering working group will be entitled to essentially to talks at cube con, so I'm gonna try to figure out how to best divvy that divvy that up with the group, but essentially you know I'm looking cut out for introductory content and maybe kind of an overview of the kind of different tools out there with demos. But we don't have to figure that out right now, but something to keep in mind other than that any other questions I'm, always seeking volunteers for community demos.

A

So if someone wants to volunteer next time, please let me know- and our next meeting will be the second week or second Tuesday of August at 8 a.m. Pacific should be August 14th.

A

Any questions, thoughts, volunteers for next time.

D

If you, if nobody else, I'll try to come up with something excuse, it's got some some cool stuff coming up, but I don't know if.

A

Pressure makes diamonds will be good. Deadlines are good, exactly yeah and then I don't know. If there was someone from lift two I'd be kind of curious to see. If lift, would you willing to talk about some of the stuff they do, especially with envoy has some of the baked in ability to kind of do chaos, testing so yeah.

E

That's me Zach here, there's a couple of things that I could demo I'm, not I'm, not sure what I need to do. I need to talk to you before doing these things, but, okay, there's a red line test that we that we run across all of our services. We kind of adjust the load balancing weights through envoy discovery, service, yep.

A

E

Have to do fall, injection rondalee, yeah I'll, look into that yeah.

A

Inori's I I know Peter well, so I'll just tell them to give you permission, I'll be all good. Okay, cool yeah I know be good cause.

A

Not a lot of people know about those envoy features, so it would be good to kind of disseminate that a little bit more sound good, all right, any other questions, otherwise we'll wrap it up, and you know I'll tentatively- is slot Sylvain and exact, depending if they have content for next time on August 14, okay, okay, cool, take everyone and I'll see you next next time enjoy the rest of your Tuesday.