Cloud Native Computing Foundation Chaos Engineering Working Group, 25 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Chaos Engineering WG Meeting - 2018-09-25

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

A

Gets no so I hope everyone had a couple good weeks weeks, since we last chat so really brief agenda this week, I know many of hopefully a lot of you will be at the Kaos conference coming up on Friday, so I'm good to hopefully catch up with some folks. You know face to face in San Francisco, but today's meeting start with some any introductions for any people to call we're gonna.

A

Do a community presentation from Julian who's gonna do a little bit of a demo around sto and Kaos engineering, which I'm personally excited about since sto is kind of a growing project and there's some there's a lot of let's say like knobs and switches that he could play with sto, so I'm kind of curious. What Jillian has to share and then I'll just kind of do a brief kind of update where we are at landscape and white paper.

A

So before I get started any new folks to the group that want to introduce themselves and say hi.

B

Going once going twice.

C

This working group just yet but I've, been busy and other ones, and so I saw the used to a topic. Had there I've been good enough, why don't you jump in and then we get a handy so I'm compelled today. Thank you for you. Awesome.

A

Good good good get to hear from you late. So let's go on so oh my gosh, my computer is going crazy, so it's gonna start things off. I won't hear from Julian. He linked his slides, but I'll talk a little bit about I for people that are not familiar with kubernetes hope you, like a brief, brief, brief intro and then talk a little about service measures and chaos. Engineering, so I'm happy to share mice, are happy to give up the sharing if you want to kind of steer, steer things Julian. So let me go.

A

Stop sharing really.

B

Quickly, alright, so you should be now empowered and very good. Can you hear me? Okay, all.

D

Right cool so welcome everybody I hope. Can I zoom all right? That is better, so we're going to talk about cows, engineering with the service mesh and even though many people are more interested into service, machine and chaos, engineering, I hope you take out a lot of knowledge from to implement cows engineering in your organization, so Who am I I'm a software engineer, I turned DevOps and I used to work for unity.

D

The game engine company I am now in Sweden Stockholm working for discovery, and you can contact me or feel free to drop me a message if you want to know more about these topics. So, as I said as a crystal in a presentation we're going to cover a little bit the base ground of why service mesh and how it came to be. We also follow with a few demo.

D

I will demo invoice separately from each do because I I've done this talk a few times now, and people have a kind of a hard time to grasp what the power of invoice and how does it fit with sto and, of course, since it's basically intended for people to introduce people to cows, engineering I will demo I will introduce the concept of chaos, engineering and demo. Some fault injection feel free to stop me at any time. If you have a question, I, don't know I, don't think I might see everybody, but it's alright.

D

So at the beginning there was an app and app was code and that needed to scale so most most companies. They have this big monolith and inside that model, if they have a few components that are pretty independent from each other and instead of scaling vertically meaning buying bigger box, they they try to break it down into what we call my micro services so into separate services.

D

The problem is that, instead of just calling a function that is nearby in the code, they have to go through the network and that brings a whole new set of problems eventually. So from there. The deployment is not one thing: that's done atomically. So now we have tons of little micro services. That does one thing and to solve that docker and the container came to be the changes that the it brings to the code.

D

Is that now you have to code in a certain way respecting the 12 factor app, but it also brings like facilitate the deploy and the build, because now you you can have a declarative language to describe your deployment and your build, but the problem doesn't stop there, because how do you scale your container?

D

How do you schedule them I'll? Do you make them talk to each other? How do you make sure that they are healthy and also the the secrets and the configuration are kind of hard? But this is where all those point here kubernetes can help. So Q Benitez is a is a scheduler. Basically, kubernetes fix the deployments and the rolling out of new version it's. It is very good to increase the speed of development for a team, but it doesn't solve everything.

D

So, as I said previously, the the network is still a problem, so we introduced there is this 8 fallacies of distributed computing that people might be familiar with, and if you see the first one is the network is reliable. So that's the biggest lie. People tell you and say: just just stand it and you will. You will receive something. It's never true. You can have hardware failure, you can have miss configuration and packet get lost all the time. So to cope with that. There is also this RFC that I highly recommend is 3, page long.

D

It was written 22 years ago and it's still relevant and actually quite funny. It's it's totally worth watching it reading it. So of course, once you have cuban it is. You also have all those problems here too to figure out, and it becomes tremendously overwhelming for a team to to really handle each of these building blocks separately.

D

For instance, metrics is a real pain for developer because they have to figure out some to have some kind of consistency across all those micro services, and so you, you also have a lot of problems with authentication and what happen if one service act badly, you want to cut him out and just serve serve serve traffic to those micro services that are actually healthy and what happening if you want to do canary release.

D

If you have a small, if you want to just do a/b testing and just route a small part of your traffic to to a new release to test it, how about failover and all those things are really hard to get right? I mean I, don't know if anybody is implemented retries, for instance, if you had to retry our service, I mean I. Did those myself that way, so it's quite hard to to get it right without breaking something. So the way developers solved, that is, they use more code.

D

The problem with more code is that you have an explosion of language framework and version. Just imagine for each language. You have to have one framework with the exact same version or API, with depending on which feature you use.

D

So it's it's almost impossible with the complexity to maintain and upgrade, and sometimes you just want to have a service running and not touch it, because by changing something you might break something and it becomes such so so much complexity into the system that they talk about the distributed monolith so that, if something breaks, it's basically everything breaks, and there is a really good talk.

D

I think it's called an abstraction by Zachary Talman that described what is a good abstraction in codes if you are interested, but basically all those problems with the deployment and what happened if you make, if you have rolling update you, don't let down down time. That means that, in your code, you have to handle two cases if you, for instance, migrating from once database schema to another and also debugging becomes like tremendously complicated.

D

If you have all those micro services talking to each other, you want to know basically what happened, and so there is this trend now, where something becomes imitating developer in what they do. Is they take that code, make it a separate part of the infrastructure? So we see that trend very much with the service mesh trying to fix or to not fix it's, not the right word, but to basically abstract away the network. So there is this. The the genesis of service mesh of East you in particular, is I.

D

Thinking is done, serially from Google who, when touring in Europe and in the u.s. asking about the the main Google customer, what are the main problem and, for instance, banks have I think it's truly a billion of dollars invested into hardware to encrypt all traffic. So if they don't want canary release because nobody wants their bank to try to see how they handle money today and see how it works, but they very much want the encryption, especially if you are in Europe GDP, are actually force you to encrypt your traffic.

D

So that's one thing to think about or so connecting traffic, even if kubernetes provide some service discovery that you can manage doing canary release in in communities is quite hard and also the last step is observed, which is is really hard for developer to know how the service behave once it's in production, so here's the link of the video. If you are interested to to basically see further, why you don't need the whole thing you can just pick one block and go with it it will.

D

It will make your evolution towards a complete service mesh easier and so by talking of what is a service mesh I like to explain what problem does it solve and the only problem in solve is communication between services? Now it's not just a one function, call away you're, not on the same box. You're you have tons, you have you have to go through the OS. You have to go to the network to the other OS and then the application received.

D

So there is a lot of component that can go wrong, but the idea here is to a service mesh isn't can be summarized a network for services. You don't want to describe all the IP table and every oath, especially with all those moving parts- it's become quite hard, even with automation, to keep up, and so how does a service mesh solve the entire service? Communication is basically the sidecar pattern. So here you can see that service B is in what we would call a cube unity spot.

D

So is the smallest unit of deployment in Cubana T's, it can be a pod can contain one or more container, and the goal here is to inject a proxy inside that pod that listen to every network packet in the service be is sending or receiving. So if service a here wants to talk to service B, it has to go through the proxy. The interesting part here is that you have the control plane. So this the proxy to proxy is called a data plane.

D

The control plane is where all the the overall governance the decision get made. So, for instance, you have three parts in East: EO special specifically, is the pilot that is in charge of making sure that the consistency of the routes are spread to the proxy. So if we you create a new services, you want to let the other proxy know about that. New services- you just publish an update and the pilot will be wind- will be in charge of updating its route table.

D

You have to be keep in mind that, since all the traffic goes through those proxy, you see exactly how much traffic gets used. So, although all those statistics get sent back to the mixer so for every request, there is also a request to the mixer services.

D

Then you have the Citadel, which is in charge of encrypted rotating the product certificate to make sure that service a is authenticated with service B. Actually, it's not the service B and a that communicates is the proxy that encrypts. So the encryption is abstracted away. It's taken away from the code and put inside the proxy, so you don't have to worry about is my call encrypted. You don't care, and so for the data plane.

D

The proxy is actually am void made by lift if anybody hasn't heard of envoy it's it's basically a single binary that takes ten megabytes in memory. He can handle two million per second so before you have a scaling problem with that. Well, if you have scaling problem, just let me know, I would be very curious to see what what you're doing so here I want to demo. I have HTTP bin, which is just I, want to demo envoi. So is the size and font size? Okay, all right!

D

I just want to curl this HTTP bin with the header to show you that this is the header that I get back. So it is just a very simple application that just print depending of the the path I use. So here I'm going to connect envoi to that HTTP bin and now look at the proxy yeah I.

D

Forgot to curl from the sorry here if I query from the proxy.

D

You see here that envoi added some others into it, so you have the request ID, which allows for tracing. So all your requests are marked in a way and it's the responsibility of the application to take that ID and pass it to the next service is going to call so that you have all the tracing necessary.

D

So that's a very quick demo, but we can actually create errors and see if I'm not so there is a 500, and you can see that the telemetry is quite interesting to see okay, so you see that I actually defined some retry logic, so you know exactly what happened in the service, how many time it retries and it returned the 500 after after third the tree retried, so that was for the demo of envoy is everything. Okay I cannot see. If is it good for okay, dear person to follow that's good or.

D

Where was I and boil that's done so, as you can see. Usually, if you have to configure this, is the Aussie Network stack? So if you have to configure an overlay, this is quite low level and you have to deal with IP table and it becomes that tremendously complicated. If you have 20 service that talk to five database, that's hundreds rules that you have to implement so, but with a service mesh, that's on top of TCP IP.

D

You can actually just say name that service web and that database database, and you just have one rule that scales instead of having to be specific about what you get is that it's completely different is for authentication, if you say I want to talk to that person or I. Have this phone number it's a little bit the same so yeah the control plane here is the pilot that does service discovery and I was wondering what was in the code.

D

So how do you have to code your application in order to gain from the from the service mesh? And it's so smart, because you just give it a name? It doesn't matter what you? What you use you just give a name to the service and you stick with it and envoi will make sure that the the name is resolved to the service that you define in a row. So here I cannot understand for the life of me why they use this port inside their URL, but I.

D

Guess it's just the the demo app that East your team build up, so they might have some reason. I I have no clue, but maybe it doesn't matter. Maybe it's just okay. We use a port and we stick to it. So we don't need to. We have the full IP. We it's not default to 80, and so, if you want to see what is a manifest of, for instance, what is to your call a virtual service?

D

It's quite easy. You can define those the host that is going to send to and you can define rules to much how the traffic is going to get routed. So here you see that I have a HTTP Clause that will match with the header containing an end user that is exactly JSON and since in CML, I guess these parties together, and so, if the user JSON, if a header in the request contained the other hand user and is JSON, it will be routed to the service called review v2.

D

So v2 is like the tag name in Cubana T's, but you can implement it different way and otherwise it gets routed to v1 it will. I will make a demo to to clarify what all this means. Another thing is that resiliency is basically out of the box. You want to implement retries, you just have those three lines to add at the end of your Yano, I mean I've implemented retries a few times, and this gets so much easier. You just know how much you can know with your timeout.

D

You can understand the overall behavior of the of the the ole request, so yeah about authentication and security. You can, of course, implement mutual TLS in between those proxies. You can have namespace and service level policies and, of course it inked. A great very well with the kubernetes with the Arabic observability is actually quite interesting. I have this clearly and you see I deployed an application, it's a bookstore, it's the the tutorial one and all the traffic reached this product page. This product page talked to this review service.

D

That review service only the version two and three talked to the rating service, and you see that this is the application that gets called, and you can see that sometimes you see the color of the reviews changing, and that is the different version. So you see that v2 is black. V3 is red and v1 doesn't have so this gets you. You have a free load balancing and that how you can visualize what what happened?

D

You can see the number the percentage of requests, so 50% of the request goes to the review service and 50% goes to the detail service. A nice thing to have is also those tracing so with the ID, the the request, ID, that was in a header. You can do very interesting things.

D

So you see that the detail page is called before the review page, but maybe now that you see that you can say, but can't we make that call in parallel and save a lot of time and you can improve your your service that way so yeah, let's do a little demo. So that's the application that I showed you and the best thing is that it's code independent. It doesn't matter which language it's used because it's it's separated from the code.

D

D

We can see also I like to show this tool is called slapper and it it allows to generate some traffic and show the response time.

D

All right, so here what we are going to do is we're going to set all the requests to go to v1 here.

D

You and let's test that, okay, as you can see, let's check if it's really.

D

So yeah all the request goes to v1 I. Don't have to worry about anything, the the other are not used anymore. So what we want to do is actually specify that the user- let's call it JSON- he's going to be routed only to v2. So it's the only one to have access to v2 and the way we can do that is by created a new page super secure, app and there I'm logging as JSON.

D

So you see that without impacting the user, I can have everybody on the on the main, a main version and have a specific version just for my user, so I can test whatever. This is happening and I. Remember at the beginning of the working group. I think is a Michael from LinkedIn entry made a demo about how to how do they do cows, engineering, and that was exactly the same. They choose for one user which problem they want to generate so that you can see and from there we can come back to the demo.

D

So alright, is it ok. So far, alright, perfect.

E

D

I put a list of the various service meshes that have more or less the same feature. Some have more. Some at last I would say that linker D was made in Scala, but they reimplemented in go and rest, so they provide a simpler adoption rate, so they don't need to to map the whole cluster. You can just introduce a proxy gradually console is very made the console, connect and made it very easy to implement security, so encryption.

D

So, depending on all of that, you can choose whatever your you organization needs, and there is a nice article on the Hashi co-op website where they compare concerning any steal.

D

So now the engineering cows engineering part, so you have the official definition, which is chaos. Engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capability to withstand turbulent condition in production.

D

So we want things to go well, even if they are not mistreated, and it's a little bit like the the vaccine, shot where you inject a little bit of harm in order to create a reaction from the body and for the body to be able to to to defend himself and the thing is it's not very much to cause problem? It's very much to reveal problem.

D

We want to know how the system behave under these certain circumstances, because we can have a lot of idea of how things work, but in reality it's completely different and we all we all end up, surprised and oops and there it goes, and it's like the Nintendo switch snafu that happened in Christmas, so it it's unpacked a lot. The business and here I really like this. This explanation is chaos. Engineering is the exploratory testing of non-functional requirements where non-functional requirements are the requirements that, if not met, render service non-functional.

D

So it's quite very hard to define what is this? What would make the service withstand turbulent condition? But that's why we need to to explore and test doesn't mean that we have to blow half the cluster to find out, because that way you will know exactly what it is and you would probably won't like it.

D

So the mean yeah I love this good having a child chaos, engineering for everything in your life, and so what cursive engineering is not is having this belief that if you do things without paying attention, it will go away because hope is not a strategy. It's really doing doing things in order to find out how how to know, because there is things that there is unknown known and there is unknown unknown.

D

There are variables that you don't know yet, and we need to find out that the most of the things that usually go and tested is, for instance, draining the requests. Sometimes when a deployments there is a shift of traffic and those requests get get sent 500 back and nobody noticed because it's during the deployment, so it sometimes things go wrong, but the health check and all those timeouts are super hard to detect and so they're different type of error. You might want to test, for instance, what happened?

D

The difference between a service is late or services unreachable, and that might be difficult things to detect or what happened when some service reply 504 too long, but the other are fine. You might want to stick with breaking everything or there is this. I have this story about doing a database migration on a huge MongoDB set and the migration was running before the program start. So the thing is that with nobody on communities there is this liveness and readiness probe that nobody ever thought that the migration will take longer, but it did so.

D

Kubernetes saw that and kill the migration right in the middle because we didn't implement a health check endpoint, so that community could check what how the migration was doing and those kind of things would have been nice to be caught beforehand instead of writing production- and this is very much the role of site, reliability, engineering, to identify weakness and improve resiliency.

D

There is also a very good paragraph in the google book called asari, and that is about a service level indicator and objective and agreement, and it says this measurement described basic property of metric that matter what we value, what value we want those metrics to have and how will react if we can provide expected services?

D

So basically the SLA is, is money to the customer that you're you made a contract with, say: ok, we can provide this level of service and with a service mesh, you can see that you have your all your data right there. You know exactly how many 500 you you send back last like months, because everything is storing from materials, and you can query everything from from as long as you want, because the the all your data are are stored.

D

So for doing chaos, engineering, it's a good idea to they call it a game day so fill in the blank. When you want to answer what happened when and there's this good article about breaking dynamodb, one thing to note is that dynamodb doesn't scale down so after a certain load, DynamoDB scale up and you cannot reduce the size of the of the box after that, so you are stuck with a big bill. If you want to scale down, you have to migrate the data.

D

How to do migration- that is a very good game day, say we're going to practice and recover from the data migration from one instance to another. That is a nice example, but is nothing can get done if the organization is not behind? So it's like the mentality of the organized organization, should be expect failure and learn from it. We should not fear and cover up failure. It's it's something to be to be dealt with, and so a very good idea is to have the high severity incident management program to know what to do.

D

What can who to to update who should be? Who should communicate that and how we can resolve that in a in a useful manner? So it's a very much a cultural approach and never underestimate the power of root cause analysis. It's it's really nice to have a document with a proof that saying we did this this happen. Because of that- and we here are the result and everything is documented, because if you don't learn from a mistake, it will it's bound to be repeated, and so there is.

D

This I find this recently about the Toyota assembly line. So they have this motor core Kaizen, which, if you take the character separately, means change good. It's basically means a continuous improvement, and so they have this unknown Court on the assembly line of the car and as soon as an employee detect a problem. They pull that cord. The manager come to the side to check, and if the problem is severe, it can actually stop the whole line, but at the scale of Toyota, where they have to produce thousands of car per day.

D

Maybe it's a rough lot of money that they might lose so problem gets fixed really fast, because the detection of problem is really fast. They did take the prong early and they fix it early. So they have a procedure on how to fix problem basically and yeah. If for people who would like to start cows, engineering I really recommend, by experience to to be careful with the word chaos, because it means different things for different people and I, really like that.

D

You know people in IT are really excited by it about it, but for a manager saying oh we're, gonna blow half the cluster to see how it react, might not sound like a good idea at the time. So I would recommend that you use the word resiliency at first for after telling okay we're doing chaos engineering once you have the result, it's easier to to explain that was cows engineering.

D

So that's the the little setup, because I think there is a few step that might be important is instead of mentioning Calais mentioned the result mention the goals like we want to improve the resiliency of the of the database. Like how do we? How do we do that? If we don't have monitoring, it doesn't exist, something that doesn't get measure. It is just gonna be forgotten. So the good thing is that if, since we have all those nice graph now about how the system react, we can make, you can feel a low almost like.

D

Oh this, this is not normal. It was not like that last week and you you have some kind of good feeling of how things should look and that would could be called your steady state once you, you understand the steady state, it's easy to make hypotheses like what happened. If, because you know you're you're, it's like you're used to it to to what is how things should look, and you can challenge that by form forming an input and hypothesis, and so with an I put once you have.

D

Your hypothesis, you can set in place, will board events like I, don't know shutting off the one one end one instance see how the service react or killing a pod in communities to see how it gets recreated and so on and so forth. The goal is to once you did that to write a report so that other may benefit from it. You don't need to reinvent the wheel over and over again. Another good thing is that, once the report is written, you can talk about the chaos experiment and then become.

D

It becomes like a positive thing instead of something scary, because the only reason chaos is scary, people don't know about it. It's nothing nothing different from it. Then testing engineering, you you want to know about something, so you experiment with it, and the last thing is to keep on doing it doing it only once won't make it improve. So it's better to set practice in place and do it often so yeah, let's see a little bit of the demo for Calais engineering I.

D

Have this manifest here that will introduce delay for our friend JSON that is connected, so I set a timeout. Maybe we can look at how easy it is to set a timeout in so yeah. So basically, what I did what I did here is if it's JSON for a hundred percent of the request, just at the seven second delay.

D

Otherwise, for all the other just sent the normal one, so we don't impact anybody else at all, so here I'm an impacted, but if I look at JSON, you can see that here the the requests are stuck and we should see an error soon. Oh there, you go the error, fetching product review, so you have a very small blast: radius, a control blast radius to do your testing and to see how the service will look like if this service doesn't answer.

D

So it's a very good way to to be comfortable with errors like it allows for a very easy way to handling and to to show that we can recreate problem easily. What happened when so was I yeah I want to clean up because I might have another demo later, but okay, and so that was easy too just to clean everything. And, let's see everything is back to normal now so yeah I.

D

Went a little fast, maybe, but here are most of the resources that you can find on the on Cal's engineering and also on East yo I highly recommend you look at it and thank you for your attention.

A

Awesome Jillian super thorough any questions from from folks. I totally appreciate the intro to kind of ongoing sto for folks that were not kind of familiar with how it works with kubernetes and so on. So I think it's good to kind of set yeah.

D

I think there was a lot of myth around it and what is this and I just want to show like simply how to end all that yeah, it's a lot of work I would recommend, if know if people are not on Cuban et's, it might take some times, but if carefully planned, it can be a very great tool for the organization.

F

That was awesome. Dylan I really appreciated that I also especially agree with the whole idea that we should try to change the culture, around reporting, outages and being very upfront about sort of what failures happened. So we can all kind of learn and grow together. I did have a quick question to serve about that that little Gamal file at the end there that's the kubernetes yellow right. So how does that actually interact with SEO is that is the delay.

E

F

Between the actual application and the proxy, or is it injected before it hits the proxy so.

D

If you want I can show you, what is the manifest to installing steel on communities looks like sure as I'm sure it's five thousand nine of llamó.

D

That is it. There is the wall of china, and then there is the wall of jungle where this is like insane. There is config. There is everything. Okay, to be honest, I installed a lot of add-ons, so you know everything that could get my hands on I installed it. The thing.

F

Is I guess like this is 5,000 lines like? How much did you actually have to configure yourself everyone's at all, just like cut paste.

A

This year, right for the most part like like this thing, ships of this year.

D

Actually, if to you, have the helm short but I don't like to to install things Wilhelm because of the Tyler is not super secure, so I try to avoid that I just template the manifest to Cuban keys. You know and I've pushed it so.

F

I guess back to just to reach a question: it's it's a between the proxy and from the proxy that the light get subjected there. So.

D

The delay gets sent to Cuba letís through the custom resource definition, so communities send it to sto the pilot. The pilot take that info and spread it to all the the proxy to all the anvil. Russians. The thing is: I can show you what the envoi, the rumble of envoy looks like, and it's a little bit more verbose, so you sure do come as a manager of the fleet of invoice and it it allows for here you can see that we try on 503 retries in East.

D

Your is much less line of code abbiamo, so it provides a nice abstraction on top of invoice, especially if you have like hundreds of I'm going to manage. Eto becomes really useful. Yeah.

F

It looked very clean, I was just curious. I ended up working. Thank.

G

You I have a question on the same point that Matty was talking about. You know: I tried, injecting the delay and I realized if I try to access the endpoint using curl. The delay is not enforced. However, if I go through the web browser which then goes for the ingress gateway, then that is observed so and I. Guess that can explain the point that you were making earlier, that it is injected before the proxy. So if you go through the proxy, then the delay is observable, but not too curved yeah exactly.

D

So you, the thing is that there is this, the the main problem I the question also I get from from that is that people who have already a service, for instance, with SDK inside that connects with TLS, might create some issue. If you try to to go egress, you have to configure that to allow service to go outside the cluster and you have to configure for traffic to go in the cluster, so the north-south traffic is also part of the of the Yamal definition that you need to think about.

D

It's not only I think is West's service to service. You also have to think about what goes in the cluster and what goes out, and the tagging of the request is important, but I can maybe show you the what was the.

D

The basic gateway, the Gateway is interesting to see. If you look this okay so that that's the you define the virtual service, which is the the product page, this is the web page and you can define route and for those route you send it to that to that service. So that's what ingress gateway. So that's a little bit too. You allow this traffic to be to reach the cluster, because by default nothing is, is reachable yeah, it's an opt-in mechanism. Is that a question exactly.

G

You know it does and that's exactly what I was doing as well, because when I was starting to play with this I realized that you know it cannot be using curl. It has to go through a gateway and that's when it works, as the matter of fact funny enough, so I'm actually giving a talk at a rally velocity next Monday on this exact same topic. So this is a very relevant thing.

G

I have a lot of content prepared already I wish I could have gone through sort of the unwise concepts, but I only have a twenty five minutes lot. So I'm straightaway, jumping into the code and saying alright. Here is it all. The issue is, there's a rule that my players- and here is how you can inject and and I like you preached it that to your manager. Do not even mention the word chaos. You know. They're gonna, freelance.

E

G

Result, you can do it right, we mention them resiliency and as a matter of fact- and we were talking to a customer- and we said hey, this is a financial bank and we said we're gonna do chaos, engineering and the customer completely freaked out in the room as the manager walked, then, and then he says: oh, no, no, we're gonna, do chaos engineering to make your applications resiliency.

G

Now the manager is talking so and he is listening and he is more tuned into which I think that presentation on how you pitch it to the senior management versus the developers is very important and I think very clearly. Well,.

D

Thank you and I'm very happy to hear that I was not the only one who got that kind of reaction coming from you know, non-technical people, if you don't know, but I heard that even Netflix we named the cows engineering team, the resiliency team, or so they want to set the goal. What not, what they're gonna do you know resiliency.

G

Yeah, because chaos engineering, if you think about it, is an implementation detail. What you are trying to do is bring resiliency in there and that's sort of the ultimate goal. They're, usually.

F

Present to customers as Casson earring is a tool to get you to be more resilient, so you can still talk about casting without freaking people out so long as you talk about the angles, so I think you definitely cover that part.

G

As a matter of fact, that title of my talk at for Riley velocity next week is that you know bringing more resiliency into kubernetes using chaos. Engineering very.

D

Good I'd have more de more like one thing: I'm really excited about is the dark launches, so you can actually admire your traffic from the ingress, so you would have a secret cluster clone with a new product, and you can test live all with real traffic going there, because all the answer back to the proxy are discarded automatically. So you can do super powerful thing and just with configuration of just with configuration.

D

Basically, basically I know that we've were we works, it's a company, they have also won one service mesh and what they do is they use communities to query the cluster to find what is the state of the of the network and with that they compare with whatever is done, the the git repository and if something doesn't match that mean that someone made a manual change, so they enforce the pull request for the infrastructure change, so they can avoid an infrastructure drift so that the the infrastructure doesn't change over time and I think they named it.

D

D tops or something because everything is in gate and whatever is in it gets reproduced into the infrastructure. So the this is a very interesting topic and I think we we're just at the beginning of what's possible to do. I agree.

G

You know if Theo does open up a lot of possibilities in terms of the capabilities does it offer and how you can leverage that for chaos. Engineering. Thank you. Thank you.

A

Any other question questions for Fred Julian.

A

A

All right, I was an awesome demo, so super appreciate that I'll make sure to post that to the mailing list cuz it was great to kind of get the walkthrough from beginning to end, so a lot of people will appreciate it. So thank you. Julian thank.

D

You my pleasure.

A

Alright, so we only have probably about five minutes left I'm not going to you know, take too much time outside of you know. I mentioned last time we met. We have a Kaos engineering landscape now for C and C F, please, you know, send pull requests like there's obvious things missing. You know like Netflix, is chaos, monkey and so on, but I haven't been too easy to edit myself, hoping someone else.

A

Does it more importantly and more timely, you know, there's a chaos conference, this Friday where I know some of you will be out that I think Matthew and some other folks from his organization are putting on so hope to meet some of you face to face there and if you have any kind of topics you want to discuss, let me know, and then finally, I had some volunteers for the kind of intro to chaos. Engineering topic for Q, con Claudian con in December.

A

So thank you for everyone, volunteering there so we'll have a session there that's hosted by a few folks and finally, you know there's some work for us to kind of iterate on the white paper I've. Just been super busy that I haven't had time to kind of go, look at it myself. Just then wrapped up with traveling to China so appreciate it. I think Sylvain's been driving.

A

Some of that I, don't know if Sylvain you're willing to give a quick update on that or maybe he just disappeared, but essentially I'll send the note out to the mailing list about taking a look at that eventually I'd love to kind of get that a little bit further along and kind of used that as the basis to pitch the official blessing and formation of this group under the CNC F through the technical operating committee or technical board there.

A

So, but that's mostly on my onus but I appreciate if folks have time to to contribute and look at that than that. um You know we'll meet again in a couple of weeks. If there's anyone in this group that would like to volunteer to present on a topic, let me know: I found Julian's topic today, awesome and very interesting. So if there's any anyone else there that wants to talk about something. Let me know I know Mikhail from Bloomberg is, is up to present, but I know if he can make it in two weeks.

A

But if anyone wants to do something, please shoot me know: are there any topics or questions for the last few minutes that we have.

H

Do we have a draft date or a date where we want to make that, but our se white paper, final there's.

A

No, you know, there's no official drop date, but I would say it would be advantageous for us if we got it ready for December early December for cubic on, mostly because we'll have a lot of PR or kind of analyst people there at the conference, and so it's just a way to kind of drum up interest, and we actually could have folks if folks are actually planning to be there. We set up kind of meetings.

A

You know for folks kind of discuss it, so that would be my ideal to get it in shape for December by December 10th got it I. Think.

E

That's completely doable yeah.

A

Cool yeah I think we add I, think it's totally doable like a lot of us have been. You know busy with you know, conferences and so on, but I think now we got a good group of folks that kind of iterate on it and get there so Lane's been doing a good job of pushing it along as far as possible. I think.

I

I've reached my my limits in not just making some make it sound like it. You know it's just my idea of chaos and drink, so it's probably better. If you know many people actually contribute so that we have, you know variety and the diversity there and.

F

Just made a note to take a look, so I'll go out.

E

We have to get a people really discharging at and pumping it out. Yeah, like you know, I mean because probably sounds like we're just too many people with not enough time right and then you get to the annuity opinions. Nothing getting that yeah yeah.

A

I mean that's a challenge of the open sourcing in general right, so the idea yeah I mean the idea would be like. If we get you know it doesn't have to be this. You know, like you, know, chaos, engineering, tome, slash Bible likely. No, you know we're basically positioning it to introduce the topic to kind of the wider clownin of community and basically offering like an initial landscape right. So you know I would love. You know more additions to that landscape.

A

So when we kind of launched the working group officially and announced the white paper, we had the kind of updated landscape to kind of go with it right. That's kind of what we're trying to really do here is educate the wider. You know, CN CF and in cloud native community on on chaos and resilience, engineering cool. Any other thoughts concerns questions before we cut out.

E

Like crazy bad view at some point to do the security carry you want.

A

Okay, awesome, yeah, I'll, let you I'll, let you know just a sheet unit with a couple of upcoming meetings and then just let me know what works best for you. Okay,.

C

A

F

It's a conference called it's chaos, calm, we're we've been sold out of tickets for a little while now, unfortunately, you're on the waiting.

A

List though, yeah I'll have to check, since yes, we may have an extra I have to go sick check. We may have we'll see, but a few extra tickets are comfortable yeah.

C

A

I'll, send you I, sent you the link via chat, bleep, yeah.

E

H

They'll be fix the velocity too so yeah I'll, be there. Okay I know: Tammy will be there as well. Okay, he'll, be there Rosen ball he'll, be there. Okay,.

A

Yeah I'm most likely I'm planning to make it work, so maybe we could kind of do something. Let me let me think about that. Okay, all right thanks everyone for your time and thank you again Julian for that amazing demo. So we'll get the stuff published on YouTube, so people could watch it. I'm gonna go update the readme tree to kind of link to previous videos, I've been a poor steward of that rainy, so I'll get it done by the end. Today, all right take care great.

B

Thank you thanks. Chris.

B