Cloud Native Computing Foundation Chaos Engineering Working Group, 9 Oct 2018

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Chaos Engineering WG Meeting - 2018-10-09

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

A

That one right, hopefully, everyone can see my screen. Yes,.

B

A

Right so the agenda as free she's, not here I'm, trying to do my best and hopefully just you know, raise your hand if you know I miss something or if I'm not clear enough. Today, we're going to one demo run PCF from from you guys, Carolyn and Ramesh, and then we'll have a quick landscape and white pepper update, I'm, not sure there as there is much there. But let's, let's keep the discussion going. I think Chris was meant to do a quick recap of of chaos.

A

Conf of fully there will be someone who can you know you know tell us all about the conference. I was a great, so probably you feel the roll. If you don't mind right so, let's get started I think we are rolling for the demo guys so I'll leave you, you know, show your screen. I think that's probably best sure, stop sharing.

C

Is there any button here to share I think.

A

I'm zoomed, you have a green share screen at the bottom here.

D

A

There you go and you pick up, you know whatever.

A

C

C

D

Makes you want to start yeah sure thanks Kevin good morning folks, my name is Ramesh I'm the senior engineering manager for the platform engineering team at t-mobile first off. Thank you for giving us an opportunity to present here very excited to know that there's a community around this and now we can tap into like extensive community network as also get help on what we're trying to do at mobile, but good background on me. I've been with t-mobile for 10 months.

D

My team owns the container strategy for t-mobile, and Tarun is one of the billion engineers we have on the team. So Cohen do you want to introduce yourself or situation? Yes,.

C

So my name is Corinne January and I'm senior software engineer working at t-mobile I joined this team in March 2018, so I'm, fairly new to t-mobile. Well before that I was with Huawei as a cloud security engineer for them. I have about 13 years of experience in the field of information, security and enterprise security. So yeah, that's pretty much so here at t-mobile we take care of pivotal Cloud, Foundry operations and DevOps kind of the role, and also we have communities. You know in house cluster. These two come under a mesh.

D

Okay, so let's get started I know some folks are excited to see the demo and what we're trying to talk about right so I'll gradually take you through the journey cone. You wanna go to the next light. Please yeah!

D

So one of the things that you know I'd like to start off with it's jokers on our you know analogy on how he interprets chaos in this movie. Dark Knight- and we kind of spoke about that at our conference at Spring. One and Joker calls it as a fair act, which is every time you disturb the harmony of systems. Good things can come out of it right, so that's and that ontology has started kind of like started.

D

My thinking also, which is my team, built all of these capabilities, on top of like a massive infrastructure and behind the scenes, there's compute network and storage and things will always go wrong right so get to the actual problem statement. But I always like to start off any presentation with who we are what we do. What are the services we provide in the stack and kind of drive you towards the problem statement and then handed over to Cologne?

D

So our vision today at t-mobile is to deliver simple, secure, scalable platform services that are infrastructure and platform agnostic, and we do this with relentless focus on customer obsession, because you're carried to the needs of an internal engineering community with over 800 plus active users on our platform and on the right side, you're, seeing the evolution of the access service model, you might have seen different variations of this, but one of the biggest partners of our infrastructure team who's, also moving towards more of an automation model with one-click access to compute network and storage and other resources, but in word their biggest customers and more extending that portfolio to even give better capabilities.

D

One of the capabilities and the containers the service matrix is okay, you know offering for Cuban Eddie's, so people can bring their own containers and then a platform as-a-service abstraction where people can just give us their code. You know we'll go run it within our abstraction layer and they don't have to worry about other capabilities right. It's like a self-driving car.

D

We give them the bells and whistles of operating their code and scale and at the same time you know they get the best experience in terms of like dealing with live customer issues right so, depending upon the kind of abstraction you choose different different flavors comes in. So that's really what my team does in a nutshell, let's move on to the next slide, so one of the things that folks have asked me is okay. So what's the big deal, every company is doing this.

D

You know what what's really in my portfolio, which is driving the need for chaos, engineering, I'll focus on the fact that we're building our services on top of on Krim infrastructure. Today there is compute network and storage we've gotten the business used to agility already in the last two years, roughly 4,000 applications, 500 active users per day for a 31,000 containers across development and prod their production. For me, even though this development environments, business has gotten used to the ideality, which is faster applications faster, mean time to respond and resolve.

D

The last iPhone launch event saw a max peak of like 16,000 requests per second provider. The minute iPhone launch was launched right and then since then, it has been trending around an average of 14,000 thing on that, culturally, when to move on to the next five weeks. So then we're moving towards the feature which is like you know where we want to extend our capabilities around simplicity, security and scalability, we're trying to deliver net new capabilities on the function of the service and also exploring new capabilities as the platform as a service layer.

D

So a lot of work that is already being planned in terms of a Denarius meant enhancements, all of which entails infrastructure in the in the in the background. The next slices okay. So our bit of the explanation of the problem statement, little more have folks here on the call seen this before, like blue and the black dots anybody.

D

Okay. So let's actually go through the animation here girl. So basically, what you see here is what's called as a death star diagram and it's a representation of the kind of the ecosystem you micro-services deal. You know, live in and the kind of interaction that they have independent services. The snapshot on the left is from Amazon from 2010 in the Netflix that star diagram is the blue version and then all looks a little more less chaotic, but we're getting there in terms of you know what the chaos is going to look like the future.

D

So the key message here is: you know we as engineers we write services, the services have a back-end systems, they interact with and obviously even system scale. You know things your customer is going to take the impact in terms of like any customer impacting events. So what we're trying to do here is like us, in terms of our digital transformation initiative, we're trying to think about failure in a different way and trying embrace failure, because we know failure is inevitable.

D

X, like this I mean so and I actually start off with this problem statement with karoon and a couple of months ago, we we wanted to look at like two kinds of failures. Obviously there is the platform level failures that I care about, because I run the platform and what I mean here is what are the kinds of things that could go wrong with my platform? What are the assumptions engineers make when they build services on on top of an infrastructure right think about things like about and that would be homogeneous? We have infinite bandwidth.

D

The fact that we have infinite compute resources. All of these assumptions needs to be validated because when you fail to validate at this event, problems start happening, and you know you could get into a disaster scenario like these two guys on the left side, which is not our data center, but somebody else is there a center. The fact here is, you know our data center is in a North Quay prone zone. You know and anything could hack happen here.

D

We have active volcanoes in this region, so we're trying to be cognizant about the fact that, okay, if things fail how our system is going to react, how can t mobile continue its business or and because a lot of the business critical applications run on this, and then there is the containers that are running within the platform. Containers have applications, and it's not just one target application right, there's several application, so you want to launch specific, targeted attacks on containers and just affect that one application under context right all the different servers it interacts with.

D

So that's a fairly significant problem on its own, because you know because of the way PTFE tape is present because of the way containers work on future. Okay exploitation- this is you know, I think, should hang up and then talk to your journey cover. Sure.

C

So so what so? Looking at the problem statement, we have two problems there. One is the platform level attacks and the other one is applications running on the platform. You know attacking the applications that are running on the platform, so we started exploring you know. Are there any existing tools because we didn't want to reinvent the wheel, so we started with an open source solution, called chaos, lemur and but all we could see make it work with the chaos.

C

Lemur is killing of the virtual machines, but in fact, at the platform level we wanted to achieve killing virtual machines, killing a process introducing latency into the system and introducing memory and CPU hog right. So all these come under the infrastructure level attacks, but chaos lemur for us was like more like a chaos. Monkey can only go and then you know turn off a random virtual machine, which is definitely not something that we are looking for.

C

Looking for a more bigger solution, so we started looking at gremlin as one of the commercial offerings as well, and we see here the version that we evaluated with the gremlin. It's a pretty, very, very powerful, you know tool, I must say, because it comes with a very neat UI and there was initial thinking whether gremlin would work on the PC of environment or not, but we made it work and it seemed to fairly work.

C

You know, perform the operations in the infrastructure level like killing of virtual machines, killing of process introducing latency and but again the version that we evaluate it doesn't have the application knowledge, which seemed to be the case that you know the gremlin is working on it and even in the recent term, intro from there from from the CEO mr. Colten I think I seem to be coming up and they have added this capability in the latest release of Kremlin, which we have never looked at yet right.

C

So but again, grimian comes with the cost and we are also very conscious about the cost. It is involved, you know and running on the infrastructure. So we looked at turbulence as another alternative and it's an open source. As you can see, it performs fairly. You know, and it's very native to the Cloud Foundry as well. Like you know, which is pretty much, we were looking for any Bosch hosted.

C

Virtual machines can be done with the Chaos engineering attacks or failure injection that attacks can be performed on the Bosch, hosted environment, performing killing virtual machines process and introducing latency and CPU memory hog. But again it lacks application, knowledge right. So so for us, as I said, like you know, animation explained those are the two problem statements like infrastructure level, chaos, engineering attack or the platform level, chaos, engineering attack and the application level at attack right. So here is what we looked at: chaos toolkit a safe framework. It basically orchestrates multiple other.

C

You know solutions like gremlin and turbulence as drivers. So what we built is we built two drivers there. One is a driver for the turbulence itself and then another custom homegrown bill driver, that's built from the scratch which has application knowledge, so it can go and then figure out discover where your application is running within the cluster. So if I have a cluster of two thousand nodes in that two thousand nodes, this driver can go and figure out. Your application is running on those particular nodes. It can also figure out what service dependence is.

C

This application has. So if it is dependent on my sequel database, it can go and randomly kill the my sequel database instance and see how your application behaves. So these are the two drivers that would be demoed today and let me before I jump into the you know the demo.

C

Let me explain more clearly what exactly that we are talking about here on simulating failures on the platform level right. You can see the component diagram for the PCs, the pivotal Cloud Foundry. There are various components here: each component could be a virtual machine or multiple boxes here could be processes running inside a virtual machine. So failure can happen.

C

There is a lot of interaction happening, so it is so obvious that you know failure can randomly happen at one point at any point and see that you know it might eventually lead to the disaster as well right. So, for example, let us take a simple example here: the ref process going down. So what happens? Is the ref process running inside the deagle cell is responsible for managing the life cycle of the containers running in it? So digo SIL is like a worker node in Kuban at it.

C

So if the rip process goes down in that node, there is no way to manage the life cycle. So it's one good way to simulate an attack. Why are turbulence and the driver that we spoke about and another attack is think of applications running in your cluster.

C

Let's say a set of apps have auto scaling enabled like for which means so it based on the load or the CPU stress, or the HTTP latency, the apps gonna scale up and scale down in terms of the instances. So for that the autoscaler as a service depends on cloud controller. So what happens when there is a huge traffic to the app and then the cloud controller goes down so and then what happens is just up going to scale up or scale down? You know those kind of failure injection tests that can seamlessly be performed.

C

Why are the driver that we are talking about here so how we perform this, or this is the turbulence? Turbulence comes with API server and the agent the agent goes and sits in e, each of the virtual machines in your cluster and then listening to your API server, which is a control plane. So we use C DK and initiate few attacks, which goes through the API server and then agents fulfill that request.

C

So all the ones which are highlighted here, like pausing a process, so some of the attacks that we can perform is killing AVM, killing a process, pausing a process. So that would be one of the demo scenarios here, introducing a stress into the system by increasing this by introducing like CPU and memory hog corrupting a disk associated to you know a virtual machine right and network delays limiting the bandwidth reordering of the packets. What happens if the packets are reordered? How is your system going to behave?

C

Obviously a platform going to react: firewalls attacking on the firewalls at the platform level targeted level blockings like you know, you can go and then perform. You know IP table rule level failures as well, shutdown block DNS and duplication of packets, so these are some of the features that come with the turbulence and highlighted ones are the ones that we have added and contributed back to the open source. So let me show you the first demo for this I would like to run the video from my desktop. So just give me a second here.

C

Are you able to see my screen now or are you still seeing the presentation.

A

Presentation indeed, okay.

C

I think I have to share this particular window. Then.

C

B

C

Know yeah, so this is a demo. I was talking about, so what we will do here is we are going to demonstrate how chaos toolkit has been used as a driver, and then a turbulence driver has been added. What it does is like you know, it demonstrates two scenarios. The first scenario is pausing a process. So here for this, we are going to pause, SSH process. What happens if an SSH process has been paused to the ego-self right and then the other one is killing a particular VM itself like killing a degausser?

C

What happens to the containers running in that right? So a very short video and it's there on the YouTube as well. The reason why I run here is a it has better quality and the zoomin effect. So first I go in here pause process door Jason. This is my experiment: file in chaos, tool, gate with title and descriptions here with some steady state hypothesis and configuration information that I am supposed to pass as a part of the experiment.

C

So these can be again can be enhanced, like you know, instead of putting in username and password here it can, it can come from the vault as well. So what I'm doing here is this is a one box environment called Bosch light, which gives you a cloud formerly running in one laptop, and you can see in the turbulence deployment virtual machines. You can see the API server turbulence. Api server is running there as I said. Like you know, there is a turbulence, API server and turbulence agent running in each virtual machine.

C

So that's what the configuration we provided in the experiment and then the method here is, is to attack pause process SSH for one minute, so we are going to pause SSH process for one minute on deployment, CF and Group D yourself limit to one which can be any deeper cell. So right now in this environment, I have only one digger cell, since it's one box environment, but we tested this successfully on the staging environment with about hundreds of VMs there.

C

So there is only one we VM here: DSL, let's perform a such operation on the degausser, as you can see, it's pretty responsive. It is very quick and then now I go and perform failure. Injection test using chaos toolkit by running this experiment pause process, not Jason.

C

These are some bubbles running. The experiment and steady state hypothesis has been met as well, and then experiment ended, which status has completed. Now. Let's try to do an ssh into the same. Tea go, sir.

C

There's a pause like you know for and it spas so for about sixty minutes and we can go and check the UI as well. There's a new eye aspect of turbulence. It is saying the post process is in progress and it will continue to you know, spin up till bollocks about one minute. So after about a minute, you will see the lock is released right, so you can try again to do an SSH and it's again you know very responsive after one minute right so because there is no lock on that SSH.

C

So since we could do it for SSH process, you can do it for anything like you know, you can do it for the rep process or anything second scenario is killing Adi go sell any day, go sell like again. There is an experiment file for this separate one. We go with a standard definition. Experiment file for now steady state hypothesis is actually empty for now, like you know, we are not doing much, but we can do. We can add some probes.

C

There method is to attack and kill a degausser which is running in the deployment CF and limit to one. So you kill 1d yourself for that matter, so any D goes in and then I'm going to run this experiment killed. Ego cell, validating hypothesis and experiment ended with compete. Shared is completed so, as you could see, there is a the D go cell running in this. Now, let's go and then print the BMS. Clearly, there's no D go cell here right, so it's killed right.

C

So what happens to the applications running in that D go cell or continuous running in there right. So that's another way of looking at things. So if so again the UI shows that the attack has been successfully completed and that's the reason it is in green color and then our synthesizer as I said it's a Bosch, hosted, environment or wash will go, and then you know get the bring back the D yourself again.

C

So so, after certain interval of time you could see that the D go cell is back because the watch has created it again right, but so it in the same in that scenario, Bosch also made sure the apps are scheduled back again. So that's my first demo.

C

Let me go back to my presentation. Slide deck.

C

So this is again, so that's the first half of the problem statement. How do you perform the platform level accounts? Engineering attacks is why a turbulence and chaos toolkit driver for the turbulence. So the next half of the problem statement, which is crucial for us, is application level. Chaos engineering, because we have about 4,000 applications running on our platform. It's not a single. We are not a single application company right, wherein you just attack an application and all the components of that application and see what happened. That's right!

C

It's not that we have different different teams, building different stuff every day and about 4000 applications, not not at all independent. They have been independent and not interdependent applications right, so we don't want to randomly. You go and then kill a degausser. It would impact multiple themes within t-mobile and that's that's a big problem for us. We have. We wanted to make a very conscious level attacks like in a very targeted attacks, in a way that ok, it's specific at a specific application, is targeted for the chaos engineering. What would happen?

C

How would you know without disturbing other applications running on the same table? So do you want to talk about this Ramesh in the Oxford or should I go ahead, so I think I'm.

D

Sorry I was speaking: color and I was amusing. Yeah yeah can.

C

D

E

D

Interrupts well, we deal with an open support model where we get a number of different questions and I want to talk about here. Four favorite favorite questions as to how we can actually not need to be enablers would also be guardians right, but is not necessarily just rely on best intentions, but provide tools, capabilities that will also kind of like help with the best intentions when working with such a large interval customer base. First one is the my app isn't picking up latest configurations, and our first reaction is.

D

This is because of bad karma and your app is misbehaving right. Second, one is my app isn't connecting to Cassandra, and our first reaction is because you can't and that cluster was potentially decommissioned, or you must be hit with the Cassandra team and then the next instance we see is my app works locally, but not on CPS.

D

This is likely because you misbehave with the PCF team, it is asked and then last one we see as unit was working telephony and then it stopped working, and this is because we believe that you've not get the bills for us all right. So but jokes aside, guys, these are some serious concerns where we classify them as debugging as a service and oftentimes customers like to start with us, because we're very nice to them. We try to enable them.

D

We try to like make them some sufficient, but that's not enough, so you need to like provide tools and capabilities which will also provide guardrails for them to operate with them, and that's where to like. This is going to come in all going to be very effective, which is it's going to like. You know, enlighten them as to what they can do to actually validate some of these off soil conditions right and help them be more self-sufficient.

C

Perfect sounds good, so, having said that, you know again the same. Extending the problem statement or I would like to touch base on the cascading effect, popularly a butterfly effect as well right. So there are two. We all know this. What is cascading effect I just wanted to I, don't want to dig into the more details, but quickly explaining this. We have concert and web weather micro services running in our platform and, in this case, weather is dependent on third party.

C

What happens if the third party application you know goes down, so it totally packs with her and those times out, concert right, and it may so happen that you know the database also, which is dependent on which concert is dependent on, might go down. So all this put together creates a cascading effect and gives an unfavorable behavior to the experience to the web application running on on the front end of front end to the client right, so the client will see experience a very unfavorable thing, so zooming in a bit.

C

What happens if these applications are running in a spring cloud kind of an ecosystem? So these are the D goes, so imagine that you know weather and concert micro-services both are scheduled and running on it's the same degausser. The gazelle is virtual machine, so the both these containers are scheduled to are scheduled and running on on one particular node.

C

So it means what I mean by targeted attack is what happen docking the traffic to the concert so coming from the go router or the load balancer to the concert service only and and then blocking the traffic from weather service to the my sequel database. As you can see, weather and concert services both are dependent on my sequel database, so I'm doing a very targeted attack here.

C

Both are running on the same node, but these two apps I could go and then you know, do a fine-grained attacks in a way that okay blocking the traffic to the concert and blocking the traffic to database from whether again the failure can again happen at different levels as well. There is a lot of interaction happening here. The weather might lose the connection to the service discovery or circuit. Breaker concert can lose connection from the car to the config server as well.

C

All these attacks can be simulated all right, so how we do that today is CT k, CF blocker, that's a new driver that we wrote and then target specific CF apps application host. So what it does is it discovers where your application is running and then it discovers what services your application is bound to. So in this case, my weather and concert micros are are bound to my sequel database config server, you Ray Kassar Eureka's and then his trick service and then now it can also go into service instance like, for example, it can.

C

The CTE FCF blocker can also go into your config server, that doom app is bound to and it can bring down the config server as well. So what it does is primarily blocking all traffic to app instances and blocking traffic to Bound services as well. So how we do that is again. The next demo for us is.

C

So let me share this screen.

C

So again, we we are gonna, see two scenarios here. The first one is blocking database connection, so I have two apps running in the cloud for on wrist playing music and spring music. As you can see, these two there are two URLs out, that's pointed to it, you can go and then check.

D

Yeah yeah, sorry guys, I didn't come to you. I'm gonna stay on the call. I I have to like dialog of the web meeting to drop my daughter at school, but I'll stay on the call. Okay, okay,.

C

So the spring music just to see what happened here was the app got loaded, and these are the breadcrumbs. These are the album information that got loaded from a database, so you can see here which database it is. It is my sequel database called music DB, so these two apps spring music and spring music both are bound to this music DB. So now what we're going to do is bring down application connectivity from music DB from for spring music.

C

Only so again we have a experiment file here with all the attack definitions and it has got probes. We have used probes. In this example, we are going to check for HTTP, 200, okay or not for this URL.

C

And and then the method here is blocking a service and unblocking a service. So first we block the service for 60 seconds. What we, the service that we are going to block, is for the app spring music and the service name is music DB and we have to provide some information like organ space name so which is specific to PCF again in the on rollback, unblocking the service provide agnate, space name and the app name, and then the service to be unblocked.

C

C

Let's verify again, these two apps are pretty pretty much working good, there's no problem there. So, let's perform the attack. Now I can use a Python script. Chaos toolkit and running the experiment.

C

Seblak service- and there is a verbose you can enable both to have a deep dive, stacktrace information, but time will not use that for this demo. So what this timer is doing is it is trying to block traffic to music DB bound to only spring music app. So now it found where the applications are running and it has the APIs has three instances. So it figured out all the three instances of the app running you know in your cluster and then it started attacking now.

C

So you can just refresh, as you can see, there is a successful attack. It there's no data now right again, spring music, which is running on the same VM and pointing to same database, can fetch the data. There is no issue there, but spring music, there's no data. So so that's that proves our successful attack and it takes like 60 seconds to you know, bring the system back because we have a rollback policy after 60 seconds right.

C

So so, let's go back and see what's happening, so the rollback unblocking music DB has kick-started so there you go. You see he that spring music is back in action, so within that 60 seconds you can see what happens with your application. That's what is actual goal here and it can be any service. Now, it's not just music DB, but it can be anything my sequel or you rayker service history. It's all the services that app is bound to. The second scenario is blocking traffic to an app.

C

So again we are going to use the same apps here spring music and spring music.

C

So what we're going to do is we are going to block traffic to this spring music app. Only again, these two are running on the same virtual machine. Blocking traffic is the experiment file. Let us look at the block traffic JSON file and you can see again it's the same. You know similar pattern. Of definition. Configuration goes in here, steady state hypothesis. Let me come back to the method here. What we are doing is blocking a traffic to the app spring music. There is no such service here, that's fine!

C

We don't need to define a service here. We have two growing going to block the traffic for 60 seconds or you know, and then the function has unblocked traffic.

C

It's a blocking all traffic to spring music, so it's still working, but let's give it a time and and there you go, you see, the traffic has been blocked and you see the 5:02 bad gateway. So the Gateway is aware of the problem route, but it doesn't know how to do a TCP connection to the app. So that says it's this full attack for us right. So.

C

And unblocking is been initiated, so you can see that you know unblock happened because there is no timeout for 60 seconds here, so it happened so quickly. So that's that brings down the end of the second demo as well, and let me go back to my other slide. I have like two three slides now. That's all I have.

C

So I think Ramesh is not in okay anyway, so this is upstream contribution we have made so all the demos that you have seen here. They are part of our upstream contribution from the t-mobile side. There are.

D

C

You can speak.

D

I'm actually going to be driving, so you can speak for this, please so.

C

We have Kaos toolkit, CF, blocker driver and kiosk toolkit turbulence. We wanted to bring these two into the chaos toolkit umbrella that Seyfert we are putting in as well, and these demo videos are available for you to go through again and we have raised again. Turbulence is built by another person, you know and we we are not the ones who created turbulence. So we wanted to do a pull request and we did the pull request with all the new feature add-ons and it's pending. Approval of the PR still will wait and then see.

C

If there is no action happening, then we wanted to create a new report off of it and start adding more features to it. So you want to talk about this Ramesh. This is our team.

C

What do you want to do next and all that Ramesh.

C

So the next is actually.

C

We wanted to conduct some game games game day, so we wanted. This is still a tip slightly matured and proof-of-concept right now we wanted to make it work and productize it and call out teams on team to team by team basis and perform game days in a war room and randomly attack our own infrastructure and see how their applications behave. So we wanted to build this capability and give it to the application. Teams to you know, perform chaos, engineering attacks on their apps, so that's all I have.

A

Well, thank you very much, Karen and and Rama. She was already really sweet demo. It was really really interesting to see. Thank you very much for that. um I'll carry on on the you know. The slides, which don't mine now and share, should I stop sharing. Yes, please yeah! Thank you. It was really a very interesting demo, very fun to watch alright I.

A

Think that's all right, so we pass that nope, yeah I think we are back to you, know usual discussion around landscape and you know and crafting the categories I personally didn't get any chance to actually look at it. Yet so, if anyone has fantastic I'll be paying I see, there is a pull request. I'll be looking at the podcast as best as I can but I think the idea and we'll be talking about cube communities probably to have that wrapped up in some fashion enough anyway, bye bye, soon ish to actually get that.

A

You know demonstrated that coupon and you know, and basically settled in some fashion. So I don't know if anyone has comment to that. Maybe on that pull request specifically, but you know it's raise your hand happy to great. You talked about that right.

A

I'll carry on just just, let me know in otherwise so yeah kyouko knees winds upon us, but sadly, not far away. It's Christmas, you know, treats almost and we have Teresa set up and the to work to to discontinue to talks. For that one is intro of chaos injuring and the other one is deep dive.

A

The intro is is certainly talking about the working group, but not only works like you presented Colonel Ramesh and you know all things that have happened in the field of chaos, injuring you know great link, Romina none spot with some nouns mount. All those things need to be. You know saying.

A

Basically, this is where we are now as a community with chaos ensuring at the tip done I think at least I know, I and Gillian, who presented two weeks ago, we'll be doing a demo join demo drink that that one on this year, where he started and and showing how we can actually automate that, after that, we secure circuit, please being Chris. If you interested in talking or being being there, were, you know in some some capacity, obviously we.

E

A

To him, yes, this I don't reach out to me because I don't have any control over that I'm. Just passing the message yeah, please do it do talk to Chris, maybe I feel about it. Alright, it be it be to be an asset, be fantastic. If we could see you know as many people as we can not just on screen this time, but you know face to fit, and you know just have a coffee or something would be perfect. So.

C

We t-mobile, we are in Seattle so and we are joining Yukon so hope to see you. There.

A

We'll see you there that definitely music. Alright, this is where I'll be leaning on the people who were EOS. Kampf I felt very good thing, so I'm willing to listen to bit more about it. So who's be. Will you know, would be willing to talk about it really I think Michael, yeah and I. Don't know if there's anyone else from the.

E

So kiosk offers about a week and a half ago in San Francisco, we kicked off the day with Colton from gremlin talking about the sort of evolution of actual chaos, engineering attacks and sort of concluded by talking about gremlins new product Alfie, which is application-level fault injection, where you can write into your application. Various tax, which was pretty cool, yeah Adrian Cockroft from Amazon, then spoke about the history of chaos, which I really enjoyed was really good talk. We had a number of other presentations.

E

From I think the Bloomberg we had a great one from Twitter I was talking about. So one of the the lady he was presenting was a technical diver and she was talking about a chaos, engineering and in team building.

E

Later in the afternoon, we had Tammy and Anna from gremlin. Do a really cool demo of braking aks and eks, which was pretty cool, and we finished the day with Jesse Frisell, who was talking about containers and breaking stuff in containers. So it was a really good day. Sorry, no, no I'm gone I was nothing nice yeah. It was a. It was a really cool day, good to meet more people in the community and I look forward to next year and I hear there will may have a new venue which we go.

A

Very cool, exciting, well I'll, be trying to watch to all the videos online. It always you know, takes time to watch all the talks, but it really was it yeah.

E

All our videos are online by the way. That's actually a really good point.

B

E

Right, thank you. I was just digging it out of my inbox yeah, so they were all online now. If you want to watch okay.

A

C

Think our last time, when I clicked on the link, it was dead. So have you posted the link here in the chat window right now?

C

The link to youtube link to the kiosk can't talk.

A

Awesome awesome.

A

Carry on that was.

D

A

That's back to the what do I was saying about the working group. Work I need to pay attention personally to to the state of the Piazza and everything not just one I mentioned earlier, but device ones on the on the white paper. Just try to melt everything into one document and see where it stands. Right now probably needs a lot of polishing, so lp's, you know basically welcome and that's that's about it for this week you know meetup.

A

Anyone ask you and I think this usually ends up by saying that we are welcoming any demo and anyone wants to do a new demo or better one. You know whatever I think it's always cool to have that. So that's the longest word to people that aren't aware of this working group. I. Think it's important with you know, I'm sure there are plenty of companies who do while doing, kill centering or resiliency engineering, whatever you want to call it and they should come and just show it to us ray all right.

A

Any questions, not that I can answer them. Probably but they'll be there.

A

No, that's fine! So.

C

This is so here we are late entrance right, remission, I and what is the expectation like? You know what what is the best now low-hanging fruits that we could grab and then you know, start contributing and.

A

A good question but yeah if anyone wants to respond I, don't want to all the you know, actually actually done this the showing the screen. So we can see each other happy to let anyone answer that one I can go for it. I think I think the working group is not yet working group, that's important to notice from a CN CF point of view. It's not yet working needs to be proposed and accepted and blah blah blah.

A

The point of we're trying to you as a community is is to bring everyone across and basically work together, and for that we try to at least add a white pepper, doesn't have to be complete. You know comprehensive, and the OL model of you know got white pepper for chaos enduring, but enough to to showcase what it is and where it goes. I know. Chris is looking at expanding on the landing scape of the scenes.

A

Yet, with tools like your balance could be one phone follow some other tools saying this is a field and you know bias technology that I exist, so I think right now it's it's more awareness and trying to have that step one. You know to get us that trampoline that get us elsewhere after that I have.

B

A question: if I did you had this discussion? Sorry I didn't draw a while, but did you had this discussion for Chris? What is the next step is to transform this kind of to real working group in sense here?

B

What is the do, I think isn't important if it's date or nice time, because something interested to fix target. Maybe next, maybe not next sense yet, but also on something like that to to achieve something and to deliver something and to go further for dots, yeah.

A

Mike Michael did mention that last last time as what we need to say is, but it's it's a good point or eighty right I think the idea is, you know: I I, don't work for the C&C. If I don't know that that proceeds, so I uh I can do is echo what we discussed before but I as far as on, and the idea would be to put that what you know that github repo in good shape, where we can you know, agree on that white paper. At least whiz at you know, phase one version one.

A

It doesn't have to be.

D

A

Or anything and and increase the landscape so that we have a good overview. What's what exists as a mine, stone I think it would be before the end of the year and best would be. If we could, you know, do it before keep calm, because during coop con I think Chris is trying to raise more awareness and more you know, promotion yeah, so three lady, might would be better. Okay,.

B

Let me know if you shall get like maybe Chris having better visibility, but the most the best case is to fix the target to finish the landscape before the end of the year and to motivate to to grow up to work in a real working group. For for the target to write specifications all right best. Practice in Koenji means guideline for currents, engine for all engineer, working the services I think.

A

I think that's interesting because, as far as I understand the CN CF tries to to avoid specification. Initially they don't want to in us you know governing, but in that, in that sense I as I understand it. So my practice or something my my assumption here is that we do the tidy up. You know we didn't like six weeks or something so that gets ready for you know during November, so that at coop con we can meet up and probably discuss, but next year.

A

If we look at what Services has done its subtly interesting I was talking about specification. They came up with the specification on the side and I I. Don't know if we need a specification or I, don't know what we need as a community but cube con, and you know and other meetups here for those who come come to me. Con are the good places. This is where I would like to go and as a community we can basically decide that.

A

So it's it's probably up to us as well to say not just where are we but we have where we want to go. Let.

B

Me definitely the next since you've, a Seattle is a good place to discuss about that. Well,.

A

Every time we meet up in person is probably a good good. You know it's, it's it's faster, it's more concrete and you know it's. It seems to work better, even though we you know, we all distribute it, and we all. You know you know in places somehow, when we meet up in real person. That seems to be faster.

B

Thank you for you. Alright,.

A

Any other question that I can't answer properly, probably.

A

Alright I, you know it's five minutes to to the end, so let's just wrap it up and thank you very much for for everything today and you know that this gets on on slack. Thank.

B

You to my decision thank.

A

You thank you very much.

B

Okay, see you next time, bye-bye bye-bye.