Cloud Native Computing Foundation App Delivery Special Interest Group, 4 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF SIG App Delivery 2020-11-04

Description

CNCF SIG App Delivery 2020-11-04

A

B

B

Good evening, thanks for joining.

B

I'm just updating the attendees list on the.

B

C

C

um Yeah, let's actually put in brackets here, I'm going to just say: let me space my data all right. I can rename I'm just renaming.

C

The zoom so that they thanks for taking time josh it's an important call.

B

B

D

So hello, everyone hi.

C

D

How are you hello, tired today,.

D

Honestly, so I'm afraid that not necessarily a lot of people are going to show up uh today, elections, yep.

D

Yes, but let's still give people a chance to- and I know that you wanted to present something so you might not have the biggest of audiences but anyways we have to recording and we can share it later on. That's still something we can do all right. I'm posting, as always the link to the doc in here waiting a couple more minutes, then for people to join.

D

I really like your profile picture thanks.

D

I feel watched a bit when I look at it, but.

D

Okay, just posted the note stock in here.

D

I'm adding one more item in here, so obviously we have to operate a working group update today, which is great for making progress here.

D

Then litmus integration with this captain um again as not too many people are going to join today, we'll just uh record it and then post the link to the on the mailing list later on. The bitcoin should be then available to tomorrow, and then I briefly also wanted to give you. Maybe we do this, then also as one of the first items, a brief update on the potato head project.

D

Okay, so usually we give people five minutes to join in here.

D

Let's see how many we joining.

D

D

D

So I'll be back in two minutes.

D

D

Okay, so who wants to get the update on the operator working group today? Yeah.

E

I will do it.

D

Thanks so much.

E

um So a short update from the operator working group, um and uh so we are meeting once a week at a small, very small group, uh three more people and myself and on that meeting we are talking about the operator trying to get it into something we can share. We are sharing the results week by week, so uh in the definition of an operator you can see sorry in the yeah in the operator definition working document I will edit in the links you can see.

E

uh Progress is being made and we're open to comments uh there. And if you want to join the small group- and I will also share how or you can just ping me and the last time we talk about capabilities of of an operator and we are trying to define as phases in the life cycle of an operator and how an operator can say I'm doing that or I'm not doing that.

E

So we finish defining that and we will add it into the document so keep on watching there.

E

And then that's it till next time.

E

You're on mute.

D

Sorry, it's been a long day thanks. uh When do you think would you would want to share it with a a broader audience that just started? There is already uh like just quickly sharing it here.

D

This is like all the new dev operator definition. What could you have done already?

D

C

To record this meeting,.

D

It's getting automatically recorded.

C

D

Okay, yeah thanks.

E

A

So that's the stuff.

E

I think is something we already discussed. The yellow is something we discussed, but it's not in final words, so we're hoping weak and we will refine it. So we talked about it. This is the output of what we talked in the meeting and now every one of us took a small chunk and we write it in the better and explained words.

D

Yeah, I think that's great, so the next meeting will be cancelled anyway, due to cue con cubecon. Obviously so, if if we would have maybe a bit of a longer update, not the the next meeting but like the actual next one like in a month from now that that would be great.

E

Yeah, I believe that also in a month, we will have something more concrete to show so maybe we'll schedule a longer read of the talk.

D

Yeah, I think that might be that that's going to be great and this notice has been in the working form for quite some while then we kind of lost track of it, and now we are back on it. So I think that's great, and ideally we can also like share it with a bigger audience later this year, but that looks very, very promising already and uh yeah thanks for taking this taking on this work and if you want to discuss specific topics, also feel free to use the mailing list.

D

I found that there's way more people reading the mailing list and those are joining the meetings here, simply due to time, zone issues and and other reasons.

D

Okay, looking at the agenda, some people were just adding something in here.

D

Since yes conformance and new cncf working group who edit this one, that was me, taylor, okay, so I'm just looking at also having the demo- and I assume you're here, to really uh give the team here a small update. So I was wondering whether should this do this before the demo. How much time would you need for for that topic? Roughly.

A

um It's really just more of a notice and request for people to be aware and join if, if possible, so just a couple of minutes.

D

Yeah then I would propose you go first and then we dive into the demo if the demoing team is fine, because that gives you more or less time for the demo till uh the end of the meeting and being sure that tyler gets uh some time to present on this one.

D

Okay, so I would then pass over regarding cgf, conformance and.

A

All right, um I can, I guess, share my screen if that's something that y'all like to should.

D

A

A

A

All right so at the I don't I don't know who's aware of uh the cnf conformance in in general. It it's been going for about a year as far as at the program start, but it's been focused on a test suite project up until recently. So what's just um recently started was a a new working group.

A

That's the new thing here, but to give you um a little bit of background um and by the way, this may be a repeat for some people, if you're on the toc meeting yesterday, because this was shown there, so I'm not going to go through all of it. You can go review that these slides are from the toc meetings.

A

But the main thing here is the cnf. Conformances is looking at how to help telcos and being more cloud native with their applications, the telco industry and we're looking at the service providers themselves, so at t voter phone orange, whoever and the creators of the telco applications, so vendors and everyone else and how to help them move I'm just going to kind of go through very quickly.

A

So this is just an overview of saying, what's been happening and and the telco is more and more interested the you're starting to see more adoption of kubernetes and on a platform side it seems like everyone wants to do certified kubernetes, but one of the big things- and this is recent- surveys- is talking about moving the actual uh networking and telco type workloads which they're usually referring to like network functions, moving those applications onto kubernetes, and so there's there's issues with the the philosophy and viewpoint on how a lot of those have been built that are quite a bit different from the normal applications, so that the conformance goals as a program is similar, like kubernetes conformance for interoperability at a platform level.

A

Of course, that's really core kubernetes. Then you look at additional specs. So what it's trying to do beyond? That is what are the things that are needed to to help the telco industry to understand best practices?

A

So there's two parts that the existing part is this project for this actual test suite and that's you can come check it out, there's existing code and this is being developed for a year now and trying to build tests and stuff working with different telco community communities and getting feedback.

A

There's a lot of a lot of different things going into this.

A

But the the new thing here is um this working group going around defining like a certification and definitions, and um this is, you can see an open pull request if you go to the cnf conformance where it's the um the charter and all the other things are being added here for this new working group and what's in scope, so I'm I'm.

A

I don't want to take up a lot of time in this call to go over it, but the the main point here is there is going to be overlap in many communities, um cncf and kubernetes, and one of the main ones that we've talked about is sig app delivery and um another, wouldn't would I just point out, would be sig network because there's there's an overlap on both of those.

A

When you look at telco and the type of applications that are run, is you have a lot that deal with like user data planes, so not using the standard networking? So what do you do for those type of applications? And what do you do for other other applications that may go beyond what the standard common applications are? So that's what the focus of this group right now is to to work with the telco and cloud native community to to try to look at what is the scope?

A

What are the the things that we need to do to extend and look at best practices on how they can be utilized in the telco domain, and I should really just say that the networking domain is telco would be under.

A

That platform is likely to be part of it in some ways when you talk about stuff like a user data plane, and you look at other mechanisms within the um the platform or you could say layers and add-ons of kubernetes, and that's where it ties in with the network, plumbing group and sig networking from kubernetes and as well as the cncf sig networking, which looks at service meshes and stuff.

A

So this is more of this working group is forming right. Now, we're going to be having community meetings and to talk more at kubecon that we'll start having regular working group meetings and we would we expect to protect collaborate more directly with sig app delivery than most most of them, and probably sig networking as well.

A

I'm sure there's going to be a lot of overlapping different ones, but if, if we'd love to have your participation in that group to help telcos, I think that's probably it that the test suite itself, maybe just throw out, is this. This is going to be put forward as a a a sandbox project in the next three to six months.

B

A

Test suite and maybe a quick overview, the test suite is, you could think of it as a combination of what uh the the e to e tests are in in kubernetes, when you think kubernetes conformance, you have the e to e tests, which are managed by sig testing, and then you have sauna boy, which is what people usually use to make it easier. You don't have to use some way, but many people do so. The um what's currently called the cnf conformance test. Suite may be renamed is kind of uh it.

A

It takes a lot of ideas and concepts of sonoboy and and then supports doing many different types of tests that you can do. We support pretty much any layers, and I I think that covers it but happy to answer any questions.

D

A

D

For sharing, so I definitely think that there are reasons that there are points of overlap uh here going forward, so I think, having a recognizing make sense. Interestingly, the first time we had a touch point with um uh regarding telcos was really on the air gapped work that we kind of well. We didn't really stop it, but it uh didn't get as much momentum as we thought it would.

D

I think it doesn't really fall into cnf category, but this is something we already started to look in as well, especially when it comes to telco provider appliances and how to actually deploy applications and as well, kubernetes there and that's something I reached out to to the telecom user group, but also look keen on how this plays on the on the networking site.

D

I don't see the immediate overlap, but I think it might be good to at least have like a quarterly update on topics and see where things overlap. If that's, uh if that's fine for you.

A

Absolutely and I'd say the it's more of a compliment than anything else, and potentially it's going to be a a specific on on the application side when um versus when we get down lower. That may be like a platform or the sig networking, I would say it's a specific case and if, if it wasn't for the um the platform items where it starts getting a little bit gray, this probably would be a working group under sig app delivery. There's a little bit of discussion on.

A

Does it fit more on sig network or sig app delivery, and I see matt farina. I think you said in the toc yesterday that it kind of bridges both or agree agree on that. But yes, for a quarterly update and we'll continue to collaborate and think of sig app delivery kind of as a base for a lot of the cloud native uh application pieces.

D

I mean, obviously we can think more frequently, but I would at least put like a quarterly update on the agenda to ensure that people are aware, but obviously we have something that comes up earlier. We can obviously talk.

A

Yeah I mentioned earlier stuff, I think, we'll start saying more of that.

D

Okay, then next item on the agenda- and I see you wanted 35 minutes- we won't make it like for the full 35 minutes. You might have to speed up your demo, but it's great to see updates from uh two two projects, and especially two csf projects, uh working together on the um chaos, engineering and delivery integration between uh captain and litmus. So I would ask you to maybe certainly stay in time. Can you do it in 24 minutes? Because that will keep us in time.

F

Yeah, I think that should be possible for us, um I'm going ahead and just share my screen.

F

um Yeah thanks always uh and uh as you already mentioned, that was really a joint effort between uh the litmus and the captain team uh and uh together with kartik from the litmus team, uh we are going to present our um joint effort here how to integrate these two projects and actually, why we're doing this and which problems we want to solve. um We have a little bit of a presentation prepared and then uh the demo um card- tick.

F

um Maybe you just let me know when I should uh move on with the slides, and I will do. I will let you start and then I will hand over. I will.

B

Thanks hi, everyone um great to see taylor, um watson and luna here- we've met recently and should get around soon to contributing to the cnf conformance uh this week. Okay, so we have uh um some items on the agenda. That's a couple of slides. We have that is explaining what we're trying to do and then we'll go to the actual demo and try to keep it short on the slides and move to the demo quicker.

B

We could probably go to the next slide here again, thanks, okay, so we are all aware that uh resiliency on microservices is hard and whenever we are deploying applications on kubernetes, we know that there's a lot of services running there, which we have not picked ourselves and 90 percent of our resilience- depends upon all the infrastructure components that we are running on top of. So it's really difficult to predict. What kind of outage can happen, so it's very important to have a system where we are going ahead and injecting these failure.

B

Services so clears engineering, has been compared to vaccinations.

B

So that's something that we are injecting failures, willfully and trying to find where the system base and that that information can actually help us improve our application, business logic or deployment practices. So I think that's really important. I think we can go to the next slide.

B

So one of the things that has a project we looked at is communities with the onset of communities. I think, and a little before that, the way people spoke to infrastructure change infrastructure, as code came in and then communities came in and reactions were becoming the standard.

B

So with communities all aspects of managing complications and managing resources and policies is all done in a declarative way via yamils, and we wanted to adopt the same ux when people are attempting resiliency tests or chaos on on their questions. So that's where the custom resources came in, so we were able to provide the same user experience, describe the chaos intent in custom resources and have an operator understand that and implement the right functions.

B

So if we go to the next slide and litmus, we have each of these activities that you see there on the block diagram. So you can see a traditional flow of the chaos experiment.

B

We have a hypothesis that is describing some steady state for the application or infrastructure which is validated first, followed by a fault injection, and there is going to be a steady state verification, that's done at the end of it or even during the fall, as the chaos occurs in parallel, and if our hypothesis is right and the steady state conditions are met or they are regained. Within the toleration seconds that we have specified, we are going to qualify that infrastructure or application as being resilient.

B

Otherwise we see that we found a weakness and the way to describe this in litmus is by our custom resources.

B

So there are some off-the-shelf custom resources, static ones, called chaos, experiment crs, which is what is listed on the chaos column, and this will be pulled into a cluster and the chaos engine is a user facing cr that maps a custom chaos experiment resource with an application instance, that's described by the name, space and labels and other attributes.

B

So when the chaos engine is created because operator launches a set of chaos, parts which actually implement the experiment, the clears injection logic is implemented by them and the results are captured in another cr. So they are different crs, as each of these has a scope for a lot of information to be placed inside it and acted upon.

B

So this is a declarative way of doing chaos. That was the same for litmus and that's something that's been adopted and uh taken quite well, and it lends itself to a lot of paradigms in cloud native space like guitars, uh which we will talk about in a short while when is being discussed.

B

So this is about greatness, why we needed it and how we are doing it. The next slide is talking about what are the ways in which projects like litmus are being used today.

B

Initially, chaos engineering was viewed very much as an ops thing um very much in the domain of the srd, but there has been a left shift and people are using it uh as part of the release, that is, they are using it as part of the delivery pipelines.

B

So chaos experiments with the frameworks like fitness are allowing themselves to be used in this model, so you could actually run a kiosk experiment. This part of a ci pipeline with different popular ci frameworks, great land, negative actions, etcetera where you can actually store the experiment computations beforehand and run them as part of the pipeline and use the result to determine the ci stage. Success of the ca job senses and application artifacts that do pass.

B

This stage can be placed into other clusters, which are doing some kind of scheduled chaos which are typically for longer durations, and then we can interface with cd mechanisms with ct solutions which can actually take this and put it into a staging name. Space run its own set of validations and promote it to other stages depending upon the success.

B

So this is something that we are seeing. In fact, this is uh one of the dominant use cases or litmus in the recent times that we are coming seeing the community and has sparked some very interesting integrations, which we will talk about. um That's where uh I would like to hand it off to you again to talk about.

B

Captain and what this integration is about,.

F

Thank you. um Oh there was a little animation, but here we.

D

F

Yeah thanks so before we we'll jump into the demo and show you how this, uh how captain and litmus chaos work together. I just want to give you a brief uh heads up on what captain is. uh Maybe you have already heard about it. It's also a cncf sandbox project and it's a it's really a control plane for cloud native delivery and operations, and in this sense uh it can orchestrate the whole application life cycle.

F

For today we will really focus on slo driven delivery. That means that captain has as essential piece quality gates based on sre principles like slis and slos. It has this as a central piece already baked into the captain platform. So we have this part and we can trigger different tools. um Different integrations like litmus and then uh after they do their job.

F

um Captain can then take its next actions, for example, evaluating how uh actually these tools uh have been executed and how the quality of a microservice that should be delivered to to production uh is actually affected um same as in litmus. Also, everything in captain is declarative and captain itself comes with its own kidops approach, so everything that's stored inside kept and managed by captain is also versioned and uh stored in the github's approach, um and that is true also for everything that we are doing now with the litmus experiments.

F

So everything really works or is moved into the um github or git repository and can be linked to github, gitlab or whatever. um So how captain works in the sense? How can you really connect other tools? um The two main definition files for captain are a shipyard file that describes your environment and a uniform file that this or a the concept of a uniform that describes which tools you want to connect to captain and captain is this control plane then then connects all the different tools together.

F

So, for example, captain starts by deploying a new version of an artifact. Your container image deploying this after the deployment capture will automatically trigger the testing tool uh you will define which testing tool you want to use. We will be using uh in the demo two testing tools at the same time that are both connected to captain after the tests are finished. Captain uh will execute uh the evaluation of the tests, but captain can also be linked, or other tools can be linked and connected to captain like a shadows tool.

F

You can control captain by a chat bot or you can have everything. That's going on inside captain sent as a notification to slack and, of course, you can link tools for observability, like promethos dashboarding with grafana. You can link this to captain and when you're using captain captain can distribute the events from the control plane to these tools and then detect. They can take action. So with this, a lot of automation is possible. A lot of automation is baked into captain um for this use case.

F

Today we will only focus on the deployment testing and evaluating. We won't take a look at the automatic configuration of dashboards or promotion to production. We will only take a look at one one piece of captain, let's say so. What we have prepared for today is really: how can we include chaos? Engineering into a pipeline- and we start with uh deploying uh the potato head application- I I just saw it's also an item or for today's agenda of this uh uh seek meeting. So we will hear uh maybe more about this.

F

uh It's a small demo application that will that we are going to use for for this demo. um We are going to deploy this into a chaos stage, so it's a single stage environment in this case, because we're just interested in evaluating uh the resiliency of our microservice we're using help for deployment and captain is executing the deployment with kevin with helm for us after the deployment is finished, captain will trigger chain meter tests, but not only meter tests but actually, at the same time, also trigger litmus chaos.

F

So with this, we can make sure that we have actual traffic on the service under test our potato head, and we are also triggering um chaos on the service. So we can then later on evaluate the resiliency once.

B

The tests are finished.

F

The captain quality gate is is triggered, and this will reach out automatically to promethos and gather some data that is defined for this quality gate and we'll gather this data for the exact testing time frame and for the service under test, in this case our potato health application, and then captain will go ahead and evaluate based on the quality criteria of the quality gate. What we have defined in the slo dml file. We will see this also in the demo.

F

Capture will go ahead and evaluate this quality gate and we'll come up with a total score and the score can be either. Let's say a thumbs up. uh That means that the resiliency is satisfying uh based on our service level objectives or if we cannot reach the score. uh Our as resiliency is not satisfying and we can just rerun the whole workflow again, for example. um Usually what captain can also do is we can just promote it to the next stage like from pre-production to production?

F

uh If everything's fine, I will hand over again to kartik to explain our demo environment and then we will see them.

B

Cool, so the demo environment is quite simple, so we have a gk cluster uh on which we have the potato head app with hello service. That's running, uh it's basically giving you a hello page at its service endpoint and it's configured with the redness pro this demo. We will basically look at the difference between a single replica and multiple replica one and see how the quality gate actually fails in the first case and goes through in the second case. So it's highlighting and deployment issue.

B

um The flow is something like this.

B

So there's going to be a deployment, that's going to be managed by captain and once the deployment of the hello service is complete, the litmus experiments are going to be triggered on that and in that process the um we also have a black box exporter and amity's instance running managed by captain, and we have also a black box exporter.

B

That's getting the probe success and duration matrix, which we are specifically acquiring for, and the prometheus sli provider of captain is going to average out the metrics that's received, and it's going to provide that to captain which is going to carry out an slo evaluation um against this matrix.

B

We can see that when we do the demo, what are the rules that we have set as part of this evaluation, and um if, like uh you can said, if we are successful in meeting the criteria that we've set against the matrix, then we go ahead and pass the stage. Otherwise we are going to say the quantity it is failed.

B

So this is what we have in terms of the demo. I think we can go ahead with actually demonstration. Okay, thanks.

F

Sure, um okay, so what I have here is uh the user interface of captain, what we call the captain's bridge I'll just open this. uh It's actually from earlier today when we did our experiments, but what we are going to do is send a new deployment event to captain, and I will do this with the captain cli, but with the option uh that I just sent the cloud event uh to captain. uh The reason I do this by the cloud event and not by the built-in capabilities of the captain.

F

Cli, is that we can take a look a little bit on the details of the cloud event, so I will go ahead and deploy my potato identification. It's the hello server um microservice, and I will just do this with one replica and I will instruct here captain to start the deployment and uh what I've already told you in the uh on the slides uh will happen. So captain will start to um actually to deploy the service, and we will see this in the captain's bridge.

F

The captain will deploy the service and what we can already see here that uh the chaos runner and the pot delete will start it so after deployment. It was really fast because actually it was already running in this exact version on my cluster, um so the deployment was finished and captain is triggering two different kind of tests. uh The first one is the chain meter tests. They are running in a different name space, so we are not seeing the part here.

F

We are only seeing the part in the litmus chaos namespace, but what we can see is that captain also um triggered the chaos tests. We can already see it's killing a part here, so our chaos tests are the pod delete chaos, experiment. That means a part that, from our halo service replica set, a random pod will be killed by the litmus chaos experiment um and it takes a couple of seconds for the next part to come up.

F

We have added a readiness probe with a little bit of delay, uh so we can make sure that uh once the pot will receive some traffic, it's already up and running and uh now after a couple of seconds, is up and running. So let us now take a look on the captain user interface. What we can see here so we just triggered the chaos and we can see the configuration change was received and the deployment is finished after the deployment is finished.

F

Captain will start the tests and we have already seen the test execution of the chain. uh Sorry of the eliminate chaos experiment, but actually we will wait here for the for the full execution of the che meter tests. Usually they take about two minutes for them to be finished, so we should receive the event just in a second.

F

So we can already see the tests are finished. Our achievement tests with a couple of thousand requests has been executed against the service triggered by captain captain is now doing the sli retrieval. That's some captain internals, let's say to reach out to prometheus and query all the data and captain is doing an evaluation, and I make this a little bit bigger. So we can see the evaluation is done. We can see already here the total score. So it's red red of course means it's not good. We got a result of zero.

F

um We would need at least 75 percent to be able to proceed to the next stage or for a green result, and we can also see all the sli's and, as those that have been evaluated, so we can see actually our probe success percentage and the probe duration was not satisfied.

F

The reason for this is that um there was the time the part was actually not available for a couple of seconds, since um the chaos experiment killed the part. There was no other part that was able to receive the traffic, and it took a couple of seconds for the pot to come up.

F

So we with this experiment, we could already evaluate the resiliency of our application, and what we can now do is to play a little bit with our replica set or with other um settings in our experiment. We will just increase the replica account, let's say to three: I just save this file and I just send it again so in this, in this case, I'm sending the same instructions. I want to deploy the same version, but I want to have a replica count of three this time. So, taking a look, we can already see.

F

Captain um was triggering hell to deploy two other replica to other instances of this part. Once they are finished, then capital will go ahead and trigger the tests. Again, it's always the same workflow or pipeline whatever you want to call it. It's always the same. First, the deployment test, the evaluation and then the result of the evaluation with the promotion or rollback, for example, of the artifact.

F

So we can see the deployment is finished and we can already see that the chaos runner is started set as this. At the same time, also, the chain meter tests will be started. The chaos runner now starts our population experiment, and this is also triggered and orchestrated by captain.

F

So actually captain is starting the chaos runner in the chaos runners, then starting the pot delete part, and this one will kill a random instance of those three um instances of our hello service, I'm not sure which one will be killed, but one of those will be removed and another one of course will come up just in a second here we go a new one is created, because this one was killed by the experiment.

F

Everything is good. It will take about 30 seconds for it to come up and to be ready, but in this case we would assume that the two other instances they are ready and they are here to reserve to to receive all the traffic that should go also to the to the kill instance. So we would assume that our quality evaluation uh regarding our resiliency would be quite good. This time, uh because we have a higher replica set, then we have some backup pods that will receive the traffic one of those goes down.

F

So let us take a look in the captain's bridge. It's again, it's a new instance of our configuration changed it's the same version, but this time it's a higher replica set deployment is already finished. So right now the tests are being executed in the background and we are not only waiting for the for the litmus experiment to finish, but also for the gme that has to finish.

F

We can see jamie, the service has finished and uh we can see the evaluation, and this time we got a score of 100, since both of our slos are evaluated correctly or satisfying. um So our success percentage is 100. There was no downtime this time, uh although one of the parts got killed and also the probe duration was quite fast uh faster than this 100 milliseconds that we had set as a limit.

F

um So what we can see here is basically, we included chaos testing um into our pipeline into our cd um next to performance tests, uh and we called it uh the chaos stage. That's the reason why you see a chaos here. It's indicated right here because it was failing, but you with this, you can easily integrate chaos, tests into your cd um and make sure to um to evaluate also the resiliency of your microservices.

F

uh That's the main idea that was, uh it was a short and easy demo, but if there are any questions, we're happy to answer the questions. uh Otherwise, we just have two more slides um as an outlook. What will come next with this integration? But we are happy to answer questions here as a we're just inside of the demo.

D

Yeah, maybe you're almost on top of the the timing here. So if you have something else, did you want to share with do it and maybe also reach out to people on the mailing list?.

F

Sorry, yeah yeah. I just want to share uh this, but um please please go ahead.

C

Now, after this slide, uh I I I wanted to ask if I can take uh two to three more minutes to give quick updates and uh run through three to four flights.

F

Okay, okay, so what will be the next for litmus uh and the k and the captain integration? uh We will also take a look at the litmus case.

F

Results right now, we're using prometheus as the data provider for for this evaluation, but also litmus is exposing a lot of data for the experiments and we're also taking a look at the litmus data, and um we also want to improve this work by not only including it into the cd part, but actually testing, self-healing uh methods and testing auto remediation by having litmus experiments as the tests and then auto remediation scripts and auto remediation instructions, orchestrated by captain as the as the solution for problems or chaos introduced by litmus.

F

And if you want to learn more about this in a little bit more detail, we have a joint webinar um next week on november 11th, um we will share the the slides and, if you want to give it a try yourself, you will find all the resources here thanks so much and um please please share your thoughts.

C

Yeah uh sure, no, that was great integration, and uh I wanted to thank uh with the projects uh they were working on working sessions for three to four weeks, so that was great. uh Can I share my screen as well uh again: yeah? Okay, so we were.

C

uh This is just an extension of uh uh you know the presentation, um so we were chatting with uh harry uh about uh you know what are the next steps to the project and he suggested that uh before going to uh qoc, uh it would be good if you could present it in the sick chairs and that delivers sick. um So, primarily you know we have been seeing a great momentum uh in the last uh four to five months uh just before and right through after the project has become a sandbox.

C

So there were some people in the.

B

C

And the maintainers feel that you know we already put incubation, so uh we wanted to state that intent and uh provide a quick snapshot of why we think so. um So the real reason um where why we feel that we are good for um applying for or just showing, the intent to uh move to incubation is uh adoption uh and which indicates the project maturity.

C

We are 1.10 release um and also there are some um huge deployments in production for some time now and they're being become the contributors as well. So we feel that we will definitely pass their due diligence very easily and we also are happy to state that this project now has been. um You know managed not just by my data but great contributions coming from a lot of others, but we have foreign maintainers. Also in terms of indeed and amazon.

C

You know red hat and autopilot uh container solutions have been contributing uh recently uh to the project and we have established the open conference model um where we have a lot of um uh six inside litmus and uh there are about uh 16 meetings that have happened and uh telecom or some of uh some of the um contributors or uh sig chairs uh coaches in inside litmus, and we actually integrated with the captain. Octato spinnaker argo, that's one of the big ones, all right. So for all these reasons uh I know.

B

C

uh Team uh maintenance team things that we will. uh We need to move to the next step as a quick snapshot. These are our users. um Some of them are vendors, vendor users and some of them are they qualify the end user category and definitely intuit is one of uh litmus biggest users and our cncf end user and then iag autopilot and uh fi networks they've been using uh litmus in production.

C

um So in the last two months since we presented uh the project uh here at capitol during chairs, we have added six experiments and we ran about eight community meetups and uh very happy to say that there are 39 contributors and 16 members and our slack has been busy because you can see that last two months we almost grew again about 40 in terms of the experiment plans, so um we do a release of uh every month, so the community can expect that.

C

Definitely there is a release coming out on the 15th and last six months, a lot of development happened and we released a redmos portal, which is a full ui based git ops, friendly um platform for running chaos, engineering for kubernetes. So as part of that effort, we integrated with uh argo workflow.

C

uh Now you can run super complex chaos, workflows that are closer to your real-life scenarios um in parallel in sequence, you know, so it is possible to take a chaos engineering to a production level, um and as part of that, we also introduce probes where we don't tell what should be the end result or about it. You can define what you think should be a pass or fail for a given test in the hypothesis and as part of the requirement for incubation, we have updated all our cacd inside the litmus scales development they're all open.

C

um So we have a full-faced project for running the cac and also monitoring is a very important aspect. So we started the architecture completed it in showing uh chaos interlude dashboards. For example, you might have a soft shop application on any other dashboard for the farmer. Now we have a chaos interleaving as well. So when you run some chaos test, uh there's a red mark that comes right on top of it.

C

So, and also, chaos uh is now possible to be limited to a name scope and you can have multiple uh operators within the cluster, so developers can run as if it's a small application. So that's a very quick update and, as you can see, this is from dev stats. um It shows support from my data integrated at hsbc, microsoft, uh dense, updater autopilot.

C

These are the top 10 contributors uh since sandbox, so I'm very, very uh happy and thrilled to say that this project has received a great adoption and you know they're actually contributing back and uh it's working the way it's supposed to be uh in a sincere project. So this is another step that says you know how the forks are increasing, which shows that you know people are taking it changing and upstreaming it back.

C

So um these are just a quick update on who's actually contributing. You know this. Also one of the primary contributors that had a team from israel um is adding uh infrastructure changes to litmus so that you can run chaos from kubernetes on a resource. That's outside kubernetes right so, for example, a vm that is orchestrated by overt and container solutions, uh which runs litmus on some of the huge production deployments.

C

uh They've been uh contributing back uh with the network, kiosk australist and openshift support, and there are other uh microsoft team from japan has completely tested uh the aks certification litmus in all 33 experiments at that time, so uh autopilot uh um team has been using uh litmus but they're, also contributing back in the helm shot. So you know these are some of the in the interest of time. I just wanted to skip this. I've shared the slides and thank you guys for giving a few more minutes into the agenda, but this is really about.

C

um You know how litmus project is running uh things. You know open governance model meetings and integrating with the ecosystem projects and actually gaining adoption in production, um so all that stuff. So with that, I really want to stop this presentation, and I wanted to thank.

C

For giving us an opportunity.

D

Yeah, thank you for the update. I think it's great to see that project moving forward and obviously that's what we want to see is moving projects from sandbox to uh incubation. I assume you officially created the pr for the tuc, because that's all the official pros and cons things.

E

D

Yeah yeah thanks for the heads up and yeah. Let's let let's work on this and it's great to see to see the project working forward and also great to see you collaborating closely with uh other cncf projects as well yeah. I think with this we conclude for today. I wish everybody writes rest of the day nice evening, depending on your time zone and uh talk to you again in four weeks, not in two weeks. So next, two weeks we have kubecon and we decided not to have a meeting during kubecon thanks.

C

D

C

You all right great job again, uh you've been in gothic bye thanks.

B

B