Cloud Native Computing Foundation TAG App Delivery / Chaos Engineer WG, 29 Oct 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Cloud-Native Chaos Engineering WG Weekly Sync Up - October 29 2021 | CNCF TAG App Delivery

Description

Check out the recording from the Cloud Native Chaos Engineering Working Group Weekly Sync Up from October 29th, 2021

For more information check out the WG Charter: https://docs.google.com/document/d/1scr9uuvG1g1xpIHPs3314FqeFufE31ustTVnRMrX3gI/edit?usp=sharing

A

Good morning,.

B

Hi good morning morning, um thanks for joining today's chaos, engineering work group call we'll just wait for a few minutes to see. If other members join in we'll probably give five minutes, and then we can get started.

B

A

B

Session is being recorded and our discussion today will be made public and you can see the past recordings on the meeting. Notes that I'll be sharing on the slack channel right now on our zoom chat right now, so please feel free to take a look at some of the past discussions.

B

I I know we've chatted earlier and uh you also expressed interest in being part of this work group and sharing your views and contributing to the chaos engineering community in general. So thank you for uh for agreeing to participate and for taking interest. I've shared the meeting notes on the zoom chat. We also have a git repository that we just created towards some artifacts related to this work group, which is still having some initial content. So please feel free to take a look at it. Add yourself as an interested party in the charter, etc.

A

So is it something that I need to present or share something.

B

We do most of these user discussions in question and answer style. So if there's anything that you would like to demonstrate or talk about, you're very welcome to do that. If you plan to do that in a subsequent session, where you have had more time to sort of prepare and showcase and demo things like that, that is completely understandable as well. We could do this as the as the context setting discussion, and you could do that in a.

C

Subsequent call, if necessary, yeah sure.

B

C

B

Give you a little bit of background as we wait for the other members to join in. um So this is a chaos engineering work group that is started quite recently. A few months back. It was an initiative from cncf to foster some interactions between the end users and the chaos projects in the cncf landscape.

B

So we can build more knowledge around chaos, engineering in general, and this is basically going to help in producing a white paper. The end goal of this work group is to come up with a white paper in which we crystallize our learnings about chaos.

C

B

And the expectation is for the end users to benefit from the content that we have delivered. This white paper get them uh started in their chaos. Engineering journey just help them with um understanding chaos, engineering, better and the paradigm shift that is taking place in the cloud native world in terms of how people are practicing chaos, engineering in a cloud-native setting.

B

So that's what this work group is focused on, so just wanted to share a little bit of background. From that uh I see sergio has joined this call uh hi sanju. Would you like.

A

C

A

C

I was a software engineer at arena world and I contributed to the brave browser and um I'm planning to be ambassador for the academy software foundation. Awesome.

B

Great to have you here, um please feel free to add yourself as the attendee in the meeting notes. I have just shared the meeting notes on the zone chat.

C

B

The agenda for today's call is a user discussion, so as part of the work group, we've been trying to get end users, people who are practicing chaos, engineering in different domains and basically invite them to share their experiences around how they're practicing chaos engineering and we have money from hello doc, which is a med tech organization based out of indonesia, and he will be sharing with us in scale's journey, so we'll be carrying out this uh discussion and equation and answers time.

B

uh We have had some other users appearing on this working group before michael um came in last. He was he was speaking about um the chaos philosophy and practice at iag and tpg telecom, where he is served as a devops consultant and we hope to invite other end users. Whoever is interested to come and share their learnings with us and we'd like to take that back to the white paper.

B

So with that context, I think we can get started. um Maybe I think six minutes passed. So money will probably begin by an introduction of yourself. So please feel free to talk about um your background, where you're working at what kind of what you do and how you basically got started with your chaos journey.

A

Yeah sure, thanks for uh kissing me hey, uh my name is working as a sre engineering manager at halodoc, so I have overall 17 years of experience in the id industry. So I I played different roles at a different point of time uh at halodoc. My primary responsibility is, to you know, take care of uh infrastructure, uh provisioning and then uh you know uh bringing in all the best practices uh uh that are required uh at the helidoc as well, as you know, identifying uh analyzing the tool and tech stacks.

A

What is supposed to to have it at these are the primary responsibility rather than that we have a set of deliverables which needs to happen so as a sre team. So we support all our tech verticals, where they they have a set of requirements, that's to help them provision infrastructure, or maybe the configurations or maybe at the different level. So that's something that we do. uh Apart from that uh analyzing different set of tools, uh uh you know a quarter and quarter and then see which is the right fit for hellotalk.

A

So uh during this journey I think we started evaluating um a kiosk tool. So, basically, we we considered a set of tools to you know to evaluate on certain parameters.

A

So uh what we did we, we did an analysis uh in the month of uh I think, uh jan uh we started looking at uh how we can implement uh chaos engineering at halador, considering that we identified around four four tools: one is kiosk toolkit, the other one is uh pumbaa and a litmus is the third one and the last one is chaos mesh. So what are the key parameters that we looked at while exploring this tool? Was uh we categorized this into six areas?

A

One is how easy the installation management is and what are the different set of experiments and definitions. uh Each of this tool offers the third one is about security and the fourth one is about observability and uh fifth is one of the primary point. You know which we want to make a decision, whether we want to go with this tool or not just uh basically, actually, if we wanted any any tool, just basically a kubernetes native tool, the last one is about the running behavior, so these are the six key.

A

You know categories that you want to. uh You know analyze through and then uh take a decision on it out of which you know we we felt uh all the all. The six parameters were very well made by uh litmus, so so, based on that, you know uh we, we decided to go with the greatness, but uh there are few factors which which uh which which you know made us to uh make this decision. Is this kubernetes native tool? It's a primary point.

A

Second thing is very easy to install and you know lightweight and stateless, and it has got variety of experiments uh and, and it covered almost like you know, in every area that we wanted to cover.

A

The third thing is about the security aspects, which is more like you know it has uh uh service specific. You know, accounts roles, role binding. So since you know all these experiments are talk to kubernetes, so we can very well control within the name, space level or at the cluster level, so that was one of the additional benefit we have got and then by default. Yes, it gives a very detailed report of how each of these experiments were running in a very simpler format and optionally.

A

It can also get integrated with uh promotions, but now we can see more and more metrics from the end points. The last is about the running behavior. So since this comes as a crd right, so we can say that we go and solve as many number of experiments with entire workflow we want to can be, uh which comes up with all the list of inbuilt. You know experiments so with this. I think you know which made us to decide. You know to go with.

A

B

For the details around the intro, what the organization does and how you sort of zeroed in on a case engineering solution, so I just wanted to probe you a little bit on uh the the chaos philosophy in herodot. For example. At what point did you feel chaos? Engineering as a practice was necessary.

B

um Your engineering manager, like you just introduced yourself, so I am sure there would have been some resiliency efforts that were in place, and you said you started looking at chaos tooling in germany. So what was really the trigger point for you to consider doing chaos? Engineering? Was it an outage that happened in the organization or was it a conscious decision that, as a technical leadership group, you took to sort of adopt chaos, engineering practices because the rest of the world is doing it and it is a beneficial thing to do?

B

What was really the motivation for you to start chaos, engineering.

A

Yeah, that's an interesting question. So uh uh what made us to you know uh incorporate.

A

So often we onboard new uh services, new utilities, you know to the end users, and uh it happens. You know um in a situation like where we get the requirement, we have to make sure that this is resilient and uh you know when there is a failover how how quickly it is recovering. These are some of the points you know which we wanted to consider when we onboard new services.

A

uh Initially, when we did this, we had come across few challenges in terms of you know not checking the resiliency, how reliable the utilities that we are on boarding and directly it is going live, uh it went live and then we've also seen some some kind of hassles. So to overcome that uh uh what we have decided is we wanted to have all this roughly checked, I mean all the resiliency testing, meaning in every.

A

What is the service availability and if there is a failover, how quickly it is recovering, so these are some of the things pain points we want to do. You know capture it at the earlier stage, so that made us to. You know uh pick up. You know another kiosk engineering tool to create uh some gears within our infrastructure to see you know how resilient we are. uh So how we started is we started this with basically in the stage environment uh uh anyway.

A

Initially, when we started it was more of an uh experiments there, where we run it through crds directly at the later stage, I think when we started, uh you know uh showing more and more interest incorporating this tool. We went this through a git ops approach where we have maintained all the source. I think litmus had, uh you know very well. uh You know helped us setting up this tool through helen charts, so we maintain all the stores in our source code management tool and then from there.

A

I think we deployed and then enabled all the workflows that we wanted. So we started creating a portal, uh you know workflows for all the services and we we wanted this to be integrated uh within the uh litmus. So I think that's when uh litmus team, I think chaos.

A

Engineering team has really helped us setting up a kind of portal through which you know we have created all the set of workflows and uh we we had approached via, like you know, control, plane, approach, wherein we have control plane in one cluster and we have all the other clusters having the the infra part and we can able to target our experiments. So that really you know motivated us to set up this in stage, and then we propagated this to production in the later stage.

A

B

A

B

um Great inside money, so this is something that is consistent with what we heard from michael in the previous call spell where the chaos experiments were introduced in a non-production environment at first and the reason why chaos was considered in the first place was to test their systems for resilience and onboard before they onboarded new tools into their production environment, to test the resiliency of these tools, and then they sort of started doing it, introducing it at some level in production in a gradual way.

B

So the original chaos engineering philosophy, the principles of chaos that you see on the website, the principles of kiosks.org, which was put together initially by netflix and team talks about the necessity to do chaos in production and for a long time. People have believed that chaos engineering is practiced only by sres, and they do it only in production and that's how it was taken care of using game days and half days and things like that.

B

But there has been a shift, um a gradual shift, especially with the introduction of the cloud native uh landscape, where there are a lot of tools that you would use in your environment. You picked from you know, different sources which probably you've not had much control in terms of developing yourself, because it's all needed to provide a seamless experience to the user, and it's also needed by the sres to maintain the sanity of your systems.

B

In such cases, when you're onboarding these kind of tools, you don't know what impact they're going to have so it's necessary to do testing, do it before you convert them into production. So that's something that is getting reiterated.

B

So thank you for providing that insight, and you also said you slowly moved and propagated this to production right. So I'm.

C

B

That you might have done this as part of some game days and you might have done some control faults in your production environment.

B

Was there a challenge, some kind of apprehension, hesitation coming from your sra team, when you said that you need to be doing these chaos experiments in production, or is it something that was actually welcomed by them? There was you know mutual agreement right in the beginning, and people were invested in this process or were there any cultural challenges that you had to face um in convincing people that this needs to be done in production and also on a similar note, you said you're bringing your staging and pre-production environments.

B

I would just be interested in understanding who is the person that is actually doing these chaos in production. We can understand that it is the the cluster admins and the sres and the service owners who are doing it, but in case of the staging and pre-product environments, is it again the same group of people or is the person are different, something like saying a performance engineer, qa engineer or a developer? Who is actually doing the experiments by themselves? Has it sort of populated the philosophy of chaos and the thought process of chaos?

B

Has it percolated down to other different person as well.

A

To give a more info about that, so we actually uh execute all the experiments you know automatically meaning since, as I mentioned earlier, you know uh it is more of a git ops approach. We have created all uh workflows uh the way we want it uh and all these are. You know, stored in the repositories so as and when any new service is getting deployed in stage, so we uh it automatically triggers the workflow and start executing meaning. So there is no individual person goes and manually trigger the workflows.

A

It is taken care at the time of deployment, so we have enabled github's approach there, so that is that is currently happening only on stage, whereas when we, when we look at production, it is a scheduled run where we clearly, you know call this out in the wider channel, we will tell them get their buying at what time we are going to execute and we will target set of services, key services, chaos, experiments again against those services and see you know if there are any kind of failures.

A

If there are failures, then yes, I will call it out and we'll analyze why the failure has occurred and accordingly, no team will fix it and come back. So this is something we have done couple of times in a in a quarter. I think, every month and month we have a day where we call it as uh kia's chaos day. Wherein you know we create, we will run the experiments against our production environment, share reports to the leaders, that's something we do awesome yeah. That's.

C

An interesting perspective by.

B

Automating this you sort of taking away the human uh hesitation out of it. One could say so that that is um nice to hear.

B

uh The other question that I had was about: how do you determine what scenarios you run? For example, uh you mentioned you have scheduled some experiment on staging um on production, and then you have upon deployment experiments getting triggered as part of the detox flow. So how does the team come up with scenarios? Is it based on past experience and your analysis of your code base and application application weaknesses that you, you might think, might be manifested, or is this something that you have determined on okay?

B

So these are the possible thoughts that can happen in our infrastructure. So these are the parts that we should test beforehand so which way what contributes to coming up with scenarios and what are some of the most popularly run? Chaos faults in your environment.

A

Basically, we have some past experience, so it's like you know there are from some key services.

A

If these services goes down for an example, I can put an otp service, let's say a person is doing a uh enrollment and he needs to get an otp if this service goes down, so you may not get the odp, for uh you know, subscribing right, so that is one of the showstopper. So what we did we identified list of services and you know we wanted to run a set of experiments. How we have uh sequentially, you know running.

A

This experiment is basically we first wanted to see if the pod is getting terminated, uh how quickly the pod is recovering. That is the first thing. Then we wanted to play around with the memory and cpu hawks, so that was the you know next set of experiments that we have still arranged and the other areas. Most of the time we see is about the network loss and most of the times. You know we have some dns errors.

A

So when, when we do a health check, uh uh you know with the parts the part is not coming up. uh It will keeps on spinning, but the part never gets terminated. So these are all some of the past challenges we have seen so considering this, you know we want to have this uh relevant.

A

uh You know, chaos experiments with litmus and we have arranged it in a sequential order to go one by one, and if these five, uh you know experiments are met, I think the service is stable and uh we will accordingly, you know, generate the report and share the reports to the uh whatever.

B

A

B

You brought in some very uh valid points about checking the health of applications, even as you do the faults. So this brings.

C

B

Or leads us to the next question about observability, so observability is very important for chaos, execution and you're, factoring in a lot of checkpoints as part of your run. So what is the nature of the observability infrastructure you use and when you design an experiment?

B

What kind of steady state validation do you sort of build into the experiment flow? Typically,.

A

So what we, uh what we have done is like we have, uh uh you know running while running these experiments. We are probing for certain uh endpoints right, so that is any way being supported by a chaos experiment. So, whatever experiments we design, we can probe on the different endpoints. It can be http or maybe uh uh maybe kubernetes side as well, so cubicle uh probe can be done.

A

So these are this list of probes that we have enabled, uh but I haven't said that when this, when there is a failure with this experiments, we have also captured this, and currently our team is working on integrating this. With uh with prometheus, uh we wanted to enable all the service with service monitoring, and then you know, uh skyping the metrics about the execution. So currently, I think it must by default, have a very, uh very good dashboard, which is which is under the analytics.

A

We can able to see uh what is the level of experiments, how many times it was run and what was when was the last failure and the success rate and how many? How? What is the percentage of each of this resiliency I mean which of these services? What is the resiliency score? There are multiple things comes by default, but beyond that I think we are also working on uh you know integrating this with uh prometheus.

A

That is something still work in progress. Great, thank you for sharing.

B

That one, the other question that we had was a lot of the times. People doing chaos. Engineering in pre-production environments tend to use synthetic workload, generators to sort of simulate the traffic that they see in production. Is there a practice that you follow and is that an area for which you interact with the performance group? Do they have any inputs into your chaos? Engineering practice? Are they also stakeholders in.

C

This game days.

B

Do they generally participate in analyzing the results through cause analysis, things like that we're just trying to the reason for asking this question is trying to establish um sort of or trying to understand the um interoperations of the communication channels that a chaos engineering team would have in an organization. So how frequent are your conversations with performance teams and do they participate in constructing scripts or help you with load generation for chaos?

B

What is the nature of the contributions like.

A

Yeah, so we do have a perf team so wherein you know they have set of performance scripts.

A

What we often do is we give them a setup of perfect environment for a specific set of services wherein they want to uh create a load against and see. You know how it is performing. So what we have recommended to that perfect team is basically we call them exhausted. So within our organization we have uh software developers and tests, they were there, the right set of test cases uh and they run against environment based on the use cases. So we we tag them. We uh discuss them.

A

You know how how the the setup can be made on top of them. We have also explained them. You know how this litmus uh kia scan experiments can contribute. So in that aspect you know uh uh on the load side. uh To be honest, you know we haven't uh given much. You know.

A

Workflow executions happened around that, but beyond that, our focus is more on the part, specific and node specific load side. I think yeah, as you said, we have to give more attention to it. So we will definitely, you know, come come up with some points around that in the coming quarters.

B

Great thank you and it a lot just a couple of questions from my side and then I probably opened the floor for any thing that you would like to ask or share money and sergio can put in his thoughts or questions one. um Another question I had was about the importance of multi-tenancy in in chaos, engineering frameworks, and is that a requirement that you had and, for example, it could be like when you're doing it when you're doing chaos, experiments in a shared.

C

B

And you have different people doing their own experiments. Sometimes it can be like stepping on others toes, so you mentioned that you've automated it to a large extent, um there's this possibility that multiple apps get upgraded at the same time triggering their own respective workflows in their own spaces. So um is this something that you've given thought about? Did you do some kind of intelligent scheduling or or is there any agreement between the teams that they're going to be okay if experiments are run or do basically inform people? How.

C

Do you execute.

B

Chaos in a shared environment, from both technical and and the process aspect.

A

So let me reiterate your question karthik, so you're asking me whether these executions happening across multiple clusters, meaning different accounts and different clusters. Okay, if that something that yours yeah.

B

That's one way to interpret my tendency for sure. The other way to interpret is you have a large staging cluster, like you mentioned that you're carrying out experiments and staging environments typically are owned by different people or when you say owned, I mean to say that they are used by different people. You could have name space a being used by let's say one team, one application, services, team and name space b being used by someone else, so they're all trying to test their own review code in the staging environment.

B

So it's some kind of shared environment and it's if it's also the place where experiments are being executed and.

A

B

Radius can sometimes go ahead and cause inconvenience to the other person using the environment. So is that something that you pre-design or is it something like? You have dedicated environments for each group developers.

A

Now, in the are stage the uh litmus experiments at that point of time, what we have done is through automation, so we try to replicate what we have in production, meaning you know that in rds, data points would still remain same, but the replica accounts- and I know how how the the parts are running in production right. That is something where we want to exactly simulate in our stage. Environment then run run the experiments once it is complete.

A

I think it'll go back to the previous two meaning it will not have any impact to the developers as such. So.

C

A

To make the changes they want to deploy, it will still still go ahead, but uh considering the experiments that we wanted to run right, so we are exactly replicating what we have in production. So.

B

A

I think you know uh if, if it doesn't work in stage, I think it will not work in production.

B

And you mentioned about cloud native, um you know way of doing chaos. You mentioned about kubernetes crds. You also mentioned a lot of your services, but do you operate or do you have services that don't run in kubernetes?

B

You do you have a hybrid kind of a setup where there are some services running out of you know private data center or they're running on cloud instances, not part of kubernetes clusters. So do you extend chaos engineering uh to those systems as well, or is it something that that is completely on kubernetes.

A

Right, we have a plan to cover the in a non fibonacci specific, um but for now I think our major focus is only on the uh kubernetes uh uh specific and the the nodes out of kubernetes. It's uh right right. So it's more of a vanilla, flavored instances rather than you know, only the equivalent of this nodes.

A

Our major target is only a node and the part specific for now. Yes, in the coming quarters, we are planning to cover this one awesome.

B

So the final question that you had was about: how do you sort of set up kpis around chaos, engineering as a practice as a discipline, so.

A

B

Sr engineering manager- um you might have some kps for yourself for your team around improving resiliency and chaos. So how do you sort of evaluate the efficiency of your chaos engineering process? Is it the number of tickets that have reduced? Is it the number of outages that have reduced or is it like increased uptime? Do you actively have some slos that.

C

B

Sort of look at to say, we've performed better in this month compared to last month. How do you generally evaluate the efficacy of your chaos practice and how is it communicated to the larger group in the organization and how is it sort of bubbled up as a as a report that everyone can consume and understand.

A

Yeah, basically, we have error budgeting and we have defined a solution as a life for all the services. So with respect to our chaos, executions, right experiment, executions, we'll make sure that it is not going down. In terms of you know the percentage and also we have a alert alerting system in place. If any services uh has been, uh you know, we created a chaos against any any services. It goes down for some time. The error percentage would get incremented right so which will indirectly impact our uh you know reported percentage as well.

A

So uh considering this, you know, we always run these experiments on stage and we have a scheduled run on production. As I said earlier, if there is any, you know a discrepancy in this. uh You know uh these data points I think yeah. We uh we recommend this to the development team to have a look at it and also from our side, uh uh provisioning, the intra meaning. You know the service goes down for certain reasons like we will also analyze and fix them.

A

C

B

We are following.

A

Following awesome,.

B

Thank you money. I think that was very insightful.

B

Some of these questions that are part of the status like template. You mentioned in the meeting, notes, uh questions that help us learn about how the the in the inducer community is practicing chaos and the few others, where sort of things that sort of came. To my mind, as you explained, the use cases, thank you so much for being patient and giving us all this information.

B

I'm sure the community is going to be making use of this in a better way all the answers that you provided. We will try to crystallize this information uh as part of the user experience and the user story section of the white paper, so I probably like to get in sergio for his views on what he just explained. Are there any questions that you'd like to add.

B

I think money is available for us for some more time, so he will be able to answer any questions you might have.

C

No, I don't have any questions.

B

Cool um I see brad has joined this con uh hello brad. Would you like to yeah.

C

Yourself yeah, my name is brad, I'm uh from new zealand, so normally I'm um living in australia, but I just found that this meeting suits my time zone so yeah, I'm happy to be here. Awesome thank.

B

You for joining brad, um so we were this- is the chaos engineering work group meeting as you know, so this is um across tagged and across project working group in which we're trying to build some knowledge around cloud native chaos, engineering and crystallize, all these learnings into our white paper and as part of this effort, we are inviting some end users to come and talk about when how uh they practice chaos, engineering and what are the challenges they face? What are the tools they use?

B

uh What are the plans they have for the future things like that, so we can get their learnings, uh take their learnings back to the to the team and put them into the white paper so that beginners or enthusiasts, who planning to do? Okay engineering can derive some learning out of this. So we had money from hella dog id, a med tech organization based out of indonesia to come and talk about their use cases, and we have captured some interesting answers and perspectives in this call, so they will be made available in a recording.

B

So um I think, please feel free to take a look at that brand. We had insurance.

C

B

Group talking in the previous meeting, so there's a recording available for that as well. So that's as some context for you and uh money, I just want to sort of uh leave the floor open for any questions from your side. For example, um what is your ask of the cncf community um in terms of chaos? Engineering? What are the kind of resources you would like to see here? What is the kind of standards? You might like to see anything that you would like to see from this community?

B

um Please, please feel free to go ahead and talk about that, and I will.

C

B

Interested in knowing what plans you have for improving the chaos practice that you are having today, what are your goals around gears engineering over there? Let's see next few quarters or months.

C

Yeah yeah sure so um I I haven't done casting before um I guess from this working group. It would be really nice to to see some sort of introductions to it. You know a lot of people are beginners going to these webinars as well, so it would be cool to see some sort of hands-on labs and yeah and really like do you have a specific tool that you'd like to use lip masks or that there's another few?

C

Do you use a specific um project or is this sort of you know? Chaos, engineering in general,.

B

This work group is as a project neutral kind of a platform, but many of the members in this work group belong to or have been using, the projects.

A

C

A

C

Kiosk toolkit things like that.

B

Money today um explained about how he has been using litmus and why they chose it must come amongst the choice of tools that it had. What was the basis of their comparison, and things like that.

C

If you're looking for some.

B

Hands-On labs around specific tools and if you'd like to be introduced to some workshops, you'll be very happy to do that. I can connect with you offline as maintainer of the request project, to sort of provide you with some information around that.

B

uh As far as this particular meeting is concerned, it's more of generalized, you know um perspective.

C

B

Group for chaos.

C

Engineering, yep, yep, no perfect, and um if you ever want to do like webinars, I'm happy to help like I'm a cncf ambassador as well, so I can help I can help to like uh get you webinars and things just. Let me know how I can help as well sure that would.

B

Be awesome, thank you, for um uh I need to contribute uh brian. So there is um I've, just posted the meeting notes and the charter link in the uh zoom chat. Let me go ahead and place that um once again, so please feel free to add yourself an interested party and.

C

Sure yeah, I'm in that document, now actually, oh, no, not that one. Actually, I won the other one yeah.

B

Bunny, would you would you like to say something, would you like to sort of speak to what you would like to see from the scenes of community in terms of yes, yeah.

A

To answer your questions, so we being a startup, uh we wanted to evaluate uh you know, tools which are open source.

A

uh Since you know our team is uh using uh kubernetes, which is again a cncf project, and so uh the most of the tools that we uh wanted to bring in uh suits this this environment other than litmus.

A

I think we're also using all other tools, like you know, argo cds, one tool we are using, we are using uh prometheus graphana stack and then we are using uh oppa gatekeeper. So these are all some of the tools we have already evaluated and we have adopted and we are using it. uh Our primary point is, since we are we're having all our hundred plus services running in kubernetes, so we wanted. uh You know this. uh The the tools that we are bringing in is a cloud native.

A

I mean kubernetes native tool, so that was one of the thing and uh uh if you ask me the road map, yes in in in this quarter, so we have planned to uh cover all the node specific areas and uh I based on the discussion that we had in the past. So one of the key thing I would like to integrate is on the user user roles in user groups, so the user manager.

A

So that is something we are looking forward to closely work with you, because currently only the sr team have access. We have to manually, go add users to this user management app and then we have to provide access, but it does not have any g-suite or any other integration into it, where in all other tools that we have, it has a g suite integration. Any users who wanted to see our view the dashboard support the metrics so which is in place, but for this yeah we have to manually add users.

A

So that is something that we are looking forward from you in the coming quarter and yeah. As I said, uh the load specific and the storage specific, uh you know, uh chaos experiments has to be focused in this part.

B

Great that sounds like a very exciting road map um that will be very happy to extend any assistance that you might need in terms of uh approaches that you would like to sort of validate and if there's anything that you would like to share about your chaos journey other public forums such as the cnc inducer forum. If you would like to do a webinar, we have brad here who's who's. um Who is basically going to help us with that aspect.

B

If we'd like to present um anything about ikea's journey, if you want to talk about specific use cases and incident and how you've basically constructed scenarios around it, you are most welcome we'd, like to as a cncf group we'd, also like to thank you for using other tools in the ecosystem for continuous deployment and policy enforcement.

B

So I'm sure the chaos ecosystem is going to integrate tightly with all these platforms as well, giving you some kind of a whole some integrated experience in terms of maintaining and running your applications in production and as like you write. It pointed out at the beginning of this call as a medical tech organization. Startup reliability is very critical. You need to be able to connect your patients to doctors.

B

um You probably need in-time reports and your websites and otp mechanisms should work right. So I think this is probably a very good use case for chaos. Engineering similar to you know the criticality it has in domains like financial groups. So thank you for joining today and sharing all the thoughts, perspectives and insights.

B

We to take this back and put it in the white paper before that there is going to be a dedicated artifact that will contain information about or contain learnings from user interviews.

B

um So that's at this point in under construction, so we will share that on the cncf uh chaos engineering workforce, my channel, probably in a week's time or so so feel free to add to that and you're most welcome to join this work group any other time in future and probably do a couple of quick demos or uh bring up any other questions or challenges that you might have around chaos engineering.

B

So thank you once again.

A

Thank you, god sake thanks. Everyone.

C

B

Right, thank you. Everyone um hope you enjoyed this and we'll meet in the next meeting.

C

Thank you yeah. It's very nice to meet everyone and uh looking forward to the next one.

B

C

Thank you bye. Thank you. Everybody see ya.