keptn Working Groups, 19 May 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Keptn Auto-remediation Working Group - May 19th, 2021

Description

Meeting notes: https://docs.google.com/document/d/1y7a6uaN8fwFJ7IRnvtxSfgz-OGFq6u7bKN6F7NDxKPg/edit

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject
Sign up to our community newsletter: https://keptn.sh/community/newsletter/

A

Pretty cloud all right: yeah hi, everyone welcome to the next round of an episode of the auto remediation working group.

A

um This time I think or young, and I we think that it would make sense that we step, maybe one step back and uh think what should we achieve, or what do we want to achieve in the last meetings we or let's do a very short recap at the beginning, we had a really good ideas and thoughts and we were making up this mind map that helped us a lot of ordering our thoughts around autoremediation.

A

This was in meeting two and one and two, and we also did a little bit of work in the third meeting, and then we decided that we want to provide a a template for um a use case. A use case in this in our situation, is a jvm exhaustion and based on this problem, we want to derive the remediation process that is necessary to fix this particular problem and the last two meetings we were discussing this template and in the last meeting we then also did a little bit.

A

We talked about why it's necessary to be too agnostic when it comes to all remediation yeah, because at the end, you always want to exchange the tooling, but without touching the process that you defined- and this is what we discussed the last time.

A

B

A

C

Hello, I'm just lurking just listening in.

A

A

Great and uh yeah, this is what I have summarized now, meaning what we have worked on the last couple of meetings and what we have achieved so far the mind map. Then we also define the user story and the charter.

A

And last but not least, we talked about the template for a jvm exhaustion example, and this is where we stand uh right now and I'm honest with this working group now and um well.

A

I would like to ask you all all of you what should be actually the outcome that we want to achieve next and um because I think we we got a little bit lost with the template and where we are heading towards to so that we're not on our not focused on what we want to achieve, and I think this should be now an open discussion on what our desired outcome should be in, for example, the next three meetings.

A

Quite a lot of work, we have great input, but we have in the mind map in the template in the story but kind of bringing everything together and what should be. The outcome is a little bit open um and not that clear. I think right now.

D

Yeah I'll say, the most exciting thing for me is actually digging into the example case and we chose jvm exhaustion, but it's an interesting.

D

uh It's an interesting dilemma to have jvm exhaustion, because it can have multiple different ways of being remediated and it can have multiple different sources. So of all the things we've done, I think we've we kind of got our mission, our charter and that stuff's fairly good and and nailed down.

D

I think the mind map was obviously brainstorming.

D

Anything could go on there, but the thing that seemed to be one of the more exciting conversations we had as a group, at least in this group, uh was starting to talk about what is the idea for that jvm exhaustion as an example to explore the model or framework for how how an auto remediation that process, what that process looks like and what are all the steps and decisions along the way in that such a process, because, in the mind map remember, we talked about permissions and governance of different levels, pre-approved remediations versus remediations that need approval.

D

Well, if we, if we walk through that example of a jvm exhaustion um that that'll help us decide where do those decisions take place? How does how does an organization create those policies, but it's also because it's technically exciting I mean to everyone.

D

It was like all right now we're in the we're getting our hands dirty in an actual problem to remediate, so it makes us feel kind of excited because that's kind of the fun to be like I didn't have to go through those 45 steps to diagnose, determine, apply and then evaluate whether it worked or not like that's. You know, that's a huge process to evaluate and do all that and you know kept in auto remediation. Just did it for me automatically.

D

So I think that's that's to me. That's the um uh that's the light bulb moment that I think at least excited me about it. um At least it's triggered for me the idea of different levels of remediation, meaning the levels of intrusiveness or disruption. How disruptive is a remediation to the point where I have to do a code change and a push, maybe a little less than that would be an application, configuration change and a push a little bit less than that would be.

D

Maybe it's a machine level configuration change, uh something in the operating system or in the container framework uh and push or like I don't have to touch any of that. I just need more instances or I need more of a resource memory faster network, something like that. That's at a very, very low level in the infrastructure.

D

Maybe it's even dns changes. Maybe it's something outside the application stack. That would help remediate something. Caching is a good example of that something completely separate. The application doesn't even know it's happening, but we're starting to cache things externally. So getting that model for uh to walk through that example, I think, is a good for me. That's that's an outcome that focuses other things. We've talked about on to what we really what that, aha, that's cool moment really becomes, um and then it's also something that I believe is deliverable like hey here.

D

Are the automa auto remediat the auto remediation package? If you will a collection or assembly of things, the auto remission irritation assembly for jvm exhaustion like we do one, for you know threads controlling threads. We we could do one for general. You know network latency at a very general. They get very specific here's. You know caching and dot net. You could do that so you'd have very very narrow, scoped packages or assemblies of policies for auto remediation, very kind of general ones, but to me walking through that example, gets us to those kinds of outcomes.

D

That's the paragraph I wrote in my head just now.

B

Cool yeah, uh so maybe I just add my my thoughts here um so for me, uh the the most interesting discussions where actually uh what I really like around the mind, map to kind of do: brainstorming open up all the different possibilities, all the things that are kind of involved into auto remediation to find out. Okay, what are like, uh maybe let's say, organizational boundaries. What are the technical uh issues which are more related on the infrastructure side? What are the parts that one application has actually to provide to be able to be auto remediated?

B

Maybe not? Each application can be auto remediated. What are the um like um approval, steps, so kind of laying out, let's say, a a framework, but not a framework in the technical sense and in the implementation sense, but more a framework where you can see? Okay, these are all the different parts, and if you want to build auto remediation, then you have to think about all of these things and if you want to, if you want to implement it, uh you have to think about this.

B

Those uh those um different parts which are actually the leaves in our mind, but these are the different aspects. So, let's call them aspects, and I also think that doing this applying this onto an example gives really it's kind of the confirmation that it really makes sense that it's general applicable, but it does not mean that it's fully in in all detail. Maybe it's not fully uh kind of matured in each detail, but it should be. uh It should give a very good understanding of all the different parts that are involved.

B

um Regarding the example itself of the jvm exhaustion. I think it's a very good uh practical example to do it um for me personally, um it's great to see how it works for the jvm exhaustion.

B

Just me personally, I would not go into other examples and then um kind of have a look at how um I don't know a process crash would would look like and what are the different um options for for crashes of processes and and dig dig deeper there. Maybe this is something where we can have uh more um folks from the from the working group or just interested once joining and say: okay, now there is this framework. How can we use this framework on for my example and kind of do the ex?

B

Do the exercise that where we provide the um where we provide the framework or the template and others can then do the exercise and do it and then maybe we can as a larger group? Maybe then we can come out again with the catalogue of different things, but for this group I would see more a desired outcome on kind of a written way.

B

How one can apply the uh the framework that we are that we have kind of developed, yeah.

D

This is not to totally interrupt you. The word model is coming into my brain, but it's more uh there. When we, we wrote a book at microsoft on this air on asp.net, it's one of the old original asp.net performance and scalability books, um and we as part of the sql server team, cooperating with the uh the ias team at the asp.net team. We had pretty some. I have pdfs of them somewhere, I'm sure we can find them on the internet.

D

Very, very elaborate flow, charts, logic, flow charts um and, of course they were fairly linear in terms of how you branch, your thinking and there's multiple branches of thinking to remediate.

D

uh In a very simple example, it was like sql server performance, you look at locking and then you look at blocking and then you look at resources and then you look at and if I'm in a performance, sql performance, I would go through these steps and there was a logical sequence, meaning it doesn't make any sense to look at cpu until I resolve these two other issues and bring the data from those two steps into the third step, and so at a very, very low level. You might ask all right.

D

I looked at locking at the same time, according to the model in the mind map that we went through all right, I found something that I can fix. Do I have permissions to fix it? How long will it take to fix it? Is that fit within an slo or a of a fixed budget right or an error budget? So there's all kinds of things from our our brainstorming model that would come into each step of the process. So I kind of I'm connecting the two just hearing.

D

You talk um that there may be another outcome that is sort of hey here's. Some elaborate logic and the making it look huge was great for us, because I could increase my bill rate. um It was because people go holy crap, that's what your brain has to go through to figure out a sql server performance problem: yeah, that's not a deficiency of sql server, that's just any database!

D

You have to go through all these steps to figure out how to fix it, and so it was an interesting way to sort of display the complexity of performance, interdependent resources and performance problem, identification remediation, which is, I think, we end up talking about it casually. At least I do because it's just baked in my brain, but if you really spill it out like we did, and we would do it gets really elaborate, but it could map back to the brain. To the mind map say: oh yeah.

D

These are all the different things we had to think about at each step of the process.

A

Is then the idea that we take the mind map as a basis and then derive a process or a modal model based on this mind, map in a written document? And then we just check the model or check the example. We have against the model.

D

Yeah and- and it's like all right- this example for jbm exhaustion, there's some parts of the model that we didn't. We didn't have to visit, meaning, let's say we're remediating, something where I don't have to escalate, because it's a fairly, not disruptive, step. Okay, first step, let's say you're going to do an uh you're going to hot swap memory, but it's not available to the operating system until you restart that system. So you could do step one without additional permission, then step two you're going to restart that node.

D

So now there's another level of notification in the model where, like who do I talk to do I have permissions? Do we have you know that kind of stuff? Is this a highly sensitive application versus a pretty much a throwaway part? Can we just throw that container away and fire up another container with more memory?

D

That kind of remediation but we'd be able to go through each step and say: okay for jvm exhaustion? We only used four of the areas of the model. Let's come up with another example that now goes into let's, as an example, database indexing, a missing index very common in almost every rdbms, and even other issues in nosql you'll find things that are indexes and keys are wrong.

D

You, you would just pick another example to illustrate other parts of our model like a high risk part of an application. There may be other controls and permissions, so we have to add. You know, steps we're fine, so we could find maybe two or three different examples down the road that would flesh out sort of the whole model, because I think for all of our thoughts, finding one problem to solve in production that hits all parts of that model might get a little bit crazy, a little bit uh overwhelming to somebody.

D

I just want the concept, so here's a fairly simple remediation. I need to expand the cluster, add more nodes, okay, fairly disruptive. We don't have code changes, we don't have config changes. We just need more nodes in a non-elastic, not an autumn. Auto scaling type of situation, and maybe the abbreviation, is then last step is recommendation for the future. Then we talked about hey: did we learn and recommend what we think should be fixed down? The road hey, get your application to enable auto scaling.

D

You know with xyz feature kind of thing: jvm exhaustion is somewhere in the middle depending on the source of g. Are you just you just don't have enough memory and you need more memory so that gc is working efficiently. That could be it just limited heap or is: do you have a memory leak somewhere in the application where it's growing and growing maybe of a static structure? That's not coded properly, and it's just growing and growing over time, depending on throughput.

D

So there could be other things causing a jvm exhaustion that use different parts of the model in the determination of root cause.

B

Do we have, in the in the mind, map uh some kind of requirements for the for the application itself? So you just said: uh if there are some issues with uh jvm uh memory, then just uh build the the boundaries of this that it fits that, like gc, can come up with uh uh a nice way to to to do the the garbage collection, but are there maybe some other, um like um prerequisites for applications that they have to be? I don't know, uh maybe one thing is they have to be state class.

B

They have to be um like uh conf, they have to be uh kind conf available or able to be configured from the outside or some maybe like this 12 factor applications, or maybe it's just three factors, or maybe it's something else. uh Maybe uh I I'm not even sure uh what it is or like for infrastructure. It has to be accessible via some kind of uh like uh infrastructure.

B

As code you have to have infrastructure as code if there has, if there is a manual way to do some things and there's no way to do an api call to kind of approve something right. If there is uh always, if you install, I don't know, you do a um apt-get install and you cannot do a minus y to automatically accept and and the prompt is waiting for the weekend for someone to type in yes and then so. Maybe there are some kind of prerequisites for applications that we can also uh put into this framework.

B

D

Sure, even in even in uh even in a micro services versus monolith type of situation in microservices you, you could t something up to push, but certain organizations don't allow it to go all the way through because of either a compliance issue or whatever there is somebody sitting there going well, it's a manual package drop like cd only took us so far, and the zip file's sitting right there. All you have to do is double click on it. You know and it'll go, but then no one's available to do that.

D

So remediation, where main we did. I think the mind map to your original question. We, I think we got to the precipice. We got right to the edge of opening that discussion and to me I'm gonna, I'm gonna try a little wizardry here for you. If you guys can see this. Do you see this yeah to me? It's that level uh where there's things at a very low level, which is just adding memory right at the very bottom.

D

I can take an existing machine or I can just re reconfigure and like the app doesn't even know that it just has more memory available, but the container framework would know or whatever, even if it's a physical, redeployment or no redeployment, hot swap, the memory in I didn't have to call a developer whatever I can throw memory at a problem, and maybe it buys me something um you just need more memory and maybe maybe kept in or auto.

D

Remediation would basically do a little diagnosis of saying we can see where your gc pattern is changing over time, and so we think, if you just give us a little more overhead, we're having forced gcs because of heap limitation. But if we give more heat space without changing the jvm configuration or anything just have more available memory. It'll we can hit that peak and have gc be more intelligent, especially g1 gc and java 11, and later g1 really is smart about knowing how to stay, but if you've crunched it it's like.

D

I really look at these structures. I would really like to have not forced gc because of heap exhaustion, but that's like a to me. The model from the mind map would be like level one. Then let's say you actually had to do. Jbm config push if you're jurgen to your point. If config is code, that jvm config has to be pushed back through cd, and maybe I do that change and because that's more elaborate and a longer process, more people involved, then our mind map, where we talk about governments, is this a pre-approved remediation.

D

Has it been tested in pre-prod where everything's a lie, but have we previously passed a quality gate in kept in to say that is an approved remediation? You can just push the button and we tested it, which means it doesn't have the minus y and there's it'll just go tweak the jvm automatically and push different heap settings.

D

If you went further than that, you could get into the ear more detail, but now you've got app config, so things within the application layer that are on top of the jbm, and that would be like a level three remediation and again even more permissions. If you took this entire stack and said, I'm going to put this at in a high risk app I've got now. I've got additional steps at each of those to get permissions to get. You know, notifications change. Oh, this is a high risk app.

D

There was an even if we think it's benign and it was a pre-approved remediation, we're going to give a notification to the app stakeholders to say: oh hey, one of our high risk, apps high revenue generating high dependencies, went through a change that is supposedly benign and here are extra information about what to do. If something doesn't seem right if you who to call etc. So there's other steps in our mind, map thinking that would play over here where, as a low-risk app, let's say this is medium, and this is low-risk app.

D

We could do all three levels of remediation over here in low risk. Without any extra notification no didn't have to call anyone nothing. We just take care of that app and keep it running in the medium side. Maybe when it gets to app config, we need the technical product. Owner gets a notification and maybe the lead technical dev, you know, gets invited to a 15 minute. Hey can you be on call while we push this app, config change, etc, etc.

D

So that's the idea where, for me, walking through an example with different levels, allows you to say: hey here's where I would not have to use a lot of our thinking in a very elaborate mind map, whereas high risk or compliance or high throughput high risk could be high throughput as well. It's under tremendous load, extra steps for load balancers.

D

You know pull a container out completely smoke test it before we put it back in etc, etc. That that has other more elaborate steps, and I got to use my picture in a.

D

B

Yeah, I really think that these uh concepts can uh can really understand the whole complexity of auto remediation, and uh I think, at the very beginning, one it was never a goal to do the whole implementation.

B

The goal was always to showcase uh what is part of uh what what has to be part of a modern, auto remediation uh tool, let's so to say something that is not existing yet so we don't have to focus on captain how it is right now we should focus on how we need to build it and helping with all these examples.

B

I think it really helps uh if we can think about. Okay, where is this classification of low medium high risk, for example? Is it only in the remediation part? Maybe it's already in the deployment part. Maybe it's in the whole application life cycle part, maybe high risk applications need to be, uh let's say for high-risk application. We enforce um security um scans during deployment, so this could be one of the things where captain already takes care of this.

B

I know now, I'm again talking about captain, but it's just something that it's not only for the remediation part yeah, and then it gives it like a justification. Why it's so important for remediation, because it's not only important for remediation. So we really have to put this into the into the concept of also in remediation, because it's actually the whole application life cycle.

B

If we think of a rollback, then maybe a high risk application has to be treated completely differently in a rollback, because manual approvals are needed and they are needed because it's of high risk, yeah and uh and he's also kind of involves. Then uh the deployment already in the first place.

D

Yeah, I think I I obviously lean heavily into captain's definition, I would say definition but uh implementation of the quality gate concept, because that's near and dear to my heart uh or it's in my experience in my mind, more than anything um and if you're not using kept in to do quality gates, you might be doing that in very old school ways. You might have a fairly elaborate quality gate concept on your pipeline.

D

The other question that expands for me on that and the reason I see captain quality and the automation of your pipeline uh working in tandem, is because you have multiple pipelines flowing into an entire app landscape and so you're doing auto remediation, not just on a single pipeline. You've got to go back and say: oh well, wait a minute. I can make this change, but that also, in fact, this dependency. This dependency database indexing is a good example.

D

If I just go hey this application uses this table for this index, but there's 40 other applications that hit that same table. Okay, because we still have a monolithic database or a centralized database concept. Okay, I can't just tweak this index. For me, there's a now, I'm hitting some other roadblocks in the idea of auto remediating a missing index. I've got three other there's an index there already that covers five of the six attributes that I need. I can just add the sixth in there because it may affect that other application's query.

D

This is really what we pay the human brain to do all day, long as a dba designer developer modeler. It's like all right, I'm going to figure out how how many different stored procedures to talk to account data. Do I can I really support under different workloads and do I need to split them out. I have three different account tables now in the nosql world.

D

They have some similar data when I need to coalit coalesce all those tables and go do reporting somewhere, that's a whole separate process, so there's different ways to solve that problem of that layered independent interdependency to me, that's where remediation generally gets really complex, so we picked a fairly simple one, even jvm exhaustion, but when you think about it in in this, in that context, it quickly gets more complex. Even when you start applying our thinking on it um anyway,.

D

So, to summarize that back to jurgen's point, I think, there's more, we can do in the mind map that helps us go into different modalities of remediation, highly interdependent versus low interdependence.

D

High permission, high permission versus low permissions, meaning manual intervention for permissions um high uh pre-validated remediation steps versus low pre-validated, meaning they have has the have these changes ever gone through the quality steps or quality gates to to say, hey, we're pretty we're pretty comfortable in our experience about changing jbm settings, we've done a lot of them over the years. We're just going to tweak this one, like all the other applications, so that that you know previous experience prior knowledge.

D

Permissions things like that could go into the mind map, and maybe that does become a model and I think they're there to me there's still a flowchart in there there's a you know what are all the things I have to go through as almost a checklist before I take any action or even try an action. The other thing you think about the akamas guys I mean they're, they're sort of they'll put in a setting. They'll run something and they'll get the data back.

D

You could even do uh in production, say well we're gonna of the 16 different containers, we're going to tweak it and move one of them put a little traffic onto there, monitor that and then see. If that change is good. That's like when you're developing the remediation there's, even people that under pressure are like well, let's try it try one see how it goes. If that looks good, we'll push it across all the rest of the containers and restart them.

D

So that's that's a whole other different kind of way of doing remediation in real time. I even think about.

B

It sorry for for interruption, but uh no, no, it's good. I even think about this um kind of that. Remediation is also an optimization problem if we think about remediation, not remediating, the user impact like an outage or a high error rate. So I cannot do my check out and I really want to get this.

B

I don't know what I want to buy uh like more coffee beans and but we could also think about auto remediation as it's not meeting slos and the slos are defined as running very performant like the akamas guys, they are always uh optimizing the jvm settings, so maybe, if we're not hitting our slos, we treat it as an autoramination problem. So we could go back and forth with autoramination to kind of do auto remediation steps.

B

It would be more on the low impact part and would would hopefully go without approvals, but these kind of doing one thing giving it a try, testing. It uh then evaluating it doing another thing, giving it a try, um then evaluating it. It's it's. Basically, it's the same workflow or it's the same flow chart for auto remediation as well.

B

Just that the trigger is not an incident or an issue or a problem, but the trigger is because we failed our slos that are specifically only made for optimizing the application, so it could also be like on the it's. It's not the core problem, but it's the same idea, optimization as a remediation problem.

D

What about uh something famous from about 10 years ago was this idea of cost optimization, which was, if you're looking at something in like cloud found or abstraction layer in the cloud? Hey, it's cheaper to run my containers over on those guys than it is on these guys. They announce a pricing change and like, if you put the cost of the platform cost of the infrastructure into the model.

D

Could I just don't change a line of code? No one even knows hey we're. Now we're running on these containers over here we're running on that with that provider over there. So it there's not even the the upside is not great performance. It's just same performance costs you a thousand dollars less a month.

B

D

Actually, what.

B

What they, uh the folks at ida rate, are doing yeah, you've heard of them, so either rate is, uh I I think of it. It's a little bit of a captain quality gate, but they don't call it the quality gate, but they more want to do like ap testing uh canary releases more uh they they say, iterate makes it easy to optimize business metrics and validate slos. When you release new versions of uh kubernetes apps yeah yeah.

D

I think you're right there, there is a that should be in the model. You know what is the impetus? What is the motivation? Is it and then it could be? You know, refinement or, like you say, optimization as as iterating on the the continued we've been remediating, this same way the entire time, but is there a feedback mechanism that says you know every time we do that we should have.

D

We should wait longer between changing the load balancer and then give it more time or there's very technical ways of refining that, but then to these other goals yeah could I is, does it change conversion rate, the a b testing, honest to god that kind of testing? I always think you're right just it should be one of many quality gates it. Just it's not coming from a hard requirement. It's coming from uh our friends. Call it a desirement I desire to make more money. I desire to have a better conversion rate.

D

I desire really nice colors or you know, on my website. I want to change my colors, but you know what suddenly some graphics engine just goes right. It looks terrible. um So optimizing other telemetry yeah yeah.

B

Yeah you could optimize for availability or for um revenue so sure. uh Of course you need some kind of availability for uh revenue, but uh yeah or like.

D

I mean, but I mean it can also be a sentiment right. We know that's a a big thing. You know if you've got feedback mechanisms um on your on your website, so people we made some changes and unrelated to the change people you know had a 10 uptick in people. That said, your website is great awesome. Okay, cool.

A

D

A little bit into.

A

The direction of of the business objectives: well, we talked about uh this one.

B

A

As we were working on the mind map all right, um gentlemen, what should be the next step.

B

uh So I would um sorry I just had to put down my phone. um I I would suggest to take the mind map put it into a written document, because, right now all the leaves they're just the bullet points. We we know what we mean with it, but I think adding a sentence or two for each of them uh would be.

B

We already come up with a kind of a large document with just the description of the mind map and then once we have this, we can go on to the template and johannes, and I we already did a couple. um We we used this template already today and put in the tools, for example, which are used, because this was also one of the um the action items from from last meeting.

B

um It's I think today, we've discussed it's not about the tools, it should be nevertheless, tool agnostic and that can later on uh it can be a nice framework for us to find uh in the in the ecosystem. What are the tools? What are they doing now uh nowadays and then reach out to them and invite them also back to to this working group, or maybe we initiate a new working group under a new umbrella or whatever.

D

Or a working session specific to what they do, like it iterates a great example um akamas a great example they're already a partner to like hey. You know we want to. We want to take what you do and try it out in this new model and get some feedback from them. Yeah.

B

Yeah we were discussing today that uh also that there are these special interest groups that are part of the cncf, and maybe so. This could be something where we then open up a little bit more and ask others. Okay, can you please present how you are doing this? uh Does it fit to the model that we've already developed? Does it or do we need to extend the model? uh Maybe we've only seen uh this from one angle and one perspective.

B

I think we had already great discussions here, so I I would pretty much be uh um be surprised if we missed one part if we totally missed one part but um having this and then giving this as a kind of uh um as an input for for further discussions um could could really help, but.

A

I think for this first.

B

To have a written document, um do the the mind, mapping describe all the um the beliefs and then all the um all, the other parts of the mind map also describe the template. uh I think uh kevin has also done a great job here in describing everything that was going on in their organization, very detailed, um marcus. All you have also done this great job on on the user story that, where that can fill in here, um so I think we already have something me personally.

B

I would start with describing the concepts of the mindmap, because this will give us the model.

A

B

Use the model for the template: we have the example for it, and this could be something I think it might be way too long for a blog article, but um like a lot of working groups, they then come up with a white paper at the end, it can also be a block article. It can be whatever, but something to share then with others that have not been yet part of this working group so that we can also get to get their feedback and we can use.

B

This eventually also uh is part of a new implementation or a further implementation improvement uh in in captain um yeah, but also to um it's a food for thought for the whole community.

D

Yeah so um things that would go in that document um jurgen, I think we see, I see, maybe a bl like a flow chart, but it's also without process flow. There could be like a block model. What are the building blocks? That are all part that you have to think about like when I go in to remediate problems, I kind of it's a it's like. I could go anywhere. I need to within these different considerations for well what application does this test?

D

What another thing we could do in the paper are: here's the you know the list, a checklist of questions that would even be inputs to something in an automated auto and an automated auto remediation product yeah process would be you're, always ingesting information. That tells you. Okay, here are the blocks of the model that are going to be most important for what you're dealing with and the third mo block model. I think there's a list of questions uh for ingest. What what are the questions?

D

What are the data points, the data inputs to the model that help you you know navigate when I show up with a customer as a human to do this?

D

It's it always starts in the first n number of hours, with gathering data gathering information and pulling them into my brain model and then figuring out exactly you know, oh okay. What? Then I asked this loop back on questions all right. I see what could be the thing. Is this something we try is this something that I know will fix it, how confidence so that makes me think of the vectors meaning in our in our model.

D

Here again, the vector was low, medium high and I just said risk so risk would be a vector um permission or uh experience would be we're very experienced in tweaking jvms we've been doing it for 20 years. We know a lot about it, so we can automate that knowledge on a vector disruption would be another dependencies would be these different vectors. I think we described them as when I'm looking at the block model are some of those blocks more complex, less complex, more sensitive or less sensitive.

D

Given the environment, because you could walk into another system and say well, there's very low dependencies because we're in microservices and we can make changes without really putting at risk the majority of the monolith. But if I walk into a more monolithic application, I'm like well, you know what this is the jenga tower. I can't just go pulling pieces out and starting to try to fix things.

D

There has to be extra testing, maybe extra uh you know retries before uh simple roll out stuff, so I think that those could be some of the things we write up uh in terms of the model, how it, how it would visually appear, a flow chart to how to apply it, because once you take an abstract model, apply it to the step by steps now you're you're, going through the logic flow vectors.

D

What data points to be ingested, uh you know the and those could block blog articles could come out of each of those. You could just say: hey. You know. Let's here are the top 10 questions that we ask when it you know when we're going to start to build auto remediation engine. You know what kinds of data inputs do we need. We start with some questions.

A

And I I'm listening now and and and ordering my thoughts I always have this chart in my mind that has four quarters and yeah with two two angles: two directions and yeah. We can also put different vectors or we can use different the vectors. Maybe one is for risk and the other one for dependencies and then depending on, in which quarter you are in maybe other uh recommendations and um aspects you need to consider and yep. We could kind of provide uh this.

A

This quadrant, it's a quadrant and uh tell them the reader or the person that is interested in which aspect is relevant when you are in which quarter kind of recommendation.

D

Model yeah, I think also with that, because I know it's it's not a it's, not just a x y, two dimensional quadrant.

A

Yeah, it's it's depends. I.

D

Ended up in this square and in this square there's a three-dimensional quadrant of you know, I'm gonna disrupt and then whatever so now I've got you can have one of three.

D

You can have a really secure system, a really performant system or uh make your boss really angry, because you took a bunch of risk without their permission, um uh but I think there's I, I think that's to make keep to communicate something simple and then ease people into the idea of your kind of now you've gone through this first quadrant map to figure out where you are: let's double click on that and drill down and look at now, you've got these other things at the next level of complexity, which is, I think, what makes this area of.

D

First of all, the exciting thing to me is just how your brain works to go through this and, if you've done it a couple thousand times in your career it I don't often stop to even stop my own brain. It's like you know, I'm crossing the most interesting things are the stuff that I apply. Apathy like what are the things that I feel safe that I can put out of scope. Okay, I know I'm not going to touch x, because that has nothing to do with my next hypothesis to go down and fix something.

D

So there's a there's even like a whole other article. Maybe maybe I'll give a talk about this. Like I don't care, I like apathy as that's my guiding light. I I go into a situation where I'm nervous and I care about everything and then I process of elimination. Here's all the things I don't care about and then get to the root of something. So maybe there's a whole other inverted thing. Instead of hey, we need to expand everything. We're aware of. You also need to know how to safely ignore stuff.

B

Yeah, we just had a talk recording today uh with our friends from litmus chaos, and they also had this, um the the known knowns that the known unknowns, the unknown knowns and the unknown unknowns, and where kind of why it is so important to uh to also test your applications uh to to always assume uh or it. Actually it was the other way around. We always assume uh what are we. We have a lot of some assumptions when we develop applications. We assume we always have network. We assume we always have power.

B

We assume we always have uh unlimited kind of unlimited cpu. So a developer would not assume his application would just be moved from one node to the other, and during this process it will not be available. This is not part. This is not what the developer thinks. The developer has some assumptions that if I build my docker container whatever then and it runs, then it will always run because if there would always be a safe place for it, but that's not the case.

B

So this kind we work a lot of with assumptions and then also with a lot of unknown unknowns, basically or their own unknowns. But.

D

The uh the epistemological model for for chaos I gave I gave a talk once about that that was the no known, so no, no, no, it was about cogniti precognition in your in your experience of interacting with an application, while you're testing doing exploratory testing it was like there's stuff, I don't even I'm not even aware of that. I'm thinking as I do this and how do I tap into the stuff?

D

I don't know, but I know it's in there and then the stuff- I don't know that I don't know, is in there, while I'm yeah anyway,.

B

D

Does that help us uh johannes, get to narrowing down on kind of what we would like to produce.

A

I think I think uh what we discussed right now is two aspects. First of all, we derive a model or framework from the mind map.

A

We have uh just took a couple of notes how the model or what the model should contain and then based on that, we do an example based on the the jvm exhaustion scenario, to validate how the model fits in or how the or how the model fits and how it yeah fulfills than the requirements, and I think this makes totally sense because it summarizes the work we did at the beginning and what we did on the last two meetings to have then a written document and something that we can collaborate and we can share with others.

A

B

A

Yeah I have to leave the building in two minutes, but um I really would like to define what we should uh work on the next two weeks um I mean I can or I will.

A

I can contribute on working on on building the model, the framework by taking their their mind, map and providing an outline first and then ordering the thoughts that we have there yeah.

D

A

That I can contribute and work on.

D

I uh I can definitely I would like to do a few more of these just just and then go back and find stuff. That's coming out of my brain and go flush them like actually take. What's on the mind map like work backwards from the example, uh which is, I think, the reason we wanted to do an example anyway.

D

So, let's, let's go through what we think that process is and see if we can find our way back home uh to what the block model is and then, let's, uh let's throw out a kind of a generic outline for what we think the the the white paper or the what the what the dot, what the report would look like um and kind of the different sections that we want to dig into that that'll kind of lay out uh generically as an outline where we can start writing um and then editing and tying them back together, um and I think our objective the way we look at the charter that this that kind of deliverable fits with what we want to contribute to the industry right, some, some pretty in-depth thinking on how how you would model this out.

D

So I'm good with that and I'm in the next two weeks next, two weeks, I'm pretty good. I think I I don't have any yeah. I don't have any customer. Hopefully I don't have any escalations, but uh I don't have any key customers on my plate right now: cool.

A

D

I do have to spend 10 minutes writing something for andreas, but you know.

A

And uh you may ask you to help out a little bit on writing the the framework. The model.

B

uh Yes sure sure, okay.

D

Oh sorry, oops, it's just watching my hand.

B

No worries cool um yeah. I also need to jump here, um but I think it was a very, uh uh very open discussion today, a lot a lot of uh new thoughts uh and I'm really looking forward to to bring this all together into that.

D

And uh I'll, let you guys go, I I would think maybe, given that we only had a handful of us here today uh and we we started with more folks. Maybe we find a way to get even an update of what we're working on once. We have something: that's a little more visual or written up to kind of bring it back out to the major working group or the uh the the advisory board and say: hey just share, share an update of what we're doing.

A

Absolutely true, true: when we have something in hand, then I'll share it to to make us more visible, then.

C

Can I just ask by the way what the companies you mentioned today was was acmass and some others.

D

Iterate right, I iterate.

B

Iterate iterate, it's not a company, it's uh it's also, a pro it's an open. I think it's also open source project. uh I I.

C

B

Earlier, let me just either iterate dot tools. I will just send it to you here in the in the zoom chat, really quick. So here in the zoom chat and the company was called akamas. I think they are based in italy, or at least they have some italian uh uh rooms.

C

They're kind of like stormforge, if you've seen those it's a different uh performance, optimization company.

B

C

B

Did we mention anything else, so it's.

D

What was the other one.

C

uh Storm forge, I just saw we had a discussion internally and they're pretty similar to akimas. They, they try out different settings.

D

For different settings for frameworks and performance, optimization and stuff yeah.

C

Yeah, basically, it's uh yeah ai ml based optimization based upon prometheus.

D

Yeah yeah, no, that's right. I remember. I saw white paper from them very cool yeah, it's uh the other jurgen. I think the it's called. They call it loop testing, where they're they're, trying different stuff and like, like you say, for a b and that kind of stuff, but johannes, has to drop he's gonna get out of here.

A

Okay, unfortunately, I'm very sorry but okay.

B

Cool see you all next time. Thank you.

A

Guys take care.

B

C