keptn Working Groups, 7 Apr 2021

Previous Meeting

⏯

youtube image

►

From YouTube: Keptn Auto-remediation Working Group - April 7, 2021

Description

Meeting notes: https://docs.google.com/document/d/1_WlLP6oLcHe0yyC7kXH2hB3i9bOPvIArp83NohE78FU/edit#

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject

A

A

All right, hi, yeah, hi and welcome everyone to this um to the next meeting of this working group. um Let me just share my screen. First um to go over the agenda items for today.

A

Actually, first, we should take a look at what we defined as action items till today, and um it was about adding some thoughts and then restructuring or repolishing polishing the chart so that we have one document that we have in our mind or in our focus and we are working towards this was the action item and we should discuss this charter finally to come up with a final conclusion and then, from the last meeting, we still had an open issue or an open action item, which was then the success story that mark and jung were working on.

A

We also want to go over this story today and after the story, we we think we should discuss here in the group. What is the actual? What's the next milestone? We want to achieve jurgen and I we talked today a little bit about that and we think that it would make sense to to write a white paper, but that's just our opinion, and we should discuss this in the group to come up with what we want to have or what we want to achieve next.

A

All right, then, um let's go directly to the first item on the on the agenda. It's the charter.

A

And what I, what I did is I removed or did some rework? I did not change the wording I just um removed. For example, in the first sentence there was dyna trace in there, but we agreed on not focusing on on a vendor or not on a yeah on dynadress or any other monitoring solution it should. The outcome of this working group should not be vendor-specific at all, and the mission is that the captain.

A

This is also a little bit questionable, whether we want to have captain in there, but I think for now, we we can leave it. The gap in automation, auto remediation working group, establishes proposed automated remediation requirements based on the input and collaboration of a diverse set of users.

A

This is kind of the mission we have in focus and based on input from the working group. This team seeks to establish a living set of automated remediation requirements and deliver these requirements on a periodic basis to the community and then what I did also is. I took the ideas from mark. I really liked them because last time he said as a member of this working group we or you as a member, you innovate best practices for auto remediation.

A

In general, you innovate ideas for leveraging ai and machine learning for remediation for remediating issues in an automated fashion. And last but not least, you contribute these best practices back to the industry. Slash community, that's really cool! I like that. One.

B

uh Be honest, actually maybe it's just me, but I don't really understand the mission or the the first part in the vision. It's the the remediation requirements.

B

A

Kind of confuses me.

B

We we have in the in the working group document we have uh on the very first part. We have also a goal of the working group and this one yeah. This.

C

B

C

B

Understand this language, but the other language, it's very hard for me to to understand uh not sure what it what the reputation requirements are.

C

Maybe it's just me.

B

C

Know yeah no, I I I'll second that and I think, remember in our last meeting. Our discussion was around something that was like really specific to dynatrace, product development or or product management and stuff. So maybe some of that requirements like specifying the requirements was coming from that perspective, um and so maybe the requirements word, I'm with you, you're gonna. It just doesn't sit quite right if we're kind of expanding the idea that we're not really requirements of the community but proposed automated remediation practices, maybe requirements we could.

C

You could even just put the word practices in there.

D

Yeah, I was thinking practice or process we're.

C

D

Developing a process to follow more than requirements.

C

And the goal, the goals are good, that's a good paragraph as well. I like the goal too.

A

Let me remove that one. uh The goal is establish an approach that defines the future of auto remediation for both cloud native environments and traditional implementations. This will help series and devops engineers to define auto remediation processes and allow testing these processes and instructions to validate terrific effectiveness.

A

True, um I'm also yeah.

D

See I like that that doesn't have requirements in there that you know it's talking about processes.

A

And best practices- that's true and then yeah. Some sub calls identify um suitable remediation actions, then prevent issues and then also have this concept of closed loop remediation, which means that the goal, or that we want to have an approach or a process process that provides visibility and answers to the to be valuable for remedy for the remediation process itself.

D

Yeah, I I personally and it's just a personal preference I think, but when I see prevent issues uh I don't know a lot of times. I think that's like marketing material. You know hey, let's catch issues before they happen. Well, it's not possible right. You know in realistic terms we're trying to minimize impact or mitigate yeah, uh timely yeah, timely mitigation, more than actually prevent issues. I mean, I understand the concept, but it's you know when you read things like that, it's it's great for c-level executives! Oh.

B

Really I can prevent.

D

Stuff well, realistically, no, but we can fix them a hell of a lot faster.

C

Yeah right you're still going to catch the corona virus. It just won't kill you yeah.

D

A

That's true, preventing is, is not always possible, but what we can achieve is mitigating issues and and problems all right.

C

I wonder if there's it isn't one more thing when I think about the vision as a member of the working group. One of those lines above is the contributing these best practices back to the industry, to the community and maybe as a goal, we're also there are aside from technically inventing auto remediation actions and processes and practices uh and applying those in a way in an approach or a a context for mitigating issues. I, like the the idea of in you, know working on closed loop.

C

I wonder if there isn't something also about I don't I don't think adoption is the right word, but it's the barriers to like, let's say all of a sudden, everyone was inventing their own ways of trace ids all the way back to old tea leaf systems. You know back in the day, and now we have observability frameworks that get built in and we're tracing stuff through open, telemetry and away.

C

We go there's, there's something to the maturity of the industry being able to adopt or uh integrate these more comprehensive ideas versus an off-the-shelf product that you plug in it's just for you. um So I'm wondering if there's something around the open, telemetry part where there's a auto-remediation body, that one of our goals could be to accelerate the adoption or the acceptance of both the you know. Hey here are some ways of doing standard practices for auto remediation or good practices, but also you know, there's people averts to the ai machine learning stuff around.

C

You know: hey. Can I just let a few of these things? The system knows how to kind of keep itself running um in certain certain cases, so there might be some pushback or some resistance to that. So I wonder what you guys think about one of the goals being taking everything that we're working on and also you know a goal is to ease the adoption or accelerate the adoption uh into the into the industry as well.

D

Yeah I like that concept yeah, because I think everybody's on board. You know everybody likes all the buzzwords of ai and and in my own organization you know we're very anxious and we're actually doing a demo of a auto remediation that we put together to our cto this afternoon, and you know everybody loves to see it and then the first word that pops up is well. That would be risky right if we did in production.

D

Well, let's put the brakes on you know, so it's um you know to bridge that cultural adaptation of this to to get people past that risk acceptance, um I think, requires some pushing um and some good community uh acceptance and documentation as to the guard rails around this type of thing that says: okay, this! This is truly safe and here's. Why uh yeah? And so I think we need to publish some of that- uh those guard rails and stuff too. I don't know where that would fit, but.

C

I just think it's I I like how you said that it's like you can build some amazing technical things and, if you're like well, how does this fit? How does this? How do I get? How do I encourage people to start using it? Giving us feedback start engaging like okay, there is risk, but we also. We also thought about the risks in the working group. We also thought about those things and here's some way of saying you know how we can ease the adoption or or improve the adoption of these practices. Yeah yeah cool.

A

Totally valid point: uh we should not not forget about the acceptance and the adoption at the end because we can build or we can up. Okay, we can come up with a cool solution, but at the end it must also be usable, and then someone has to to trust in it as well.

A

Is it okay? The way I phrased it here I mean adoption. It's about acceptance.

A

Ensure that users will accept.

A

Their proposed.

A

A

Reduce the risk.

D

Yes, I think that you know keeping that goal in mind um obviously uh drive some of your architectural design behind all this right. Keep it in mind. You know like the open telemetry. They made it too difficult to implement. Well nobody's gonna, adopt it so same with this. You know we have to architect something that can easily be adopted. So I think that's a key goal to keep in mind when we come up with our solution.

A

A

Okay, um any additional thoughts on the goals. For now I mean we can always come back to to rethink these goals and to maybe um refine them, as we make our way through the topics that we want to tackle on.

C

um uh Johannes did we just want to replace the word prevent with mitigate okay? You know what I mean like it's, it's not a separate goal, um but it makes it it'll make us think about it differently.

A

All right cool, then um this is what I wrote down as I started working on this charter, a non-goal should be that we define a workflow engine or incident management. I think there are great solutions out there that do this job in a better way that we can think of, and um I just define it as a non-goal, but when you think there are other non-goals, then please feel free to add them in scope.

A

um It's. This was also kind of my first idea on this charter. We should also think about the end user view in the sense of how does a developer think of remediation. How does an ss think of remediation also a devops engineer?

A

Then? Maybe we can define some standards when it comes to. How can I define or declare my remediation action then define best practices to validate these actions, and the last in scope item would be identify a real world use case or a story that mark and and jurgen we're working on, so that we have something also more touchable, and we also can then discuss certain use cases around this story and then jointly work on the story um and then drive it forward.

B

I think actually, this one is uh the real world environment is actually some, maybe uh uh like a captain user or maybe a bandages customer. um I think what we've worked on uh was more a fictional scenario of how it could help a user, but it was not a really reasonable scenario. Yet, um okay, it was. It was more like just defining something that could happen.

B

This bullet point is more around uh really having some some uh captain, user or um or you know, maybe a dynamite customer or someone who really has um the pain, and we can solve this with alternation with the processes and concepts that we are developing here.

A

A

Backwards then, let me um yeah define it like. I did either we use a fictional success story, as we have have right now, or we identify a real world use case that we can then use to validate um our approaches.

B

I think we can keep both. It just says it's in scope of this working group: it's not the deliverable that we must attempt to that. We that we need to do in order to to have a successful working group. I think we can have both.

A

A

All right, yeah, um I think, or already that we always can come back to this document and also reconsider it when required. But from my point of view, I think we are good enough to um to go on.

C

Just through in management and leadership, that could be something to consider just a suggestion.

A

True, uh as they are very important stakeholders when it comes to adoption and acceptance, we should not forget about these as well.

C

Okay, cool all right, awesome.

A

Right this goes a little bit hand in hand how we encourage people to start using remediations here.

D

And maybe it's just semantics, but maybe you would not say setting up but just end user view of remediation actions. Obviously your management leadership aren't setting up any remediation actions, but just views.

C

Yeah and an end user view of auto remediation actions.

C

Or view of an auto remediation solution.

C

If I really think the bulk of what we do is dig really deep into the the goal number one, which is really what are all the different actions that we can do at multiple levels, can we sequence them? I mean yeah, okay, cool.

A

All right yep that one.

A

Then I think we are good to go on with the next item for today, which is the success story.

B

You'll be honest, um sorry, but can we go back just for one more second, um since we now kind of discussed this, I thought. Maybe we can remove draft here and just put a date on this one kind of charter and, uh let's say as of april 7th, um and so we we know, we kind of uh the working group today agreed on this um and.

B

C

Just just that we have, we have the version history too, so yeah definitely.

B

C

D

A draft anymore, it wasn't.

B

Pretty it was dropped previously and the the comment that mark had um is it already? Can we just merge it, uh and then it's also um that we have like we kind of agreed on one on one version for today, and I think it's just it's. It's very good.

C

I think, johannes, your your your end user view of an automated remediation solution.

A

Often automated thanks.

C

Yeah yeah, no, it's fine. It's just words.

A

Solution this way, and should we specify what we mean with stakeholders.

A

Because a stakeholders is a group of people, should we just break them down into more concrete ones? I.

C

Think, given the given the idea that, if I'm a developer- and I can think about an auto remediation action, appropriate to the component or or the code- I just wrote say it- you know as I as I now as a developer, I think about try catch and what's in the catch, I okay, I catch something I could trigger an auto remediation.

C

From my exception, handling, um that's one one idea around a developer view, an sre or a devops engineer might say all right. These are things that I need to deliver with this version of the code. These are the appropriate, auto, remediations plus, which ones are validated which ones are approved, which ones do I need. Can I do fully automated, which ones need oversight?

C

That's another view, then I think it really is just management and leadership, because there's people will be in a position where they're like we're going to start doing auto remediation at the company.

C

How do I, as a as a leader or a manager, start talking about that across the organization? Where does it? You know you might get into now roi statements and things like that so stakeholders, maybe maybe we drop stakeholders, I'm just thinking of that persona. Almost of you know, management and leadership will be able at a company to be say. Okay, I see what you guys are doing this. This will be great. It will, you know, allow us to do all of the take care of things that are fully approved to be auto remediated.

C

We can do other stuff, etc, etc. So, I'm, okay, if we leave stakeholders out for now.

D

So so maybe it's because I work for a bank and my last job was a bank as well, so I've been in the financial industry for a while, but obviously auditing is, is huge for us to say you know and that's one of the first questions that comes up. If you try to automate something you have to prove, you know that it's being done. uh Who has the access to do it and all this?

D

So there's um I I don't know if every company is that way, but um you know anything that we develop and- and we have this in our mind- map uh because that came out. You know we have to have to be able to show that this stuff is happening behind the scenes somewhere, uh so that people can audit it for comply, yeah compliance yeah. So I I don't know if that's worthy enough to mention specifically here at the view, but maybe it is.

D

Because, if it's, if it's not a view tailored to them, they're going to the developer or to the sre and say gather this information for me um and then they're attempting it to do it from their view, which may not be an operational view. You know so a reporting view versus an operational view. I think, are two different but necessary things.

C

Yeah, I'm cool with that.

A

um For me, it's currently hard to think of a persona that represents this person, but I I get what what you you mean, um there's always someone who wants to get report, how his automation done and uh how is it also audited and yeah totally as also agree.

A

All right, cool.

A

Then um I think now I'm good to move on to the success story um you can or mark someone of you wants to share.

B

A

B

A

Or we can also do it from your screen: okay, yep! I'm fine with that.

B

If I find it, I can also hear, uh but you already share it, um so the the main idea- maybe I can start- and I will hand over to mark- because I know he is for sure more. He has the nice words for everything that we have put together in here, and you can explain it in better words that I could do, but I will just kind of set the stage and what the idea was here is.

B

So we already had very great discussions in the last working groups uh in the working group meetings and um we had our mindless in the mind. Map was sometimes already very technical.

B

How we want to do this and which parts that are actually uh are involved in in remediation scenarios, uh even like a business remediation and also technical regulation like infrastructure or application remediation, then we have a couple of other parts like testing revelation the process um in a lot of different aspects, and what we tried to do here is uh mainly but mark feel free to correct me for wrong, but mainly to to use these aspects that we already identified and and adding some value to it, so that it can be once it's it's finished and then also um kind of yeah polished by some with some marketing wording.

B

It can be a very nice success story. How captain auto renovation all the concepts everything that we have developed is actually helping a fictional customer in in saving a lot of money, since they do not expect any downtime anymore or drastically reduced downtime, let's say um and uh yeah. We try to to put some. um I said um some the value to all the aspects uh they are not yet prioritized in any way.

B

It's just a couple of phrases and ideas that could be added to this kind of press release or success story and uh yeah, but mark I I I will let you do the the phrasing and the presentation of this. I can also go ahead, but I think you you can do it in the better better way than I could do.

C

No, no, you do a fine job, uh but I'm happy to I kind of like this exercise, because sometimes I can at least I my personality-wise. I can relate to an interview like I do all the podcasting and all sorts of stuff and I'm always interviewing people.

C

uh So to like, I was imagining a fictional interview that you'd turn into a column story or a press release with future kept in customer x from uh from uh from acme corporation and- and I like the idea to of just as a mental exercise, because then you work backwards into what we're going to spend our time doing over. The next number of months is hey.

C

If we really would imagine just in our wildest dreams that somebody would say these kinds of things, these kinds of statements about our work, then, okay, what what are we really going for? Are we doing the things that are going to keep us towards this as just one of the ideas? So I I kind of changed.

C

We had jurgen had some great ideas out there and I just kind of reworded them as if it were a narrative interview of a customer giving you you know, quotes for a press release, so the one I like is the after. After the validation, the the days of guessworks and crossing fingers are over for us, which doesn't mention anything about a product or anything, it's just about pure benefit. Like oh, I used to cross my fingers, like that's, I'm emotionally nervous uh to kevin's point like risk. It's like wow. I've got some risk.

C

Should I really do should I can I hot swap this memory? Can I have? Can I not swap some cpus back in the old physical world? We used to do that and it was nerve-racking um and now I'm gonna let uh an ai bot go ahead and do this for me, um but you know to have somebody there's also like an emotional quality to it that says: hey we used to have guesswork and crossing our fingers. We don't have that anymore, because the remediations have been validated, we actually tested them.

C

We actually know that they work with this particular code base, which is cool, so that was one I just really liked that kind of kind of really put my head in there. um I don't know jurgen if you had another favorite one in there.

B

uh I think let me just uh go through this yeah. Actually, the one right on top of this one, um all the everything that comes to we already know it will work in production because we tested it it's kind of all. It relates to the one that you just said, but um we already tested our remediation instructions. We finally had a way to do this, because we are kind of simulating outages.

B

uh We were able to plug in already our remediation instructions and also one big part is um we were confident that our monitoring or observability tool would actually alert us if something goes wrong. I was just talking to johannes earlier today um about a conversation I had with julius faults, he's one of the the co-founders of promethous, and he once told me that he now he founded his own company, it's called prom lens and he does a lot of consulting and with prom lens what his software is actually doing.

B

He can analyze all the promptql queries or from ql statements that you write and you can analyze them on a tree basis and whatever, and what he often finds out is that a lot of the alerting rules are actually broken. They will never give you a result. The result set is always zero, because there is a. There is some kind of issue in the in the in the query itself.

B

So it's kind of it's not dividing by zero, but there is some statement that will always break the whole expression and it will never alert so with dyna trace. For example, that's not the case because you would not overwrite the ai to completely break it, but in as kind of a outcome of of our solution or what we are. uh What we are proposing here is that you can validate your alerting. You can validate your other radiation instructions and you will validate this not in production, and this is basically those three lines.

B

I think it's only three lines, but it's very strong.

C

Yeah yeah and then down uh more jurgen had written, uh really kind of describing like the automated process. That happens, uh and maybe one of the auto remediation actions about you know, restored our initial landing page and kept the instagram income stream high, and then I threw in a fake quote of you know I like to go fishing on the weekend. So that's pretty awesome. um I thought that was that was fun that one's my favorite, you just you make it a little relatable right.

C

I mean it's like what are we really putting all this whiz-bang fantastic technology for and if it's keeping the income up? That's great, because I'm going fishing on the weekend, um but the last one the last one goes to. I there's a lot of things in automated testing. Automated everything where uh you know we're like. I we're gonna try to automate the most complex parts of our jobs so that we can dig into more complex stuff and it's it's kind of the opposite approach. Where really as humans, we probably would benefit more.

C

If we spend our minds uh the other way around right, I mean we should be. We should be digging in the more complex stuff and the validated auto remediation. Be the same. So I the last sentence there, the paragraph uh gets to kind of this same old adage of hey.

C

If you get automated testers, that means I can get rid of a bunch of the the manual testers and if you buy something that is like a framework for development, then I don't need as many developers, because they don't have to write all the code from scratch and there's something in the balance of you know how companies run that are like hey. If I can automate this, I don't need people truth be told. If you automate this, you still need some of those people and they're going to do slightly more digger, deeper, digging and stuff.

C

So that's something. There's there's some history, at least for me in the automated testing world, uh where that this is. This is a good thing to state. When anyone says I'm auto anything, I hit the autopilot and then we've got the boeing 737s. Unfortunately, you know so hitting the automate button doesn't mean you don't have a pilot, and even if you do have a pilot, you could still have things whacked out. um So there's some interesting things to think about in that last paragraph too.

C

Otherwise I I loved kind of spending some time writing on this uh and it I don't know it's cool. I don't know what you guys think.

D

Yeah, I got one that I'd like to add uh to this. um You know that could add value, I think so when we talk about slos, you know we're talking about error budgets right and if you hit your air budget, you have to stop. You know if you follow the the true mantra of it. So by instituting auto remediation, you get to increase your features uh which brings value to your company, so you are automatically increasing yeah the amount of work you can spend uh improving your product.

C

Boy, that's a really good point. uh I like that. That's good! That's good thinking, you're right, because if you can auto remediate those things, then you change what's considered consuming the budget right yeah, so you get bandwidth to in that quote-unquote air bucket or remediation budget.

D

A

That's a really good idea.

A

A

B

B

Cool yeah, very good idea, um anyone else, uh some input that we should add here. uh I really like the idea of mark to having this kind of uh uh answers to the questions, so we can just throw in some answers or quotes that we want to kind of collect here. um It can even be something like I like to go fishing on the weekend.

A

I mean we're on the top two. We have a couple of of items left, but we we don't have to go through all of them. um I think we picked the highlights and um yeah really cool what you came up with and would be awesome when we can achieve that. I mean when we show at the end someone our outcome that also implements it in the company and finally, we get quotes like that. This would be really really awesome.

C

um Johann is just one more thing: if you scroll down in that uh the uh I we put some other ideas for what these things could be now, if other other stories, they wouldn't necessarily have to be press releases. um You know it's actually more like just ideas for other types of this fictional writing. um That might be helpful um and uh one of them was uh there's a the guns. They've stopped is a line from star wars. uh It's actually episode. Four, I think right.

C

Sorry, it would be the new hope right. It's right. When they're, when they're in I gotta see I gotta repair that that's episode, four, um but it's a the classic infrastructure person who's. You know constantly getting crazy alerts and stuff, and suddenly I I'm just getting information that things are being auto remediated, it's not really an alert, but it's like the alerts. They've stopped. I thought that could be the and you're going to remember. We talked a little bit about.

C

We could do like a a film or a short short movie or something which could be fun um of having you know visiting somebody who is you know they. They haven't turned on the button to turn on auto remediation, and so everything is just chaos. Everything is whatever we that could be fictionally really funny um and suddenly you hit the button and everything goes quiet and then all of a sudden, one of the all the lights start going. Green, green, green, green, green, green, green green.

C

What happened exactly so that could be fun.

D

It's it's funny that, on that topic uh we would often have in my previous employer. We would also often have people contact us and say is monitoring down, because the alerts have stopped.

C

Is the apm tool just dead yeah.

B

But but this is something that you in the large organizations that you always have tons of alerts open and yeah yeah.

D

Yeah, it's um it's the fomo. If you're missing out, you know uh so many people feel like that. They need that information. It's it's! um It's a hard cultural shift to tell people um you need to change, and you don't really need that information.

D

They just they've always need they've, always had it and they just feel more comfortable having it. So you often give it to them.

C

I'd say the the other thing that we've seen recently with all of the remote working from home through a slack or teams, or you know, the collaboration uh chat is all of the alerting started out really being on a single channel, and then you quickly blow out the capacity for a human to keep track of the barrage of an automated channel. So the the escalation and communication of what's happening with just alerts and then you've got pagerduty flooding, something you got other tools flooding.

C

It became this different sections of the architecture have different channels dedicated to them. So it almost is like when you see the unread messages show up in the slack channels. You'll see you know, fro, let's say: frontend server services, layer, database layer. If all of a sudden, it's like unread messages, go you're like okay, that's that's just throwing alerts into a slack, but I'm seeing it as a flow of where things blew up um and blowing that. But that's might be something that we consider in the in the other.

C

Part of the working group to say are: are we also, maybe not totally in scope? But we're aware of leadership and management. How do we escalate auto remediation and and the process? I'm kind of excited about that part as well.

A

All right and think I think this was not a good hand over to the last bullet point for today on the agenda, which is the topic. What is our next milestone that we want to achieve? um I think we should discuss this here um in an open discussion should not be predict defined by someone. We should think about what we want to achieve. Next, let me just give you a quick recap. In the first meeting we did.

A

We started with the mind map, but we put quite a lot of ideas into the mind map to structure our thoughts and also to start discussing the topic we continued on that one. In the second meeting in the last meeting, I showed you how captain is doing remediation and we also started talking about the charter as we continue today and um the success story we also have now.

A

uh We took a look at it to understand all to have this company in mind that we want to help, and we want to make sure successful what is also helpful for doing uh booking backboards, but now comes the question: uh how should we go on? How should we proceed here in this group um yoga and I we we talked today a little bit about this topic and we think it makes sense to or it would make sense to start thinking about. Writing a white paper um yeah.

A

Well, we have to define the outline first and then, when we have outline and structure of how this paper can look like, we can then kind of distribute the chapters so that then um the people can collect or or can do some research in the topic they are responsible for and then contribute back the the knowledge that they have um yeah collected into the paper.

A

Would this make sense, for you feel free to to tell tell us what you think.

C

I I think that works, but a white paper would be a good milestone. um I could see not necessarily well yeah. Chapter chairs, but also just like focus areas would be interesting because I think a lot of the good work is it for us is digging into.

C

uh Let's say what are: how would you auto remediate by workload so standard web services, uh different kinds of of servers or workloads, data workloads, uh data, warehouse, workloads different than online, so maybe by focus area you don't it could be chairs. It doesn't have to be that formal, but it's starting to get our own mind maps one level deeper.

C

I think we should my my thought would be. We do some work there to get those things brewing and then then think about hey. We have a white paper based on those ideas uh and some of those remediation actions what's possible with also certain technologies are easy to do some kind of automated change to remediate other other technologies not like. If I'm gonna. Oh, I need to partition a database table okay in in a no-sql world.

C

It's that might be very different than doing that with a big old oracle database or something so maybe we we do need to to I'd rather take some time to dig into those focus areas and start getting our hands a little dirty and what are all the different things we would potentially automate for remediation in each of those different areas and and then bring that bring that brainstorming back into some kind of paper would be cool. That's my thought. I.

D

Would uh I I would personally um like to maybe walk through a specific focus area, so I think, as you branch out, your different focus areas you're going to have many things in common, um and so maybe if we started more with like a poc of of say so, the the auto remediation we're showing today is a jvm, uh exhausted memory, and so we're just going to detect it and restart it. uh You know automatically so through some ansible.

D

So maybe if we walk through something all those different things in the mind map that we came up with, you know we can kind of sift through those and get an outline of how we would you know, incorporate all those areas.

D

um And then you know that that would kind of be a base template that we would work from and then you would hit your focus areas of okay. Well, this one's a little different and this one's a little. So even with the jvm restart, you know we started talking about well, websphere is going to be different than tomcat, which is going to be different than just a standalone jvm.

D

So um you know, as you start digging through the weeds thing, those differences start to creep out, but you know there's a lot of base similarities of you know. How did you even know this? This was a problem. Where did you get that data from you know?

D

What what tools are you incorporating for your auto remediation, so yeah? Maybe if we start with that template and walk through that as a group first um before we hit the focus areas.

C

Yeah yeah no and we would learn, learn sort of what the model of an auto remediation is yeah yeah. If I hear.

B

D

C

D

Yep yep exactly.

C

Yeah, let's, let's do that! That sounds cool.

A

Also, a very valid point, and a very good idea to have a poc kind of setup where we as a team, define how it should look like for this particular use case for this particular scenario like a jvm restart and then when we are confident and think that the way it should be, then we can start to to add more uh content and more uh topics. On top of that, like that, one as well yep.

C

Yeah, but I think johannes, it becomes like a template when we dig into how would we do simple remediation like a restart or change the resource configuration and do the restart or change the jvms configuration itself and do a restore I'm thinking a little bit of kind of the akama stuff which is really interesting uh particularly to to jbm's.

C

But you know there we could find that the model is can be very elaborate, but there's level one level, two level three: how does it work, but we but we'd like figure out what that model of hey.

C

This is what we think the uh a template for an auto remediation at multiple levels or multiple tries, or you know, level one level, two level three what's validated what's not validated can I am I approved to do a restart yes, but you're not approved to change the memory configuration okay, so we're really walking through what that model looks like uh as a just pick one as a team as a group and go through like a jvm memory issue.

B

A

Okay, um then, we have just a couple of more a couple of minutes left, but how can we prepare the next meeting for getting started on on this pc.

C

uh Well, I think we could, let's create a doc on jvm memory, exhaustion and then just start dumping ideas on all right. What are the different ways we detect it? What are the different ways? We would validate it? What are the different ways? uh What should we validate for the process of conducting the remediation like?

C

What are the steps that you would look for in a jvm, a healthy, jvm, restart and start laying that out as a process could be a process flow, but also just hey different ideas could could relate it back to an slo uh to validate the remediation success or not um got you know the gotchas and stuff, but I think we just need to let's start collaborating on a doc and sharing some ideas.

C

Yeah, I like that so to kevin to your point, is jvm jvm memory, exhaustion.

C

And also yeah the indicators right when, when can I see sort of virtual memory exceeding physical memory allocation above and beyond the current, like there's things, you can see sort of pre-issue that to forecast something there's all sorts of interesting things that we we as professionals know how to look for that stuff, but we're gonna try to brain dump it into uh into this. This uh proof of concept. For sure I like that, thank you, kevin. That's good! That's cool!.

A

All right there we have, uh let's go use case.

B

uh I will just create the document and uh and link it here, so that we already have something that we can share, and uh everyone can basically already add some ideas.

C

Just as a side note, um obviously we have different performance engineers different engineers that we all know who have specialties like I I know who to go to if I've got oracle performance issues, I can go to the academic book guys. I can go to consultants and there's people that we may want to pull in as we develop these things and make them um sort of anointed reviewers, like you would review. When you write a book, you have reviewers, so these are sort of reviewed by.

C

uh Maybe some industry names um that people might recognize, depending on the technology that we're working with, could be entire companies, as we build a framework that they can adopt and say: oh we're, gonna. We do auto remediation, but now we can talk to captain and other other tools for open, auto remediation.

C

But that's as we get into this template and thinking about it's like okay, now, I'm at a point where that's really super detailed memory configuration within the jbm, maybe we can get a second opinion from an external source to say: hey, you know the person that invented the g1 garbage collector hey. Would you like to review this idea, um but that might be this just as a tangential idea as we move forward. You know in terms of focus down being focused on the industry and and adoption, and things be like oh yeah.

C

No, this is this is cool. We would totally use that so yeah.

A

That's really good idea, also to bring in specialists and and other folks that may have maybe more knowledge in particular topics than we have as a group. But as we progress and then yeah define the use case for us, we can and do an outreach all right um jung thanks for creating the dock.

B

But everyone should be able to access it.

A

But then, uh let's define this as an action item that we, um for we um add our thoughts and ideas for this use case into this document. It's kind of a brain stump brain dump again, but still um it should be focused on on the problem that the reason jvm memory exhaustion and um we want to get that one remediated, maybe just all think of how this also could fit in our success story. With this company with the uh ace me ace, acme, acme company corporation.

D

A

uh We can add an additional quote here that says that this company can save x millions of dollars because we can fix their jvm problem. Also think about that. But let's continue working on on the use case.

A

It's really cool.

A

And if you need the mind map, it's also linked.

A

Down here yep, um it was here in the second meeting. I have a link to the mind map, and here you find all the the initial thoughts that we have.

A

I think there's already jvm restart on there, but just go over the mind map and get some additional thoughts and ideas. Yeah.

C

Yeah, just as a one of the things I think we did brainstorm um and for our our scenario, our our first template or walk through the proof of concept, one of the things was um being able to publish auto remediation rules, actions etc, like a plug-in. So if I'm vendor x, with some cool new thing, I could say just like, I would say: hey here's. We we can talk open telemetry. So if you install our stuff we'll cooperate with trace ids ever you know, just like anything else in your entire ecosystem.

C

Same thing for auto remediation. If I get new vendor x component- and I put it in it- comes with hey here's, the auto remediation things that can talk to whatever auto auto remediation framework, you're using uh it for us, it would be kept in, but the idea being that plug-in ecosystem is something we might also talk about uh as we work to through the proof of concept but get out the other end and say all right. What does the process really look like?

C

What, if this is let's say, tomcat got on board and said: hey we're gonna build auto remedia. How would you publish this as a plug-in, um yeah and now, of course, then you're building a marketplace there's a whole other thing: dude I charge an extra ten dollars uh or you know an extra extra ten 10. You can buy the plug-ins for auto remediation or something but that's a whole other ball game.

A

All right, cool, okay, great thanks yep, like always cool meeting great talking to you and um let's jointly work on this use case on this poc and then next time um we see what we came up with and um yeah. Then.

C

A

It and talk about that. One.

C

Did you did you guys see jp's comments in the chat.

A

C

No, it's just nice. He was here for a while lurking and listening, uh but uh yeah.

B

Cool all right.

C

Thank you. Thank you very much. As usual,.

B

A

D