keptn Working Groups, 12 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keptn Auto-remediation Working Group - May 12th, 2021

Description

Meeting notes: https://docs.google.com/document/d/1y7a6uaN8fwFJ7IRnvtxSfgz-OGFq6u7bKN6F7NDxKPg/edit

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject

A

B

A

Hi everyone and welcome to the next uh round and episode of the ultimate auto remediation working group um yeah. Today, I put the topic uh for today to talk about the tool agnostic, remediation process. Why it's needed and actually what's the benefit of setting up setting it up? Why is it important? Because last time we talked about the template, the remediation template that we want to work on.

A

um I will also open it in a couple of minutes, but uh during this discussion we figured out that it's not really obvious why we should build a remediation workflow. That is too agnostic. I mean there are solutions out there like ansible, where you can build remediations and then automate your operations.

A

But why makes it sense to have here uh a more separate why you have here separation of the process and the tooling?

A

And this is what I want to demo today today um and then I also want to drive to discussion, but before going to to do that, um let's jump to them to the template, to this scenario that we agreed on working, it's the it's the template for um it's a template that we want to work on uh in order to define a process based on jvm memory exhaustion when we detect this problem, we want to kick off a remediation process, then does that fixes the problem for us and what we had defined um during the last two meetings?

A

It's the process as laid out right here. It's a data gathering that we want to work that we want to kick off. First, then, um the change process that needs needs to be tackled, in other words, who is allowed to trigger the remediation and who should be involved in fixing the problem and then the recovery process itself, where kevin gave us really great input on how they are solving this problem.

A

I mean they use here, gitlab to spin up an ansible tower and to execute the playbook and then yeah. We want to do the validation step to make sure that the jbm is running again and accepting the traffic and finally, in case we could not resolve the problem. We escalated um yeah to inform other parties that should be involved all right yep, and this is the process we talked last time and we also defined um action items on that and uh kevin.

A

I mean you texted me or you sent me an email about the action item of working on the on the sample app. Can you also share the crew what you um accomplished or what you can contribute here.

B

Sure so um a member of our company had created a sample app that basically just had two buttons one. It increments a counter to show that the application's working then he had a second button that would just hit a loop that would exhaust the memory.

B

So I reached out to him to to see if we could turn that over to this group, since it was already done and working, um and he did remind me that you know we do have a company policy uh about sharing any uh intellectual property that was created in-house. um You know, even though it's very basic, um we didn't want to share it directly, um but he he did give me. You know the the looping code uh that runs it out of memory.

B

um So then you know we we talked about uh possibly putting it into the uh sample app used for kept in the cart at the sock shop. App- um and uh you know you would ask me for a good location. um Oh my computer wants to restart all right.

B

uh You had asked me for a good location to put it, um and I was looking at that sample app and realized that they already had a similar scenario based off of cpu exhaustion that they built into that app by sending it a certain id for an item in the cart. So I suggested wait. Why don't we do that same thing? We'll create a unique id for jvm memory, exhaustion uh insert that looping code, and then we should go to test it. That way.

A

All right, yep, thanks for the update, um I think um yeah doing this and then going with the cards app would make sense, since it's a java spring boot, app and um yeah. Adding this additional scenario into the app would make sense and should work out all right.

A

C

Since, since alex is joining here as well- uh because I was talking to johannes today- uh if it would make sense to use the potato application for this, uh do you know alex? Are there any plans to have a child service also in the potato application? Because right now, everything is written, go and uh what the working group has agreed here is uh to first take a jvm process.

C

um So I was just wondering.

D

uh That distributed version uh that's currently in a branch. There is a pr open. It's not yet submitted because the litmus people they have worked on it, but they will be working on it. So you could easily provide a lag or a head service that is written in java. You just have to follow the same http rest interface, so that that wouldn't really matter you wouldn't have to replace all of those we just have to provide like one of those services as a java service. If you wanted to.

C

Okay, because if we have to build it into one service, then we can also think about uh reusing the potato addresses.

D

A simple spring boot application and just have to stick to the same interface and just load another um image there.

D

The only thing is we built most images right now from scratch, which would be a bit harder housing with java based images.

D

But you would have to install it and inside the um inside the image.

C

D

We would have to take a look.

C

At this, but if the lithium spokes are already working there, we can.

D

The distributed version is there and then having following the protocol is easy. I think it's just more how you build the container properly, but that's something that can be figured out.

A

All right, okay here this could be- should be option two when uh we don't wanna, go with cards all right, okay and um yeah, as we also discussed last time. I should do I wanna demo today, um captain and how captain implements can implement this um process, the data gathering change, process, recovery and so on and yeah. This is what I actually wanna do today and also then to drive the discussion of why having a tool, agnostic approach and let me get started by yeah, going to captain first and what I did is.

A

Basically, I went to my git repository that contains the the process definition and this one is added or configured in the shipyard.gamil, and what I did then is: I went to my production environment, where I want to activate my remediation process.

A

It's yeah the production for customer a um or is the stage is called production customer a and what I basically did. I added this remediation sequence, this one and then I added the task for the data gathering for get remediation for action, meaning really doing the execution and then on the evaluation task.

A

And additionally, I also added a sequence that kind of works, as as fallback or as.

A

That takes um an action when a remediation failed and this one I just called um escalade and I configured a trigger and this trigger is configured in a way that it executes, or it starts this sequence whenever remediation failed and this one has just one task in it and it's the escalation task all right- and this is what I added to my to my shipyard, to my process definition and I'm now able yeah to trigger that particular sequence.

A

And today I will um yeah use postman to do this manually, but just imagine that this can be automated by an integration into monitoring tool that monitors your environment, identifies a problem and when the problem appears it just sends out and he went to captain to kick off the remediation process. This is what I'm doing now manually.

A

And when I switch now to my captain installation, I can already see here on the environment screen that the remediation is ongoing. I go to the sequence.

A

I see that the remediation is started and when I click on it here, I already see yeah the remediation is running, the data gathering step is uh executed and in a couple of seconds let me just refresh.

A

For some reason, um the next task won't is currently not triggered, but let me just show you the execution. I did a couple of minutes before where the data gathering was done, the remediation was retrieved, action was executed and the evaluation happened.

A

I just have to be clear that this is currently not working or implemented in the background, but I'm just showing you that the process can be modeled and executed as laid out here, because the services that are actually or should do this job are not available now and instead, I'm using here this echo service. It's a very stupid dummy service that just receives the event sends back that he wanna takes care of doing this job.

A

It waits a minute or uh waits a second.

A

Then it finishes the action and returns back to the captain and for this particular task you can see that actually two services were listening, the echo service and then the service data gatherer and both were running in parallel and doing their job and after a couple of seconds they were done, and captain was then going on with the next um task in this process, which was then to get remediation, which is also mimicked and and simulated here by this echo service and, as I said, and I configured also my remediation process in a way that the escalation will always happen, because I'm listening here to the trigger that has a filter on warning and always when the warning appears, then escalation will be triggered and captain uh kevin.

A

Because last time you were mainly asking. Why should I now use captain instead of ansible?

A

What I want to show you right now is that, on the one hand, captain gives you a way to define what you should do and the tooling that is actually then underneath or doing the execution is abstracted away and can be exchanged depending on on your needs, and you can also think of adding more different tools to the to the task, depending on what you actually then want to have implemented.

A

Kevin, do you have feedback on that or um what's your take yeah.

B

Yeah, so um just to kind of recap um what I was speaking of last time and and as we talked about it, you know I kind of had an aha moment, but um so in our remediation uh process, as we go through each of these steps, uh everyone was done by ansel because it needed. um You know logic, uh programming, logic that uh you know that kept in itself wasn't really built in to go.

B

Do so, for instance, the data gathering okay, we can send the problem event into uh kept in from dynatrace, but then something some smarts needed to go and and retrieve more information that we could act upon. So we were using ansible for everything, um and so my statement, I guess, was if I'm only using kept in for auto remediation.

B

Should I install another product to do this, or should we just continue to use just ansible, since that was doing 100 of the work? You know why put uh some other tool in the middle and have a dependency, um and then we further discussed the fact of you know that captain is also available for quality gates and that kind of spurred the discussion of oh yeah, we're kind of talking right now. As an operational mindset of we have this thing in production.

B

Something happens. We want to fix it, but it can do more than that and that hey we just deployed new code and out of this pipeline, we want to, you, know, run tests against it and then validate oh nope, it's bad! Now we want to do some rollback and- and you know that process is also remediation, so um I think the more that captain is in your process of quality gates and in the pipeline.

B

It makes more sense you know from where we were starting in just a strictly operational role. It I didn't see a lot of value, add to put a tool in the middle uh just strictly for auto remediation.

B

So I I do understand that value now, um but I yeah I just guess I don't know how you would sell this just to people that are only interested in auto remediation.

B

um If you know they, they kind of have a tool of choice right now, ansible puppet chef, whatever their tool is, um you know, that's that's where I I guess would need help seeing where's that strict value-add that we could say hey, you really need this product and and uh something I was thinking about as we were walking through. This, too is maybe, as we have this template built. Maybe we should go back to our mind map that we did too and start.

B

You know linking the things that we built in our template and and see how many things we've created out of our mind map, because I know like one of the things we talked about is hey. You know we want a tool that you know gives us some audit capabilities or um you know approval capabilities. You know- and I think we've built some of that in our template, but just as a side note, maybe we should step back and see how well we're following what we came. What we theorized in the beginning.

A

Okay, yeah um to your first to first uh argument, or the first um note: why should I use captain now? Instead of ansible, I mean um with ansible what you have for what you get is. Actually you combine all your actions or all your tasks that you want to do as part of your remediation into one playbook or run book.

A

We call it and just imagine how difficult this is to exchange certain components within this run book and because you have to tend to change um yeah, you have to go through the run book, modify it and taking care of updating all these bots. That yeah needs to be changed, and with this, or with with captain in place, you add a tool on top. That's true, but it abstract abstracts away the tooling underneath.

A

Because, um at the end, you can easily exchange the components that are then doing the the certain tasks of part of the of their of the process.

B

Yeah, I understand the concept.

B

It would just be a matter of of, I guess, selling me on putting this in just for auto remediation purposes. I mean I I you know as a package.

B

It makes a lot of sense, um but just as auto remediation only.

E

B

E

Hey kevin, um this is andy, I'm sorry that I'm training late- and maybe this has been discussed before, um but I just tuned in into the conversation one question to your order: remediation scripts in ansible, so you said everything is everything is inansible. That means you have probably built libraries and libraries, and I don't know how many hundreds of thousands lines of automation in antibodies is correct.

B

No, uh so we're just starting down this journey, so we only have, I think, three or four playbooks that we have um so each of these are kind of broken out into playbooks and and what we did was uh to sidestep ansible tower licensing, we're actually running it through git lab. So we have dyn trace, call get lab a get lab pipeline via api and then that kicks off ansible runners out of get lab um so that the node licensing is actually just for that runner.

B

You know out of get lab instead of each of the endpoints. So uh wasn't my choice. That was- and I don't know if we're gonna stick to that or not, but uh supposedly licenses was gonna cost us millions of dollars. So we came up with this workaround, but um so the the get lab pipeline uh has a few steps in it. That kind of does these different steps? Okay, so.

E

Basically, basically, your gitlab pipeline is then kind of what we are trying to use captain for in terms of orchestrating your remediation process like.

E

Interesting, okay, I think for me.

C

The main part- and I also mentioned this uh the last time.

A

C

That ansible very like a general purpose, so you can do a lot of things and if you already have all those playbooks, then there is no need to throw them away and do everything again.

C

um So, for example, with antiviral, we do have the the intel tower integration, which you can just call run books from there, similar to what you might have also in gitlab.

C

But if you want to get started or if you want to build new integrations or new run books, then I think it's. uh If you have everything ansible, then you might end up with writing a new playbook also in ansible, but maybe there is another way to do this. Maybe um for some action you want to execute it's fine. If it's a web hook and you just just call it or maybe it's maybe the best action sometimes is just executing a bash script or some action.

C

Sometimes it's a toggling, a feature flag and all these kind of smaller actions with with captain. I guess it's quite easy to um to have these actions in whatever language fits best and captain will execute. They will call them um instead of having everything in large playbooks and with conditions if they get called or not.

C

So this is one of the things where I think captain can bring a lot a lot of benefit.

B

Yeah, I think the other I guess thing to consider is so one of the things that we do is we go and with ansibles we update the servicenow incident that we created um and basically are updating it. The same way captain is updating, so we said: hey, we just you know, did our data gathering and we discovered the pid, and here it is and so that somebody can go look at the incident and see what the remediation process did. So you know as a as a developer of this okay, I know ansible.

B

I know how to connect the service now. In fact, I just copied somebody else's uh playbook. That already did this function. If I'm using kept in now, that's a different service. So now I have to learn all right. How does you know if I wanted to use the servicenow service from captain? I have to understand. How does that one work?

B

uh What do I need to plug in and so you're you're kind of multiplying uh the knowledge that you need for the different services to connect to different systems um versus, if you did it with one tool like an ansible, not saying it's bad good. Just observation for discussion is all.

E

A quick follow-up and I think, alice actually posted a nice link to your blog post.

D

Yeah, I posted a link to a blog post where we describe this concept of micro operations, where I outline a number of reasons uh why these traditional uh operations workshops usually do not work, um and I think that johannes should make it into the demo, because one key thing usually is that you have a remediation file that you ship, along with the release.

D

This is usually a big issue, because uh this is usually a multisystem update. So if you ship a new version of your application with your remediation instructions and that's why the remediation instructions you currently have encoded the entire instructions in the sequence and not in the remediation file, if you, for example, would want to change some of your procedures for the next release that it's not say a jvm restart. But it's it's something different for for the sake of it out of memory. It can obviously be something different.

D

uh But, depending on what the remediation action is, you can ship it along with the code artifacts, that's one thing and uh also for micro service applications. These scripts tend to get like super complicated, uh that's as well. I think that was a good question for mandy, like how many of those scripts do you have if it's two or three it's fine, but if there's an individual remediation action for each service, you're running and you're running hundreds of interdependent services.

D

Writing these end-to-end um remediation flows at some point um gets uh super hard um and another point. Obviously you have those scripts and uh I think the way you could also run them by the way in captain is that you simply have a service that uh simply execute simply also does the answer will run as well, because we could run the ends of the script directly.

D

If we wanted to wall is playing, what gitlab is doing for you right now, and but the other thing is of the separation of concern.

D

So right now the tool integrations and the actual process that you're running are in the same file, and we talk to a lot of companies, especially in restricted industries where they want to separate this out and for them pipelines are a massive problem, because policies on how certain issues should be hand handed are directly intertwined with the rest of the workflows, and it often is a massive issue to audit whether those processes uh still work the way they were supposed to, but from the pure. Can you do it? I I totally agree.

D

If the only thing I really want to do is restart a jvm, um then just calling an n simple script directly is definitely the simplest solution. If you wanted to do it.

E

And, and I want to add to.

D

E

Notice what you were saying I think this is this is the important piece right. Ansible is what I call it do it yourself, uh swiss army knife. You can do whatever you can do everything right, but the question is: if you now start out with three teams and they're building an enterable script, and you already mentioned.

E

If the next team comes on board, they probably copy and paste something from another team and then just add a little piece to it, and maybe they copy and paste inc they copy and paste includes the integration with servicenow the call to servicenow, which means all of a sudden. You have this piece of tool: integration as part of 3 4, 5, 10, 20 different ansible scripts, and this is all in the end, technical debt that you're building up, because what, if service, now changes the api?

E

What if you want to not do like a simple status update in servicenow, but you want to do something else and you need to go through all of your different scripts in the chat and and the reason why this is because you have everything intertwined in one end, symbol, script, all the logic, and I think this is what we're trying to solve here right. We're trying we're trying to avoid 80 to 90 percent of automation, code that you have to write, because we have.

E

We are providing this integration in an event-driven way and we provide these tool integrations so that whoever defines your ordinary order remediation process can say. This is the process, and here is one step where our developers can specify what should happen to remediate a specific issue to their application. Oh yeah, they have an answerable script that restarts it, but the rest is, it can nicely be and independently defined by your, I don't know, sre teams or whoever is taking care of it.

B

Yeah, I think uh I think it's how maybe how we were approaching it and um considering the remediation process, um maybe a more of a global project instead of you know having this packaged within each application and and kind of doing your uh monitoring as a code and and keeping it with the app we were more or less doing it as a global type thing. So you know any jvm um memory problem that existed in dynatrace would all go hit. The same remediation process didn't matter what the app was.

B

um So it was one repository for this process not separate by each individual app.

B

And I think the other thing that um I guess can compounds the the confusion. Maybe is that when we started this working group we didn't, we didn't want to isolate it to uh cloud native services. We said you know, there's companies that are they're still dealing with monoliths, including in ours um and and of course, those are the ones that uh our leadership want to have solved immediately hey.

B

Why can't we know auto remediate these monoliths that we have so we can, you know, use the extra help to start working towards getting to microservices and cloud native and so by starting with those monoliths. I think we're starting with some more antiquated ideas and and a different design than what this was intended for um so maybe just complicates it.

D

I think the key still is is that there are some kind of key principles that you mentioned. The first thing you want to do especially coming from a say, traditional operations approach. Is you want to encode more or less your your remediation work?

D

So you want to move away from a wiki page or a traditional human centered run book to an automated process, and I think everything you move in that direction obviously is great because it helps you to automate and replay this process and, as you also mentioned ancient that I really can't be tested automatically uh as well. The next thing is, that is, how many tools do you still have to touch to implement it, and I think that that's gradually, you can find the the right balance for for what you want to use.

B

Yeah I mean I personally want to use captain. um I I tried to get it instituted for our um our poc and kind of got outweighed by others that wanted to keep it more simple.

B

um You know, because we we want to, we want to start doing the quality gates and everything, and so, if you, if you've, already got it there for that, you know why not continue down the path with auto remediation, but I think the one thing that kind of stopped us with going down the captain tract was the fact that diane trace was going to build it in, and so we were now between. Okay, we've got it stood up, uh but it's open source and they're getting ready to build it directly into the dining trace product.

B

So should we continue down that path? Should we wait until they release it? So we we hit it kind of just at a bad time to try to to push it for the enterprise.

D

But I have one conceptual question here, so if you run your ansible script, so I mean nothing against envelope script, so my first ever demo was running um an ansible tower instance and an ansible script uh based on the dyna trees uh uh based problem.

D

B

D

How do you validate whether the ansible action actually resolved the issue like whether the problem in diner tweet was closed?.

B

Yeah, uh so we have at the end of the ansible playbook, uh it goes through a loop and it checks. uh It hits the dynatrace api and says: hey. Has this problem closed or not.

D

Okay, it looks back before for that, so you're passing the problem id in there and yeah.

B

Yeah yeah, so the very first step the data gathering, um because the only thing that comes across is the problem id um and so then we go and we get. You know what entity was affected so that we can find the host to go to and uh the command line arguments so that we find the right jvm in case there's multiples.

B

So we gather all that and then we do the kill we validate the kill, because this is just a very simple uh java- app run with a java jar command line and then after it's running, then it will go back and check. I think I think it pokes it. Maybe every 10 seconds or 30 seconds to see if the problem uh has closed and then once it's closed, it'll, better update, servicenow.

E

And so I think this is exactly the logic that we were talking about. You can build all of this in the ansible scripts, but now somebody has to maintain it, and this is the logic that we built into the core automation, workflow engine of of captain right, because captain can will trigger your remediation and then the next thing it does it validates back with diameter is whether it solves the problem. It's doing it in a similar way. It is looking at the problem that or that initially opened up the remediation workflow.

E

If that is closed, and then, additionally, you can add more slos, more key metrics that are important besides just the problem itself, if you want to.

B

C

We had to develop.

B

Our own workflow um documentation too, um you know because after we did this, we discovered well. Okay, anfield worked, but you know what we can't see anything you know. So we we had to build in. um You know, updating, servicenow, and then I created a business world service now to go update, diatrace uh to show it on the entity itself.

B

um So a lot of this kept in stuff that's built in and showing you everything would be much easier too.

A

um Just a quick question kevin because you mentioned the task of data gathering. um We were discussing that one today um as well, and you used it for just identifying the service detect that actually had the the problem.

A

Right. Is that right, or did you also collect more information as part of of the scattering process that was then presented to the user.

B

uh We have not yet so um we did uh the anful creation and and poc to our cto, uh I guess about a month ago, um and then we went through and and created that template uh where we said. Okay, beyond the very simple case, we now need to say: okay, we need to find uh startup scripts and shutdown, scripts and- and uh you know, do capture logs or config files or run you know for ibm run.

B

Their ma must gather because they're going to want it in the support ticket, all those various things we documented our plan, but we have not coded anything to do any of that. Yet.

A

All right yep, because we were thinking about actually the the challenge of collecting all this data and then what? Actually? What do you want to do with this data? Then? um Because, as when you already know the problem and also the root cause of what is happening, you and you have the mapping between the root cause and the action. Then why is there a need to collect more data uh in front.

B

Oh sure, so uh that's because we're we're a bank and we deal with like vendors all the time so uh ibm is notorious for it uh and if you have never had to open a support ticket with them, um they are basically going to tell you without running these diagnostic scripts. There's not much. We can do for you, you know you can have diet trace and you can say yeah, but I see in diet rates, but here's your problem. Well, we we don't rely on that.

B

We need to look at our own data, our own logs, um so they'll have you, you know, run heat dumps. Well, they'll tell you well the next time it happens, and you know maybe it's a delay tactic. I don't know, but um you often have to gather those things for the vendors and and different vendors are different. They want different things, but it's maybe less for ourselves, but more for vendors.

A

Okay, there is just collecting the data and storing it at the ticket or pushing it somewhere at the repo repository. Is it then, this task that you want yeah? Okay,.

B

Yeah, so that way, it's you know uh with the the dying trace mantra. That's been burned into my head. You know you should make sure you have the information the first time it happens. You should never have to wait for it to happen again. So you.

C

B

If you have this process in place, gather all that data so that you know you don't have to wait and say you know I didn't get it this so time prepared. I guess.

D

But right now you actually made a good argument for a more modular approach because you say, depending on which application, which type of jvm needs to be restarted. This process is going to be different, so that means that you will need yeah different versions for different steps in there. Obviously, the actual restart.

B

D

Might be uh the same, I mean you know what the gbx is up or not most likely, but the data gathering might be different in this case. In some cases, even the restart scripts might be different, depending on which applications you're restarting.

D

So while the overall process flow always stays the same, like getter data, uh then triggering the remediation check, whether it works or escalated. What these these tasks start to do um might then at some point turn out to be different, which means you need to have a library where you're, more or less, depending on which process you're running you're, compiling the individual steps on the fly.

B

Yep yeah. I completely agree with that.

D

And would it then model this in ansible or in gitlab.

B

B

I I don't know yet actually.

D

Who owns the ansible part of it? Is it yes.

B

B

Yes, in this case, it was an sre team that did that.

B

Again, we're kind of still um in the infancy and learning as we go and- and um you know discovering hey this might be good. This might not be good, so um it's ever evolving for us but yeah. I I like the modular approach to this. um You know it was just.

B

We were keeping it simple and and now that we're expanding it, uh as you said, I I definitely see the the need to have it. Modularized like this for the different systems that will come into play and the different scenarios that are going to come into play.

D

And how so you you said, you met you're running it globally right now, right, so every damage.

B

D

Sent to the same script, so there's no dedicated onboarding of a new service or or anything.

B

um So we we came up with an idea of uh tagging um the services based off of uh if they were ready for this type of service, um but we we hadn't gone down that path too far, because yeah we didn't want just any jvm restarting because we had to have that information about it. uh First, you know startup shutdown, each one potentially could be unique. I think we need to you know, do a canvas, let's say: okay, websphere.

B

Do we have all of our startup and shutdown scripts in a uh templatized location? Hopefully we do uh probably, as we canvas we'll, find that there's all kinds of edge cases and and so we're gonna have to you know if we do it globally, we would have to account for that. So I think, as we go down this journey, that's also going to be a good reason for us to start pushing us back to uh each support team to model their systems in a fashion that can be automated.

B

Obviously, we have to get rid of snowflakes if we ever want to automate on a large basis, so.

D

Yeah, I think you can have snowflake implementations there. I think that is that's. Okay,.

B

But at least it.

D

Needs to be a standardized api, it's pretty much the same thing that kubernetes did to a great extent. I mean how you implement something in your container does not necessarily matter, but there are some sub-ground rules that you have to follow. If you want to get things working properly,.

B

A

All right yeah uh this the discussions that uh they're really creating and and going on, but let me just jump back to the agenda for today.

A

We do have a couple of yeah minutes left and what I want to touch today is how we continue on our um scenario and our template for fixing and and and getting the jvm memory exhaustion problem remediated.

A

This is, I think, what we now should discuss in the last couple of minutes to make progress here.

A

Okay, um I mean the sample app and the setup to simulate the jvm problem.

A

Should we should we go with the cards app to get this pro to get this simulation done, or um should we give it a try to to test it out with the potato head implementation?

A

C

Think the quick win is uh to do it in the card set because, uh since the the pr in the potato, it is not yet merged, but for the long run it just depends on. When do we want to have the sample app to be ready uh for the long run? I I would go for the potato head as it's. uh It's a more mature, app and uh maybe a little bit more.

C

um It has also some dependencies that we can then uh take into account, uh but if we need it uh really really fast, then the cards have it's. It's very easy, because it's already, we have the build process. We have everything we just have to put in the loop that kevin can provide.

A

That kevin provide.

A

D

And like what we also do is a simple ansible parameterized service would be a good idea for remediation, so remediation, actually just consensual.

A

Thanks this is actually this part, the recovery of the remediation itself.

A

Let's focus on ansible script to execute.

A

This is a question that we added here, but let's just move it down for a minute.

A

um Kevin is there a chance that we can reuse the ansible script, that you prove that you implemented.

B

um You and your team there's a chance yeah I can ask. I can see what we can do, generically that that won't get us in hot water.

A

Actually, uh let's, let's go back to the first item- is someone up for uh implementing this change into the card service. I mean the jbm problem.

D

So basically, it's just for every call, adding some block or allocating some amount of memory and storing it somewhere where it in a static variable, more or less.

A

True and and kevin also provided an example where just loop gets executed that fills up the memory and then um runs into an issue.

A

But um let me just I will take care of getting that one implemented and done.

A

The process orchestration wire captain, we we discussed today and um yeah- I already showed you how it could look like, and I would propose to to go this route to use captain for doing our remediation scenario and the data gathering part. This is now the first action that is maybe not that obvious or not not clear what should be done there. um Can we use the the next two weeks to figure out what we have to do there and how this should look like.

C

Yeah, maybe I can also comment on this because you're honest, we were talking about this also today, and uh so I I would be really interested in uh maybe kevin's example. What what are you currently uh collecting and where are you storing? This is only for archiving uh purposes. Is it also for any analysis purposes and finding the right action thinking about it? Kind of this is a modular approach, and we can also kind of the data gathering can be can be done in different parts of data gathering.

C

That can be maybe an analyzer's part to find out what might be the right action, but interesting would be which data is needed to be uh collected, and uh where is the data currently stored? Is it from? uh Is it already in some kind of uh I don't know prometheus or data trace, or is there any other um data hub where that holds some of the logs, or do you have to go directly to the box and fetch some more data there? um So where is this right now located?

C

Where is the data which data sources have to be pulled actually.

B

Yeah, in our case, we were kind of just walking through what um a support person would do. You know they get called and they are told hey this, this jvm's out of memory. um You know one of the typical things that they'll do uh you know for the vendor is to you know, capture a heat dump so that they can review the heat dump. um You know, let's, let's say dynatrace, wasn't the the tool of choice that discovered this. You know which would have the memory dump for you or hey.

B

The active gate wasn't configured for the memory dump analysis somebody messed up and the data's not there again want to make sure that there's no oopsies so go ahead and and kick off that uh jvm memory dump and and capture it so that it can be analyzed later. um You know some. Some vendors obviously have scripts that they like to run that go and collect. You know different pieces of data. Os data, jvm data, you know, runs the gamut. So again, it's I think it's a plug and play of options depending on your app.

A

But given is this data then part of the remediation process in the sense that is handed over to the next step in the process, or is it just stored somewhere for for someone to look at.

B

I would say yes, that it's stored uh somewhere for someone to deal with yeah.

C

D

And I would also add the example where one jvm can actually not be restarted.

A

Alice, what do you mean with one game cannot be restarted.

D

uh To that point, but there is not an applicable necessary for all jvms, because I think right now the example looks.

D

I think right now, there's a bit of a hello world situation here, because if the only.

A

D

Really want to do is execute a generically like one action, it's always the same, but we are ignoring that. The developer might specify that for their service, they can restore it or cannot restart it, and we could show the different behavior where we say well, this service can be restarted or the service cannot be restarted and then have the process behave differently and give the developer control what is allowed on their service.

D

I know it's not exactly this example, but I think the example is, to some extent so uh straightforward that that you don't need uh any any powerful solution as there'll be to run it uh for that, but also kevin obviously brought up the case that not it's not a lot for all of their jvms right now. It was also what he brought up around tagging, so you also have the possibility that, for a certain service, either certain service or certain service version you're not allowed to restart it because it maybe doesn't allow to do it.

B

Yeah, I wonder if that ties into step two of the change process there. You know if, if you uh have this documented in your cmdb that says you know this.

B

This can be restarted, but maybe only during the maintenance window, and so you, you can't do it right now or yes, it can be restarted, but you need to get you know an approval first um and whether that be hey, I can do a slack bot, approval or or some other you know, process there, but I don't know, maybe that maybe that ties into that change process of allowing or not.

A

A

That's actually actually a another good point.

A

Should the change process or their approval process be really tied into on the remediation process itself, or is it somewhere extracted in another solution or service? That does this job, then, for you.

A

I mean um you mentioned last time that you are using service now for checking the approvals and making sure that the person who is allowed to do the action um gets gets asked is this. Is this is still the case right that you're using service now, for that.

B

A

A

Yeah till till the next meeting I mean I will take care of the cards app and then kevin you're about the ansible script. And what else um should we tackle on to get the scenario running for us.

B

What was that question.

A

uh What else should we work on uh to get the the remediation scenario or the template working for us?

A

I mean, should we all also already work on on the change process.

C

Yeah, I think it's important to kind of define what what are now the the parts of the whole process that have to be in the process for for kind of exceeding this hello world example. uh We, I think we kind of agree that there has to be the validation part, because that's one of the parts that we identified early on, that the validation has to be part. uh Data gathering, I think, is also an important part that we already identified. That has to be there um the approval process.

C

um In my perspective, it makes a lot of sense to think about how our approval is currently done, and uh I believe they are not right. Now they are not done inside character, so they are done somewhere else, so it has to be kind of. How can this? uh How can we process this, and how can we tools.

C

How can we um still use like captain uh as the orchestrator in which tools are currently doing the process, and I think we also have to take a look here on tools like page of duty or other tools that are like incident response tools, uh how they are currently doing this and uh what is their strengths? And uh what is the process, what they are lacking, and what can we bring to the table.

A

Good point: should we should we do an outreach, how data gathering and the change process is currently implemented in other tools to not rebuild what uh what others already do.

D

I think they, what I'm kind of seeing here, I think, would be useful to still model the entire process end-to-end just to draw it, because, just in the last hour, we talked at least about five or six different tools to run a single process, and it would be good to lie out, like all the two interdependencies which we have here right now and who's controlling which parts of those steps mm-hmm yeah. Like so far, I think we had obviously dynatwise uh as an early monitoring tool. In this case we had gitlab for execution and modeling.

D

We had ansible in there. We had some third party tooling for data gathering. In there we had service now workflows. In there we had the cmdb in there with additional information on services, which technically is also servicenow, but cndb could be somewhere else.

A

That's a very good point. I mean we already have identified those tools in in our working document on the template. Here we have yeah added a couple of notes, but it would make perfect sense to now bring the picture together and to make it clear to everyone how the tools are connected and how everything is then um working together.

D

And this is also a control so like who is in control of which parts of the process is there a.

A

D

Control entity, or is it passed on between different tools.

D

I think it's, it is a good example, because it is more complex than just changing something in the.

D

System, I'm always a bit uh that footed for the change process.

D

I think we should integrate it. I'm always just thinking it's if, if the change process is modeled, that way, which obviously makes sense, because somebody has to check certain things.

D

It would even be interesting on the service level if it can be done automatically or not automatically, if I notice ahead of time, I'm just thinking that right now the restart procedure, the remediation procedures differently, but it's happening at 2am in the morning or at 10, am depending how fast people can reply. Obviously you have a 24 7 knock, so somebody will eventually look at it.

A

All right, then, I would summarize the action items as follows. As I said, I will take care of the app and kevin about the ansible script, but I would or all of us we should continue modeling the process and the tool integrations and therefore, let's continue on on our yeah working document, where the proof of concept is kind of already laid out. But let's just add here, the tools that are all involved and the responsibilities that they have within this process.

A

Does this make sense to all of you.

A

All right yep, then thanks a lot for joining and yeah, let's jointly work on on the model, to get a clear picture and to see how everything ties together.

A

Thanks for the meeting this week, see you then in two weeks, yep yep thanks.

B

Everyone good discussion.

A

A