keptn Working Groups, 21 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keptn Auto-remediation Working Group - April 21st, 2021

Description

Meeting notes: https://docs.google.com/document/d/1_WlLP6oLcHe0yyC7kXH2hB3i9bOPvIArp83NohE78FU/edit#heading=h.wqdeglxri66j

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject

A

Cloud all right, yep, then, let's get started again. I want to the next round of the auto remediation working group. um Last time we agreed on working on a template for an auto remediation scenario, and the scenario should be that we remediate a java app that is running into a memory exhaustion, and we therefore created the document and yeah already. Some progress has been made on this document and I would propose that we go over this document and yeah see what we have and how we then can proceed.

A

This should be the first part of the meeting and then I would actually break down uh the scenario into more concrete uh items and artifacts so to say, and then we should came up with a way how to work on these artifacts like, for example, the app or then. We also should think about the orchestration of the of the scenario, as well as the different components that are involved, like data gathering, the change process, recover and validation.

A

But that's part of of later, let's first focus on what we have um in in our document right now. Let me just quickly close that one for a moment.

A

A

Okay, um yeah, as we said, we want to work on a template and I just added here a very, very short abstract. What this working document is about um yeah, it's for defining a proof of concept where we can drive our template of how remediating problems should work, and I have now to ask the question kevin. Was it you who started working on this document.

B

A

Cool thanks, um maybe you can I didn't change the structure. I did not add change or removed something here how it was. I just added a couple of things and a couple of comments. um Can I ask you to go over this document now and to explain us what you added first sure.

B

Yep, all right so uh using this as an example uh came about because um we at our organization at truist uh were going through this exact scenario as our first uh try at auto remediation.

B

So I just thought it would would be good to. uh You know, bring this group into the fold working on that same similar problem.

B

So in our working group we kind of uh went through the the process uh as a group and kind of came up with um some of these um steps. um So, let's, let's start with under the assumptions. um Obviously it's going to run a non-container environment because it wasn't a container. We could shortcut a lot of this stuff. um The bullet point number two sample application that runs out of memory, so one of the people on our team created a sample application.

B

uh It basically just has it's a little spring boot app that has two buttons on it. You press one button and it increments a counter, and you press the second button and it just calls a loop that runs it out of memory immediately um and so dynatrace. We we have dietrich on the server where that was running.

B

It did open the problem um and then we we kicked off auto remediation. um You know through some ansible by way of get lab, and we we can talk about that later. But um so so we do have a sample app and and uh a way to uh you know, have diane trace, detect it.

A

Awesome great because I was adding this bullet point um in the sense of we need to think about an app that we take a sample. But when you have already one that's really great.

B

Yep, so I can. uh I can talk to the author of that. um It's so simple, there's no private data or anything to it. So we should be able to share that, but I'll just get his okay. Then I can. uh I can share that with the group great.

B

uh So then we moved on to okay. When this happens in the real world uh in an enterprise type environment. You know what what's the process that uh people go through uh to resolve it manually. So the the first step is the data gathering. um So a lot of times. These are supported by another company, say websphere and you have ibm and anytime. You have a chronic issue, they're gonna say well. Did you capture all the logs? You know? Did you run this script?

B

um Did you take any heat dumps for us to analyze, so there's there's that process that you know the vendor usually will require as long as well as the support people, so we've got the logs heap dumps config files. We could continue adding to that.

B

I see you did have a comment on the config files.

A

What do you exactly mean with config file? Is this the config of the app itself or of the jvm.

B

uh Of the jvm process yeah, so um you know possibly you could go and look at your. You know. If it's tightly controlled, you could look at your source, control and say well. This is the last thing pushed to it, but did it get changed or you know so it's I think it's always good to get a running config before you uh restart something just to validate that. It was running with the configuration that you thought it was running with uh so the step number two, uh the the change process.

B

So in our environment, um you really can't touch anything without documenting it in a change and then in our case that is through servicenow um there's. You know often debates on things as simple as restarts as whether that could just be an incident record um or does it have to be a full-fledged change record um but regardless, if we wanted to kind of keep this more templatized in general, let's say it's more than a restart. Let's say we're doing something you know complex.

B

So somehow we need to be able to tie into where to record this change and have it match up with the ci's that we're changing. So, in this case, we're restarting that jvm, uh what's what's the service that we're affecting and be able to document that, along with that, then there's the approvals um you know again, company by company, you might have different processes, um but this you know, might you might have to look up the the support groups uh to see you know?

B

Does someone have to approve this? um If so, who who are they? How are you going to notify them? um Is there a maintenance window uh right now? Could you could you do it without an approval if it's in a maintenance window, so that kind of opens up a few different ideas.

A

There all right got it.

B

So then step three. um We basically now have the okay to do a remediation, um so we need to uh stop the jvm. Do any cleanup uh oftentimes, you know if it's running out of memory uh the jvm might might dump on its own and and fill up some some file systems that may prevent you from starting it back up, because the file systems are full.

B

So there could be some cleanup work that you should check for and then we would start back and start the jvm back up, and then I just threw in the idea of um you know. Maybe we would potentially make changes, uh especially if it's a chronic issue um doesn't help to just continually uh keep restarting the jvm that just keeps running out of memory. You know, maybe we need to increase the heap size or change the garbage collection or or tweak it in some other way.

A

Before doing a restart, and um do you have already or how is that one implemented, the recovery phase? Is that one an ansible tower script or how is it done right now.

B

Sure so, in our uh scenario that that we demoed for cto we have diane trace, opened the problem, um we were going to use ansible tower, but uh ansible tower um it can be pricey per endpoint.

B

So for our demo we could have used ansible tower directly, um but we wanted to kind of come up with a better solution for what we would use in the end. So uh it was decided we're actually going to go to get lab.

B

So through the get lab process, we're able to kick off a pipeline that spins up an ansible, node and then goes and runs the ansible playbooks that we created and then spins itself down. So it was a way to uh kind of get around ansible tower licensing.

B

um And yet you know give us the the breadth that we would need if we would roll us out amongst many many hosts, so the the ansible playbooks uh that we have so we're going to pass the dianetrace problem to ansful and then he's going to come back in to dynatrace via the api.

B

Look up the problem id find the impacted uh service.

B

So then he can query and get the host that it lives on and then he can go find the host.

B

We looked at the command line for the java process to be able to see which java process it was running on the host so that he knew which one to restart, and then we had a kill, because this is just a standalone jvm. So there's no shutdown or startup scripts. Really uh it's just a java, joe our command. So we just killed it um and then started it back up with the java jar.

A

All right cool uh great, I think you have to I had no time to write it down, um but um I have to maybe take a couple of notes from here, um really great that you have already displaybook in place. That also does the job of remediating your problem.

B

Awesome yeah we should. I should be able to share that too. um If you guys want to see those.

A

I think we can- or we should discuss this later on, um to uh which artifacts we need, or we should work on, to get this scenario running perfect and time and then the the last um part of the of the scenarios or validation itself. Is this also automated right now or is it just uh an idea.

B

Yes, this was uh automated too uh in the ansible, uh so the anfield playbook after he starts uh it, will validate that the jvm is running. um We did not do step b because it's such a simple app.

B

uh We didn't actually generate any traffic against it, um so that one's not done, uh we did not put in uh c either um checking for any defunct processes, um but that you know that should be simple enough as well, um and then uh it does scan for uh it hits the dynatrace api and and looks to see that it goes into the closed status.

B

So it does do that.

B

So I see you have the last one highlighted should be validation.

B

uh Yeah either or uh um but yeah it. I think, oh so we did add a few things to this and I didn't come back and update this, but we do have a another step of escalation, uh so that kind of plays into this. uh You have to check to make sure it did what it was supposed to do and if not then then move on to an escalation step. Yeah.

A

uh Okay: okay, no, um this one is not part of the escalation. No problem originally correct.

B

Yeah that that would be uh what would cause yeah you to escalate if the, if whatever originated it, does not accept that the problem is fixed.

B

Again, I tried to keep it general would not say. Dynatrace problem was closed because anything.

A

Fair enough good, uh all right, very cool that there are already parts that we can potentially reuse um when we want to get this one up and running, and you also worked on artifacts that are required. I mean here I added a couple of notes. I can then explain when we go through them kevin. Can you start with the first ones, as you also defined, that.

B

Yeah, uh so under the data gathering, uh we need some location information. um So first, where is? uh Where does the jvm live on? Which host so that we can jump into that host and be able to run our remediation?

B

The jvm log files for archiving off where the config file is located so that we can grab those and then the diagnostic archive location? So where are we going to save these logs and configs, which we want to retain before we do the recycle and he don't.

B

Next jvm process identifiers, so we need that for to detect the process that we're going to kill or shut down.

B

We also need the process identifier to see who is it running as so that when we start it back up, we start it back up as the proper uh user or service account.

A

ah Okay, okay: this is this one is for run as information, um meaning which user was responsible for starting that one correct.

B

Yeah, we don't want to accidentally start something as root and then some other thing not have access, because the process is running as root or or have a security issue of running as root when it should not or your answer. Ansible playbooks might jump in as a different user than than what that jvm process needs to run. As.

A

Okay, got it all right: um the next one captured jvm heat dump there. I did a research on that one and I found different ways how we can gather information about the java heat dump heat.

B

A

um Yeah here I referenced the blog post or or an article, but mainly that these two options are quite convenient. You can use the native um java option, heat dump on out of memory error and that one here generates the log files for us and then there is also the tool called jm app and that one also allows us to to generate the heat dump and to get more information about the problem and that one is very similar. Just another way of gathering more information, and what I found interesting is a blog post from netflix.

A

They also described the same problem um they had or they were facing, and what they did is that they used linux to create the core dump because they kill the jvm by sending a sick, a port instead of a sick kill, and this was then required to take a look into the core dump and not the heat dump, because you would not get the heat damp when you send a sick aboard yeah. This was, it was just a problem that they faced and they fixed it already.

A

They took 10 and the way of going with the linux mechanism of getting more information about the memory usage and, interestingly, they also have here a little script that then uploads this archive on into a three pocket and which is then used for getting or for investigating the problem and identifying. Actually, then, the root cause, maybe just uh some fruitful thoughts as we um continue as we are working on that one on that issue, all right, then, given the change process, um what do you mean by that.

B

Yeah, so this was uh the change documentation location. So, um as I mentioned earlier in our case, it's uh servicenow. So where exactly do you have to document that? Is it a change? Is it an incident?

B

If it's a change, you know, how are you correlating it to the proper ci.

B

The change approval needed and acquired, um so in that case uh you know who who's the support owner for that service um and if you know they need to give approval, how are you detecting that? um You know we're going to tie into maybe a chat, ops type thing where you notify them: hey remediation sitting here: do you approve not approved and then yes, no and then it moves on? um Is it a servicenow approval that they would have to hit?

B

You know various options there.

B

We also talked about maintenance windows- hey, maybe it's maybe we're looking up for that service if it's in a maintenance window and and it doesn't require any approval during a maintenance window and then the problem originator details.

B

uh What did I mean by that one for the change process?

B

um Oh, that maybe that was uh where I was explaining uh thinking of getting the details of the problem to figure out which service is affected um so that we could tie that to the uh the proper configuration item in your.

A

B

And then the recover remediation. um So in our case it was very simple.

B

We were just doing a kill, uh but um in more complex uh scenarios there should be hopefully a jvm shutdown script uh that will shut it down more gracefully and then is there any cleanup scripts that could be called that already have some predefined information in those uh to do the cleanup that you would like you know pointing it towards log files or anything else that you're looking to clean up before restarting and then we have to know uh where the startup script is uh to be able to start that up.

A

um What I also figured out is that actually just let me open that one for a second that actually um java already provides the feature that you can define a startup script um like that one. When you have an out of memory error, uh you just give you give it an uh the pass to a script and that one then kicks in when out of memory is detected.

A

I think it's supported with java 8 also and would already help, but um this would then be already automated by java itself and not by an remediation process.

B

Yeah that really takes a lot of control out of your hands. True yeah yeah.

A

And you would never know when something gets restarted or who did that yeah.

B

That's nice to know, though I didn't, I didn't know that they had that feature. So thanks for pointing that out.

A

Of course, no problem uh all right and then finally yeah the validation step. This is what I here, I added that we can, or we could use on service level objectives that we finally check whether they are fulfilled or not.

A

This already gives us an indicator whether we could fix the problem or still have it open and kevin. What do you mean with acceptance.

B

um Oh, I think that was uh checking that the originator uh agreed that the problem was uh resolved.

A

B

In the case of diane trace validating uh your problem, closure- or um you know, if you want about, if you had a tight end of servicenow validated incident, was resolved.

A

All right very cool uh what we already have- and this is what I added then uh to the document. This is: has the title proof of concept, where I mean that we have now very detailed explanation of how a remediation process could look like for us for fixing a jvm problem, and um I think that's already very good.

A

We know the artifacts that we should um take a look at and we should consider- and now I would kind of bring that one into a poc into a running prototype. So to say- and I would propose, from a captain point of view, to use captain for the orchestration of this remediation process, where we have the different tasks of this sequence or scenario.

A

And then we need to provide the tooling that helps to fulfill these tasks and, as kevin already pointed out, a couple of things are already in place. But I would now open a discussion or would invite everyone to collaborate on how we could build up this proof of.

A

Concept is it clear what I mean I mean I mean.

A

Let's be more concrete, we have the different parts in our in our process, like data gathering change, process, recover validation and finally, also the escalation and, to some point, we need to automate these steps by an external tooling or by an custom implementation. I think we have to provide.

B

B

I'm trying to think of the different services and captain that may help us do that. um You know in our scenario uh you know, as I mentioned, we we pretty much. Did it all uh with.

A

B

A

B

All of that so to break it into the captain uh pieces.

B

I don't know I mean, are we thinking just the generic service offered by captain? I mean, I know a lot of these uh there's not a whole lot of services. I don't think that no.

A

um That's true: we do have an ansible service a now service and also the generic executor for doing any kind of job.

A

I think we we don't need to.

A

To take services that are already available as sigma.

C

If I can just add my my thoughts here, um I think especially for the data gathering, so what I've seen in the uh in this document that it's um so the data gathering part here was very java specific.

C

But uh the important part here is for me once we want to move it from a specific, like, let's say, java, specific use case to a more generic use case. uh It's important okay. How do we get this data and uh if it's java process it uh so, in my opinion, it kind of uh means you have to ssh into this box and you have to find the java process and you will find the the downboard.

C

Basically, you have to go to this instance uh kind of retrieve the data and you it has to be very specific where you know where you will find the test. Maybe it's uh the the name of the process. Import is important. The the path the file path is important. Where it is there won't.

C

I assume there won't be a service that is list that is basically listening in this or sitting in the in the container or in the process and then pushing uh if there is a crash that it's pushing the data out. So we need exactly where to look for, and although this can be done in uh for this specific use case, I think um or my my problem or my thought is more: how can we make this kind of a generic way that it works?

C

For, let's say most of the java processes uh and in most organizations because, like ssh into a box, is probably not the way that will.

A

C

Accepted for a lot of organizations, uh maybe there are some restrictions, so I was just thinking more about only the data gathering part is already quite tricky. In my opinion, how do we define where do we get this data if it's a kind of a pull or push approach, how to get this data, and if we have to pull this data, if we have to go there and uh and find the data, how can we do it? uh I think that's not! uh That's not an.

B

C

A

Yeah, that's that's totally true um this scenario. It would be, as you said, um going to the reserve and then working there locally on the machine.

C

Exactly, and maybe this is true for a lot of different uh process crashes, uh if you need the data you have to go there, maybe you will. Maybe you can find the data also in the monitoring. uh Maybe you can have uh maybe with dyna trace. You can take a log there, uh a look there. um Maybe you have something like um I don't know, prometheus or jaeger or anywhere, where you can take a look at tracy's logs whatever, but it's.

C

This is already a huge variety of different options: uh how to really find this data, and I think we have to kind of find a way that is a little bit generic, because if it's, if we assume we can just go there and get it like first class directly from where it happened, um I think that will be really difficult.

C

B

C

Way of uh basically fetching this data via we, we have to assume that this data- let's say we have to assume this data- is also available at data trace, or we have to assume this data is available somewhere on the accessible data store uh or log store or whatever, um but yeah, I think kevin. You are the experts here and you can share the details, but this was just my thought that it already sounds very um like a very big problem, very big challenge: how to to actually get this data.

B

Well, I mean I I agree with with parts of that, but then I also say that this entire process is that we're going to go into this same server and stop a process and start a process.

B

um You know which also has security things around it um so in in knowing that we're going in and and doing something destructive like that, I guess I feel like grabbing diagnostics, while we're doing that is not as big of an issue.

B

I I like your idea of of getting that information from from somewhere else and in most cases, that's that's probably true. You know when we're talking about um config files or log files, um I'm sure we could do that. The one thing that we probably can't really get around as much might be. Special diagnostics from you know that a vendor requires um you know ibm. uh I don't know if you guys are familiar with their must gather script.

B

um So you you can uh google it it'll, take you to their uh their documentation, but they have a must gather script that you gotta run um and it goes and does all kinds of things um grabs os information and process information. I mean it's, it's a big script um and when you open a support case with them, they're gonna say, but did you run the must gather, um so you know obviously um being a dietrich customer for a long time.

B

The mantras uh in my head of you know: let's catch it on the first time right. Let's not have to wait for another event to uh be able to react upon it. So you know it'd be nice if we could run those diagnostic scripts, the the first time we see it instead of restarting it and then going to the vendor and them saying well, you've got to wait for another event and then, when it happens, run the script.

B

B

Yeah and I think no, please.

C

A

C

uh Yeah, I just wanted to to add that some, like, we often think of let's say doing the jvm, restart that some other party is responsible for doing the restarts. uh Johannes johannes already mentioned uh that we have this, um let's say: ansible tower integration. So it's basically it's not captain.

C

That is doing the restart and captain going into this machine, but captain is triggering the run books that you already have is it can be enfield tower, but it can also be your um batch scripts that you already have, uh and maybe these best scripts are started from, let's say a bastion host or they are not.

C

uh They are not living inside the machine that has the the issue, but maybe they are part of their automation is already to kind of going connecting to the to the machine that has a problem and doing something there, and I was more thinking about. Maybe we can use captain in this way that we are orchestrating the scripts or actions or radiations that are already available, um but also that would mean for the data gathering part we would also be. We were also kind of relying on some other tool that will give us the data.

C

It will provide us the data. Otherwise we are another uh must gather script and uh that might kind of put us we that will put captain also into the into this, into the bucket of um uh log analysis or or or data gathering tool which, which I think we are not. But we can utilize these tools and make sure that we bring everything together that uh that can be leveraged for um for the process of automated remediation.

B

Yep, I agree with that.

C

B

The uh captain would orchestrate, but you would have a different tool that actually did the work for you. Yeah.

A

Correct uh thanks, because this is exactly what I wanted to express by this part: we have the orchestration layer and then the tooling, underneath that is then taking care of executing the jobs and for me it would try to um to get an understanding of which tools could we reuse or how could we, for example, do the data gathering part or the change process or the recovery part?

A

Should we already take those examples that are already available? As you said, kevin for a change process, you had service now, or you have service now in place.

A

Recovery was done by an ansible script.

A

And the data gathering archiving part is that one already available or.

B

No, I believe the only thing that we did with that. I think uh there was a little cleanup just looking for uh dumps and getting rid of those, but that was ansible as well, and I I would move forward with ansible on that. Probably too.

A

Okay, okay and for validation. We have here yeah a check on the slos and above let's validate whether the jvm is running. Jbm is accepting traffic no defunct process running.

A

Let's see this is something the dynadress can.

B

B

It it can, but yeah you would still need something to check dynatrace data with. So I think that again would be ansible.

B

um I know that's that's what we used to check that the problem was closed, uh check that the jvm was up and running.

A

A

Would it make, would it now make sense to you to um to continue working um in a way that we now try to provide an ansible script, integrate with servicenow and make up or- and we build up this uh end test scenario by tooling, underneath.

B

B

And and one of the other things we, we did not really change process per se, but you know we. We went in and uh updated the servicenow incident, so we I was. We tied the dynatrace to the developer instance of servicenow and uh had the dyn trace app in that servicenow instance, and so it it automatically opened the incident and and resolved the incident um and as ansible was going through, it would update that incident.

B

Saying yep I found the jvm I killed the jvm, I restarted the jvm um and then dying trace came in resolved it and then ansible says yep. I validated that I see it results, so we did a lot of anfield stuff in there as well. So we didn't do an escalation, but we could have and we would have done anfield so long-winded.

B

As you look at this, we basically orchestrated everything with ansible, so my question would be: why do we use captain?

B

Why why not use ansible for everything.

B

Because that's essentially what we.

A

Did um yeah you used ansible and ansible playbook to orchestrate everything um yeah.

B

Yeah, so the main playbook would have separate tasks under it, kind of working through the flow that we had up above and you know anytime. If it would have an issue, then we could kick it out of that flow and hit the escalation flow or but but move through that workflow.

B

Yeah I just I figure. I gotta at least ask the question. You know just in my head, I'm I'm I'm thinking well. Is it simpler just to use ansible.

A

It's a very good question, but um when you have a playbook like an ansible interval playbook, you always connect to the tools that you know today, but in the future, maybe you wanna, inter integrate with another service. Maybe you wanna exchange a service now with another tooling, then you have go back, have to go back to your playbook and you have to adapt everything that is servicenow really related with the concept of captain.

A

You would easily just exchange one of these components because the process you don't have to touch- and you just plug in another component and that one is then working and then taking care of executing this particular task.

A

You get more flexibility and also um you for future changes. You are um you're ready to take them, as you can easily plug in the tools as you need them. This is the the main benefit you would. You would get with kepler.

B

And- and I do understand that concept, um but just in in this specific case, we're saying that everything's answerable.

B

um So if we had captain doing this, if we replace danceable with something else, we're still in that same boat, I think of having to go back and change everything.

B

I can easily be talked out of this.

B

I just wanted to bring it up, for you know food for thought and you know have somebody convince me that that um putting kept in in the middle so because we we went through a process of trying to come up with okay, we need, we need somebody to orchestrate it, and then we need somebody to do the work, and you know we we looked at possibly using servicenow gitlab ansible captain was on the list and as we started going through each scenario, it was you know we don't want to throw a tool in just to orchestrate if, if it's just adding another failure point either.

B

So um when we started working through the process, it kind of came down to ansible being the orchestrator.

B

Seemed to be uh the simplest option.

A

Okay, um I see.

B

I know it's a buzzkill, I don't know I'm just I just wanna, you know I I just wanna bring it up, you know to get discussion around it um and maybe even you know, have you guys, expand uh and bring that up with other people to get more people thinking about it. But um you know why why put kept in the middle if we could go potentially from a dietrace or prometheus or something right to ansible and have it orchestrated if each of your steps is already answerable.

A

Okay, yep, as I said.

B

A

B

There's a better way to do it, too, maybe we'll say: well: uh data gathering, actually it's better to use this tool and change better to use this and recover better to use this. So this is just our first scenario, but you know just food for thought.

C

uh Am I back? Can you hear me?

C

Okay, yeah, my my computer did not let me switch any windows, so I could hear you, but I could not unmute myself because I couldn't see soon, but I just wanted to add that I I always think of ansible a little bit of like a general purpose.

C

Automation language like you, can use java for um for building whatever application you want to build, but there are for some parts they're just more that they're programming languages that fit the purpose better, but you can do a lot of of it with java as well, and I think antidotes also one of those languages or automation, yeah, automation, languages, let's say uh or platforms where you can do a lot of things, uh but especially uh when it comes to validating again the um the quality of the services.

C

Then in captain we already have uh the quality gates uh baked in uh it's. It's based on service level, objectives and service indicators. So were you kind of using a couple of those uh best practices from the sre community there and you can just use it out of the box.

C

You don't have to write it yourself um and then also the tool integrations, um so there is, for example, the the slack or teams or whatever integrations, so whenever you want to switch those tools, it's basically just as johannes mentioned just the one tool- integration that you can do you don't have to change the process. You don't have to change all the automation, scripts, it's just um and a new approach and a new take on this, um but for sure it can be done with ants will swell.

C

uh I would even argue it can be done with with bash as well, but we just moved on with more uh with other concepts that maybe fit the the purpose better, but yeah. If you already have everything in ansible, then I think it's not good to to just throw it away, but just to find a way.

C

How can you bring those things together um and yeah captain can can be one of those orchestrators, and it's also when we, when we are building captain, it's not only about the the remediation part, but it's also a lot about the the the deployment and delivery part. um So- and you can also do this with ansible, but it's a it's a lot of yeah custom scripting. um If you want to do this, so this is just my my thoughts on this.

B

Yeah, I I really like the um and the quality gate stuff. um You know I demoed that for some other people in our organization- and they were all very impressed- and you know anxious to to go down that route too, um so I think obviously, captain will be in our organization. Definitely for that use case, um and so I I guess, if you already have it, um it makes sense to for auto remediation to to say, maybe use it. I guess my only thought was okay say I don't have kept in.

B

Would I install kept in to do the auto remediation and just working through this flow? I would say, probably not since I could just go dynatrace to ansible and and do it all there um there's not. I guess that strong value add that captain brings on the auto remediation side of things.

B

If that makes sense,.

C

Yeah, I think so I would want.

B

C

B

That go ahead, guys.

C

Yeah, I I think this is the part where we are now building something with captain that has a lot of um the like. This is why we are reaching out and building this working group, because we want to figure out what are the, what is needed to support organizations in this process and um because.

B

Everyone is building.

C

Everything uh for its his or her organization- this is what's also one of the uh the problems of captain that it's kind of everyone is building kind of an automation, not maybe a platform, but everyone is building cicd automation because everything that is out there does not really work for them. So everyone is, let's say, starting with jenkins, but building everything a little bit different.

C

But after all, everyone is building ci cd automation, uh but what we are trying with with captain is to to bring this together, not not by just bringing another tool to the table, but more okay. You have now built your custom.

C

um Let's say uh the way how you want to to deploy your applications, but still you need a way to validate uh the application, so you can either then use your own thing or you can go for captain quality gates, but the deployment itself. You can kind of connect it to the process that captain will orchestrate. So you can do your own testing scripts. You are not um tied to any of those uh or to to one specific tool or to one specific way to do it. uh You you can.

C

Basically, you can interact with the captain, control plane or the captain control plane will will uh interact with your tool and uh we're really trying to build a a tool or a platform that, where you can connect everything that you already have, but all this glue code and what happens when and I need to do the roll back, but when it is when is it triggered and how to do the the evaluation? Actually, um this is what we we wanted to kind of take away all this: the burden of integrating all tools to each other.

C

um This is, I think, where one of the strengths of captain lies.

B

Yeah- and I I think so- you got my my brain going here now, so I think that is the big difference. um We are showing a operational auto remediation, so something is a monolith, is sitting out there and has been running we're monitoring it and something goes bad and we want to correct it. The other scenario that you brought up that makes more sense is the cicd pipeline.

A

B

We've now done a deployment and and something went bad and we need to auto remediate that, and that makes a ton of sense in captain uh because, as you say, you've got all the deployment information in there. It's process flow driven and we're going to insert some auto remediation pieces in that, and it's all going to be all documented in one place.

B

um So I think yeah. Maybe we we um started with the um a scenario that that's not as captain per se friendly, um but auto remediation on a ci cd pipeline makes makes a lot more sense and has a lot more vision in my head.

A

But still on the scenario we have right now here in front of us is also a use case that can be covered um using captain I mean we, we skip the entire deployment part, but, and we start working when a problem occurs, and this can occur in a deployment scenario as well as in a normal operation.

A

A

um Okay, I mean I can propose to show the next time uh this workflow. We have right here just a skeleton of how captain would execute these five um steps and we then in a follow-up meeting. We then discuss how to bring in the different components like, for example, ansible to do the first step, then service now to do the second one and so on. Would this make sense to make progress here.

B

Yeah, I think so.

A

Because what I can do is I just create this process and as a receiver or as a tool, I don't use a tooling and just use a select notification or anything else to show that this is now triggered and executed.

A

And then, when the execution is done, I just go on and I can demo that so that we can think about exchanging these components by the tooling we have or we should have available. Then.

A

Okay and um kevin, can I ask you a favor, as you mentioned, you have the sample app that allows you to simulate the jvm problem. Yes, can you can you take care of of getting this? This example.

B

Yep, I can do.

B

A

But with uh tooling.

A

Okay, the sample app the process process. Orchestration is done by by me, and we started discussing the tooling of others.

A

uh We discussed the tooling, but I think, um let's just continue the next time um to be, to get more concrete on how we will do the certain things like data gathering, the recovery process and finally, also the validation, is that okay for you yep, perfect, then next time, a short demo on the remediation process.

A

We take a look at the sample app and then figure out how we connect the template with tools that are available.

A

Great sounds good, perfect, then good.

B

A

Thanks uh you too um have a nice day and see you next time all right.

B

Sounds good thanks, guys.

C