keptn Community, 26 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Keptn Community & Developer Meeting - Jul 26, 2023

Description

Meeting notes: https://docs.google.com/document/d/1y7a6uaN8fwFJ7IRnvtxSfgz-OGFq6u7bKN6F7NDxKPg/edit

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject
Sign up to our newsletter: https://bit.ly/KeptnNews

A

B

A

B

C

C

Let's wait a few extra minutes, so everyone could join the call.

D

C

So I'm sharing the document Link in the chat.

C

Please add yourself to that in this list and then, if you have any topic that you'd like to discuss, add it to their General items.

C

C

Block of topics in during this meeting we will serve first 20 minutes for the community, then 20 minutes to discuss documentation, issues and then 20 minutes to discuss more technical details.

C

But I see that today we have also some guests Hussein and Maria. Sorry, if I butcher your names so welcome, I know that you would like to present an integration that you wrote for Captain B1 right, so maybe I will start sharing the screen.

C

C

I would suggest- maybe let's start with this, so please introduce yourself and present what you've done so far.

D

Sure let me share my screen: I put together, slides I love, a good slide deck um and just excuse my voice. It is currently 4 AM, my time so um getting you're, getting a very uh tired version of myself. So, all right, so today we're going to be presenting um our bot, Cube and Captain plug-in so just a brief overview with bot Cube It's, a open source collaborative kubernetes troubleshooting tool. So what that actually means is you're able to Monitor and troubleshoot events all in your communication tool.

D

So you don't only receive alerts, you're able to act on them as well and have it in a space where you and your team can see everything that's going on um and the goal is to really improve developer experience and give developers and platform teams self-service access to their resources. Without you know needing to be a kubernetes expert um and you can respond to your alerts and access your cluster from any platform, because it's connected to all those communication platforms, you can do it on the go, so it can be in Starbucks and be running.

D

Cube CTL commands right from your phone um and then just more of an overview. We currently work with slack teams, microscope sleep, slack teams, discard and matter most um and currently we have um uh Source plugins with kubernetes events um in Prometheus in production and today we'll be showing off our source plugin um with Captain. So we hope to expand out our plugin system, um and then we have our executor plugins right now with Cube, CTL and Helm, and we are hoping to expand our plug-in system on that end as well.

D

You can also automate your event responses, so you can put in automation. So one really popular automation we have is, when you get a particular error from your Source events, you're able to automatically run a cube, CTL get get logs to describe that error and um have that running automatically.

D

So you're not running the same code commands over and over again um and then with our UI app you're, also able to audit your event so you're able to get a a log of all the commands and who ran them in a timeline, and um you can easily um with that web app it's easier to do. Configuration changes as well um and then more about the plug-in system.

D

What we're here today to talk about so um with our plugin system that we launched um a few release Cycles ago, um we have the goal of being able to automate all the tools across the cncf landscape like there's like 400 million tools it feels like, and the goal of block you is to have it all in one place, so you can have them interacting with each other. I, really like that. It's in slack I feel like it's less scary than the terminal for me.

D

So it's nice to have almost like a tool Suite that we're trying to build out and I hope. Our collaboration today will allow us to add kept into that tool, sweep um and then just some Basics on what the plug-in hussain's going to show you today um right now, it's a source plugin, but we have the goal of expanding it to also being an executive plugin. So the goal is to receive notifications about Captain events with block Cube and your communication platform today, we'll be demoing with slack, um but we would.

D

We would also want to be able to um trigger commands based on um Source events that we get, and then we want to be able to take action based on a particular uh Captain event. So I'm going to hand it over to Hussein the creator of this plugin I'm just here to highlight his work um so yeah take it away soon.

B

Thanks somebody yeah, so, um okay, I hope. You see my screen.

B

B

A

um Yeah, as Maria said.

B

My both group has a couple of plugins already and we just.

A

E

B

With Captain, so uh I already have a captain installation here. I only have one vodka project, so let me just show the metadata of this plugin first, so, as you can see, even in our health chart uh by the way this plugin is already in with this string, so it will be released to production after two weeks or so so, once we release it, um you will see. This is our Cloud version. You can see the captive Lighting in our production, uh plugin list.

B

It will be available, but we started very, very simple because we just wanted to get used to it and then uh see the I. Don't know a couple of more alternative scenarios here in Captain we have only two parameters, so you can provide project you can provide. If you are not inside the kubernetes cluster that Captain lives, you can provide your API gameplay URL here. Also, you need to provide your token, of course, um with a simple installation here.

B

I will assume it ahead, so you can see it when you focus on this part. Of course there are lots of parameters like, for example, socket select configuration so once you use botcube, you will be able to send or receive this notification.

E

B

In select, as Mario said in different platforms, here, I am saying that just enabled Captain and the token is this one: the project is bot Cube and then I am binding. This Captain plug into a channel uh I already installed this one. You can see my installation uh here. All the captain. uh Captain components are here, so what I will do is uh I already have this Captain CLI integrated in my terminal I will trigger a second for this project and this service hello. It contains a task: I am using job executor service.

B

This triggers a Docker image that prints a hello world. Let's just trigger this one. Once you trigger it, you will see here.

B

So this is my channel and all the chapter events just received here in my channel. So what I see? Basically I can see that okay, there's something coming from Captain and it is triggered initially the task finished and I can see see also if there is a description. That means, if Captain gives me an extra metadata like the description. The logs Etc I can see that. Okay, this job executor service is finished and the output is hello world. That's it so I can take any kind of event from Captain D1.

B

Of course, I am using mongodb data for API and we are basically consuming this endpoint. Whenever we see a message, we are shipping it to the platform that user configured um yeah. That is our integration, so our goal is to just introduce the Captain executor plugin, also, maybe by using slack as you can see in slack here, for example, when I say list executors uh I don't have any kind of yeah. I have executors.

B

Of course, for example, I can um I can run cheap, CTO command here, qcpl get pause and Captain, for example, so you can see I have all the port lists here. I can see there here if I have Captain executor plugin. That means I can execute Captain commands within select or within teams, or in this code, which platform do you need yeah? That's it on my answer.

C

So, thank you very much for the presentation, very cool.

C

um And I think we are here today also to discuss. How can this be integrated into lifecycle, toolkit um so question to the world round? How would you see or put in vision and integration.

E

Maybe for the upcoming evaluations to trigger.

C

E

Directly from it would be great, yes,.

C

I was also thinking about that, basically just to bring you up to speed uh Maria Hussein.

C

We plan to introduce a new feature: the life cycle toolkit to have quality Gates like V1, where you can test the quality of your deployed services.

C

And this could be yes, a nice executor part where you just list. Please give me the quality of my service or trigger a load test and then evaluate the quality of the service.

E

um I might have a question: is it possible, with bot Cube, to have like a wizard wizard-like command since, like like I, say, okay, what Cube create me a new cap metric and then it will ask you, okay, what kind of Provider do you want to to use um say Okay I want to use Prometheus as provider, and then it says okay, what kind of query do you want to have and then that it's creating me um such CRTs for for klt automatically? That would be.

B

Yeah we already, we are only using the interactive block messages in slack, so it is available in cubic CTL So. When you say Cube CTL you can see. There are a couple of options which name space when you select namespace, it says which resource this kind of things. We already have this API, but my question was about this nifto. So I just take a look at that one, but what about the existing apis I mean they are consuming the API and going for events so in need toolkit.

B

If we switch to that one will we still be able to consume these events, or uh this lifecycle toolkit will send or push notification to us. What will be the.

C

We plan to have Cloud events that we meet because right now everything when we monitor your deployment with klt. We emit kubernetes events, but we plan to also support Cloud events and maybe CD events more specifically, and so basically we would push events to you. If you configure your endpoint to us.

B

Okay, because our end users mostly, are just allergic to exposing this kind of standpoints. For example, we saw this with Ms teams and the Legacy like, so they need a web program. Point a push notification to us, but the end user says that we don't want to expose this kind of one point to the public. So if there will be a chance to consume this still from an endpoint, it would be best for us um and in the issues I already saw that the existing.

E

B

Is deprecated, maybe it will be maintained for one or more releases and then you will drop it right.

C

A

C

About this pushing part, maybe when we will start working on that specific part to expose Cloud Events, maybe I can reach out to you vs like so, we can discuss a bit to make it easier for you to digest the events.

B

Because, most of the time they deploy board chip in a maybe essential, kubernetes cluster, and then you can think that the bot group deployment is in a different kubernetes cluster than Captain uh instance. You can think something like this, so they need to expose this one to publicly or they need to do some kind of VPC bearing to send this notification through a tunnel or something like that.

B

B

I don't know releasing implementing, happily can contribute on that point.

C

uh But let me take notes. Oh now, the hardest part. Oh, can I write an email out with my keyboard, so hopefully I won't butcher your name too much.

A

C

Something else.

F

I just had an idea that came to my mind. Bundy events already be consumed or open Telemetry anyways from.

A

F

Somehow, over a provider.

C

Repeat, please mark it. Sorry we're.

F

Already emitting events right um are.

C

They also Landing.

F

C

um No, at the moment we don't attached, we didn't attach any of these events to a span only.

G

C

Specific things happen, such as the evolution finished, and the result is that.

F

Because that one could be another way to consume consume what is happening in klt, probably right of water.

B

You already mentioned that you can expose this one as a kubernetes events right. Yes,.

F

Right now, maybe.

A

F

Right now, we're emitting kubernetes events um but yeah.

F

We already have basically the setup to push metrics and other stuff of open Telemetry to a matrix provider. Basically, so maybe that's also a way to kinda enhance that functionality of it.

C

I, don't know if this works with you. Can you digest open Telemetry data would make sense for uh Bots Cube to receive this type of data.

B

So any endpoint or any, how can I say the source that they can consume? That's totally fine by us. Okay, it can be also this open television.

C

Yeah would be interesting to run a tracer test with open Telemetry data to verify that your deployment, basically through a span that describes what happened in your deployment. You did everything.

F

Yeah kind of like that.

C

Okay and I guess you're, both in the captain select Channel and then in the cncf slack channel. So whenever we think about maybe a nice integration just reach out to them to pitch it.

D

Thank you so much for having us um and listening to our presentation, I guess a follow-up question, so we do have a PR about um getting onto the captain integration page so do do. We want to clean up and um switch to the uh klt before we get pushed. We can officially be on the Integrations page or.

C

I think we can go ahead and merge the existing PR. Actually. Could you please uh send in the chat so I can edit here.

H

C

um Yeah we suggest we can already go ahead since you already have uh something built and then, when we finish with the klt integration, we can simply change that. Thank you very much.

C

And very nice too thanks again for presenting that.

A

C

Then I would like to say few words about some status updates, so this is the milestone for the next release of the lifecycle toolkit and you can see there are many documentation tickets because we like to improve our documentation and we start working also on the quality Gates, how we can have the same quality Gates that we had in V1 for klt and I will present later the result of This research and some good first issues from the community.

C

However, there is a bit of a problem with the first part, which is supporting the Readiness Gates of kubernetes, so we can get rid of a custom scheduler. We have a bit of a problem here, um so I will remove these tickets from the Milestone and push down back to 0.10. The problem is the controller runtime So reading escapes I've been introduced with 126 27 in beta, we plan to make use of them, but unfortunately there is an issue with a fake client which makes impossible to test and breaks all our existing tests.

C

uh Therefore, we need to wait until they fix this problem and they release a new version, and this takes between one and three months. Looking at their past release scheduled so I would move all of this ticket to the following up. Milestone.

C

And then we are 20 minutes over with the community time block I think we can go over then the documentation part Mac. Do you.

G

Have any, is anybody here to discuss documentation.

C

See Yash here as well and Adam.

G

We get oh Josh here: aha hi, Ash um Yash is working on improving the architecture, documentation and that is. Would you like to give us a status of how that's going Ash.

I

Sure so I recently started working on the ktl scheduler and the lifecycle toolkit operator. So till now, I had drafted the pr for documentation one and got some suggestion over it. So we'll work up on the suggestion which have been recommended on the captain, scheduler PR. Basically, I would like to link it over here.

C

C

Add it to the chat, so I can add it to the document.

G

C

So everyone can then spend some three Cycles to do a small review ah perfect. Thank you.

C

I

No, there is a just the captain scheduler part which I was working on. I would like to get your reviews. That's all.

C

Perfect. Thank you.

F

C

A

Actually already gave.

F

Reviews on that yesterday, yeah yeah I think we get some reviews on that period yesterday. If.

A

You want to check them out.

A

C

Okay: okay, any other topics, Mark that you.

G

Like not really um I'm, mostly working on migration, I got sidetracked into found. Another sort of big problem in the background stuff, so um I thought we'd have more people here to discuss some of the architecture stuff. But since we've got a lot, I think we've got a lot, can I exceed my time and let you get started on refinement.

C

uh We won't do a proper refinement today, but more like a presentation of um how we envision the quality gates to be supported in klt.

C

So this is a very huge um file created by florium, which, unfortunately, is not here today and I will try to guide you through it.

C

So in Captain V1 we had this relationship where a single project could have multiple stages and every stage could have one or multiple services and when you wanted to Define an SLO, so do a quality check for your service. You would write something like this here.

C

He had some filters, which was custom metadata that you can specify, and then you have a field to say the type of comparison that you would like to use and the objective itself that you would like to evaluate for a specific service- and here you refer- you refer to a specific SLI service level indicator.

C

We give some custom name, and then you define okay based on the value of this SLI I. Consider that to be okay to pass it if it's either less than 600 or less than plus 10 percent compared to the last time. I have related this and we have also some warning criteria to set okay, it's not pass, but it's also not failing it's more in a critical state in a warning state. So we can Define this if it's between 600 and 800, basically, and then we can compute a score.

C

If you have multiple criteria simulated here, each one would get um a score and you can set okay. I want at least 90 percent of my slis to be successful.

C

To consider this to be an slope with passes the value. Otherwise, it's a warning. If it's in between 75 and 90s and the SLI file you can see it here is nothing else than a template for writing your queries, and here you have some special placeholders.

C

The project stage and service are passing through here from Captain, but then you can also use these filters, data that you provided before and then captain will run this for the stage and the service that you specified and automatically fill. All of this variable for you and query the data from your provider for klt.

C

We don't have any more this relationship, because now we don't want to support this custom thing, because we saw that the industry have a different way of deploying things and it's better that we just operate whenever so manifests are applied to a cluster.

C

So for this we plan to rewrite the SLI files into name, might change analysis value template. Basically, it's transforming the SLI file and having a query there.

C

And this should be one crd per SLI that you want to use this way. You can reuse it through multiple Sr evaluation, for instance the standard for Golden Matrix photovolate, the service. You just need to Define them here and then you can refresh from multiple services from multiple slos and you just you know, replace a specific parameters for them.

C

The most harder part is to map the SLO into the lifecycle toolkit, and for this we come up with a name called analysis definition which is similar to the SLO file. But now it's more descriptive because we saw that we encode here a special logic behind the scene. If you have multiple criterias, those are evaluated in a or fashion, but inside the criteria, if multiple items or those are evaluated with an end- and this was a bit confusing for past users, so we said okay, we want to make things more explicit.

C

So now the pass you can have any of all of none of some more descriptive name that defines how the children of this element should be evaluated, and then we can have the standard um Target rule instead of having unencoding strings. We would like to have more explicit operator this way we can fail. When you try to apply this definition in the cluster, we can have a predation Web book that makes it fail directly without the need of parsing a string at runtime and set up sorry when we try to relate to your SLO file.

C

Here we don't support an exclamation mark if you made a typo, so we can detect these problems earlier and then we have the same things for the if it's a key objective and the total score, but then this is just a static configuration. How can I run an evaluation?

C

I need to create a new crd, called analysis here, I'm telling klt, okay, now I deployed my service I want to do a check of my SLO of a service and for that I pass through the time frame to check. Okay, I want to check the quality of the service, but for which temp window. So here I specify this.

C

We can then add some extra metadata. We first call it selector, but selector is a bad name, because it's has a specific meaning in kubernetes, and we agree that this is a bad name and should be called metadata context. Something like that which is similar to the filter field that we have here in the old V1.

C

So you define some extra metadata that you want to pass through during the evaluation and then the analysis definition that you want to use, which is the CR that I explained before here.

C

You apply this resource into your cluster and then behind the scene. The captain operator will start to crunch some data and we'll present in the result here the final score of your relation. If it's good bad warning, Etc I won't go too much into the details about how we understand everything would work, uh but here flow also draw a bit of example. What are the crds that you need to write them statically and what are the one to be created at runtime to perform the valuation?

C

So if you have two services, you would need to Define two different analysis. Definition per each service, because the two different services will have two different uh SLO definition, because maybe service one is more a CPU critical service. So you want to check for CPUs value and service 2 is that is more memory, consumption type of service. So you want to check for some Metric for the memory and then you basically from the definition of the service one.

C

You will then refer to different analysis, value template what the industry calls also SLI and all of this part here will be statically already available in the cluster when you want to perform the evaluation. The only resource that you should create is the analysis CR that refers for which service I would like to evaluate and then behind the scene. Klt will pick up this SLO definition knows: okay, I need now to fetch these three metrics.

C

It will contact the provider of each oopsie, sorry of each of these Matrix to fetch the data and then write back the results.

C

Is this clear enough.

C

Most importantly, maybe Adam I know that you asked this in the past. We are finally solving the issue where, uh in the past, in P1, we could have only a single provider per SLO evaluation. So I cannot fetch this metric from dynatrist this metric from data doc. This metric from Prometheus, we could only fetch all of this from a single provider and now, instead, we allow to have pick and choose different providers for different metrics.

J

And is that? Because it's gone through the captain metric server, so you're really pulling it from there exactly cool nice.

C

Behind the scene, all of this would be evaluated by The Matrix server, so there would be two different types of metric in the metric server Captain metric they're, currently, the one that we have will be used for scaling purposes.

C

And then we have this analysis type of Matrix, where you can use to evaluate the quality of an application at runtime.

F

Yeah, so you can think of the two things last Matrix is kind of continuous continuously evaluated and you always get an updated value, empty analysis.

A

Is one of one of.

F

J

But are they the same metric behind the scenes, it's just for different purposes or are they stored differently.

C

uh We're still evaluating if we can reuse but most likely, no, because um if you define a CR called Captain metric, then how can you differentiate between continuously polling, this data from the provider from just one of so in terms of feature parity between bottom? This metric can do versus analysis. Value would be the same, but the purpose is different. Capital metric is continuously evaluated. The analysis, value is or SLI or single metric is computed once at that specific time frame that you define here.

J

One other thing: that's came to mind with with Git Ops. Do you need? um Do you I, don't even know if you can do in a second analysis at any time, but do you need another CR? Do you need to create basically another yaml file for a secondary yeah? More, it's not a.

F

Yes, there's two ways: I think um from what floor said either.

F

If you use KRT for the whole life cycle, um then the analysis would be created automatically as part of a post-deployment task, for example yeah, and then you can, for example, right after load testing trigger the analysis automatically through the lifecycle operator, and it would be evaluated that way or the other way is. If you, for example, just want to use the Matrix operator, you could um keeps it will apply analysis, yamo and and triggered manually kind of that way.

C

H

Same like basically triggering evaluation on Captain V1 in the same manner.

C

So if there are no questions.

C

All of this will live inside the metric server, so you can use the metric server for performing evaluation of your applications and scaling of your applications.

C

And here for usan and Maria I think this is a good integration, because then, with the bot, you can write. Please operate my service and you can maybe ask for a wizard okay, for which time frame do you have any extra metadata that you would like to pass through and then you get back the results with okay, the quality of my front end is 90 percent.

C

Hopefully we can release this in 0.10.

C

We start to work earlier. Of course, pieces will be available in 0.9, but by 0.10. Everything should be end to end available. At least that would be my goal.

B

So, for example, in the promise Community operator when we install it, there are lots of user phone, for example, alert manager, rules that by default you can use them, but in Captain once we install it, the dashboard is empty. With this lifecycle toolkit. So will there be some kind of user for analysis files SLR? There will be such a thing or it.

G

C

B

Helpful for us to use them, maybe by default.

C

Yes, we plan to provide some templates for the standard golden metrics like response time, throughput error rates and latency, so you can just have to pass through your service name and then the query can fetch this data for dashboarding uh you mean grafana dashboard to see data.

B

No, no Captain, you are Captain, you are.

C

We had a cons at a decision that, for the lifecycle took it, we will not have any special UI. We have a problem that we don't have many uh front-end developers that contribute to the project and maintaining the UI is hard. Therefore, we decide to expose everything as open Telemetry, so we can use Prometheus, grafana and Jaeger as a UI.

B

I assume that in the in the operator, in this reconcile logic, Etc I will be able to access this useful informations right, for example, if there is, if you trigger something, I can see the process through the operator build operator. Will there be something like this or.

C

um We don't plan to expose any rest API for that, but you can do that through checking the status of the crd. Okay, yes, Andre.

H

Yeah also I think for this usage. We will Implement kubernetes events, the link that will actually tell in what stage the whole evaluation or the life cycle is in. So, um according to this being already standardized the events enough for at least the basic uh evolution management.

C

And if other questions for this topic.

C

C

um Let me add this: let's switch link.

C

Then any topic from the community.

J

Actually, I'm just thinking a little bit more about uh bot, Cube and it'd, be really nice to be able to say get me the services that on a cluster the captain is or or the lifecycle toolkit is managing for one of a better word so where, where it's active, get me the namespaces, where the lifecycle toolkit is active.

J

Things like that.

B

Is there any, will there be any kind of comments or sub comments in Captain CLI or we are talking about implemented plugin, so it can give CTL I, don't know list custom resources or this kind of thing a second one: okay, yeah.

A

B

Why not because we all we already had this kind of logic to executors? Why not.

H

If we introduce the CLI, it will be only for a converters if I understood it correctly. No.

C

H

C

The only thing that we might think to add instead is a cube CTL plugin.

C

Okay, so if there are enough further questions, we're always reachable over slack, especially for Yash, Usain and Maria. If you have any further question or you want some PRS to be reviewed just being as on Slack, otherwise have a great rest of your day. Everyone and good night Maria.

D

Thank you so much for having us.

C

Thank you for presenting bye, bye.

H

Folks, thank you. Bye.