keptn Community, 28 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keptn Community Meeting - October 28th, 2019

Description

Discussion of Keptn Quality Gates
https://docs.google.com/document/d/1Vebjqs2JRtcH_GHBXTqddyowKGTUxeMCxCUIUvFd23U
Dirk: will provide a sample SLO file that includes the upper- and lower-boundaries until next community meeting
Christian: Details on SLI pass criteria and logical combination of those
Florian: Provide information on how to write SLI providers in the next community meeting

A

I started the recording hello everyone. This is the captain community meeting on October 28th thanks for joining I will just open up the agenda and start sharing. My screen.

A

Hopefully, you see what I see I want to continue the discussion of the captain quality care to action items from last meeting. It was that Andy and drop said they will provide a sample SLO file. That includes some more details and in the next community meetings. So we've already managed to include that or incorporate that sample as a low into the captain.

A

Quality gates, use case document and will briefly sweep through the changes and and just point out what is different now and Christian from the captain team will enlighten us more on the details of how that passed and warning criteria actually will work, and this was an action item for me. But I asked one of my colleagues to actually do that. So Florian will provide us with information on how to write SLI providers in this community meeting.

B

A

This one, but I got the file from Andy and Rob and with that I would just jump right into what has changed in the captain. Quality gates and use case document prerequisites prerequisites. They the same. We renamed their service level indicators to response time. We discussed this last time because request latency was well not that well received from some and I also think. The response time is more.

A

Clicking I'm missing the English word right now: self.

C

Explaining self-explanatory.

A

Thank you very much when coming to the service level objectives configuration a few changes happened. So this is the an example. Slo file and and Christian will walk us through an very detailed service level, objective configuration later on, but bear with me. So we we've discussed the filter section previously, where you can, for example, provide the idea of a Prometheus Crabb job, and you, of course, can override project stage or service values. If needed.

A

For comparison, we have the possibility to define if you want to compare with a single result of its several results and also define a filter. If you only want to compare to previous past results or if you also want to consider warning results, for example in the comparison as for the objectives themselves, they consist of a reference to an SLI.

A

So this is the name of the SLI, it's actually error rate, and then you need to define the pass criteria and you identify warn criteria if you want to, but more to that in the detail later on, and we do too many requests have reinstated the scoring into the SLO file, where you have the possibility to define waits for objectives and always have a total score of an evaluation. There is a value between zero and one, so it can be displayed as a percentage value, so it it does not change over over time.

A

So you always have a good reference to the previous builds when it's always a percentage. So it's better than an absolute value. These are the page. It changes that that came from the proposal that Andy and Rob did so. We have the comparison with a single result. We had that already, also with several results where, for example, we have a filter that you include past or warning results, you can define the number of comparative results you want to compare with. In this case, three and I also think three is the default value.

A

Are we gonna go with and you can define an aggregate function, so it will calculate an average of the three last past or world results and use that value as a comparison value in the actual evaluation and with that I think I will give the microphone over.

A

So nothing did really change in the user walkthrough there was one one question open, so Henry had a question about providing a data source and Florian will answer this question later on, but first I would like to go through a detailed example of the service level objective file. So you see there is all kinds of documentation in there where each and every field is document pretty neatly except the total score. But I would like to give the microphone to Christian to give him a chance to walk us through the detailed example of the service level.

A

Objective file.

C

A

Would you like to share yourself? Yes,.

C

It makes it way easier for you.

A

C

Will start with the objectives.

C

C

So you should see the Yama file now right.

C

Yes, yeah, so I'll start with the objectives that you can define, and the main thing here is an objective, so NS Li has a name and it has a pass criteria and the warning criteria, and if none of the criterias here match it will automatically fail- and we want to discuss today or yeah the syntax that we use here, in addition to the logic that we use here as an example, we have here a response time, 95 percentile, and we want to define that it should pass if the relative change between the last couple of values or the last value and the current value is less than ten percent, and we also want to pass if the absolute value is less than 200.

C

In this case, maybe milliseconds. The idea being is that these two criterias are combined with an or so this syntax here would translate into a passive relative change is less than or equal ten percent or the absolute value. Another error here is less than than 200.

C

You could also try to extend the criteria here, which we've done for the warning case. The idea here being is we want to warn as long as the change is relatively small sort of changes between relatively between 15% or minus 8 percent, so that would be a relative change with an upper bound and a lower bound, but we also want to say it needs to be less than 500 milliseconds.

C

So, for instance, this free criteria would be connected with an end, so it would be relative change less than 15%, more than minus 8 percent and in total, less than 500 milliseconds. To see this as an example when this would be true, if we had a response time of 300 milliseconds in the first run and in the second run we had a response time of 400 milliseconds. Let's say like this: this would result in a relative change of 33%.

C

So let me do it like this and there would be too much. Obviously so 33% would be above 15%.

C

It would be in total, less than 500 milliseconds, but this is an end criteria, and this is already above above the 15%. So this would fail, but when wood is not fail, if the first run would be 300, milliseconds and the second run would be 300 10 milliseconds, then this would be a relative change of about what is it. 3 percent I think and 3 percent is less than 15% it's more than -8 percent, and the total time is also less than 500 milliseconds.

C

With this warning criteria, or we can make sure, let's say we have 20 test, runs over the time and continuously this value response time. 95 percentile is rising and at some point it's maybe 490 milliseconds is what we have stored as an average and the new value that we get is maybe 501 milliseconds.

C

Now the relative change of this would also only be a couple percent to 3 percent or something so it would fall into this criteria, but this last criteria it wouldn't be fulfilled anymore, because it's more than 500 milliseconds now and we would send in this case it wouldn't even be a warning in this case it would go into the last part, so it would fail.

C

So this would be a criterion you would combine with end, and this way you combine it with an or obviously there can be different use cases where one or another is more important. We have one use case here. I have a typo here, for let's say you want to count SQL statements. For instance, you could say if the relative change of SQL statements between test runs is exactly zero percent. So it's the same amount every time then this should obviously pass because no SQL statement has been added. None has been removed.

C

That should be okay, but we can also say we want to have a warning if the number of SQL statements has changed slightly so, for instance, less than we got roughly five percent more or less SQL statements, so I will have an example for that in a second, but we still want to say as long as it's less than 100 SQL statements, we don't bother it's it's a pass anyway.

C

So in this example, let's say we had 100 SQL statements in the first run and in the new run in the second run, we have 105 SQL statement. Let's say: 100 fourth SQL statements: that's a relative change of 4%, so obviously it's a change, so this criteria is not fulfilled.

C

The second criteria says it needs to be less than 100 it's 104, so this is also not fulfilled, so it is definitely not a pass. Then we can go and look in the warning criteria and the warning criteria says: oh, it's less than 5% change. So warning is okay. So it's a warning. We will send a warning. What would have happened if this value would have been 98 and this value would have been 99, so we had 98 runs recorded for the last couple of runs. 98 statements, I mean and we know 2nd run.

C

We have now we have 99. This is a relative change from of about 1%, so it would be within the warning criteria. But the second criteria here, which is again connected as an or statement says it is less than 100 SQL statements and less than 1 at SQL statements is automatically a pass.

C

So with that kind of syntax, it allows any combination of end, as well as our statements that we can use for specifying our our lower bounds, upper bounds and thresholds. The same is obviously true. Let's say for security vulnerabilities. What if we only want one criteria, so it doesn't matter. What is what is happening? We do not want any security vulnerabilities detected or if one is detected, we don't want to go on. So we only pass if the number of criteria, the absolute number is zero and nothing else so yeah.

C

These are essentially some of the cases. You can obviously define more than two. So if you say you have a third criteria here that you want to connect with an or statement, let's say I, don't know if it's about five thousand milliseconds again, it's okay, for whatever reason you you wanna say: if it's really really high, then we don't care.

C

It must be there's something behind in the logic that says, if it's more than five thousand milliseconds it's okay, yeah doesn't make much sense in this example, but you could do that if you want all right. So that's basically part of the objectives you can define and the other thing.

D

That Christian can I ask a question here: yeah.

C

D

Thanks for explaining everything, so do I understand correctly. If the past criteria, if everything's evaluated and it's passing, then the warning is not evaluated anymore as.

C

Soon, as we have a pass, then we have a pass and warning doesn't matter anymore. There was the idea, so if you have, if you have criteria that might be overlapping, the first one would almost be passed pass is always stronger than a warning. Okay got it thanks.

C

This also allows so like this change between minus 8 and 15%. This change would also be triggered by this criteria. So if it's less than 10%, it's obviously also less than 15%, so it should always be that the pass is always the stronger criteria.

C

Thanks. Okay, any other questions for the objectives.

C

Or right dick, do you want to continue with the comparison strategy, a single result and, and what is the other one sure, because I think that that one was on your list still or do you want me to show it.

A

Thanks for your explanation around the pass and warning criteria.

A

So you handled the objective scent that we still need to talk about the scoring right, yeah.

C

Left it out on purpose, perfect I can talk about that as well. No, it's okay! So.

A

For comparison SS before we have two different possible comparisons in areas, one is where you only compare with a single value and of course you say compared with single result, and you can compare compare with several results than you would put several results here and you can define filter criteria for previous results. That you want to include in the comparison default is all so, regardless of if the evaluation has passed or has resulted in a warning or has failed for the previous value, it is included in the in the comparison.

A

That is the default value and other possible values are passed where, on the past, evaluations are included or pass or Warner also pass or warn or evaluations that resulted in a warning are included in the comparison. Any okay find and define the number of comparison results. You want to include an t, aggregate function, I think I've done that like ten minutes ago, but there we go again and then we can define objectives, and this is what Christian already presented to us and then it's now it's time to talk about the scores again.

A

So by default we will. We will handle it like that that, if an SLI, if the evaluation of one SLI is successful, it will result in one point: if the result is warning, it will result in half a point and if it fails, it will result in zero points and you have the possibility to define weights so for each SLI. So by default, the weight of each SLI is 1. So the maximum number of points that can be achieved in this example is three, because we have three SL is defined.

A

Now, if I say the security vulnerabilities SLI, is that important for me? I want to have it more weight in the evaluation. I can, for example, just see here wait then it counts twice as much right. So if it, if it passes, it counts for two points, and if it there is a warning, it counts for one point, and this is of course also reflected in the overall maximum score and then the actual score that is calculated from the current evaluation is divided by the maximum score, and that yields the total score of that evaluation.

A

So it's always a percentage and you can then define pests and warning thresholds that are interpreted as greater or equal them.

A

So, for example, if you have 90% it would pass, and if you have 89% it would be a warning and if you have 74% it would fail the overall evaluation and then we also had the discussion, because we had a lot of feedback around that that the weight alone is not enough.

A

Sometimes so it's good that you can actually wait some s.l eyes more than others, but there is the concept of chiesa lies and if one of the key SL eyes fails, then the entire evaluation should fail, regardless of the other results, and to accommodate for that. I think this is missing in the example right now, but let me just write it here.

A

We had the idea of including a simple field that says he SLI, that is by default, false of course, but if you set it to true and the evaluation for this SLI fails, then by default, its failed evaluation.

A

Of course, if you have warning criteria so if, for example,.

A

Let's add it to the SQL statement example here: if the evaluation result of this SLI falls in yields, a warning results, then the the result of the entire evaluation. It's also warning because, one of because the key SLI true statement is set here and I think these are all the details that there are to know about scoring the weights, the chiesa lies and the comparison modes that we have. So are there any questions at that point in time with regards to the service level objectives, definition, I.

C

Think one question that we raised earlier: we can discuss in in in this round if a key sli is defined, should it be only one key SLI or is it okay? If there is multiple key s allies and if there is multiple key s allies should the logic be such that if, let's say, there's four TS allies, one of them has a warning and another one has a fail.

C

Does this mean that the whole evaluation fails because one of the key is allies failed or do all key as allies have to fail or what is the desired? Behavior of a of multiple tiers allies.

A

Maybe someone in the meeting that was not part of the discussion yet can share his thoughts on that.

C

So, let's just not man in the maratha question question number one: should there be more than one key SLI allowed I.

D

Would go for it? Yes, yeah.

E

My yeah, this is Rob I, mean I, think that the key LSL I would probably be you know not.

E

It's a it's a key L SLI. So to me my opinion would be yes support. Multiple, probably people shouldn't and I would say if any one of them failed. The whole thing fails. That would be kind of my opinion.

C

E

Was similar to this.

C

Was similar to what we discussed, we would advise people against having more than one key SLI, but if they had more than one and one of them fails the whole pipeline, what the whole evaluation should fail. That was there was our idea to um because, because you can also.

E

Manipulate the weight to kind of give it a higher priority to so to me, it's there's different ways to achieve it, and then the key LSL I was just kind of a specific use case of if one metric, maybe two was really really important and it failed. It just was a mechanism to force the whole thing to fail when.

C

He actually discussed the use case why the weights are not enough from just to add that, let's say you have not just four metrics but 100 metrics. If you then increase the weights of certain metrics and all the other metrics, all of a sudden become absolutely useless, and therefore the the note the the key sli would be a little bit better for explaining what is happening. That's a very valid point. I agree with you.

A

Are there any other questions, rope you you, you yeah.

F

A

On the proposal, and what do you think of what? What came out of it? No.

E

I think it's very good I really like it. My question would be, you know more remain. Maybe it's outside the spec. You know what you know. What is what's the first implement a is the first implementation gonna be CLI. Rest both is the you know. The data store part of this, presumably the UI, as well as a later later change. You.

G

E

Kind of curious what how it's gonna get implemented, I guess the long story short you.

G

Know where's the implementation strategy, so it will. It will.

A

To a large percentage, follow the user walkthrough we've defined here, so you create a project. You add your secret. If you need it apply uniform. We don't support that yet, but that is also not needed anymore, because we just.

A

Don't work with with the events that we don't need anymore, so it doesn't matter what what services is is sitting behind a topic that is not used and I, think or maybe Johannes or or someone from from the from the development side could answer this question, because I can I think it would make more sense.

B

If you answer that sure yeah I can take over yeah, as you can see right here in the future, you will get support with the CLI.

B

Where you can say, captain sandy went, then you specify the name of the event you want to send to captain and then captain or forwards it to the the component that takes over the responsibility of doing the job, and you also can retrieve events from captain and therefore we have the captain get event come on and you just specify the type of event you want to get from captain and when the event is available, then you get your result yeah through a console output, and you can also use the API for retrieving events from captain and what you just have to specify is the captain context.

B

You want to get the event from and yeah and you you get it coming.

A

Back to Rob's question, so it works both in July and the API correct right.

E

And it's, but it sounds like it's gonna, be in sort of an asynchronous kind of mechanism where I submit my evaluation. I receive a captain context and then I have to use that cop in context and then query for results as a second step. Correct, correct, mm-hmm, that's okay, so there'll be some sort of. So if I'm like in a pipe a code pipeline, you know this scenario of salmon Jenkins, so I, don't I, don't know how in Jenkins, you might do that.

E

You know unless, because there's sort of a risk, I guess of a timeout right, you like hard code, some sort of wait time or will the rest interface be synchronous, so that you know it may take a little while the process. But at least it's a synchronous, rest call.

A

Most likely, not at least not in the beginning, so the reason why it's an SOS call is that it might take several and in fact several minutes to gather all of the SL ice and into the evaluation. So the evaluation part is the quick part, of course, but the gathering of the of them of the slice values takes considerate amount of time which makes way, which is why it makes no sense to make it a synchronous call.

A

Maybe at a later point in time we find a more elegant solution for that, but right now this is the way that we're going to implement that as.

E

Long as sort of that second call has some sort of you know, positive responses are not not an error. Message like you know, results not ready, or something like that. You know, then, that way you programmatically to keep retrying their.

B

Message will tell you that, on the event could be found for this context,.

C

Okay, would a certain HTTP status code be suitable, maybe that we can define it's not ready yet.

B

You will get the 400, meaning that no service is available, but we haven't talked about a specific response code.

E

Considerate I mean I, know we don't decide, but it might be, it might be. mmm You know, there's advantages to.

H

E

Saying like more like, you know that there is an event as opposed to. There is not an event, and it's just not ready. Yet.

E

Because I think that's what we're going to try I guess une, as you guys have it ready like I, know that I will immediately try to incorporate the quality gate and at least two maybe three different pipeline types. You know like Jenkins as your DevOps and then probably Mike villager. We probably tried in concourse with so we'll, definitely want to try it. So will this happen that way we can have good examples out there for how to do it.

E

In the in the any any just one thought I had was with the indicator file itself, do you have maybe have to go back once there's this there's the same concept of a data source? I'm. Sorry, if you already showed this, but we will support multiple data sources like we did before within the indicator, file itself, I think.

A

This is the perfect segue to to Florian's and topic, so maybe he can. He can share the bigger picture of what they were. The lighthouse service actually does and how the different, let's say, SLI providers, then then communicate with lighthouse service and how how the definition of the of also customize allies could work in the future.

A

So I will stop sharing and give the microphone over to Florian all right. Let me just share.

I

I

All right, you should now see sequence, diagram, yeah, all right so since we re implemented the whole logic of how we're evaluating results. Previously, if you recall it, we use the pedometer service and one of the main pain points we had. There was if we would want to add another data source, so, for example, in addition to dynaTrace and Prometheus, we wanted to support, for example, here some other data source like new T's new load.

I

We would have to change the source code of the pedometer service itself, and it includes the required node modules in that case and implement some logic around that that data source.

I

Now we wanted to make this whole thing more flexible and more extensible, so the way it works now is that we have the evaluation service or the lighthouse service will be called, and this lighthouse service will trigger the retrieval of metrics from an external data source by sending a certain type of captain event, which we will look at closely in a few seconds and basically, the new data source. If we want to implement the the data source, will be implemented as an HTTP service.

I

That can receive HTTP requests with a certain format that we're going to look at shortly, and this service is responsible for retrieving.

H

I

Required SLI values so, for example, the error rate, the throughput and response time that we saw in the document earlier did Big Shot. Then it will retrieve the metrics and then, as a result, the data source service should send out another event that contains the values for those sli values and now we're going to look at an example of that of those requests.

I

So in that case, the incoming event that the data source service should be able to process to is of the type captain internal event get SLI and this event will basically contain the type of the decide SLI provider. So in that case it's Prometheus, then it will also always contain the project, the service and the stage a start and an end times. Then, to enable the data source to calculate the duration of the tests and exact timeframe and an array containing the name of the indicators that should be retrieved.

I

So in that example, we have the throughput the error rate and 50% percentile for the response time and then the way the the data source retrieves. Those values is completely up to the to that other source itself. But the only obligation of that service is to transform it in an event that looks like this. So what are the most important properties here? So, first of all, the type needs to be like that, so Sh, captain internal event get SLI done. Just on a side note.

I

This is still under development and subject to change, but we will of course provide a documentation on how to write those evaluation, services or data source services and in the payload of the of the event, the actual values for the retrieved metrics will will be contained. So in that case we have the the name of the metric some throughput. Then the value then an indicator.

I

If the retrieval of the metric or SLI value was successful and if it was not successful, we also have the possibility of including a message that pretty much describes the the reason why it couldn't be received. So in that case, if, for example, we weren't able to retrieve the response time, P 50 as a live alyou.

I

We can send back an object that looks like that, and we also already have an example of one of those data source services available in the captain cantrip organization. So if you go to that or we you will find the Prometheus SLI service and in the develop branch, you will see an example implementation of such a service, along with detailed readme, which describes, for example, also how you can override queries and and define, for example, the parameters endpoint.

I

So if you, for example, need access or the credential of an external service, you could do that in a human at the secret. And if you want to configure certain certain aspects of the implementation of your data source or service, you can, for example, use a config map, as we did here, all right and yeah. Of course, if you, if you want to implement your own service, you can go to this example, but also, of course, contact us directly and ask for help. If you want to contribute datasource service.

E

Yeah, this is really cool. I mean this is this, is rough I mean I, mean I like I, really like how it's flexible and solves that problem of right, new back-end providers coming in without having to recompile like the service, so I think this is a really really good solution here.

E

One question I had was: just maybe: can you elaborate on its basically the the registry to me like there's a registry of the indicator files themselves? You know, is that obviously it's defined in at the moment in the indicator file, but are each of these name indicator names?

E

You know unique to the data source or were you trying to come up with a you know, standard set I.

I

Think I got your question so the way we did it right now or what the current state that you have right now is that we.

H

I

Three predefined metrics, so the error rate, the throughput and the response time that are also described in the in the Google Drive talked about this use case and those metrics have to be implemented by the by the data source service. But we will also allow users to define any type of queries and to extend those.

I

E

Example, you have there like throughput query that that would be the the in. That would be the name of the ID that we would put into the service level objective file right. It would be without the query, yeah.

I

Without the query.

F

I

This is still a subject to change. Okay, on a conceptual level, that's that's correct right, so the other name of the career is our fruit, boot error rate and respond time. In that case, this is not up to date, yet it.

E

In the scenario where I have two data sources that both provide say, error rate- and you can only have.

F

One, how how would you, how would you deal with that or what you're on the way we are doing it right now, is that we can configure one.

I

Sli source per project- oh.

E

I see so for like the standard SLI is you can only pick one and you just have to decide which one you want.

I

Exactly so, for example, if you have two projects or shop, you can configure captain captain to retrieve s alive, at least for this project, either from the interface or from prometheus or from another potential data source. But currently you cannot use multiple data sources within one project, but if you have another project, you can of course configure another data source for that project.

C

And architectural speaking, there is nothing from stopping us to extend this functionality at a later point, so to say that we have a Prometheus data source for metric a and the dynaTrace data source for matrix P, but it is. This is something that requires a little bit more thought and it's not something. We we support right now: yeah, okay,.

E

Yeah I would think that we would want I mean I, guess I was thinking the scenario of just an individual metric like error rate, and you could only have one error rate, but I was but I would think. We'd want to have that ability to say write. Some data comes from, you know, have the ability to say some data comes from some. Some of the metrics come from dynaTrace say some of the metrics come from neo load as an example and to be able to kind of combine those in a single.

E

You know, project service, so yeah I understand version one. So, yes, a version, two yeah future versions. Yeah I definitely see that that that that coming up as a requirement.

C

Yeah and maybe to to add on that, but Florian just showed them the very same or very similar implementation is already working progress for a dynaTrace, datasource yeah. The repo is already there it's just in a future branch right now. This is also still pending. Some.

H

Changes but it will be a configuration map as well.

C

Queries or you can configure dynaTrace to have certain queries and there is obviously the same set of metrics the free metrics already available. That will give it an address for error rate and I. Think this time.

A

All right are there any other questions regarding the topics we've discussed.

A

Because, if not, then we will just end the meeting and if there are any questions just reach out to us through the usual channels, either open a github issue, writers on our slack channel, write us an email call us find us and talk to us. Whatever suits your needs. Thanks for joining and see you in two weeks.

A