keptn Developer, 15 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keptn Community & Developer Meeting - April 15, 2021

Description

Community Rockstar Celebration 🚀

Meeting notes: https://docs.google.com/document/d/1y7a6uaN8fwFJ7IRnvtxSfgz-OGFq6u7bKN6F7NDxKPg/edit

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject
Join our Keptn community newsletter: https://keptn.sh/community/newsletter/

A

Hi everyone and welcome to this week's community meeting of the captain cncf project um today is april 15th and uh I have to admit we are a little bit late for a celebration because usually we do it at the end of the quarter and we are already in like in the beginning of the next quarter, but anyway, I'm really excited to um to celebrate something today. uh But let us just walk through the agenda, as always.

A

First part is uh please everyone that joined is joining uh just add your name in the attendee list and um we are following the cncf code of conduct here. That's also. I just want to mention this in the beginning of each meeting. That basically means please be nice and excellent to each other in this reading.

A

So taking a look at the agenda, there are actually no follow-up action items from last week. We moved everything that we had to discuss into prs and issues and the discussions going on there, but doesn't have to be discussed in this in this meeting. um There is no concrete blog content out there as far.

B

A

uh If you are, if you see any new blog content out there, just let us know- and we're happy also to announce this here in this meeting. But there are two new video tutorials out there. The one was just released today, a short video on captain in the box and how you can um how you can use it and what?

A

Basically, uh what what it gives you uh in terms of uh the flexibility and the very fast captain installation, with the demo project already on board it and everything for you to be ready to be experienced in this and then another one.

A

Is the captain resilience evaluation with the litmus chaos and captain uh this one uh was recorded earlier this week and both of them are also linked in.

B

Their respective tutorials.

A

So I will also make sure to have the tutorials linked in here um so either way if you're following tutorials or if you want to um just take a look at the video, you will find the content here. So thanks everyone also for joining here and with this uh I think without any further ado. I want to announce our community rockstar of the first quarter of the year 2021 and uh he's also joining here.

A

It's adrian and he did a lot of yeah thanks for for joining and thanks for all your contributions on the captain core, where you um helped us to enable uh custom, mongodb deployments, and not only the bond that comes with captain but to make captain more flexible and, uh let's say, also enterprise ready to allow custom mongodb deployments. You have.

A

Those also in uh I think in detecting the right version, uh with the version check, don't get captain.sh, and uh you also already did a great implementation of captain the organization you're working for uh where you have uh captained, together with a local or basically captain uh having having kept in quality gates as an evaluation of resilience tests, and these tests are a combination of litmus chaos and locust performance test and load tests, but yeah uh for for now.

A

I really want to give you the stage to please uh present your contributions that you have already done to captain. I think everyone here is really excited to to learn what you've already done. I guess most of the most of the folks that have that are joining might already know you from the captain community, but nevertheless, um I'm really excited to see it again. uh What you've already contributed and again thanks so much for for contributing to the captain project and being part of the captain community. We really appreciate this.

C

So, thank you very much. um I have one bad news. The bad news is that our vpn today is not working too good, so I will just sorry for the for the slides being reused from the previous presentation, but our envy is not the best today, so I would. I would just uh so guys thanks. First of all, thanks for this uh amazing award, I'm really humbled to to be the part of such a great community and to have this badge. But of course it's just not just a batch.

C

It's a it's a sign of your let's say of your uh how to say um you you welcome me as a part of the team, so team community. So so that's something super important for me. uh Unfortunately, right now, the the time is quite crazy for me, because I'm going some transformations, but also as a spin of the from the captain community, I made the decision to to go back to to be closer to qa leading things.

C

So I I merged my paths with kitopi and kitobe is the company that basically, we were working working on the integration in uh and one of the reasons why I joined them is probably also captain. So that's a that's a good uh thing and the other, let's say a spin-off from this collaboration- is that they're gonna use dyna trees, so we we are gonna, have even more common points. uh So as uh as jurgen um mentioned two, let's say smallest thing, I don't even have on my slides today.

C

I think you you can see my screen right.

C

It's two smaller things where the contribution made by the way- let's say so during this project of introducing captain uh in kitopi. uh We realized that it's uh super super easy to use mongodb instance. So, with collaborating with with some other people in captain project, we were able to give people possibility to use their own existing mongodb instance instead of spinning the the one that was automatically deployed with the captain.

C

So basically that's very nice, because if you already haven't some people, some folks already have existing instance, so they are now able to use their own instances.

C

Also, we learned that it's pretty much very hard to use document db, aws document db, even though it's mentioned as a mongodb compatibil for usage with captain, it's not so not so close to mongodb. So that's also a lesson learned. Also. As a second thing, I was able to provide some some simple validation for uh captain cli version selection.

C

So these are the two things, but I'm also proud of them, and it's very nice to work in this community. But the main thing as as jurgen mentioned, was uh introducing a very simple pipeline, very simple for now, because, as I mentioned, I I'm still as a you know a side man in ketopy for now like a session player, uh but since may will be, we will be working full throttle. So so I think we will develop that.

C

But for those who who haven't seen that uh as the as the outcome of our work in guitar p, we were able were able to come up with um a project or a pipeline that basically has three stages. um We are thinking about those stages as progressive chaos, intensive intensity, so the stages represent different, let's say different conditions uh in in your environment. The first stage would be like a perfect, uh perfect world where you don't have or you you have near zero um chaos.

C

The second stage would be the little chaos that you introduced to your system, and we assume this as a light chaos. uh That means we. We want it to be something that our system is able to handle. So, let's say daily dose of chaos that you still can withstand, and the third stage would be a more dramatic chaos that would most likely cause our system to fail, but we would be able to recover from this chaos.

C

So we have the stages of no chaos, light chaos, so everyday work, daily things and some more serious failure, which of course, would lead to uh to some real failures in system, but also we should be able to recover from that and we defined sli as a request for now. This is just the requests per second for each endpoint. For now we are monitoring four end points so as sorry so the source. I I should have put in my in my presentation so as the source of my test data.

C

We have a low cost locus tests, which is like jmeter test just basically so load performance, a load, slash performance framework that generates traffic, so the traffic is generated for some particular endpoints, meaningful endpoints in our system, and we can, of course, as with any load test, we can steer the number of users the time between requests and the end points we are hitting. So for us these are the like. Let's say these are the indicators that we will be taking into consideration.

C

So taking this uh long story short, because this is in my presentation- I don't want to go to too deep. We are able to put the the data related to load tests into promote use and from primitives we are able to say okay. So what was the average request per second for each endpoint? We are monitoring in a particular stage and mode was the total error rate or, as you say, the average sum of total of error rates for for all those endpoints.

C

So this would describe more or less our performance, but also this would describe the error, resiliency or error accuracy more or less and for the slos or the objectives we want to.

C

Investigate we assume that the state of no chaos so the perfect situation and light chaos, so everyday perturbance should be pretty much the same. We should not see much difference in our system behavior, so the request per second, uh whether there is no chaos or the chi slide should be pretty much the same and, of course the error rate should be near zero or zero for heavy chaos.

C

We are expecting the system to fail, but we are expecting it to recover since uh we are running our tests, for let's say five minutes, and the period of chaos is only for, let's say one minute: uh the average, of course for the one minute or maybe one minute with some buffer, we will have a total failure, maybe 100 percent, but the average uh failure rate and average request per second would not be. You know, 100 percent error rate and not zero requests per second, it would average to maybe 50 percent.

C

Maybe 30 percent in our case is more or less uh 30 to 40 percent and that's acceptable because uh checking the results at the end. We we know that the system has recovered and it was not total disaster. You know during this time. So in our case, uh the objectives are uh pretty much like this.

C

This is only to show you the of course, the slo file or the definitions. You already know. If the this is the case of no chaos, slash light chaos. So, in our case, if the average rps per endpoint is above 1.5 requests per second, it's a idol situation, agile situation, and it's still not that bad means warning when we have at least one request per second for uh for a single endpoint for fail ratios as we as I said, the criteria is uh in our situation. We have.

C

We want to have below one person, but it's okay uh to have this. Normal phase fails lower than five five five percent.

C

While for the stage of heavy chaos, we are expecting the rps to drop even to 0.2 and in the worst case scenario we can to some point, accept a warning with 50 percent of fail, but it's not gonna, be. As I said, it's not going to be zero requests per second. So it's not going to kill our system totally and, of course it's not gonna cause 100 percent failure rate and the final uh architecture. Sorry, maybe I think I should put it as a first slide.

C

Nevertheless, we are steering our pipelines directly from from gitlab and for now the gitlab is the orchestrator of everything. So it's able to run the tests. The tests are able to store the metrics into prompt use.

C

Also after the tests are run via cube ctl, we are able to sorry when the tests are started via cube ctl. We are starting litmus chaos, which is also putting uh some metrics in prominent use and once the chaos is done once the tests are over. So in our case, it's just waiting for five minutes for for the things to finish, and then stopping the test, we are asking politely dear captain. Can you please calculate the result of our run?

C

As I said, for now, we are doing it pretty much manually to the tests cube ctl to the litmus and, of course, via api calls captain cli or captain api rather, but I hope it's gonna change soon. Nevertheless, we are observing very nice results. uh You can think about this pipeline like this, so the first stage would be the lotus without getting chaos and then evaluating the stage in captain then as a second job.

C

In the next stage, we have the load tests with some chaos, light chaos and also evaluation and, as I said, the heavy chaos and evaluation as the third job in our pipeline. So more or less, it looks like this.

C

You can see that the pipeline pretty much takes up to 20 minutes, let's say also right now. uh We are also using a warm-up stage. So basically, we learned that our system during the night, since we are testing our stage environment, our system during the night needs a bit of time to react, and so we are using some warm-up stage for three four minutes.

C

We are running uh some tests, some load gentle load for the you know for the brokers to start up for the database, to start the connections, etc, etc, because you know after some time it just goes to sleep on stage.

C

So now we actually have four four phases and then we are running the our stages, so example, look of results we are gathering in the tests is like the following.

C

We have the first stage where you observe a very nice situation. The red line represents arrows, and the green line, of course represents the requests per second.

C

Then, as you can see, this is stage of no chaos, so you see a pretty much perfect even line in a second stage. We have some perturbance during the chaos phase, which is one one minute, and then everything goes back to normal and in the third stage we have pretty much some troubles at the beginning, because the chaos is starting, then the failure rate rises and, of course, the requests per second stay. Stay high white stays high, because our system is responding too fast with errors.

C

So we have errors, and we have, uh of course, very fast responses or too fast responses with 504s 503s and then at the very end, we have again very high failure right. So for now we are not able to quickly recover from the heavy chaos it. It shows that, unfortunately, the the pots need to be restarted after chaos. That's why we have the heavy failure rate right here.

C

So if for those of you who are familiar with litmus chaos, you know that after the tests, if the pod pods are not going back to normal, litmus will restart pots uh automatically, and that's what happens here. Basically, the pots are getting restarted here and that's why why we have this high uh failure rate and that's something that needs to be addressed and as far uh as the pipeline goes uh or the sorry the stages in um in captain, as you can see, we have three different stages: uh the light chaos stage.

C

I think this is the older, the older picture, where we, where we had a less stable light chaos stage, but no chaos and light chaos. They go pretty much okay. With this heavy chaos, we are having a bit of trouble.

C

uh So that's something that needs to be addressed, but this is on the developer's side and so far, we've had over 1000 runs every night. We are running pipelines every two hours and we are observing pretty pretty stable behavior to the system.

C

We don't yet see any big peak changes from deployment to deployment, uh so next steps first of all uh go back on tracks with this a bit more because, as I said, this is not.

C

This doesn't get as much attention as I I would like it to have, but also key topping is a bit more essay, and this is also a topic of uh switching to dynatrace. We will be, we will not be using promote you soon, so we will. We will be also going into dynatrace and using dynatrace matrix and we will be exploring possibility of using quality gates in tests or maybe rather using tests use quality case with tests.

C

Oh that's going to be a perfect description, so, basically for some of our tests, especially api tests, we would be able to or we are able to define the smoke, as you probably also have the smoke smoke test suit uh and we would like to couple it with some quality gates, definitions uh and use it also as an indicator of a successful or unsuccessful deployment. Okay, I think I went way way over my time. So thank uh thanks for this uh wonderful ceremony here and, if you have any questions, feel free to ask now.

A

Cool. Thank you. Thank you so much uh adrian um yeah. I would be interested in uh actually in this or maybe, if there are some other questions, uh I don't like to go first. Okay, so I just start um so on slide. Number. Nine, for example. um Is there any other behavior if you have more, um let's say, pause or idle time in between those faces?

A

It looks like on when you do the light chaos that there is a little bit of error rate, but just it's hard to see if there's very small uh around the time.

C

Yeah um we we didn't, we didn't explore the the times before chaos. That's a very good like very good comment and something we need to address. So there might be. There might be a change if we, if we, you know, make a bigger buffer between those those cups very good comment. But I cannot comment on that. Yet.

A

Okay, cool yeah, and then I I just saw from the screenshots that they are still from the previous version from captain so make sure to upgrade to the latest version um you also get. This is a sequence screen and uh in uh with with captain 0.8.1 uh you have like this um overview has, has, has improved and yeah like.

C

This version is still still sterile, but we are still using older version on our in our system. This is like um a lot of topics to address and a very small time frame, as I said, I'm still waiting for for the full transition.

C

Like sleep sleep good tonight, I will do it, but.

A

Yeah I get it. The important part is that the quality gates uh they're all they're, also uh working for you in the current version of captain and yeah.

C

I mean for me, like the sl, let's say the slo calculation. As I said, um I don't, I don't think I said it before so so. A few few years ago I was, I was having the very similar task of basically evaluating statuses, of some, let's say, api responses. Whatever, like I, I build a very similar tool on my own app sorry, not very similar tool.

C

I built a small portion of the logic that is behind slo evaluation in captain, and I know how uh how this you know, how this logic maps and how calculating of percentages and different values that happen to some indicators, how you know how uh how completely how complex it can get so really for me like this is this is the winning point uh of of captain that we are able to define with very simple uh the uh my specific language esl, um something very powerful, so this is for me uh personally.

C

This is the most important feature, even though it's the basic feature for me as a person that really cares about the indicators. I don't want to care about all the maps. You know, let's let someone do the maps. I just want to define it and that's that's awesome. Right.

A

Is there any um because you just said you just talked about the the maths? You don't really care about it, but is there any, um let's say a statistical comparison or any uh feature missing from from your perspective that you would like to see in the captain quality gate that you maybe not only compared to the previous one, but maybe to or some kind of more complex uh calculation that you would like to see. Or is it fine, as it is right now.

C

No, I think it's okay, I mean it's. You know like time. Series comparison is always a problem. Basically, because we are, you know that we are limited by simple statistics. We are not doing like any rolling average. Any you know any uh anomaly, detection or whatever. So really, if there is something, I'm expecting would be more anomaly detection, but I know this comes through, for example, with dina trace. This is also one of the one of the things. What why we are going for dyna trace, because we want to be able to detect anomaly detection.

C

As far as the simple statistics goes, you cannot do anything anything better, because this is all you need to have like compared to previous results. Maybe last result, that's all you have powerful functions in promote use in the queries. If you are using open source, chromatic use gives you all the queries you need. You just need to be. You know dedicated enough to go to the all the documentation and and to.

A

How many um slos.

B

Have you defined.

A

B

A

Those three that you mentioned or do you have four.

C

I think I think we still have four uh slows are like no four or five sorry, it's uh four endpoints, plus the average total error rate. So it's gonna be five. For now it's it's! It's a very small fraction, but it's it's a usable fraction.

A

Cool yeah- and you said uh already more than a thousand rounds of the quality gates to evaluate this lightly. This is pretty cool.

B

And adrian, but for this use case you are using the external mongodb and not the internal one of the captain right.

C

And for this we are using internal because we, our our database, is document db and we were not able to connect it. So we left just the basic like default.

A

Deployment cool uh are there any other questions.

B

A

If not or we we can still, uh but if not, then I would uh like to thank you again adrian for uh for your work and your presentation and also dedication to the captain project. I think now it's also official.

B

Or official that we edit you.

A

Here uh our first community rockstar of this year, I'll link to your github account here and also mentioned the latest video uh you did so really cool. There is a small gift uh on it's in uh it's in the post, so it will be shipped quite soon, hopefully and yeah. Thanks again for your contributions, I think uh we have covered everything from the agenda today.

A

If there is nothing more here, then I think we can close it a little bit earlier. uh I think, for next week we will focus on a new release of captain, and the captain court team will take ownership of this meeting again and uh yeah.

A

I think for today we're all good if there are any more questions for adrian. Please ask them now.

A

Otherwise, we also have time to to slack in the in the captain uh community portal, on slack uh thanks everyone for joining today uh and uh have a great rest of your day and see you all next time. Bye.

C

Thank you very much guys see you soon.

C