keptn User Groups, 20 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Continuously evaluating application resilience with LitmusChaos, Locust and Keptn

Description

Adrian Gonciarz is sharing their journey on how they implemented application resiliency evaluation with LitmusChaos, Locust, GitLab, and Keptn.

Meeting notes:
https://docs.google.com/document/d/1Om9pj16hGKP_w2vUaH-7Cp0ffEIj-Oe3IezeVCpFYAM/edit#heading=h.49tbq0kx1jf9

Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject

A

Hi and welcome everyone to this edition of the captain user group. um Today, we already have a couple of folks joining uh and uh I'm very honored and pleased that we also have a great presentation from adrian coming up. I've already seen parts of it. I really love the use case and how admin solved actually the challenge of evaluating the application resiliency with a couple of different tools.

A

uh He brought everything together and uh he will be explaining the challenge and also the solution uh to us um in the in the bigger part of this uh captain user group. For today, um the captain user group is also following the cncf code of conduct. That basically means please be nice to each other. Please respect each other. If there are any questions, please put the questions in the chat and after the presentation of adrian, I will also uh open up all the microphones or I allow you to to unmute yourself.

A

So we can have a discussion and uh please feel free to ask all the questions directly to adrian or also questions around captain the captain project in general. Some use cases you're interested in please feel free to share your questions. Your thoughts and give feedback- I will again put in the link to this document that I'm sharing right now in the chat, so please feel free uh to also add your name in here, uh so we can keep track a little bit of all our smds in the sessions.

A

um Just uh writing your name and your affiliation. That would be great actually.

B

Micron is also.

A

B

He's from peatland.

A

So I'm really curious also to hear your thoughts on this. What agenda is presenting? That's.

B

A

Yeah uh with this uh thanks, everyone for joining, um please um also put this on your calendar. We are doing this kind of meetings every third tuesday of the month and, as you already know, from the first edition of the captain user group, it's more focused on captain users how users are actually taking advantage of captain what are the main use cases they want to implement and therefore it's not so much focused on the capture developers, the captain developer meetings.

A

We still do them on um on thursdays same time as captain user groups, but we just have these two different kinds of um of meetings. So with this uh I think I'm already handing over to adrian. I know that you have a short um introduction for yourself in your slides, so I will just give away the stage to you. um We or we adrian agreed that uh he will also share the slides afterwards, so we will put the slides in here and we will also put once the recordings up live on youtube.

A

We will also put the recording link in here, so uh thanks already for uh putting in your names- and I will be handing over now to adrian and really thank you very much to your.

B

Okay, so just let me know that you can hear me that you can see my screen.

A

All good, we can see your screen and again, if you have any questions, uh please put it in the chat or in the q a and uh I will be either throwing them into adren, or we will just discuss it afterwards. After the presentation.

B

Perfect, thank you very much here again. uh First of all, thank you for having me here. Thank you for your continuous support uh year again and andy and everyone uh at captain and dinotrace. That's that's really a blessing to have such support when you, when you take a difficult task, you have no idea about. Basically, this is the the first story and the first chapter of the of the journey.

B

I had no idea where it's gonna, uh when it's gonna lead me and how it's gonna go on the way, but hopefully I was able to to reach the first milestone and, first of all, when you listen to my presentation, don't treat it as any kind of expert's uh story or expertise, huge expertise on this kind of implementation, because for me personally, this is the first try of this kind of chaos, engineering and resiliency testing. uh It was a hard but joyful time. So please be aware that this this is nowhere near perfection.

B

This is just the first proof of concept and hopefully uh a great start to something really important for for us. So I will tell you uh a bit of of a story, but first let me turn on my uh timer, so I don't talk all day.

B

B

So I would like to show you today guys and also this is a challenge, because I don't really like online meetups, but we are in these difficult times where we cannot meet each other and have a beer and talk about this. So um let me tell a short story about the idea about the way that led us to some prototype and where we are now and where we want to go with the idea of evaluating application, resiliency or basically implementing some kind of site, reliability, engineering in kitopi, but first two words about me.

B

Every day I don't work in ketopia, I'm a site, reliability, engineer in a company called iqvia, but as my additional task and my consultant's role, I cooperate with kitopi. I will tell you two words about github in a second, but first let me tell you about myself.

B

So, as I said, I have some experience in the field of qa quality assurance software testing and also a bit of site reliability engineering. Lately I work in itis in it area since 2013, so around 8 years. So far, I've been a tester uh test, lead software testing manager.

B

Also, as I said now, I'm a bit more on a sre site and who knows where I'm gonna go. I'm obsessed absolutely obsessed with python. I never felt in love with java, but when I learned python it totally got me so most of the things I write, I write with python, whether it's test, automation or support code or some sdks, etc, etc.

B

I'm also quite obsessed about kubernetes and cloud. Actually, that's quite funny because when I was working in a company called impost part of this engineers are now working in quitope. That was the first time when I heard about kubernetes and I'm lucky enough to meet those people who basically taught me about kubernetes like sebastian in kitopi and martin.

B

Now we are back on the same track for some another journey and I'm also teaching people uh mostly um in area related to software testing. I teach at universities in uh in post graduate studies. I also do some trainings and most of all, I'm quite focused on automating stuff, not only test automation, but basically how to automate different stuff. We have some guests from gitlab, so I tell you that a lot of lots of my problems are solved using gitlab pipelines.

B

I really like what you do there and how these things improve my everyday life, but using automation in gitlab python, using some tools, I'm able to maybe make mine and some of my co-workers, life easier and kitop is a a company that is not from poland. It's from.

B

I don't want to lie. So, let's just say this is a company that delivers kitchen as a service. So basically it serves as a.

B

As a kind of software platform, where you can put your rest around where you can put your shop and which serves as a meeting point between people ordering some food, etc, etc, and restaurants and companies that deliver food, mostly in middle east, but I quite sure that they have much more to come. So basically, you are dealing with quite a big traffic related to putting orders to the system, processing those orders and serving them to different kind of entities such like kitchens, restaurants, etc, etc.

B

And that's where the idea started so once uh we met for a quick coffee with guys in kitopi, I think a few months ago. I think it was november or something like this, and we had this idea or they had this idea, which was sparkled by you, know the centers.

B

So we had this little production issue and that's how it all starts always so now, uh as an action point from this little production issue, uh they they said that they would like to, or we would like to test our resiliency improve the resiliency of our system. So we need to find a way to to test it.

B

So we thought that it could be done in a modern but quite natural way and for I think for many cloud living platforms. The natural way today means chaos testing. If you are not familiar with the idea of chaos testing, let's just say that this is the kind of engineering that is supposed to introduce some kind of malicious behavior to your system and malicious behavior, maybe not be the best word, some kind of unexpected behavior, such as a network problems, maybe problems with some other services.

B

You depend on basically introducing some kind of chaos, so this is simulating a real situation more or less, as you would sometimes face on production. Sometimes one of your service service is not available. Sometimes you have problem with connection to database. Sometimes maybe you have problems with restarting your application, etc, etc, and for some time chaos testing has been around, but I have never been able to put my hands on it and this time we decided. Maybe it's going to be a good way to choose for our testing.

B

So maybe we would like to introduce the unpredictable factor by using chaos testing, but on the other hand, we would like to deal or simulate the production like traffic and for me, as a qa engineer, and maybe for some of uh some of you, the natural way or the intuitive way of simulating production like traffic is load testing.

B

So this is also what what we called non-functional testing uh any. Maybe most of you will be familiar with jmeter, uh maybe got link. Some of you may be familiar with locust, I'm using, but basically, uh on the other hand, on the client side or the on the production side, we have some load generated via some tool.

B

So the initial idea was very simple: let's generate some load, let's introduce some kind of chaos to our system and, let's see what happens, let's see if we are able to survive this kind of uh natural state of bigger or smaller disaster, and first we sat down and we thought okay. So what building blocks would we need for this idea? Because of course, we have the very rough idea, but we need to put some necessary tools and basically, at the very first meeting uh we start, we established a set of building blocks building tools.

B

We would need to achieve this goal so the list pretty much went like this. We need a source of traffic, so we need to generate or stimulate our system the way our users would do and then, since we have the traffic and we since we have working system, then we will need uh some kind of chaos generation. At this point, I'm talking about broad abstractions.

B

So since we have traffic, we also want to introduce some chaos to we. We need some kind kind of chaos generator. Also, a lot of applications are storing metrics. So, in order to see how our system behaves, how we can track the behavior of our system and also how how we perform in terms of traffic, how we perform of in terms of chaos retrieval, we would like to store those metrics somewhere and possibly after this testing.

B

We would like to draw some graphs see if we can narrow down our bottlenecks and if we can see what was happening basically to the system and then we need some sort of evaluation. So we did some job. We generated traffic, we generated chaos. We are able to draw everything that was happening to the system, but are we able to evaluate what if the test was green or red if the behavior was positive or unexpected?

B

uh And last but not least, we need some kind of runner. We need some kind of tool that will lead the orchestration here so going step by step, the source of traffic. For me, the natural thing and luckily for uh for the team for the qa team there, which also are some of my friends, so I'm very happy to have my friends everywhere.

B

We decided to go with locust, which is very easy: um python, based load testing tool very similar to jmeter or gatling. So for our purposes, you can think of locus. Does basically the load testing performance testing to tool pretty much the same as jmeter in a black box view, then, for the chaos engine uh the suggested solution by our devops team was litmus.

B

After some research uh they did, they suggested and they implemented implemented very first version of chaos engine using litmus and after some time I learned that also litmus is the tool that the captain's team is cooperating with. So that was a very natural and good choice for metrics and graphs. I think most of you know promit use as a source of matrix and grafana as the as the graphing energy graphing engine.

B

uh This is very popular stock, so I think most of you should be familiar with that and for results evaluation. I said okay, so we will run this test and I will find a way to script. The evaluation I will just basically pull some metrics put them together, uh see what what was the failure rate. What was the request per second?

B

I will do it, I'm I mean it should not be easy. I will just script it probably so I will use python and for a runner as nearly most of our projects we use uh gitlab, which would be basically the orchestrating platform in this in this matters.

B

So we went for a beautiful prototype. uh I hope your prototypes are better, but first idea was like this. uh We thought about it at the very first meeting and we thought about the idea of three stages of chaos uh and by three stages of chaos. I'm not I don't mean the the three lockdowns or how many lockdowns you had in 2020.

B

It was a bit a bit less heavy and the three stages we basically imagined and once again don't treat it as some super expert knowledge in chaos, engineering treated us as my attitude or our attitude, the idea we had and maybe some inspiration, if you have any comments and go ahead, but the initial idea and the idea we still have now- were intrigued, defined by three stages. The first stage would be no chaos, so basically, we would like to see how our system performs or how?

B

What are the results of our load tests and by the metrics from lotus? You can probably think about requests per second or maybe what is the error rate for particular endpoint. So these are important metrics, um so we'd like to basically have a baseline, so we'd like to run our performance test without any chaos for some given period of time and treat it as our baseline and then we would like to introduce some light chaos and now what? What does it mean? The light chaos as a light? Chaos?

B

I imagine imagine that you have a perfect, perfect, perfect environment, no one is using freshly configured, etc, etc, or maybe your stage environment at night basically very peaceful place.

B

This is no chaos stage where we don't have any unpredictability when we don't have any natural movement of a system and light chaos introduces should introduce some kind of natural randomization to the system, so it should not be a failure of the system, but it should let us implement some kind of small issues.

B

So, for example, as I as I taught jurgen today, you can imagine that using aws in the morning and when the usa wakes up and you are using the aws at the same time as the guys from usa, you see that the response times of aws are different. So, for example, if you are able to do something within seconds or minutes in the morning, it can take 15 minutes half an hour in the evening. Also, when you have some kind of natural delay between applications and database, you have this kind of unproductive ability.

B

This randomization, where you don't really know if you're gonna connect within 50, milliseconds, 500, milliseconds or five seconds. Maybe this is the light chaos. This is not not the bad behavior, but this is some kind of natural randomization and our expectance for the state of no chaos and light chaos for our applications is them for them to the users. To be the same, maybe uh we will have a slightly less true output, slightly less request per second, but which should not observe any errors.

B

So whether we have the perfect state of no chaos or very gentle or light chaos, natural production um conditions, we should not observe errors and we should not observe any drop in our performance. On the other hand, we have heavy chaos, so we introduced this concept of heavy chaos, which is the bad behavior. Something bad goes on. The like something goes really wrong. Maybe we have a really bad network latency. Maybe our database has dropped. Maybe we don't have.

B

We cannot connect between pots and then, of course, we will observe failures, but at the same time we should be able to recover from those failures pretty quickly. So this is the very important idea uh for the development of the whole of the whole project or the whole concept, so just keep in mind that no chaos and light chaos for you as a end user, for example, people ordering at restaurant or people using the the app in the kitchen should be the same.

B

You should not observe any drop in performance, but on the other side, during heavy heavy chaos, you can observe these failures, but for a short period of time, and then we should naturally recover. So these are our conditions and these conditions were translated into litmus scenarios.

B

As as this light, chaos for us for now, as a proof of concept, is a network packet drop so introducing some latency some unpredictable improv in predictability in the network for light chaos around 25 light enough not to uh destroy our system but to introduce some kind of randomization and, on the other hand, for heavy chaos. We we selected a quite significant, quite significant packet drop of 75 percent.

B

That would definitely cause our cost problems, but we should be able to recover and for the tests and metrics uh we've chosen, uh locust and as a as an addition to locust uh so-called locust exporter. So we built our architecture of locust in a very easy and natural way for locus, so we have one master node, that orchestrates workers and the workers are realizing the requests. This is done in order to be able to scale up so whether we want to run 10 users, 100 users, 1, 000 users or 10 000 users.

B

We are easily able to scale scale. Our workers and the master node is used only to communicate with the exporter and exporter is the layer between locust or the test results and prompt use. So our test results are stored as a metric in promote use. Basically, and then we put the together uh or I put together a very simple gitlab pipeline, uh which we could summarize uh oh, in which I would store the artifacts from tests.

B

So my test results would be basically in the first iteration stored as files containing the results of tests and the pipeline looked pretty much like this. I would run some load tests without introducing any chaos, and I would store the results as the gitlab artifacts. Then I would run the same load test, but I would introduce the light. Chaos also, I would store the artifacts gitlab artifacts and then, as the last round I would run the load tests, but I will introduce heavy chaos at the very east and the very end of my pipeline.

B

I would like to run some script and I run some script to evaluate the metrics and I have the result whether my pipeline was as expected or not.

B

That wasn't that bad, but there is better solution or after some, given some thought, I was able to find a bit better solution. I thought to myself wait a second.

B

Few years ago I met in krakow, I met andy from dinah, trace and captain, and he was telling us the story uh about captain very early stages of captain. I guess it was. I think it was around three years ago. Then he came again. Then he told us the story again a bit. You know a bit more uh cheerful story, but still I didn't even move a finger, but now a few years later there comes the time when I, when I think to myself, hey it's time to use this captain to see how it goes.

B

I only remember that there was something as pytometer, which was, I think, replaced with some other applications, but nevertheless uh I synced up with guys on dynatrace with andy, and then we thought that the good solution for evaluation of this metrics would be captain, and here is a big shout out to jurgen he's the guy that spends many hours with me helping me out. So that's uh you know. That's very very kind of you.

B

Thank you very much uh so after uh I would say uh a month of r d, different configuration issues, different stuff, different learning curve. For me, of course, we were able to come up with some new configuration.

B

We defined a project, a captain project with three stages, those of you who are familiar with captain. You know that we have projects and within the projects we have some stages in a full version of captain. Those stages can relate to uh to deployments to your test stage pro the environment, but my use case was more towards the quality gates examinations.

B

Therefore, I defined a project with three stages that reflect that no chaos, light chaos and heavy chaos, and, as you might know, if you are familiar with captain, you will or if you are not at some point, you will get familiar with slis and slos. Slis are basically the metrics. So these are our indicators, how to measure how to calculate and what to calculate in order to tell whether our system behaved properly or improperly, and for me, as a qa engineer, and in this kind of situation, there are two metrics that are important.

B

First, it's requests per second, the number of requests per second, that each endpoint I'm shooting at is able to process.

B

This tells me if I'm basically digesting the traffic slowly or fastly how it goes, and the second very important metric is total error rate so total, but by total I'm I mean that I'm not calculating error rates per endpoint, but I'm only calculating the total error rate.

B

So those are two important metrics I use as my slis, so I defined requests per second for each endpoint, I'm querying basically or testing, and I calculate the total error rate for the period of time. I was testing- and here you have a example- excuse my very bad code position at the slide, but this is just an example. This is promit use. Query I use to get the requests per second for one of the endpoints.

B

This is basically the rate rate is a prompt use. Query operator of locust number locus request number uh requests. This is the metric exporter from lock exported from locast. So you can say this is the basically the average rate of request for a particular endpoint. The end point is hidden here under those three dots and, of course, captain provides you with some kind of parameterization using this duration seconds. While variable you can, you can pass during the evaluation of captain metrics, you can pass what was that re duration of your test?

B

So for me the the metric is parameterized with some duration seconds. When I evaluate my test results, I tell captain hey, please evaluate this particular metric for a time of last five minutes or ten minutes, so the time where, when the tests were running- and basically this gives me at the very end, one single number for each sli and, as I said in my case, I have three end points. So these are three slis and this is the fourth one, so I will get four numbers basically and for slos this is the important part.

B

I have two kind of slos: slos are the objectives service level objectives, service level objectives? So basically you can. You can tell that each stage of your pipeline of each stage of your test can have different sl sli results, and then you want to check or ask captain to check if the result, if the numbers in the metrics are okay or not, and as I said you- I told you previously- I treat my stage of no chaos and light chaos as pretty much the same so for the end user.

B

The request, the number of requests per second should be more or less the same, very close, and the total error rate should be near zero. So in those two those those two situations where we have no chaos and light chaos, there should be barely any difference. I want those states to be similar, so why is that? So? Because our application should be resilient should be, should resist and small variations.

B

If we have some delay on connection between the app and database, it should be okay for the application it should reconnect. It should wait longer. It should retry. It should not immediately return errors to the user, so the slide chaos is taken. So oh, I'm running almost half an hour, I'm very uh I'm taking a lot of time, I'm going, I'm gonna go so uh to the to the point no case and like chaos is pretty much the same situation, so I expect the conditions to be the same and for heavy chaos.

B

I have a bit different conditions and I will show you how they differ, so the slos for no chaos are defined in a file like this.

B

If you are not familiar with the files, that's okay, just take a look at this, so I expect my average rps for one of the endpoints, so requests per second to be at least one and a half requests per second on average and then the test, the state of this slo will be passed, but also, if I have at least one request per second, so between those two I will have the status of warning and then of course depends on depending on the status captain provides some calculations and then they provide you final results.

B

If you want to know more, I sincerely recommend documentation, uh but for now, let's just say this is our criteria for passing or warning for this sli and also for the average fail ratio. I want it in a in an into a idaho situation. I want it to be less than one percent, but if it's less than five percent, it's still not so bad, and this is for the stage of no chaos in light chaos and see that the numbers change for the situation of heavy chaos.

B

So for heavy chaos. I I would expect our application to be able to process at least half a request per second per endpoint, or at least 0.2 requests per second per endpoint, because I don't want a total disaster, you know and also the fail ratio.

B

It should be either under 30, but it's not that bad if it's 50 percent. So this number changes, but the slos are pretty much the same. So how this final pipeline looks like we have the gitlab as our runner and the gitlab is able to communicate with locust via api. So I'm triggering my tests locus tests as a via api call.

B

Also I have my litmus and I am able to run litmus chaos engines, so those high latencies or low latency, so basically introducing light or heavy chaos via cube, ctl comment. So basically I'm modifying the deployment in kubernetes and I'm activating the chaos this way. I think this is very similar idea to to what jurgen and the the team has for integration with litmus, but I'm yet to discover this length a bit more, but I think we we a bit agree on on that part uh and then, as the source of matrix.

B

I have my prompt use where I store the metrics or results of locus tests and the litmus chaos experiments and, at the very end I have my captain and at the end of each stage in gitlab, I'm asking my captain: hey captain, please evaluate my metrics for each of those stages and, of course, captain pulls the metrics or slis from and it checks if the slo is passed or is, is in status, pass, warning or or error and calculates the final result.

B

So my current state of gitlab pipeline looks like this. We have a as as previously we have some load tests, we have without chaos, and then I evaluate the stage in captain. Then I have some load tests with light chaos and evaluate the stage another stage with captain and the same for heavy chaos.

B

So this is not demo because we don't have too much time for a demo, and I really want to show you the essence. But let me show you how this looks like in some screenshots. So basically, this is my pipeline. My gitlab pipeline, where I have the three stages defined uh as separate stages, and this is not very complex pipeline- it basically in each pipeline, I'm only using some cools, I'm only using one cube, ctl command to activate chaos and some final evaluation by captain.

B

So this is pretty simple pipeline, but it corresponds to a thing like this. This is a grafana graph and now it's it requires some explanation. So, as you can see at some point at night, of course, we have three bursts or three runs. The green line represents the total rps. So basically, let's say the rps summed rps of all endpoints and the red line represents errors seen by locust. So, as you can see for our first burst without chaos, we don't have pretty much much any errors and we have nice around 2.5 average rps.

B

Here you can see the second burst. We had some small small small errors here, but still the overall rps is very similar, but for the stage of heavy chaos, we are starting to see some errors here.

B

The rps don't drop too much. Why? Because probably the end points are responding with a similar time, but they are throwing errors, so they are throwing timeouts. They are throwing firefox and very important, and yet very concerning part of this graph. Is this part here? Because what happens here we are introducing, let's say our tests take five minutes and we are introducing the chaos only for 30 seconds or one minute. I think the chaos engine the latency has to start here and then it drops.

B

So we have the error. We should have the uh the errors only here and then we we should not have errors here and still we see that there are some errors seen by locust after this. So my guess is that after the experiment is ended, our application is not able to uh to recover on its own. That's why uh chaos litmus is restarting pods and when it's restarting the pods I'm seeing the errors, so they are not.

B

They are not able to respond to me. They are showing timeouts, and this is wrong. Behavior I mean the result of the this experiment should be only this uh errors here, and here I should go to the green ariana errors, and this is very concerning to us because uh in our first try in our first experiment, we didn't see that now we are seeing that and there is something clearly wrong.

B

So captain is the captain is starting to pay off and just to show you how it looks in a real situation. I don't know if you are familiar with captain ui, but here I present you the do you see my marker. This is, uh this is the stage of no chaos, and here we have the slis and, as you can see, the average rps for endpoints are around four.

B

They are not in in total agreement with the previous previous slide, because previously I was calculating average over a different period of time, but anyway, for this period of the test run, we have around four requests per second per endpoint and we don't have any errors, so that's good and for the light chaos as you can see, this is very similar.

B

We don't see any errors and it goes gray or green. There is no problems with that. The result of those two tests are is pass, but.

B

For heavy chaos stage, we see exactly this, this behavior this concerning behavior. I showed you a moment ago. This test is in status, a warning. Why? Because the average fail ratio here was over 50 percent.

B

That means that it's more than I would expect for warning, it's very bad, and it's in agreement of what we've seen in the graph. We should only see a very small part of errors, and then it should recover. It didn't recover. It was still throwing 504s.

B

That's why I see a big average file ratio, and also one thing that's concerning here- is this: this request per second are only slightly lower than in situation of no chaos. It means that we are throwing errors very fast, and this is concerning this should not be like this. In heavy chaos, we should have low rps.

B

The system should try to reconnect retry, etc, and it should not uh throw errors all the time. So this is very interesting from the standpoint of analysis.

B

Okay to the final result, sorry for being so so taking so long to the final result, uh the initial goal was achieved. It was achieved by with with a strong dedication from our team, with a great help of captain team and I'm happy about the results. For now uh I was also able to submit two requests to captain community.

B

uh One was, I think, with some scripting thing, so this was quite funny lesson for me, but the other one was related to a thing we were uh discovering at at some point, and the question was: if we are use the external mongodb and not the one provided with captain deployment, then the answer is yes, you can do it very easily.

B

You need to modify some just a few values and we already discussed how to modify the captain codebase in order to be easily to be able to easily use your own mongodb instance. On the other side, we tried on two versions of.

A

uh I'm not sure if the network hiccup is on the side of adrian or if it's. If it's on my side, um please let us know in.

B

The chat, if you can still hear us.

A

um Okay, thanks uh michael, so I assume that um adrian will be hopefully uh back again soon. uh He also had some network issues earlier today, um anyway, um thanks adrian- uh maybe you can't hear this uh right now, but uh thank you so much for this presentation. We are already uh in the part uh where you were presenting the results, uh and I know also some outlook for the future. What uh actually uh you wanted to improve? um I tried to um pull up the slides on on my screen and show you the slides.

A

Oh adrian, he's coming.

B

Yeah, so sorry, I think the internet is trying to tell me that I should stop talking. Can.

A

You see my screen, I was just announcing that you will also be talking a little bit about future improvements, so the stage is used.

B

Sorry, I'm not taking more than two minutes. So what's the future.

B

It's not the perfect way in captain world, we can develop local service, so we can devel develop a middleware for the captain to connect instantly to locust and to orchestrate everything directly via captain, so the. If, if we dedicate a bit of time, we can develop an integration between captain and locust, which I think is yet to be done. But there is already existing integration of litmus, so we can use the litmus integration and then we can shift to captain as the orchestrator. So we don't need to run the calls manually.

B

We can pass the task to captain and then on our side. We need to improve monitoring, graphing and possibly maybe we can integrate with argo cd. So the future version of this graph would look like this. The gitlab will will only pass some comments to captain and captain would be the direct orchestrator for locust for litmus and also the evaluation layer.

B

So this diagram would look like this, so we will not um have to make all the api calls and cube ctr requests on ourselves, but we would use captain as the first point of contact and that's pretty much all. Thank you.

A

Thanks so much uh adrian for for sharing all of this, um I know that we have um the litmus community here that we have someone from gitlab. Here we have a couple of folks from the captain community here, so um please feel free um to to share your thoughts.

A

I will allow everyone to unmute, so um please feel free to just jump in and ask the questions uh directly to uh adrian um one thing that I discovered while everyone is joining in here, is um you mentioned that um you were quite surprised that the requests were during the heavy stage was so fast or actually the the request per seconds that they did not drop?

A

So I thought, if you want, you could also add an upper and a lower about in the captain. Quality gate to, for example, get a warning if the requests do not drop. If you expect them to drop, you could do this, so you get the the evaluation result um as you're, not using it for the complete cicd part, but just for quarter gates. This is something that was. I was just thinking about this during the presentation.

B

Very good idea, very good.

A

Idea we have to hear that maybe they want to comment on.

A

A

uh So any any thoughts uh and comments, um I see a couple of things uh in the chat. uh It was really much appreciated um if you want to um also unmute and just join us in the conversation, please feel free to do so.

A

Hey organ hi adrian, this is karthik from the witness team. That was an awesome presentation, thanks for taking the time to describe the use case and explaining how you went about this um really interested in. uh I think the litmus service that we are working on are there any uh suggestions you have adrian on how you would like to see. Let us more integrated into captain. We were having some discussions around using litmus for some more use. Cases then already are being actually.

B

Yeah, actually, actually I'm yet.

B

It don't don't get me wrong. My my development right now was focused on implementing the solution, and now we have the working solution and now is the time to uh make the actual r d. So, for now we have, as I said, very, very simple chaos. Definitions of these packet drops. We should improve them. uh I think this is now the time to get into litmus a bit more learn about the experiments, learn about the possibilities and see how our system behave and also I'm yet to see the integration with litmus so for now.

B

Unfortunately, I don't have any advices, but if we can keep in contact, then I will be uh jurgen knows that. That's uh I like to give suggestions so definitely up for, uh like I'm really happy to see. What's, gonna come up with the future.

A

Great yeah, looking forward to that thanks yeah, thank you for for sharing um yeah. I think that's uh that's a great use case um where the little service is used in a real-world scenario, where we can really learn uh what is needed.

A

So in the current implementation we can already basically um fulfill the last part or the last image you had where captain is doing this orchestration. uh What's still needed is the locust service. So if anyone here in this uh in this group is working with, locust uh has experience with locust and want wants to join us in building this locust integration into captain, uh please um step up and please get in touch with us. We will be starting quite soon on this implementation, um as we um as we've seen.

A

It might be helpful if there are some issues going on that. It's basically that we have an integration between uh the test execution and the orchestration that we directly know what was actually going on, and we have uh kind of the bridge as a central hub where we see the information. If tests have been executed. What was the result.

B

Yeah, I think I think that needs one explanation here, because uh for those of you who are familiar with low cost, locust has basically two uh two modes of execution.

B

So first mode is the ui mode where we deploy our test script or load the script, and we basically run everything from the ui and the other mode is when we run it. So this ui mode is similar to jmeter ui mode, but with geometry, ui, ui mode. You are using the application so to implication, client whatever, and here you have like website so standalone website and the other way to run it is just command line, run for a particular time. So this this modes are slightly different. So with the ui mode you have always.

B

You are always running the website somewhere on your um in your system, and then you can just start this. Whenever you want run it and manually, stop it so manually, run it manually stupid and with the command line execution. You have okay, let's run this with 100 users for five minutes and we are using the ui mode. So we have the deployed locus somewhere and whenever we want to test performance, we can do it manually, but also we are a. We are querying or starting the test a bit of a hack way.

B

Why it's happy so basically the same. What would ui do- and this is going to be a bit of a challenge for the local service- to support- maybe both versions, but I think we will start working with the ui version and the locust api.

B

So that's low hanging fruit. Why? Because, then, if you would like to, if you have this running instance, then you are using the same instance that has the exporter that that is exporting matrix, and if you run the command line, you would need to start pots in kubernetes ad hoc connect to exporter, and this is more more troublesome.

B

A

B

The devil is always in.

A

The details, so um it's very interesting to see that there are different approaches to it. Also, the approach we took with integrating the litmus service to captain or developing it. We decided to go for one approach. uh We've implemented this, uh it's already uh out there for everyone in the community to use, and uh we will be also um yeah improving this uh over the next weeks.

A

So it's always uh great to see also some more use cases that might be needed, um and it's great to see that there's already some uh some demand out there um yeah, if you, if there are more questions, um please um just ask them away, and if not, then I would like to thank everyone.

A

Maybe I just share again my screen just to give a little bit of um background what I'm talking about so um yeah I've put um a list of attendees together the ones that I could see if you're, not fine with having your name out there, just please go ahead and remove it again, but that you totally fine um thanks adrian for doing this presentation on how you did the implementation- and it was really great to see all these things coming together. um We can, if it's fine for you.

A

I would also love to put in the link to the slides in here, so that if someone wants uh to take a look again and maybe take a look at some of the details, um they can they can do so. I will put the link to recording once it's online. I will put the link recording also uh in here.

A

um If there are no more questions, then uh thanks everyone for joining. uh We will have the next captain community meeting again on the third tuesday of the month, which uh will be- and I actually have to take a look at my mobile. uh When is the the next third tuesday of the month, and it should be february 16th so february. 16Th is our next edition of the of our captain user group.

A

In the meantime, we are hosting our weekly developer calls um on thursday same time as today, uh please also, if you're working with captain, if you're more on the developer side, if you want to build integrations on captain, uh please join us uh on thursdays.

A

um As a last update, uh we just released captain serial.8 in the alpha as an alpha release. uh Please give it a try, give us feedback how you like it give us feedback on the new features, give us feedback. If something is not working. As expected, um it's an author release. We do not recommend to already um override your stable release of captain and use it in production.

A

Please do not do this, but if you have some instances that you are where you're developing with or that you are experimenting with, we would really appreciate if you give the new version of captain a try and give us feedback on this. You can reach us slack or on the twitter google group on the on all the channels that you already know so thanks. Everyone thanks adrian and have a great rest of your day. Usually after these kind of events, we would go for a beer.

A

I'm really really sorry. We have to oh very good. It's not the pointy.

B

Camera, but it's it's it's my green water bottle, so it's a waterfall, okay! Unfortunately, I still need to. I still need to drive, but the next beer I have. I promise to make a salute.

A

Great, so thanks, everyone uh hopefully see you all again next time.

B

Thank you very much guys. I appreciate your time and your recognition so to say have a great evening thanks.

A

A