keptn User Groups, 16 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Progressive Delivery with Keptn

Description

In this talk Brad McCoy (CNCF/CDF Ambassador, Head of Cloud Engineering at Moula) and Adam Gardner (Automation Architect at Dynatrace) will talk about how to use Keptn to perform progressive delivery and how Keptn can help us in CI/CD pipelines.

Meeting link: https://community.cncf.io/events/details/cncf-keptn-community-presents-progressive-delivery-with-keptn/

This talk is a recap of the Cloud Native Guatemala Meetup in Spanish: https://community.cncf.io/events/details/cncf-cloud-nativegt-presents-progressive-delivery-con-keptn/

A

That's just that he's coming now.

B

B

I'm still uh playing with the stream but yeah you're working, fine.

C

A

Hey everyone hey brad uh I'll, like if you can make brother presenter, maybe.

B

uh Yes, sean, yes, sir! I'm still waiting. uh Okay, so brett! Welcome to the panel. Oh yeah, I made him a host. That should be okay, too.

B

So hi everyone should we begin or let's uh wait uh for extra few minutes. um Yeah we'll.

D

Yeah maybe wait just one moment: I'm just trying to get my camera around, but it's not okay.

B

So I do not have slide deck for tomorrow. um I will just do a quick introduction um and then I will uh hand it over to you.

B

And while we're here.

B

Oh, and here I got the live stream working.

B

So I don't even have to share my screenplay.

B

A

A

Ryan's having trouble with his camera, apparently.

D

Yeah, I can't turn it on, but I think it's okay baby.

E

D

D

It's not the end.

B

Okay, so just in case you're already live on youtube, uh but yeah. We will start the presentation in a few minutes and yeah so.

B

B

Okay, so let me know when you're ready.

B

A

I promised brad that I wouldn't uh I'd hand across to him and wouldn't speak so much this session. So whenever brad.

D

You can hear me either.

B

B

So we're waiting for camera right.

D

uh It's not working baby's, not letting me do it, so I'm just going to have to go without it. Okay, well,.

B

D

Do that then yeah.

B

So then I'll start the recording and let's begin.

B

Hi all welcome to the captain user group. We have a first ever user group meeting in the epoch and immediate amazon thanks a lot for bread to brett, mccoy and dumb garter to for joining us. Today we have a presentation about progressive delivery with captain um and uh during this presentation uh you will learn a lot. How captain is used on the field and after that please stay on the call. We will have an opportunity to discuss anything about the progressive delivery with captain or captain in general.

B

So once the official part of the presentation is over, we will uh have a discussion and everyone can get the voice permissions so that we can have a live discussion with everyone uh for connade. During the presentation please use cornea, so I will be monitoring uh quinny in the chat and uh also we have youtube live today. So I will be attracting questions there. This is our first attempt with youtube live so just in case apologies if something goes wrong, but yeah thanks a lot for watching us today too. uh So what.

D

B

uh Basically, nothing so let me introduce brett and adam and again thanks a lot for your participation today.

D

So um ellen, would you like to introduce yourself.

A

Yeah uh adam garner, I'm one of the captain uh contributors. I guess uh I've run a fair few. Captain services uh get involved starting to get involved in the core uh micro services as well, which is interesting but um yeah. It's it's a it's a good project. I I enjoy it and the the thing I like most about it obviously is: is the community everyone's very, very um friendly and open to help?

A

And that's the main thing for me: if you start a new project with anything, is trying to get help and trying to get off the ground so yeah. That's me.

D

Yeah and um my name's brad I I guess I unofficially work with adam. um You know working on open source, it's pretty lucky that um you know we both have the same interest and passion, so we thought I find myself talking to adam, sometimes more than other people at my work, so um we've been working a lot with other cd captain and portelius as well so sort of how I came about captain was we had a use case.

D

Well, I want to have this in my company as well, but we had a use case where there's a cd foundation project called ortillius and they were looking for a github solution and they needed other tools and capabilities to help them achieve that goal. So, essentially, what we wanted is we. We wanted a get-off solution that you know you can deploy using um the getups paradigm and things that we started to see is that we need a little bit more. So we wanted to do more.

D

You know testing um the the quality gates and remediation were the first things that got us interested in using kaplan as well, and then, as we used it more, we realized that you know it's.

D

It's like a swiss army, um especially not for devops, so we love the fact that you can, you know, bring your own tools and, and it's really um it's not opinionated, and that was something that really um you know got my interest in it so yeah today, I think we'll share a lot of the journey that I've had when you do use get ups.

D

You go around in circles because you try to stay true to the principles and- and normally you know traditional ci city you're, you sort of think in that mindset. So it's hard to get away from that. So, um first of all, I'd like to show um you.

C

D

First step that we had was was obviously, and it looks like um I can't share my screen either.

D

Yeah, do you think I should just quickly leave and come back and see if.

B

Yeah, let's uh try yeah, so usually you just need to get a permissions.

B

I started uh a meeting notes uh document, so everyone is welcome to contribute, um as you can see, from the history. We are not that great about doing these notes, but well I like doing them so I'll. Take care of that and uh please feel free to contribute.

A

While while brad is rejoining, I don't want to steal.

D

A

But I may as well describe what he's going to describe so, on the one hand, we've got argo um and argo. Obviously, we've got a github repo with our code in it um we've got oracle. Looking at that continuously. Basically, so any changes you may can get are built and reflected in argo um and then argo's going to deploy, uh do the deployment and then uh we're gonna. Actually, once the deployment is done, you get other there's brad, so he can. He can talk through it. Now there we go.

B

You are still muted, just second.

D

B

D

Yeah, that's really.

B

D

So here the first step of our journey was to learn agrocity, um this I'll talk about this more, but um it was a good journey in learning because we wanted to use an automated approach to this as well. So um what I have is a devops cluster, so I use so. I can get this running in a click of a button, so I'm using terraform to provision my this particular one runs in azure for ulteriors, but we can run it in gcp or aws as well.

D

So what it does is it will run the telephone code and then it will. um I first started using sealed secrets. So for those that don't know about cell secrets, it's a way that you can declaratively.

D

I guess, register your your secrets and get without you know.

C

D

The secret, because you can encrypt it with the cluster and then after that we found that you know that was a bit imperative as well, because you still had to run a cli command. I think, last month, though they just changed it to integrate with helm. So I have to check that out. But at the moment I'm now using external secrets, which is sort of the same. But I can leave some of my secrets in the keyboard, the terraform spins up and then I can reference that in my as a external.

C

D

Which is a custom user definition, so so this here? What you can see is that I can spin up um captain.

D

um So the cool thing about argo is that once you install argo, you can actually argo can take care of itself, so it can actually um change itself in motherboard itself, which is really cool, so um the things in this case I'm using as your front door to as a entry point into this particular cluster. um The main reason is because, if I want to change the dns records, I have to ask the linux foundation, so that can be annoying.

D

So I sort of use front doors, the interface and then I don't have to keep following or waiting for me to do it and then that's good for multi-region as well. If I you know, once we get a bit more mature, we can have captain in multi-origin.

D

So here you can see. This is an argo city. They have a concept of app, so this is actually using a new one called application sets which is fairly new. We started off running this in the dev branch, um but it's recently maturing and it's coming into the release as well for the archicad.

D

So um when that this is a good way to learn captain as well, so it can be a harder thing to understand. So I found that, although I was able to visually see what's actually going on and that sort of helped me to.

C

D

It um so essentially do you want to talk a little bit about the components that will drop me to I'm happy yeah. So the um thing we have the concept of the bridge so essentially that's the the ui where you can run, which we'll show you later on and the we have approval service um lighthouse service. You know these are standing with the standard um health chart and then you can you can the cool thing is that there's a great community um with add-ons?

D

So you can have, you can add stuff on like home servers january servers for testing, and you know adam's done. A few integrations himself which are really cool. So he's done. You.

C

D

Jr servers, I think, trivia there's more coming every day and you can see here that we actually have the autelia service which we're working on at the moment. So that's written in golang and that's essentially, it's use this template, but it's like a fork of the template and then we can update the logic in the going to accept card events to to talk to captain.

D

So you can also use things like the the workbook service um and I think there's one generic script service or I don't know, what's called accountability. But essentially, if you don't want to build your own service, then you can use easier ways to to talk to other services. So um yeah. This is yeah. As we said, this is lighthouse service.

D

That's um used to sort of interact with tools like dynamic tracks, for the quality rates, etc. We have mongodb datastore, the remediation and yeah. You can just see there's a lot of you know, service events.

D

So one thing when I first started was that um I I wanted all of this to be- and I wanted all of this to be shown here as an app and because I wanted to get up so I didn't want to. I was doing a tube ctl apply for my ingress, so I then did a pull request to get them to add the english to the kitchen chart, and then I found that I actually wanted more more add-ons as well. So that brought me to a part in my research where I started using customized home.

D

So it's quite not many people know about it, but it's the concept of you can use customize and help them together, which is very cool. So I can show you.

D

D

Yeah not to confuse it too much, but um this is using customized here. So you can see here that.

D

What I do is in the customization on the base. I can say I want this version of captain. You know run this home chart from here and then all my values that I want to override. I can put here so you can see here that I'm running my ingress, so that will essentially be able to connect to the virgin api and then also the course thing is as well that you can have other home charts as well, so you're sort of packaging it into the same app.

D

So um that's a if you have any questions on that. uh Please let me know it's quite um a lot to take on to start with, so um so that's sort of my journey and how I'm installing captain. So it's you know, budding click from splitting up everything in the cloud, the queen's cluster captain and then at the moment I'm I'm sort of- I guess, watching thomas as he makes the get ops operator as well. So once the cid is ready for that.

D

I will simply put that in here, as well as a dependency, and then that will install the get ups operator and then I can start adding my shipyard file. So adam can explain the shipyard a little bit better, but what the shipyard is is it's um it's a way of defining uh our workflow and the captain. So it is the concept of stages, sequences and tasks so um yeah in this particular instance. What we're looking at now is. This is what I want to try for my use case at the moment.

D

So what I want argo receiving to do is when I have a let's say, a helm chart.

D

What I've got working as well is something called argon notifications, so I'm gonna. What argo notifications do is that when I do a pr to the main branch, that's going to start triggering my workflow, so actually argo will poll get and that will say okay. This has changed like let's say um I just a simple thing like I, um I have a new docker image. You know with my code, so this docker image will then I'll, say. Okay, I want to go test this in a test environment.

D

So what that's going to do is because it argo sins and events and yes, I've complete I've. I've successfully completed.

D

You know redeploying this code, the new version, then it's going to send back a notification and that notification is going to be a cloud event captain to trigger a sequence. So you can see here that, um let's see like the event test delivery finished, but so what I can do is actually trigger that and then and then what I want to do is is a tool that I'm um looking at at the moment called tesco and what that does is that's uh it's essentially a you know. You can run your postman api collection, etc.

D

So I'm trying to move testing to the left a little bit where we can do fully automated testing in the pipelines as well or not a pipeline in the workflow. So that will then do the testing when that's open. That will progress to the next task and it will run the quality date as well.

D

So if the quality gate is good, then captain will send your message back.

D

It's essentially, this is the part that I'm sort of running at the moment, but it's probably gonna look something like that's going to then commit to the values.gml file for the next environment. So let's say qat and then- and so this is where that progressive delivery is coming from what we want.

D

So that will essentially because we want to keep with you know the github's principles as well.

D

Essentially, um orterius is going to catalog that metadata and then and then it's going to say yes well, we've done true testing, it's automated um testing in the quality. So that's going to say: okay, that's good for the next environment and then it will proceed. It will do the commit and then r will pick it up and then it will go to the next sequence as well.

D

So that's essentially what we're working on at the moment. um The reason I'm using the argo notification service, I could use other rollouts, but we're also thinking that a service mesh would do that capability as well. So we we're installing link of the at the moment to to get those different deployment strategies.

D

So I guess at the moment we're testing a lot of different things, and this is the fun part of researching as you can. You know you can plug tools. You know plug and play where you want. So at the moment, I'm going to try with anything for a service mesh for my deployment strategy and then um I can try I'll go rolex as well. There's actually a tutorial on captain algorithms as well on the um captain.sh website. I believe so quite a lot going on there did you want to comment on any of that.

A

um Yeah, so if you could just pull up, um if you just scroll down and I'll go to some of the services, um I just want to kind of highlight, because we've.

D

A

Strong uh faces on the core, we're probably new yeah.

D

It's um I want to try- and I know this is a lot to take in but yeah. Maybe we can go back to basics. A little bit yeah.

A

Yeah, so you can see there, we've got um the configuration service, which is one of the core captain uh micro services, there's the helm service, the jamie disservice, and so on and as brad says, you can write your own services. So if there is a tool out there that you like you, can bring that tool. That's that's one of the core things about captain is that it's not opinionated in what tooling it uses.

A

So how this works is, um if you could flick to the shipyard, glad sure, so you define your shipyard and that's basically your blueprint of what your world looks like to captain and now what actually happens behind the scenes is that cloud events are generated for this um automatically for you behind the scenes. So, for example, you as a human or another tool in this case we're taking argo and argo is doing its thing and notifying captain, and what argo will do is actually trigger the sequence there.

A

So you can see online nine, there's the delivery sequence in the test stage um and then so that's the cloud event that we fire in to tell captain to start it's its workflow captain. Then goes away and says: okay, I need to trigger the delivery sequence. What do I do? Well, I need to trigger the deployment task on line 11..

A

So after you've triggered the sequence captain is going to manage and orchestrate the rest for you now brad is going to have to have a tool listening for that deployment triggered cloud event that captain is going to generate and send out so that tool that's up to brad. He can bring whatever tool he wants. As soon as that tool finishes, it's going to signal back to captain to say I've done the deployment, it's finished and here's some results and then captain says okay.

A

I know the deployment is now finished, because this tool has told me now: I'm gonna, you know, and the process repeats the captain generates that release event that cloud event and some other tool listens for that event and so on and so forth. So that's what I wanted to try and map together is the idea of the captain services and the shipyard and how they kind of fit together and notice. We're not actually mentioning tooling in here. We're not saying. Do a deployment with tool x and that's deliberate because those services are hot swappable.

A

So if you don't like you're the the tool that you use for deployment, if a new tool comes along tomorrow, you can just switch out that captain service and everything else about your environment stays the same. So we can as we're experimenting with this brad, and I can really just swap tools and see which one is the best fit for his workflow and you can have different tools responding in different environments, so we could have test cube responding in the test environment. We could have j meter responding in the uat environment.

A

If that's what was wanted um yeah? So that's all. I want to try and point out.

D

That's a good point because, like for example in uat, you might have more sort of like for like production data, so it would make more sense to run january just in uit, and it doesn't necessarily need to be test because test is more traditionally, um you know sort of older than the uat environment, so yeah and and because we're very much in the research phase, it is a great um yeah. It is good that we can swap tools out, as we see what's the best way to do this.

D

Once you go down this journey of get upset, it does get confusing fast and you just have to keep experiment different ways and see what works and and having the freedom to do. That is great.

A

Do you want to talk a little bit about about the the plan for kind of after deployment? So if the the sort of rollback that we've got planned.

D

For the remediation.

A

D

Yeah, so um I guess I haven't done much with the remediation yet, but I can imagine that once um for starters, I'll use downtrends to get a get this sort of workflow working, because it's just easier and once to my understanding once done, dress was in the problem event.

E

D

My remediation sequence will then.

D

I guess in a way it's probably got to do another git commit as well, because I would pick that up so maybe it it depends on that. Well, once we start testing what the actual problems might be, then we can either yeah.

D

You know, probably a traditional one would be scaling, so you know obviously auto scaling these days would sort of take care of that. You know you should be setting your clusters up with horizontal scaling, but let's pretend the scaling was an issue.

D

It's a black friday where retailer were getting smashed, maybe irremediation would just be just simply scale up so so that would probably that will do it, get convert back to the prod file and then scale that up and either pick that up and do it straight away, and then it would sort of go in that remediation.

D

A

Haven't got to that would be to obviously between these stages. um So that's the other thing I didn't mention. Is these sequences by default they're standalone, so you ask captain to run a sequence. It will run to the end, but what brad has actually got here on line 18? um Is you can chain sequences?

A

So basically, he here he's saying: whenever the delivery sequence finishes in the test stage, um I'm going to start the delivery sequence in the uat stage, so he's chained those two together. So it starts to look a little bit like a pipeline, but remember we're not actually building tooling integrations in in this part, so my first thought there would be as soon as we finish the prod delivery.

A

We actually trigger a final quality gate after say, 15 minutes in prod, using the same metrics that we have all the way through and if that quality gate fails, then we know the deployment, for whatever reason has gone wrong, so we can either. You know as.

C

You say: do whatever.

D

A

Can scale up, we can roll back, we can um flick.

C

A

A static version of the website in the worst case, you know we can um it's it's flexible to what we do, but my my first thought would be run a final quality gate after um after the deployment after the problem.

D

I recently watched um andy and I think his name was daniel from cosplaying, that one of their talks on using captain and cosplaying is possibly you know the concept of a remediation task with if azure was down, then to you know using crossplay which people don't know. What crossbar is it's essentially, um infrastructure is code declared and including these manifest files that that's using get ups as well. So I I want to start experimenting with cross playing as well. I'm doing remediation for that.

D

To- let's say um I don't know some service from israel goes down, then we can then trigger remediation to go spin everything up and go to aws, or um my next task as well is because this is the devops cluster. I want to use cluster api so for my environments, I want to be able to use. You know, get ups to be able to spin up my other clusters, my environment clusters as well. So there's lots.

D

You know the possibilities of all the cool things that we can do it's good for, dr as well so disaster recovery.

C

D

Of really good benefits out of get ops and using captain as well.

D

B

There is a question from dmitry: uh how do you integrate the detail? The login of the validation tests with captain dashboard.

D

Sorry, how do you, how do you integrate the model.

B

Login validation tests with.

B

It would be nice.

D

B

How it looks like on captain site, because captain also has its own web ui, but we haven't.

D

Yet touched it.

B

D

Sure adam, would you like to share your screen for that.

D

A

Well- and I don't have the share screen button, unfortunately,.

D

Okay, let me try and stop sharing and see if it gives you g there you go yeah.

D

Yours would be better because it's a more mature environment. I guess.

D

Yeah, it was probably we probably should have done the ui first, because it's a little bit easier to understand, but I guess that that demo was it get off, and now we can take care of the spin up, eyebrow, captain and lips inside it.

D

Can you see.

A

That um so this is, this is basically the ui that you get, and I remember all of all of the output is driven by cloud events. Now um you get the raw cloud events. Obviously there is an api behind all of this um that you can interrogate to retrieve the cloud events, but of course they are also available in the ui, um but the ui makes it nice and easy so the way the quality gate works.

A

You obviously give it your service level indicators which are effectively the metrics that you want to look at so in this case we're looking at authentication. uh You know the load time of the home page on the store booking page in this case, and then uh you give it some thresholds or your service level objectives. So you can see here that I'm saying okay, if my authentication endpoint um is less than 10 milliseconds responds in less than 10 seconds and it's within 20 of a previous um set of tests.

A

Then it's a pass criteria. If it is not a past crisis here, and then it may be a warning criteria and this one is less than 20 milliseconds and then obviously, if it's anything above the warning criteria, then it's a fail for that individual metric. So every single metric gets its own score um and then those those scores are summed together to give you this score here, which is the overall result of the quality game. So in this case all of my metrics were green.

A

They all passed um so my quality gate as a whole passes. So if you think about brad's workflow he's got, um argo he's got the quality gate at the end. This is the kind of output he would get and then he would get a decision back from captain to say is that artifact? Is that thing that we've just deployed um you know not only to the liveness and the readiness probes um and the pod spins up, but actually is the performance of this acceptable? Can users make a booking?

A

Can they um do what they're actually wanting to do on the website or the api endpoint.

B

And this can be these.

A

Metrics can be anything they can be business relevant, they can be security. Vulnerabilities to captain captain doesn't really care. It's it's a metric that you are interested in, so you get a result and you can then we can use that result to decide what we do. So if um we could use that to automatically migrate or progress, the the artifact into the following stage um or we can just use it as a notification, just send a message into slack to say: I've run a quality gate in uat and it's it passed.

A

um So I'm I'm guessing what you're going to do. Brad is possibly have it um automated up to prod, and then you might have maybe a manual approval just before prod, and there is actually I don't know if you have it installed in your argo.

A

But there is an approval service and what happens there, when you run a sequence, you'll get a nice button in the captain, ui that says: yes, the quality gate passed, but you've asked me even if it passes to to give a human approval and someone's got to log in and actually click you know approve before it'll go to production as that kind of final safety check, so that, obviously, if you want true end-to-end automation, you just don't use that service and you just let captain um you know, do its thing and rely on the quality gate.

A

But if you want the human element, that's that's fine too.

D

I guess just just touching on that with the matrix argo will let you output metrics to rafana as well, so it would be quite cool for you know to measure your success in devops. You.

C

D

Things like things we care about. Luckily time mean time to restore it'll, be really cool to capture all of that stuff as well, and then show that in grifana your devops metrics and then you can see how you're improving, where you know, and just really how successful that is. So that would be a cool one to to look at down the line as we, you know, mature the solution as well.

A

Yeah um anything else: how do you troubleshoot the failed indicators?

A

No I've stopped sharing, but hopefully you can remember the screen. Basically, you get the um so you're going to have a metric back end behind this you're going to have dynatrace or prometheus or datadog or splunk or whatever, wherever you're pulling your metrics from and um obviously you've you've got the metric, and you know how you define those because it's defined as code in an sli yaml file, so you can easily jump into whatever tool, you're using and say. Aha, this, my authentication endpoint was was too high, for example,.

D

And you can do you've done custom ones recently as well.

C

Right, maybe just to clarify my question here: um the integration with grafana is a pretty missing here, so the click on the red, the indicator would ideally bring you to the grafana dashboard or a grafana plugin within the captain, to see the metric over time and to be able to to troubleshoot more easily because right now, what you are saying, you have a different implementation for troubleshooting using grafana, where you can deep dive into specific specific behavior, and you have here just a global indication.

A

So what what you do get it? I know what you're saying it might not be the whole way, but you do get a hit. I don't have a lot of history in this quality gate. Unfortunately, but if you flick across to the chart, you do get a history of each um each evaluation.

A

You get a history of their um the response, time metrics within captain, but yes, I I get that if you have a more fully featured tool at the back of this, um you know like dinotrace, where you could drill into the pure paths. That would be a nice, a nice enhancement.

E

Hey folks, can I quickly add something to this. This is andy because we had a similar discussion yesterday uh in our user group meeting with rifason software, where they have shown the automated validation of their online banking software and what we we had the same discussion and what we proposed, and I already gave that feedback to the the captain team, especially johannes, that an extension to make this possible the drill down from the heat map to the to a dashboard would be that the sli provider should not only return the actual value of each metric.

E

That is queried, but also a link to let's say if this is a metric coming from prometheus, then you know it could return a link to a grafana dashboard. If the metric comes from dynatrace and the metric comes from a pure path from our distributed trace, it could also return a link to the puppet dashboard directly for that distributed trace. So the the idea was that to discuss an extension to the sli provider that it not only returns a value but also a link to then do the troubleshooting or the you know, deep dive.

A

Yeah great idea.

D

There's no um question too basic as well. If there's any any folks that are, you know using it for the first time and have some questions feel free to ask as well.

D

What do you think about this so far like you know the research and I know there's so many different ways you can go, but um what are your thoughts on it.

E

Can you repeat the first part of the question.

D

Oh yeah, just just what are your thoughts on what we sort of presented? What we're trying to do at the moment.

E

Yeah, I mean, I think, everything we can do to um to let the community understand and make it easier for them to. um You know, really get kept running and really embedded into their ecosystem. I think this is impo.

E

The other thing we need to do- and I think this is all of our duty- is to make sure that people understand what's the role of captain in combination with all the other tools, because we all know that there are so many tools that are trying to do similar things, so we need to really come out with some best practices on you know. What's the role in your case of argo cd, what's the role of captain? What's the role of tekton?

E

What's the role of jenkins, I think this is still something where people are maybe also a little bit. You know confused, and I think this is why it's great that people, like you, you're, actually showing uh best practices on on how to combine these tools and then really use the individual tools for strengths. For me, the strength of captain and that's what I hear a lot- is the automated analysis, based on the slos, the the data driven aspect of orchestration, the loosely coupling of all of your tools to really automate and orchestrate your sequences.

E

I think these are the real strengths yeah. No, but what you're doing here is is is great right. I mean, I think, that's the right way forward. Also, I'm not sure if you have followed all of the comments in the chat when you were showing your uh your customized templates. uh Christian was also highlighting that uh you know, then don't forget about the new helm, uh chart repo, uh that's important. I think you're still pulling it from uh from google um from the google artifact store, so check that out as well. Okay,.

D

E

D

C

May I have a um a theoretical question more for brainstorming.

C

uh So when you are using argo, maybe I don't know you can use the argo rollouts to have a blue green deployments or canary deployments and you can validate a sli before it was actually be visible to the customer.

C

A are you using this.

C

This option, this alternative and another question in case you're, not using that uh it could be what sli is broken not because of deployment.

C

It could be so many uh external external causes of the failure uh environment based configuration based whatever, but not delivered, bytes uh how to actually distill what is the part of the deployment and what is reflected because of other dependencies.

D

That's a very good question and that's actually the phrase that we're in at the moment. So we are testing with um with other rollouts and and also comparing it to see like what I'm confused about at the moment- and you know researching to understand, is that is argo. Rollouts the same as like the service mesh. You know to give that that deployment strategy capability. So I don't really have the answer to that.

D

But I I'll have a play around as well and then I I think that's a really good discussion to have, because I know that um a lot of folks have done the captain other rollouts as well. So I I would actually be interested to see a demo of that myself, but I think that's a really good topic and.

C

D

Andy, have you seen that as well in the ecosystem.

E

Yeah, thank you for that and I also just posted two links uh in the chat as well. So there's one with my initial tutorial on argo, rollouts and then another one from a captain user group from one of our users, that is using argo rollouts and to that theoretical discussion. I think it is not a theoretical it's a practical thing right, because in the end, I exactly think that captain is perfect for that, because you can use argo rollouts to deploy a new canary or whatever you want yeah.

E

Let's call it a canary you deploy it. You then run tests against a new canary and then based on that really decide whether you want to roll it out to a larger user base or also blue green is another again perfect you're using a blue green deployment. You then run tests against the new deployment that is not yet exposed to the outside world then use the slos, the automatic, slow validation to then figure out.

E

Do we meet all of our criteria and if so, then we make the switch and our end users can also access that app. So I think this is perfect and I believe and dimitri if I understood your question correctly, this should all be part of of the rollout process right. This is part of the release. Process is really validating until end users can safely use it. That means the for me, a perfect captain sequence, not only switches between blue and green, so deploys tests, then switches if everything good but then keeps validating.

E

So things that I have seen is that people are after the actual deployment. After the switch to the live version, then, let's say after 10 minutes, 30 minutes an hour. Do additional slo validations to make sure that the live system is also running, and obviously this approach depends on the maturity of your observability platform. If you, if you're basing it on, let's say pure metrics, then I think using captains as a low validation is great after 10 minutes after 20 after 30 after an hour to constantly validate, are we still within our thresholds?

E

If you have, let's say other observability platforms that have automated, you know anomaly detection, then you can also use use those I mean, obviously I'm biased with dynatrace, so our customers are typically using dynatrace to take care of monitoring production, and then it tells you in case something is wrong after an hour after two hour, this could then trigger a remediation action.

E

C

Correct but when you are using uh exactly the flow you just described, you are at risk of false positives and false positives.

C

So you executed the validation on a blue version, but it's not yet been exposed to the customer and it's red, but it's red not because of the of the delivery, but because of environmental settings so um how to distill that, from from real real failures versus unreal failures, so, for instance, currently, we are experimenting with, in that case, to execute the same sli on a on a current version on the green version and to uh to disable to skip fail. The kpis slice.

E

I'm trying to I'm still processing the the scenario you just painted, um so if something is failing because of an environment issue, does this then mean you still have a problem right? I mean you would not then, but.

C

But why not to allow the deployment? Let's say the latency you set is beyond the two seconds yeah it's a trivial case. You have an.

B

C

You're accessing api and the usli is below two seconds, and it's currently above it's three seconds now, but it's the same for the green version and the blue version, and the reason of that is because of the database uh or data store accessibility.

C

So why? If you're using a blue green strategy, then failure on sli means your money you're not promoting. Now.

E

I understand now I understand your point. Yeah.

C

E

The two two answers to this question: the one is right: you can always have a manual override. I think you can obviously have a human being. Looking at it and say hey, you know it's still good, because we know we have a current ongoing issues and this won't be fixed.

E

This is also something that came up yesterday in the rifas and call where they said they would love to have the option to override a result, human because they know of certain things that the tool doesn't know, and the second thing is remember in captain slos can not only be defined based on static thresholds like your two second example, but you can also say I want to make sure that I'm not getting worse than what's currently in production, so you can also specify relative slos, so relative means I am comparing my current value with my last deployment that I had and I want to make sure that I'm not getting worse than 10 20.

E

So I think these are two options exactly and- and this is just showing it now.

C

Correct, but in in that concrete example, you need to to check with not the previous deployment, but with a it's been a twin version, a base version, whatever green version, and this.

E

Exactly I mean uh you know, and I think in practice the previous deployment, the previous evaluation you've done, is actually what is currently the green version. So I think this should work.

E

What we also know, though- and this has also been this kind of discussed with the with the captain community- is how we can really extend our slo comparison capabilities to say either compare with a certain baseline with a certain let's say, golden standard, or also compare two systems that are currently running in parallel, meaning not comparing it with the previous build but pulling the data from, as you said, from blue and then from green and then compare those I mean, that's also something that we have already uh discussed.

C

Yeah and they're currently experimenting with this configuration exactly yeah, so.

E

C

It's ready, I will share the results.

D

That'll be interesting.

A

I'm still on me one other thing I wanted to quickly mention. Of course you can define um slis, but you don't have to necessarily give them criteria, so they can be reported in this uh report effectively, but you're just getting the values, so you don't want them to affect the output or the parts of fail if you just want them alongside in the report, that could be one other.

A

B

So basically, it gives you just monitoring dashboard, so you don't have auto remediation, you don't have alerts, but if you want to observe a particular metrics on the captain dashboards, you will get it right away. Is this feature nice now.

B

Okay, so any other questions comments. um I think that we basically run out of questions in the chat etc. So I will grant.

B

Presentation permissions for everyone. So if you want to comment and just share your feedback about the session about your use cases for captain, please go ahead, and so you need to just accept the invitation. If you want uh to speak or just comment in the chat, I will wait. uh Yeah thanks a lot to brett and adam for joining and uh sharing the experiences and uh case because using kaplan with argo cd for progressive delivery is one of the hot arrays we actually have tutorials for that uh on tutorials.captain.sage.

B

Some of the tutorials need to be updated to the recent captain version because there'll be some changes in services and structures, and it's somewhere on our list, but uh yeah, it's a really important topic and we need to keep maintaining that and maybe just a generic question to you.

B

What else are you missing in a captain for your setups related to progressive delivery or not.

D

Just um really interested in the get up operator that will be eventually coming.

B

Yeah forgive separate, we already have a poc available, so I will share the link in the chat too.

D

Yeah I have that option. Yeah.

B

So you already evaluated it and uh how does it work for you.

D

um I've been talking to tom thomas about it he's been really good, so I I've just been trying to get past my next research phases. Just with the roll-ups.

D

I have to get that part done and then I'll, then it will be a little bit more mature as well. So then, I'm planning to adopt it probably next month, but.

B

Thank you, yeah. All of us are looking forward to see how this story evolves uh because yeah for me it was always seemed uh quite strange that cabin is basically a github's management tool, but you don't able to efficiently manage capturing like githubs. Well, you could do it before, but it would require some tricks, but now this iterators, I think it becomes available right away plus, since you will have full access to crds to specifications, you basically can integrate the capital management to any management flow you have so it sounds really exciting.

B

And anything else, you would like to see.

D

No, maybe just some some other contributors coming through the ranks. You know some some good first issues, more good. First issues for the folks.

B

We are working on that well for real, so we actually started uh working on contributing guidelines, extending them uh removing some obstacles because, for example, maybe it's my personal opinion, but I hated dco uh and I'm really looking forward to getting rid of it somehow and other boundary uh other obstacles. We also need to improve them, so I heard that, for example, adam recently created a pc for captain and docker and docker.

B

Maybe it could be also used as a kind of development platform because one of the feedback we got from the field that if you want to develop caption, you need something like 64 gigs recent hardware, which is still a challenge for many people.

B

So I think that it would be also one of opportunities for more contributors and well newcomer friendly issues. That's for sure we started creating some around documentation, but we also need to work on captain and currently there are. There are infinite opportunities to contribute on the services side because we have web service. We have job executor service, which are basically generic automation frameworks. You can just uh take one of these services and quickly create your integration, uh write some documentation based on them.

B

Also, there are easy keys for creating full-featured services, for example in golang, and you can do that as well.

B

So I think it would be one of the areas for contributions these days and also it would be nice uh to get it streamlined. So if you want to contribute just yesterday, we started a new channel on slug called kaplan contributing. um So if.

C

You are just interested.

B

To start, do not know where to start to use this channel to reach out, and we will be happy to help.

D

Okay, I just realized. We have the same kind of t-shirt.

D

Okay, well yeah thanks everybody, and if, if you, if you want to know more about customized home or how they've done that as well, please reach out to us more than happy to show you as well.

A

Absolutely brad and I are both in um australia, so I'm taking a lot of. Obviously everyone on here is in the same region, so feel free to reach out on the captain, slack we'll uh yeah, we'll help and uh yeah. Definitely if there are tools that you like, and that aren't kind of supported by a captain at the moment um feel free to reach out to me um I'll guide you through the best way. There are a couple of options to contribute so yeah happy to help.

B

Yeah, I guess adam holds the record of the number of integrations, because recently he created the e50 integration and those. uh What was the second integration?

B

Yeah, yes and it basically it's something like 3000, integrations right away to captain. If you just.

D

B

Submit a proper webhook.

D

B

And actually, I think that captain is a great engine for that, because it was designed for orchestration.

B

So thanks everyone uh yeah great presentation, uh yeah. We we had a first uh broadcast, so basically recording will be available right away uh first time, let's see how it works and yeah thanks brett thanks adam and thanks a lot for all your contributions to the community and to the project.

D

Thank you cool see.

A

D

A

D

B

A

B