Cloud Native Computing Foundation Online Programs, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Automating SRE from “Hello World” to Enterprise Scale with Keptn

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Welcome cncf community thanks for giving me the opportunity and joining in to automating sre from hello world to enterprise care with captain. This is really an overview and introductory section to our cncf sandbox project. Captain. I think of all the links here that you need in order to figure out where to find more about captain captain.sage, follow us on the captain project star us on github or join the slack channel.

A

I am andy grabner, I'm a deaf rail for captain and if you want to know more about me, feel free to reach out, we will also have a live webinar on the cncf webinar schedule coming up next week. So then I will also be joined by jurgen etzel stuffer, and we can then both show you more about captain in life. You can ask us questions and then we navigate you through the product, but I really encourage you as a first step.

A

If you want to learn more definitely check out our website from here, you can reach all the tutorials. You can get access to additional resources, like previous recordings on different use. Cases uh also testimonials to see how other users are using kept and how they take the benefit. We've also just recently released captain 0.8, so it's march 2021, depending on when you watch this. There might even be newer versions, but just to let you know that this is the latest and greatest as of the time of the recording.

A

But let me first go back and give you a little overview and let me get started why we actually built captain, because we built captain to solve a couple of problems that we've seen in our own organization, but also with users in our community.

A

One of them was a lot of devops teams are challenged with having very monolithic automation in the pipelines and it becomes hard to deploy. What does this mean? An example is from christian heckleman. uh He is a devops engineer and he constantly is challenged with pipelines that are broken.

A

He constantly gets pinged on slack saying pipeline broken, please fix, he is managing as you can see at 2 800 projects and 966 ci cd pipelines, and this is obviously challenging, especially if things are broken and why are they broken because over the time they became very complex?

A

Here's one of the screenshots- and this might be something that you can also relate to some of these pipelines- start small but all of a sudden. Well, it escalates well and fast. We end up with very complex scripts that are doing a lot of amazing things, but it's really hard to maintain and keep up, especially if you then have different permutations. So this is the first problem we see out there and that we want to address with captain.

A

The second problem is that also devops teams or people that are in uh related and in charge of tool integrations and pipelining. um These pipelines tend to contain tool integrations, and they are often custom made custom built and then copy paste it around because of lack of standard. This is an example from dita he's a senior ace engineer here at dynatrace, and he says: onboarding or updating pipelines is manual and often error prone now, while his environment is much smaller than what we saw from christian.

A

The challenge is that you often start with one pipeline for one service. Then you copy and paste and modify small things into the other services. So you end up with a lot of different pipelines for a project, and then this multiplies and you have the classical snowflake effect now.

A

What's interesting, we've done some analysis, so deta has done some analysis to actually see how much duplicated code we have across all the different pipelines we have in our different projects and it's a very well eye opening to see that there's a lot of red here, which means a lot of duplicated code. That means if there are either bugs in there or something needs to be changed, need to change it in many different places, and often you don't even know anymore where the changes. This is another problem.

A

We want to make this easier because we're spending too much time. Another problem that we solve or want to solve is we see a lot of sre teams, they're trying to get sre practices around sli slows around performance testing around chaos engineering at scale into their organization, but it's really hard to automate that at scale triscon, so roman fiesta managing director triscon he's been working with organizations where they're limited to the number of tests they can run per year or the number of apps they can test and validate against their slos.

A

So the reason why they are struggling with it is because a lot of the stuff is done manually. A lot of tests have to be rerun because they only run you know, let's say 15 times a year. So a lot of things change between this also means they have only about 10 percent of the projects onboarded in an organization but don't scale, they haven't scaled it across the organization.

A

The reasons for all this is because there's a lot of manual time spent in script creation, configure your monitoring analyzing your test results analyzing, your slos, which you want to do. If you want to get broader with your sre practice, not only in production, but also bring it across the lifecycle now these are three problems and three challenges. Now I want to show you three examples of how captain users have been helped by captain and solve their problems. Sumit is at intuit.

A

They are using argo, gatling and jenkins for distributed load testing fully automated, and now they are using captain to completely automate the test analysis.

A

Captain has a capability that is called slo-based quality gates, so they run their tests and their existing tooling, and then they hand it over to captain to fully automatically continuously evaluate slo, something that they have done manually before now. Captain allows them to scale coming back to roman, who I brought up earlier. Remember. He had like 15 to 20 tests per year and only five apps. Well now they run 15 times the amount of tests and 10 times the amount of apps.

A

Thanks to the automation that captain brings in because captain runs tests more consecutively, more continuously, more automated and also automates the analysis. This really enables them to do automated performance and resiliency resiliency testing and the third one remember christian. He was challenged with the ever-growing number of pipelines.

A

Well, they have now moved over to kubernetes, which means new microservices new pipelines that have to be onboarded, and they didn't want to make the mistake from the big from the from the past, from the previous architecture so now they're using captain to orchestrate the whole end-to-end delivery pipeline, calling gitlab for deployment kept, triggering their automated tests with uh catalan ng meter using helm for deployment, but then also doing the automated quality git evaluation.

A

So this is where this is some of the stories, and you can actually find some videos of these three gentlemen and more. If you go to the captain websites and go to captain resources, there are some other nice testimonials. You can also find them on the website. What I really like is taras from facebook who says: captain feels like a reference: implementation of google's site, reliability, engineering and the site, reliability. Engineering workbook.

A

I guess this was really nice for us to hear that it seems a lot of people understand that we really try to help, especially the sre community, to bring sre automated in your cloud native, continuous delivery, all right so now. What is captain right captain is something different for different personas, whether you are an ops, an sre, a deaf, whether you're a performance engineer.

A

Whoever you are captain allows you to pick a use case where you're currently struggling with automation, with automating it in general or in the way you want to automate that and integrate it into existing automation tools. So captain allows you to pick the use case that you want to automate quality gates, delivery, sre automation or auto remediation for production.

A

Depending on the use case. You then bring your configuration, for instance, for the quality gate evaluation. You have to bring your sli and slo definitions for your performance test, automation. You bring your workload definition for your order, remediation and production. You bring your run books and best of all captain doesn't execute. This thing captain is an orchestrator captain, connects to your tools, so you can bring your tools that work well in your particular environment. Everyone has a different environment.

A

Everyone has favorite tools that you have investments in, so you can bring these tools and connect them to captain, because captain then takes your configuration, takes your use case and really automates. The configuration of your tools connects them and provides the use cases completely as a self-service, and it does it through a declarative approach.

A

Everything all the configuration files are all persisted, stored and versioned in git everything is centered around service level objectives, as slows every action captain takes is validated that it doesn't break anything or still you are within your slos, and the whole communication from captain to your different tools is all based on the cloud event standard. So everything is standard based. There is no proprietary integration.

A

Now everything is based on standards, which makes it easy to extend easy, bring in your tools and also easy to swap tools without having to then update your manual custom proprietary integrations right before I go into demo a quick architecture overview.

A

The architecture was driven by really the new requirements that we've seen remember. We have seen pipelines, we've seen automation, scripts that grew too fast because they had mixed information about processing and tooling and target platform and environments in there. There was also no clear separation of concerns about what the developers should do and evaps engineers should do and a site reliability engineer should do. This is kind of everything we packed everything together, and these were the fundamental problems I think of most of the approaches we have today.

A

So what we, what we said well in the end, we have processes, but we want to automate processes with the hard dependencies to the tooling. So we said if you have a process on the left and you have the hardcore dependencies, why not just break these things apart? Why not break the dependencies or remove these hard dependencies and say hey? We have a process that we want to automate. Of course right it may be build, prepare, deploy, test, notify rollback.

A

Whatever you do in automating certain processes in your delivery and operations on the right, then you have your tools, so I rather like to call them capabilities, because you may have one or more tools that have a certain capability or can provide a certain capability in a certain environment.

A

So if you have the process on the left and the capabilities on the right and we have a process orchestrator, then we need some way for them to communicate, and this is where eventing comes in. Captain uses an event-based model, just as when we break monolithic applications into smaller services, then use eventing into connect, and we do the same thing.

A

We allow it to define the process and, as we execute the process, captain will then send the right event at the right moment to, for instance, say hey I need to. I need somebody that has the capability to deploy container number one in-depth with a blue-green deployment strategy. Then you may have one or two capabilities: maybe you have helm that could do it? Maybe you have a jenkins pipeline that could do it or you have spinnaker, then these tools can say.

A

Yes, I can do it because I'm certified uh and I have all the config files that I need for that environment. Let me do it and when it's done it sends it back that it that the job was successfully done or maybe failed. Who knows, and then captain can continue with the workflow.

A

So really what we did is we said which events do we need and also what who are? What are the capabilities we need on the right side and then we connect them through eventing from 10 000 feet. The way this looks like you install captain on kubernetes, uh you install the so-called control plane on a cluster that manages all of the workflow and all the logic I just explained we're using nets as the eventing engine.

A

Now, in order to use captain, you have somebody that needs to say which processes which workflows which sequences captain should actually orchestrate and automate. This is what we call the application plan. You specify what type of processes is it delivery process? Is it a remediation process? Is it a testing process? You declare this in our config files. We call them cheaper than remediation files. Shipyard, that means everything that is related to continuous delivery until it ends up in production and remediation is everything for the order for all the remediating tasks in production.

A

The nice thing is because we have a clear separation of concerns between process definition and and the tooling and the capabilities you can have even a different team that can define and install the execution plane either on the same cluster or on different clusters. We just introduced captain 0.8 that now finally has the capability to execute or to install the execution plane in all of your different target systems, and then this team can say well which tools do you want to use in this target environment, and then they install these capabilities they're.

A

Listening to these cloud events, so it's all based on standards and once they receive it, they execute the action respond which means at the end, the real beneficiary is the user, the dev, the ops, the sre. That can then say I have a new artifact and I want captain to now run an automated process. For me, let's say test automation or even delivery, which means then captain starts with sending the events depending on your process. Definition with this triggers the right tooling in your execution plane these tools, then do the action and then report back.

A

If something is good or not good, the nice thing is, you can easily now change the process without having to think about which tool integrations. You now need to worry about, or maybe break, but you can also change the tooling without thinking about the process. Right, you can say I'm swapping from let's say a jenkins pipeline that used to do my deployments to now using helm, natively or you may switch from jmeter as a testing tool to something like niotus or you switch from one monitoring tool to another monitoring tool.

A

It gives you the observability data and the nice thing. Is you don't have these integrations hardcoded anymore? It's all process definition tool capabilities and then they are connected through events. So I want to off go into my first demo. I want to show you a little bit of captain all right, so I had this here. um Let me show you something that I have here and by the way, as I said in about a week or so, we do a.

A

uh We do a live demo, we can do more, we do a live webinar and we do a little bit more on on live demos with captain. So just wanna, let you know I've installed captain on an eks cluster. This is a standard installation. Now, where I have control and execution plan installed, you see a couple of pots here that I have. I also have my captain. Cli authenticated against my captain environment, and I can now also do things like, and let me just do this here. History grab artifact.

A

I want to kick off a new deployment. I'm too lazy to remember all of this. To be honest with you, that's why what I want now says I want to say captain please. I have a new artifact for you for a particular captain project as service and here's my new image, and now you go off now. While this runs, I want to show you a little bit of what actually happens behind the scenes.

A

So here is my captain. Installation here is my captain. O7 project captain internally holds a config repo for everything it does so for every project you get a config repo and then you can also specify an upstream git. This is here my github repository what you can see here in the main branch. I have my shipyard file. This is kind of my process definition. This is what, where I say captain I want you to provide me three stages: dev, staging and prod. You can give it different types of metadata to change the opinionated workflow.

A

That captain has like what type of deployment should happen. What type of testing should happen? What type of approval should happen? What type of remediation should happen now? What you see here is a shipyard file of captain version. 0.7 0.8 was just released as I'm recording this, so I will show you how this change in open day, because in 0.8 you're more flexible with what should happen in a stage, but I start with o7 here, because in the end it gets the point across what captain is doing.

A

So this is what I specified that's my whole kind of pipeline code. Now what else do I have for every individual stage? Captain created a branch for me like, for instance, if I go into the dev branch. This is now where I have all of my supporting configuration files for the individual tools and capabilities so that they can do their job.

A

So, for instance, I have my g meter scripts in here, because I'm using jmeter so when gmeter is triggered later on, it can access this config repo for def and then say: okay, what are my files? What is my configuration I also have for? I can either specify it on a global scale for stage or I can do it on an individual service, because a project in captain typically contains multiple micro services that you want to deploy. Then you can have more specific files for a particular service.

A

Maybe I have specific test files for a service, so I have global files for maybe everything, some basic tests and then specific ones. Now what you also see here, I have my helm folder, where I have my helm, charts and, interestingly enough, two minutes ago.

A

Something was changed here because remember I triggered off my. I said: captain send new event artifacts. Let me just go back quickly. I said uh captain send event new artifact for this project for this service for simple node I have a new version and one of the things that captain does the way I specified it.

A

Captain will first trigger send an event and say: hey andy, wants to change the version and therefore take this version, information and update it in the files where it's necessary. So the first thing that actually happens is a version change and it made the update here, which is nice, because I don't have to deal with it. You don't have to do it. You can also do it through your regular uh github's approach, where you, you know, change uh your configurations in that git repo and then trigger the rest of the captain workflow.

A

But this, I think, is pretty neat, so we have everything in here def we also have prod and staging so for every stage. You have your different files.

A

Now, let's go back to captain, let's go into this project, so this is now my captain o7 project and I have if I click on environment, I have my dev staging and prod, so this is kind of just the visualization of the gpio file that I showed you earlier.

A

I can also see what is currently deployed in that stage. I have only one service on border, that's my symbol node! So I can click on here and it seems build number four was already deployed.

A

If I go back to staging this is where probably build number three still right, because so build number four was in depth and build number three is in staging and uh because I sent kind of captain along the way, captain will now go through all the process until hopefully it ends up in broad, and you will see here, I actually didn't clean up my environment from some previous demos.

A

I actually had a couple of builds and runs earlier that made it all the way into prod, not almost all the way into prod, because I specified in my sheet pad file that I want to have direct deployment or direct promotion from dev to staging. If a build is good, but from staging to production, I always want to have like a manual approval. This is why these are these are waiting here and now I get the overview of my slis and my slos, and I can then make a decision to go a no-go.

A

So these are some old test runs that I have never approved. So this is this is why they're still lingering around now, but what's interesting. This gives me an overview of what what is what is currently deployed in which stage. But if I click on services, then I see here in the list all of my previous attempts and my previous demos. When I ran deployments right, they can be triggered through the cli through a web hook. You can do it as part of the git action, whatever you want.

A

If I go to build number four now, if I click on it on the right side now you actually see all these events that I talked about earlier. Remember in my in my animation, I said: captain is sending events and with this triggering then the capabilities, and once they are picking up the job, they say yes, I'm doing it and then they're sending back once they're done so this is really neat, because I see exactly what is happening. Deployment test finished until quality gates are enforced, and then it goes on into the next stage.

A

Now let me switch to build number three that I ran a little earlier, because here I see a little more because in this case the build was promoted all the way from def into staging and then into staging. We also ran some jmeter tests and we did some more extensive uh quality gate evaluations until it ended up waiting in prague for approval, so I can actually now. Finally, I think you know kick this off and push build number three in prod. I mean build.

A

Number four is already on its way, but now I'm good with this. So this is kind of like a quick overview. What you've? What you should take away with you is that uh in captain everything is declarative, meaning you declare what kind of process you want to automate. We have the shipyard file, for instance.

A

You then also add all of your configuration files for your specific tools and capabilities into that git repository for a particular stage, either for this for the overall stage, meaning all your test files for that stage or specific ones uh in in for a particular service, and then captain orchestrates everything for you now. Captain also has a very rich api right. We have a swagger ui here where you can explore the api, and this is where you can then, for instance, trigger an event from the outside. You can create your project to service.

A

You can fully automate captain there's a lot of different options here to then access the git repository or everything in the git repository upload files, download files up to you- and you may ask you, may ask okay so, but how do I build a new service? How do I extend it? Well, services are basically listeners to events and, if you go to captain sandbox.

A

Then this is actually a great way for you first to uh to see what type of services we have. We have a sandbox, we have a contrib and we have the core captain project, but this is where we have low cost service litmus service. You see there's a lot of stuff already built monaco service, the github's operator. We have a lot of things here already.

A

If you want to write your own service and see how easy it is to just build an integration based on these standards, then just take this template, it's a go template and you can get started now. We have already videos for that. So I'm not spending my time.

A

So what you've seen is a quick overview of how captain works. Now the nice thing is, captain can easily be integrated in existing pipelines and existing tooling. That was also our goal. We don't want to replace everything we want to extend them. We want to automate things that are currently hard to automate, and one of the examples is patrick they're, using gitlab for ci for building their containers and pushing it to the registry, and then they're kicking off captain and keptness, then doing the delivery for them.

A

So this was kind of the the overview of of a key major use case. Now, a very big important piece of captain is that everything we do in captain everything captain does is always validated against important data, slos service level objectives.

A

So every time you deploy every time, you run a test every time you do remediation, we we were using slos and the reason why we had to do this, because this was a problem that a lot of people are facing, and this was also one of the problems I highlighted in the beginning that a lot of people are trying to automate data data based decisions as part of the delivery process, and I know many people have tried to bake this into the existing pipelines.

A

Whatever tools, these are to make a good or no good decision, but it's really hard because we do no longer only has just have unit tests to look into or functional tests.

A

Sres, especially, are asked to run more performance tests, more chaos tests. You need to bring in observability data from tools like open telemetry or your apm solutions and there's so much data, and it's really hard to to then analyze it from a build to build from a deployment to the deployment perspective. It is all possible, but it's not easy.

A

So what we then said well, we want to tackle this problem and we want to make it core at captain and for this we looked at google's s3 practice. So sre stands for site, reliability, engineering. I guess I don't tell you anything new for those where it is new, it's very simple! Actually you have slis. These are your service level indicators, something a metric that you can measure that is important to you, like the error rate of login requests, then you specify what is your objective with this metric?

A

So, for instance, you want to make sure that the login error rate should be less than two percent over 30 day period, especially in production. These are the things that you've defined and then slas, probably more well-known. Are things like what would happen? If you are kind of missing your solos, then you may have a legal contract. You may have some obligation or you may lose users whatever that is so in the end, google did a great job in advocating for this principle as part of the site, reliability, engineering, practices, great videos, great books.

A

I like the top line, slo travis slos, which inform slas now what we thought, it's great, that more and more organizations are looking into using slos as part of their production deployments, uh production monitoring right, you can use slos for individual services, applications for different types of metrics.

A

You use them, use the error budget, the status to make decisions on whether or not to deploy, but we thought why not take it and use the same concept for everything we do for everything we do from when you you, you create your first container image until you deployed in dev run your tests. Why not use the same concept of looking at metrics and then validate them if they are within what I'm expecting?

A

My expectations are, and this is why we bring captain quality gates as a core component to it, which is based on the concept of slis and slos. Metrics compares against objectives and then captain just analyzes metrics that are important for you with every commit with every build and then makes a decision good or no good.

A

Now these might be different, metrics and different thresholds that you have in production. I understand that this is also where you typically use regression. Detection between builds, because you want to know, did the new build, maybe introduce cpu consumption by 20, or are you making 50 new database calls to the back end, and this is something that you want to flag. These might be not slos that are interesting for you in production. Well, they might be.

A

But what I'm saying here is we allow you to also specify different slis and slos as part of qualitygate, so um very high level how this works? You specify sli in captain. What metrics you want from whatever tool and data source it could be prometheus could be, dynatrace could be. Wavefront could be any of the other monitoring tools. Then you specify your slos, where you can say I'm expecting this metric to be within a certain range, or I don't want this metric to go above a certain baseline by looking back at different builds.

A

So you can do absolute and relative. We analyze every single value and we grade it, and then we come up with a total score and then you can also say what is your objective for the overall score? We always normalize it between 0 and 100..

A

So if build number one comes along and everything is green, then great and captain will tell you you're good to go one hundred percent. If build two comes along and it seems you are slower on response time and failure rate, then you're getting penalized getting 75 and then you can decide it's still good to go yes or no.

A

If you're trying to fix this problem and build number three comes along and all of a sudden, you fixed the response time and the failure rate, but all of a sudden, you have an increase in number of backend login service calls from one to two, but it didn't allow any of that to happen because of your slo definition of an increase of zero percent, then you're getting penalized, and this would then stop your pipeline, which is great because you immediately get that feedback and then build number four comes along.

A

Everything is green and now we're good to go. So this is how it looks in excel. This is now how it looks in in captain the way captain treats slice and slos. You specify your slis in as indicators sli sli, yaml files. I don't want to start the debate on the amble and json now so you basically say these are the metrics, and then you put the query language next to it for the particular tool that you're using and then you specify your slos in a separate file.

A

So we made a strategic decision to separate data source definition to the sli definition from the slo, because this also allows you to easily swap monitoring tools by still, but still maintaining your slow definitions.

A

So if you have those and then captain you're asking captain, please evaluate- because this is also one valid use case- you can just stay kept. The only thing I want you to do is to evaluate my performance, metrics or my sls and slos. Then captain will send an event say: hey, which tools can give me these slis. Here all the definitions, then, whatever tool you've connected, can then report the value captain then takes this value, scores every single value based on the slos and then comes up with a total score.

A

That is then translated into pass warning or fail. Now let me go quickly back into captain uh and to show you this, how we did it with dynatrace now this also works for prometheus and others, I'm just using dyna trees, because this is where I, where my day job is, and so I'm familiar with that tool in dynatrace. We allow you to simply just build a dashboard and then captain will automate all this. Now. Let me go back to my captain instance.

A

So remember earlier, we deployed build number four. Let's see where build number four is now: oh yeah build number four made it all the way through it's now waiting in prod. But let me show you this here. This is the sli and slo definition.

A

Now remember I told you that normally you would go in into your git repo and if I go to staging and if I go to uh in my case, I'm using dynatrace as the monitoring tool. So here's where I specify my slis I could go in and specify all of my queries against my dynatrace query language. So I say: hey, there's an sli called process memory, and this is how you query it.

A

You can do all that and you can then also specify your slos in the yaml, just as I showed you earlier here, all the slos and I'm sure there's somewhere the memory, the process, memory with past warning, weight and so on and so forth. So you can do this, and while this is great what we did in order to make this a little easier because not everybody is yet there to do. Everything is code from this from the scratch from from scratch.

A

We said for our integration here. If you're using dynatrace, you can also just build a dashboard, meaning I have an observability platform. I build a dashboard where what I would normally do right. You normally build a dashboard, you put all your metrics on and then you typically have an idea on how how how must the metric look like like how what's the metric?

A

What's the range that I expected, so that's the same thing that I did here.

A

um I basically put in all my metrics that are important for me and then additionally, if I zoom in here a little bit instead of me looking at them and say: okay, what's the value that I that I'm expecting- and that is good- I can specify my rules pass- should be faster than 600 milliseconds and it should not slow down by more than 10 percent. I can do this on service level metrics on transaction metrics.

A

I can do this also on my process, metrics memory, whatever I have even loop, loop, tick frequency, because in my case I have a node.js app, so you put this in here and it's just a convenience thing and what we, what what captain does with it? It takes this dashboard.

A

That's kind of the source of proof generates the sli and slow yaml out of it, so that then captain internally can also process it the same way as with all the other monitoring tools, and then here back in my in my cabins bridge, I have all the results for every single metric for every single run. I can look at it. I can also click on the chart and then see things over time, which is really nice.

A

Now I also want to quickly highlight uh captain 0.8 I mentioned open8 was just released and I'm really happy about this, because there's some really cool new capabilities in 0.8. You also have a nicer way of visualizing the stages, so you can really easily click and focus on the sequences here for the slo validation. We now get a nice overview of tables. This was missing earlier. What's the sli? What's the value, what is the pass and the warning criteria? What's the result? What's the score? How much does it contribute to the 100 points?

A

This is all here and obviously you still have the charts, as I just show you, you can also ignore tests or ignore runs in case. You had a major issue that you're aware of so that it doesn't kind of pollute your baseline right, so really cool things that that are that are possible now, especially also nicer, visualization, all right now what that means is.

A

This is core to captain. We always evaluate our slos, but it means you can also use it. Standalone- and this is, I have to admit it's- it's the first use case that people that people start with captain. They say I may have already my pipeline. I already deploy with like christian here with gitlab and already kick off some tests, but then I have not yet automated my test validation, so I want to use captain for it. So in this case it's just from your existing, let's say: gitlab pipeline you trigger the captain evaluation.

A

Then captain brings back the result. All right. Last point: I've shown you a lot about data driven delivery data, driven validation. Now the last concept is data driven operations.

A

We also know that a lot of people are struggling with all the remediation and there we also wanted to focus kept. Let captain focus on through a feature that we call closed loop, remediation, so similar to orchestrating processes for delivery.

A

We can do the same thing for processes that we want to trigger as part of a problem in production. So, for instance, if you have a monitoring tool that alerts you on conversion rate dropped, root, causes cpu pressure, then you can specify similar to the shipyard file. I showed you earlier, but now a remediation file will say these are my steps that I would execute as remediation and again, these steps, these actions captain will take them and the way captain treats them is just like with the delivery process.

A

It sends an event and say who has the capability to execute this particular action on that system, because this is what I I want to automate. So, for instance, the problem comes in, captain will say well. The first thing is scaling up, so please scale up whoever whatever action that is and then remember also for the order remediation we validate, we validate the slis and dslos, and also we call them blossom business level objectives because in production you typically then also take, maybe some end user metrics to it.

A

What's the impact on the end user and then based on that captain says yes, the first remediation action brought the system back to a regulator healthy state, that's great, if not run the next action, and if nothing of that solves it, then in the end you can still escalate.

A

Really cool, but I know also really scary, a lot of people think they don't trust this in production as a first, I'm running it as a first time. This is why we are also partnering and we've seen a lot of movement, and this is why jurgen is great that he will be there with us next week in the live webinar um integrating captain in prepro with chaos. Engineering, so captain can trigger your performance test to run some load against your system.

A

Captain can also, at the same time, trigger your chaos test, whether it's litmus, whether it's kremlin, whatever you you use and then the first thing this does it validates that you are alerting and monitoring works correctly, and then you can use this to create to refine to validate, to battle test your remediation actions, because you in the end, this is a package you're pushing new code through, and the code now also comes with remediation actions and the the combination of both and the orchestration that captain provides really wants to make sure that the system stays healthy, even under chaotic situations, so captain can then be used here.

A

What I like to also call test driven operations, because that's really what it is. I want to test drive my operational code that will be executed later in case chaos really strikes, but I've already battle tested it here so to wrap it up.

A

Captain is different things for different people, but really what it is. It allows you to painlessly, automate tasks around delivery and operations. You pick your use case. You bring your configuration, you pick your tools, captain does the rest.

A

Now, captain 0.8 was just released and a major milestone here is that we have now multi-cluster support. So we can install the control plane that is managing and controlling the processes, and we can communicate to one or many different execution planes that then actually do the execution itself.

A

Christian is actually one of those people that are one of our early adopters there to use captain triggered from gitlab to then do multi-stage deployment into the different environments, then having the deployment doing the testing and then eventually promoting and the automated monitoring.

A

Another cool thing is, and I showed you this briefly in the demo- we went to a different shipyard model. Now we have shipyard version 0.2.0, which allows you to be more explicit on what should happen in each stage. In the previous versions, we were very up, not opportunistic opinionated. That's the right word.

A

We were very opinionated on what happens in a stage now we give you more freedom, you can define your own tasks and sequences and you can say which sequence should trigger when and what should happen, after which sequence gives you more flexibility, so have a look at cap. 0.8 best way to get started is to go to tutorials, make sure you choose the right version and if you have any more questions, feel free to you know, reach out to us make sure to follow us on twitter visit.

A

Our website join us on slack and uh yeah, make sure to also join us. Live at the cncf live webinar. Where we talk about captain, then you can ask all the questions and we just go through the product. Thank you. So much happy sreeing happy scaling from a small project to to a large enterprise skill. Thank.

A