Red Hat OpenShift GitOps Guide to the Galaxy | Red Hat Livestreaming, 13 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: GitOps Guide to the Galaxy (E62) | Tracking DORA Metrics with Pelorus

Description

Originally part of the Konveyor project the Pelorus project is designed to simplify the data collection needed around DORA metrics. Learn more about DORA Devops metrics and the Pelorus project from Red Hat's own Wesley Hayutin and Michal Pryc.

Learn More: pelorus.readthedocs.io/

A

A

Hey that was pretty good, not bad.

B

I like to brag, you know, I'm I'm, pretty good. At this thing, you're.

A

B

Good at that interesting, yes, you have.

A

You have permanent responsibilities for it. Now, though, it's done. No, no love it um good afternoon. Good options um welcome to episode, 62.

A

wow I'm, really at 62 now yeah. So um this week's episode is on uh Dora metrics, which are devops uh devops, metrics and Polaris, which is a very neat tool that originally came out of the conveyor project on uh Gathering and measuring those, uh but real, quick. First before we begin with that, a couple minor announcements, um the first one is, is that argocon North America, which is co-located with kubecon in Chicago, uh this October October yeah, something like that.

A

um The cfps for that close on August 6th. So that's coming up in a couple short weeks here so get those cfps in.

A

um We always like to watch those videos and figure out what people are really doing with Argo. um Other minor announcement is that I am leaving on vacation next week and I will be out for two weeks. Therefore, the next get Ops Guide to the Galaxy stream will be sometime in August second week of August I. Think, instead of last week of July, so we're skipping the stream in two weeks, I will be on vacation and celebrating my birthday. Instead, you may send me birthday wishes over Twitter as per usual um I.

A

You know also accepting uh messenger pigeons and uh you know LinkedIn messages, and you know any other ways of telling me that uh you're celebrating my birthday so because it's the best holiday.

A

um That, okay, that's it that's all I had for the announcements Johnny, you didn't have anything right.

B

No, no, not that I can think of.

A

Perfect, okay, so then we'll get started. I will introduce uh Wes now and mihao, um who are the um two of the uh the brains behind Polaris, we'll put it that way and um Wes I'll. Let you introduce yourself and pass the Baton.

C

All right, I get a full brain I'm. Thank you for the compliment. uh Wes Hayes and uh been at red hat for around 16 years. Engineering manager for uh plorus uh the community project.

D

And hello good afternoon, good morning we already are so. My name is mihao and I've been engineering on working for Red Hat working for Progress project, contributing to it having fun with it. So I hope you will enjoy today.

C

D

And demo of this project, thank you.

A

That was really fast interest guys, uh but you're good good for you good for you um I, obviously don't like to talk about themselves, not going to be a streaming host anytime soon. We have to be real good at talking about ourselves right Johnny. That's.

B

A

Right um anyway, uh so I know, uh I know that you have uh some slides for us, a demo for us West I, think uh your slides are up. First.

C

A

Perfect, let's hit.

C

B

C

I do like talking about Polaris and and naming uh software projects. We all know is the hardest thing in the world. Why would we name this particular project floors?

C

uh Traditionally, before we became so famous, uh Polaris was known as a navigational instrument uh used on ships to help determine what the ship's, bearing is think of it as like a simplified Compass without a true north. We hope that Polaris, the software engineering project also isn't going to tell you exactly where to go, but help align your organization to where you want to go and keeping you on track. While you compare your organization to that, bearing that you've set.

C

So that's how we came up with the name that.

A

Is a much more thoughtful name than some other ways, I've seen things named so I'm going to say that that works pretty well, we.

C

Are filled with thoughtful people.

A

A card right: that's what thoughtful people do.

C

I register that and noted.

C

All right, let me attempt to get to the next slide Okay, so we've established what the name is and so how? How do we help software organizations find their bearing? um We are collecting metrics from various traditional sources. Your sem your issue, tracker from openshift directly um bringing those metrics together in a dashboard that hopefully is fairly easy for organizations to read, so they can start having conversations about how they want to steer their organization Improvement to the bearing that they want to set.

C

The dashboarding and the information you get from Polaris will help your organization or any organization realize what kind of business value they are getting from. Various changes processes, procedures that they are implementing in the organization or that they already have established, compare those against various teams in your organization or as an aggregate as well. You can see all your organizations together to see what kind of business value you're getting out of your changes and to make Corrections see who's, doing something well see who's, doing something poorly. Compare contrast get on the right pairing.

C

All right, so this is just a small sliver of what plorus looks like um fairly easy to install we're in the community operators for openshift uh installing and getting some sample data we've done it we've timed. It takes about five minutes, so hopefully, um folks in the community also find it easy to get it up and running. Get some sample data in there get a feel for how things work.

C

Okay, so, basically, as we're collecting information on various teams and their um the software delivery performance, we'll get to that in a moment, it just really boils down to a classic science experiment. You take your control, you measure that you introduce some change, some experiment, you measure that compare and contrast repeat, rinse and repeat over and over again trying to constantly improve which everyone in the software industry is always trying to just do things a little bit better every day.

C

Not everyone is measuring those changes, those in um uh so what we're we're trying to avoid. Does it feel like we're getting better sure it feels like we're getting better people are saying we're getting better, that kind of boils down to gut checks, we're trying to help organizations avoid the gut check and actually have real metrics to understand if their changes are having any real, meaningful impact.

A

So I'm actually going to interrupt you here, um because um some context that I think you and I share West that other other people might not and I actually was um or on Tuesday I participated in a dynatrace panel for on the state of essery, and they asked me the question which was like what is the most surprising thing that you've experienced in the last year in your esri practices and so recently having come from a position of trying to help Advance esri culture across a brighter, uh the broader, like red hat organization.

A

What I found was that teams were doing a lot, but with very little data um and very little way of understanding and contextualizing. The data that they did have so um I kind of you know, and it's not like red hat, is not we're not unique to this position. Right I've talked to people all across the industry. It's basically been the case of people are doing a lot with gut checks and kind of general and not having necessarily great um like hard data to to really know how how their perception is, comparing against reality.

A

So, like my favorite metric here is actually change rate failure uh or change failure rate. Excuse me: that's the dyslexia happening right there, um the so the change failure rate probably is my favorite uh of the Dora metrics, um because that is like that that one is a data, but it's also kind of a gut check like I, think we're getting better and like there's a okay that that can actually aim towards. Like oh yeah, we are now if you're getting better, because you're deploying a lot less frequently than you were before.

A

Maybe you're not really getting better and that's you know, that's that's a whole other. You know soul searching thing you have to do as an organization, um but so that's why um it's? Why I really like this project? It's why I asked you guys to come here um and and and talk about this is because um holistically across the industry, I think where a lot of us are still from like an SRE perspective where we're fighting fires.

A

We're still like you know, we're not not really looking at preventative stuff as much as we would want yet, um and it's hard to have it's hard to make good start it's hard to make good, um as you say, like bearing directional decisions without the without the data, um so I run. I want to kind of re-emphasize. People are doing a lot of really awesome things without data, and this solves that problem.

A

C

Yeah I've had that same experience. um uh Some some of the these gut checks will come out at the end of the release during a retrospective, and you know, and and if you have a a pretty like some of the best organizations I've seen- do bring some data to that kind of a retrospective.

C

But what I have not seen is across a large organization being able to break down application or team by team who's, doing being able to understand who's, doing um some things particularly well and and who's not, and what we can learn from each other in that way and that'll only that doesn't come out from gut checks, because people aren't going to call out other teams that way publicly and they just that's just not the way it works.

C

But if we're all unbiasedly bringing in data from our processes and then have them up for the world to see- and you know with good intentions, uh raising those problems up um or demonstrating who's good who's, bad with the intention of correcting that bearing back to oh I, can learn something from this team over here then I think this product helps it's.

B

Black and white right, that's what it that's! What you do, you're bringing you're bringing binary decisions to the table. You've got legitimate content right. It's objective: it's not like yeah.

C

B

Got this I've got this gut feeling that it's your team doing this. It's like no I've got data that backs us up. Let's, you are introducing a regression quarterly. You know and like it's it's causing this thing, so it's it's tangible data that you can use. You know to make it to make a difference and essentially get better so.

C

Yeah, that's awesome. Yeah well said, okay, so, overall, what we're calling? What we're measuring is software delivery performance um and we're measuring that overall, with for the four key Dora metrics lead time to change deployment, frequency mean time to restore and change failure rate and we'll get into a better definition of what those things are we also if this is starting to sound, compelling to you or your organization, um you can deep dive and some good books to check out especially are accelerate and I, like lean thinking as well.

C

So if this is starting to sound like something you you need to have for your organization, I highly recommend reading these books here, especially those two today we're going to go kind of Cliff note, Edition never fully recommended, but here we go. So what are those four key? Dora metrics? What do they mean? We'll start with lead time for change on the left, and this is kind of a measure of your Market agility.

C

You want a low lead time for change, which is the time your software engineer committed the code um to when it was actually put in a production build. So small changes small features getting out to customers bug fixes coming in often this is kind of what you want to see. You don't want to see the monolithic huge release overwhelm your customers introduce a lot of new bugs all at once. That's not the ideal According to some of the the reading that I previously mentioned same thing with the deployment frequency now I have a lawnmower here.

C

One just one moment: please.

C

Never know when that's gonna happen.

B

It wasn't bad I promise.

C

Someone someone called to make that happen. It.

A

Was totally not me, but if I had thought of it, I absolutely wouldn't have done that. It's.

C

Part of being thoughtful is to play practical jokes. Yes,.

A

Exactly that's.

D

How you're a thoughtful friend.

A

By giving their other friend just a little bit of a hard time right.

C

um Okay, so moving on to deployment frequency um again an indicator of small batch size, you want to have lots of deployments getting out to production, small, making sure that you have your um the more often you're doing it, the better you are at it um and uh the the easier it is to consume for your for your customers,.

B

Can I just because I don't I'm, sorry can I just like jump.

D

B

And I just want to give an example of this right. Like there's within our team, uh you know: wait, we do validated patterns and stuff like that. There's one time, I wrote I had this huge commit right and they had like 400 commits like it was. It was ridiculous. It was. It was massive.

B

It was large and and at the end of the day, right I got my hand slapped because it's like dude, nobody can actually go back and review that right and I think that that kind of ties in a lot of this would you want these small bite size changes so that way like hey I, can actually do a code review, push it out and and then get out into production, and just do these little bite size, config changes, because the big massive ones are just a pain for your team.

B

You're, basically screwing your friends over and.

A

That's not how we're a thoughtful friend well.

B

That's not being a thoughtful friend, yeah, that's being a bad friend and that's I was a bad friend once so.

A

B

I just want to jump in with that.

A

I was gonna say. The other thing is especially when your customers are other software Engineers right.

B

A

When your customers or other software Engineers, uh just like big jumping just like big bucket size, changes on them is just I mean how many things have I seen gone wrong because uh dependency dropped just a bucket of changes and something was missed because the changelog was too big to truly. You know for one person to truly consume and comprehend right happens all the time. So it's how you're a good friend to your friends and and co-workers it's how you're a good vendor to your customers. It's.

C

A good idea to think of your customers as friends, you know we'll probably do better for them. Yeah.

A

Probably be a little bit a little bit better for that um I. Think, uh for you know one of the fun things about. Although we have a lot of red Hatters who watch this stream and participate in the chat, um we do actually have quite a few customers who watch the stream and occasionally make it to participate in the chat too. So it's we actually get to live that our customers are our friends here in get Ops guide.

A

um So that's kind of a night like a cute little like uh that's very true life, at least for me, our customers are my friends hi guys.

C

Hi friends, hi friends, um uh so the small size, um I think also comes into small batch size. Frequent deployments comes into that the far right change failure rate uh your favorite metric Hillary. um In case you have to roll back. If you have to roll back a huge monolithic change. Oh the pain you might.

B

C

Rolling back a feature that some other customer needs, um not not, okay, so what is change failure rate, it's the percentage of of deployments uh overall or in in a certain time period uh that you had to roll back to fix, and so obviously you want to keep that as small as possible and in worst case scenario, if you do have to do it, the small batch size really helps.

C

Meantime to restore is a little bit easier to understand. Now, it's just simply um how long it took once you did, roll back. You opened up a bug open up a bug rolled back. um How long did it take to get a fix in push it out to production, verify it close? The bug and.

A

That's we're actually yeah sorry, I'll, say we're actually using this in a second way here um in in the SRE practice at Red. Hat is we're also tracking this for how long it takes us to resolve incidents regardless of.

B

A

That incident is fixed by a push to production or by a manual intervention, because we're also using mean time to restore to track how uh we're doing with our error budgets and our slos, which is a complete like not really part of the intent of the metric I. Don't think.

A

But it's what we've it's one of the ways we started applying the metric, because it helps us to understand cost in um a way of not just like the the dollars spent on cloud compute, but also the dollar spent on uh person. Time and effort.

C

Yeah I I could see the same thing being opened up if there was a network outage or something like that, and your service was down and you and it yeah. If you wanted to track that, um absolutely I think that's totally appropriate all right. So we know what the four metrics are: vaguely Cliff Notes style.

C

um What do software organizations do that impact these four things, and this is where I kind of start going crazy, So, Cal calm me down a little bit if you need to um so in every software organization. I've ever worked in um all of these things on this slide that I have under each metric are done a little bit. Different they're also done a little bit differently um each week each month.

C

um Let's use engineer engagement as an example are, are the engineers on your on your team, highly motivated, come running to their keyboards in the morning drinking coffee, banging out, features and Bug fixes, or or is it like, coming up to a holiday? Is it almost Christmas and everyone's burnt out and just needs to get away that that happens on every team?

C

It's kind of it's kind of like measuring when you take your car to the dealership, you never want to get your car worked on on a Friday, because the folks working there are not thinking about your car they're thinking about the weekend and you don't want to get it done. This engineer. Engagement is a variable even across the best of teams across time and very different across Personnel. Obviously, so it would be really neat if we could kind of see that, through the lead time for change same thing with code reviews.

C

How well are your engineers reviewing the code? Are they actually finding bugs, while in code review, correcting things making things better or just slapping on, looks good to me and let's go let's get this release out the door? It's Friday we got a weekend.

C

um Is it easy for them to find bugs themselves with automated unit tests, integration tests, and is that running through CI? And is your CI fast feedback, or do you have to put in your patch and then come back in the next day and read the results when you're out of context and you need to get back into Flow State?

C

All these things are variable across every team I've been on and almost variable from week to week month to month year to year, depending on who's on your team and how they're feeling um any questions on lead time for change.

A

So I'm not sure how much of a real question it is because I do not, which is, is code review relevant at all and feel free to all too, because I don't know if you actually expected me to address this feel free to clarify the question: um is code review relevant at all um and that's that is a really in I. Don't even know how to take that question. It's an interesting thought. Experiment um I will say something. I've seen done well.

A

um Some of the teams at Red Hat have um somebody who's um sort of like almost on call to do the code reviews for a period of time, right right um and then like that changes whoever's supposed to do. The code review will will like alternate through a rotation to help um kind of reduce like code review burnout um and the people who are doing the code.

A

Reviews are not necessarily um not necessarily expected to do Feature work um during that period of time, so that they're not um context shifting too much and compromising the the Integrity of the review.

C

I think code reviews are an opportunity for more experienced senior Engineers to help grow new new Engineers with their patches. It's a as long as everyone is doing it with good intentions and and nice and trying to help folks out code reviews are an amazing learning opportunity for everyone. It's also something that I've recently thought that maybe a first pass with AI would be a good thing to hook in uh your PR's to automatically have whatever AI agent.

C

You want to do an initial review, for you get the easy stuff out of the way boilerplate kind of stuff, and then let the more meaningful code review come from senior engineers.

D

uh I also think that code reviews are important here because they may affect change failure rate at some point, because the less reviews there are less eyes on their code than the higher change failure rate may be right. Yeah yeah.

C

It's a team skill.

B

Yeah I've seen it.

C

There's a big variance in in code reviews, I've seen it done really well, um where you're not only looking at the code you're pulling the patch you're, trying it out you're in you're, really giving that feedback. It's it's gonna, be something that everyone's experience is a little bit different.

D

Synergy growth and when there is a code review, because people talk to each other even through the that channel, rather than face to face right still, there is some communication established yeah.

C

B

A

Think we could wax poetical on on on code reviews. I know, tal has some opinions that we can see in the chat here um about like a delaysly time to change sure um I'm I'm. Well, let's some other time we could do an entire episode just on code reviews, but let's, let's continue on now right.

C

Well, I mean to this point an example would be, um let's say, there's a team that has a high lead for change, but um a very low change failure rate. Maybe that difference is all in the quality of the code reviews don't know but wouldn't know unless you measured it. uh Moving on to deployment frequency. Some really cool new stuff out in the industry get Ops Argo ml Ops um to help you get your code um from GitHub to production.

C

All automated um all very cool stuff will help your deployment frequency go up if you're not cutting edge. If you don't have those things, something as basic as when you're, when I used to help deploy, um uh let's see was it was satellite hosted back in the day all we had was checklists. This is a long long time ago, um when I get wheeled into uh uh the emergency room, to have my leg, cut off and I looked down at my leg and the and the correct leg has the mark on it.

C

For that one to be cut off, I would be very happy and more confident, I'm kind of paranoid about that, but checklist surgeons use them Pilots use them. If that's all you got use it um security scans, improve change management, all those kinds of things help your deployment frequency go, go up and vary greatly from Team to team I'm going to keep re-emphasizing that because we want to learn from each other experiment in teams take best practices. That's your bearing that's!

C

What Polaris will help you find flipping over to change failure rate again on the far right. These are the the percentage of bugs found in production. Are you doing deep planning designing, uh looking for edge cases while you're in design? Are your test environments like your production environments? Are you having retrospectives? Are you? Are you implementing do not repeat yourself when it comes to errors, those all those things vary from Team to team and also help reduce your change. Failure rate mean time to restore same kind of thing.

C

um What's your rollback strategy, do you? Do you have a backup of your production? Insta uh instance? Can you just take the last known, good and or n minus one and send it out instantly? Do you have failover across regions? Maybe you can do it that way, but getting from outage to to back uh to running as quickly as possible varies from product to product and is key for customer success.

A

There was um an article I was shown recently. This goes back to your we'd have to determine how what kind of impact our code reviews have by determining some way of measuring them.

A

um There was an article I, read and I I'd have to find it again, so I will have to tweet this out later, um because my my colleague sent it to me and it basically said that a Dora metric is anything that you can measure in a meaningful way for your team.

A

So we have these like kind of four key ones that they've come up with and kind of they left the door open for Dora metrics could be these four plus additional, really important metrics to your team um and you're in your organization, um and so as we look at kind of like the things within these things, like I'm sure there's room for other metrics to be invented in a way that helps your team. Your organization find its bearing everywhere. I've worked, every team I have worked on.

A

We have worked differently but have worked in a way that worked for the team, um so I like I, think that's really important. Kind of with that whole like conversation about no true north, just bearing yeah I, think that that's kind of an important context that for a lot of folks who watch this stream have been in the industry a long time and have that same shared experience. But some folks are newer to the industry and won't have won't. Have that perspective.

A

So I think that that's just a kind of I'd like to call it out as meaningful here yeah.

C

We've discussed, probably every team that uses pelors will have um unique needs and I. Think me: how uh will kind of touch on that in in some of those upcoming slides, I hope so is.

A

That my cue to flip over to his screen share.

C

In the most thoughtful way possible.

A

Wow, that was, that was really subtle, I'm, not good at subtlety. Here we go.

D

uh Okay, so uh so far west, thank you for your slides and for explaining. uh There are metrics here and in this part of the presentation we focus on how the Polaris is measuring those metrics and to start with, we need to First better understand um the definition of what exporter is and then we will go to the architecture of Polaris each of the exporter definitions, what it captures, how it then translates to those neurometrics.

D

So Polaris is uh an operator at the currently a community operator, as was explained in uh in the first slide available in openshift Marketplace, we've had open a ticket to make it also a working for kubernetes Native. We.

C

D

Know when or if this will happen, but at the moment it's fitted to monitor openshift workload and.

D

It's installing a couple of components on the right side of the slide. You can see that it installs Prometheus grafana, and this is simplified example because later we'll go to the uh a bit more interesting example: architecture with exporters, so exporters is a Prometheus concept. It's not something that terrorists invited it's a promotivious concept to gather some data from different.

B

D

The different services and then expose this data through a simple HTTP or https endpoint and then promote use, scrape this data. So it goes to that.

C

D

Every a while, it just looks for the data there and then it stores in its own database for visualization and and some rules on top of that, but also inside Prometheus and grafana installs, or sets some of the rules. And this is the secret source of the roles as well, because those rules are then being visible for us for the UI.

D

uh So we build those exporters on top of already existing Prometheus.

A

D

And we integrate with Prometheus uh directly there. uh Polaris currently have three main times of the exporters. We can see commit time, deploy, time failure and as we go through those slides, we will understand what each of the functions of this exporter is doing and why there was also a new exporter introduced webhook, and this is uh do it all for us. So basically, webhook can serve us any type of exporter, plus some extra types that can be uh as well said.

D

Some some teams, some organizations, have special requests, so webhook also allows us to add those special needs to it. uh It's pluggable architecture, uh just a python code that allows us to gather more information than just those three metrics currently and.

D

Okay, so this is a very simplistic diagram to show what I already have said, and we can see those three exporters commit time, failure time and deploy time that are pulling the information from uh even data sources. So this is a pull method. We connect to different apis to different services and then pull this data, and then our Prometheus instance, which is also deployed as part of the powers operator, uh is pulling data from those exporters on the right side.

D

There is this webhook exporter, so webhook exporter acts as a kind of proxy to the Prometheus. We push the data to it from any source, so this can integrate easily with third-party CI systems and some failure tracking systems, and we have a well-established structure of the data that needs to be sent to that webhook, and this will then expose this data to the Prometheus instance.

D

So, instead of relying on some of the apis, we just enabled everything to the entire world, and this is how, for example, four Keys project was working by allowing.

B

D

Webhook exporter to be to be used.

D

Okay, so now a bit about exporters, uh the commit time exporter is the one that.

D

Connects the commit hash together with the image Sha, so we are building container with some image, with some sha available for us that was uh created from a particular comet in the git repository or.

C

D

Repository to be.

C

D

If we are using webhook, then we can use any repository here and then this information is being combined together to one metric which consists of information such as what application we are deploying. What comet hash is there? What image show was used to deploy that application and then the namespace, where this application was deployed and a timestamp when this event happened?

D

So having this information, we can either use on the uh this webhook exporter or we can automatically query some of the apis and power supports the from the git endpoints GitHub, GTA and gitlab Azure devops. Also it supports the image stream in openshift, so openshift have concepts of the internal registry and the objects. Their image streams can be annotated, labeled, and then we can take advantage of this. By taking this information and the commit information from.

B

D

B

D

Translating to promote use metrics and also recently, a container image also was added, so this is similar to image stream. However, it does a bit more. It goes to the container, Registries or Docker IO and then uses Scorpio to query for the labels that are embedded in the container itself.

D

This goes outside of openshift objects and apis, and it allows developers to embed this information directly to the container image. um Then we have a failure. Exporter and the failure. Exporter is capturing the timestamp at which the failure in a production uh occurred, and it was resolved. Why? In production, because, for example, we can measure, we can store our failures in a jira, but not all of the uh failures. There are production, one from the dura perspective. We are really interested in the ones that are affecting our deployments on a production.

D

So somehow we need to filter out from all the jira tickets from all the jira cards, uh the ones that are actually causing the failure on the production and every deployment. Every group every organization may use a different layout of jira cars of jira labels. That's why we have our own defaults plus. We have included a custom query, so really anyone can adjust this failure, exporter to its own jira, workflow and based on a well-known giraffe queries. Also, this failure exporter allows us to acquire apis from the GitHub issues: servicenow pager, Duty, Azure, devops,.

B

D

Webcock because webcook, as we said, it's a push, data model that serves it all and then the deploy time exporter and which captures the timestamp at which the deployment happened in a production environment also, an application can be deployed to a staging environment or to some testing environment, some other namespace. But we are really interested in the ones that are and.

D

Deployed to a production and the configuration of each of those exporters allow us to really fit the user needs.

D

So it's really important to understand how we are deploying application, how we are building application to make sure that Pandora's will gather the events that are really interesting and really the ones that are affecting production and not the staging, and there are many many different uh configuration options and the first homework that anyone should do trying to adopt floors to uh to the um to be useful is actually to see how the end to end application life cycle application pipeline is uh done and then add up those exporters to feed those application pipelines. Okay.

D

Now this is a bit bigger architecture for Polaris. So as an environment, uh we may have multiple clusters. We rely on Prometheus and Prometheus. Also works pretty well with things like Thanos.

D

Thanos is not on this diagram to not pollute this diagram, but this diagram is to present that we may have multiple clusters with one Polaris instance per cluster and then for that particular cluster. We have. We may have a those exporters that are serving the needs for a particular cluster or particular application living in a cluster. We also on the left side can connect this Polaris instance. There were some tries like that to any external promise use exporter, because this is, as I said, the Prometheus.

C

D

If there is any questions, please stop me because I'm, just looking at the slides and I, don't see any questions, no.

A

We have, we have no questions so far. Don't worry, I would have interrupted. You.

D

Okay, okay, thank you. Yeah I, don't see your faces or anything that happens on that chat. So if there's anything that's been here, okay and on the left side, we can see also external from it use exporter, because maybe there is another instance running somewhere else of the cluster, so we also allowed this to do to customize the deployment of the Belarus and also connect everything through the S3 storage to ensure that if the cluster goes off or there is a redeployment in another cluster, this data is not lost. Why?

D

Because the data for the dura Matrix is the most important for us really. The purpose of the Polaris is to store the data for longer period of time. It's pretty hard to actually show this demo because we store the data for one to three days, but the real value of this tool is to monitor the deployments monitor the commits across.

B

D

We can see what happens in different projects in different applications across a longer period of time. So that's why it's very important to also use this persistent storage and back it up with some as free storage.

D

um So it can be then used later on in another, let's say, or instance, or connect.

A

D

Bigger set of instances of blocks this slide is to show this Polaris dashboard we use. So let's go back. Actually we use here on the in the top left corner grafana, so grafana is also we use Community operator to deploy grafana that points to either Prometheus directly or if we are using the S3 storage to a tunnels. Query to aggregate this data from multiple clusters and then grafana allows us to represent this data in those four nice simple views- and this is the view of Belarus and that provides.

C

The aggregate aggregate.

D

Of all the information- yes thankless, sometimes I need to drink something to get my tongue.

D

D

How the Polaris collects this data is one thing, so we know that it connects to different apis. It stores the information from there openshift object itself. We can push the data to those exporters, but also we need to I think understand how this lives in the application pipeline. So in a good devops environment we have and I just split those phases into some simple blocks. We have continuous integration, so this is everything that happens before pushing some code to the delivery phase.

D

So there is, there are happening, some builds some tests, uh some small CI is happening and the best is if it's all automated of course, codes writing code. It cannot be yet automated, uh but we don't know in a couple of years from now. Maybe it will be easier to write a code, but here we just capture the point where the um code was reviewed or not.

D

Some of us doesn't like reviews, that's okay, but if, for someone that does the code needs to go through some cycles and then it's committed to the repository to some.

C

D

Source from which the uh from which the application is being built, so we start we take this artifact called artifact, and then we move it to the production phase, and from that point we know that the code was committed at some point of time, and this is the first point where Polaris is interested in, and this is this comage time and the comic may have happened like even a year um prior to the bills time. However, the Polaris has, we know, is the direction.

D

It's not the tool to go back in history, so we are really invested in the events that happens from the time that the terrorist was deployed. So even if we do uh gather this data from the past, sometimes we just ignore this data. So the next part is this production phase. Where we build the source code, we test it again and again and again. I hope that the testing phase is good enough to not introduce problems, and then we create some artifact. It can be a container image. It can be just a terrible.

D

It can be just a source code stored in the git repository because, for example, openshift towels to build the application from source to image, and then the artifact is just the code itself and then this artifact is sent for a staging environment, testing, environment, further environment. We don't know, but at the end it's good to have it in production. So this is the point where Polaris is looking at second event and looking okay, when this application actually landed in the deployment and going back, we could we in the slides.

D

We could actually see that this deployed to the production is taken from the openshift workload. So this is one point where Polaris can look at the replicasets ports and then combine this information together and.

B

Say: okay, my application.

C

D

When was the commit time, the comment was on the first when the first event occurred and based on that we can calculate some of the durometrics and then our application is working and, of course, bugs happens, because why not and application fails, because bugs can cause a production, application failures and we are invested in those events. The application failed, so we are checking when the application failed.

D

The entire application pipeline can start from the beginning, because there needs to be fixed. There needs to be an artified, there needs to be a delivery phase and a new production already application, and then, when the application is fixed, we see the second phase of the third data point event that the application is working again, and this is our failure exporter. That does the first step hope this is understood, and now, let's focus a bit on the um for metrics from the.

A

Just a quick time check because we've got about 10 minutes to go um and I know that you did have a demo. I think that you wanted to do so. I, don't know if you, okay,.

D

So I will be fast and the demo will occupy um three minutes and this will take around five more, maybe three, let's see uh so tell me why.

D

D

Photographer change, failure right also something that was explained in the at the beginning- is also very simple. Calculation is a number of failed changes uh divided by the total number of changes to the system, so under the hood, the secret sauce and the Prometheus rules are a bit more complex than just this, but we don't need to go there. Just number of change of failed changes and the total number of changes to the system uh can be many things here. But if someone is interested, the rules are in the source code.

D

So it's pretty easy to find and then the.

B

A

D

Is the mean time to restore which is the failure exporter? So how long does it take to restore service on a production when a service incident occurred? And this is the average time over period of time, because we can say again. This happened like within one month, two months time frame and our average was 20 minutes to restore our production failure, and this also can be applied to a one particular application or across group or across all the applications that we are monitoring, Okay, so uh I'm, sorry, I, move back.

D

We have in our source code, which is uh here a demo, so sorry, a demo where everyone can go and quickly run the demo against openshift environment, and this demo will deploy a tecton pipeline, build some simple application using source to image or binary. Build then.

D

Also label everything for the Polaris to monitor properly, and this demo will give us a pretty nice feeling of how per also works for today, I created a slightly different environment where I was using the webhook exporter for to send the data and I created free applications, GitHub github's Guide to the Galaxy and then the second application, the Galaxy application and the third application, which is actually called here. The second sample application- uh and we can see here and the grafana dashboard- we can select a different time slot.

D

This is something that grafana provides and we have. We can see here those four neurometrics so lead time to change. We can see that here.

D

The average within those six hours was 6.7 minutes. uh I hope that everyone understands now. What is the lead time for change the deployment frequency, so we got 23 deployments over this period of time and we got a couple of uh backs failures on the production you can see here is also very short, um lift failure. So we have one two three four five failures and we can see that over the 23 deployments there were five failures and over this time there is an 8.6 minutes interval to fix it. Yeah.

C

And one of the things um I really like about this view for organizations coming back to kind of like your bearing is obviously we haven't been running this particular instance for like a year or two, but really the value of this having running running for that long is where were we? Where are we now? Where were we three months ago and six months ago?

C

And let's compare those or even like a year ago uh time enough for these organizational changes to kind of uh work to work their way in kind of like raising interest rates it takes a while for the effect to be read in um so um that that's where I think on a day-to-day basis, this isn't going to be the most useful information, because change doesn't happen that quickly, but really being able to see like a diary or a log from a year ago, where you were in your um in your organization this this proves to be extremely valuable, I think exactly.

D

And we can see also here that, for example, for this particular application, the change failure rate was uh lower. So we had some problems in the production here happen, but overall we were getting better and better.

A

I think it's also really important to when you look at these. um These metrics um there's there's something in the actual like the Dora devops research, it's like kind of like these concepts of Dora, high and Dora Elite and so forth and I think rightfully. So those get a lot of pushback because um you know the if you, if you slow down your deployment frequency slightly, and maybe you drop out of door Elite to Dora high, but your change rate failure also drastically dropped off and is, like almost you know, almost almost zero percent right.

A

That probably was the right decision for your organization, so I I I, like that, um you know, you're, not opinionating. In any way the data that you're displaying it's just the data is the data and people will do with it. What makes sense for them right.

D

And it's also good because it's kind of conversation point for them right. Why why this happened? What we can do to make it better, so this is important for the team.

C

Yeah there aren't any managers across any organization that I've ever seen, at least that can look at 10. Different teams know in depth their processes and procedures and be able to articulate why a is better than b with data this.

C

It would be nice, though,.

C

B

A

Think there's a song right: wouldn't it be nice, let's not sing it, but I wanted all I want it in everybody's head now.

C

Here's your ear worm for the day, Aaron.

A

For the day, you're welcome: this is what thoughtful people do.

C

A

D

Okay, so to finish, this presentation I'd like to point to our documentation, uh it's under Dora, Dash metrics.io, and we put a lot of effort to make it uh readable and pretty clean. uh You have a quick start, installation tutorial a long configuration list because, as I said, it's important to understand the process and uh add up those exporters to your your needs uh and we made quite a nice way of uh for contributors to a easy day, uh First Steps in the pillars projects. Basically, everything is done for the make file.

D

Mega defense will prepare everything for you to start playing with it to deploying it to running CI test locally, and it's pretty easy to do that. So thank you.

A

I love a good make file.

D

Also also, actually, Wes added are pretty nice uh thing, which is called make help which is unique here, so everything what you make which make will do for you, yeah.

A

Does it make help just launched an instance of chat GPT like what does.

C

It do that's, that's the next version next level.

A

Yeah: okay- okay, that's great! That's fantastic! Okay! Well! um That timing was perfect. uh We are exactly at top of the hour. um Thank you so much for uh coming here and talking to everybody about these Concepts and um showing us the demo. I really appreciate it. um I don't have any closing thoughts. I, don't see anything in the chat. So if you snooze, you lose just tweet me questions later and I'll relay them through through the slack ether um to get the answers. um Should you so desire and I think? Is there a?

A

Is there a public Community actually around Polaris, where people can interface with the team GitHub issues.

C

uh Yeah get ambition github's the best way to do it. There's the GitHub discussion and issues discussions, probably key yeah.

A

Yeah perfect so yeah, if you can find them these folks on the discussion on the GitHub project which uh I linked it somewhere here it is I'll show it there. You go pause. Your screen copy the comment from the YouTube interface. However, you do it, um we will let these guys go. We are going to I will hit end stream.

A

um I will remind everybody as per usual, to choose your technical debt wisely and I will see you in roughly four weeks four weeks, not two weeks going on vacation enjoy.