GitLab Scalability Team Demos, 6 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-10-06 Scalability Team Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I'll do that so timeland. um This came up when Stephanie. Suddenly nobody was available and then Stephanie and Matt needed to review things that I treated them and they'd never seen timeline before so. I think this is um a little bit of an introduction about what it does and how it does it. So uh uh where shall we start any suggestions on where to start so we start with the source metrics and the saturation point in the runbooks.

A

That's okay, sure.

B

Yeah sounds good I I was I, was planning on having a pairing session with you, but not not not here not to waste everyone else's time.

A

ah No but then, let's uh whatever you think about when we are looking at things like throw it out because yeah I don't know what what to talk about neither and I've just been doing stuff, so the Run books. We have these things called saturation points in our run. Books look at the redis CPU. That was the one that I was recently looking at, and the point of these metrics that we Define, like the query that we defined here, is it's supposed to spit out a percentage.

A

So a number between zero and one uh zero is excellent. One is burning, and then we set the lows on that um where this is, where we're going to alert. That's a short-term thing. We generate alerts from these metrics. So as soon as this goes above this SLO, then the on-call gets an alert.

A

What we also do um so from that we generate a whole bunch of.

A

A

Saturation metrics and those are these- are the alerts.

A

Yes, so those basically just wrap the the query that you just saw on the left. Let's look for the redis one.

A

Here so those wrap the the query that we've defined here on the left into the thing on the right and it records gitlab component saturation ratio and that's the thing that we're going to be using in timeline um yeah. So switching over to timeland.

C

B

A

B

A

Oh yeah, I'll, I'll, I, won't, but uh so timeline timeline is um like an IPython book, so uh Jupiter notebook and let's pick the redis one since we're following that around yeah and here.

B

This is the piece I I I'd love uh extra details on I haven't worked with Jupiter notebooks at.

A

All okay, so uh um it's funny because these these are not okay, normally you'd write the mark down far from for Juniper notebook and the code cells are python that gets executed.

A

We generate our markdown files and this is from a service.nd, DOT ninja. So that's the template for this file. uh The front batter is yeah just like an introduction to like that's the rendered on every page, not super important. uh It's this bit.

A

um Then these Imports saturation forecasts and operation forecasts- that's both of the sections that are going to be on the page. If you look at the page, let's open up the redis one foreign.

A

So first we've got uh the reddest non-hunter result horizontally, scalable resources, those are the here's, the primary CPU one that we was looking at that we were looking at before, and that comes from.

A

This part here so well.

A

This part here actually this, that what I selected here on the right we do that for each component we know each component like redis. Primary CPU is one component. We know its component because we generate like a manifest.

A

That's here in saturation.json.

A

And that's just a dump of what you just saw in the in the runbooks repository that synced every week, I think, but there's a bot creating merger quest for us every time, saturation points and so on change. This manifest gets updated and that's what we use to generate this markdown file that here in the components ticked, has every um yeah every saturation point that we pass in.

A

Are we following along.

B

Yes, so we're only using the the component level saturation and the the operation rate. Those are the only two. Yes.

A

Operation threads.

B

A

Little bit uh different, because here we just select all of the operation rates for a certain service and we use the gitlab component operation right like so, for every SLI. This is less clever, because here we know which ones we should have uh the horizontally and the more horizontally scalable one. So we know which one we should have and we print it all out in this, uh like we rendered it all out in this dictionary.

B

I see so this is a. This is a subset end of the okay. That makes sense.

A

So uh if you see here, it reaches out into this saturation forecast.plot series method and it prints out this component stick. So, let's compare okay.

B

This is the other piece I wanted to see the the saturation forecasting there's I've worked with a few different systems that attempts poorly to do forecasting and I I wanted to learn more about what what forecasting method we're using here, yeah.

A

Sadly, I won't be able to dig into that very much because we use the profit one from Facebook, but I'll show you where it starts, and then we can dig into that, maybe together in the parent session, that would be cool yeah. But there's.

C

Also, a cool white paper about profit forecasting, that's on their website that I tried to read some of and then it got into math and then I stopped. But.

B

A

So, let's uh yeah go ahead: Matt.

B

uh I I wanted I wanted to to let you finish and then oh I had a a couple of topics to kind of explore as kind of tangents so keep going. This is fantastic. Okay,.

A

So first saturation forecast the that's the like. We do this separation thing, um but all of the component saturation um metrics. So those ratios that I just shown you in the Run books those get rendered out in these two first topics and they're. Basically the same one has a hard line, uh one or no ones are marked horizontally scalable and the others aren't and we tweak the non-horizolially. Scalable ones is more important because that's more difficult to scale. So we want to look at them first.

A

uh So then, let's go into the saturation forecasting itself and the plot series method is what we're looking at, and um so we get the page name. That's all stuff! That's just used for the for the report. Here's the component! Stick! That's this huge dictionary that you saw on the other side that we pass in um and then for each component. We do a plot forecasts, not forecasts here and um what that's doing it's getting a bunch of the.

A

Properties of that component to know here um which query to build so uh uh like removing the outer join labels. It's removing the the threshold because that's not part of the the labels in the that are on the series, and then we get the capacity planning strategy to know um yeah to know which um I can show you that and load historical data frame to know which um metric we're going to use like we have these quantiles.

A

This is the the default one that we use, so we take the 95 quantile or over one hour of a of a ratio, and we use that as a data point. Does that make sense.

A

um When we specify other capacity planning strategies, then we pick another query, but this is the the query that we're going to be performing. In the end this series passed in here it takes the dictionary that we just saw and whatever's left in it. It gets turned into to the to the label. So the selector.

A

uh Yeah I, don't think that's super interesting to show, but that gets yeah here, add it to the selector and then we perform a batched query. uh That's let me open Chrome query.

A

Patched query range, uh so we pass in the query that we just built here on the left and then the start date and the end date timeline uses um a 180, Day history. So we load all the data from 180 days ago to yesterday at midnight uh yeah yesterday, end of day.

B

Out of curiosity, where is that 180 days um specified.

A

That is in an environment variable, uh let's start again at plot Series, so.

B

It's it's uniform for all of the all of the series that we're working with.

A

Yes, it's uniform for all of them. We use the same graph and then, when they're aligned, we're looking at the same data everywhere. uh Forecasting forecast dates.

B

That's that that's all I needed to know thanks.

A

uh Yeah, okay, but it's an environment variable and now that we have the cache we could extend it um but yeah. So where was I querying with ranges.

C

So I have a question that, maybe is obvious, but maybe not. Where is the cash being used by this? Is it getting this query out of the cache I'll.

A

I'll show you I was just okay, let's keep walking through, like that's where we're going to end up I'm fine here. So what we do is we batch the the thing so 180 days we can't load into um into um yeah. We can't load it in one query from Thanos, maybe after Matt and Igor are done with Thanos compactor we would be able to. But I don't think so.

A

uh So we patched this in 180 days and we load 24 data points, step 3600, that's the the resolution that we're using so we load 24 data points um per chunk of one day and the chunk of one day is defined. This.

B

Is fascinating, so um Sanders query also uh breaks up the Thanos component called Thanos query also breaks up large time spans into into smaller ranges for caching purposes, so we're doing something similar here in tamman.

A

um Yes, okay, apparently I didn't know that well, I.

B

I just learned about this fairly recently, uh thanks yeah.

A

Me too, thanks to thanks to Igor walking me through them like the because you can have overlapping tunnels at that point, because you have a um the raw data and the down sample data which have the same yes, yeah,.

B

A

So yeah there we will go there, we go, uh we separated in two days and then we iterate like uh through all of the base, and this is the loop that iterates and we stop. If.

A

um Yeah, if we've reached the the last date that we need to that, we need to get now. As for the cache, every query, the every ranged query like this is cached, so we do that query range with cash.

A

A

Here it is, and so usually we don't ignore cash for anything anymore.

A

Yes, we don't ignore cash for anything anymore. We used to do that because we were loading up to like the future, as in today midnight but yeah. We don't have that data. So then we ignored takash to not have like an incomplete fragment in the in the cache, but we don't.

A

We don't do that anymore. So we have this prompt crash manager, thing uh that takes the query and the the from and to date time this is already sliced up. So here we will this from and two will be 24 hours long and this step is going to be 3, 2 600, so the 24 data frames um and then we try to read from it manager.

A

This one I think yep. Here we go so that just uh what does it.

A

A

It loads the yes, it loads the entire day and uh returns that data frame and the data frame is a pandas data frame upon this data frame is um the library that we use to query Prometheus and so on. So it returns exactly the same thing as we would have. If we query your query range would retry with return return, the the same kind of data frame and.

B

We write it. Yes, so it's so it's serializing the the response from from requiring this through.

A

Fantasies and yes, and this this kind of Market files- okay, here you can see the structure of of the actually that's, maybe nicer. If I show you and.

B

The the width of these individual, um um what we're calling them um juncture ranges.

A

um The width of this wait, the width of this data frame returned, is one day. Yes,.

B

Okay, great and that that I'm, just realizing that that uh earlier you mentioned that we We sync, the um the definitions for our saturation thresholds on a um I, think you set a daily basis. Maybe you said a weekly.

A

Basis, maybe weekly, maybe daily I, don't I, don't remember: okay, okay, yeah.

B

I was just kind of thinking about the the um like. If, if we made sorry, this I I said that I'd hold questions, uh keep let's keep going with this. No, no, no I.

A

See where you're going um I think there is likely something to go wrong with the invalid cache if we TR, if, um if we change the same metric to be different, suddenly is.

B

Yes to the metric yeah, yes.

A

But but we also don't rewrite recordings for those metrics.

B

A

Are coordinates so that, like it's, a kind of a.

B

A

What we have done in the past for changing saturation points is introducing new versions so.

C

A

A new version runs both of them side by side, because timeland uh wants.

B

Oh and have history to do the projections.

A

Yes, two more than two weeks of data, so more than 14 days of data before doing projections, and otherwise we just ignore the component. Yes,.

B

A

Sense yeah, so awesome, okay, um we read and then we go through the cache and the cache looks like this. So these- and this is the part of the thing that I might want to change. These are like um hash of the of the query. So exactly the query string that we pass into tunnels this is it um shot 265 or something like that and then for each query. We've got the day.

A

So if we so here we can see the query. This is the query that we were doing like here. It's a utilization metric that I didn't want to talk about just now, but yeah. So the query is there in the file and then these are the days and Pockets. So that's like.

C

A

The format that is stored in.

A

B

Yeah and where, where were you looking at this, is this the the the directory structure you were just showing us.

A

Is that this is uh it's just stored as a directory.

B

A

B

A

uh And no, not in the git repo, it's not checked in it's, okay, um it's written and ignored, but we have this task, so um this is how it loads the data. Do you want me to continue from here and go into how it gets to date and caches the data and make sure that this thing doesn't take six hours to run? Or do you want me to continue into the the forecasting bit.

B

um Let's, let's, uh let's skip the forecasting for now and and continue with the mechanics? Okay, okay to you, Stephanie.

A

Yeah totally, okay, so here the query gets uh uh cash hit Hub and if it isn't it's it's there, it's written uh query with range yeah. So here uh query is non-cash missed. We perform the query with retries. So if there's one failing it, then we try again when we do get the data frame. We write it out and that happens here in the directory structures that uh I just mentioned to you.

A

To then, we have got an entire directory full of all of these files and if to make sure that we don't need to redo this every day, we've got.

A

This populate cache command, which does like we. As you remember. We went into this coming from the book that performs a bunch of queries and and at some point it hits the query range with cache method here we're coming coming at it from the other side, so we don't start from the book, but we start from this script, and here let's go for we were talking about saturation components, so here are all components in all saturation components.

A

All saturation components here are basically generating every page like we generate the name like from the ninja files, so it gets all of these components uh in all of the sections yeah.

A

So what goes in here are the same kind of component libraries component dict. If you remember the name goes in here and we built this array of components, which means we start with the same thing.

A

We've collected all those here and then so here we've collected all those, and then we tell them populate saturation component for each of those. um All of this here with the Futures and so on, is and the thread fuel executor is a way of doing this synchronously because most of the time, because most of the data is cached for uh 180 days, only one day is not cached. So we do this in several threads to just iterate over all this, like spinner wheels for over the 197.

B

A

Of data that we're going to load see if it's all there and if it's not populated yeah, are we limit it to five to when we do hit the new day to not have a Thundering Herd on montanos? Yes, uh so here's the populate saturation component- and you see here that it takes the same plot, forecast method that we just saw and it yeah does this thing with from two which are the same forecasting dates uh here uh to load? All of that.

A

So it's going to go through all of this cache manager, stuff here and write out the data directory and this skips generating the forecast, which also takes some time, and so it's slightly faster than running the entire book. um It skips the that's with the historical. Only here skips calculating the forecast, the forecast I'll get to it later.

A

So now we've got a fully populated data directory and then what we do. This gets run from a CI job, so gitlab CI populate cash. So we have this thing that extends built with cash.

A

Here it is, and that starts by downloading the cache. What we already have: that's downloading it downloading it from a generic um package. So it's a gitlab feature. I can show you that here.

B

So this is like a tarball or something yep.

A

Exactly and lives in the package registry here so right now, 27 megabytes of data compressed is in there as you've. Seen like it compresses quite well.

A

uh So first we download that then.

A

We run that make file that I've just walked you through. We populate the data directory, some more. We upload the new one we removed the previous one.

A

That's how the cache gets populated uh any questions here before we go back to the forecasting bit.

B

That that's absolutely makes sense thanks. So much.

A

And then saturation forecasting, so we took a tangent here when we hit the load historical data data frame that loads a data frame for the entire 108 days.

A

Then, uh if we're not coming from the cache thing, this historical only thing is set to true for um for populating the cache and when we're rendering the book it's false. So then we're going to generate a forecast.

A

And this is yeah uh where the smarts happened happens this so This profit is a little bit of a black box that you can add some yeah parameters too, and that's going to.

A

uh Yes, so make future data frame, um that's going to make the same kind of pandas data frames like we got from from the Prometheus queries, um so we've got the same kind of things. We could actually yeah write them out the same way if we wanted to, but those might change. Obviously, so we don't.

A

And then we wrap it into this forecast object.

B

um I'm, probably just missing it, but how do we pass in the the input data for um for a profit to consume? I, see I, see where we're configuring daily and weekly seasonality.

A

uh Yes, wait a second.

A

That's how we configure it.

B

How does it know where to get the input data.

A

Series The Tears the data frame. We pass it in here, odf.

B

A

Df and then we and then we do there should be a forecast or predict call or something.

B

Oh uh m.fit, maybe yes, okay, uh uh about five lines down.

A

Yes, yes, okay, so this is yeah. Yes,.

B

Okay, fantastic Black.

A

Box I don't know yeah, okay, the the smarts but yeah all.

B

A

Cool and then here in forecast, we um we basically prepared for rendering out in a pretty graph, yeah or pretty Eye of the Beholder and so on, but yeah so uh yeah. The configuration of the of the axis the graph lives in here.

B

So this is uh fantastic tour. Thank you. This is. This was the end of the line right, yeah.

A

I think this is this is not what I had planned, but that's where I ended up. Okay,.

B

Yeah, that's fantastic, um so uh one of the um there were there were a couple things I wanted to to to um uh I, I, guess: I, guess: uh I I'm, not sure if I, if I want to chat about this or or brainstorm about it or just kind of keep an eye out for for ways to to handle it. Gracefully.

B

When, when I manually, look at the the trending history for for even outside of teamland, just for for any kind of uh um um time, oriented trending behavior, um some some common classes of gotchas that I've run into are I'm sure we've all run into are um are uh having a discrete event like uh like a change in system. Behavior, where, like a workload, change or um an efficiency or inefficiency, was introduced um or the uh a defect in the measurement was corrected.

B

um It would be really really useful. I feel like to to be able to annotate that and in grafana we can add annotations, but I don't know if we can add annotations to tameland.

A

We can well Prophet supports it, so that means they're like I, don't know the code by heart, but I've seen it in the documentation that you can do in this um like here, like we said the daily seasonality we can. We can add, like ranges of of that, that we need to ignore so that's possible like ignore or do something else with like it's supported, or you don't have a way of doing it right now.

A

So when we have these things in timeline and we see them as a human on the graphs, then we currently say this was this Mark this issue for like how long we think like if it was like um a one week thing it might take a month or so before the prediction is like over it. Yeah.

B

I guess I was I was wondering if yeah, that makes sense, um so I was kind of thinking of this in as as kind of a twofold uh uh topic, one is um kind of conveying, like you know when, when we do this kind of Discovery work, um it's nice to be able to pass that along to other humans, uh like you know like like, like we are right now um or if, if, uh since we've caught a rotation, uh uh now being able having a place to kind of, say, hey for this, for this particular uh I, guess they're not really alerts, but for for this alert um here, here's what I found you'll this will probably continue to to affect the the projections.

B

For you know the next end week.

A

So yeah right now, this this all lives in those issues but I think ideally, I haven't created an issue for that we would fix the prediction like we only.

C

A

B

A

Did we did database maintenance or, um like we were close to saturation, because this thing got deployed and it burned up bitly? um We know yeah.

C

A

A bad ignore these three days and then the next prediction will be like we'll see the dots uh of the actual data when we yeah.

B

Some sometimes when we have uh so I guess ignoring a particular range um lets us handle things like a discrete events that caused uh an Abrupt, limited duration, Spike or dip, and.

A

B

Yeah exactly um what about cases where, uh where like, for example, we made, uh we made the capacity increase which dropped the percentage utilization or we made an efficiency increase which also drops the the if the percent utilization like we have with like when we, when we added capacity for when we added memory to to redis Cache or when we, um when we uh perform tuning or when we offloaded the you know, reduce the TTL uh for for runners, cash I'm, just thinking of kind of recent events- and we do this in in a bunch of different components.

B

um You know largely as incident response, but sometimes just like you know, plant work as well. How do we do you know of a way to to kind of retrain the the projections to to like where we can give it a hint that says this was a discrete event that that will have a lasting change.

C

I was just magically fixed themselves at that point in time, I mean, but.

A

It takes a while, if you, if you.

B

Do anything like this is why I want to see the algorithm that we're using, because.

A

Yes, let's, let's look at the the reddest thing? No, yes, the red is cash.

C

That was when did we do the redis cash work a month ago, a little over a month ago,.

A

uh So red is cash and if you can't see I might see my screen.

C

But we're sure it's very good, whatever you're showing us is very useful.

A

It's me scrolling through page, so here's the reddest memory, no right.

A

C

Is a problem because we didn't have the saturation graphs turned.

A

Off before, yes, we said we.

C

Turned it on as part of that work, yeah.

A

No, we we we had that before and we said we don't need this for redis cash because Max, it's always at max memory and that's what max memory is for and then we decided it's not what max memory is for. So we do need that metric and then we introduced it. But this is the kind of event that matters is talking about.

A

Like we, we saw this coming, so we did the thing and do we had? This drop now be smarter about it, like uh it's going to grow at the same rate, but yeah um and I was looking.

C

Wouldn't read a CPU I think Reddit CPU was already being tracked. Wasn't it redis cache CPU.

B

Oh yeah: let's look at that too, because.

C

That's the one I looked at some of this yesterday: I'm, not oh yeah. That I was that smart.

B

That's a great idea that when we changed the machine type that doubled again right, yeah.

A

There you go single nodes, but we want the primary.

A

What was I read this primary yeah.

C

A

You go here, it's not as dramatic as you would hope, but like these things, these outliers. So yes, the trend is going to go. Downish.

B

Oh wait: this this red is primary CPU resource. That's that's going to be the the reddest main thread, I think so.

C

Yeah, it's still a single review, yeah.

B

It won't be sensitive to having more cores.

C

And no and I don't think we upgraded or changed the CPU right by size or speed or anything else. So yeah, okay,.

A

And the other one, the other uh red is like like if you look at like since it's single core, a single threaded I mean then.

B

You scroll pass I didn't catch what it was, but just when the screen was refreshing there was. There was one that there was one of the graphs that had an Abrupt drop in the the um and it had the the confidence bands uh Arc out.

A

uh This one yes.

B

That was that was the shape. I was looking for. What is it? Oh, yes, the the memory resource utilization. Yes, so.

C

That is actually isn't that that's what we did, what we were looking at yeah.

A

So then, these confidence.

B

A

Out about and right now we deal with that humanly like humans say no, no, we've got this like it often happens. We alert, we create the issue. When these blue bands hit the red one, then we create a capacity planning issue, yeah so much stuff that I still haven't shown like you. You don't know how these Pages get generated or yeah how these issues get created with capacity planning.

C

I have a question that is probably a very simple yes or no one at this point in time. uh At this point in time, we're only focused on capacity planning in the sense of something is going to get too big correct, we're not actually looking in any way at. We are wasting resources on this and should use yeah. Okay, no.

A

But um no, we don't have anything for that. It's just things will stop working. If we don't right.

C

Like for real yesterday- and it seems like the we are wasting, resources on this would be something that timeline might potentially be useful for as well.

A

Yes, uh for some things, the if the the the there.

C

It it's obviously dependent.

A

Yeah, but um if you look at I, can I can show you one I think if I look for Trace chunks.

C

Yeah there was a couple scrolling through those pages that looked very obviously like we're just throwing a bunch of resources. It's something that isn't using any of it.

B

Yeah I mean these. These graphs give no context, though, so like correct.

C

And that's why I'm asking the question not saying it's a thing: we should do yeah.

B

Yeah, like I, mean that is being a great example of of uh a case where we have a bunch of CPUs on some of these redis boxes right, even though redis can't use the CPUs, but explicitly by having a bunch of CPUs. We also a.

A

B

Powerful machine, yeah and- and we get uh prayer prioritized network uh throughput um because that's also stupidly tied to the number of CPUs I mean from our perspective.

C

Absolutely yeah.

B

Yeah Yeah, the more the machine type and the yeah it is I mean I can kind of rationalize. Why gcp does it this way, but it's yeah I find it kind of distasteful, but.

A

Like from from from that perspective, um yeah, what we're seeing here on my screen is the the single threaded uh so single core CPU for registration, chunks and I bet. That's exactly what you were talking about, because we provisioned that machine to handle a lot of throughput like Network, yeah, yeah.

B

A

B

Exactly so so we can do we can like in in using the Traditions as an example. We can do an assessment uh like I, guess, I guess what I'm trying to work. This is a topic. So um so we have, we have kind of uh you know just thinking of, like you know, uh VMS as a as a as a collection of a handful of different resource types right um and they get I mean the way we're producing. These VMS is generally in kind of standard uh kind of block uh blocks of those resources.

B

So uh so, if you want a machine of size, X, then you're gonna get this but CPU, and this much memory and this much Network and this much disk, um and we don't have to do it that way. But that's that's!

B

That's a common provisioning approach um and I guess what I'm kind of working up towards is um for a given uh for a given um workload, whether it's, whether whatever workload means in the context of the service that we're talking about there are generally going to be.

B

um um As as human analysts, we can identify certain uh certain factors, certain factors that are likely to influence um I guess what I'm getting up to is: there's always going to be at least one resource. That is the most critical bottleneck for a given workload and.

A

And then we pick resources, yes,.

B

Exactly so exactly so from my perspective, what I, ideally like, as as a team for us to be able to work towards, is um building on a knowledge base of what's the bending resources are for for uh for given workloads and services, um what factors influence changes yeah in in that workload, because changes and changes in the in the in the workload uh Drive changes in resource utilization and can shift the bottleneck to some other some other resource or component and being aware of that is that that is just solid gold in terms of being able to you know, do both capacity planning and incident response or use cost exactly.

C

So they're all related.

B

Yes, exactly so, that's like you know, uh I, don't know how much of that Tamlin can help us with, but in terms of, like you know, as as a as a team of human Engineers, that's what I would ideally like us to be able to move towards, using whatever, whatever the right tools are for for for that kind of uh for that kind of work.

B

um But that's kind of my my personal ambition for for us to be able to accomplish, as uh you know, as as a team um I mean, but that's that's just one person's opinion. So uh I was really curious. How how YouTube felt about that, as uh as kind of you know, one of our um something to work towards and and also um thoughts on how we can kind of begin to accumulate this kind of information in in a structure.

A

B

A

This thing that we just talked about with registration, so we reason about and that's yes.

C

A

How does uh somebody like, for example, Blake, go past the graph like this and say well, the CPU is not utilized. Why is that yes, reason about this.

B

Exactly I I kind of feel like this, you know, having kind of a set of uh you know a set of kind of um notes about uh about bending resources and known factors that influence utilization of those resources um could be organized on kind of a per service basis um and I had at one time a couple years ago, thought that we might use the Run books for that, because we have sort of a similar need in terms of uh incident response to do service, or you know, service oriented uh triage.

B

um The Run books are kind of a mess in terms of the the organizations you know not lacking. There's been several uh kind of. You know attempts to tidy that up, um uh so I guess what I'm? What I'm kind of working up to is?

B

I I don't really care where you put it as long as it's accessible to uh to to folks and- uh and we have enough freedom to to uh you- know to organize it as uh the more we do, the the more we'll understand about what what structure is useful for for our purposes for for uh doing doing, analysis and forecasting um and I think once we kind of you know have done a little bit more of this. As a group, it'll it'll be more clear.

B

What what aspects of of organizing this information is useful and reusable um I end up in past work prior to get lab. We used a Wiki for this, we're not big on wikis here, but yeah and.

C

The problem with that is that I always find that all of them fall out of data fall.

A

C

Date pretty quickly like the thing for me on anything like this, is we need humans in order to determine what the bounding resource or resources are, but I would want the way to discover that for future humans and the way to keep that updated to be programmatic in some fashion, like I, don't know what it would be, I don't know how, but.

A

In some fashion, what do you um like, because now I'm thinking like yeah starting the South, like everything in the Run books and saying um API uh like the web? Everything there is mostly memory bound right and we just Mark that on the service.

C

That's kind of what I'm thinking is: do we label it in Prometheus in some fashion? Do we well.

A

C

A

We can label it, but is that right to go out of date with the workload changes that Matt was talking about like how it might? How often, how often does that change for but for red is for yeah.

C

I mean I, don't think British well. Redis has two memory.

A

And CPU, correct and for cash is going to be well for cash explosions.

B

May be an interesting service because, uh depending on depending on uh like we can, we can drive Italy to CPU and memory saturation uh either con. You know at the same time or separately um like there's exactly exactly um yeah, exactly like representing a kind of a cascading failure scenario like I'm gonna, I'm gonna, um just as just as just to have a concrete example to talk about um now that we've rolled out uh c groups, there's a there's, a there's, a pattern where uh I mean this can happen without Z groups too.

B

But it's easier to happen. It's easier to talk about in the context of a c group. So you get one project that has uh say someone's got a say: someone's got a a fork of Linux kernel or a fork of gitlab or gitlab.

B

Something that's got a lot of a lot of git objects in its history and they do uh and they do I shouldn't talk about the specific mechanics of this particular kind of abuse, but say they run some grpcs that are particularly memory intensive and require a trip, a long, long, lasting traversal of the object history and that that drives up both CPU and and memory utilization.

B

But in this context we're going to imagine that we run out of memory first, um this this uh and this Anonymous memory usage, depletes the file system cache and whatever scope we're working in in this. In this case a suppose, it's a c group, so just just to put some numbers on it. These are number numbers are smaller than what's in production, but say we've got a 10 gigabyte budget for for memory in the C group and normally that's plenty.

B

Normally, that's mostly found some cash pages, but when we run this particular workload, um each of the get each each time this grpc gets called it gobbles up, say four gigabytes of anonymous memory and holds it for a minute. So if you get two two or three these running concurrently, then you just kicked out the entire custom cache and at that point, any other action. Whether it's these you know poisonous commands or not has to do a lot of more disk.

B

I o so at that point, we've shifted the constraining resource from memory to disk to to block I o um for for the for the scope of the project that runs in that c group I'm, not sure how we capture that, but as as humans, we can write some prose that describes this pathology, um but I, don't know how we capture that in a forecasting framework, except it does have some interesting properties like that state transition, where we abruptly switch to having a large increase in in Block, IO, I, guess: I'm I guess this particular like the scenario I'm kind of talking about is is like more of an incident.

B

But if it becomes a recurring pattern, then I guess we could think about it from the capacity planning statement.

A

As well, we do like the way we match um these saturation points to Services is um like through tags. uh Okay, let me show uh it's a bad example now, but um like do, we already have saturation metrics for the newly introduced c groups and do we need them.

B

um No, we we don't have, we don't have new saturation metrics for it, um we're relying on the host level, uh the host level metrics and the the reasoning for that is it's really difficult to um to to measure accurately how much.

A

B

A

Are container like.

B

A

Yeah yeah exactly.

B

For for memory in particular, we we uh we it's normal for the c groups to have uh to have. You know approximately 100 memory usage, but most of that memory should be fasted cash, Pages, not Anonymous memory. So that's differentiating between there's between different types of memory, usage, uh C advisor, doesn't do a fantastic job of advertising. The way in which memory is used.

B

So it's it's kind of like it's kind of like at that host level where, like you know for for Linux, unlike some os's uh Linux prefers to have a very small amount of actually totally unused free memory, but it's it's uh it'll use most of its most of its memory for custom. Cache it'll give up those pages whenever processes need need to allocate Anonymous memory.

A

uh Right, but so what I'm? What was driving at is that we have a mapping of which Services depend on which of these resources that could get saturated. Okay uh applies to here. So, for example, here's the CPU saturation, one that is using the metrics catalog to find everything that is provisional, VMS and that's based on.

A

Let's go to you know what what is still on VMS cataly.

A

So here uh so we've got these tags for some code things and then we've also got deployment deployment, deployment.

A

The data was marked somewhere.

A

Neither of you remember how that's marked no.

B

I'm not quite sure what we're looking for at this point.

A

um Just think that also General, like it's a it's, a stanza, that's in this thing that says this is deploy.vms. This is deployed on kubernetes.

B

A

Here, uh no, that service dependencies.

A

A

Cube resources, if that's present, yeah I, think it's something like that: okay,.

B

A

Wow, what I was what I was getting at? Was you know which resources apply to which services?

A

Now we need to group them.

B

A

Like we need to group yeah.

B

A

Into like the bucket of this goes together, this is the same node pool. This is the same. No.

B

Oh okay, so you're thinking about I, I think what you're getting at is uh in in the context of uh kubernetes node pools. We may have some heterogeneous uh uh workloads where um you've got some um some pods that are doing API work and some parts that are doing.

A

B

A

It could also explain the the situation we were looking at with, with the rate limiting instance. If we had this grouping- and we see like the CPU is other utilized, but it's at the max of it's.

B

A

Network true pack, in the same bucket of of resources that we're looking at then okay. This needs to stay on this over provision machine because of this reason,.

B

Got it yes, okay, yeah I started I I was thinking of uh I was thinking. You were working up towards um towards um contention between services that are sharing uh that are sharing the resources of a single VM I.

A

Think we're far but you're we're far away from being I like that was the thing like. That's the dream right like uh right this, um this reddish thing is not doing quite so much. Let's put some CPU and intensive sidekick jobs on it. Like sounds brilliant, but it's very scary. Yeah.

B

Yeah I yeah, uh yes, I'm, remembering some bad outcomes from yes agreed. Okay,.

A

I don't know where I was going with this, but.

B

Yeah so uh yeah so grouping um um yeah having having some way to to annotate to, like you know, to Blake as a as another consumer of this data, um um to show that these resources are uh are collectively, provisioned and um and among these, these five resources. This is the one that's abandoning resource and and uh the other resources kind of come along for the ride. Yeah.

A

But if, as soon as we can already see it in aggregate I think it's much easier for a human to reason about like we could build the view. Sometimes.

B

Yeah sometimes it's it's hard, though, like um uh like sometimes you you just seeing the graph and and the the collection of uh related graphs isn't enough. uh You need to see some more about the reasoning like, for example, um we just um uh I I, think the three of us were just just uh had an example of this.

B

A couple days ago, where we were talking, we were trying to reason about the the the sizing for the VMS for the registrate limiting um uh pods to run on, um and we kind of rehashed the the the in in kind of Rapid fashion, the the series of discoveries that you know from like the last couple of years, where we're like okay, so we we know that, like you know, the the copy on rights uh during rdb backups uh means that we need to, you know, have up to double the max memory, and we also have the you know.

B

More recently, we learned about the the redis replication buffer um times, number of replicas uh being a factor for certain workloads that take like, like red, as cash in particular, is big enough that that that was something we had to pay attention to, whereas while they read as clusters, we didn't have to so I think having having this kind of I feel a little bad about this. But this this makes me feel like we need a place to put some some written prose to describe the interactions between these resources and kind of the reasoning behind.

B

Why they're sized this way, hello? Does that make sense, I, I'm thinking of this I guess I'm kind of thinking of this again in the context of having.

B

B

Like service visit, like descriptions of the the you know, concise descriptions of the reasoning behind the sizing decisions of a given a given set of resources, um whether that set of resources is like you know, all of the res, all of the metrics representing resources for a given VM or type of VM, or if it's um kind of in a more abstract fashion, um like most of the time when we're talking about resources. We're talking about like machine resources, CPU memory, Network disk.

B

um But there are, of course, other kinds of resources like uh like.

A

B

Yeah exactly exactly um so that also I think benefits from having a little bit of a context around around some some some sizing uh and scaling choices like I guess, I started off thinking about sorry, I'm, not leading up to anything I'm just kind of talking as talking through um having um like there are a lot of people that are contributing to to uh to decisions about how how we're making you know, tuning and and optimization trade-off uh choices.

B

So we're we're not we're not going to be able to capture all of that, but as we're doing capacity reviews a subset of those are going to make a meaningful impact on the on the graphs that we're looking at and as we kind of uncover, you know like we, we have an unexpected drop or rise in one of our graphs.

B

For for the timeline, trending um we're going to kind of ReDiscover I think some of those some of those um you know, decisions being made and having having a way or or a place to kind of you know- and maybe maybe this starts off as just the timeline issues themselves. um This.

A

Already comes up and those tamelan dishes, like one example that, like a saturation metric that I've been recently looking at, is the um the PG bouncer one and that's.

B

A

To the Fiji bouncer client connections, yeah tied to the CPU of the replicas, so we need but, like you, don't see those that contacts. So you see the line time Corrections were running out. So, let's just add some more yeah.

B

A

Not always the right decision or the the because maybe this research was set the way it's set. Because of and that's what we need to capture and that's.

B

A

To have in in yeah, so what we want to link is two saturation points that belong together like that are dependent on each other is.

C

That I made it go, drop off and learn about being on call, uh but I'll watch the last two minutes of this. If you keep talking.

A

B

A

Need to I need to drop off as well, but I think it's.

B

A very interesting.

A

Thing that we got into Matt, thanks to you, yeah like seeing this like we need to what we're trying to solve is not. Everybody is able to rebuild the context all of the time. Yes,.

B

That's being made.

A

Exactly so we need to like we can't do that. We can't programmatically do that. That's not never going to work, but we can give them as much information and like as fast as possible as they need to yeah.

A

B

We don't have to solve this right now, but I kind of wanted to kind of throw that there and you know kind of we could reconvene because, like we're all in this rotation, so uh now so we'll it's.

A

Going to be great, like uh I've, not left you in a in a pretty good place, but let's, let's prepare next week somewhere, okay, yeah on doing a run through and I'll do a run through and make sure that the book is ready. But I broke things the past two weeks and okay like yeah.

A

That's the part that I haven't shown and I actually wanted to talk with Sean about, but like the the bits and pieces with pipelines on Ops and pipelines on gitlab and the pages site on gitlab and yes, the two different projects, capacity, planning and timeland. And then the scalability project is adding images to the timeline dishes. So it's a little bit of a yeah right.

B

Yes, awesome, yeah, I I will I I would love to get a walk through um yeah I I just barely started to look through some of the issues, and it was kind of clear that there was some context. There is missing, so I thought pairing on on it, for you know, for for the for the first round, or so it would be super helpful.

A

Yeah, let's I'm, going to drop something on your calendar on Monday evening or something does that work absolutely.

B

A

Cool around this time.

B

Sure sounds great.

A

Excellent excellent.

B

Wait wait, wait, wait, wait, um I'm! So sorry uh my contractors will be here. I need to look at my.

A

Look I'll drop something on your calendar and you move it around. Okay,.

B

Okay, perfect yeah, because they won't leave me the whole time. I just have to be able to let them in the in the house. Yeah.

A

Wait for that, you can like tell me that yeah.

B

Yeah yeah, true true, okay, awesome all right! Thank you! So much Bob. This is so so helpful. I need to drop off bye-bye bye.