GitLab Scalability Team, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Demo Call - 2021-06-10

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So this is the scalability team. Demo call we're just giving a moment to see if anyone else is going to be.

A

A

A

A

uh Bubble, I was wondering if you, if you wanted to show what we've just been talking about with the uh panels that have just been merged into the dashboard and then the piece of work with the operation count that hasn't yet been merged. Yeah. I was just reviewing that merchandise.

B

That's why it was late, so I can do that wait. Let me clean up some stuff.

B

A

Hey bob's, just uh getting ready to present.

B

Yeah getting ready means closing stuff that I don't want to share on youtube.

A

Is it that video that we shared on the scalability launch this morning from new zealand.

B

No that's already on youtube, so I don't know.

B

No, but then I get the messages from co-workers, saying that signals still open and so on so.

A

Well, the reason that I'm excited about what bob is going to show is because there was a conversation yesterday that the where the ems were talking about being able to see into what was causing problems, and this gives them the ability to see it, and it would just be so nice to close the circle um with what they're asking for by saying. Well here it is, you can see it here.

B

So um am I sharing my screen yeah, I think so um what what we've added is um this row, which has um a bunch of text explaining what um what is being shown? um This is the most important part it shows the breakdown of the error budget and where it's being spent. This is um like that the number of fail like this is supposed to it resembles the number of failures.

B

So it's a total failures that have been tracked in the 28 days that are used for the budget fixing the highest one is what would help and then, because I couldn't figure out how to add links here. We've added the links here. So if you see that the aptx for puma is the number one source of the of the budget spent, then you can click open. This link, which will result in a very slow table- and this is already this version- isn't deployed.

B

Yet it's what including sean's improvements, because I like it better and that's going to show I'm just going to ask sean to sort it by this, and it's going to show you the end points how many total hits you've had and how? Many of those were slower than than the threshold, and so that should allow to the people, the engineers and so on, to dig into where the budget is being spent.

B

They start with the highest number here and then look what um what end points so sidekick jobs or requests are attributing to that high number and for now, we've only built these fancy links in kibana for puma and sidekick, because most people are spending most of their budget there yeah, that's it.

C

So silly question those failure rates is that the thing I was asking.

B

That's the thing that sparked the discussion, because uh yeah.

C

B

That failure rate, but it's not actually a failure rate, because it's a sum over time of those failure rates, yeah.

C

We don't have actual.

B

C

When I see what it's for, uh I think the whole discussion is moot, because this is just for ranking and uh comparing things, and nobody cares. What this number actually means.

B

No, it's just comparing it which one's higher, but we use.

C

B

Kind of sum, over time for the query that kills this you.

C

Stop sharing your screen.

B

B

For the query that results in this uh here's the same kind of sum over time- and we have it in the longer range ratios as well so.

C

B

If we have a recording for three days, then we also do some overtime of the. I think one hour metric over.

C

I I saw that you gave more concrete examples, which is great, but I haven't. I just saw it like 10 minutes before the call- and I started thinking about it- thinking. Oh yeah. This is difficult, uh but now one of the three concrete examples you just showed on the screen- and I thought okay, we can.

C

If I from my point of view, we can stop talking about that particular one, because uh it's if it's just a number to direct people, to point people in the right direction and it doesn't really matter if it's we.

B

Just want to know which one is highest.

C

Yeah and we only care over a short window and like if something strange would happen with the sampling rate, and these numbers show up jump up or down, then they would all jump up and down and people would still.

B

Reliably pick the most.

C

Important one, the.

B

Differences that I've seen between the highest and the next highest are big enough that one or two samples off would not make a difference. I think.

C

Yeah and so so I think they're.

B

C

Yeah and I I was looking at your comments and I thought about this whole- estimating rates over a longer window thing and I have to think harder.

D

C

Harder about uh what that means, um but I don't want to hijack this conversation. I just saw that one thing and I wanted to check that. That was the thing we were also talking about on the other issue.

C

B

That's where it's that's, no, that's where it's started, but, as you mentioned this one.

C

And I will spend some more time thinking about what we're trying to do there. I'm.

B

More I'm more interested in the other places. To be honest,.

C

Yeah, because this is a ranking, uh this is really only about ranking and not about what the number means.

C

Well, let I mean it is about what the number means, but not in a very detailed way.

B

It's about how high the number is: no, how much higher the number is than another number, not the.

C

B

C

Yeah thanks, sorry, everyone else that was very vague because you're not in that threat.

A

Thanks for showing us that bob.

A

uh Also, while we're on that, I also just wanted to say, like thank you also to andrew for helping to prep getting to the point that we're able to do this, because it feels like there's a whole bunch of work. That's happened underneath this and being able to take the result, which is the error, budgets and show it to the pms and the ems is just uh it's just excellent, and uh so I just wanted to say thank you for the best.

E

That's super cool. Well, thank you to everyone. Who's done all the all the proper work.

B

The thing is, it took me uh a while to understand where we were going with all the stuff like we've started, adding feature category like that's the first thing we did. That was pushing us like in an open in this obvious direction, and that was like by now a year and a half ago, and then I didn't understand yet what it's going to be useful for.

E

Yeah, I think I think what we said was like one of the things that like when we were thinking about what the scalability team was going to do. It was like we've got a monolith and we're not you know whether we like it or not. The monolith is going to stay, so we need to. We need to figure out a way to have attribution like in in other big companies what they do is they just break it down into services, and it's really easy to blame a team on. You know this service is this teams?

E

C

Thought you were going to say it's really easy to just break it down into services, but.

E

No, no, no. No. It's really easy to do attribution with one of the things about micro like or any sort of service architecture is that you know attribution becomes easier, and so it's like how do we do this inside the monolith? And you know I think, like we're kind of setting up some really cool things here, and I don't know if you've seen that the database team are now going through a similar exercise with um the tables and actually on that.

E

I think it would be really good for someone from scalability team to get involved because they're still talking about team attribution and I think they should be talking about feature category attribution of the tables right and and using the the you know at least on. I don't know how many tables we can do it on, but where possible, use the attribution framework that we've got on those entities or something similar for um you know active.

B

Record entity totally added, because it's already we already added.

E

Two classes.

B

That's the thing so we can have.

A

Two models: yeah exactly that issue back um some back into the um the doc. uh The one about the database table attribution.

E

Yeah, let me find it but yeah I mean we really should uh I I pinged you on. It was in the info dev. Wasn't it.

A

If you did ping me and then I replied.

E

A

uh No, I don't know where it is.

E

Yeah, let's go look yeah yeah.

C

Be ashamed if they reinvent uh some of this attribution stuff yeah. If, if it's really trying to do the same thing,.

E

B

So the reason we did the feature categories- and not groups like they might be discussing, is because group change groups change all the time.

C

E

Apart from that,.

C

Well, there's that, but there's also just duplicate work, to keep to keep track of these things. Yeah.

B

Well, um as I've seen now with something that huang ming was building to add the feature categories into century, but we're on a lower version of sentry that doesn't allow filtering for multiple categories at once um might be handy to have the thing that we now have in metrics. The mapping that maps as a feature category to a group inside something in the application as well yeah. But.

B

Brainstorming never mind.

E

And can we get feature category rate limiting for certain teams anyway,.

F

We could make the error budget the hard limit like as soon as you're over it's done.

E

Yeah, no more, no more, no more century errors for you today. Sorry you've you've spent your century cool. uh Do you want to move on? um I thought I.

C

Was madly trying.

E

Yeah I was madly trying to find one of these series, uh while we were going through the last thing and I have been unable to find one, but I'm still going to try to explain it. But this might be a very poor demo, so kind of the type that we are.

E

So what it was was um craig I and I were talking this morning and craig had this um issue. Actually, maybe I've found one here. um Craig craig pointed out this issue where he was looking at the rate of certain sidekick jobs, and he knows that they get called very, very infrequently so once an hour or once every six hours.

E

But if you did a rate on them a rate on the job, um they always came back as like 0.0, not zero point, something something something and then a little number but flat zero, and it doesn't make any sense, because we know that these current jobs run once every now and again and then and so craig.

E

One of the things that craig's looking at doing is putting better monitoring on these low frequency, but high importance jobs like the stuxxer worker and you know which hasn't been running for months and because it's below the threshold that we allow for for our monitoring. um We basically ignore it, but it's critical. So we can't do that, and so we were looking at it and we discovered a really interesting thing. And so I thought it would be good to bring it up in the demo, but of course I can't find one of these um jobs.

E

Now I was madly trying to find one um one. That's zero, yeah that that's that's exactly so. I know that the one that we were looking at this morning was the stuck ci.

C

What, if you add a less than 0.1 to the query,.

E

No yeah, I tried to it's: uh let's just see if we can sidekick jobs, buckets.

E

um But you see it's not happening now, because this was certainly one of them that we saw it on these numbers were flat zero and you know what maybe it's actually, if I do it in this morning, but we know we know why this is, and it is very interesting.

E

I see zeros um yeah here we go so yeah. This is a perfect example, so that's that's actually really interesting that it changes through the day as well. It's part of the part of the mystery, so here we have a uh and we'll just we'll just pretend we were this morning like when I was talking to craig because that's gives us a better result.

E

E

Okay and I'm probably gonna have to do this. As I said, this is a very rough demo. So apologies for that here we go okay, so.

E

Wow, okay, so this job, this ci drop pipeline worker. It ran um on let's, let's try zoom in on this period a bit uh until what was that ninth.

G

E

Right so we know that this job ran at um 1946, okay and if we go and look at a rate on that job right.

E

We get zero, but we know that it ran and it's not zero and actually, if you aggregate these across the entire cluster, you can get zero when these jobs are running and so we're like. Is it a bug in prometheus? What's going on and then we'll get.

C

Very obvious: what if you make the range window bigger than one hour like make it 12? No, no, not that one in the in the query where it says one hour.

E

12 hours, I don't think it'll help; no, it won't help okay.

E

So so we were looking at it and then it struck me what the problem was here, and this is something that is going to affect like a whole bunch of these low frequency jobs, and actually it's got words since we've moved to kubernetes right and I'll, explain what it is.

E

The reason why it's zero is because we lazy initialize these sidekick metrics, uh which way did we go there? Oh, we made this longer, didn't we 12, or maybe two hours in fact, I'll make this five minutes there.

E

So when, when this, when this process the sidekick process starts, it doesn't initialize that sidekick metric to zero it just that that particular series for the ci drop pipeline worker in this case, just didn't exist right and the moment that it comes into existence is the moment at which we set it to one right and so the rate over there. It's just the derivative on that number there it it never went from zero to one.

E

It just appeared into existence at one, and so when we have uh containers that are starting up much more frequently than before. Very often they go from non-like absent to one and they can never. They never increase.

C

They go to one and then before they go to two. The container is shut down. Yes,.

E

C

They were never right and then they disappear.

E

And they never and we we see this happening all the time with these low frequency jobs, so we always lose the first. um Basically, we always lose the first one and for a lot of jobs that doesn't really matter too much because they're high, you know 10 times a second or whatever. If you lose one, it doesn't matter, but for the low frequency jobs that that don't happen very often, it makes a big difference, and so I said well, you know maybe what we can do.

E

My first proposal, which is a horrible proposal, but it's easy to do was let's just um when the job starts. We initialize it to zero, but lots of these jobs will run for less than 15 seconds, so we'll set it to zero, we'll set it to one and then the scrape will happen and it'll just have the same effect. So we spoke about.

H

Really andrew, do you what happens.

E

H

If you add one uh inside the rate, so I would, I would guess that null plus one is still no, but no plus zero is one.

E

No, I think, I'm pretty sure, I'm pretty sure the way it happens in prometheus is that it will whatever, because it doesn't see the the the step up. The the one place where prometheus is different is, if you have a reset on a counter right, then it will say: okay, it was five and then the next time we scraped it was three and then it will kind of like find the equidistant middle point on the right.

C

And that's why okay yeah.

E

And that's often why, if you have something that very clearly is like um uh integers, you know like whole numbers and you do a rate on it. You'll get like an increase.

C

E

Because it's those interpolations that make them into real numbers, but it doesn't obviously do that at the beginning. It just goes because it doesn't know if it.

C

E

At zero, it doesn't know like that that I at least done.

B

And I said too great if I got.

E

It no never yeah, so what we have to do, I think, is when a process starts we're going to have to precede these to zero for all of the of the jobs that could possibly run on that process. I.

B

I well yeah for all of the jobs I could possibly run. That might be a good idea, because I had a similar problem once and then um we decided that it was for um low frequency, endpoints, so http requests, and then I think it was sean and me said: okay, we'll just initialize all these metrics and then we had a carnality explosion of roots that are never going to be used on gitlab.com yeah. So then ben was angry at us and we did the middle ground.

C

So another another.

B

Approach, you could.

C

Take is to think of this as uh a short running process that can't get scraped, and then the way to get metrics would be a push gateway.

E

Unfortunately, not because push gateway doesn't work very well as an aggregator. You, like push gateway, doesn't have the concept of like um updating state you, it's not like statsd, where you say increment this counter. You say this metric is um one and then the next time you run you don't know what the old metric was. It's not a it's not like redis.

E

If you want, where you kind of send it an update or increments, or anything like that, it's you only give it um and and push gateway, is kind of very um it guards that sort of thing very much. It says you can't you can't be the opposite of that. So I don't think that there's an easy way to do it with push gateway.

C

Okay, I was just curious. It's also not nice, to have some special.

E

C

For cron jobs compared to other ci jobs,.

E

So so what craig said he would do, because we obviously recognize that there's a cardinality problem here and so craig was going to go. Do the maths to figure out whether this would just completely blow everything under the water, but I think it's better to be correct.

E

You know, because we've seen what can happen when these jobs are failing all the time and um it's it's a bad thing, and so so, if if we can get it down to only the jobs that run on a certain fleet, it might be okay, because you know eventually it'll get up to that number.

C

um Another way of looking at it is that uh we are updating counters spread across many different series, where they should be one series. So if we had a single counter, we were updating in, say: redis, then you don't.

D

C

Cardinality problem anymore, because that there can be any number of right.

E

C

Every pod will start a new series.

E

F

These kind of things.

B

I I also think we should be looking at it from the other side, so not the the jobs run, but that the work they are supposed to do is done.

B

um That's something that we were uh taking a look at for deleting and deleting the.

C

Problem is that that's not generic, whether that work gets done, is completely dependent on what the worker's supposed to do and whether a job runs is something universal.

B

Yeah, I'm not saying we don't need to solve this, but I I mean that it's like this particular problem. We could have seen on the other side and we would have known.

C

But only if everybody so then everybody creates a critical cron job needs to also set up some sort of monitoring that tells them that the critical crown job is running, and that is a strong assumption to make that this happens for all critical grunt jobs.

E

So so I think the the the easiest solution will be to to figure out how many more series it's going to be um and then and then, if it's a lot, then it's time to break prometheus-app down into prometheus-app and prometheus-sidekick.

E

um That's like the boring solution, and if it's not that many, we can just keep them all in prometheus app. I suspect that it's much fewer series than we have in prometheus db, which has got all the um pg stat statements, combinations, which is like a lot of data.

C

If, if we're only updating these latencies, that's uh I guess that would be the number of buckets times. The number of workers.

D

C

But then you have uh an increase because of the number of parts, but parts go out of existence and then um prometheus should garbage collect them or how does that work towards.

E

Cardinality it yeah it well the the it's fine kind of in the now, but where it gets really heavy uh yeah, because this was where it gets really painful. Is that when you do a rate over like six hours or something like that, it's still got to go and, like you know, go to all of these different buckets. um I don't I, I think the simple solution is just to pre-initialize.

E

If we can like, if you know that that yeah keeps things simple, um but we've just got to figure out the the content. That's.

B

The problem we just recommend in the end, doesn't it.

E

They do yeah, but I've never seen them. Give this reason and I and I thought that's a really: we should change the documentation. Maybe it has maybe didn't read it.

C

Properly yeah, we also have very some very low frequency uh criteria rpcs, but there we've used.

C

A middleware from the start that pre-initializes all the counters.

E

And also with gideon generally, the processes are going to last long enough. That.

C

E

Don't have that particular problem, yeah yeah yeah, so it's kind of a interesting edge case, but in order for um the stuff that that craig's looking at to kind of really be properly done, we will probably have to consider doing that. um I also think you know we've had three prometheuses for a long time, so you know if we have to break that into four it's you know not the end of the world.

C

I I faintly remember that one of the sre books talks about low frequency jobs in the context of slos, but I haven't looked up what they say about it. Do you have.

E

C

E

Yes, so so the number one thing that they say is that can you increase the frequency and craig, and I discussed that this morning I was.

C

Wondering that too, okay.

E

Because because and I and I think, but what we said was that that's probably going to take some time and we need to engage with teams, because what you probably find with a lot of them is you know if we call them once an hour, it'll be like 10 seconds worth of work once an hour and a lot of them, probably you'll, find if you call them once every half an hour.

E

It's maybe six seconds, so you know because a lot of them the work will probably be quite divisible, so you're not creating a lot more work by running it more frequently, if you know what I mean, um but but that will require like engagement with teams and it's kind of like a one by one sort of thing. So we don't really want to do that. So what we spoke about, it might even be in this query: let's take a look well.

C

There's another reason why it's a good design right, because if something needs to happen regularly, you do a big like you can catch up on a missed job sooner.

E

That's yeah exactly so, but going back to your original question, um we had we had quite a long discussion around this and we both got very excited about it um and it's a very nerdy discussion. So this is what we've got at the moment. This is fundamentally wrong. So we've got uh it's not fundamental. It's like the last level of maturity, so now we're getting to the next level of maturity.

E

So we have this as our one hour and our five minute rates, and then we have this as our 6 hour and our 30 minute rates. Okay. So the first thing that we really need to do is break that into two alerts, because then you can start doing really nice things like if you had a really bad morning and we had a whole bunch of stuff- and we know that that we've spent our six-hour budget. You can silence a six-hour budget and still get the alert for the one hour.

E

You know if things drop off again, you can say: okay. Well, you know that something really bad is happening now. So there's a nice thing that happens there, but the other thing. That's really, I think, really nice is you can take this.

E

This clause over here is a bit rubbish, because it's saying that for us to do any alerting on this slo over six hours, we need 0.1 rates on the job consistently, and so what we can do instead is we can say, in order for us to do alerting, we need to have 10 samples and it doesn't matter whether those samples happen like in a second or in in the six hour period.

E

We just we're going to say that the minimum alerting threshold is 10 samples and the way that we can do that is we take this clause over here and we move it into these two things like this uh live coding, my favorite thing not, um and then we do that and then this is obviously a second alert.

E

But we say here we say that times: uh 3600 that'll give us effectively the number of samples, because the rate is per second and there's 3 600 in an hour, and then we say that that needs to be greater than 10 or whatever.

E

That we choose is the magic low minimum of samples that we need in order to evaluate the service, and then we say the same pretty much the same thing over here and again, we just say it's on, except here we got it on the the six hour rate, and we say times six times is more than ten samples and then the last thing that we do is we set up the three day one day three day, you know the third tier which we don't have, and it's also something that we should really fix, and that will also say that over a um three day period, you have to have ten samples as well, and so then we get away from this like having a minimum rate.

E

You know of a very you know, over six hours at a minimum rate of 0.1. That's you know, thousands of I don't know. Well, it's it's 1 000, something um samples which is actually very, very high, so we can. We can break that down um and still monitor the low frequency jobs.

C

When you say samples, you mean operations.

E

Yeah and it's it's the same so when I say sample I'm, you know I'm thinking of it from a sort of statistical point of view, but it's it's yeah. It's each operation is one sample that we're, including, and the reason we had that ops filter was to just kind of filter out low sample rates. You know where you get three things, and you know it's not enough data to really build something up here. So if we take, let's just take this and see if this works.

G

E

C

Live coding, no data.

G

Live coding, yeah, I wonder what we've done wrong here and we probably didn't actually have a violation.

E

G

The first clause.

E

Yeah yeah but yeah so the, but basically then we can, you know over a longer period. We can start doing monitoring of the of the um low frequency jobs. The other thing is just worth pointing out. Is we'll never have these uh the three day you know the long period monitoring go to go to the sre on call.

E

We have that go straight to an issue tracker and we can use the feature category routing that we've got for that. So then it instead of us getting a job about stuxxci jobs, not working. It goes straight to the this is kind of the future future. This is a few steps ahead right, but there's no reason why the sre on call needs to deal with that. It should just go straight to the team that that's responsible for that job.

H

E

Talking now, specifically,.

H

About sidekick, though,.

E

Yeah yeah in the sidekick case, yeah yeah yeah, so.

H

I was going to say I I have many times wish for the the split between the long-term and short-term um um burn rates, so.

E

I think that's a generally.

H

Fantastic idea.

E

Okay, cool yeah we and it's it's pretty easy to I've, been meaning to do it for a long time as well. But now this seems like a good. Oh there's, one there's one not not even complicated, but slightly complicated thing that we have to take into account, and that is that often, when things go really pear shaped um the long and the short both instantly drop, and so one of the things that we don't have in our alert manager. Config at the moment is we don't use.

E

I even forgot the name of it. Alert manager's got a thing where you can silence one alert based on the existence of another alert, and so what we don't want to do is send pager duties to the sre on call to say your six hour and your one hour are both violating now. Here's two here's two pager duties.

E

We only want one and what are they called suppression, suppression, rules and, and we can set up a suppression, rule in alert manager to say if the one hour is firing forget about the six hour we, you know we don't care about that. We just just tell the person the only thing that I'm that I have a slight concern about is: if you get those suppression rules wrong, you could do really horrible things by accident, and so that's why I've always been a little bit cautious about when I'm going to roll those out, but yeah.

H

I personally I would, I would just accept that we're going to get double paged as an initial step. um Okay, speaking as an on-call, my pager often blows up with multiple alerts at the same time. So it's it's not pleasant, yeah by any means, but it's tolerable.

E

Okay, yeah, then maybe maybe that's, maybe that is a reasonable.

H

Minimum product exactly- and I love the idea of the suppression rules, but I would say that's totally fine to just get comfortable with first doing this split, okay, cool. That will be quick, that's my opinion anyway. Cool.

E

I think that's that's all we can say about that.

A

Was there anything else that anyone would like to um to demo or to show.

A

Great well, thank you so much for that. I hope you all have a great rest of your day.

E

Have a good one.