GitLab Scalability Team Demos, 17 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-06-17

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

And uh could be of the first one this morning.

B

Yes, uh I put a very nerdy topic on the agenda.

B

um There was a discussion in our sex channel last week where, in the context of error budgets, where bob pasted a pronq, ul formula and said how do I say what this is calculating and he was probably trying to answer the question from somebody who wanted to understand, error budgets and uh I didn't understand at all- and I started to get worried that we were doing something very fishy in our calculations and then bob currently created an issue, and then I thought. Okay.

B

Now I have to sit down and look at the issue because I created a problem and then I thought about it, some more and then I realized what we're doing is not fishy. If you all right.

C

B

Was able to explain to myself what that is what we're doing and why it's right and that's what I want to talk about now and I should share my screen to make this less abstract. So I'm just going to double check my browser here.

C

I might even be able to find my mathematics textbook from university. The last time I thought about these things.

B

Well, andrew you're, probably the one person who is most comfortable with this calculation, uh because I I think you worked.

C

B

Never ever thought of.

C

Ryman sums when I, when I, when I put it together, but maybe there was a.

B

No, probably not um yeah, so I, the topic I put on the agenda is riemann sums and uh that's very, very nerdy, because riemann is a. I don't know a 19th century german mathematician.

B

But the nice thing is that if you google riemann sum, then you get a wikipedia page with nice pictures uh that are that help to understand uh what we're doing um so. I think I have an excuse for being so nerdy and calling it the riemann sum because it helps it guides you to a picture. But first I want to show what the question was about. So the formula was some overtime failure rates one hour, 28 days, this formula. Well, actually this one bigger letters.

B

So what what does that mean and um the the answer is too complicated? So the answer is: it is the number of failures in the past 28 days, divided by 60, assuming that uh our recording rules create a data, create a rate metric every 60 seconds if it was every five minutes that would be 300 seconds and it would be divided by 300..

B

I'm so glad you've done this, so uh the divided by 60 makes everything complicated. But what um so? Let me try to work back and explain what we're trying to do. What we want is the number of failures in the past 28 days, because we, this is the sort of number we care about for error budgets.

B

We want to see an increase over a long range and uh the data we have are rates and a rate is instantaneous like how much things were increasing at a point in time, but it doesn't tell you um immediately how much increase you had over a window of time and uh what you need to do if you want to go from a rate to the actual increase, is to take the area under the graph.

B

I don't know if this makes sense, but uh and that's what integrals are in mathematics just again very nerdy, an integral is taking a graph and computing the area under the graph.

B

So that's where these pictures come in and if you want to approximate this, then one way of doing it is to draw these little rectangular boxes under your graph at different points and calculate the area of each of these boxes and sum those and then you have an approximation of the area under your graph.

B

Now, what we're doing in our formula is so this line would be failure rate one hour and um and we these dots are where we have values and what we're? What I'm assuming is that uh the time between these dots is always the same. It's always one minute!

B

So if you, if these are our failure rates- and it would be the rate here times- 60 for 60 seconds in the rate here times- 60 the rate here times 60 well, they do it on the other side. But let's ignore that for a moment and you you constantly do rate times 60 plus rate times 60 plus rate times, 60 plus.

B

Does that make sense? Now, if you think of um multiplication and addition, if you have something a lot of pluses and every time you do time 60, then you can also first sum up the things and multiply by 60 once at the end, that is called distributivity.

B

So uh because we're assuming that all these uh things are evenly spaced, it's always times 60.. So you can also say it's 60 times the sum of the rates, and that is then the area under the graph, and um so that that's one and then why don't? We write 60 times some overtime failure rate because uh we take these increases, but we always divide by another increase. So the 60 is both above and below the dividing line, it's in the numerator and in the denominator, so it disappears.

B

So we don't need to write 60 every time and it's also nice, because 60 is a configuration value and as long as all these rate numbers are generated with the same configuration value. Now we don't have to write the configuration value in our prometheus formula.

C

Can I just um this is so awesome that like because this is all in my head when I was doing it and I didn't express it very well, but the one just on that point, yeah one thing that made me feel confident about cancelling out the 60 on the top and the bottom, because that's how I thought about it. I was like 60 on the top and 60 on the bottom was that both of those configurations are in the same recording rule group so and, and that was like they they always will run paired together.

C

So if we change it so that they're running at 30 or or whatever there will always be pairs of them, so the the failure rate and the total rate will always be increasing. So even if over the period, it was changing, it wouldn't matter because that it's changing at the same rate.

B

And at the same time, so yeah- and I think so, one of the reasons so the other thing I was curious about was my first thought when I saw uh this expression was. Why do we have a thing here that depends on the uh the recording rates, the rates at which this the rate at which we generate data points like that. The calculation should depend on it.

B

So the yeah, the assumption is that we only write some overtime of a rate in an expression where we know that all these rates come from the same recording group and have the same uh interval.

B

Because then all the intervals, the interval size cancels out and is irrelevant, and then, if we would, if somebody would throw an arbitrary prom, ql query on the wall and they do some over time overrate.

B

Then you can't quite be sure what you're getting, but because of the extra assumption that you just pointed out, andrew that it's all from the same recording rule a recording rule group, it it makes sense.

C

And that assumption is part.

B

Of explaining what this formula means.

C

It was, there was a time where I had um a second expression, which was the number of um samples or the number of observations that that we had over that period on the top and the bottom. um But then I realized that that kind of messes things up. You know we didn't need it. Firstly, we could cancel it out.

C

It made the query slower and you know just it just worked more elegantly if we just cancel it out on the top, so you can look at you can say over 28 day period, we saw you know whatever 14 000 or you know whatever. However many samples and therefore we can kind of infer from that that that we're doing this once every 60 seconds, um you know we can go back from the number of samples.

B

C

It it I just it worked out that we didn't need to do that. The the other thing that's slightly different from a riemann pronunciation, as I say it um uh is that there is. There is overlap so in that, in that wikipedia page that you showed each of the bars is um uh uh taking up a so. Do you want to share your screen that you had happen a second ago, um but I and yeah so so those bars are kind of each kind of distinctly taking up a piece of the x-axis.

C

But actually, if you just go back to the ratio, calculation um that lost herbia you'll see here that um we are using a one hour rate over a yes yeah. I- and I I wasn't sure if that was right, and I actually think I I said in a call. Luckily, we have someone who's very good at maths here in a demo call, but but because I was.

B

Hoping you were sure.

C

uh Yeah, it was actually but um but the the thing the thing about it is that through experimentation, I because I I didn't go through the whole sort of uh approach, like you know the mathematical approach to proving it was correct, but I did lots of experimentation on it on lots of different series and it seems to work. This is also what we do. The upscaling with on on the sidekick mattress.

B

Yeah- and I I actually um I didn't- make progress on understanding this formula until I started ignoring the one hour, uh because I think uh the one hour is important because well the way I think of the one hour- and this is not really okay mathematical me doesn't like it because it's too hand wavy, but for everybody else, it's probably fine.

B

I think the one hour is doing smoothing and if the window was very small, then we would have very erratic rates and uh there would be much more.

B

They would be much more jumpy, but we make the assumption implicitly that the rate stays more or less the same for 60 seconds and if your rate is very jumpy and yeah. So if it's a very it's like, if something very local happened that was off and you get an incorrect rate and then you multiply by 60 seconds then you're multiplying that in accuracy.

B

So you want a relatively smooth, well, okay, the way the mathematics work is that that you calculate the rate with an infinitely small window, but you also do the sums with an infinitely small width and then yeah.

C

B

Lucky and you used to write it then then, usually it works out, but that's not the world we're in and um so intuitively. I think these rates need to be smooth and that's why uh one hour makes sense, but apart from that, they need to be smooth. I don't think they really change what the expression means like you can just think this is the rates um like.

D

They're, just a way to get a point on the graph that those parts on the graphic. Exactly where that. How that? How? How many of those you've got affects you of sort of the fidelity, but not not the maths of whether it.

B

Yeah, so you need, you need to get points on the graph. Somehow, so that's and you need some window and one hour is a valid choice that through experimentation andrew discovered works well, but it yeah it. If you want to try out, uh try and work out what the one hour means for the approximation of the increase, then I think the math would get very complicated and I just don't want to do it.

C

Yeah- and I was also worried about the jitter, because the the the evaluations are not happening exactly six, you know they're not back-to-back 60 seconds, so there's always going to be sort of, um because prometheus is kind of going in this loop. And you know it's got a it's got a schedule and every 60 seconds ish. It will run that recording rule, but it's not it's not exactly 60 seconds right like depending on what the server is doing, and so there was also kind of some concern around that there's.

C

Another reason why I think that this is actually quite an important discussion to have, um and that is around the three-day rates, um because I think we want the three-day rates it's for.

E

B

C

Yeah and especially for the sidekick for the for the low um frequency sidekick jobs, um and so we want the three day rates. We can't really um practically evaluate the sidekick pods over three days, because we'll just melt, um thanos and sidekick down into yeah.

B

That's the other part of the story that, uh if you want to know the increase and you had all the prometheus data- and you could just write the prometheus query- that takes 28 days worth of counter values and calculates an increase for you, but that melts prometheus and that's why we have these rates, which are lower in number yeah.

C

Yeah and actually in the days that we were in vms, we could pretty much get away with it. It would be a slow query, but now, with the number of pods and the pods starting and stopping and giving you two minutes, um we can't really it just there's just too many series now and it's actually got a lot worse, um but the so with the three day one.

C

um I think what we're going to do is for everything so not just for sidekick, but for the web and for api. All the three day ones will be constructed from lower sample rates and then kind of, I think I think, in the code base. We call it upscaling, but this is where having you looking at. This is super fortuitous right now, because what I found when I did it with sidekick, is it worked? You know within?

C

I remember I put a graph on and I think bob reviewed it and maybe craig furman and you could kind of see that they kind of it kind of tracked the real data. It wasn't perfect, but it was kind of a rounded version of of the real data, but then for some- if I remember correctly for some things like for api or something there were certain conditions where it didn't work very well at all, and um I think what it was again. This is. This is kind of scratching at the back.

B

Of the three-day rate.

C

No, I think what it was was that when the failure rate, doesn't you know for certain sidekick certain uh things when there's no error, there's no observation. So so you know, if you don't, if you don't have any 500s, then that series is absence. It's not zero and uh that's for certain things. That was a real big problem and that's one of the reasons why I changed the recording rule to always give us a zero and so I'll need to reevaluate that now and see if that's been fixed.

B

Right, I was actually wondering about this too, and um I it looked to me like if you have so, if you assume that you have a constant trickle of data points with one minute in between and if some are missing and you take the sum, then I think you're effectively treating it as if the rate is zero, because if the rate was zero, then they also contribute nothing to the sum and, if they're missing, they contribute nothing to the sum yeah, um so it's indistinguishable from rate zero. I would I expect there was there.

C

Was like I was looking at the because, because where this conversation really started off was with ben and ben looking at the prometheus servers, melting down running these six-hour queries and saying you know why can't we just use average over time and me saying no average overtime really doesn't work like look. Here's the observed data and here's what happens with average over time, and you know this doesn't match up and then kind of figuring out the theory from the practice.

C

But then, with this approach you know I looked at it and I said you know this is a better approach for for kind of estimating what the six hour rate is and for side. Kick it where at the time for sidekick, it worked really well, but not for everything and that's why in the metrics catalog, we have a switch. That says, like I think it's called like upscale, longer rate queries or something or something like that.

C

It's upscale longer something something and that- and it's only enabled for sidekick and for postgres, because those are the two that we have. You know for postgres we're using all the rails uh metrics and then um putting them together and I need to. I need to still figure out why we can't just apply that generally, because for some of them it was like out by a few percent.

C

B

Too so maybe we can look at that sometime.

C

Yeah, because and and if we want to do the three-day one it because I don't think, there's any metrics where we can just do three days using the raw data, it's just it's going to be too expensive, and so we have to fix this for the general case in order to use it for the three-day one. So so it's something to think about.

B

Yeah, not let um maybe we should just sit down and then you talk me through uh yeah I'll.

C

Take it I'll, take a look and see if it's still the case like I, I need to kind of boot it all up a bit right and right.

B

C

But I I need to take a look and it's quite easy to check. It can be a bit slow, but you know I'm doing it. The experimental sort of engineering way try something try again until till it matches.

B

Which is uh which is great, of course, yeah.

C

I guess my message.

B

If I could try to summarize it is that summing rates is okay, as long as you know that the recording rate is a constant time in between more or less constant, it's never going to be perfectly constant, and also, if you're, if you're, dividing. If both the numerator and denominator all use the same recording rates, then then it works out because then you have to then time 60 disappears everywhere.

B

That was my message.

C

I think it even might have been in one of the maybe in the upscaling, mr, it was kind of mentioned, but it felt to me it was like a okay approved and I never had it properly validated. So I'm really really happy that it's been properly validated now and somewhat surprised that it actually works.

C

I'm not well yeah.

B

I you shouldn't be surprised that it works because we're using it right.

C

Yeah pleasantly surprised.

B

Yeah, I also wanted to share this because I, I think all of us are sort of ambassadors for error budgets and these metrics, so uh I don't expect any. I hope nobody will mention riemann sums outside of this goal, because it would be to probably alienate the next person, but I I I wanted to share this- to have more intuitive understanding of what these numbers are to to somehow boost that for everyone. So thanks.

C

The one last thing before we move on, I kind of toyed with you, because average overtime is equivalent, isn't it.

B

C

On both the top.

B

And the bottom now, with average over time, the number of samples is part of the calculation and the number of samples in the top might be different from the number of samples in the bottom. Oh.

C

Yeah, that's why yeah yeah yeah and I think that's yeah. I think okay, cool yeah. I think that's why it's elected away. I can't remember now, but I I.

B

I also remember that you had you were talking about figuring out how to get the calculation right when you're mixing errors over time like error, increase and latency. Miss increase, and you have these different kinds of data, and you basically need to need to do a weighted sum. You can do that with.

C

B

Or you don't know the weights and the moment you work with increases, you effectively get the weighted sum, uh but you don't have to figure out what the number of samples was, because who cares what the number of samples was? If you can calculate them, you.

C

Know that they paid yeah yeah, um but it actually may be something we should document somewhere because there's probably some point in feature where people say oh well. Maybe we don't need these two things to be evaluated together and.

B

That was my. That was my starting point. Like bob showed me this formula, and I thought why on earth are we not doing average over time and do we have a formula that makes assumptions about a prometheus config or a thanos conflict like yeah? Can we do it without those assumptions, and so no we need the assumptions but yeah. We should document them yeah.

B

Well, thanks for listening to all this there's nothing else in the agenda, I don't want to keep droning.

A

On is there anything that anyone else would like to demo.

D

I've got a couple of things I wouldn't like talking about it's actually in the same vein, it's what we're talking about. Last week, the metrics and initializing, the metrics I uh from sidekick.

D

The which ones were there sorry not prepared um yeah initializing everything for every psychic job that they could run on a cluster rather than waiting until they run because of these low rate, psychic jobs, uh and basically it goes from about 10 000 metrics at the moment, to about 40 000, which is not really good, but it's also not out of scope with others.

D

I was just wondering if anyone had any other thoughts on that is that it's does it fit.

B

Well, it goes on a prometheus server like does that thing have enough ram to do that.

D

I would assume so.

C

Does it is it all around it? It's not put some dust, there's some fairly there's some fairly um well publicized documentation on sizing. You know the number of series in a it's. It's in the pro mio docks, um but you know the the the kind of simplest approach would be to look. I suspect that that prometheus db is probably under more duress and so to look at where we are. You know just relatively look at the defaults db and app instances, and you know see if it's if it's, um if it's.

B

What instances, what instance do these 30 000 extra data points, land on.

D

They will be on the app instance here, which is probably one of the stressed ones well other than db. There's this metric there's a metric for the number of metrics. Isn't there.

C

Oh, it's! It's not actually the app I'll take that back. It should be on the gke one. Now, isn't it prometheus.

D

C

That's a very good well, some of it, except for the sidekicks that run on.

D

Vms, how many sidekicks for a non-vm still seven of them until we get one key for a shot, just the last of the catch-all which we desperately wanted to turn down. So we could. We could ignore those actually and they're, not the problem, because there's only seven of them um so you're right, it's gke! Isn't it? Is there a metric for the number of metrics enough.

C

Yes, it's actually under um if you go to prometheus gke I'll just share my screen, quick.

C

So prometheus does gke, um and here we have ah big surprise. Almost all of our top 10 are sidekicking sidekick things, um so they have under the status thing over here. They have like a bunch of things: top 10 label names with high memory usage path. That's not what I.

D

Would have expected right, 1.9 million almost 2 million series and I'm proposing to add 30 000. so less than one percent. That's.

C

Very reasonable if we go look at prometheus app, let's go to that same tstb status,.

C

The number of series: what's that two million uh okay, so it's a lot, it's actually a lot higher and then just one more just for a bit of. We could probably throw.

B

You, what are you saying, gke.

C

Is at five million tk's at five, and that's almost certainly because of the pod load. You know, because of the um all the other parts creating new series when they grow in and out of existence, so wow, okay, so so my my um guesstimates were totally off, so prometheus db is only at half a million. um So it's it's cruising um prometheus app is at 2 million.

C

Prometheus gk is at 5 million, so adding 30 000 is neither here nor there yeah it's a drop in the ocean.

B

Yeah the other argument: well, it's not really doesn't mean it's safe, but we already defined them. It's just that they don't they. Theoretically, they could like they could have.

C

Happened already: yes, yes, and some of them have just yeah, not all of them. Yeah.

D

Yeah and in practice.

C

D

Somewhere between ten thousand and forty thousand just most of the time at the lower end, because we're lucky.

B

Yeah, but just because they could have happened, doesn't mean that we know they're not happening because otherwise we wouldn't have these uh missing metrics, and it also doesn't mean that if we start pre-populating, this doesn't dip over uh the prometheus server but, like you said, like thirty thousand on five million, is, will be very odd if that pushes it over the edge. Do we have saturation to tell us if that five, that one with five million is in trouble.

C

No, we should, um and just I just quickly do now. I looked at the defaults as well and that's at 16 million uh 1.6. Oh one point my eyes are, thank you. I think so. Yeah yeah.

D

Yeah you totally right here.

C

Okay, so then it is the worst, but yeah I mean we should we can. What we can do is we can look at the um prometheus um io docs.

C

um I think they've got a pretty good operating.

C

Somewhere in here, they've got a pretty good thing about it, so maybe we should put that. As a I mean, it is a saturation metric of a sort and we should be monitoring as such and then also getting it into um tamland. So we know how long we have with these prometheuses.

B

Where do we feel the pain? Is it in um in prometheus itself or when we try to run queries across a lot of them?.

C

And certainly the startup time is like the the famous pain of prometheus, because you've got to load the whatever files into into memory and that takes longer and longer and longer and longer.

C

But then also, I just think, the memory usage.

C

I don't actually have a very good idea of managing prometheus, but.

C

I think there's there's memory, because it's got to keep certain things in memory as well. um It might you know the indexes to where other things are. I'm not I'm not actually totally sure about it.

C

I I I'll, I think, that's a really good a good takeaway is. We should be doing more to kind of be tracking that.

B

We don't want to want our prometheus tech to explode on us because yeah yeah. Currently it probably happened at the wrong time.

C

Yeah and it- and it is one of those things where we'll probably just one day find out. Oh, we need a new prometheus today, and so you know that's the kind of thing where we shouldn't do that. Just.

B

um My non-favorite topic zonal clusters uh that might actually be a very natural fit for prometheus.

C

So uh craig- and I spoke about that and now one to one earlier exactly that topic: do you want to expand on it a little bit jakub.

B

Oh no, I was just.

C

B

C

Now go ahead, jacob.

B

Oh um so it might be a natural way to divide up the prometheuses to have one uh if there's this natural division there anyway.

B

But I have no idea how hard it is to to deploy or maintain any of that, because I'm not an sre and I'm also not a prometheus expert.

C

So just to expand on the idea for others a little you know what we're going to have is we're going to have a redis we're going to have a sidekick cluster per zone um or per kubernetes cluster, and so that we can kind of have lots and then each one will have its own redis instance, so that we can effectively horizontally scale the redis behind sidekick. But then also what we could potentially do is have a prometheus instance that just lives alongside that's paired with that cluster and collects its metrics as part of that cluster.

C

And then you know we can. It becomes one of those things that we're just stamping out more of these of these um configurations.

B

Yeah, I think the one of the reasons to do this might be that we just saw that the gke prometheus has five million series, and we said that's probably because of pod churn. So like the more of these, um if it's mainly because of the pause, it might be nice to also just if that, if that is a source of journey, and if we can separate them that might be yeah and.

C

B

Know that the prometheus architecture?

B

Well, because we use thanos, it's it's perfectly fine to have multiple prometheuses, and this is the why I say that total clusters are not my favorite topic, because I am a bit nervous about the application impact of having multiple psychics of having psychic talk through different registers, but the same postgres, but that's a different topic. But I think in the case of prometheus, I don't see a reason why it wouldn't work, because we already do this sort of stuff, and it's mainly about like operationally. Is this something. uh So this is hard.

C

So one very mine or a low risk issue at the moment, is that for sidekick we don't evaluate the slis at the thanos level.

C

We still do it at a single prometheus level, so you know we would get split, brain sli alerts and that's not a problem because all of the well, almost all, except for the seven that that craig mentioned or all of the all of the um the jobs are being evaluated within one cluster, which is the the regional sidekick cluster and uh they have uh then they're all in there, so they're all getting evaluated in one place.

C

So it's not a problem if we split it up like that, you know, jobs could run in one of three clusters and we just have to do exactly like what we did with the rest of our slis and move them so that they're evaluated in thanos at a higher level and we've got the reason I say it's low risk is because we've got all the tooling in place for that. You know we've got aggregation sets. We could probably just set up another aggregation sets for that exact um right, but.

B

We currently have things in place that are only correct, because certain data lives all on one prometheus server, yeah.

C

And all that all the sidekick jobs for a certain class all go to the same prometheus. So it's not a problem.

E

So um I'm sorry, I cannot turn on camera right now, um for my understanding isn't the problem with having sidekick on zonal clusters, the fact that we take out a cluster at a time when we are upgrading which could mean right like if we have separate reddish per zone that would possibly cause additional problems like we would have to have a master of main radius. That would be able to like compensate for the fact that we are taking out two zones at a time.

B

I think we were talking about prometheus now.

E

Right right right, I I know we were talking about prometheus, but it feels like we're trying to resolve one problem to maybe potentially create another. That's why I'm asking.

C

Can I I mean I I wasn't aware of that argument and um why why why would we take out the entire cluster while we're upgrading it.

E

um There was first of all, there was a challenge with net portal location. If I remember correctly, we used to do it in one go and then after we had that, we've realized it's actually a nice feature to be able to see any breakage uh prior to rolling out to the rest of the cluster. So we always have some capacity.

E

um That's all up for rediscussion right like it's not. We were like. We were purposeful there, um but it's something to keep in mind uh when talking about these topics.

C

And how would that impact? Can you just explain in a bit more detail how that would impact sidekick, because I'm I'm not following so.

E

If we are talking about look, if I understand correctly what you were talking about here, like you were saying that would have uh ready set up personal cluster, my understanding is that if we take out the whole, you know zonal cluster, whatever was stored in the redis instance. That was for that specific zonal cluster could cause issues elsewhere. Right, okay,.

C

I I I never I I I didn't expect that we would delete you know like the we we might take it out of. Are we talking about actually deleting the cluster and recreating no.

E

No, no! No! No! No like taking it out of rotation right like.

C

Being unavailable right yeah, but that that shouldn't affect um uh redis two I mean.

E

C

Wanted this challenge here, so the challenge.

E

So yeah the the challenge here I'm talking about, is like the different versions that we would run um you know, depending on the cluster, we are rolling out the version. The new versions too right like um if I remember correctly, one of the problems, was that uh at any given point in time, one of these clusters could be running a different version of well gitlab. Basically, so wouldn't that cause already.

C

No, I mean that's already a thing right. We that's something we have to do already.

E

Right, but if you have a separate radius cluster, you would sorry separate red is per zone. That would mean that the new version, or that there would be a version uh uh stored differently in redis, so you might actually end up having a situation where side it pulls from red is uh something that is in a different format.

E

We already have that yeah. We already.

B

E

B

Yeah yeah, like a deploy like a deploy, is slow. So we have that's true uh jobs from there submitted by old versions, picked up by new versions and the other way around, and uh it's it's a horrible mess. But we already.

C

B

C

But that, but that's a horrible mess at all big yeah yeah. You know sauce setups.

B

If you can't do an atomic deploy, then you have this mess. Yeah.

C

Yeah, um just I'm just kind of curious. This sounds like something that people are talking about. Is there an issue or is this being written down somewhere because it.

E

Has been written down somewhere in one of the many issues that we have? I if you were to ask me now, I don't know you might need to ask jarv or scarbec where this was discussed. Yeah.

C

Because I I I would push back on that, I mean I I you know, don't I don't necessarily agree with that. To be frank, um you know: there's yeah I'll and I'll respond on the issue, I'll, try and find it or I'll ask java about it, because I'm.

E

C

I don't think that that's uh that doesn't it doesn't connect for me at the moment at least.

B

I think I see a general problem here where there is a a solution which is this redis per cluster for so redis per kubernetes cluster for sidekiq and um different problems that people they think this solution solves.

B

And then you get a disconnect where uh I think on the delivery sides from what I've picked up from jarv there's. This idea that you could have um within a zone you could have unifor like you, could drain an entire zone, upgrade gitlab and put traffic back in, and then you no longer have this mess of different versions of gitlab talking to the same redis, which is a different problem from scaling having to process lots of psychic jobs, which I think is.

C

B

Problem andrew was talking about, but both of these things have the potential solution of psychic zonal clusters.

B

So I I suspect this sort of confusion is going on here and um I I think that whenever we talk about this, we should also be clear about what problem we're trying to solve and not get attached to a solution like the one thing that always makes me a little. Anxious is, if I see people getting very attached to a certain solution, because they think it will solve their problem and in the end, you end up with something that doesn't solve any of the problems correctly or something complicated.

B

That yeah just doesn't solve your problem, but we all got very attached to a solution so.

E

This this sounds like a perfect opportunity to actually pair with delivery on um like a larger discussion. So maybe it's it's worth like. First of all, starting up like a general issue to talk about the problems here and then uh pairing with delivery to see uh whether they're we need to do some architectural changes to how we deploy gitlab and what is else necessary to add to our infrastructure.

C

Yeah I mean I can see that branches is also of like totally shutting down traffic to a cluster, um but there's also disadvantages to it as well. Right and- and so it makes me wonder whether making that decision will make other things kind of more difficult in future, and also whether that means that we'll miss problems that customers might have because customers aren't running. You know multi companies, clusters.

E

C

You know they don't have that they don't have that um luxury right.

E

But at this point we also have to take care of what gitlab.com uh needs right like if we can't scale further, then there is not much to discuss there right so yeah yeah.

C

E

Just want to be clear that I don't think that we made any purposeful decision when it comes to how the clusters are organized deployed to and so on. It was more of uh an outcome of uh our migration, so it is. It might be a good time to to revisit this with delivery.

C

Should we should we put something into their demo, call.

E

Might be worth doing that yes cool, although their house might explode because they have three different migrations currently in uh in mind. But let's do it.

E

A

We should give them like after the release is asked. um I think they've got quite a lot going on and if we do this before the release, uh I don't know if we'll have all of their their focus on this particular topic. So maybe we give it a week or two before we, we put it in their their agenda.

E

We can always do like the async write-up right. They describe the problem and start a discussion right.

C

I'm trying to find this concern around the the versioning and and stuff, but I'm so far.

E

Drawing a blank you should you should ping uh javascript for that one. I remember.

F

Discussing that created three years ago,.

E

uh No, that's the name space one. I don't think so.

A

Well, in the interest of time, I see that uh marin's got another item on the agenda, so shall we uh should we hop to that one.

E

Yeah, if you don't mind, this is more of a question now that I see craig is around um like we see the recurrence of uh of the crown worker not being able to archive the trace jobs again, so we are in danger of data loss so to speak again.

E

um Craig you mentioned in that comment that this is a pure infrastructure problem, but I'm I want to talk about. uh You know whether that's actually the case. Yes, I know the fact that we have like a large sidekick, uh paul, churn and like this is becoming a theme of this call a bit, um but at the same time I I just kind of wanna discuss whether you know the application is uh not able to do uh what it's supposed to do in this new environment. Maybe the architecture needs to change as well.

E

um So I just wanna kind of open up that for discussion, and maybe you give a bit of a background to folks in the call.

D

Cool, so the very simplest quicker summary I can give you. um We have some sort trace chunks get put into redis up to 128khz each they those that get what get moved into object, storage and then, when the job finishes, those objects get collated into a final artifact, which then gets put back into object. Storage.

D

Some of our jobs have something in the order of 10 000 plus of these chunks and the rate that we can get those out of object. Storage is order of 15 or 20 a second. So collating, those in the end takes 10, plus minutes give or take. I haven't, got actual numbers, but that's sort of sort of scale.

D

We're talking and the auto scaling of sidekick workers in kubernetes often doesn't let those complete so the initial arc creation of the artifact fails and then the cleanup job, which is trying to clear up all that and all the other ones that fail for various other reasons, also can't get past these big chunky ones and they build up over time and that's at least a big part of of the okay case.

D

Yeah, not the things went really really horribly wrong in the first place problem that we have so um the app could be re-architected, I mean if we could do more of those. In parallel I mean the the slowness with the trace downloading the traces from object storage. Is that they're done serially? You know we do one after the other after the other, and they take. You know some number of milliseconds each.

D

uh What's it probably yeah 50 to 100 and yeah, it's just that they're happening. It's not that object. Storage can't handle the throughput. Is that we're doing it in a single thread? So if we could bring those down multiple at once, we could reduce that by factors. You know two or three times would probably be enough. You know if we get those down to two or three minutes, we'd get a much better chance of those working.

D

It's not wrong. The only other way would be to pre-collate them in smaller chunks, but then you're sort of you're just throwing data around over and over again, and I'm not sure that would be you'd just be fighting. You know putting together 10 of them and then putting them back into object, storage and then at the end, putting chunks. That seems a bit weird to me.

E

So I'd like to take like a page out of andrew's book in this case and uh and maybe like discuss whether 10 minute job in uh in sidekick is something that we want our application to do, and maybe we can set a goal of what we can offer as infrastructure and ask development to tell us what they can do within that period of time and push back if that's not doable.

E

So you know like it feels to me, like the long running or stable chart suggestion that you're making. It goes completely against uh the the kubernetes uh philosophy there right so like. I would like to challenge the situation a bit if possible and if only, if not really possible, uh without a major re-architect we go back to um yeah. I don't know, I don't want to go back to vms, but the vm behavior.

D

Yeah, I I just said I mean I agree with not with having to be more resilient, but 10 minutes feels like a very slow small number for a pod to live when none of them seem to live more than 10 minutes. I mean we're just we're just never getting it's not like. Some of them die quickly. It's none of them seem to or all these jobs are ending up on pods that never.

B

That sounds wasteful if it takes almost a minute for the afterboot.

F

G

C

Already set the.

F

Limit like the maximum runtime for a job is already five minutes. Isn't it.

C

No, no that's for it. It that's what we judge them on yeah.

F

C

F

Jobs to be yeah yeah, we want them to complete, but if something takes longer than that like this is twice the time we're talking about.

C

I mean well security, you know the security report, jobs can run for 60 minutes um and so.

D

You know this isn't even the worst of.

C

It exports on the other side.

D

C

D

As well I mean.

C

Yes, yeah the classic.

D

Case, which is some of them are tiny, and some of them are huge.

C

Yeah um and and the problem I mean I I agree with marin like we should start pushing towards saying you know you can't have it you can't. We can't scale things that if somebody does one push, we go off and do an hour's worth of compute on that. That's just not gonna work that doesn't scale, but at the same time a lot of these things are what they are, and so we're gonna have to have a lot of exceptions to that rule, because they're not going to be able to change them super quickly.

C

But maybe we should start turning the ship in that direction and start making noises that that this is not acceptable.

A

And this seems like.

C

A

With as well because we've had the problem in the past, we've I've done manual intervention on this. For for months now, we've been doing this manual intervention, and now we thought we'd gotten somewhere and we haven't go back to where we started. So this seems like a decent one to say um you know you have to have a better long-term plan for this in the short term we can do x, but we can only support that for for a very short period of time.

A

You need to show us that you're moving in a better direction to not have these jobs running for quite such a long time. Because of the reason you said that like if you, if you're doing this massive compute function for every single push, yeah, it's just not going to scale.

B

Right but we started this with the ci trace trunk drops right. I had technical trouble, so I missed part of the explanation of what's going on, but is it a throughput problem like why? Why does it even matter that these things get terminated after 10 minutes they get scheduled once per hour and they expect to run for an hour and they only get 10 minutes.

E

Craig you're, muted.

D

Sorry, they're collating trace chunks out of object, storage, some of the jobs have 10 000, plus trace chunks, and we, if we do it serially, we only get 15 to 20 per second out of object, storage, and it just takes that amount of time to get all the data down to be able to create the artifact to put back in.

B

Which we could do is a single job processing. The uh artifacts of a single job is that taking more than 10 minutes and it's never completing some.

D

B

Yes and that's why we have a build up now.

D

B

D

Do make it past that, but you know it's easy for them to get caught by the looks of it.

E

D

E

um Related to the project expert imports um just want to kind of share how that whole thing is going uh on that side. So we have this, like you said craig, it's the same problem, basically right um now, the challenge there is the the feature itself is outdated. Architecturally right, like it needs a complete re uh work because it was already not reliable in vms.

E

um So we are right now uh looking to create that stable shard that you're talking about with a couple of pods that will be running there almost exclusively just for project imports, but we already know that's not going to work like we already know this is going to fail as soon as you have like more than four or five parallel long-running migrations, you start queuing and it just goes out of hand right and we don't have infinite capacity. So what is being done on the side? Sorry, I know just to finish this out.

E

What's being done on the side, is we are buying time for the team that is responsible for project import export to actually redo the um the future? So this way we can support them for a tiny period of time while they get ahead, but everything else is stopped right, like we're not taking new project imports like they're, failing it's known and so on. So it's a bit of a mess. I don't want us to get there with this situation. So that's why I would like to start these discussions is parallel.

E

Like rachel mentioned right, like we need the longer term plan and the stop gap. The total uh issue that you're asking for approval um that can only last for a bit like it can't go forever. You know sorry.

D

I don't even want to do that work, but it's just like sort of facebook.

C

Yeah, like I, I just want to kind of like also say where we, where we I understand, why we're doing it and I totally support why we're doing it, but we should do it with like as much pushback as possible, because every time we take something and run it on its own special snowflake we're actually building the worst of all worlds, we're building vms on top of kubernetes right and we're getting all the administrative and and operational overhead of running vms.

C

On top of the operational administered overhead of running on on kubernetes- and you know the more we allow this the more pain we kind of pushing kicking down the down the road, um because you know there's no scaling on these things. There's you know everything. We know that's wrong with it, we're basically just using kubernetes as a deployment mechanism. Then.

A

So then did those items just get stuck as like p2 infinitive issues until they get resolved, is. Is that the right way to provide the pushback of visibility on it?.

E

Well, for this specific one, uh because it has the visibility in the stand up, um I can make it into um not necessarily a rapid action. I don't think it's. We need a repetition for this, but we need an organized effort for multiple sides to get this thing done. So what I could do is um if you have a bit of a write-up of like what what do we need to expect from development? What are we going to do in like a stop as a stop gap, and so on?

E

I can go with that and explain like what kind of effort do we need, and maybe it doesn't turn hopefully doesn't turn into a rapid action, but more of a larger project that we need to collaborate with uh others on, um and it could be it. This is a scaling problem right, so this this is what we need to do as well, so we can pair with the team that is responsible for this to design a new um um yeah design redesign. This feature basically.

C

Marin, this does sound like an engineering allocation. If you wanted to do it through that good.

B

Point you know and get.

C

And get you know, that's that's what most of that info dev meeting is about now and you know christopher's doing a great job of of of um driving. That.

E

That that makes sense. Let's um things I forgot about engineering allocation, I still don't know how to use it. um I think that's actually a really good approach there. So, theoretically, we as a team could take that on uh together with whoever is the the engineers allocated there and drive. It is one of the projects.

F

Which project are you talking about now? The importers or the trading.

E

No, no, no, the traditions import is on the other side.

E

H

Sorry uh rachel, I didn't.

E

Know this was part of infradev uh the engineering allocation. I only got.

A

Into the engineering this week, yeah, it's still new to me- I'm also learning how to how to make it useful, but um so what I'll do is I'll draft um craig's last comment on this. One issue is about being firmly with infrastructure, so I'm going to draft a reply to that, um but I will yeah I'll draft it and then send it back to craig just to check that um my statements are right and then we'll try to push it forward as a project using that engineering allocation.

B

E

Of curiosity, what's an engineering allocation.

E

If we can stop the recording, I can say the real thing. uh Otherwise, it's a way to actually get uh some architectural changes uh uh done through um outside of the regular scheduling that product does so outside of the it's productive.

B

And it's a different: it's it's a different tool. It's a different tool in the info dev toolbox to get.

E

Correct and you want to explain it.

C

No, my understanding is that um the engineering department has got an allocation that that they can use for longer term strategic initiatives to fix. uh So I, in my mind, uh infradev, is quite tactical and it's quite like we need to fix this thing now um and we're gonna scream a lot until it gets fixed engineering allocation is the next rung up where there is a certain proportion of headcount.

C

That's that's dedicated to addressing kind of strategic problems like off the top of the head on top of my head, like, I would imagine you know the object, storage problems around file upload and all of that would be a great engineering allocation problem. It's not something that you can address really short term like you know in the next three releases, and it needs um devoted kind of engineering and and christopher's managed to get an allocation of headcounts that can work on those projects and it moves around between teams.

F

I don't know that's perfectly, but that's.

C

F

It sits in between infradev and rapid actions like it's uh faster than infant f, but not as do it.

C

Now, as reproductive no, it's it's slower than infra dev, um it's slower than infra dev and um slower than rapid action. So it's the third tier.

E

It's a different optimization, it's an optimization to skip the whole prioritization process. That goes to multiple milestones where you need to go through a product manager continuously. This is more of uh you have a dedicated set of people who can actually do the work and schedule the work themselves to skip a couple of levels right, yeah and.

C

E

Puts aside some reserve in in capacity for engineering to be able to do this.

C

um And the the the kind of status updates of that is now in the tuesday infradip, so input, even an engineering allocation happen together in that tuesday evening meeting um and so the first part of the meeting. We talk about infra dev issues, and then we talk about engineering allocation and what you obviously find is there's a lot of overlap. So some of the stuff that's been raised is infradev. It's like! Oh well.

C

Actually, that's going to be this engineering allocation, that's underway or the other thing that might happen is you know we're seeing this common pattern over and over. You know like I'm, going to use a file upload example again we have all this different pain because of carrier wave, but it's not something that you know. One team is going to go.

C

One stage group team is going to fix, so let's get an engineering allocation and rather than having 10 stage, group teams fixing their own individual carrier wave problems we'll have an engineering allocation, we'll come up with something better, but it might take. You know four months.

E

So I um I have an item for tomorrow for the sauce stand up. So if anyone can, let me know like whether we'll have a tiny bit written up, so I can actually share otherwise I'll move the update for monday um and um and try to.

A

I'll write the draft today and when craig gets online tomorrow, then he can confirm that that's okay, so you will have an update for tomorrow.

E

Awesome perfect. Thank you.

A

Yeah, your time zones.

E

Yeah and craig also, I really appreciate the right that you did in that issue um right like you, you, it was really super clear what is happening there um and I do appreciate your help helpfulness to say that this is an infrastructure problem, but uh as someone who's been in infrastructure for a while, I realize it's very rarely only an infrastructure problem.

D

That's a vehicle.

A

Well, we're at time. Thank you very thank you very much. Everyone for the great discussion today. We've gone everywhere from maths to process, so I hope you have a good rest of your day thanks. So much thanks, bye.