GitLab Scalability Team, 17 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2020-12-17

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, uh so I have the first item on the agenda and I already brought it up to rachel in our one-on-one doc, and she mentioned there that we should talk about it, but perhaps it's better to talk about with everybody. We have this long-standing issue where italy, canary is alerting on us and that's because the well, our repositories are there and they're very busy. Every time we look into it, we find a different mold, whack and like yeah.

A

This is a bit never ending, so I wanted to bring it up here to maybe discuss what would be a better approach rather than just chasing this all the time.

B

I think the biggest problem which you know we've talked about before is that italy, canary, is two things it's a canary deployment of the getty code, but it is also storing the repositories that we use heavily like you know, it's very hard to.

B

I don't know from the isolating perspective, but it seems like it's very hard to like disentangle like what, especially when we've got a bunch of alert silence. What behavior on italy can area is because it's running new code and what behavior on gitly can area is because it's using these high traffic repositories.

B

So I don't think it's very useful as a canary, like you know, I think it's useful to us to like have a thing that says like. Oh, this is the performance on.

C

I'm not sure it's useful at all, because we don't look at the alerts like what's um like. Does it do a canary job at all.

A

Well, I look at the alerts and the issue when it pops up like I see if there's anything newish there like something that I didn't recognize before, but that's not super useful, because how.

C

Often, do we actually find a problem like like a gateway problem and not uh our repo is too busy problem.

A

I think most of the things that we fixed, we haven't actually like those were rails from like rails, calling italy problems.

C

So it could be that quickly doesn't even need a canary and because there's a couple.

C

Another thing that comes to mind for me is that I think the reason we got here is that we realized that we kept getting alerts on the gitlab or kit lab repo, and we thought, if we put it on its own gitly server, then it doesn't bother other people and then we turned the alerts off and then we all went home.

C

um But when you.

D

Did that was? um Was it a canary installation like? Was the code different on there like when it was first separated.

C

I'm not sure um it could also be that, so I'm not really sure why you had a canary another issue. Here is the order of deploys and uh the tight coupling between the gitly version and the rails version, and I think at one point we had a rule where we said you should always deploy italy before you deploy rails. I think actually.

C

This is why it happened now that I think back when doing a deploy upgrade gitly before you upgrade rails, because uh in rails we never account for the fact that maybe we're talking to an older version of kittery rails always assumes it's talking to the latest version of gitly or something newer.

C

So then we wanted to have a canary rails. We had a problem because that would talk to an old italy. So then we created a canary italy that that we can deploy ahead of canary rails just from a correctness and compatibility viewpoint, um I I think jarv would also know I think he was involved in this. I think that's why it happens.

C

So not so much because we want wanted canary alerts about italy, but because we wanted the canary rails and we couldn't have one without a canary italy and then gitlab or gitlab was its own problem and we tucked it on there and then yeah. We just taped the red light. That's where we are now.

D

So when we say it was its own problem, do you mean that it was just causing alerts and noise and uh things to be looked at that were distracting then, from what actual problems were, because the the the usage of it is so different.

C

A

Were a very noisy neighbor for other people, so.

C

We constantly had people constantly getting paged on that server where github or gitlab was on, and um I I know there was a discussion and there was something slightly iffy about moving it, but it was some sort of pragmatic decision like we're, probably better off, to put it there and not make it everybody else's problem.

C

I I could try and find that issue.

D

So next question are the slis the same on both main and canary.

C

I think that's the problem.

D

That was going to be my next question is: do they have to be.

C

I think they shouldn't be because it, I think, actually, I also think we've had this discussion before somewhere, uh but on on a big gateway server, with lots of different repositories, you're sort of seeing an average across a lot of different things, but on getty canary, where it's just get up orchid lab. It reflects uh the performance of that one repo a lot and that repo is used very heavily.

C

D

C

Needs its own slice, that's the.

D

Truth, the performance of the repo is our usage of this repo, so specific, so particular that there is no other repo on gitlab.com. That behaves like this.

C

A little bit I mean we also have marquee customer uh italy servers in in a way gitli canary is the marquee customer. It's the marquis gatling server for ourselves.

D

But on the marquee gitly servers um they have the same slis.

C

This is the this is exactly the right question. I think I don't know we should be doing. I think we should treat this as a marquee customer. We should treat ourselves as a marquee customer and use the same approach and rules and then yeah.

D

Well, that's an interesting approach that we can take because I feel like playing whack-a-mole forever. It's never well! I mean it's forever. It's never, but there's always going to be something else that comes up and that's why I thought it was better to talk about this from a more fundamental perspective, because if it's, if it's fundamentally different, then maybe we need to you know be treating it differently. Maybe these sli's do need to be different and I think finding out what the marquee customer situation is is is good.

C

Because that's the your question, like is, are we unique? No we're not unique. Well, what's unique about us. Is that it's a busy repo and we care if performance is bad because for the most part we ignore, if people's performance is bad because we're blind to it or we're deaf to it. We don't hear it if their kid performance is bad, so we just ignore it. But if gitlab or gitlab ci is stuck, then people on slack start clamoring, and then we do something about it.

C

Okay, that's a very crude way of putting it, uh but I I I think it is more or less true that we have a hard time um observing what the user experience is at the individual level, and we look more at the system of how is this server doing and is the server do we find most of the time? And if there is 10 000 users that are on that server, then that doesn't mean that all 10 000 of them are happy and that's also what marquee customers are about.

C

Those are people that, uh where, if they're not happy, we actually have to do something about it, because there's a big contract behind it.

C

Which again, it sounds very cynical, but I hope you understand what I.

A

Mean uh what about the things that we have like looking into this once in a while has made us improve things for everyone.

C

Yeah, no, we should make it better. I I'm not saying we shouldn't, but um it we should probably treat this the same way. We treat uh marquis performance.

B

C

Don't think it's a canary.

B

C

No, I think, that's an accident because uh well also, if there's nothing on the canary server, then um if it doesn't get any traffic, then you get nothing from it. So, let's just give it one of the biggest chunks of traffic. We have.

A

One of the issues that andrew created recently is to start doing like to start using the distributed, reach stuff. That kitly has there and then it like. Then it is kind of a canary. If we try out new stuff there.

C

Yeah but then, if it turns out that that works really well, then we probably want to do that for our marquee customers too, so it could be sort of a staging advanced test area for what the marquee server should be like, but I mean because the question is again like why aren't alerts going off on marquee servers all the time, and why are things better? There.

A

Because it's quieter, I think.

C

On the marquee servers or on ours.

C

I find that hard to imagine.

D

I think that's worthwhile investigating, because I think that um asking the question about the difference between this and a marquee server might give some interesting answers, because the the idea behind it is the same like you want to have a certain type of usage isolated.

D

But then what is different about the usage like? What makes this different and yeah? I think that understanding more about how the marquee servers are, users is probably a good line of investigation here.

A

I'll comment that on the issue and see if we can take it in that direction,.

D

uh Well, when for context, a marquee server is somewhere where specific customers have been identified to be moved on to specific specific sets of hardware. So, instead of having everyone altogether, we've isolated certain customers on request onto specific hardware, to try and isolate noisy neighbor problems.

D

So that's why we're talking about having our own repositories on a separate server, because it seems to follow the same pattern of isolating ourselves as a customer onto specific hardware. We just added the whole complexity of let's also make this a canary installation which is like a pre-release um available as well, which yeah probably might use the watches a bit here.

B

I think an actually useful, given italy canary, would probably just be an arbitrary, regular gitly server um and that, like I've, said this, one should be like a marquee server and then we we have two sort of dimensions to think about that one. But I think if you're talking about pre-release codes, then that probably makes more sense.

D

uh So what you're saying is um you have specific repos will have a like a selection of arbitrary repos that test out a pre-release but then treat ourselves like a marquee customer so that we are seeing what is happening on marquee servers as well.

B

Yes, so, in terms of like new gita lead deployment and again I don't know from the gateway perspective.

B

If this is even like an important thing to the italy team, but in terms of like new deployments, I think because they have to kind of have to be coupled to repos, because that's how gita is deployed, isn't it um then um it will make more sense for it to just be like you know if we said that, because I think we're trying to move towards canary just being like five percent of traffic goes to canary or whatever right. um That would be. The closest we could get is to like deliberately pick up, not remarkable.

C

I I think, um maybe we need to talk to people on the delivery team about this, because this whole aspect of what gets deployed when uh it's kind of complicated and that's something that this really their focus. I'm not even sure if this strict well not sure why we would be gone with this strict relation between deployed gitly before rails. I don't know if that's still true, but if we had an arbitrary gitly server act as the canary server, then um that would usually get non-canary rails traffic.

C

That would probably work, but then we still want canary rails traffic to go somewhere and I'm not really sure how that puzzle works.

A

C

I said I think delivery are the best guess of people who know how that puzzle works out.

D

Well, I think this issue impacts them as well. So perhaps when we raise when we start talking about these questions, we should ping them as well and ask them to help think about how we resolve this or how we take the next steps here.

C

And another possibility is that uh we're stuck with this being the canary italy server for practical reasons, but that we can still say we run it as if it's a marquee server, we apply the same standards, uh slice alerting or whatever is in place for marquee. We we treat it the same, and maybe we already do but right now we don't know.

D

Yeah- and I think, funding out is important, um I think, finding those things out of the first steps. You have to decide what to do next, um because yeah I'm just like to spend more time just uh bashing the biggest problem as they seem to come up every week or two or what the next different thing is.

A

I think every time a new person goes on three hours and looks into that issue. It gets built up.

B

And the other thing is, um I think you saw there was find commit bob um again.

A

Yeah, but you say that I haven't haven't, looked farther into it to see which fight commit. That was because that's used all over the place. Oh.

B

Right, but what I mean is like find commit itself is, gets dramatically slower when you have a bunch of concurrent requests that will use find commit so like that's like we can remove, find commit calls to like reduce the amount of concurrent requests, but I think the gitly behavior is probably not great either so yeah.

C

Yeah we trying to do smart things to make that more efficient, but uh it it's still not. It's still not great. It's already the case that if you do multiple find commits with the same correlation id, we reuse one process, so the process startup time should uh should be amortized.

C

But if you have concurrent requests with different correlation ids, then they still start up their own process sort of.

B

Right: that's why the dramatic yeah.

C

Yeah yeah, probably um so it used to be that every request got its own fresh cut file process and that was uh slow. And then we came up with this thing but yeah. It's still not great.

B

Because, on the rails side, we also have the ability to do a request cache for find commit results which is effectively the same as the gitly side. Because, like the request, will have the same correlation id. So we've kind of solved that problem that part of the problem from two sides, um but.

C

C

It's yeah I get a link and get caching is an interesting problem and um that's one of the reasons why I think it's on our agenda to work on that and.

C

I don't think it's a problem. We ever really solved well when we started the get clean project. I pushed back on having that in scope because I think just getting italy working was hard enough, but the result is that there's not a it hasn't been explored much and it's yeah it's one of those things that uh it spans across multiple components across the whole application and we're maybe just not doing it right or in the right place.

B

Yeah, I think, a while ago I created an mr that would enable this uh ref caching everywhere, but then um the problem was. I wanted to use the mr to see what specs broke um to like get an idea of like is this feasible, but the problem is that most of the spec failures were um like spec failures, like you know like in a real situation.

B

This wouldn't be a problem, but in the context of the spec, where, like the spec is like you know, doing one thing then immediately running a request and expecting to get a different result. Like you know, it's it's another one because, like you know in the specs, the the request context is the whole spec um for simplicity. So um basically I didn't. I didn't pursue it any further.

B

It's what I'm saying there.

C

It might be an interesting problem, an impactful problem, to try and do something about, but it's yeah to me. It's also like a really big problem and you need to yeah. Maybe it's just a spec problem. Maybe the I don't know exactly what uh technical solution you're talking about now.

B

Yeah, but even then, that doesn't that doesn't solve the problem of like multiple concurrent requests doing find commit, which is the the the other issue, the other side, yeah.

E

B

Yeah, that was just a real tension. Sorry.

C

But yeah well, it is an important dungeon, because the real reason we constantly have these alerts, and we have all these problems- is that there is so much more. We can do to make git access more efficient in gitlab.

C

And I think that would be very impactful because look at this server, that's constantly not meeting its uh its targets.

D

I think I agree that, um like there's a lot more there's a lot more, that we could potentially do to make the access faster, and I think it's a case of deciding, if that's the right thing to do, and I think that the going starting with the path of finding out the difference between this and the marquee nodes determining the best use of canary and then it might be that keeping it on keeping the arrangement as it is, is actually the right thing to do, because then it does enable us to then look at this like how we can how we can improve the usage across the board.

D

um We just have to decide if that's the most um like the best use of the time, given that it's mainly us that seem to do this, um and I know that making improvements for us. It makes an improvement for everyone, because it's all the same thing underneath.

C

Yeah, we will think I I'm still optimistic about our next project with the pack file. Caching, and I think it will make some of these things better, but it will not make them go away because that's just it's just such a busy repo and it's not the only thing where we're not uh not efficient at.

A

You I mean you haven't built magic.

C

Well, I haven't built anything.

A

So you're still proving magic.

C

I think we need more than one kind of magic.

D

Is the magic that we need to make in gitwo in italy.

C

uh My god feeling is that this magic is more in uh in the rails level, uh it would be avoiding italy calls in the first place.

B

I think yeah I think zj mentioned that before as well like, if you think about postgres. Obviously postgres does some other stuff with like buffers and so on. To like you know, you know, have uh frequently used rows in in memory. But basically, if we want our database query to be quick, then we catch the result on the rail side, because the rails side knows how it's using that and knows when it can invalidate that- um and I think the same applies to italy.

A

Doing that right now, like the query, cache thing, yeah.

B

Yeah exactly like, if it's exactly the same query, then active record does it, but I mean, if you want to catch it across requests, then you do that in rails. You don't say to postgres like let's yeah.

C

Thanks for bringing that up, because that is, uh I was talking about me- pushing back on cashing in italy, but that's also. One of the reasons for pushing back is that uh it keeps what whatever gitly says, is authoritative and that's what actually in the reap is in the repo um and knowing when you can ignore. What's in the repo and serve something still instead is complex and.

C

So it's it's a real, uh but it it it's um yeah. I don't know I mean in the past. I would have maybe said, create or source code owns this because they own the rails code that touches git repositories, but I'm not really sure if that's practically true, because they are not the like a lots of other parts of the repos of the application care about stuff. That is in repositories, so it might be a little bit like this is one of those like sidekick efficiencies.

C

This is one of those things that is important, but somehow too big or not clearly owned in the rails. Application.

D

I think it's also one of those things that seems to be highlighted by scale and size, because it's being highlighted on our repo and not so much on others, and I wouldn't wonder.

B

I think it is more clearly owned in the rails application, because only a few teams, like you know, there's a source code team that, like you, know, deals with like how uh rails like accesses get repositories like essentially. So um I know that other teams do that as well, but I think they would be. The natural owners, like they've, got a lot on, of course, okay yeah, but do.

D

Does anyone know if there are already proposals and ideas for what the enhancements are, or is everyone just saying well, there's something that needs to be done, but there's there's not the clear like these are the ideas that people are putting forward.

A

I think one of the things that sean linked there well, both of them actually are ideas to improve stuff. The concurrent requests and the caching on the rail side.

D

So I suppose after we've done the investigations here about the slis and the marquee notes when it comes to now talking about how to move forward. It's about getting source code involved with this and working with them too, to try and see what can be done next, but I think it also depends very much on what we find with the marquee nodes and uh and the comparisons there.

C

Yeah- and there is another interesting angle here which is not about jumping in and trying to fix the problem and that's about observability, because it's very observable what kitly does, because you have these rpc calls and we lock them, and we know how long they take. But what happens? uh How? Often something in the application tries to look up a commit?

C

Is a kind of rp could be an rp sequel like that. There might be there's stuff that the application tries to get out of the repository and then that's and then there's the things that actually go get sent and asked from italy, but what's happening in the rails, app as we try to access a repository is maybe not as observable.

A

C

A

Add there because we already logged the number of deadly calls and so on per request. So.

C

It's not about the goals, but it's.

A

About the things.

C

About the things that one commits and what what the patterns are there and um things that want the same, commits and do they hit a cash in rails, or does it become a gitly call like it.

C

A

C

I can express a clear idea: yeah.

A

I know you mean you mean like how many times does okay does the rails do find commit and how many times does it actually go together? How many times does it hit the cache I used find commit, but whatever, like anything, yeah.

C

What like, what? What does the behavior look like? What is the? What are the patterns? What are the opportunities for optimizing like if we see that something gets repeated all the time like, and you know um I am I'm not like, can we make the well, I suppose we can make the argument? It's not efficient enough by just pointing to the number of critically calls and say that needs to go down so that point.

A

C

We know enough.

A

I think you're trying to get us like to add influx db again with the method calls.

C

Thing uh not uh not intentionally, but uh now that you mention it. Yes, that is maybe what I'm.

A

C

And probably not what we want to be doing.

C

We used to have observability inside the the ruby codes where ruby method calls would get tracked and we would dump crazy amounts of data into influx db to see what the application is doing and we had to stop that because we were generating too much data and that's why bob is uh suggesting. Maybe we shouldn't go there again.

F

Can I ask one question: so what is the current status of the tracing on that option? I saw that we use the lucky ruby for racing and stuff, so uh is it yet deployed into russian and we can get the data from that? uh It is that in my local environment,.

A

um Like we have tracing like the jaeger tracing is enabled in staging right now, if I remember correctly, but it's not yet in production, I can pick up an epic where igor is working on getting yeah.

B

He goes, he goes working on that and I think that's a good point like it feels like we're punching a few things to that. But, like you know, it would be nice to revisit some of these things once we have tracing enabled in production and see, if that, like lets us find out the answers to these things without going and building our own thing.

C

Yeah, on top of that.

B

Yeah yeah, if you can find the epic bubble, I'm not having a great amount of success at the moment.

C

So tracing is in a strange state because the work started one or two years ago, but it never like deploying it into production, stalled for a very long time and it's it is moving now. But it's still not there. That's why local development, you can see tracing and it's like hey, we have tracing and then why isn't it in production, because.

F

Yeah actually, on the local environment, I only raised it to ratio that it is not very helpful to see how the god is behaving. So I think that if we have like in enhance the observability in the tracing, we can just like have a really uh clear vision about how things and how our application is doing on the ocean. Maybe we can enjoy some uh sampling about one 0.5 percent or even with the long running cost, and we can get the data from that. So maybe it's really helpful long term.

C

I think this argument I agree. These arguments are much um if we want to argue that something can be improved and we have tracing data that would be easier. uh Yeah.

D

And andrew has just joined for his favorite subject, distributed tracing hello, sorry.

G

I was on a incident lots of fun this morning, hi hi, do you say uh call me.

F

G

I'm the new team member in the.

F

School of business team.

G

Yeah, I know I know I'm very excited that you've joined welcome.

F

Yeah nice to meet you.

G

G

D

You give us an update for what is the status of distributed tracing and production.

G

So it's it's actually very close. um There's a readiness review, that's currently underway and like I'll. uh Where should I paste it? Sorry, I don't have the doc open.

G

um So we we have, we have jaeger in staging, like, I think, from an expectation setting point of view. um It's uh it's very early days right.

G

So if you look at the staging traces, there's there's a lot of stuff in there that I think we should clean up and there's some I've got some open mr's to kind of clean up, some of them like, for example, we have like most of its health checks at the moment, because we have so many health check, requests and they're all getting traced, and we should turn that off um and obviously we want to go to remote um sampling and all these features, but as a sort of mvc.

G

We should have it in really soon, depending on when that's, um I think, hopefully, it'll make the production change log uh at the end of the year, so otherwise it'll be early january, that we'll have tracing in production and it will make a massive difference.

G

Who, who was somebody just advocating for it right now?.

G

Or why why was this topic caught up.

B

G

Just wanted to invoke.

B

You and get you to join the call.

D

uh Where we got to was um we started talking about the fact that we still have alerts silenced on gitly canary, and we had a conversation about what we could do. That's slightly different there in some investigations.

C

D

C

D

To get to the answer of how distributed tracing got there and then, uh when we brought up uh well, we have distributed tracing locally. Is this not in production? Yet because that might give us some more guidance so.

G

Yeah so yeah I mean I had a. I had a call with jarv and marin yesterday about this, because the other problem is that we have um affinity for certain projects for canary.

G

So if you go to kit lab or gitlab by default, if you don't have any other setting, it will opt you into canary, whereas if you go to like f droid, f, droid or another open source project, it will, by default, uh opt out of canary and so because we're driving canary web traffic to um web canary, which then goes to the gidley canary.

G

It's. It has a knock-on effect, so we have silences on like web alerts for canary, which are also very unhealthy, and so there was a discussion on maybe monday, between marin and jarv and myself, and we we spoke about either breaking up that opt-out scheme that we have on the web. um But marin was very against that.

G

Basically, what I'm against is having the silences and one of the one of the things that we spoke about was you know. Basically, the repository is slightly pathological, the main repository on canary and it's always going to be slow and because we don't get this kind of normal mixture of traffic. um We.

G

It's very problematic, so one of the things that we're talking about is until we get the clone, um the clone sharing or the the pac file sharing or whatever we're calling it. um One of the things that we should consider is moving gitlab, org, gitlab to prefect and and replacing it on on the gilly canary with a bunch of other gitlab traffic, because we're we're the best testers of gitlab.

G

However, it only makes sense on prefect if we also have distributed reads, and that is not on at the moment so distributed reads and kidney cluster which share the reads, are between multiple prefix nodes, and so we we might be able to reduce the um the load that we see on on those nodes by sharing it between three nodes. But but that's also like a beta feature at the moment.

G

C

um What we were wondering and what you maybe know it's like how is the gitlab or getback different from our housekeeping canary different from marquee italy servers, because uh I presume those are also high traffic repos with uh special needs, and do they have different slis? Why don't we get alerts from them all the time so.

G

Actually, the marquee servers um are very like the the historical reason why we even have those marquee servers was because there was a horrible bargain performance back in gideon and it was getting worse and worse and worse and our our most valued customers were getting the angriest. And so we were moving them on to very low utilization machines. And so the marquee machines are actually very low utilization and we never kind of completed that and moved more people on there, because it was really an emergency they're. All for proficient.

A

G

Yeah they're very over provisioned, so the marquee machines are, we actually there's outstanding work tasks to to resolve that, but the the the the gitlab will git lab it's just down to the number of ci jobs. um The number of.

C

G

Can we overprovision that server too? We so that that I was just trying to find the issue that I that I wrote down all the options.

G

uh Let me just find it for you. It's.

G

Too many issues.

C

So my thinking is the reason we don't get uh alerts about the marquee servers. All the time is because they're over provisioned, so they're so yeah they can just handle what uh whatever we throw with them.

G

So so yeah, so so that was the first option that I listed um and it's kind of like we it's a it's either or between that and um and and prefect will give me cluster. I think, are we calling it gilly cluster? Now it seems to be sort of getting transparent.

C

I think prefect is just a component uh that is the routing component. um Okay,.

G

So here I wrote down all of those. It sounds like we had the same conversation yeah. That's that's thanks, but there's a note where we just kind of describe all the options and that that machine has got 30 cores at the moment and if you go look through the metrics, it's totally cpu saturated, so it can take up to one and a half times the amount of time. It's able to do work to schedule the work on a cpu which is bad.

D

So is there a reason why we haven't considered just making just provisioning more before, like.

G

D

It's already it's.

G

Already, we've already done it once so that machine is already bigger than um so. If you look at the normal gimli fleet, they are in one standard, thirty, two in one standard somethings, and that machine is a c two or c three or whatever.

C

And what is what are marquee.

G

And the the marquee are also the n1 standards.

C

So there are normal surfaces with just one repo on them.

G

And so we've already switched over once to compute optimized, um uh whatever the length.

C

Okay, so so that means we are special because we are on a bet. We are on a bigger server than everybody else and it's still constantly saturated.

G

Yeah, but we did get a bit of headroom, so we kind of got a bit of headroom and for the most part, the pre-clone scripts helps us, but then there's like security releases and they're not going against master, and so- and we have you know all these other things that are going on, that the pre-clone script doesn't cover and which is again like why I keep banging on about the pack file um thing because it would help with.

G

As far as I can see, it would help with all of this stuff, um but the so so but rachel. I don't think that there's the the distributed reads are not going to happen this week.

G

So I think that on the back of that and I'm my my time is short um because I'm going on holiday on friday, um and so if you want to drive that or at least get alberto during drive it with alberto and get that machine scaled up like it's going to give everyone a much more peaceful christmas um or end of year time right.

G

And so I think we should just do that because it was kind of an either or and we've discovered that the ore option is um is not going to happen in the next few weeks and we'll probably get uh broke down with the change lock. So we it.

C

Could still be bigger.

G

There's still room to put.

C

G

Because it's yeah, because it's a I mean we could put a 60 course um it it's it's throttled on. uh I know it's it's it's it's not great, but it's the thing it's throttling on, as far as I can tell, is the cpu so like throw more cpu at it until we have a better solution which is coming. It's not like. We don't have answers here.

D

I yeah, you also would have missed the beginning of the call and I've said it a couple of times, um so I apologize to everyone who's listening, but it just feels like continuing to do one small change at a time isn't effective, like we just keep doing it every time the triage rotation comes around. Someone finds a little something that we can do. That makes a difference but makes a little dent, but we still have. We still have silenced alerts. So the reason we were discussing it in this course, because it just hasn't.

G

Are you talking about this in general or in terms of the gidly canary.

D

A giddily canary, um so there just has to be something different that we can do so um so I'm happy to to carry on trying to drive this. But I see that the issue that you've linked here that one two one four two is specifically about moving it to get to the cluster.

G

So if you, if you look at the the the last issue, that I linked with the hash ref to the note um to uh one two one, one, eight that's got all the that's got all the options that we discussed in the call and there's even an agenda link in there. I think as well. Okay,.

D

I'll follow the conversation on here and see what I can push forward here. um It just feels like there's something different. There's a there must be a different way to try and get through this, because what we're doing at the moment is just it's: it's repeated work and yes, we're making the improvements, but it just doesn't feel like it's.

E

G

With marin, because and job because we had a like a very you know, we spoke about this a lot on monday in a very animated call, um and but I think that the the you know draft did actually make a change on monday as well, which has made us. Are you taking into account the change of job made on monday, because it did actually make a bit of a difference? But.

A

That's just canary web. Yes, we're talking about canary yeah,.

G

Okay, that is true, and I'm just kind of checking to see if that has still had a positive impact yeah it has so that's awesome.

D

Okay, so I suppose for next steps here, I think we should still write up on the issue um as as bob is described above the conversation that we've had here about the differences between the servers um and the discussion about the marquee customers. I think it's good for everything that we've said here to be recorded on the issue, um because it shouldn't just exist in in this format. I'll also continue the conversation about the on the issue.

D

That andrew is linked here about what we can do with this in the shorter term, um but I'm glad that we're just talking about this in a different way and seeing what else we could do here while you're here.

A

Sorry go ahead. You probably know this by heart. Does marquee have different slice, then.

G

No it's the same, but but monkey like I, I don't feel like maki is particularly tied into this, because those those machines have got like 20, utilization of disks right and and they're just they're, just idling and so they're, very, very different from gitlab or gitlab. um If you look at them by any measure, we've got a dashboard called get kiddie rebalancing dashboard, which is kind of a proxy attempt that I made to take all the different factors that we consider to be utilization and match them up and on those like by every measure.

G

The marquee customer nodes are totally overprovisioned.

A

That's the one with the bars right with the millions yeah.

G

Yeah and they're really bad labeling, um the the thing that I think I like, I think the way we have to frame it is the ultimate fix for this is going to be um back files that back file sharing or pack file, caching or whatever, because that is that this. This will be the biggest repo impact for that fix.

G

Do you agree yaku.

C

um It looks like it's a big chunk, uh but at this point I don't know I on the one hand, I'm I'm wildly excited about it. On the other hand, I'm a bit scared that I'm too excited, but I looked at.

G

I looked at kind of concurrent requests for um clones as a kind of bad proxy for how effective it would be, and you know so- I was looking for like how many concurrent requests were going through for for different repositories and git label. Gitlab was like so high compared to everything.

C

Yeah well, if we know that the incidents are about things where the pre-clone script is failing, then that's or it is not offering enough relief, then yeah. That is exactly the problem. This solves, and that's also why we came up with the idea, because uh he's all incident about that. It's just that in in the larger scheme of things, uh there's all sorts of bad things.

C

You can do to a repo on the critically server and we have to be talking about cloning now, and this should be uh make a very big dent in that thing. But there might be other things that are problematic.

G

There definitely are, but they all the problem is they're all being squeezed by cpu. So it everything else is made worse by by.

D

Well, we have a couple more minutes, so um sean. Would you like to show us around advent of code.

B

uh This really wouldn't make sense in a couple of minutes. It was literally just if there was nothing else on the call.

B

D

Anything else people want to chat about in the last few minutes.

G

I've got I've got something that totally is unprepared, but maybe I just run it by people, um but also it might it stopped me if you think it's not the right thing so.

G

I'll I'll just.

F

G

So what we've got at the moment is this for all of our metrics. We we take the metrics and we use recording rules to turn them into like these recording rules slis in prometheus, and then we take those sli's in thanos and we talk to lots of prometheuses and we aggregate them up so that we get.

G

You know all of the metrics that you see here. um This dashboard is at the moment it's in the wrong order, but all of this is being aggregated by component service type and we have a set of aggregations.

G

um Then we have something called like node level monitoring and we've only turned that on for gidly, and what that does is it gives us a different aggregation set right, and so here we're saying, because kidney each getting node is like a single point of failure. We want to kind of record our slis and what our error rate is on per node basis, so we have a different set of aggregations, and that gives us all these dashboards here. So we can look at.

G

You know how we're matching our service level monitoring for 506, or you know, filo, 7 or whatever, and and so we have these set of recording rules that we take the base slis and we turn that into per fqdn aggregation set. And we also have um you know the top.

C

We started we sort of manually have this already for redis right, because we have three reddish services- and this is saying: uh gitli is maybe 50 services, but we don't want to present it as that so yeah, yes,.

G

Exactly so, red is sweet, but reddest the workloads are slightly different and we have different of monitoring on them but yeah, and then we also have at the top level we have like the. Let me just choose a different one, because this is definitely broken, but we also aggregate the service level indicators for a service up to this top level so that we take all of the slis and we mix them in together and we get a top-level aggregation.

G

So there we've got three different aggregations and then for each of those we have a different function in json. It's like give us the service level. Aggregation error, charts, give us the component aggregation request, chart, and so there's this like matrix of of these things, and so because we're moving to kubernetes now and we have four different clusters that are almost running independently java and and we're going to be deploying to them at different stages, so we're actually not going to roll them out.

G

At the same time, we can have different deployments in each of them. Java. Very much wants us to be also monitoring things at a her cluster basis. So he wants to be able to to say how is the web service doing in the u.s east 1c cluster as opposed to the uss one b? And if it's not, if that particular cluster is not managing it, it's slos we'll actually get an alert that says usd 1c is is is failing.

G

We also have you know we roll things up by canary, and so I just started thinking about that in all the new panels and dashboards and everything that we've got to create, and I started getting very sad because there's there's lots and lots of stuff, and so what I did was I take. I took all of that stuff and I abstracted it out into what I call an aggregation set and this is still a work in progress, but it's nearly finished and it's reduced a lot of the code like there's so much stuff.

G

That's going to be deleted from this, but effectively what it says is each set of aggregations has got its own um set of metrics behind it and then, instead of saying well give us the charts for service level aggregation so give us the the request, series and use this aggregation set and then, when it needs to look up what the five-minute burn rate or the one-hour burn rate. Is um you know whichever one of the metrics it needs?

G

Is we just plug that in and it's kind of like pluggable, and then we can go from having like 20 different charts that are all very similar down to down to four charts. You know one for errors, one for requests, one for and we just plug in the appropriate aggregation set into that, and it will it'll.

G

You know, and we get rid of that and when we want to add the new aggregation of cluster or we want to start aggregating by feature category, we don't need to add like another 20 charts um and just more than that, actually just the overhead and the cognitive complexity of doing that.

G

So what we'll have to do, then, is we'll just add another one of these definitions, which kind of gives all the metrics that this aggregation set will use and how it's defined, and then that will you know we can start adding new ones like the per cluster one without just having this explosion of code that we've got at the moment. So I don't know it's it's it's nearly finished and one of the things that's promising is most of it is deleting code and simplifying code. So I'm very happy about that.

G

C

So, instead of a lot of predefined named dashboards, there will be fewer dashboards with drop downs. Where you say what you want to see and then you sort of get what is now a name.

G

No, so so we still have the same that we still have the same number of dashboards like oneplus service, but at the moment right we have um a function effectively for that and then a different function for that and then a different and obviously.

D

G

A new kind of is this something.

C

I see as a user.

G

Or is this something I see as a controller as a contributor, but also one of the things that it allows us to do is much more quickly say. Well, we want to slice our service level monitoring in a new way, and it's not like. Okay, that's going to take me like three weeks to update like hundreds of dashboards, it's just kind of like plug-in, define the new aggregation set like what all the metrics are, and that's.

G

um um And then, and then just plug it in and you'll get all the graphs on the on the fly and you and it sort of says you know you define what the labels are for. The segregation sets in one place. So this is another thing: is you know? Ben really wants to get rid of the tier label, and instead of going through 100 places in the code, we can just remove a chair and then everywhere that we generate that recording rule where we're using the tier label. We can get rid of it.

G

um So yeah it's it's not it's not very exciting stuff and it's very low level, but I think it'll help us improve our monitoring going forward. So that's why I thought I'd bring it up.

G

And there may well be other aggregations particularly feature category in the future, that we want to slice and dice our metrics by.

A

Yeah, I think you're going to want to add the feature: category label to some of them. Yeah recorded metrics that you're.

G

Yeah and that will become much easier now before it would be a.

G

G

D

Well, thanks for taking us through that um and thanks everyone for being on the call, um I will upload this to uh youtube unfiltered later on, uh andrew with the shares that were on there. Is there anything that shouldn't be made public.

G

uh No, there was nothing private unless I can see something on my screen. No.

E

D

I think it's all good great uh well I'll, share that there, thanks to everyone for joining, have a good rest of your day.

B