GitLab Scalability Team, 16 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2023-02-16 Scalability team demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

So I've got two items on the agenda. The first one is well they're, both kind of related, so the first one is about uh Italy, which I just got logged out. So let me re-authenticate for a moment.

A

It's that time in the morning again: okay, uh so.

A

B

A

um It's it's about logs and it's about having useful information in logs, um which there's there's many different reasons why we would want that um I'm sort of looking at it from an SRE perspective, but I think it's also useful in another contexts.

A

So as an SRE I get paged with aptx alerts, we use symptom-based, alerting, and so it's gonna say something is slow right and uh out of our job in scalability. I guess in general, is diagnosing that type of stuff or helping diagnose that that stuff, and so it's really useful, to have latency attribution so that we can not only see what were the slow requests, but also where did they actually spend their time and uh We've invested heavily in this type of um sort of almost log based metrics I?

A

Guess you could call them um well for the rails, app? uh Where, for any given request, you can sort of see. Okay uh second spent in database was this much and then you can compare that to the overall request time and get a sense of what was the expensive part of this request and then aggregate that information.

A

um So on many of our other services there are not rails. We don't have that much of that information. In some cases we don't have any of that information um and gitly in particular,.

A

Has many different sources of latency and so just to kind of share my screen um just to kind of show, uh uh let's sort of maybe talk through a gitly request to to get an idea, so an RPC comes into Italy and it's got a concurrency limiter at the beginning of the request cycle. So if and this is per repo, um so if uh there's like a gazillion clones happening against a single repo, then that concurrency limiter will kick in. This is configurable per RPC.

A

We've only enabled it on some of the long-running rpcs, but uh this concurrency limiter will then put everyone else in a queue and they're gonna wait, and so it's really useful to know. If we're getting alerts for slow requests.

A

Okay did they just spend their time waiting for for uh to like waiting their turn in the queue um and then the other thing that gitly does during most of its rpcs is shell out to git commands right, and so this is sort of an external call, and we want to know how much time was spent during that external call. So a while back uh we got these command stats.

A

Let's see if we have an example of those here, maybe not, um but this basically says how much system and user time like how much CPU time was spent cumulatively, in this request for all commands that we shelled out. So that's pretty useful, um but there were a few blind spots, and so this was a recent Edition. So concurrency limit was one, um and then we have a few other mechanisms, so spawn tokens are a way of protecting against excessive shelling out, uh and this is basically another cue that rpcs.

A

If there is contention our PCS are gonna have to wait and being able to see that we were waiting. um Yeah is, is a useful Edition. So let me maybe open a few of these and we can sort of see what the the logs look like uh didn't have an example here, but I had an example in one of these for sure um sleeping this one.

B

A

So app file is uh I'm not going to go into what it is, but it's a mechanism in Italy and you can do you- can sort of do different operations against this cap file cache, and uh so we can see.

A

Okay, the overall duration spent in cat file was one millisecond um and then we've got to break down or three different types of operations here, so we've got a request, object, we've got a read, object and we've got a flush, and so you can already based off of these metrics kind of tell okay, we're requesting a lot of objects, we're reading a lot of objects, but we're flushing, not that often, and so yeah there's actually a batching mechanism involved.

A

So this having not only the duration but also the count, can also tell us when there's not enough batching going on and also when there's load on Upstream systems where, where that load is coming from.

A

um Yeah, so uh uh these recent editions I think are going to be really useful for diagnosing alerts and and other latency issues in Italy and I just wanted to highlight that and see. If there's any questions or comments on that.

C

Foreign milliseconds instead of seconds.

A

Yeah, um the guitar logs use, milliseconds and so I kind of uh based it off of that. But I agree. It would be nice to have consistency and in fact uh uh it would be nice to also have consistency in field names across logs, which is pretty terrible, like even just going from rails to Workhorse. The path is in a field called URI and stuff.

C

Like that, I've got maybe something to demo around that: okay.

D

People have kind of come up with a bunch of stuff.

A

I was gonna mention that yeah, what was it called? Elastic common schema or something yeah.

D

Common easy as Leicester common schema, yeah I I years ago, I opened up an issue about it, but it's one of those things. That's really difficult to kind of get by in yeah.

C

D

Add logs without permission right, they're, just it's a it's a Grassroots thing that people have.

A

D

A

I'm in favor of adopting something like that.

D

It's it's really awful at the moment because yeah, like you say like if you're trying to join paths from one set of logs to another you joining path to URI and.

D

Actually interviewed the person the other day for a role who put that together in elastic very bizarrely, so maybe we should hire him and then he can help us with this.

D

One of the people.

D

So Bob on also on the milliseconds like I, feel I. Don't know why I think this, but I think for logs it's less important to have them all in the like SI unit than it is with metrics I. Don't really know why, but certainly like all the grpc logs that we've used forever have always used uh milliseconds and it feels like the right. You know it's easy to crack as.

C

Long as that can be very small, yeah I, just I just prefer, if they're all in the same unit, because it makes it easier to calculate with them in the. But is it the query? Scripting thing the.

D

Only one that I can tell you is definitively wrong. Is the runners which use nanoseconds, which is like unreadable like everything's in the tens of billions, and it's it's awful.

A

Needs more scientific notation, as.

C

Long as it's in the name, it's good like if it's already suffix to.

D

It I don't even think it is it's just this string of numbers.

A

Yeah um the thing I was going to say about this is the what's kind of important to me is that it's consistent within a given log stream so that you can compare and chart them against each other? So what's kind of nice to do is have a single chart have uh like all of the duration Fields summed.

C

Yeah I'll see which one is the biggest biggest yeah.

A

Yeah, which yeah so but.

B

A

Agree: it's it's pretty awful.

C

Yeah and in that sense, I think, let's just leave it at milliseconds for now, if all the other fields are in milliseconds, there.

A

Yeah I think it's a broader discussion of whether we want to sort of unify our analogs across streams.

C

A

C

Unifying them across services and then we'll see if we can do magic, yeah.

D

But yeah I think that point three is a good thing to talk about with that ego. um Is there a way that you can kind of communicate this information with like staff level or support Engineers? Who might find it also super helpful for um self-managed instances where they might be? Having you know, Italy is one of the big performance issues with self-managed, particularly with weird file systems and and gapses and stuff. So how would you how you?

D

How can we like kind of let them know that there's this information, this new stuff, that they could use.

A

Yeah I mean I can do a one-off Outreach but I guess. The question is: how do we have a channel that kind of communicates this in general yeah.

B

A

Know I think that's it cool. It's a good question.

D

Thanks certainly find it useful as well.

B

A

Yeah, um okay, so moving on to the second one. So this is this is related and it's sort of going from Italy to uh now uh registry, which has really sparse logs and to be fair, uh I feel like I. Don't need to go! Looking at registry locks very frequently, it's been a fairly stable service and uh I guess.

A

I have not had the need that often, but we did get a recent uh incident around some database performance issues in registry and it was kind of hard to pinpoint just because we don't have this uh information, and so the step going from alert to kind of the external system that is most affected is tough because, like what do you do you go?

A

Maybe some you go metrics, phishing and hope you find something, um and so uh I did open an issue to sort of promote this for registry as well, and uh it's basically the same thing right. It's doing the same thing for registry that we have in other places, and so there's sort of two questions that came out of this. For me, uh one of them is the maturity model that uh currently doesn't say too much about how your logs should look like it just says they need to be structured, Json logs.

A

That's it um so I'm curious what people think about um giving a bit more guidance on what it could look like or what it should contain, and you know we we talked about the elastic common schema. This could be sort of our own version uh that at least says this type of stuff needs to be present and maybe give some recommendations for units and field names for newly created logs. So at least we have consistency going forward.

C

A

C

Putting this into lab kit.

A

C

A

That was the second Point. How can we make it easier.

C

I, like that, Ruby isn't used that often like three services, I think. Maybe two I don't know just.

D

On it's a container registry using lab kit for logging.

A

D

A

D

Be honest because yeah I mean it should be: if it's not, it should be um yeah I think it is I. Don't know why? Oh if I recall, they pushed a whole bunch of changes through to make it more sort of testable, which were good changes.

D

um So maybe it was cows that did that.

A

I mean the ownership around lab kit is also a bit spotty. There's various individuals from various teams that have merge rights but I. It doesn't feel to me, like anyone, is actively pushing.

D

It I I always feel that it's it should be not that it is, but it should be owned by scalability, because it's all application performance, um possibly but I, don't think they've got any committers on it. At the moment. For example, maintainers.

C

D

The lab kid Ruby yeah, but but certainly one of those two teams should kind of become more because, like you, you say it's very badly: um it's ego, Steve myself and I, don't know who else yeah and obviously Bob on the Ruby one as well and mates.

A

Yeah and and I guess, I'm also not sure whether we have capacity to do this now or like how how this weighs against some of the other priorities that we've got. So um maybe this is also more of a long-term thing to to kind of keep in mind, as we finish up uh stuff that we're working on at the moment.

C

Continuing on that, and let me share my screen.

A

You're going to three.

C

Unless you want to expand a little bit more on the go side of things, no.

A

C

It I don't know what it looks like currently in lab: git go but um I liked what we did.

C

The lab kit, Ruby, actually Matthias, started this all. um He moved. We had a bunch of loggers and stuff that lived in gitlab rails only and then we used lab kit to call out fields to put in those log messages. So every time we use the logger where the gitlab app logger, that is a Json logger or an import logo, or anything like that.

C

We would call out to the const context, so lab kit context and say get me all the fields to add to the logs and now uh Matthias a while back has moved all of the loggers, um the structured loggers into lab kit itself. So it means we've got the Json logger there now and this merger Quest changed um part of the behavior.

C

Here, instead of just putting in the correlation ID and making sure that is always there, it's going to include the entire context all of the time. So that means anything that uses a Json logger and everything in gitlab rails is, is going to have current context information, always we could add more Fields here and so on, yeah and expand on that, but I like that. Now all of these messages will have at least these fields consistently, and we could also start propagating them across services. This way.

C

That's all I wanted to share.

D

Yeah on on just on the context, propagation, that's something that's quite strong on like tracing Frameworks, so they they have that sort of weird bag of things that you could kind of push with your with your Trace and it's one of the things it's kind of really good, but also kind of terrifying, because people can put like 8K messages in there and pass that around to get a lead 50 times in a in a request. But in general it's a really useful thing.

D

But if you go look, that's very much part of like um distributed tracing. That's um the context! Propagation! That's.

C

A part of well with mixed success with um limited what goes into the context from gitlab rails, at least to be tied to a thing like um there's. This thing called application context and gitlab rails, and it wants like uh IP addresses records that are a project or a user or whatever, and then it expands information from that. But it's not like you need to add a field and its type and everything into the class before you can put stuff in so that was in an attempt to yeah. That's.

E

A good idea limit.

C

To what what goes in there, but makes success.

D

Yeah I mean if we yeah, we could use like the I, think I think the lab kit tracing already. If you push the context into there, it will push it in, but then obviously actually I think it passes along with the correlation Eddie. So I mean it might be something that's already correlation.

C

Study used to be like this separate thing, but now it's just part of the context and the first context like the root context. If you will will generate a new one, yeah and.

D

Then I think for passing it between is it? Is it uh we just putting a header in or something like that right.

C

Passing between was like I would do that with the header like uh from rails to Workhorse, uh we could use headers from like the rack request, headers we could use for those from.

C

um Rails to gitly we could I, don't know what that call. That's.

D

B

C

The data thing uh yeah, okay,.

C

Well, I think that that's been an ID for years already, so we haven't really Advanced on it.

A

C

Apparently, I've got a the next thing as well. Let's just uh while we were looking into Prometheus and tunnels, and now the tunnels is its own service with, like it's separate dashboard. Actually, that's nicer to show start with that. One.

C

So now we've got you see and saturation points and everything so.

D

Much clearer now, yeah.

C

So now you can actually see something is happening with Thanos and we'll get yeah and it works the interesting things around the Andrew I'm. Sorry, I'm, stealing your thunder. You did.

D

It if you make it six hours, you can see some more interesting stuff in in this graph at the moment, Bob, because you can see the the we had a little bump and it shows in all the graphs, and it also shows in the CPU there yeah.

C

A

D

Fantastic, yes, so much better, actually usable now. Finally,.

C

And I think the idea is to do the same with Prometheus. So then we just have a monitoring service that contains some logging and some I don't know what else. Maybe we can make it the logging service then.

D

One of the things that's that's worth stating is that, um with with this dashboard previously we had it kind of split across, like all the environments so like we do with all of our other services. So we have production and staging, but really it didn't make any sense, because Thanos is operating as a single service right.

D

It's not multiple services in each environment, there's one service and so breaking it up in that way, actually really hides a lot of the information from you, and so we had to jump through a whole lot of Hoops to get our monitoring system to be able to monitor across all environments as a single thing. It's um that's.

C

That's not, this thing doesn't have an environment variable and if we open it up in on us.

C

I think the environment is now already going to be Thanos yeah.

B

C

Yeah so we've got a new environment called Thanos and Thanos lives. There.

D

And I guess the question with Prometheus is is I. Think Prometheus is probably like per environment, but.

C

A

C

It there, the GPU gprd G, staging everything.

A

Yeah, so the the only part of Thanos that is per environment is the collector yeah.

C

A

The sidecar, Sidecar and uh stool Thanos store.

D

A

Compactor as well actually right.

D

But I think it it. What what I'd rather see is like graphs on the single Thanos service, where we have those things as being um kind of broken down.

A

Purpose right, yes, and.

D

Then then, because we can do that in the detail- ones, yeah.

A

That's kind of what I'm.

B

A

It would be useful to have on on this page the breakout still.

D

Because I think I don't know if those are really ready yet yeah you can see they're a bit flaky.

C

E

C

Can use like an ad environment as a significant label for this service like for all slis, and then, if there is one, then we'll see, wouldn't that work.

D

Yeah I think I think that that's um that that's something we should definitely consider we've also Bob and I were also talking about doing the same thing with the logs, although I'm pretty skeptical at the moment that these logs are being collected properly because I've been logging onto the Thanos router box to get logs that I can't find in in any of the um elasticsearch bugs at the moment, yeah.

C

But who knows where to look.

D

Not in no but the Thanos ruler is running in Ops, so I would expect it would be in the Ops one, but there's no trace of the of the failed evaluations. But you can find them if you log in and read them off the box. But then you've just got to jump through lots of Hoops with JQ to kind of.

C

Ruler, I am proposing to add a new SLI this one that is supposed to measure how long a rule group takes compared to its interval. So if it, if the execution, the ruler execution of all the rules in the group take longer than the interval, that's going to be an error on the SLI uh and I want I, wanted to add this one for uh Prometheus and for Carlos.

C

So that's the one that I'm working on now and I want that, because I want to get rid of some like for the future category metrics, we use intermediate recording rules, um but those are not recorded in the same group and I'm thinking that some of them like yeah, and what to get rid of this in Direction because I think that's some of them. The errors and like things going like within an app text, ratio being bigger than one bigger than 100 and so on are coming from that interaction. All.

D

Right, um I think that this might, if we can it's kind of difficult to do, but if we can craft it as an aptx rather than error, um it would be really nice but I know it's hard. The.

E

Problem the problem is, uh we've talked about this before, but the problem is that we're using success ratio for one and aeration for the other, like if the.

C

Error counters and success counters for one and error counters for the other.

D

I'll need to look through it, but because the reason I'm thinking is, we've got this rule evaluation right and we've already got an error. It, which is the the rule, gave a warning or an error when it was evaluated and that's quite a nice error rate, but we don't have any sort of app decks or any sort of latency measurement or latency SLI for a rule evaluation, and this really is kind of a latency. It's like it's saying it's taking too long, but I I know that it's it's really hard to measure the.

E

D

For someone coming along in future- and they see that the latency is spiking- they'll- know that it's that rather than the other thing, but do.

C

You have an idea how to because we already got the concept of counters I. Think.

D

I think you have to have a custom. You have to build a custom um metric for the aptx, um because it's not a it's, not a history, so the norm normally for app decks. We.

C

Use it so we already have success counters in there, like as a custom thing yeah.

D

C

I think we can, we can probably come up with something, but then we need to.

D

Do like the but it's not trivial.

C

No because we need the both metrics to do the.

D

C

D

C

Subtraction and like yeah.

D

But if we can I just think you know someone a year down the line, who's on call and they get an alert. It will be clearer to them that this is a you know, a slowness problem rather than a like an error, because it's coming out as a as a slowness SLI. But yes from a from an affordance point of view. If you want to say.

C

C

I'm going to give it another shot, how hard can it be.

D

By getting back into all that stuff, this week was quite a sweat for me. I must say I'm out of practice with all that chase on it. I forgot how much of it there is.

D

C

You've written a lot of it, so it should be okay,.

D

Close to scrapping it all and starting again, oh.

C

Is the solution? Do you want to talk about the ownership of services that you brought up recently? Okay,.

D

um Bob put this on here, but I'm happy to talk about it, and so um chance is doing this ownership thing for all the services and I think it's really important, especially as we are like as the new reliability teams are forming and I think it'll help them kind of actually get like a real, solid idea of, like which of those teams owns Which services and then also which which of our teams. On the on this.

D

On the platform side, you know what we own um and then Rachel asks a really good question, which is so so we said we we're going to get the managers to kind of choose which things they um their own and then Rachel said okay, but what it?

D

What does ownership actually mean, which is a very good but I think also very hard question to answer and one that could be fairly philosophical, but what I suggested was that, like we just keep it very pragmatic for now and effectively what it means is when something gets labeled in a uh in in the tracker with service colon, colon X, we know which team is going to triage that issue right.

D

um It basically false you know, and we can write a triage Ops um tool for basically assigning the team based on that and it's their responsibility to keep the the triage on those issues, and that doesn't mean that they're going to fix everything but like at the least they can say, won't fix or we're going to put this on the schedule. We're going to reassign it to a different team. But they they are the the kind of entry into that.

D

um And you know, via that, all the capacity planning issues or if there's any sort of availability issues will arrive at them, because they are that first line of of contact for that service. So that's kind of it and then kind of sort of more sort of broadly, as a future thing which is kind of might be controversial, is you know the scalability team's done a lot of great work to kind of push error budgets left in services like web and API and sidekick that are running application services?

D

And maybe it's also time to start thinking about for some for, like a limited number of services, also having some sort of error budgets. But on the infrastructure side so like the one that I'm thinking of most is to try and prevent the Thanos problem in future right. So if Thanos is kind of degrading and kind of getting worse and worse and worse, we'll get to a point where we say. Okay like this is something that needs eyeballs on it and we start doing the same sort of error.

D

Budget reporting as we do, for application teams on the application side. But for certain infrastructure services like I, don't think it would make sense to have double reporting for, like the web service, for the application teams and for the infrastructure teams like I, think we've got that service covered, but there's a whole bunch of services that don't have the same kind of coverage, um and maybe it makes sense to start doing the same sort of accounting for for those services. And it's very much like a very rough early sort of idea.

D

But I've been thinking about it a little bit for a while, like how can we know ahead of time when things are maybe getting a bit Rocky on certain services and then plan ahead for for um maintaining those Services there.

C

Are now like, um we scalability has future categories that apply to us and that we can use inside the application and that could trickle down into slis and so on. Why can't we do that for all teams in reliability, for example, tunnels.

B

C

Properties would be monitoring stuff and then that would link to.

D

It, how do you do feature categories for like Thanos, like what feature category does that? Because it's not it's, it's kind of doesn't seem to that. There's a.

C

Problem is the name feature category, but yeah.

B

C

If it was just category.

B

D

B

D

Team yeah yeah exactly and that's but the the yeah, so the team ownership. That's that's that connection, but yeah feature. Category I think is like a reserved word, a gitlab because it very specifically means like a thing but yeah, some a categorization and we roll them up. So we can look at it at a higher. You know one of the things with the error budgeting was because we could roll it up to the level of teams and then to the level of of directors.

D

I think you know that was a really important way of of of helping people kind of prioritize that stuff and and if we can do the same thing, but without you know and services that are more General, so.

C

So everything now is sourced from that stages of the ammo file. I, don't really mean adding categories there, but it we can't have more files like we.

D

C

Have our own infrastructure, but.

D

Is that not what the service catalog is.

C

Is that what I.

D

Think to some degree it is I, don't know, maybe maybe not like. What's your opinion.

C

C

It can be that I think we can change it to be that and we can use that in our mapping file. We have our Stage Group mapping file. We could extend that and then refactor to change, because then the category for tunnels and Prometheus would be well monitoring blogging I, don't.

D

Know I mean, is it? Is it? Is it? Do we use feature categories, or do we just say team like do we say Thanos and Prometheus, or team observability reliability, observability for.

C

The same reason, for the same reason as you decided to use feature categories, I would go with categories.

E

C

Things move ownership and so on. So then, if we do.

D

Okay, we need to introduce, we need to introduce our own feature categories that are like, um and what do you call them? Infrastructure categories or something like.

C

That yeah, something like that: okay.

D

We have whatever we call it like naming's hard, but like so you're saying that Thanos and Prometheus are like metrics, yeah and CP is exception management and then in future. If those two things go to different teams, we don't change anything except for that attribution in the same way that feature categories, work, yeah.

A

Maybe service category or something along those lines could work. The.

D

Service category but I think if, if we use the feature category, then people get confused because it's like.

C

Yeah but exactly.

D

The service category might be a good name for that. Yeah I mean it works. It works very well on the product side to have those feature categories and then they subscribe. We I don't know if you kind of have followed, there's a there's, a file in the handbook called stages, yaml and then things get moved around and we don't have to worry about rearranging because we just kind of follow the feature categories in that file.

B

I think that makes the sense to do, especially if stuff gets more complex and you've got more diverse services that do slightly different things that need slightly different categorizations. Having that automatic- and you know single source of Truth is probably a good idea in the long run.

D

Yeah, so so do we do we do uh we are we uh do we do the feature categories now as part of that service category thing, or do we just do the ownership now and then do that as a second stage, probably better, not to nerd snipe it and just get it done, and then we can break.

C

It up yeah I, get the first thing out and then, when we start using stuff- and we see you know how to change it.

D

B

D

Right at the end of the agenda.

C

So have a good rest of your day. Everyone bye-bye.

B