GitLab Scalability Team, 6 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Demo Call - 2021-05-06

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Great bob, you have the first item, yeah.

B

So um I'm not going to share my screen, you can all read the issue, but I am I don't know I I was with craig. I was discussing database transaction time like real database transaction.

C

I can barely hear you bob could be that your microphone is being blocked.

B

I don't know how to change that. Wait, you're, louder, you're becoming louder yeah. It's the auto adjusting was a bit slow. It seemed better.

B

So yeah I was discussing like time and sql query timings with craig and the merchant question. Then I noticed hey these things have a feature: category label um and I was wondering how like, if and how and- and we should add those to the error budget. So we have threshold sets set for queries, how fast they need to perform and if an endpoint is performing slow queries, we could ding them for it.

B

But the question is like: if we do it like like it is now, then that means that a request that does a thousand super fast queries would get a thumbs up, because it's a thousand superfast queries, which is not what we want.

B

So that's why I brought it up um sean could. If you are on the call, I can't see you, but could you explain what you've written.

D

uh So if we're using histogram buckets, we don't actually know like. We know if they're satisfied or not satisfied by the timings right, like you know, if the timings are good or bad, but I was thinking another way we could do. That is to have like a counter of database seconds consumed by this whatever and say that you have a budget of like the field in the logs yeah. So you have a budget of x seconds per thing that you're doing, and we can just say if you're exceeding that budget or not.

C

It sounds almost like this is orthogonal to error budgets. Yes like. If people have a feature, that's fast, then we don't care how what it does as long as it's fast and if, if the date, if we want to say that the database is some sort of resource- and we want to give people quote as of how much they can use the database- that's a nice idea, but that doesn't sound like it's error budgets, that's more about stopping people from using the database too much.

E

um Also, surely what you can do is you can have a like a measurement of, I think. Maybe this is what you were saying. So excuse me if I'm paraphrasing you, but it's basically a histogram of the sum total for the request right and then it's just a normal aptx after that, so you get two, you get that's what's going on and.

B

E

Suggesting yeah yeah yeah and then one is kind of the one is a um like the one where we measure each request. We actually attribute that to patrony and not to as particular team, because it's a way of kind of judging the database, but this one would be not patrony, it would be the actual yeah. Then it would be each category yeah.

B

Another feature category then I would like record it next to the puma ones, but with a different component.

F

B

F

B

It is not so important it's how we I like the idea of using a separate thing and jakob to reply to what you said and it's a way of like seeing if people are nice to resources- and I think of it, like I think andrew mentioned- that before to um like it's a way of having mechanical sympathy, things included in those budgets like then mechanical sympathy becomes how nice are like if you're doing a thousand queries in your request that I think that would currently show I don't know, do we have mechanical sympathy for the database.

C

E

Sure we sorry go ahead. Jakub.

C

I I was just going to say that if something is slow, uh like the users, don't care how much something uses the database. So if something uses the database in a bad way, it will show up in uh from the user perspective,.

C

B

C

The system by having so we want to, we want to have the teams, make good decisions about what to focus on and uh we can have a simple, simpler model where we just say this. This thing is slow too often of the time so something's wrong, or we can have a complex model where we say uh you're using the database. A lot so maybe you're doing something wrong, and you should look at that. But the.

B

Simple model we have now is what we have now that we have threshold set on on the request and if they complete within a second you've, got thumbs up. But you could still do a lot of queries in a second.

C

But do we care, why do we care.

E

Yeah it I see where you like. Basically, you know the whole thing with the service level monitoring. Is you you're measuring the service quality from the user's point of view, and so so maybe we shouldn't be confusing the you know and that's where error budgets come from, and maybe we shouldn't be confusing this with that. Is that another way of putting what you're saying.

C

Yeah, that's one way and another way to look at it, uh and I don't. I don't have a good view of the bigger project, but that's the way. I understand it we're trying to make a big step in adoption of error budgets across the organization and if we can get a good result from people if we can have a simple system that gives useful signals and that changes.

F

C

The whole organization behaves- maybe we should start with that simple system and not mix more signals into that model, uh but we don't have to yet.

A

C

A

The air budgets that we're putting in front of people at the moment um by and large a lot of the stage groups are seeing red bars. So, whether or not we add the database information now it it's, it would probably just make the red bars even redder like it would make it worse. So I think that we've already given them quite a lot to focus on and quite a lot of things that they can improve, and it may be that in future, once things are more green, we can then decide right.

A

How do we add the database information into the mix, but it feels like we're either going to make this a bit too difficult for the stage groups to actually adopt right now, if we make it too complicated.

B

I think it it would be easier to track down, because if you like, the whole thing all comes down to what was good and what was bad and you get points to score and points to lose and if you've lost points on database time and on request time, then you know what to look for in other places like. Where am I executing too many queries, but we can start with what we have for sure.

C

I think that's.

B

A different part of.

C

The process, because the error budget tells you something's wrong and then.

E

C

To look at your metrics to see what's wrong, what do you do about it.

E

But it's specifically not cause-based. It's it's symptoms, right and, and you know we don't want to kind of that's a that's a symptom, rather that's a cause rather than the symptom itself. The symptom is the user's having a bad experience.

E

C

It's symptoms versus course, I yeah, that's exactly what I'm saying. Yeah.

E

Just one other thing is there's also like, in addition to that, which I think is a really really important point, but also um the it kind of there'll be two two things contributing to that error budget right because then uh so so, if you imagine a request coming in and there's one request, that's really really slow because it goes to an external thing or it just spins. It's just four. I equals one to a million.

E

It's going to get a tick on the it didn't make two seconds with the database calls um sli gate right and it. But it's still from the user's point of view. It's still rubbish. The user doesn't really care whether it was a database or because someone put a four.

E

I equals one to a million, but on the one, if you look at it overall, they will get like 50 and on the other one they will get um zero percent right and- and that's not really fair, because the user doesn't in both cases it was a really slow request, but we're judging the one by because it was a database problem by two things rather than one. If that does that make sense,.

C

You're diluting the strong signal that the user had a bad experience by uh having this conflicting signal that the database was fast, which the user doesn't care about.

E

Yeah, because if it was like external http requests- and we don't have that covered by this- then you know we ignore that, and so you know they only lose one point for that because it wasn't the the database, which is the thing that we're watching on this. I think it's better, as you said, to keep it simple.

B

But we do have sli's that are not user-faced like they are user-facing, but they're not like gitaly is one of them um like the rpc speed that currently gets attributed with the italy feature category or whatever.

E

So with gideon, we tend to the the way I see, that is, that the user of gidley is rails or workhorse rather than an end user. So it's kind of a little bit different. The service.

C

Consumer, um if, if italy uh time spent in italy, is part of the error budget right now, I think I would say it shouldn't be for the same reasons.

B

Because it is but only for the getly group, so how quickly.

B

Close the issue for now thank you for the discussion.

B

uh Somebody else have something on that night.

F

With the topic, um just as a brief tangent, I think it's maybe worth mentioning that if we do have consumers of of database connections that really do go through a churn of uh just making the numbers up, a thousand requests, uh a thousand, a thousand connection leases per incoming request.

F

That is a system-wide degradation class of events, and I don't think the chargeback should necessarily be we're not really talking about charge packs, and this doesn't strictly fall in the category of error budgets, but it does fall in the category of um of overall aptx jeopardy for uh for the system as as a whole, not necessarily for the um um the particular component. That's consuming those that that high churn on lee's event does that make sense.

F

If that happens,.

C

F

Problem uh in general yeah exactly it would have, it would exactly have a knock-on effect for all of the consumers that compete for that connection pool. Do we have a reason to suspect that that's happening now.

E

I I think it was just being used as an example, though matt like what what I think.

F

Like I agree, I just wanted yeah. I wanted to find out if someone has seen evidence that this is happening, because if it is that's something we should look into as a separator.

E

Certainly on on giddily, with the giddy n plus ones, we do have we used to have like five six thousand um grpc requests and like luckily over time with lots of work. We've got that done.

E

I was just looking through the um the mechanical sympathy alerts and I'm not sure whether we're below the threshold or whether we don't have an alert on that. But I'm going to check after this, because because then we could look in that in that channel and see what the worst case n plus ones are, because I don't see it here, but I know that there's some that are like pretty bad gotcha. Okay, yeah cool thanks.

C

So I put something on the agenda um and uh to look a little bit to talk a little bit about the possible future projects, I'm working on, which is this thing with the grpc bottleneck and I'm not really sure what to talk about, because I didn't really know what audience to expect.

C

I talked about it in quite a lot of detail with matt last week, but that was only with matt, because we were the only two people in the call, and I don't know if people watch that call or if other people are how up to speed other people are with what we discovered.

C

So I can talk a little bit about what we discovered or give a chance to ask questions about what we discovered, and I can also talk a little bit about some of my ideas for uh how we can actually do something about it.

C

But I'm not sure where the interests of the rest of the call of the people in the call. Why does anybody want to hear something more about what we discovered or have an opportunity to ask questions about that.

B

I I have, I read the thing, and I was kind of surprised that grpc is that much overhead. It was pretty cool to read and I was kind of curious at what you intend that I think you intended to show now is how you bypass it.

C

Okay, so how what we can do about it? Okay, um I think I'm gonna briefly talk about the problem, uh because um I I see um it's been mentioned in the infrastructure group call and it's getting more attention and I'm still trying to figure out how to present the problem to people and and what's there because there's different angles to it.

C

uh The way I frame that issue 1046 is uh we hit a bandwidth limit, a network bandwidth limits which is sort of about how this affects users ultimately, um but I there's another angle here, which is uh to understand what we're doing that is so wasteful, and I made that less.

C

I think I made that less prominent in the way I presented the findings, but if you're coming from a technical angle, that might be more interesting so just how much memory we're wasting or how many, which memory allocations we're wasting, um and uh so I I want to briefly talk about that. I think uh just so that we're on a technical level we're on the same page.

C

um If wait, where did I put that issue?

C

C

There it is okay, I'm going to share my screen.

C

um So, like the real data that, if you want to have in-depth data on the findings, there's just one thread on issue 1041, where I collected data for the different scenarios that I summarized in the table and the interesting the one I want to talk about is the memory allocations. So I have.

C

This is a capture of memory allocations when which one did I just click that was the first one, so that is gettingly a regular gitlab when there are no cache hits and then we're stalling on the cpu. Basically we're getting we're hitting the grpc bottleneck here, and this is a 30 second profile, and you see here 100 gigabytes of allocations, and um this is absolutely horrible, and I didn't really.

F

C

A very strong point of how horrible this was in the other issue, but because I was sort of thinking, if you think about how this is supposed to work, then 100 gigabytes is horrible, but I'm not sure if everybody understands or has the same view of that. um Why is 100 gigabytes of allocation horrible?

C

If you want to transfer data from one thing to the other in unix, you need a buffer and you say that at minimum you need a buffer and you say to the kernel: please read some data into this buffer. Then you want to write it somewhere else. Then you say to the kernel: please write the data in this buffer to this other thing, but you can reuse that buffer. You can do that 10 000 times in a loop and reuse, one buffer never allocate a new one. That's how it's supposed to be.

C

But what happens with grpc is that we read into a buffer and then we want to send it out as a grpc message and then grpc allocates memory two or three times for the data in that buffer and then sends that out and throws the memory away again in a loop. So ten thousand times you are allocating memory for to hold a copy of the buffer and throwing it away and from.

C

uh If you had to write this from sketch scratch, you would never allocate those extra copies and if you're going at full speed, then you see these kind of numbers where uh I had to divide it down.

C

This is three gigabytes per second that you're allocating and uh you only need to allocate 32 kilobytes once in the request, and then you can transfer as many gigabytes as you want, and you can see this if, uh if I scroll down the threads to let's see so the alternative like the the toy implementation, if I look at the memory profile there, it says seven megabytes, 7.5, megabytes and a lot of this is done by the profiling system, because the profiler is running.

C

So almost half of this is the profiler, and this is the difference between a constant amount of memory allocation once per request and memory allocation proportional to the amount of data you're transferring it does this make sense.

C

I I just wanted to. I don't know if I highlighted this enough in the original presentation, just how crazy this is um and uh yeah. So that's one thing and then the other thing I can talk about a little bit is uh sort of a walkthrough of how the toy version works.

C

um If that's interesting, okay, I see some nodding. um Let's see I wanted to do this. uh Let's just do it in here. I'm already sharing this, so the easiest way to start, I think, is to look at the client.

C

So this is the program that emulates workhorse, and uh this is the function where we handle post upload back, which is the thing that transfers the bulk of the data and what happens is that it calls this magical transport call function and it says it wants to do post upload back now in this toy. We cannot say what repo we want to clone. It always clones. The same repo so in in real life, this would not just say I want to do.

C

Post upload back would also say what repo it wants to clone, but that didn't matter for the toy. We get a net connection object.

C

um This is some stuff in case the client compressed the inputs, but really all that happens here is that we take the http request body. We copy it into the connection, so this allocates 32k 32 kilobytes to do the copy, no matter how many data much data you're copying that we call close right on the connection which signals to the server that no more data is coming, and then we copy the data from the connection back into the response body.

C

So that part is see. Is that straightforward, I'm gonna? Let's go to state like this is simple, but I don't want to okay. So then the next question is what happens in the transport.

C

uh This is how I wanted to uh talk you through it. So transport has a function, call um it takes as arguments a function that can create a connection and the request, which is just some bytes.

C

So uh that connection is that thing is passed in so first here we get a connection and then it calls this send frame thing and it sends the request and it needs to obey a deadline. When it does that, and then it reads one frame back on the connection and it compares that to a magic string response, okay, which is literally the letters okay, and if it sees that, then it gives the connection back to the caller.

C

So then we're back in the program. I just showed you where we start copying and if, at this point we did not get the string okay back from the server, then we would have returned an error and that other thing would have returned immediately.

C

Still with me. So let's send um then, let's look for a moment. What send frame and receive frame is uh so what that does? I decided, like, let's say a frame. Is I needed some sort of chunk of data and I thought one megabyte is enough to fit uh like the request metadata for grpc call like we have gitly feature flags and authentication, metadata correlation, ids and whatnot, hopefully that all fits in one megabyte.

C

So first we check it's not more than one megabytes. uh We set a deadline here, so we don't stall. Then we write the length of the frame. So I say the frame is uh 12 bytes in this case, so we'll first rewrite 12 the number 12 in binary on the connection, and then we write those bytes on the connection and then we remove the deadline again.

C

So that's all that does and the opposite uh reads the length header from the connection into a four byte buffer and then it allocates a new buffer for the frame we're receiving, and it reads that many bytes into the frame and it returns it to the color. So it's just exchanging a blob of bytes with a length length, prefix.

C

So that's the client. If I still have you with me and then the other side is the server. So let's look at the server for a moment, and that was in this file.

C

So the server um it takes a as a listening address and I put that in a global variable. So what does rpc server? Do? It creates a tcp listener just with the go standard library and it creates a transport server with a handle function and it tells it to serve on that listener.

C

I will look at the transport server in a moment, so the handle function receives a session and a request, and so the requested are those bytes like post upload pack I was sending in and it looks up those bytes in a map and if it can't find them, it says it's an.

F

C

If not, then it tells the transport that is accepting the connection, and then it runs the handler function on the connection, and then these handler functions are they. They look just like the opposite side. So you have uh you, have a net connection object.

C

We create a a git command, we say: standard input is the connection standard output is the connection, uh run the commands and then the bytes get copied still with me.

C

um So then I want to show what the transport server does, which is the other main piece of the puzzle.

C

Let me go back to this file, so that was okay. So what's the server server, I don't know, I decided it can have multiple listeners, but it only has one in practice. It has a function that is the handler and yeah honestly. This stuff with the listeners is kind of.

C

Please take my word for it that this is boring, we're just accepting connections in a loop and we're running them in a go routine and what we do in the go routine.

C

Is that so we have a connection, we haven't read anything yet so the first thing we do is read a frame and those are the request bytes and uh then we create a server session object which holds the connection and the deadline, and then we call the handler with that session object in the request and if the handler returns any sorts of error, we call reject on the session.

C

So what is the session? The session holds the accepted connection. It remembers if it's been accepted and remembers the deadline, and uh so the server can choose to accept the connection and when it does that that is when it sends back the okay frame like the magic bytes. Okay gets sent back when the the handler code says. Okay, I want to do this connection, so the handler also has a choice to say I look at this request. I don't know what I'm doing what I'm supposed to do with this.

C

I reject this and what rejecting just does is it takes an error and then we encode it as a string, so it says error space and then the error string. So it's just a very simple format and we send that back.

C

So that's how these pieces all fit together and that's the foundation of the of the toy thing. That's.

C

I reported the numbers on in the issue. Maybe I should I could. I could show you more, but I think I've talked enough as you stop here.

B

So the interesting bit is how you'd fit that, because um the interesting bit is now where you had the case like. If I get this I'm going to do this go thing and you want to fit that. You want to find a way to do that in italy. Is that.

C

That's one of the interesting bits.

B

Like that's the thing that I was curious about, like okay,.

C

Yeah, okay I'll make.

B

Italy do that, instead of replying.

C

Yeah well, there are two pieces to the plan, um one that I didn't just show you, because I I think it's not polite to everybody to take that much more time. But one is that uh this basic thing I showed you you can enrich that so that it carries all the metadata of a grpc call. So you could say: here's a method. Here's a protobuf binary encoding of a request message, and here are the headers that include authentication, correlation id and whatnot. So you can build this.

C

This is a layer and you can build another layer on top of it that is sort of suitable grpc. So.

B

You want to make the grpc the first thing that workhorse starts with you want to wrap the grpc thing in something else that you will send off and then reply with something that looks like grpc, but isn't it yeah.

C

But is it because in the middle of the grpc call we then pull out the tcp connection and we use that instead of grpc, because otherwise we're no better off than we were before? But the reason I want to wrap it uh in a sort of grpc-ish layer is that we can use middlewares like we can get all our logging and prometheus counters and authentication that would all work exactly the same. Correlation id is tracing.

C

We could reuse all that because we already built it for grpc, and I can. I think I can get it close enough to grpc that we can reuse that stuff. So that's one and the other piece is: how do we make a connection and there?

C

The funny thing is that uh getaly already does something creative when prefect connects to kittly, so prefect can say can start a connection that is not grpc and gitly has a way to hook into the grpc library and detect that the connection is not grpc and treat it differently and right now that is used only in italy for something it's called back channel and it's used specifically for prefect stuff.

C

um So for coordination when there's a right and the right needs to be replicated so that they need to talk back to prefect and say we're doing a right or is there a core more, I I don't know the details, but that's what it's for, but that thing already knows how to accept a non-grpc connection.

C

So we can generalize that and say well, okay, right now you know one type of non-grpc connection, which is a back channel connection, but we now have a different type of non-grpc connection for this new thing that I don't have a name for and then we do that. So that way we could have it be part of italy.

C

I don't know if that was convincing.

C

I I'm trying to uh I'm trying to write this all down and I'm trying. I don't want to be able to convince everybody that this is feasible with a written proposal, but it's hard to figure out how to present it and that's why I want to talk about it here and see what questions you ask and so.

E

My my my take on this having spoken to you previously about it as well, and and is it kind of feels like at the time when the decision was made- and I was around with that as well- it was like hey, we need a, we need, a protocol for communicating.

E

There's this thing, rpc, let's grpc, let's use it. It seems like a good idea, and now, with a bit of hindsight, it's like grpc is really fantastic for doing like grpc, you know distributed rpcs, but this is this is streaming and and we've really kind of um it's.

C

E

Data transfer- it's not for yeah and and we're like really wrapping things within things within things, in a really inefficient way. um You know where all we need is.

C

E

Packets, I.

C

I didn't even mention this, but this back channel gitly thing adds another level level of wrapping.

C

So when prefect talks to italy, it's not straight rpc, it goes through a library called yamax which is made by hashicorp, so there's even another layer of wrapping going on before the data goes out. I.

F

C

That layer is super inefficient, but it's yet another layer, and one thing I try to achieve in this with this toy thing and I try to try to work into the design, is to end up in a situation where um first there's an exchange between the client and the server about what are we going to do and then all that stuff gets out of the way, and you just get a connection like I want zero layers in between, so that we have the maximum opportunity to do whatever yeah.

C

So we can go as fast as possible. So I'm trying to avoid uh avoid layers.

C

And yeah, no, I'm definitely not saying we should throw grpc out because it's it works great and we have so much work invested in it. It's there's no.

E

Way to justify basically but there's certainly there's certain things where it like refer. The refinement makes more sense than just.

C

E

Carrying on with that approach,.

C

Yeah thanks yeah; that's that's! That's the gist of it yeah!

C

Well, thanks for uh letting me talk about code as you for uh for 15 minutes,.

A

Can I ask you a quick question before moving off the topic um yeah so with reading through um your thoughts on how you're going to get there and that whole issue where you've been writing up? um All of your notes like at what point are you going to like just write up the the like the conclusion, like I'm trying to figure out how much further you're going to take it before you say? But this is the conclusion, and will you get there before you go and leave.

C

um I think I should be able to I. I find it difficult to get this out of my head, because I I have it all worked out in my head, but that's no good because nobody can look in my head. So for me it's a challenge in um somehow getting it out of my head, but I the way I've been approaching. It is just pulling different threads out of my head and starting an issue thread and writing that, but I think I'm running out of threads in my head.

C

So I come to the point where I can just say this is the top line summary of of the proposal and if you want to look at the details, they're all down down below and yes, I think I can do that before I go on leave.

A

Great thank you because then, we'll be able to take that the proposal and the idea and um figure out how we'll slot that into uh the workload that we have and when it would be right to pick it up.

C

A

Well, andrew you've got the next one.

E

Okay, just before I do that, I thought I would write that mechanical sympathy alert for the sequel, endless ones, because we don't have one and uh it uses it. It's not like single values. It uses like a p95 of how many requests a call makes to the database, and we've got some rather bad things. Taking the worst seems to be 475 p95 per request, which is pretty terrible, and I'm looking forward to putting that in.

E

So I'm going to start creating infor dev issues around that, and actually in that one, the worst it does is a thousand per request, but that's the max in the last hour. So that's kind of an interesting side thing um I just wanted to.

E

This is kind of not on any agenda.

E

But it's something that's been bugging me as something that we need for a while and what it is is that we set the slos on our slis and we aren't very good at reviewing like how how those sli's are performing according to those slos and that creates unhappiness, because some things always alert um because they're, basically under and then the other thing that happens is that some things are just so far off they're, like normal value, that when things go wrong, we don't trigger because the slos were set.

E

When you know there were different requirements or the software was different or there was some bug or something like that, and so um I was talking to java about it. And what I was thinking is like once a month like in some channel, there's a there's, a message that comes up.

E

Please remember, to review the the slos and it can give you basically the last 28 days, because I think that's our standard now and it says over the last 28 days this sli has made, has that you know this component has had this availability, and this is where the slo is. And then you can see how far up or down it is, and so I just created a little dashboard for that.

E

Let me just share my screen: it's not very exciting, but it's more like I'm trying to figure out how to make a process around that and my computer's been acting really weirdly when I've been using zoom recently. So, let's see if it works yeah, it's basically not working.

E

Am I even coming through on slack yeah.

B

I can hear you.

E

Let me let me try a smaller screen.

B

Andrew really wants to do a computer.

E

No I'm waiting for an m2, that's my that's! My intel.

C

E

No dude, I I I want to have endless problems with homebrew and docker like that's what I live for, so I'm going to wait for an empty.

D

That's why I put home brew in a non-standard location, because I love that stuff.

E

So, let's just uh yeah that that doesn't sound like a great idea, sean. uh Let me just try sh.

D

E

Except building git, as.

D

It happens and even then it's not hard to work around.

E

So it's not very exciting, but it's not meant to be very exciting. Can you see my screen now.

E

So can anyone hear me yep? Yes, okay, good, okay, okay, um so, basically what it is is it presents it as like how under or over the sl, oh, the the thing is, um and so what we've got here is this there's no big surprise there, it's basically nine percent under the slo, the shared runner cues, because of all the problems that we've had in the last few weeks.

E

The problem with that is, it's actually very difficult to to make it any lower, because basically we're getting outside of the realm of when slo mathematics works any longer if we make it any lower. So that's pretty much going to have to be a silence. Unfortunately, until things get better, but then with some of these other ones, we tend to sort of be sitting very, very close to the the threshold, and maybe we should push them up, but I just thought it would be kind of interesting to build this little dashboard.

E

um This thanos compactor, I think I'm going to remove because it almost seems to never run it if we get very, very sporadic data. So I don't think it's worth even having in here as well. It's also an outlier but kind of what I was imagining is we do a review of this and we basically just the the slos according to where the data is so that you know we're alerting on unusual behavior we're not alerting on like services that are just running poorly um and you know infrastructure don't have anything to do with it.

E

So what I want to do is take a look at this and then update the the slos on a monthly basis, or maybe two months or however often um but there's two things that I think we have to do before.

E

We can do that, and the first is that um I think we need to give each sli its own slo, because at the moment we set the slo at the service level and the slis are all have to be the same, and what we find is that they're actually quite different and there's no real reason why they should all have the same slo, and so that's one change that I think we should make before we do this.

B

Why why would you do that like not that not that it's wrong just to understand yeah.

E

B

um Because I was looking into separating out api like graphql from the api, because they just look different. But I would like my initial thought was to put the thresholds differently inside the objects for both slis. Rather than setting a different uh slo for.

E

For them, so some sometimes you can do that, but it depends on the histograms right because, like some histograms are like you know, 10 seconds and 30 seconds, and um you know you can't you can't do fine-grained adjustments on those without changing the application. So so sometimes you don't really have uh great things on that, but um you know that's. So if we take a look uh this by the way clicks through. So if you go take a look at um like the sidekick, one is a pretty good example like the sidekick.

E

Cues are all kind of different from one another, and maybe they need to have different slos. But if you click on this, it actually opens up this slo analysis dashboard, which is quite useful. So you can see how the data's been historically.

E

Once it loads, so that's that's where it is at the moment 99.5 and then you know we were 1.3 below that actually over the month. However, there's another part to this, and that is that we're only evaluating on the one hour and the six hour thresholds and in order to make this fair, we should also be evaluating on, like the three-day 10 um threshold, uh to say that we, uh this yeah this one as well, is also super sporadic. So I might come up with a way of of filtering out.

E

You know this database queue only runs when there's certain types of database migration. So it kind of isn't a very good one to to use in the review or in a demo.

E

Well, my computer's, I don't know I can't really do anything now, so I'm going to just stop sharing my screen again.

B

Could you link the dashboard she just showed in the doc yeah.

E

Yeah, I it's still in it's still as a merge request, but the only reason I pushed it to grifana is because I wanted to edit it and play around with it. So it's it's a official unofficial dashboard.

E

So I'll I'll send you I'll I'll include you on the on the mr bob, but the two things I think we need are are the longer or the three day things going to um issues. You know not not paging people and um and then also the um uh the.

F

E

Yeah yeah well I'm kind of worried that that might be a bit complicated like there's a lot of things that you have to kind of measure then, but but yeah cool. That's.

E

A

Was there anything else that anyone would like to talk about or.

A

Share great, then, thank you all very much for the discussion today. I hope you have a good rest of your day.

E

Bye, bye, bye, amy.

F