GitLab Scalability Team Demos, 20 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 2023-07-20 Scalability Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

So we have switched over the read traffic and the right traffic has been going on for uh one week, because we had to be careful around all the S1 and S2s and we also uh ran the migration script twice like the first time is to migrate the bulk of it. The second time was the validate the migration, so the Trap, the right, the rate traffic was cut over yesterday and this morning, which is about 10 hours ago. You.

C

B

Yeah yeah! Well it eight hours ago, I've switched off the the read traffic. No, the right traffic to to release.

D

Cash, so, whereas cash.

B

Is no longer doing any work, at least any meaningful work, so this is the three days. Let me just refresh it. I'll just do two days so so this this time was when the the rate Traffic got cut over. So you can see, there's a big dip on radish cache and then here was when I cut over I terminated the double right and redis cash just went to a very small number.

B

uh So so it has well this cache there's a lot of keys that is going to terminate and Reddit is going to run in the background, so to remove all that so I.

C

B

Yeah so won't be surprised to see a small number, but in general we have five shots running, and this bill is very stable. uh I was looking like keep an eye on all the metrics for for redis cluster cache for red is cash and also web API and uh website exactly so. They were fine yesterday and today, I'm just wrapping up some of the details.

B

The one thing that's interesting is that we have right so so this is Network red is Network out, so you can see that four or five of them are fairly similar and then one is. One of them is standing out, particularly more than the rest, but this changes with time, so I suspect this is probably just hot hotkey or a key with a larger payload and svss accessible I. Don't think it's particularly worrying right now, because we have a block hidden group. We have quite a bit of Headroom on whatever is right.

E

It's only 20 so.

B

Yeah, it's like 20 on the biggest one wait! No! No! This is this is where this cash yeah. If we look at the bottom there's this little Club, they are around 20 at Max, so we have time to to do like a deep dive on. Why is that imbalance, and we probably want to have a slightly longer data collector, so we can observe if there's any patterns like maybe this is maybe some of the keys are for some users in a particular time zone and they hit it a little bit more like like.

B

There are many reasons for that, so we required to do a profile later on yeah.

C

So that's used to cross like all of the instances or something like that.

B

Yeah, that's possible.

B

uh Yeah, well, that's about it right now. We are wait, we're just gonna, wait and left and let it like sit for by a week and just observe before we do any sort of tear down and and given like the availability concerns. It's very likely that I will wait till August, because it's already almost end of July. We just wait a week or two and then we'll do the tear down later.

B

But right now it's it's it's working as as expected, and the I think Igor has emerged it so I I I've increased the severity for this from S4, which is uh because it wasn't really running to the S2, which is the standard for all the values. So this.

C

B

In production, yeah yeah, that's awesome. It's.

D

Amazing, like actually.

C

Really amazing: migration, like.

D

C

D

Biggest workload of all the world, and it's just and the migration is just like done like this just and I'm speechless, because it's wonderful like this is a really good piece of work.

C

B

Some some interesting things is that uh well I mean cash is still quite easy per se. The next the next one is Chat State, which is going to be it's fairly. Tricky uh and I'll probably bring up another time. Some some.

C

B

Like for chat, state has pops up and pops up works with Workhorse, so we've got Workhorse like go Library dependency. Things like that to work with, and uh what else there are there's exclusive leads, which is like locking kind of thing. So, if you do a set NX, you can't do a double right for set index because there will be like the second right will, step on each other, and the values will be set differently differently, so that there are these little applications like uh intricacies, that we have to iron out, but yeah.

B

That's the plan, at least alternatively, for Q3, which is to do the rest of the application. So the only thing that's non-cluster compatible is psychic, because psychic is not supposed to run. That way.

E

See what one thing I do want to mention on on the redis shared state that I found kind of interesting I ran a TCP dump analysis A few days ago. Let me see if I can pull it up, um so this is capturing uh capturing traffic on the shed States secondary, and the reason for that is. When we looked at a CPU flame graph, we saw that a lot of the CPU time is actually spent on. I o um on rights in particular, and when we compared the Ingress on the replicas.

E

Basically that that's an indicator of how much incoming replication traffic is happening, um and you take that times two and then you have a prop an approximate share of how much of the rights uh happening on the primary are actually targeting replicas, as opposed to targeting clients, and it was like 50. So a large chunk of the work that the shared State primary is doing is actually replication.

E

Traffic and I wanted to get a better idea of what that looks like, and so this is uh capturing incoming traffic on a replica and then aggregating by the size of that incoming traffic, which gives us a sense of what are the main contributors and what was kind of surprising to me was.

E

We can actually see publish showing up here, and that was surprising to me to even see published traffic on a on a replica, because the way that pops up usually works is you have clients connecting to the primary and then they Subs, some of them subscribe and some of them publish and it sort of goes directly to the clients.

E

But what redis does is it publishes? It always also publishes to the replication stream, and so all of the published traffic must also go to replicas, and that kind of means that it can be an optimization Target. So if we reduce publish payload sizes, if we shift publish traffic to a different redis, that can also reduce load, which was kind of a surprising result to me and I just wanted to kind of show that eagle.

C

One other thing: just if you share screen again for a second sure the the e-tag traffic yeah, could that not move to the caches I'm trying to think if there's any reason why that needs to be on the persistent shared State, um I I, don't think I think it could be on the caches.

E

I think so too good information, it's.

C

Because it's effectively it's had caching right or you know it's saying uh just we've had this response before, like you know, keep your your existing. That's a good point.

E

C

Probably I mean I, don't know how big that I mean it's pretty big compared to the next. um That might be a little piece of work to actually.

E

I can tell you exactly how big it is.

C

Yeah, that's a big chunk of big chunk of change. um It might be actually worth uh migrating that to the.

E

Cash instance. That's a that's a really good point. Yes,.

E

What do you think Sylvester just wanted to also get your thoughts on it? Yeah.

B

I think it's reservable I I've always wanted like E-Tech. Caching has the word caching, but it's not in cash, so yeah.

C

It might have been that there wasn't a cash instance when we built that, um uh or maybe the Workhorse Workhorse only spoke to the um primary to the to the shared States redis. It might have been that as well, and it was just like a easy thing to do.

C

B

I'm not sure about them, because I'm searching for what caused about this or redis operations, it's mostly with key watching, which is the runner cues, uh the pops up for running queues and also uh it it doesn't get on one key, which is also run again.

E

So we think Workhorse.

C

The first iteration was that rails would do the would be the, um but the whole idea was that Workhorse and maybe it has never been done, which is kind of sad, because you could short circuit that at the Workhorse level, right like if you look I, think there was.

C

There was something in the original design, doc about, like just short circuiting, that and doing the the conditional response um at Workhorse and not even bothering with with rails, and it might have been like a we'll do that in the next iteration and six years later we haven't done it, uh which is kind of funny but yeah. So maybe that is the case. I.

E

C

D

C

It happens, but it's well. It's quite likely that it's not yeah.

E

C

Think someone take an action to.

E

Research that really quickly.

F

Just confirm whether that's true or not, so we have it in the backlog. Sorry girl.

C

F

B

C

If it's possible still.

B

Yeah it does expiry so I guess it can be cash because typically Etc ttrs, it's cash, I think so.

E

Marco, you have the next one.

F

All right, I guess, if that's a little bit of light, kick as I like things to demo of Michelle my screen.

F

All right so for a little bit of context in our psychic service, we have. We originally have this per shot SLS, so we kind of hacked uh because we want to monitor each chart. We originally only has things like short control: SLI chat database shuttle SLI, so this originally tracks both the queue and the execution side of the the performance.

F

So that means whenever there's app Dex drop. We couldn't differentiate, whether it's a cubing or execution kind of issue.

F

So what uh we have done for a bit here, we actually now already have a separated execution, HLA index and the qvm XLF text and the way we did this differently is that previously for our per shot slis, we have each we money, we calculate from the histogram that we scrape from the rails application and then the app next calculation, whether it's a zero or one happens in the uh recording rules aggregations, whereas our the news, execution and the queuing ones will actually the calculation happens in the rails app itself.

F

So it's similar to the ones we have in the rails request. So then it becomes all the aggregations becomes easier, and then we would once we replace the this old per shot SLS. We actually will have less much less metrics because we only have just the counters instead of the histograms uh yep.

F

So the psychic execution was here since last week and then today I just rolled out this, the new one for the queuing, which is why you can only see this amount of data and then another thing that is worth noting is that so usually our SLI would start with.

F

Give the component app decks, whereas the one we have for the Charlotte ones.

F

Just give me a second to load here.

F

I I just show here: I guess so it will have something like component shot up decks and then we have the component equals to Second killing. And then we have the labels by shark itself.

F

Yep, so we have all this different shots within one uh component name and then what I've made the changes in the dashboard yourself. So instead of we have this short template, so you can just select from the drop down so and then it will respect the shot template into this type text panels. So now we'll see all the shards, but you can just see the related ones that we want to see and then the saturation panels also represents this shot. So previously, this saturation panel is a global view of every shot.

F

So now we can see here in the fine, Grainer yeah, so I guess the next steps is uh uh we will I will completely replace all this per shot slis. So we will just have like two rows of SLS here and then you just drop down by the chart templates on top uh yeah.

F

That's about it.

C

The market, if I, understand correctly the the um the applications emitting like a zero one, potentially also like a half um depending on the the threshold.

F

Depending on the urgency, the yeah yeah and.

C

Is that is that, did you is that being built in like a sort of generic way, where other places that we want to measure latency can use it or is it? Is it kind of like a once-off um uh just just full sidekick.

F

It's, a Wonderful psychic because.

F

C

F

For like rails requests, I guess and psychic workers, they can have different urgency, so uh I kind of make it the one-off for sidekick yeah. It's just a simple class yeah yeah.

C

It's just um there's been some discussion in the past, I think, particularly with Bob around kind of abstracting them out where you can basically give uh you. You have like a yield and you can run something in there and then you say you know what the what the satisfactory and the and the tolerated thresholds are. This is unrelated to this work. It's kind of like a it's like a semi-related. It's not you know.

C

I think this is really great like, but um and then we can kind of reuse that in places where we want to record up text, but just doing that will make a huge difference to the to the number of metrics and are we getting rid of the latency metrics, all together or they're, still there, but just not being consumed.

F

uh Yeah, it's interesting I'm trying to audit whether we can kind of replace with the this zero one app decks or not. Whether we still have important places where we actually want to see the duration or this locks is fine for us. Yeah.

C

The directions themselves, like especially like I, know a lot of software engineers in the stage groups kind of rely on those histograms, but really like the most important thing. There is to educate them in how inaccurate those histograms are. You know they can be out by many dozens of seconds right and but people intertake them as, like. You know absolute truth. It's like! Oh look. We saved a second and it's like you really don't know that, um um and, and so like.

C

The thing is, if you, if you encounter kind of um pushback from Stage, Group teams, who's saying oh well, how are we gonna know like what you've got to do is kind of teach them about the about using the elasticsearch and the logs, rather right, like that's kind of the way that we have to kind of help. Those teams you know for for for actual latency metrics is much better just to use um uh elasticsearch.

F

Yep I guess I will it's part of the Epic. We also uh maker write-up documentation about this Excel, how it works and then yeah what to do.

C

A

Yeah, it's going to be really nice being able to split it clearly between the queuing um and the execution. uh Just provide so much more detail as to what's actually going on in there rather than the aptx is fine like being able to split the two pieces out is great.

C

D

One other minor point: you.

C

Don't charge detail um dashboards as well and have you updated those? So if.

F

You still are here.

C

F

C

The top um sorry and then uh just yeah to the top of the and then you'll, see there's a on the right hand side. uh So you can you can go this way as well. Yeah, there's a short detail side. It's a second last one.

C

A minute, and then this has got the same, pull down that you were showing earlier, which gives you like all the metrics for particular.

D

C

Yeah yeah, so it might be worth just keeping this up to date as well right.

F

D

Yeah I don't know.

C

How many people use this? This is one of the places where the histograms are used, uh but we can get rid of those I, don't even know if anyone uses this dashboard. To be honest, it's very slow, I.

E

I have used this dashboard during incident response for.

B

C

Yeah, because it's it's very, it's very inaccurate because it uses the histograms it's very slow because it uses the histograms but also likely Google, says during an incident. Sometimes you need that.

F

With the the application itself, you can't really replace that with to the actual duration.

C

Yeah I mean some world where we could make that query into elasticsearch and plotted in here electrophone. It does support elasticsearch as a um as a data source, but it's probably easier just to push people through to elasticsearch and give them a graph in them.

F

E

So I I had another question as well. uh You showed the the psych or you touched on the sidekick saturation panel and.

D

E

That's now kind of broken out by components, and it's kind of a lot of information, and it's so I guess for context. When I'm dealing with an incident I, usually try and kind of scan the top row of a service in order to tell be able to tell at a glance whether there's anything obvious that looks out of place and with this amount of lines it's it's kind of hard to gauge. So I'm, just kind of wondering whether there's some sensible way to to make this make the ux more friendly. For that.

F

I guess the thing is because the default is a all shot. Would it be better if it's like just a catch-all at first and then because during incidents you know, usually it's a for a chart, let's say catch all then you know you would change.

E

Here, yeah I, guess that can work yeah the.

C

Other option is we sort of keep those per short details in that short dashboard with the pull down um but yeah yeah.

E

That's that's how I've usually approached it but I guess if we.

D

E

Selectable on the main dashboard, then, that that also works.

B

E

That's that's fair enough.

F

Yep and another thing also um I've discussed it with Bob last week, so because now we have two slice and then we, if we were to aggregate both execution and queuing into the service aggregation, uh we would then have like double RPS and uh double of the error ratio is that of just the one that we have per shot.

F

So we currently don't have a way to kind of uh indicate whether this SLI is only for RPS or for Optics or for the error issue, whereas for queuing we only, we only have the uh app decks based on duration, not an error right, so yeah. This will be the next work, also to kind of support.

F

This aggregation by a finer grade that makes sense.

C

Yeah I'm, just sort of like I, mean a lot of the reason why we we had that that second dashboard was so that to kind of avoid the compute you know of of because obviously a lot of those those graphs also like really complicated the way that they're con the way that they uh developed at the moment, so I'm just kind of worried that if we are adding in all that extra complexity about like selecting on labels and stuff, whether it would you know and then obviously also if you scroll to the top we've got the the range.

C

um You know the the normal that normal range and then obviously, if you've got two, you can't really do well. I, don't know how you would do that with two values in there, for example or I, don't know if I'm kind of missing uh what you were saying.

F

um What they were saying is um so we have this in our metrics catalog yeah, so we would have this average aggregation yield of flag. So what this flag does is it will aggregate all the app decks error and RPS.

C

C

It's still, it's still a single aggregated value, not not not two in there or multi.

F

Yeah still single they're still.

C

Single okay, yeah yeah.

F

Okay, so if I were to just include the psychic execution in the service aggregation, we would miss out the psychic killing performance in the aggregation and all all the way to the top to the um assembly. I guess right.

E

We wanna we, we.

F

Actually want to, we.

E

Want to include app decks, but we don't necessarily want to double up on the RPS metrics.

F

That is correct. That is correct.

C

All right yeah, so we can't we can't do server segregation on a per um sub attribute level. We can't do.

F

C

App clicks, but not the RPS, that's interesting yeah should that should be too hard to change. Yeah I mean a lot of a lot of the IPS are like kind of a little bit out there like. If you think about the web. You know when you think of the RPS of web. It's like one request, comes from a user, but the way that we counted on there is really like one request into Workhorse, one request into rails, um and so it's really a relative measure rather than a like yeah.

F

C

You know, because we we double count a lot of things in that in.

F

That space right.

C

Yes, but but I mean it would be better if we could fix that and actually make those realistic.

E

Yeah I mean quite often if I want to see how much traffic a service is actually getting. I'll go to the dashboard, I'll skip the RPS line and I'll go down to the SLI, and then we have load balancer error like load balancer request per second Workhorse request per second uh rails request per second. We usually have those three right and then that that gives a more realistic I.

C

Mean the other way we could do it is. We could say that Forest Service, instead of aggregating all the RPS to get the service obvious. We say RPS from this component um and and so RPS is kind of different from errors and and Optics. So so that you you know normally it would be the load balancer component. That will give you the RPS for the servers, because it would be much more realistic than what we have at the moment.

C

Something to think about.

F

Yep I think that makes sense.

F

Yeah I guess that's about it! For me,.

A

All right well we're a Time. Thank you, everyone for the demos, um much appreciated, hope you have a good rest of your day.