GitLab Scalability Team Demos, 15 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2022-12-15

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh So the Igor explained it to me already, but I was quite surprised to see in um saturation issue um that I I'll dig up later, so we often saturate um HPA, Max replicas for sidekick, urgent, CPU bound and then I went looking and I saw that um if we're at Max replicas this, this thing would not fit and fit on the Note pool, like the the CPU request, was like set to 600 or 800, something like that, and we've only got 40 V CPUs, but we allow a hundred something Max replica.

A

So apparently we're scheduling that stuff everywhere now like on the generic load tools, as well as the dedicated notebook which is I, think, okay, but now I'm wondering how do you decide how to size things? uh Do you just size them based on what you think the workload is going to look like and then see afterwards if it fits the note pool or do you need to keep summing everything or, like anybody, have any experience with that.

B

I do um whenever I had a workload I I do I do do the math to make sure this was when we had dedicated new pools for everything, um I, don't I, don't think everyone does that um I I think that that a lot of folks are focused only on sizing. The the pods in the Pod count for the HPA and just assume that the node pool is an infinite resource and.

A

I was doing the thing that you were doing, but based on the Note pool that I don't know the size of anymore, because yeah 40v CPUs was not accurate before and we could still run the workload and then like it was never saturated because that notebook isn't used. It's just always scheduling on the generic nodes.

A

Hi Edgar I was talking about my surprise that you've explained to me.

C

I'm pretty sure that I recall that there was a lot of work to move to generic node pools for a bunch of things yeah a month or two ago. So and.

A

Then the the goal is to do like, like Matt said, others are already doing just get the sizing right for the workload You're Expecting uh with the max replicas to have a limit and consider the node pool infinite.

B

uh To be clear, I don't think this is a good pattern. I think that I I disagree with the trade-offs that were that were made in in that uh in that consolidation into generative, nude, pools. um I.

B

Think buckeying is a super useful practice and we've given that up by moving to by uh by moving to hybrid workloads in uh in the shared node pools, and it does make sizing um um capacity, planning more difficult, um yeah I would say impractical because we don't have a good view into all of the workloads that are currently sharing um sharing those node pools um and just as a closing note, I I'll comment that the two resources that kubernetes uses for scheduling, um that being CPU and memory are not even close to the only resources that can be contended at uh at a at a machine level, um and this is one of several reasons that I thought the isolation boundaries of having dedicated node pools provided a wonderful installation so that you don't have um so that you don't have problems like um like what we observed with um sorry I'm, going on attention agent having isolation boundaries is useful.

B

We've given that up by moving to generic node pools. This is also why we are not moving to generic node pools for certain very sensitive workloads like redis, just just to kind of highlight that aspect.

A

And the other side of the coin is not advocating for either side because I don't know enough, but the other side of the coin is that you can use the nodes more efficiently because you can fill them up based on those on the single two resources that it uses uses for scaling. You.

B

A

Them up more efficiently, you.

B

Can bin pack more tightly that doesn't necessarily mean more efficiently? Yes,.

A

Yeah, okay, more ports on a node doesn't mean the node is utilized exactly.

B

Yes, yeah I just wanted to highlight that it's optimizing for one specific set of criteria and those may not match our preferred criteria.

A

And is there ah two more questions, one? We have two two types of generic nodes: I've seen High memory ones and the regular ones. Why is that? Because sometimes we have a workload that is requesting more memory than would fit on them normal one, okay,.

B

Yeah I believe so it's we've got two two resources that we're scheduling for CPU and memory, and so we've got uh node pool for CPU or for CPU oriented workloads and a separate one from memory. Oriented workloads.

A

I, don't think we have one high CPU, it's just all the end to like the the.

B

A

Types which, like I, don't think we have any workload that.

B

A

More than four vcpus, so yeah.

B

It really I I, think I think what it really comes down to is that we've got uh kind of a limited Buffet of uh of um machine types, in other words, VMS and um the ratio of CPU of gigabytes of memory per vcpu is different for the for the for the two types of VMS that we're using in in these two generic node pools.

B

um So it's it's really. It's really kind of um saying. If, uh if we've got pods that wants a relatively large amount of memory per vcpu of expected uh consumption, then we'd prefer to schedule them on to the onto the nodes that have a larger ratio of uh memory to the CPU.

D

Basically, that's the theory. We need to put Prometheus somewhere right, but.

A

We didn't want a dedicated notebook, so we made the generic one for families yeah.

C

This is a complete change of subject and I. Don't need anyone to answer it now, but is there a reason that we don't ever use custom uh node to our custom machine types in gcp.

B

That's a great question: uh I I think some of the machine families have a fixed ratio and some of them don't. If you're talking about like adjusting the the ratio of memory to CPU.

C

I mean, as far as I'm aware, you can do a lot of just go to gcp and set up whatever kind of machine you want and.

D

Create you can mix and match yeah um I can I can answer for redis uh in redis. We use C2 instance types, because those have the fastest yeah um single threaded, CPU performance and for C2 custom instance. Types are not supported. Okay, that.

C

D

We did use to use some custom instance types, I believe when we were running.

B

It was awful, Bastion hosts were.

D

Yeah so when most stuff was running on VMS, we did okay, I, don't know I, don't know if that's even possible to do with kubernetes night pools. Probably I don't know if it makes sense for us to do, though, probably.

C

Not it was just a question. Sorry Bob for completely derailing.

A

Yeah I have one more question, but I forgot it.

C

It's all my fault.

A

What things what things are we going to like besides redis? What things are we going to keep on dedicated note pools?

A

What I think I'm asking is? Is there an issue that this was that this work is ongoing and that I can follow along and, like maybe understand,.

B

um Speaking just for myself, I really didn't have this stamina to to follow it. So I can't answer your question.

A

I'll try to find it and link it from the doc.

B

I'm I'm sure that there is I just don't know what it is. I I've got some strong opinions about it and I didn't have the energy to uh to push those opinions, and so I'm kind of checked out, apart from keeping tabs on what the current state is.

C

But I found the issue.

C

I'll put it in the duck. Thank.

A

C

And completely ruined the formatting.

A

Matthew wanted to show us some flame graphs.

B

Sorry, yeah I I'm a little groggy early um yeah, um yes, so I'm I'm part way through I thought. This was kind of kind of an interesting Show and Tell um and it and it's pretty quick.

B

Here we go okay, so um so we've done a lot of uh so we've done a lot of work on.

B

We've collectively done a lot of work on redis period.

B

um Also more specifically, um several months ago, the uh we, uh our our team, discovered that rdb backups are um are capable of triggering CPU saturation on on redis instances that have uh um the the save your active, enabled or more more generally redis instances where the BG save command gets run, which is what creates these rdb dump files.

B

So um this is an example of uh um a CPU profile captured from the primary redis, uh the primary redis instance in the redis, persistent Target, that's red SO3 currently- and this is from I think I- think it's about a day or two ago on captured captured this week. Just so, it's it's very fresh.

B

um So we're going to go through three levels of filtering here and I kind of wanted to narrate, um actually I'll just I'll just tab through them. So you can kind of see what what the picture looks like. So this is uh so. This capture is for just the redis server process itself, um that is to say the the Reddit server process that actually handles traffic and the child process that it Forks when the BG save kicks off.

B

So there's really two processes, and it this this uh this view represents all of the threads uh for those for those two processes. uh Next tab over filters to just the main thread of the the primary redness process, the the parent process um and the I I, don't know: I, don't know if you can. If you can see the the band present here, the band is still present. Is that visible? Okay?

B

um Some uh for some reason when I take a screenshot of this? It's just completely a flat red color, so I wasn't sure if that was coming across on on Zoom anyway. um So this is the same. This is the same point in time that we saw on on this representation. uh It's just filtering to the main thread, and the important point from this uh from this view of the graph is um oh by the way the sampling frequency is, is a about 500 samples per second and we've got 50.

B

uh uh Each second is drawn vertically here in 50 buckets, so each of these buckets represents. uh um Excuse me so um so I guess. The the point here is that when we Mouse over these cells- and you see a count right here- um you can kind of multiply that mentally by 10 and get the percentage of CPU CPU time that we have represented there.

B

So if we just kind of browse here we're seeing that we're burning about 70 percent of CPU time prior to this burst and then during this burst we're sitting pretty at 100 CPU. So this is saturation. So the question, then, is: what in the world are we doing during this saturation? I'm gonna grab a bit of it here and we'll find out.

B

So um let me make the font a little bit bigger. Is this easier to read so um so, just at a glance you can see kind of you know typical. Looking uh um redis stack traces where we've got process commands doing various commands. We don't really care what they are. What we care about is what are you spending CPU time on and just at a glance you could you all know this, but for anyone else that isn't familiar with this, the color scheme here is um this uh reddish pink color is uh these are stacked.

B

These are stack, Trace frames from user space from from redis, and the the orangey yellow color here is current is the kernel code paths, so you can see kind of just picking on this example here that we've got, we've got some stack frames in redis. This is redis's main thread.

B

It's culling, it's event, Loop, which is AE main, which calls process events which is handling requests from clients, and it's making a system call and that system call is the read Cisco and it turns out that's reading from a TCP socket, and we can infer that this is a TCP socket where it's receiving client requests, because the the call Path was read, query from client. So that's just kind of a super quick tour of what we're looking at in General on this uh in in in general, for these stack traces.

B

But the piece I really want to focus on here is when we do when we do page folds, um so a page fault means the process is trying to access a page. That's not ready for it that the kernel has flagged, as this page is not ready to for user space to consume, and so kernel has to do to do something to make that page ready- um and in this case um this is. This is just one of many examples, I kind of wanted to show.

B

um So this is um so page fault uh I'm, going to narrate a few of these stock trade stack frames, so page fault is kind of the generic um kernel interface for the CPU, uh where, where um some, um where the exec, where the executing thread tries to access a virtual memory, address and Colonel, does the virtual to physical memory address lookup and the the page table says Nope, there's not a physical page, that's ready yet, and that triggers a page fault event um which uh which Colonel uh kernel handles and that's the portion of the stack that we're looking at here.

B

So in general. So that's what a page vault is just as a general piece of vocabulary term, a little bit of background. The important part of this stack Trace is this: do WP page, which is uh this, is a distinctive code path in the kernel for handling copy on right, um so copy on copy. It's called copy on right.

B

When you see people say cow, that's stands for copy on right, um a better name for it to be copy on first right and the the idea is when you Fork a child process, and you use particular flags that say: preserve, give the child a virtual copy of the parents, memory image or some portion of the parent's memory image.

B

The um the all of the pages that haven't uh you know. Initially, all of the pages in The that are that are cloned for the for the child um are just referencing, the same physical page that the parent is referencing, but they need to represent different. You know um uh the the promise is that those pages will remain um intact, even if the parent later comes by and and changes the contents of those pages. So this is why we have copy on right.

B

um The important Point here is, the parent is the one that incurs the burden, because the parent is the one that will continue modifying these pages. So uh so what I'm? What I'm going to do here is tell uh flame flame scope to search for all occurrences of gwp, page.

B

We'll Zoom back out and we'll see that roughly 30 percent of the CPU time that we saw here and remember we're highlighting the point when redis is com when the redis main thread is completely CPU. Saturated 30 of that CPU time is spent doing copy on rights and there's a bunch of code pads that are doing that, because this is basically once we Fork the child process. Anything that modifies um um a memory map that uh that was inherited by the child has to be protected.

B

So any pages that are really volatile are really likely to incur this penalty so um and that that includes things like um redis's, uh vertices, client, buffers hotkeys.

B

um So what we're going to do next is this is the third. uh This is the third tab. I wanted to show before.

A

You move on yeah. How did you go from um like? How did you get here like now? We can see that this this copy and write behavior that, like that, the the main thread needs to to take care of yes, 30 CPU time, but I, don't see how how you discovered that? How did you go from from this is a like. It looks to me like a pretty normal reddish flame graph, with the process command and so on. So how did you pick that one out.

B

Right um so I've done some work in the past with uh with copyright overhead, so I recognize this kernel code path um and the kind of the the high level pattern here is when we. um So when we, when we have um say I'm gonna, make I'm making these numbers up say: you've got a gigabyte of memory. uh You know this isn't about writers. This is just about kernel memory management. So you got you got process process. One, that's got uh sorry. Pid1 means something.

B

Special you've got process, a that has a gigabyte of memory and uh and it it creates a child process. I I, say fork, but this this applies equally to the the Clone Cisco. So uh so you you, uh so you have a child process and you you pass it and you Fork it in a way that um that preserves that gives the child access to that point in time of the memory image at the point when it Forks.

B

As long as neither of those processes modifies any of the pages that were that are shared between them, then both of their virtual memory address maps can continue pointing to the same physical pages, and you won't get any of these page folds, but the moment either of them starts to modify those pages. That's the uh that! That's the that's the point in time when you start to incur the the cow overhead.

B

um So in our case we know so switching back to the context of redis. We know that redis has a large memory footprint and that many of those pages contain you know relatively stable keys, but a subset of those pages are going to be used for really hot keys. That get model all the time, and some of those pages will will contain client buffers uh for for input and output for for clients.

B

It doesn't matter what the page was used for, because redis is using je Malik for for its for, for uh for its memory, management and Je Malik is perfectly happy to you know, use any it's not gonna. It's not gonna make an effort to separate hot pages from you know.

B

Hot allocations from from cold allocations- it- that's not it's it's, that's not a design goal for it, so you're, basically going to have um um hotkeys and client buffers will be spread around across a random subset of pages, and so, if you're, spending I'm making these numbers up. If you spend, say two percent of your memory on uh on really hot on on, say, two percent of of your your memories is sorry. I shouldn't make up numbers.

B

Let me let me just say that kind of kind of conceptually you could have a small percentage of um keys and buffers that get mutated on you know on during during the first few seconds um and whatever Pages they happen to reside on, are going to incur the that copy on right overhead, um but once you've kind of touched and made a first touch after the fork of of all of most of the hot Pages.

B

um You won't see a lot of copy on right overhead anymore, because you've already you've already paid the the cost and that's what you're going to see in this next graph.

B

um This is pulling out only the stock traces that uh that made from the main thread that were um that made that do WP, page um kernel call frame, and you can see that they're extremely front loaded and that, once we've uh dealt with the hot pages, that it becomes rarer and rarer to have to pay that penalty and I'm going to scroll over to the left, and you can see um a few a few minutes later by the way the the headers here are this.

B

This is um this is giving us a relative offset since the first capture and um yeah, so you can kind of think of these as seconds so 230 seconds from some point in time um is roughly when, when we started to see this activity- and it looks like it- took us uh just a couple minutes- this is this is effectively when the backup ended and the child process exited, and therefore we don't pay copy them right overhead anymore, because there isn't a process, that's kind of present serving references to those original pages um yeah.

B

So that's that's kind of what I wanted to show um that this is the dense period that lasts a few seconds when we're paying a really heavy cost for copy on right and um and uh once once all of those highly volatile uh um keys and buffers um have touched, the pages that they reside on.

B

For the first time the the penalty is much lighter, but the fact that we're con that we're for a few seconds consuming uh on average about 30 percent of our CPU time um in a saturated state, is going to have kind of implicitly the same. uh The same effect um where redis can't keep up with its incoming request rate um response rate, Falls below request, arrival rate um clients perceive this as slowness and because most of our clients are Puma.

B

um I. Think that that implicitly means that um each Puma worker process is going to kind of rapidly approach its maximum number of threads. So if it was previously running um two uh two Puma threads and it's Max is set to four. It's going to bump that up to four and that's going to manifest as the new threads needing to open, meaning, naturally meaning to open new connections to redis and- um and we see that, as do I still have this open. Maybe.

B

Yes, and that's going to result in new incoming connections which we can see here during every one of these events. So each one of these is an rdb backup and we can see a spike in the connection count in the in the incoming Connection rates and uh the same spike in the um the open connection. Count um and um I'm mentioning this mainly because Beyond um Beyond this few seconds when we have uh heavy copy on right overhead. The following few seconds.

B

uh A little bit of overlap also contain a lot of overhead for accepting incoming connections from from clients. So that kind of extends the the CPU saturation by a couple more seconds than what we're seeing in in this flame graph. But altogether, that's still only representing, maybe, like you know, five or six seconds of CPU saturation on the main thread, I'm going to switch back to the the actually this. It's easy. It's easy to see here as well, even though this isn't just the main thread you can see.

B

This really is only spanning about you know five seconds or so in this particular example.

B

um I think that the the it's really bugging me, let me go back to the main thread, um so I guess the the main thing I wanted to kind of take away from from this is one I thought this was kind of an interesting technique to kind of um visualize.

B

What's going on um and kind of identify, uh two components that are driving this, this uh five or six second period of CPU saturation, but also kind of highlight that um even after we're no longer saturated, there might be some kind of follow-on effects from from uh from client uh from the client perspective, um the impact to app Dex kind of is larger than this time scale and I'm not entirely sure.

B

If that's um um kind of an artifact of uh the way our metrics are, are kind of lagging indicators um or, if there's something else going on, but regardless we know that we know that we get those those Apex dips and kind of the corresponding throughput dips.

B

um um When, uh when we trigger these BG save events, and we can synthetically produce it by triggering the BG save event, so if we avoid the BG saves, we avoid this saturation and the kind of knock-on effect um that that uh that kind of cascade from it.

B

um So that's that's uh I, guess I guess: I wanted to show um one the kind of lower level detail of um what's expensive about these BG saves and two uh it kind of um um talk about just uh just kind of talk about how their the the side effects also get avoided by uh by avoiding the triggering condition, is.

A

The the 30 overhead that you mentioned is that a constant with the amount of traffic.

B

Oh, that's a great question: um I uh I, don't I, don't think so. I think that probably the heavier the traffic, the the higher the the portion, um this particular profile, was captured at 1742 UTC um two days ago, just for reference in time. So it was not at Peak, but it was. You know during during.

A

The heavier part of the day it wasn't wasn't, calm and otherwise, like that was the next question. If, if we had 30 CPU overhead here, clients wouldn't notice anything.

B

A

Sorry, if we had 30 Headroom.

B

um That's a great question: I yeah I think that's true, like I think that um it would like it would affect latency but I. Don't uh it would affect it on a time scale of microseconds, so yeah I, don't think clients would notice.

B

This this only matters because, uh because there's a lot of them happening so yeah.

A

I think the the one thing that you you mentioned: um five percent of fully CPU saturated, but the effect goes on longer. I think that's probably because it's five seconds like fully saturated, but it's still pretty busy right after until.

B

A

We've paid most of the cost of the copyright, so I think I.

B

Think so yeah drags.

A

All a little bit longer after the saturation, because it's still pretty hot, but that.

B

Yes, yeah I think that's plausible, too yeah I I, just uh I I try to try to be uh um I, try to be clear about things that I'm really confident about versus speculation, but I mean.

A

It doesn't, it doesn't matter for the for the solution anyway, moving the background save out like yes takes it out of the equation completely, and we have 30 CPU overhead on the replica speaker, not they're, not really doing anything except for record I. Think yes,.

B

Exactly exactly.

A

Head over to the AMA yeah.

B

A

B