GitLab Scalability Team, 18 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: WAL-G CPU profiling session (debugging WAL archiving saturation)

Description

As part of our investigation into a WAL archiving saturation incident (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6581) we got into an ad-hoc profiling session, and general introduction into CPU profiling.

Participants:

- Matt Smiley
- Igor Wiedler
- Alexander Sosna
- Biren Shah

A

B

A

C

uh Matt you're, you're, muted, you're speaking.

B

To the void, thank you. Thank you, yeah, um okay, so we've got, we've got uh uh yeah, so we've got some some um some context here. um We are um currently running a cpu profile on um uh for 120 seconds, we're running a sampling, cpu profile on a subset of the pro of the postgres processor. Specifically, there's a one long live purchase process that handles archiving raw files. It's the parent process, that's that spawns the extremely short-lived wall g processes, which is what we really care about profiling.

B

So by default, perf is going to recursively inherit any child processes that that are spawned by the pit that we're specifying here, which is how we're kind of getting away with uh capturing a profile of the wall g processes that would otherwise be too short-lived to to grab by by process id um just for context. Another simpler way to do this would be to just capture a profile of everything running on cpu on this host and then filter that as a post-processing post process, except to just include the process ids that we care about.

B

um But this is this- is kind of a more direct measurement of what we want. So that's why? I'm starting with it um so so this is the the first step of capturing the profile data it goes to um so everything else I'm going to talk about for the next couple of minutes is generic um by default. Perf record is going to write out to a perf.data file, which is a raw binary capture file to get something useful out of it.

B

We run perfscripts by the way I should have started with this um just for reference in user local bin we've got a set of helper scripts, generally you're going to want to run this one. It takes no arguments, you just run it like tab completes for all cpus. You press enter on this and it'll grab 60 seconds worth of profile of all processes, running on the cpu and it'll generate and it'll do all of the post processing steps that I'm about to do manually now, just for reference.

B

So, if you're into pinch just run, this run this command with no arguments and you'll end up with with a useful profile.

B

um So anyway, back to back to consuming this, uh this profile room, um so perf script don't bother trying to memorize these. uh It's it's not worth your time unless you do this a lot.

B

So I'm gonna, so perfscript is going I'm just talking in very general terms. Now perfscript is going to extract the contents of that perf.data file and give us effectively a textual output of each uh each event that was captured as part of that profiling run. I usually name these files, something something to indicate the context. So in this case, I'm going to say uh postgres archive or electronic scores.

B

So this will often emit warnings. um Usually you can afford to ignore them. The warnings generally are talking about um how certain libraries uh that are on disk don't match the libraries that are actually linked being used in the process, because we've had the library upgraded since the last time the process started. So the process is using an older version of the library and consequently, we can't use any symbols that are available for the on disk version of the library because they don't match, what's what's actually being used by the process.

B

um All of that's really just kind of giving us an fyi about why some some of the some of the symbolic names for functions may be missing from the profile. So just so, you can see what the profile looks like. This is not sensitive information by the way. um So this is a single event, so these are the headers by the way that just give kind of context for for the capture. um This is a single, uh a single profiling event. This is another single profiling events. You can see that it oops.

B

You know I accidentally scrolled my mouse and start over. So name of the process process id uh a timestamp, um since this is second since boot um with high precision, um and this is the kind this is the event cpu clock. It's just the default events that we were profiling and then you've got a stack trace for uh whatever for this particular process. At that moment in time, uh wall g is compiled without any useful debug symbols. Unfortunately, which means we don't get symbolic names.

B

All we got are the raw, uh the raw virtual addresses for for the the frames um in the stack trace, which makes it hard to interpret um so we won't get a useful flame graph, but I'm still going to generate a flame graph. So you can see what it looks like.

B

um What I really want is this raw profiling data. um So we can look at the effectively a timeline to see when the worldview processes were hot on the cpu. um Okay, so stack collapse.

B

And then use the same name but change its suffix.

C

And as a side note, this is what the script the helper script in user local bin is doing internally as well.

B

And it's just a shell script by the way, so you can totally see what it's, what it's up to use the record and the perf script and the the bits you just saw.

B

Okay, so we're going to log out. What do I have is a stop. Stop oh yeah right! That's! Okay! That was from that other incident. That's less interesting than this.

B

So I'm going to copy both of those files to my local machine.

B

This is that this is that textual output that we were talking about- and this is the flame graph that we that we generated from it so the flame graph you can open in any javascript enabled browser and if you haven't, played the flame graphs before this is definitely worth showing.

B

um So it's svg files are scalable, vector, graphics files, they're graphic images, um but uh they're effectively uh most image files. These are actual text files. These are xml files and they include some helpful javascripts um that lets you basically treat them as interactive graphics. So we can do mouse. Overs, look at the bottom of the screen. Here.

B

You can do mouse overs for each for each frame and it'll tell you things like sample account and the percentage of sample counts, and you can also do things like search for all frames containing a particular string, which is not super useful here, since we don't have any useful function names, but uh in general it's going to be it's going to be more helpful, so the takeaway here is we were we were profiling, uh the postgres archiver process and any child processes that it creates. So um this frame over here that represents one point.

B

Three percent of the samples was the postgres process and uh and ninety-eight point. Seven percent of the samples came from wall g, child processes that respond by it. So we know that um that most of the cpu time that we're observing here came from wall g, not from the postgres, uh the the postgres parent process. That was spawning them. So because that's the case, I'm gonna.

C

B

C

And this is this is cpu time it's uh sampling on.

B

C

Time I just wanted to highlight that and make that expensive yeah.

B

Super important yeah: this is just cpu time, not wall, clock time. What we really care about in this case is wall, clock time and we're using cpu and we're so bear that in mind. As we look through this, this is not going to represent time spent on disk io or network io, okay. So um so what I wanted to use the the same profiling data for is to load it into another tool, called flame scope, which will give us a timeline um of when these samples occurred. It's just a convenient way to visualize.

B

It's a convenient way to visualize.

B

B

Yeah, okay to visualize, uh when we had on cpu time.

C

And and visualize that over time, so the yes, the thing worth noting here is when you look at a flame graph that is averaged across the entire duration of the profile. So this was a two minute profile right, which means what what we were seeing was sort of the the average across those two minutes.

C

That doesn't tell you anything about the time dimension, um and so what uh we're about to see what matt is about to show in flamescope is uh actually showing that time dimension and I'll hand it over to you for further commentary. Matt.

B

Sure yeah, so this is what ego was just talking about. You can see. We've got bursty behavior here. This is what we would expect from what we know about the processing model that we should have bursty behavior by the way when I was talking earlier about how how little of the time was spent in the postgres process.

B

The main reason I was mentioning that is, we could potentially, as a kind of an in-between post-processing step, um pull out only the samples that came from the wall g pids, rather and and throw away the the samples that came from postgres, but because they represent such a small percentage of of the the samples anyway, I'm going to just skip over that and um and assumed, and assume rightly assumed that that the large majority of what we're looking at here came from wall g, not from postgres.

B

So that's that's the whole reason I wanted to show the fling graph so um yeah. So what we see here is uh an oscillating pattern where we spend uh where we spend some uh uh some significant time on cpu and uh and then a lot of time, not on cpu um and at times the time that's spent not on cpu could be spent in uh any uh any of a few possible areas. One is uh disgaea reading the wall files, another another is network.

B

I o interacting with um with the the api for uploading to the object, storage bucket and another is just not doing anything waiting for uh waiting for um another. uh So this is. This is a little bit harder. I feel like a whiteboard would do better for this, um so the the operating model for wall g's background upload behave. So uh igor already knows about this, um because we talked about it yesterday in during the code review, but I wanted to.

B

I wanted to kind of revisit uh sorry, I'm I'm making gestures, but you can't see it because uh the gestures don't really tell you anything anyway. um I, I guess the important p piece here is: um um oh, maybe maybe maybe I could just use a text file for this yeah.

C

B

That, okay, so the sequence of events uh for for any individual invocation of world cheese, so well g, wology, gets invoked. um um Take a step back. Postgres postgres will run its configured archive command once per wall file. That needs to be archived.

B

We know that we've got a backlog of thousands of wall files, so this is going to run thousands of times eventually, um and we just saw that we generate about uh five percent of all files per per second, um so that should be about approximately the call rate for for um for this archive command. So our archive command is a thin shell script. Wrapping an invocation of wg ball push, um which means we call. We invoke a new world g process um about five times a second.

B

If we're keeping up, um am I remembering the numbers right? It was about five five files per second that we were generating.

C

Yeah, okay, so.

B

I think four.

C

Four or five four.

B

Or five yeah, okay, okay, great so just uh so so going with that, we would expect we would to be able to keep up we'd expect the average, uh the average duration for the whole g invocation for each 4g invocation to need to be about a fifth of a second uh about 200 milliseconds. So those are very short-lived processes. This should be very short-lived processes. But wallge has this interesting uh kind of implicit, batching behavior where, um where it gets uh well, gee whoa.

B

So you'll have one wall. G wall process get invoked with a particular.

B

uh Whatever I'm gonna make up the file names um and- um and it will implicitly say okay, I'm going to my main on my main thread. I'm going to archive um I'm going to start archiving that specified wall file.

B

And as a background activity, I'm also going to try to look for the next several wall files um and this this count is configurable and we've just increased that count from 10 to 15 yesterday, which was the change issue that alexander made for us um and we're. uh So I know I'm covering ground that you'll have some pieces of. So I just wanted to make sure we're all on the same page about the the line diagram about to try to draw um if it takes slightly less time to complete this file.

B

That thread will pick up another uh another file um before the end of uh before the end of the main thread finishing this one, and that file can potentially go on for about the same amount of time, because all these files are roughly the same size and should be approximately the same amount of work to do in upload form right.

B

So statistically, we would expect about half of these threads to finish before the main thread and half them to finish after the main thread, because they'll get to start their work at about the same time and they'll have about the same amount of work. To do so, when that, when a thread finishes early like this example, then it gets to grab a second file to work on and when it finishes late. uh Whenever, whenever this first line finishes, it says.

B

Okay, all my helper threads you're not allowed to take any more work, but you are allowed to finish what you're already doing and so um because they're all because they're all tends to have about the same amount of work. To start with, I would expect that um each of these helper threads gets to do either one file or two files, um never never more, probably never more or less and.

C

B

May be, there may be some other influences that that can affect um I'm not as confident about the less, but I think it'd be really surprising if any of the threads got to do. Sorry, when I say threads, I mean go routines, but kind of threads is more natural to talk about for some reason. um Anyway. Sorry, I digress.

B

um The whole reason I'm mentioning this is in in this case, um just I'm going to lie the rest of these in this case, which I think is because we've got 10 or 15 now of these helper helper threads, um it's very very likely that at least one of them is going to start get to start a second file, which means that our our average duration for completing this command is going to be twice.

B

uh It's going to be twice as long as it takes to upload just the one file, because it we we would expect at least one of those threads to have just barely started its second file by the time. The first file finishes. Is that reasoning hold water for all? You.

D

I mean this is good.

D

B

um I think my audio cut out for a second can you say that again.

D

No, I mean this is good. I mean I'm, oh okay, all the the threads which started it will wait for all of the threads to complete first right.

C

D

Will and when archive command fires again uh in between uh will it.

B

Oh, that's a great question so so um the postgres archiver will only run one archive command at a time, um and so these invocations are serialized, and uh this this is the other interesting part of the story. So the next time the archive command runs, it's going to say um run for this file and bulgey will uh will complete uh what so that there's. This kind of.

B

um So there's there's two there's kind of this oscillation between uh wall g, having a lot of work to do where it uploads the one file that was asked to upload, plus an indeterminate number of additional files thanks to these helper threads, um when these helper threads finish. Let's just say this is: let's say that's: this is o2 so later on the next time uh the postgres archiver says hey, I want you to archive this file. This wall g process is going to say.

B

Oh, you know what I've already uploaded this file, um so I exit almost instantly. um Does that make sense?

B

C

Is going to be a.

B

Lot easier to see, this is probably worth showing um real quick so because on because on these nodes we have, uh we have our uh bpf tools. We can do a quick, exact snoop on this.

C

Yeah, you had a really nice demo of this. We were looking at this yesterday: cool cool.

B

So I want make sure I spell the options correctly. First,.

B

So I want time stamps for reference, and I want process name to be wall g. So what this is going to do is um this bpf tool is I'll press enter in a second, but this bpf tool is going to attach to the the exec ve syscall and a couple of other. uh I think it is one of their variants anyway. The point of it is whenever a new process gets created.

B

This bpf program is going to uh is going to capture the event of creating that new process um check to see if the process's name matches what we've specified here um and then print out, uh basically a log event to it to so, we can see it and I'm gonna press enter now and you'll see the pattern, um so it takes a couple seconds to compile and install, and so now we're running so I'm gonna press enter now and press enter as soon as this batch finishes, just to kind of give a visual break.

B

Okay and now we'll stop running the program and talk about what we just.

B

What we just saw so um most of these invocations of wall g are very, very quick, um and you can see that where he's getting the you know the sequential wall file names here um when they say quick, I mean they're finishing in a few milliseconds each um a few tens of milliseconds each um and what's going on here is um wall g says: oh, you know what you're asking me to upload this file, but I can see from my scratch notes that I've already uploaded the file, so I'm just going to exit immediately rather than doing any real work.

B

So this is effectively a no up run. Eventually it gets to the end of the list of files that it's already uploaded on this occasion, and so it says, oh, I haven't uploaded this file. Okay, I'll actually I'll actually do some work and by the way I'm going to launch my internally I'm going to launch my helper threads to to, because I have it back. I can see that I have a backlog to proactively upload another big old back to files and that's why this invocation takes several seconds to run.

B

In this case, it took just under three seconds to run, and then the next, the next n invocations again are. Oh, I see this file's already been uploaded I'll, just exit immediately. So this is. This is the first kind of layer of interesting oscillating patterns that you'll see where, um where one, where um n out of one runs, are going to be very quick because they're, no ops and the the one out of every end runs is going to be rather slow because it's actually doing work.

C

Right and basically the for each one of these groups or batches. The the first item is the the slow run. um Actually, I guess it would be the next one right.

B

This is, this is start.

C

B

uh Yeah go, keep going, it's.

C

Start time, yes, okay, yes, yeah, yeah, okay, yep, so so that's so be in this case is going to be the slow run and all of the fast ones following it are the uh the concurrent optimistically concurrent um wall files that were uploaded by that first, one that are now sort of getting.

B

Yeah, so this one definitely upgrade uploaded be, but it also secretly silently uploaded bf and c0 and c1, etc on all the way up to d1 and the number of files in the batch that this one kind of secretly did for us um is what we were trying to increase by increasing the the wall. G upload, concurrency, setting.

B

um Just for just for uh um just because it's kind of neat to actually see the the um the concrete pieces of this.

B

This is the directory.

B

This is the directory, where it kind of keeps track of, of which files it's uploaded. In the background.

B

So if we just run this in a loop uh refreshing every 10 seconds, you can see that it's that these files are very short-lived, so the the um the exec snoop that we did was showing us uh the the typical batch size implicitly by showing us that we had. um I I didn't. I should have counted them, but it was. uh um It was.

C

B

Yeah yeah, it was about 15 years yeah, um and so uh so, we'll effectively we'll get all 15 of those will be create, will have corresponding zero byte uh files added to this directory, and then the subsequent runs of world ge will look in this directory before they do any work, see that that file name already exists and know that that file was already successfully uploaded to uh to the object storage bucket.

B

So that's that's the mechanism by which it'll um figure out that it doesn't that it can treat itself as a no upfront and we can just by watching files being created and destroyed from this directory. We can kind of see the pattern of that happening so that that was the last bit of show and tell I wanted to I wanted to give. This is all not obvious behavior. I don't even know if this is documented.

B

I came across this a few months ago uh during our previous round of bulgy throughput improvements yeah, so so tying that, back to, I forgot tying that back to the flamescope data.

B

Now, with that context, we can see that that these periods, by the way the scale for this is each vertical line- represents one second of data and um and be captured 120 seconds worth of data. So this time span is probably about half a second um of cpu intensive activity, um and we don't know what it was actually doing on cpu during this time, because we don't have uh because we don't have symbolic names for any of the functions.

B

If we did, we could actually take a peek at what they were by highlighting that section and getting a flame graph for it, but because we don't have symbols in our world g binary, it's all we get is just oh, no, no, no, no, no, all the way up the stack um yeah.

B

So this is this is part of what I wanted to see is uh when, when we're having these microbursts of cpu activity, how many cpus are we using? And uh the answer to that question appears to be. um If we mouse over, we can see how many samples there were in sorry. This is. This is also kind of not obvious.

B

Each second is broken down into 50 buckets, I'm going to switch this to 99 buckets, because that was our sampling rate and so at this point we'll see for each sample how?

B

uh How many uh sorry for each for each point in time when we, when we captured a profile, how many stacks did we grab and uh we'll only grab a stack if, if, uh if a thread is on cpu at that time, so this is effectively saying how many, how many cpu cores was walledgi using at that at that moment in time, and we can see that the scale here goes from 0 to 16.

C

B

um That, at least for very brief periods of time for at least uh yeah at least occasionally well g is consuming 16 cpu cores, which is rather a lot, but it doesn't do so for very long and we've got.

B

96 cpu cores available and uh generally we're using a little bit more than half of them during the the workday peak. I think I think it generally rises to about 50 to 60 usage. If I remember right, yeah.

C

That's my recollection as well: yeah cool.

B

So yeah, so that means that that we can afford to burn 16 cpus in short periods of time, um doing well archiving. So this also kind of suggests that uh um um so we're taking a step back. There were two things I wanted to get to see with this profile. One of them was, uh one of them was to see how how long these I knew that these bursts had to be happening, but I didn't know how long they were they were happening.

B

um We saw from uh the exec snoop that uh that the the ignoring the no up runs uh that we've got up here um for the runs that actually did work. uh We knew that they were they were taking. um I think before we made our tuning change yesterday, we saw an example of it, taking a little bit less than three seconds and now they're, taking at least in this case. It took about three seconds right.

C

Clock time that is well.

B

Yeah, thank you vote cloud time, um so that means- uh and this is representing cpu time so if we say, for example, um um three three columns here uh since we've got one column per second, three columns here would represent the total wall clock time for any single wall g execution and it kind of looks just at a glance like. Maybe you know much less than three seconds of that was spent on cpu, which means that the rest of that three seconds was spent off cpu, probably doing another disc I o or network.

B

I o, um but that's that's. I mean rationally that those are those are places. You would expect voldey to spend a significant amount of time you know on, but I don't actually know I guess. Furthermore, I would expect some threads to some of those threads to be idle and some of them to not be idle.

C

Yeah and and that that also uh on a high level matches what we see when we look at the overall cpu utilization in in top or pit stat yeah, you were showing it earlier. So it's!

C

What is it peaking to wall g peeking to like? Let's do switch to pitch yeah.

B

C

Can tell us that a little bit it's more clearly.

B

But when I yeah.

C

When I, when I ran this yesterday before we did our tuning, uh it was peaking, I think, up to four or five hundred percent um yeah. So that's.

C

I I can never remember.

B

Yes, we'll do one second intervals. Yes,.

C

Yes, so it's yeah, we can see now it's it's bursting up. To I mean it is. It is very bursty, but it's bursting up to like 600 percent, so six cores, but we gave it a um we gave it 15 go routines right. Yes, so it's using roughly half of those go routines on cpu and that matches what we were just looking at and sort of this idea that.

C

During these bursts we're not spending all of that time on cpu we're also spending some of that time waiting. So I just wanted.

B

To kind of tie that.

C

Together as well, yes.

B

So all of this is uh all of this is kind of tying into the the the overall question of. um Is it safe to further increase this parameter, and the answer appears to be yes, uh we can moderately increase. It again is my takeaway um in term, and I'm thinking the way I'm. uh I think this is the same way that we're all framing it, but just to be explicit um in terms of in terms of machine resource usage, we're looking at cpu usage memory.

B

Memory usage is less of a concern, but memory thrashing could be a concern I checked yesterday. It seems to be perfectly fine with with respect to that, so I'm gonna gloss that over for now, without rechecking it um uh disk and network io are the other two uh machine resources that we could be concerned about. um I am super not worried about disc io, as as a as as a concern for this host. We we are nowhere near like we're using about half of our of our spec'd capacity on this.

B

I o and there's no way well. G can burn through that, um and so that leaves network. I o and, uh as I recall, um it's harder to calculate our actual networks, uh our actual network usage, because uh disk io counts against it um as well, but last I checked. We were also nowhere near capacity on that as well, and I kind of don't see you know I kind of don't see, wool g being a significant risk for that as well.

B

So those are those are kind of the categories um that we could focus on and so far we're fine.

B

Yes, of course, go for it.

A

Yeah, um I I'm not always right with my assumptions, but um even if we would take out of take out all throttles and let wallg run full powered with as many many threats as we go, I I now uh would propose uh no not proposed. I would assume that the disc io maybe go down, because the fresh wall files are always in the in the memory, so you don't have to read them from this anyway, because.

C

A

They are fresh, they are already in the memory, so the disk io might even decrease if you are fast with shipping out these files. Yes,.

B

Yeah you're you're spot on. We only have to worry about this physical disk io when we have a backlog like this, so if we can keep up, then we don't need to worry about that at all.

A

That's the whole swapping problem when you started.

B

Yep totally agree.

B

So um I guess, uh oh so, okay, so in terms of tuning um those are those are. uh I think it's viable to to bump that that that knob a little further if we want to um for for potential um okay, so.

B

We have a couple of other things that we can do for not not necessarily um um immediately, but I think for for.

B

What's the right way to talk about this.

B

We know that we have some kinds of events that can that can trigger a significant increase in wall generation rates like, for example, um when we do um when we do bulk data changes like uh like for migrating from four byte integers to eight byte integers for primary keys, just as just as an example, any anything that does large both both data updates that ends up. Writing.

B

um Writing uh historical blocks, in particular, is going to generate more wall files than we more more world records than we usually get we're not at the end of needing to do that kind of maintenance. um So I'm a little bit worried that we're gonna, have you know future, uh perhaps ongoing and future um um background migrations that that kind of nudge us over the edge- and I guess uh kind of in an even more mundane sense, um um folks doing feature development work are not thinking about.

B

You know wall generation as as a design requirement. So the fact like alexander said to start with, uh we have, I think, pretty strong evidence at this point that we are kind of continuously right at the edge of of saturation for for being able to arc out these. uh These wall files, um turning this knob further gives us you know kind of, I mean we're.

B

I think I think we're effectively adding like you know, maybe ten percent, uh it's it's not it's giving us a small additional margin and I'm kind of worried that that's not going to be enough to last us very long, um um maybe maybe weeks, maybe months.

B

So I kind of wanted to talk about other several, some of the other possibilities um in addition to what we're already talking about doing um so we at the start of this conversation. We talked uh kind of you know briefly about um moving the ci tables to their own uh to their own database cluster, and I think that that's that's that's very much a game changer in terms of um separating uh two uh two large uh drivers for for read and write activity.

B

So I think, once that's completed, we'll be in a very different, we'll probably be in a very different space, but I kind of don't feel like it's reasonable to just. You know assume that that will take care of our problems. We don't, I don't know what the timeline is for that and um I kind of don't feel like it's reasonable to.

B

I feel like I'm using too strong language here. I feel like it's uh a significant risk to assume that that project will complete. We shouldn't.

C

Put all our eggs in that pasta.

B

Thank you. Yes, that's so much more concise. Absolutely so that's that's my feeling, but I wanted to see how the rest of you felt it. Maybe because again, I'm not keeping up with with the the pacing for that project.

A

B

Of all, oh yeah, that.

A

I have to leave in a few minutes for a few minutes, because we get some chairs delivered via a semi truck.

C

I, by the way, since since I think, we've uh covered the interesting demo part I'm I'm gonna stop the recording here. If that's all right, yeah.

D

I think that's a good.

C