GitLab Configure Stage, 21 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Profiling memory allocations in GitLab KAS with Google Profiler and Go's pprof

Description

https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/merge_requests/674

A

And there this inhale, I wanted to share an interesting uh profiling and optimization um exercise that I went through mostly just for fun, but also to kind of try and see if it's possible, to get any useful information out of this google profiler um service that we recently connected custom.

A

So this is how the default page looks, and this is data for from uh production for cass, and it shows us the cpu profile for all zones and for all versions. Let's pick the.

B

A

And it's just profiles.

A

Most most cpu using profiles and a few of you of your like all of them, are here. We had 17 000 profiles, that's a lot over the last seven days. So what do we see here? um I mean the width is what's interesting. This is the 100 percent, the root of it, and then we kind of it's. It's called the flame graph. You probably all know that and.

A

We can kind of see what takes time cpu time in this case, uh and we can quickly like filter through this by kind of ignoring that this most of this stuff, because it's just very small and insignificant.

A

So I'm looking at optimizing memory- consumption not cpu profile, so let's switch to allocations. So we want to reduce the number of allocations that we make, and here we see this column, it's part of the flame ground that it's it takes half of the screen. So.

A

Let's we can drill down so that we don't see any other things, and so it shows us total allocated bytes by a particular stack stack trace. So we see that this are almost the same in terms of width, and that means.

A

Each of them is not allocating much and then.

A

Why is it not showing me the self metric? Maybe is it because oh okay, now this okay, you see this is showing something interesting now itself. It draws self of four kilobytes in this case, and here nothing well. This method doesn't allocate anything just call something and okay.

A

So we see that this one paul okay, it's four kilobytes.

A

This one allocates it's more interesting to look at percentage 26, so paul allocates two percent, no 0.002 and- and this fetch refs allocates 26 of this whole thing, which is a lot, and you also see that that the next things that fetch refs calls they allocate half so like half is in this method and and the rest is inside and inside we see that.

A

New scanner allocates a lot. Sorry. How do I go back.

A

Okay, not that far.

A

So new scanner allocates half of that and trump bites like it's half of that above so we we have three methods that look suspicious and some something here: split, n, strings, platinum, strings, gen, split, okay, so we have. This is a bit interesting and let's dig in, but also I found that there is.

A

Cow mod uh itself, it makes it even makes it even easier to spot the suspicious things and they're here as well same same same stuff, it's just different different stack traces where the same method is called by a different. You know different chain, of course.

A

In this particular case, it's get configuration and this here it's some something else. It doesn't doesn't matter. Okay, let's look at the code.

A

So I've done I've done the work already, so I will just quickly show you what what I've done here and let's check out this commit. So basically, we are looking at check out revision check out this package, and this this is copied from italy. This code, just parses reference discovery, api response, which italy streams over grpc.

A

But it's a standard git http api. Basically, it can get replies with all the refs and the commit ids for each ref and some other meta information about the repository itself, and so we are pulling this to get.

A

You know to learn whether a repository has changed or not on a particular branch. So we know that hrefs calls that pass reference discovery, so the expensive methods are fetch refs itself and parse reference discover that it calls and fetch refs leaves here this thing. It makes a call to italy this this grpc method, and then it just consumes the responses accumulating the the data into a byte slice and then it par passes that as a reader to the parse method that parses that and just automatically processing line by line.

A

So I wrote a benchmark for this parser. Let's start with the parser, I wrote the benchmark because that's how you.

A

Test things and measure things in goa, so benchmark is a test that measures allocations and times times the code. So here we just create an artificial input, create a reader from it. Tell the benchmarking harness that we are interested in allocations in recording allocations, not only timing, the code, then we reset the timer to ignore this to not measure the above code, and then this is the standard format where you you loop n times over the code.

A

You want to measure we measure this code, we give it the reader, then we reset the reader to the same data. We don't allocate another reader, which I was doing initially uh wait. This should be.

B

That interesting.

A

Shouldn't be doing that, I forgot to fix that um so. Okay, let's run it.

A

So this is the code as it initially was. It allocates that's how much it allocates for each call, which invocation and that's how much time it spends for each.

A

You know invocation, and this is how many times it allocates memory for each invocation- it's just standard ghost stuff that it allows you to measure. Then if we add this, we tell the so intellij golend in this case builds the binary and then we pass the bills, the binary of the benchmark and it it. We tell to call this benchmark.

A

Then we pass this parameter to that program. That golan builds with this benchmark and we tell it to record the memory profile into a file, and then you can see the full thing. What uh golem does here? uh It calls go to with test and compile output, maybe not compile this output, so it builds this binary.

A

And then yeah it builds this binary and in this package I think, and then it caused this binary.

A

Yes, this is the binary name like this long thing uh saying that it should be verbose and some other parameters so to run this benchmark with this rejects and then just basically excludes all tests and then that's what we added profile and record the profile and blah blah blah and that's what it prints so to open the profile we we do.

A

A

Yeah, okay: I need to go back.

A

Yeah, so if we do top five, let's say it will give us the top five allocating because it's a memory profile. This is we analyzing memory here, so not cpu.

A

These are the most uh allocating methods and this matches as much as what google profiler tells us. Well, of course, this is the only thing that we run. That was so. If we we do this parts to see the annotated code- and here here we see this function- that reference discovery parse this function, basically that we call in the benchmark and we see where the memory is allocated and that's a lot, because we it does a lot of iterations like how many.

A

257 000 iterations to properly measure it exclude anything that has happened on your computer uh and yeah. That's a lot in here, then this allocates and that allocates and split and allocates which we also saw here in profiler split in again, this allocates, of course, because it creates new slices new baking arrays for slices, and this also locates strings again and again the same thing, and here that's the same, but the outer scope- and this is the same out of scope. Okay, this is the this is how it is this.

A

Is we what we start with.

A

B

What was the number, let's put, that somewhere.

B

A

Let me remember: okay, now, let's see what I've done. The first thing, let's check this out first thing- was to change all the string operations into byte slice operations and we don't need to allocate a string here.

A

We just work with the byte slice and the same here with.

A

Yeah we just changed: bytes split 10 to string, uh string, split and two bytes split and everything else stays the same and okay. Let's run this benchmark again.

A

And this is worse, it makes more allocations.

B

A

But it works a little bit faster. It may be not really but whatever. So it's worse, okay, why is it worse?

A

Well, probably because we have to cast change byte slices to strings here, for example, anyway, and and here as well, so, okay, let's do the next thing. The next thing is.

A

Let's first check that, and the next thing is I decided to check because I don't need any of that. I only need references I don't need to. I don't want to allocate collect all the refs and allocate to that slice. I I just iterate over the parse data and call a callback, so I change the signature of the function to just the reader and the callback that consumes the data. We no longer collect anything here and no longer collect references or capabilities. We don't need that guitar.

A

This is from italy, as I said, needs that, but we don't so. I ever just removed that and simplified the code and also replaced split n with the cut method. uh It's it was added in go 118 and let's first, let's look at two bytes.

A

And split then uses gen split, oh by the way gene split we saw that here, but it was for sp for strings, but with slices it's the same basically and allocate a slice of byte slices. It's like a two dimensional array. Basically it's an array of arrays, and this is memory that we don't need to allocate if we use the cut because cut just slices the input slice into two pieces.

A

So it looks for a separator and returns this beginning and like what was before the separator and what what is after the separator and if it was found or not, so you don't need to allocate any memory, just slice it where the separator is so that removes a single location, but it's it was used in multiple places. So it's a new method in 118, which is quite useful and uses less memory.

A

I also found that moving this into a variable package level- variable, oh and that and that doesn't change anything actually, so this allocation is in line by compiler. Basically, somehow I don't know magic, um so it's so we we do two things again: a callback plus cut. Instead of split ends, and we remove all that stuff that we don't need that capabilities, collecting references and I think that's it.

B

A

Right is that right yeah? I think so I think so. Yes, um okay, let's run the benchmark again and yeah this time. I use the reader here correctly. The first time I forgot to change in the first commit- and this is iteration three okay.

B

A

Now is iteration two: this is doing fewer allocations and it's a little bit faster, but it still locates a lot of memory in those few locations. Okay, let's open the.

B

A

The profile and we see that.

A

Memories consumed here allocated sorry here and here so new scanner and scanners can allocate memory and maybe a few other things, but maybe not, but these two things: okay, um the next thing we do is we go to the scanner.

A

And we immediately see that this is an allocation. Oh I'm, sorry this! It allocates a 65 kilobyte slice of memory on each each time it needs to scan, and then the scanner scan calls this split method, and this thing also locates memory somewhere here.

A

Okay, I don't know where, but let's first, ah we can just see that.

A

That's where I tell a case memory on this line, and it's here right.

A

We just change the cast the byte slice into a string and that's where we waste our memory spend our budget. Okay, let's check out the next commit.

A

What we've done here is we've stopped wasting memory by pulling the buffers, and that is done like this using singapore, we reuse it's a it's a memory, it's a free list. Basically it's a list of three buffers and if the buffer is not available, this function is called. It creates a new one and we have helpers. We already use that for 32 kilobyte buffers which we use for all io, and this we use now for person, and then we just use that pull here and yeah.

A

This is just plumping to pass the pull instead of this allocation.

A

This is just adjusts. Adjusts the test. Okay, let's run the benchmark.

A

So we still do make six allocations, but you know we allocate way less memory and this function runs 10 times faster now, not bad. Okay. Let's look at the profile again.

A

Bars doesn't allocate memory, scanner doesn't allocate memory, but scan still does.

A

How is this thing called splitter.

A

Yep, just this locates, okay.

A

So this string is allocated, the data is copied from the slice, the string is parsed and then the string is immediately thrown away.

A

Wouldn't it be awesome if there was a method to parse a byte slice into an integer, not a string right. We would just remove string here and use that. But there is no such method and if you really care about this, which we probably don't but um just for the sake of exercise, um let's say we do and.

B

A

So what we've done here is this is the diff is um a little bit of unsafe magic.

A

So, let's look at the code, not the diff, so byte slice is actually this thing in memory. It's a pointer to the backing array and length of the slice and capacity of the slice. So pointer is eight bytes on a 64-bit machine, and these two are also machine-sized words, so 24 bytes in total three by eight and uh we need the string and the string is the same thing except it does have capacity. It just has length so pointer and length.

A

So why? Why won't? We re reinterpret the memory. We can pretend that what's in memory is actually a string and not and not a slice um in c that would be, and in c plus that would be a reinterpret cast. Basically, so we we can do that and go by going through the unsafe pointer. So we take the pointer to the slice memory, that's pointing to those three words and why, on a save point or cast it to a pointer to the header, so compiler thinks it's a byte slicing, a pointer to a byte slice.

A

Then compiler thinks that's that it's a pointer to this struct, which it is then we.

A

Construct a string header, which is what the string is basically and pointer to the same backing array and then use length as the length. Obviously, and not we don't need capacity, that's the length of the string. It can't change it's immutable, so capacity doesn't work that is not needed. Then this is the stream basically, and we now need to turn it into a string so that we can pass it to a method that we do that by taking a pointer so pointer to the to what is a string to that header via unsafe.

A

We can cast it to a pointer to a string and then with the reference that pointer. So this becomes a string and we just parse that string. I have also changed.

A

It was a parseint percent uses parse you int after looking at the sign and doing some overflow and stuff here, we don't need any of that, because we already know that it's an unsigned integer.

A

This is four hex that smile uh characters. So we directly use parsevint.

A

Yeah this is a little bit of magic, but I think it's worth worth it, at least as an exercise, and let's see this is the last impression.

A

Four and it's a single allocation, 24 bytes, and it's even faster okay, where what is those 24 parts.

A

A

Well, we allocate memory when we put these two cars.

A

A

And that's: can we check this thing again.

A

So, ah okay, it doesn't have this method because it doesn't allocate any memory. The profiler profile doesn't have this method. uh How do we know what allocates here then in this map? That probably, is the closure.

A

Maybe that's the closure.

A

Yeah, probably that's just the score: what can we can we move out just to a better understand? What is it.

A

Probably it's not the caller, but let's just.

A

No, it's the one location.

A

So it may be ah wait, we are no, we know right. We.

A

We know that it's this thing it allocates 95 and 98, and it when it puts when it puts.

A

Yeah, when we put a slice of bytes into that thing and we call the method, we will wrap it into an interface here and that's what allocates I think. So we can't really do anything about it.

A

Anyway, um you can see more if you want to look at the code, there is a merge request for it linked. Thank you.

B