GitLab Scalability Team Demos, 8 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo 2022-12-08

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So today, I wanted to share um just some things: I've been um looking at for um container memory uh saturation, so this came out of a couple of uh capacity planning issues, the initial one being one where we said that um our web service was saturating for contained memory and I. Think Bob asked Matthias for the application performance team like hey. What's up with this and Matthias was like oh we're, working on it a couple of months later Bob says this is still saturated.

A

The Matthias was like shouldn't, be, like you know the metrics I'm looking at say it's fine, um so we discovered there's some significant differences there, um some of them, uh you know fine, so Bob was looking at container level. Metrics and Matthias was looking at process level metrics. So obviously um the container level metrics May um include memory. That's not included in in that particular process, um but we also discovered or I say we uh Matt pointed out showed us that um the current metric we are using is not uh suitable.

A

So C advisor essentially reports three different types: uh three different metrics for memory usage in a container, uh see advisor being the thing that takes the c groups, memory, data and reports it out um as Prometheus metrics. So we have usage, um that's everything so that that sounds like what you want, um but the problem is. It includes things that will be evicted before an out of memory; kill happens. So, for instance, um if you have a file back memory from like.

B

You know opening a file.

A

um That's inactive that can be that can be reclaimed by the OS before and out of memory kill happens. So, even if your usage goes up to 100, you might not necessarily get out of memory killed. Those those pages can just be reclaimed.

A

um So working sex, as defined by C advisor working set, is a bit of an overloaded term here, because there's a few different things that define it in different ways, uh working set is defined by C advisor is usage, minus inactive file, so minus that case I said.

A

But this sounds like what you want again like this sounds plausible, um that taking the total subtracting the thing that can be easily evicted in the case of memory pressure and calling it good, um but this does still include active file back pages and Matt pointed out that it you know in memory pressure cases, the OS can also reclaim those, even though they're active, so it will result in some. You know uh crashing like it will. Just you know. Presumably the the program that's running will immediately need to re. Get that memory back.

A

um It's not guaranteed, and so it also represents an over count, because that memory can be reclaimed by the OS and it isn't directly attributable to the application anyway. Necessarily so you, you can't necessarily say that this this shows this application is saturating. Our memory, just because working set size is very close to the memory limit for the container.

A

So what we want is RSS resident set size, um which is only the anonymous memory I, which is not the file back to memory um and uh swap, but we don't use swap so it's just Anonymous memory um that sounds good, so I have um a chart somewhere, which I should probably um load up, which shows that what happens if we switch these so sorry I should have prepped this a bit more.

A

um So at the moment we Define memory saturation in terms of working set size, which means it's essentially always an overestimate, um and what we want to do is switch it to RSS.

A

um So let me just share: that's. These charts are going to be a bit noisy, but um I've excluded uh the rails app here, just um because we know this works better for the rails app. This was this was about exploring what happens with other services, so the top chart will be working set, size and the bottom chart will be resident set size.

A

um Also note that, because we're doing this uh Max buy here, um that does make some of this stuff like harder, like not harder to understand, but you will need to drill down to understand. What's going on, um because you know every time you do an aggregation, you're, sort of throwing away a layer of uh a data that you could use, so you then have to undo the aggregations. In our case, that's actually pretty easy. So it's not a huge problem, but I just want to call it out.

C

So this means it'll show, for example, the worst case pod and to figure out which part it is and how many pods are affected by the issue. We'd have to kind of drill drill.

A

Into that exactly- and we have to do this in capacity planning quite quite often oh good- it turned out um I could probably do a day or two days. It's not a huge deal.

A

um So a good example of that um Igor is actually um I think it was for the go memory um metric for monitoring where we saw that step up because of the OS upgrades we did for the postgres servers, though, that meant that the uh Prometheus DB servers were more with the were the ones with the highest level of uh memory utilization um and then, when they stepped down, we didn't see a step down in the max, because something else I think the gitly ones had added c groups, but the max had gone up again, but for a different reason.

A

So when the first, the first, the cause for the first step, went away, you didn't see that reflected in the charts, because it's aggregating across everything. um So it's kind of hard to untangle that sometimes and say like what. What is reasonable um and I'm not saying we should do this at a more granular level, because I think we already have too much noise in these metrics, um but yeah, okay, fine! If.

D

A

Gonna timeout I can just show these charts, uh so this was from a couple of weeks ago, or no, it's not it's from over a month ago, wow, um so um most services uh sort of sorry one is here in both uh most Services look kind of similar, like I, said it's kind of noisy.

A

um What I want to look at is logging, which goes from um this. So you know eighty percent then down to 40 percent to this, which is minimum eighty percent very, very close to 100. So this is using residents debt size, which should be better.

A

um We investigated that in a separate issue, I say we again, obviously Matt um what this was down to was: um okay, fine, I, don't know why Thanos doesn't doesn't like me today: um lazy freed memory, so this is where um a program can say to the OS this. This block of memory I- might want that back, but I'm done with it for now, but I might want it back, so it can next time it allocates memory.

A

It can get that memory back without causing a page vault and that is included in RSS but not included in working set. So then we have um these elastic uh sorry fluent pods containers um that show this effect. Let me see if I can load anything in Thanos, um yeah, okay, you've.

C

Unshared your screen by the way, yeah yeah, sorry.

A

I was just I was just checking. I could do that. So um this is the ratio of working set to RSS, but it might be better to do something like.

D

A

um What is it uh tonight.

D

A

Oh right, there's two at the end, isn't there.

D

A

I'll just put the two metrics.

D

On the same chart,.

A

uh Site- it's probably a bad name too. Let's call it so we already have a type label now that we don't on these because they're, not the labeled ones,.

A

E

A

Right so we can see, the RSS immediately goes up to basically the container memory limit and stays there and working set stays all the way down here.

A

um So from looking at the individual containers uh mats- or this was due to Lazy free data which is included in the RSS metric, but not in the um working set, one which is a bit annoying. So.

A

What I thought this was was in Ruby. There is exactly one case um where we use that oops. It shouldn't be that that should be that there is exactly one case where Emma bias 3 is used, which is here.

A

um So this is in uh fiberpool stack free, so in Ruby a Fiverr is um well the name.

A

The name is meant to indicate it's lighter than a thread right like it's a it's a concurrency, primitive that the uh Ruby runtime manages, whereas Ruby uses OS breads for Threads um and when the stack for when the Fiber goes away, its stack gets freed and if you have M advisory available, it will use M advice, free um and so I was able to write a test program that just generated a lot of fiber stacks and then immediately threw them away um and I could see that lazy freed memory uh increased.

A

um Because of that. So that's the implausible. uh Last week, I found out. That was wrong because this code isn't in the version of Ruby, that is on the containers that we're running fluent D on. um So this was added Ruby 2.7 uh we're running Ruby 2.6. They do use J Malik, which can use M advice free, so I need to go back and check if that's the cause, um but either way that doesn't really fix.

A

This um I think what we need to fix this and what I'm working on once I can get a reproduction case is to get C advisor to be able to report the active and inactive and non-metrics from the C group directly, and then we can just sum those because that's the metric that we want is just the anonymous memory usage um I, think Matt said that um lazy freed memory ends up in active file, which is not really accurate, but is because it's not Anonymous essentially and because it's active, so it kind of ends up in this weird sounding bucket.

A

Because of that. So that's where I am with that um in lieu of updating, C advisor. Another option might be to try and fix this on the metric side. I'm not super convinced about the options here, but a couple we have are. We could say that some services or components use working set and some use uh resident set. So we could say that fluent D uses working set and everything else uses residency essentially um or we could say just take the lower of the two, because, both in both cases, the estimation error is an overestimate.

A

So we want the lowest overestimate um but I'm a bit worried with both of those that it gets too confusing. So I mentioned at the start about the drill down thing where you have the max across, like everything within this service every container.

A

If you have the max across a synthetic metric with everything in this service, then I think I'm a bit worried that actually investigating these issues will become even harder, um because uh at the moment, for instance, you can have a case where say in the logging service fluent d, one single fluent D container is that say, 80 memory usage and then that container goes away and then a pub sub beat container goes up to 81 and so on.

A

The chart that just looks like a fairly steady line, but it's actually two completely different things that are happening. um So if we also consider the possibility that it could be two different metrics um I'm worried that gets a bit too confusing, basically uh the current metric memory metrics. We have aren't super useful, though, because of this accounting issue, um there was something else I wanted to mention there, but I forgotten what it was.

D

A

Oh yeah, so go um so I mentioned Ruby uses that in the post free go also did this for a while and then stopped using it, because people kept complaining that the memory metrics reported were um confusing in this way.

A

um Sorry, the go runtime did do this and then stopped doing this um a while ago, um I created an issue to ask Ruby to stop doing it, but, like I, said in the container that we're actually running that we see this on Ruby isn't running a version that does this anyway. So it's not coming from the Ruby, uh the Ruby runtime directly anyway, so that wouldn't get us out of this problem.

C

Says your best guess that it's from Jay Malik.

A

At the moment, so um uh previously, when Matt looked at this, let me find the issue I. Think.

D

A

I thought he mentioned something about them, not being.

A

A good way to discover where that was coming from, but I can't find it so I might be conflating it with something else.

C

But yeah like I, think Matt and I did try and run a BPF trace on that at some point. Yeah.

A

I, remember that, but I can't find it in the issue, but yeah there was. Was it just like a stripped? uh Was that a problem that there was a strict binary or something or what was the issue yeah.

C

Anyway, it was coming from the movie Yes. It.

A

Was coming from the Ruby process, but um that Ruby binary, like that Ruby version doesn't use M advice free directly, so this is J Malik, because what.

C

Else is it gonna be um so well. My follow-up question was: uh if it is J Malik, could there be an option for Jay Malik to inhibit ORS.

A

So that is I believe I did make a note of this somewhere I think there is an option for Jamie Malek to uh uh disable that let me see if I can find it.

C

Yeah because you can customize a lot of J Malik via the environment, so this could be something that we could try out fairly fairly easily I suspect.

A

Yeah we build our own container for fluent D anyway, um like based.

C

On I mean yeah, it doesn't, it doesn't even have to be baked into the Container right like we can set environment variables in.

A

The humanities.

C

A

Just don't remember if it's a build time option or not um for Jay Malik um I. Think if you remember that being that, but I can't find my note with it. But yes, um so what I've been trying to do um and failing so far, is to reproduce this just using topical posts, so just have Prometheus the advisor and fluent D running, but the same. The same version of fluent D, the other versions, don't really matter too much um and c groups V1, because that's what we're using I haven't actually been able to reproduce it.

A

Yet, when I run it locally, the the two metrics track each other pretty closely. um So um I need to dig into that a bit more because I would I.

A

You know it's quite easy to write a reproduction case um directly, just using M advice, free but I'm a little bit concerned that if I can't reproduce the exact case that we're seeing and we don't know why, exactly why we're seeing it then I might not be looking at the right thing, because I've already been looking at the wrong things several times when I've been looking at this so um yeah, that's where I am with that um I'm just going to Peter out unless anybody else has got anything else to say.

A

Okay, Sylvester I guess.

E

B

E

So the merge Mr yesterday like caused some problems on. Thankfully you just uh studying. Clearly uh it's really weird, because it's uh because it's effective also like a step like uh the stack went to the and it is lightly because of the of the Mr, where we shifted the feature flag into the instrumentation layer like.

E

But the odd thing is that I found about while working on the Mr and it was it, showed up in one of the the pipeline runs and I caught it and I fixed it, and then, when I, when it was accidentally reintroduced the fitting tests that caught it originally didn't fire. So that was a strange part that I'm trying to replicate now uh I'm trying to understand how how it's, how it's, how it's not through the specs uh so so for that is reported right now we're just outside of this break.

E

The original am I into smaller pieces that are more just just a pure reflector and then fixing other things. uh I found I found one issue, yeah solution, so.

A

Just to clarify the issue here is that we introduced a feature flag check inside our Reddit instrumentation, but our feature flag checks, use redis and we were trying to catch the result of the feature five check to resolve that, but obviously that doesn't work. If you need to hit redis anyway, because you will end up hitting redis to find out what the feature flag is.

A

Instrumentation won't have the feature flag cached. Yet so it will.

E

A

The feature flag is and.

E

A

E

A

Because we're using kubernetes as far as I'm aware this didn't cause any downtime right like they should failed to deploy because the pods didn't become ready, yeah, um so yeah, but uh yeah, I, I, don't know I, don't know how we detect this because I think um in the specs. A lot of things. uh Certainly the feature the stub feature: Flags thing: I'm, not sure if it actually uses the real feature flag, um maybe Bob. Do you.

F

Know does it use the real thing like packing? No, no, it doesn't go to Red. This doesn't go too. It's just remember.

A

Yeah I think it just stops everything to True unless you stub it to false explicitly right. Yeah.

E

Yeah the the test, that coded was actually the feature flag specs. It's also it wasn't our spec. It was the feature flag, spec, so I think one of one of two of the features like spec would actually run in uh in rails rails cache, so they actually research values like uh there are a couple of specs that that test for the feature for the case, where there's a cashmere, so he actually looks at redis uh or rather radishes.

E

It looks up the active record, so those tests caught it, but they didn't catch it a second time around where we accidentally introduced it, because the the way, the way around getting the way around that to Breaking that recursive condition is to only run it if it's a multi-key multi-key lookup, because for us for flipper slippers, just a single key lookup. So if you run it only for multi-key, then we sort of get around that, but it was accidentally reintroduced when I added in the safe request store. Yes,.

A

Because we moved that outside of the place where we would know if it was a multi-key lookup didn't we, because the request store is yeah the level above where we call that, and we know that.

E

Yeah yeah, so that's a bit the the same test that caught it originally did catch it again, so I'm trying to replicate it locally and I I've got to fix. Let me just try quickly.

E

I found I I hear so this is one. This was one of the reasons why, when you run GDK rails locally doesn't catch anything the so I took it around and instantly it broke the local GDK locally. But the odd thing is that the test still pass so I'm trying to get a test to fill on on this comment, which is just a comment in between uh the merge and the revert.

A

Yeah, that is, that is frustrating if it failed before, but pass us now. um Yeah.

E

The other is to just get rid of it in the meantime, while I, try and figure out so I'll probably work on this comment to replicate the issue and then uh push like patch it up later.

E

A

We could, at the start of the request, somehow um put this feature flag in the request store um and then, if it's in the request store, the instrumentation uses it and if it's not, it doesn't but I. Don't like that, because the start of a request is kind of a fuzzy thing um in in the in the rails. App so um I think it's probably better to do a different approach.

E

Yeah I think for now I'm trying to think of a way to not add in so much, not that in so much uh metrics and lock such that we need the picture of like wasn't I think the primary go on the picture was to control how much uh extra long lines and metrics that we are pumping out so yeah uh yeah, I, I I've got the plan is to just add in to just add in the accounts of allowed a loud cross lot, but the counts of how many a lot.

E

Of course, lot commands are happening. I I I realized that if it's a cross-lock command with multi-keys, it's almost definitely going to be invalid. What's the chance of two keys being attaching to the same key slot? Is there are 16000 key slots, so it's very low to begin with and.

A

It's because we do have some cases where we use the curly braces, so those will always be right. uh Those.

E

Will always be correct, but for those those are not wrapped in the allowed, though I I found a.

A

Few ah okay I see what you mean. Yes, yeah.

E

So so those would pass the specs, because it's not wrapped yeah, but those that we wrapped within allow commands that, like I mean we could check on other implications, but if it's not using hashtags, it's likely going to going to be an invalid cross slot come on anyway. It's gonna, it's gonna, be a cross-lock command. So we could. We could just count the number of allowed of of cross lock commands that are wrapped in the allow block, and then we could put it within every single log.

E

Like everything like the request, every every block line that we lock for the request. You will just have a redis allow cross and then from there we can extrapolate. You can estimate fairly cheaply what? What is the amount of work we need like? What's the amount of cross requests happening right now without having to pump out extra long lines or electron metrics? And we could do this with this switch of like problem until we find a way to catch it. The specs.

F

Why don't we just remove the feature flag, check and use an amp variable.

E

Yeah that would work too, but I was thinking like doing like the trade-off like like. Why are we doing this? In the first place, to estimate how much uh cross slots accounts? Are there happening? So if you will do it without the future effects without having to use and bars I think, but it means you have to go into, they have to go into the chat. The helm, charts really pull and then push several comments. Not several comments uh get it into like yeah.

C

We can yeah, we can, we can override it, so it doesn't require a change to the helm chart itself, um but it does require a change to the uh gitlab Dash com yeah, but that that's a that's a single change, so I think that's a feasible work around.

E

E

F

Should work for this kind of cases I think we should think about that.

E

Yeah I mean we should get it to work instead of like betting it, but I I believe the test should check for any sort of recursive issue. So we prevent this all together. Then we start like outlining sounds like best practice, because I I checked the features like dogs, and there was no mention of like uh being caution, cautious about introducing at places that it could cause such problems. I did but then again this is a fairly hot path, so I don't think we.

F

We have done it in the past, like just avoided using feature Flags, because we know that they're not not as they're cheap, but not free.

E

E

All right yeah, so that's about it for the issue.

E

Yeah I've got nothing else for my side. If anyone else has anything.

F

I could show the occurrence-based SLA thing uh because that hasn't been brought up yet I'll fill in the agenda later and clean up. My screen a bit so I have something to share.

F

uh So let's go to, can you see my screen and am I on the issues page or am I in another browser window here I'm in another browseries yeah yeah? This is issues and where is the.

F

So here I've already Linked In Thanos gitlabcode.com's availability uh for the past. It's going to be a few days.

F

A little under one week and we're at uh 99.91- and this is counting um successful, aptex events and successful requests uh for the four- uh what we call primary services so API kit registry in web um yeah. We do that by just summing everything over time over the 30 days. uh Yeah, that's that's about it.

F

I was going hoping that this would be closer to 99.95, so we could use it for General SLA metrics, but that's not it so uh I'm hoping we can use it as an internal measure, because this is going to be closer to air budget for stage groups than the general slas to right. Now, let's see what those look like.

F

For the past week,.

F

Yeah, so that's about what I wanted to share.

C

So go ahead: John.

A

Thanks, um do we like.

C

A

Know the issue with this before was, you know, request volume, but you said the average is out over a month and it should be fine um I'm, assuming you're not going to pay for dedicated if you're, not if you don't have a reasonable amount of requests, but is there like a level below like a request volume below which this is just not going to work.

F

um I think it's always going to work better than the average overtime thing, but.

F

Yeah um I I, don't know, I, don't know um the the level below which it wouldn't work should be measured in a month. If you have less than foreign how much 100 000 requests a month, then something like that.

A

All right, uh eagle.

C

Yeah I was gonna: ask why why can't we use this metric I mean I I, don't like the number either, but it is, it is the true number right or it is. It is a truer number. So what what.

F

Are the constraints.

B

C

What are the constraints on adopting that? I guess is. Maybe the question the.

F

Because they're, not it's a number that we publish, and that is um mentioned sometimes if we have to in contracts and stuff so.

C

Okay, so we do have a contractual SLA I.

F

I don't know, but.

C

F

C

Would be good to know actually I don't know either.

F

I think I think sometimes we do mention it, but I I don't know, maybe Rachel. You know yeah.

B

They are um there's one contract that I know of that. Has it I, don't know if there's more than that, but the the SLA is written into some contracts.

C

Right and I'm guessing that SLA is, has a definition in the contract.

B

Yes, um I say it like that, because um I don't know the specific wording of it, but I do know that the last time that we were looking to change how the SLA was done, there were concerns about making sure that it was correctly reflected there um I think.

B

If we've come up with this, if we have one that is more true or more proper, then it's really a case of writing it up in English what that is and trying to get that adopted and taking it to Steve and saying this is what we would like to do on gitlab.com and how do and how do we move the process forward? That way, um I!

B

Don't think that we should be stuck to an SLA calculation if it's not the most correct or the most true version that we have and I think it's completely fine that these things evolve over time. um If, if we've found something that is better, then let's start the process of having that be the one that we adopt.

C

Right and we can also have multiple slas concurrently and say this- is contractual SLA V1 and we're now rolling out V2.

B

C

We can be be explicit.

B

And that all starts from an issue so creating the issue that says this is how we are calculating it now pros and cons of that. This is what we want. This is how we want to calculate it, going forward, pros and cons of that, and then we raise it with Steve and say right: how do we? How do we take this forward?.

F

A

F

Was just gonna, say: I think we already did go V1 to V2, didn't we Bob yeah yeah. We did and there's an issue open from like a year ago to do exactly this and now I was able to do it because Andrew needed it for dedicated.

F

So now we have actually something to report back on that issue and.

B

Specifically, staying on that issue as well like we're doing this for dedicated, so let's do this form for success as well.

F

Yeah, okay, I'll bring it up there and.

C

Then both so are you gonna kind of bring that up with Steve or what's the process? That's.

F

Writing it up. First, in that issue, where we are at now, what it.

C

F

I'm going to drop it in the channel and then we're going to see how we move it forward, um but I think Steve is going to want to see at least like a full month of data, so a number for a month and then putting both numbers next to each other. So, um let's start by writing up and then move it after we have numbers for an entire month.

C

Amazing, thank you.

B

And is there anything else anyone would like to demo today.

D

Maybe I can just show what I was uh doing.

B

D

So, just to give a bit of context, let me share my screen.

D

Yeah a couple of days ago, there was an issue in staging because this request, urgency label was missing for the error rate. um They request. Urgency is not that much relevant for the error read, but and that's because it was missing here, but we have a validation to guarantee that the same counter has the same label. So it crashed it. So we had this beautiful stack Trace in staging after enabling the feature flag.

D

Then another merge request has been rated to not this one yeah yeah. So this merge request fixes that adding the request, urgency and yeah there is a spec now to guarantee that they will have the same labels.

D

But I was thinking in a way that to avoid that because well, the the labels are being passed as a array of labels and I had the idea of changing and testing one by one, which would diminish our likelihood of calling the increment method and for getting an important label.

D

So this is how we I quickly drafted this yesterday. We would call it this way and yeah. The only change we that would be required would be the aptx class at the increment method to include the other uh label names and the same thing for the error rate.

D

It's a very tiny, tiny change, but at least for me it seems more um less. It hope run this way because we are forced to remember all the labels required for the specific metric.

F

Does the stand work for the other slis that have different labelings.

D

uh I was inspecting yesterday, the code and all of them they have uh the the request slas, they have the same labels. They just have different values and this work. um Well, maybe we don't need exactly this solution.

D

I was trying to think in other possible solutions to avoid uh this error in the future. Not just this one I I thought and.

B

D

F

I think being able to do it the way you've shown like with the the method, signature and getting it at uh yeah? Well, because it's Ruby but like getting it it that gets it more early like even earlier than we would. If we remembered which labels we initialized in SLI with.

D

Yeah, we still require this back there because well we'll be dynamically tied with language uh but yeah. It's not that easy to forget when you have it on on the method. Definition.

D

That's awesome.

B

D

B

For sharing that right now anything else, anyone would like to demo today.

B

All right well, thank you so much for the conversation, thanks for sharing all the things. um I hope you all have a good rest of your day.