GitLab Scalability Team Demos, 24 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2022-02-24

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right so he's first on the agenda today.

A

I think that's you jacob.

B

Thanks um yeah, I'm sort of stealing the show from other people who work on the redis upgrade, but uh what I wanted to talk about, it's not the redis upgraded sorry, we just upgraded redis cash to 6.2 and we means alejandro and igor and not me, but I was interested in this because of latency spikes. We were seeing on the git surface. I was curious if curious, if those got better and they happened- I think which is disappointing, but it's also interesting.

B

So I just wanted to quickly talk through the uh the graphs I collected for that.

B

um So now I should share my screen, so I wrote a comment about this. Can you see my browser? Yeah? Okay, I wrote a comment about this on the issue where we talked about upgrading um and the the problem. Really. Is this uh so it's really straightforward to see in the the rails request logs, um if you filter for type kit and you look for requests of duration longer than one second, then you see these bursts and uh this is a problem and this shouldn't be there like this.

B

The kit surface uh is where the get http traffic goes, and all these requests need to do is to look up a project in the user in the database and say this user exists, the project exists and here's the gitly server and now go away so that unless you take less than 100 less than 100 milliseconds, it's slow enough as it is, but it should take more than a second and the the weird pattern we have is that when it takes longer than a second, it's usually because of redis because of redis cache.

B

I think, if I take this filter out, then uh you sort of see the same thing. Oh, they move a bit. Why did they move?

B

Maybe because of now maybe yeah? The.

C

B

Yeah um so yeah now these are uh actually. If I exclude these, then uh we have way fewer so that there are still some models.

B

So that's that's funny, but there's this correlation with redis cache and what we realized was uh what we were even aware of for a while is that redis cache has eviction bursts um and it's supposed to effect because it's configured with a max memory limit, so whenever it hits the limit, it needs to depict stuff, and what we were hoping was that redis 6.2 would um push the bursts down and just effect at a more constant rate, because it contains code changes that should make it do that differently, and you can see the changes if you look at the eviction rate.

B

So, okay, let me also refresh this because it's not from an hour. So what is interesting here in this? This is the eviction rate. So this is a counter in redis that we can track via prometheus and what's interesting, is these bumps down here? I think these bumps are the effect of the new eviction code in 6.2, because if I go back in time 24 hours, then the bumps don't exist. You only see the peaks, so something did change, but it's not working the way we want it to uh because we want yeah.

B

We don't want these graphs to go this high and it's really quite mysterious, because the only thing that increases this counter. It's the eviction codes. Okay, if you search for the there's a variable in in register this defection counter- and it really only gets incremented in one place.

B

So it has to be that thing um and it.

D

Still has these bursts, so that's uh yeah, but any any memory pressure can induce uh can induce an eviction burst even even the the client connection. Buffers um count against that budget, but that's not.

B

The point: the point is that the new code has a latency it it's designed to not have bursts. It will only do it for 500 microseconds and then it will yield to the next requests.

B

So this means that it does 500 microseconds of evictions and then something else and then 500 microseconds of evictions and then something else. So uh why isn't it I? Why does it behaves like this sometimes uh and then sometimes it still gets to do this much work.

B

There really is a fundamental change in how the evictions work in red is in 6.2, so before 6.2 venice would notice that it was over its target and it would stop everything and start evicting stuff until it hit its targets.

B

So then it's easy to understand that you will see bursts if you go over the limit by a lot, because uh there's just it's only doing that. It's like right. The main thread is single: it's a single thread, it's in a loop and it's only evicting until it reaches call, but now every time it depicts a key. It looks at the current time and if it's been evicting in the eviction loop for 500 microseconds, it stops.

B

So it's surprising that we can have these bursts, even though it pauses every 500 microseconds.

D

Well, I mean I don't know what to say. I haven't looked at this new code so, but uh it sounds to me like the next step would be to do some profiling during one of these bursts during uh surrounding one of these bursts, particularly like for the oh gosh zoom, the zoom widgets, in the way um waiting for it to disappear. Oh gosh, zoom anyway, the the the 10-minute window that includes there we go like from from say 1527 to 1536 would be sufficient.

D

I think just looking at the the middle of the graph that you got on the screen, yeah yeah, that would that would teach us a lot and it looks like because of the time scale that we're looking at we could afford to do. You know we could afford to do like you know like 500. You know samples per like the the sampling rate. Could yeah low, not not super aggressive, yeah.

B

It seems to be a period of about it's. It's not entirely periodic right, but right, it's it's! Five to ten minutes in between yeah.

D

So I wouldn't feel bad about running a profile for multiple minutes at you know at kind of a modest rate, and it can tell us you know the it can tell us more precisely the duration of the burst. um um So, if we want to, we can uh once we get our first kind of you know uh uh medium. You know kind of lowest sampling rate uh to get a panorama of of where the activity is is centered. This is obviously going to be cpu-centric, so we'll we'll see it in the profile.

E

D

um We can, we can predict roughly when the bursts are going to happen and then take a higher sample higher sampling rate closer to when this burst is going to happen. If, if we deem that, we need to which we may not.

B

um Yeah, no, that would be. That would be interesting because it would corroborate uh what the code looks looks to be doing, especially if you use this visualization that I don't know how to make where you get the the dots over time. Yes,.

F

Yes, the flame sticks.

B

Yeah, uh because that should show the the cluster of the.

D

B

D

Exactly exactly on the eviction loop yeah, we'll see exactly when it happens and zoom in on that particular period.

B

I almost wonder if there is uh either it has nothing else to do so. There are no requests coming in for a strange reason, because then I've wondered that too, but it doesn't sound plausible. Does it no, but how? How else can it uh chew through so many keys in 500.

E

B

Bursts, yeah um or- and this is something I don't understand so well so if if the 500 microsecond burst is up, the new eviction codes will schedule a proc on the event loop in redis.

E

B

Don't know exactly what that means, but it says zero seconds in the future. Do this uh do another chunk, another 500 microseconds? So I think the idea is that that next chunk of 500 microseconds joins the back of a queue and other work happens first, but maybe there is something pathological where it it's at the head of the queue and even though stuff is supposed to happen in between these little 500 seconds, microsecond blocks they're all back-to-back.

D

Maybe yeah um well, uh the the the profile should show us should answer some of these questions. I think I think you're asking exactly the right questions by the way. um Those are those are clear. Great next steps, um thanks yeah.

B

I am that, let's I'm like: let's do it yeah.

D

B

We should figure out how to do this profile in a safe way uh and if we need to high resolution or not so this is happening right now right this happens continuously. This is a yes.

D

End time is now yeah yeah, that's great, so um so, probably after this call I'll uh it.

E

Happens while we're talking.

F

D

Yeah yeah yeah yeah, so I I want to catch this before the before the the the daily workload cycle starts to wane. But um apart from that, I think I think we can. We can gather more information today. Oh, uh would you share the link to the thanos graph. You were just uh yeah.

B

It's in the it's on the agenda, great thanks! Well, you have to click through the issue, but then you get there. Okay, great! Okay! I think I've shared everything I want to share here. So how.

C

Did you notice this? What were you looking at when you noticed it.

B

I just went through the well. We had an incident a while back, which was sort of uh borderline incident, but our slis decided it was an incident on the git service and then we realized it was because of these evictions causing latency on the. So that was the first, the log graph.

B

So we knew about that and that's how we came up with we rediscovered that we wanted to upgrade to 6.2, and I uh saw alejandro posting that he was working on that so the other day I was just curious, like hey refresh, is it on six two actually had a thanos crap with the versions open, so I just was fiving that until it reached six two in production and then I looked and then and it didn't get better so then I just looked at the log graph again.

B

Another thing that I mean we have one tunable thing, which is that uh these 500 microsecond windows we could make them shorter or longer. But I don't have a good reason to think that would help.

B

But if I know if that may be an outcome that we decide, we need a longer or shorter uh window for these things, but I think we first need to gather data like map was saying.

B

End of agenda items.

A

I had a question around the gender item. um We happened to notice it because you were looking at the graphs for for other reasons, and it seemed that the after spikes didn't get bad enough that it actually alerted. But I was one and I'm hesitant to put in like excessive process around things like upgrades, but would it be helpful to was there something that we weren't monitoring that we should have monitored? Should we have kept this open for longer to see it like you're.

B

Talking about a different problem now, because something went wrong during the upgrade where uh we started doing background shapes on the radio screen.

A

That's true sorry, I was just thinking about the upgrade in.

B

General yeah right. I noticed that too, because I was looking but that yes, it.

A

Wasn't what I was looking.

B

For in the first place that just yeah.

A

Yeah- and I was just talking in reference to the upgrade in general.

B

Well, we do have uh latency aptx and it wasn't complaining, so maybe it wasn't the problem.

C

Which one are you talking about? The the.

B

The first panel on the left on redis cache, if you go to the dashboard.

B

um Or maybe I should just pull up the dashboard, so we know what we're doing.

C

Yeah, I can try.

C

Can we see the dashboard yes.

E

C

Oh wow zoom. I have no idea how, but sometimes it randomly works so.

D

We're looking at slack.

C

You're, looking at slack.

D

C

I'm looking at the dashboard.

D

The dashboard and then it stops showing the dashboard and just shows slack now, yeah.

B

C

Somebody else yeah I'll share this here. Maybe I still need to figure out this x-valence stuff.

B

Only now it's not loading, uh maybe I need to make it shorter.

B

This is the service dashboard for redis cache, except it's not loading. I.

E

Think there was an incident about related to the dashboards, so it might not doubt.

B

Okay well, um but.

A

Here so was the incident related to the dashboards.

E

um Let me check so um it was about the thanos uh store sli, and I think that the, if you go to the production channel, the alerts are still active. I think.

E

So that might be the problem.

B

That sounds very likely because all this data needs to come from thanos and it's not loading.

B

Well, if we know what this looks like, uh these are the uh meant to be the four golden signals, and this is the aptx. We have oh no data. um This is measuring latency from the uh rails client side, and there should be a line in there where uh an slo.

B

So to get back to rachel's question, uh apparently we were not crossing that line.

D

Yeah there's a I just edit into the zoom chat, there's a link to um since you're screen sharing. Can you click it? It's a.

E

D

Of that that very dashboard panel that won't load right now.

E

B

Yeah, so this is the upper. This is the first panel literally on the page uh but uh zoomed in and this uh what are the lines again.

D

Oh man, I left the legend off. I'm sorry, that's fine!.

B

I so I suppose uh so one of these must be the self an hour and the other six.

C

Yeah, the thin line is the six hour and the tickle line is a one hour.

B

So we, if we would have spent enough time below the thin line it would have alerted or would it.

C

Yes, if that took long enough.

B

Yeah well, this was constant, because this is because this area here is because of the background.

C

That's the six hour line, so you need six hours and half an hour. Yeah alert.

B

uh Right six hours of time spent.

C

B

The below the line, yeah, okay, so that would have.

C

Taken a long time like over time, you need to have uh spent five percent of the what you're allowed to spend and then it needs to have been especially bad for half an hour for the alert to trigger.

B

So this was a borderline case that may not have triggered.

C

D

If, if we hadn't dealt with it, it would have triggered after a few more hours. That was a beautiful summary bob. I love the way you framed that it's so concise.

C

I just wrote it in a blog post. That's why.

B

So rachel, maybe if I hadn't said anything, we wouldn't be having this conversation. If I hadn't noticed this.

A

Is possible yeah because it would have alerted at some point and uh and then we would have dealt with it because it had alerted it. Just we got early warning because you were interested in something else.

C

But um you do bring up something interesting rachel. um How do we tighten thresholds like our goal is to always keep tightening them if we can and some like now, I'm looking at metrics for specific sli's that I'm going to tighten after I've done, but then for these ones like that we haven't looked at in a while. How do we decide to tighten them or not?.

B

Oh this one's difficult, because uh we still have those late. This efficient spikes, um it's it's too bad that the page doesn't load because I was actually looking at it and the this latency graph does look better after the upgrade from what I remember. I don't know if other people saw that so after.

A

We turned off the background, saves.

B

A

It's a bit concerning that those graphs won't load if we were having a proper incident right now and we needed to see them. We'd have a problem. Yeah.

F

As we have had.

C

Very frequently.

F

During every incident over the last one plus years, yes.

A

One plus years- okay- I don't know if I feel better or worse about that. Well,.

B

The problem was, we had we had one person who was really good at keeping thanos running smoothly and he left- and I I don't know what uh water who replaced that person.

B

G

I'm sorry, I shouldn't be snarky.

G

B

But I I I mean ben was a sort of wizard with all this thanos stuff and yeah. That.

D

Was his cup of tea for sure yeah.

B

Who has taken over from him.

D

B

D

A few people have stepped up to to deal with um to deal with uh critical issues as they emerge. uh I know michal has done uh intermittently some work on this, for example, but I don't think it has a new owner, at least to the best of my knowledge. We don't have a new, uh you know, owner for the service, and I I think, there's general recognition that there's.

C

A gasket there's a squad working on that mihow is one of them. I I just was in a call where they are like yeah. They got the call got cancelled because there was an incident regarding tunnels, but oh they're,.

B

C

How to like get the handle on things, but.

B

There is a squat, that's that's good.

C

Yes, they there was a squad that started like a few months ago, and then they got pulled off of that job because of another task, and now they're back to it since, like today,.

D

Fantastic, that's great yeah. I I'd only heard that there had been a squad and it got reassigned.

A

Is there anything else anyone would like to demo on the call today.

D

I don't know that it's worth demoing, but I found something yesterday that I'm kind of freaked out about. I can share it with you. If you want to be scared too,.

F

Let's do it, okay,.

D

So um so, uh through an uninteresting sequence of coincidences, I ended up on the call yesterday, with uh with some of the ddres, to chat about a couple of points of concern related to us building our own postgres packages um currently in production, we're running uh 10 petroleum nodes, two of those fraternity nodes are running a custom, postgres build that we built- and there are some problems with that build and I'm going to gloss that over for now and the other eight petrona nodes are running the community build.

D

um These hosts are still running ubuntu 1604, which is way past end of life and um um upstream package providers are no longer a building that including the postgres community, is no longer building packages for point releases, so we're stuck running uh postgres 12.7 in production currently, and we want to get to 12.9 because it's got some uh some improvements um um that we'd like to have, but there are no packages for 12.9 for ubuntu 16.04. That's the backdrop.

B

Except ours, because we yes.

D

B

Still make packages for 1604 via omnibus.

D

Yes, yes exactly so um so we don't run our omnibus builds in uh on these nodes and apparent, and I not privy to all the decisions that went into this. But apparently we decided that we were gonna, that we were gonna. Try to um use the the the debian package uh uh recipes for building um uh for building the the postgresql 12 package and and its friends. um Here's the catch.

D

There are a few problems with with doing that, um but one the one that concerns me most now is um most of the code that we run in postgres is part of is part of the postgres core, um that is to say, we run about 10 or so extensions to post quiz and all, but one of them in in production and all, but one of them are part of the postgres uh core source code and so there's absolutely no risk of version incompatibility.

D

But there's one called pg repack that whose job is literally to rewrite data files in the database um in a in a more lock friendly manner than than the native behavior. um That was built against the postgres 12.4 headers. We are not running postgres12.4.

D

That means that that binary and its certain object file is using the wrong headers. That freaks me out, because these extensions run in the context of postgres backend processes. They have access to shared memory. They have access to data files, doesn't.

B

Postgres also change their binary format on minor releases.

D

um I think only major releases.

B

D

B

That would have been even crazier if the.

D

Absolutely was rewriting.

B

Files within the wrong four- I guess it would have blown up right.

D

Now, that's that's a great question and yes, uh so that that's a major version boundary concern. um Even minor versions, though, can potentially change like I'm, I mean, for example, um if there's uh a struct who you know added or lost a field, and now we'll be looking at the wrong offset into that struct, because we used the wrong header file when building pg repack.

D

This is kind of freaking me out because um um there are so so many bad things that can happen um when you have effectively. You know incorrect dereferences yeah, when extensions have.

D

Extensions run in the same context as at least the extensions that I've worked with just to be clear. I've only worked with in-core extensions, not these third-party extensions, which use a different framework, but they still end up compiling using uh using headers from whatever postgres server was was, you know, used as part of the build process, so I just checked this out last night to be sure um yeah. So that's that's! That's the that's! That's the thing! That's kind of bothering me right now um to the best of my knowledge.

D

I think we do use pg. I think I think we do use pg repack um irregularly. I don't think we've run this very often, um but but I'm not I'm not trying to stay in the loop on on on our database maintenance. So anyway, that that was something I came across yesterday um and I thought uh I thought I was.

D

I was hopeful that this was that someone uh had uh already decided to address this, but it turned out that when I chatted in that meeting it uh it's it's it's uh it wasn't on their radar, so um so everest that has concern.

D

um I wasn't sure yesterday morning how, whether I was overreacting to this or not, um but I'm pretty confident now that I'm not. I think this is a serious problem. I think this is a serious risk and I think we're super duper lucky that this is that there is only one extension in production that falls into this, this category of crossing version boundaries, um so.

A

When you say you raise something.

D

A

So when you say you raised the concern, was this an issue created and then sent on to the team like how have you raised the concern.

D

um Well, I raised it first in a slight conversation and then, in that meeting I mentioned yesterday morning um and uh their um the meeting went long and um I think the folks on the call were pretty receptive to to the concern and um um when folks had to start dropping off the meeting uh we just we agreed that we would switch to async and they asked they suggested using um the existing change issue um for the async conversation to continue, which um I I followed followed that that that request um in hindsight I feel like maybe a separate issue would have been better um and I've got like about two or three pages worth of notes that I've added in the long comment thread on that issue.

D

um So I guess, if.

A

I think it's one of the hard things about not being responsible for the database ourselves is that all all we can really do is make sure that the concern is raised and heard and then leave it on their prioritization list to take care of um which is frustrating at times, because we can't just do what we need to do to to resolve it. But at the same time we need to leave it with him.

D

Yeah yeah, I agree.

B

But maybe we should turn it into an issue because if it's a comment on a change request, that's at.

D

B

Easy to lose yeah yeah.

D

Yeah, I agree- um maybe I'll maybe I'll- do that and link it to the change issue. Just so, there's a a cleaner place to talk about that particular point of view.

A

If it's only one aspect of the change issue, if it's only if it's only one portion of it, then at least the rest of it can be closed off, and this conversation can continue.

D

Yeah yeah, that sounds good, so I'll. Do that and then I'll uh step away from it um there's obviously a bunch of other things. I need to focus on as well, um but this this seemed like a big deal. um um I shouldn't sugarcoat it. This is. This is a huge problem. um This can. This can literally crash our servers or corrupt our data files and the the the nature the the nature of the problem could be.

D

um It could range from no impact at all if we're lucky um to subtle impact that doesn't manifest uh immediately or obviously, to really overt impacts such as such as actually crashing all of the postgres processes, because it tainted shared memory or damaging a control file or a data file.

D

um It really is kind of you know up in the air about whether how big the impact can be, and it would of course be on a it can change on a on um each time we do a minor version bump the risk is reintroduced, um because it's not about it's not about the one line, change that we've made to the postgres source code. It's about the the the all of the differences that have accumulated in the postgres source code between 12.4 when the extension was built versus whatever version we're running now.

D

A

Yeah when the issue is raised, would you please send it over to me because then I can also make sure that the right people get to see it.

D

Yeah definitely I will do so. Thank you I'll, probably just copy and paste a lot of what I've already put in in the comments.

A

That's probably fine.

D

F

Yeah, that's that's! uh That's all I had on that topic.

D

E

D

uh Every time I'm I mean one of these demos. um I I offered to show something new about uh um um builds your binary analysis, so I've I've got a few things I could talk about, but it's it's probably not germane to most folks.

B

So I guess I'll just touch up and I I don't know if we just got discussed before but alejandro. Maybe you can share something about the reddit 62 upgrades how that went or that sounds good.

E

Yeah sure um I think it went relatively good, uh except for what you found, which is that we had to. uh We had to disable the background saves, um but we had a script that igor developed and that worked out uh pretty well. We just had to do an adjustment for the sentinel nodes because ready sketch has external sentinel hosts, so we have to modify the script to take that into account.

E

um As a result, it's not an optimal script now, because there's no reason why, like we do the way we do it now in the screen, is that you uh reconfigure the first radius node and then you reconfigure the first sentinel node, but there's no reason why that has to be another they're, not correlated the release node and the synthetic node, um but.

B

Are you now talking about the the general process or the background.

E

B

E

No, no just a general process um yeah, just just uh thinking of what could improve, but with regards to the background, uh safe problem. um The thing that was a bit of a surprise is that it seems that this was a change in ready 6.2, because we used, we had the same settings as we had with 6.0.

E

It's just that. It seems that now, if you don't have uh save settings it, it puts the default settings on your configuration so to keep the old behavior. You have to add, have a say setting with an empty string. So instead of having no setting, you have to have a setting with an empty string and.

E

Maybe we should have been more aware of of that behavior change so that we didn't deal with it and.

B

That wasn't the only one right there was also something about ip addresses.

E

Yeah yeah um the. Let me recall about that one. I think the problem with that one is that we were using host names to to reference the to reference, a radius node from the sentinel node and on ready 6.2.

E

They changed that so that there's no automatic uh host to ip translation, so we had to change to nav so yeah. Those were two changes where we had to ping the distribution, the distribution team, um so that they are aware that there might be external customers that are also affected by this. um It's probably unlikely. It's probably.

E

99 of external customers probably have a single radius node that they use for everything, and it has no external sentiment, but it's something that they should be aware of.

B

I I just wonder on uh the the reddish was sort of created by one person and uh um I forgot his forget his name, um but I is his uh his nickname, auntie rez or something um and- and he was the the maintainer of reddish for a very long time, but he stepped down, I think, a year ago, or so now. I wonder if what this has to as related with the maintainer uh stepping down that there have a different attitude to changes or not. But I'm speculating now.

B

It's irrelevant, it's speculation, never mind.

E

Yeah I mean um if you look at the changelog, for example, the the change in behavior in background safe. It's just uh it's not mentioned as a breaking changes there. You have to go to the book fixes and it says: oh change, a behavior where, where no setting will give you the defaults, um so if you are just reading the breaking changes section, you wouldn't have caught either of these changes, which is weird.

B

It's too bad that igor dropped off, because I think he reads the change logs quite carefully, so he might have perspective on whether this is surprising or not. But.

D

Yeah we we talked about it yesterday, igor and alejandro, and I uh we all agreed. It was surprising.

D

We were all surprised by it. It's essentially a semantic change where the presence or absence of a.

E

D

Has different semantic behavior between those versions.

B

It's a huge change. um It's I mean redis is part of this. uh I guess it's sort of part of the no sql hype where uh databases could look fast because they don't store your data. I mean the the other famous example is uh mongodb not flushing disk right to disk, but if you boot redis, it doesn't save your data to disk, and I I remember the redis creator running having a blog post where he says by the way my entire blog is on redis and spectrum saving is off.

B

So I almost lost my blog because I restarted the redis process. It said the reddit is running somewhere, so it's just. I don't know this very, no sequel thing that it doesn't save by default. So that's quite a big change to call that a bug fix. I mean it's probably for the better for most people, but.

A

Anything else on the list for today.

A

Well, thanks for the good conversation, um I hope you all enjoy the rest of your days and looking forward to seeing you all again soon.