GitLab Scalability Team, 15 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Redis rate limiting in k8s retrospective/walkthrough

Description

Walkthrough and retrospective of the work that the Scalability:Projections team did to migrate redis rate limiting from running on VMs to running in kubernetes.

A

Yeah, okay, so just a very brief recap of what we're going to do um for future us and anyone else who is interested, we're going to walk through the work that we did to uh migrate, redis rate, limiting from uh redis Sentinel on VMS to kubernetes and just walk through the issues that we did and generally discuss so um yeah.

A

uh Do you want to start Bob or do you want me to start or go ahead? Yeah.

B

So uh the way I think I set up this project at the beginning was just start with the difference.

C

A

C

uh So the way we we tackled, that was we'll, do pre first, that we'll do staging that will do production and uh staging is the closest thing that we have. That would resemble production in a way of doing things, because uh we already had an instance to start from in Pre. We just needed to tweak it a little bit to uh not be a mixed redis instance. So, with a mixed redist instances, I mean um Sentinels in VMS and other Sentinels and kubernetes. That would have a replica that we'd failover.

C

That was the initial plan we had, uh but we decided against that because we didn't need it because we didn't really care about migrating. The data that only lives for a few minutes and just uh yeah do a clean switch in the configuration and go that way. So the first thing we did was create these.

C

uh These pre issues here uh so update the red escalator begins to be Standalone and then start using it from the from the configuration uh where we did that we uh got to see a bunch of things regarding or existing metrics. So that's the observability issue here.

C

Lots of things were like not not working because they were relying on an fqdn label to be present, so we need to update that to use instance, uh the instance label instead, which is based on IP, address Port name rather than domain name the we were doing that in parallel.

C

uh In parallel, while we were spinning up the well we're doing the same work for staging for staging uh spinning up a new cupboard like that, we need to start with a brand new redis instance. Spinning up a new instance meant creating the designated note pool in terraform. Stephanie are probably going to be better.

C

So if I forget something interrupt, spitting up the the new now preparing the new note pool and terraform, adding the new redis instance to to a tanka deployment setup uh everything there was already prepared by the Frameworks group who uh was working on them container registry cache instance. So we just used their stuff and tweaked it where we needed it like where we needed slightly different configuration and so on.

C

So after we done that, we could just uh switch over to use the new redis instance in staging and there we noticed uh some problems uh because well nobody we couldn't get this right from the first time. The first problem we we know this was a different database name and different secrets. So then we needed to correct uh the the database name for Reddit Sentinel to be this magic string, my master, which is the default database name for redis, uh and then there was the the thing with Secrets being stored in gkms.

C

That was still used by VMS by VMS. Here, I mean redis clients. So that's basically just the console hose left, but those couldn't connect. Otherwise,.

A

There was also the the issue here of um we had not originally set up external Secrets correctly. To start with in um there's a there's, an external secrets in kubernetes that um did not was not set up correctly.

A

um There was an external Secrets object that one is slightly farther above, where it says rotate British rate limited secret on um staging, uh and there was a separate Mr that we had to create for that that once we did that it was there. There was also also.

A

Oh no I haven't even touched. The load balancer chip I, was also noting that when we created staging, we also forgot that we had to create a secret in Vault, and there was all the back and forth of um where these secrets lived, because they moved in between the first um setup of these things in uh kubernetes and this one so like for any future. People make sure you have all of your secrets in place before you do this. Otherwise it doesn't work and.

C

If I got it right, we've got the correct helpers now in the tanka deployments, uh like it's a helper method that you can just call. This is what the secret called This is the environment. This is the namespace and then it will build the string to uh get the secret from the right place, but you still have to manually add it to Vault. So.

B

C

You need somebody like Stephanie with superpowers, to add that in the correct place and.

A

That that is the one piece that I know that the reliability team is still doing an enormous amount of work with Vault. So that may change at some point um but yeah. That was the one piece that Bob could not do by himself.

C

uh No there's another piece: that's coming now: oh.

A

C

So when we did the actual conflict change, we weren't super careful with staging because well that's what staging is for uh so we broke it and we had to revert it uh when we made the configuration change in in staging. uh So first we broke it because we forgot about the secrets we fixed that and then we could access uh uh red is from a console.

C

But uh then, when we proceeded to roll out to everywhere uh we were getting 500 and that was because the clients, so the rails application couldn't access the the redis instance, because the load balancers didn't allow them to.

A

That is great. That is correct. Yes,.

C

um Could you explain how you fixed.

B

A

Sure um the way that we had set this up was all of these are running um load balancers within kubernetes, and one of the issues that we ran into was that we have moved some of the things that we're using outside of the um same region as where these kubernetes objects or where these kubernetes clusters are specifically things like console, um which is moved to a different region for Dr purposes and all of our kubernetes load. Balancers were not set up to allow.

A

uh What is it called? um They needed to be set up to allow.

A

At essentially Global access, um I'm trying to actually go and find the issue, but it doesn't really matter um so what we have done is we spent some time trying to debug exactly there. You go what it is and then um we set it up so that uh it was actually Now using The annotation for Global access, which is.

A

Essentially saying that these we can access these load balancers from anywhere within the same VPC um as part of this work, we've actually made that a default across reliability as well, so hopefully nobody else loses a full day and uh multiple brain cells trying to figure out what was going on here. So.

C

Would that be a safe default to have, but.

A

Yes, it is it's a safe default and it's also Now the default for redis as well. um We we made that the default as part of rolling this out.

A

So hopefully we have saved this for the future, um but yeah it's uh the actual. Annotation is internal load. Balancer allow Global access.

C

Cool so there we had it running in staging and it was time to prepare for production. We prepared a Readiness review, uh that's like a markdown document in the Readiness review project and we we could reuse a lot of information that was already collected for the the wreckage the registry cash instance, but yeah some things were alone our own, because this instance already had quite a lot of traffic. So uh slight differences there uh when that was approved, Stephanie and I started to prepare the change.

C

So the change issue with different steps to roll it out uh uh or nigger's recommendation. We step back a little bit and we went from uh just doing Canary everything to Canary one region at a time for one zone at a time. I, don't remember one zone at the time. It.

A

Was one zone at a time yeah.

C

Yeah, so uh uh when we were doing that, yes, the.

A

Original plan was going to be Canary and then I think USC's 1B and then everything which we changed Midway through this but you're going to get to that yeah.

C

So uh during the rollout we did Canary and then let it sit for a day to just see the thing with traffic um the connect like during this this uh day, we would effectively effectively have a split plane, split brain for um rate, limiting, because traffic going to True Canary would not count towards the rate limiters.

C

The rate limits, that's otherwise counted in the redness instance in VMS, but we decided that that was acceptable, but we couldn't have that for more than a day um with more traffic than just Canary, so our plan for rolling out was doing a single zone first uh uh and then immediately proceeding to the entire uh the entire fleet.

C

When we did the Sim the the single zone, we did notice that the CPU utilization was higher than we expected like. We would expect to be at about one third of CPU utilization and compared to to Radisson VMS. We were much higher than that.

C

um That's when we decided like. Let's not do everything right now: let's do one extra Zone before we do the whole thing, uh which is what we did and then uh the three of us uh Stephanie Igor and myself made the call that we will not proceed because we were already like close to like 60 CPU utilization with two zones uh and that's about as high as the current CPU utilization is at Big Time on VMS, so he's decided to stop there and investigate further what we do.

A

Yeah and just another quick note, the difference in CPU between one Zone plus Canary and two zones plus Canary, was about 20. So we were running at like 40-ish percent with one zone in the canary and then when we went to two zones we were at 60. The logic was that we would probably be close to 80 with three, and that gave us very little Headroom, especially since this is the fastest CPU that we have, uh that gcp has and redis is single threaded.

C

Yeah so staying there would mean uh We've effectively moved the saturation the such the moment of saturation closer, and we don't have a horizontal scalable horizontally scalable solution for that. Yet if we did, then this would have been just a case of adding some more redis and then we could continue. But that's what we're going to be working on next.

A

So maybe I should walk through briefly the CPU increase investigation, which is long um so after we got through this.

A

um You know we rolled back everything uh in production and left us with uh this running and staging and in pre-production, using the VMS and Matt and I got on a truly epic like four hour long Zoom, where we walked through a lot of flame graphs um and essentially came up with you know. Igor during the actual rollout had offered a thought that perhaps this was related to redis networking and things of that nature.

A

um Matt and I fairly conclusively proved that that was the case in that um someone I think it was Philippe. Actually uh yep also confirmed this um we're using Calico in um our kubernetes and because of its hybrid approach to control traffic, which is you know, iptables and a bunch of other things. There was approximately a 30 percent increase in um uh CPU time for any anything that was yeah like uh 25 to 30 during incoming packet processing and packet processing as a whole. This explains, the you know, um explains the general overhead of CPU.

A

It's also, unfortunately, something that often comes whenever you add more layers of abstraction to a system. uh They often don't come free. In this case, this was a much higher hit than I think anyone expected and the reason that we discovered this during uh rate limiting and not during some of our previous work. Is that rate limiting is very connection. Heavy um I, don't know if it is the.

C

Short fast calls like very many very many calls uh that yeah, like every request, makes one.

A

Right so lots and lots of connections which meant that the overhead for any, like this connection overhead, was hit rate limiting worse than it might other redis instances um in that. If we were doing fewer connections that were slower, it would not have been quite as much of a CPU increase, so For Better or For Worse picking. This was a good, a good candidate to discover this exact problem. Good.

C

Point because we did do load comparison tests on kubernetes and VMS before this project even and notice there that, like everything, was quite similar but I, don't think we've replicated the the Calico thing right.

A

And unfortunately, it's just I think rate, limiting might I mean I, I suspect. If we look into the future.

A

um That rate limiting would still be one of the most impacted by this, no matter what we do and that if we had picked something else, this would have been a project that we were talking about as one of the best most elegant rollouts that have ever existed and everything just worked.

C

uh One thing that this also makes clear: we wouldn't be able to fit redis cash or redis persistent on kubernetes. Just like that, we would.

A

C

Something more.

A

C

So that's it I think yeah.

A

I mean essentially we're going to. We haven't actually completed this work yet, but you can see there that there's the revert back to using registry limiting on VMS um in staging in Pre uh we're going to be pivoting to do work on how to do horizontal scaling, um but I am for for future me and for anyone else who watches this.

A

um I do think that this was an extremely interesting project and we learned an enormous amount about you know doing redis on kubernetes.

C

um For me, anything on kubernetes really but sure.

A

uh We we very very well tested, you know the work that Igor and Alejandro had done in order to do this first piece in that you know, Bob and I took it over and totally managed to do it all without breaking things mostly um and the actual rollout was very, very roll out and roll back was very smooth.

C

Like uh I think I think that's also something that I think we should call out the the rollout and failovers and stuff that we did uh wouldn't work easily on the on VMS that we would need to do more fancy things with stopping Chef client doing a change startups, have client on a single machine and then the other machine and then the other machine, uh while on kubernetes that was just merging emergency Quest, then watching graphs change, yep.

A

Yeah, the actual, like all of the pieces of managing this in kubernetes, was much easier. We just need more CPU.

C

And we can't get any no.

A

We can't get any so um you know as a whole. It was a it also. The other piece I think that is worth noting.

A

um Due to time zones, Bob and I decided that we were going to roll this out during some of the highest traffic periods that we have at gitlab.

A

um I, actually think that worked in our favor in that we discovered that we could make all these changes and no one noticed, but also we saw the highest traffic, and thus everyone was there watching it and being able to compare. Had we looked at a lower CPU time, it would have been harder to see the difference in the graphs not impossible. This is a pretty significant difference, but harder.

C

Yes, uh we're calling out that we we needed to proceed immediately. So the fact that we did this on a because of the split brain issue we needed to proceed immediately, but because we were doing this on a on a high traffic moment. uh The problem surfaced when we were working on it as Stephanie mentioned. If we hadn't done this, but had done this like uh like in the downtime, then this would have been the problem of the on-call I. Think because I.

B

C

Know if it would be too close to saturation or not and I close enough to trigger alerts or not like it would be borderline, which is not a great place to leave stuff.

A

Yep, um the only other piece that I think is again worth calling out is I. Do think that Bob and I did a great job during this thing in like doing like handing over the work back and forth and being able to move faster, because there were two of us in two different time zones so like Bob, would do a bunch of work during his day. He would hand over things to me. I would update I'd hand back to him, Etc and I.

A

Think that meant that we sped, through the last periods of this much quicker than I, originally expected. Yeah.

C

Because that we were all sort of rolled in right like uh like approving from like just the the work getting the approvals before you came online and then getting the approvals from the SRE on call and so on. Everything was ready and we just needed to like meet and then yep get it done and then get it. Undone and.

A

Then get it done, then so yeah I mean overall. This is really useful.

A

um Definitely, like I think we've proved that we can run redis and kubernetes with this, just if we also had horizontal scaling, yeah cool any other last words before I. Stop this.

C

Nope called not for me.