GitLab Delivery: GitLab.com migration to k8s demos, 15 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-06-15 GitLab.com k8s migration EMEA/AMER

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning,.

B

C

Good morning, trust everyone is having a lovely day.

C

So let's get started.

C

I wanted to talk about one of the issues.

D

C

I was working on this is the an epic where we're trying to address the ip address space for all of our node pools inside of our production zone clusters, because they were not sized very well, and you know, one of our goals with this was to have the ability to recreate a cluster without any issues.

C

So one of my issues was trying to determine the saturation level for all of our components right and I rolled through this issue and identified three areas which were important, so I'm gonna go through these out of order.

C

um I don't know what alerts need to be silenced. In fact, if you go into our alert system and you create a silence for anything coming from a specific zone, so let's say we say, create a alert that just excludes everything from zone c, just as a random example.

C

What ends up happening is that um we end up triggering an alert which is kind of the opposite effect, so.

C

I think our only way to accomplish you know trying to figure out what alerts that we're going to run into is to practice this in staging. So I've already got an issue to you know perform the practicing procedure. What I need to do is add the fact that we need to keep track of what alerts, trigger and figure out a good uh silence mechanism that is broad enough to cover a cluster being down, but not so broad that it ends up triggering the alert and for historical context.

C

The reason why that alert triggers is because we are relying on a deadband stitch to ensure that what is supposed to alert us is still there. So if you create the silence, you prevent dead man's snitch from learning that that system is online.

C

So that's a reasonable expectation. It's just kind of unfortunate. It's built that way so.

C

So yeah staging I'll be practicing.

D

Will we additionally have other alerts that potentially fire due to the fact that we'll expect increased traffic on the other clusters as well.

C

uh No, we should not. In fact, the only alerts that I'm expecting to trigger is anything related to our monitoring stack that might be installed into a cluster, because those components will effectively be able will either not be able to scrape, or they will not be reporting themselves upward to um like thanos, where all those metrics get gathered and stored.

C

The other alerts I'm expecting are from the cluster that we're taking out specifically traffic cessation alerts and that's just because traffic will be removed from that cluster.

C

When I went through trying to identify all the saturation levels, you know we've done a pretty good job with our hpas, such that they'll scale up uh normally and I'll touch a little bit on this a little bit. But you know we should be okay with adding additional load, so I'll touch back I'll touch back on that a little bit more in a second.

A

C

A

Sorry about question john yeah did we have any issue in the past where uh something seemed to happen, and we had like a overload of alerts.

C

The only time I could think of where that might have happened is when we had some sort of zonal situation, but I would have to go back and search through our incidents.

A

C

Such a long time ago, I'm not actually sure what I would search for no.

A

It's okay, I'll.

C

Make it I'll at least make an attempt, though, because, and I recall one of our zones having an issue. I think it was networking related, but I don't recall like where that problem lied and what the trigger for that was and what the result of that outage or incident was either so.

C

Thank you. um The other thing that I had identified was that our get http node pool that handles all our web service get traffic that was dangerously close to saturation today so like we were running at near 100 saturation during our peak workflow times today so um well. Prior to today, I started working on this and I knocked out this issue, so the issue I fired up to address this is not closed. um Our saturation level was, I think, at 86 this morning. It's now down to 12, so I'm super excited about that.

C

So with 12 saturation across all three clusters, we should be close to if I'm doing the rough math, my head correctly about 25 saturation across two clusters um when the cluster gets when they get hdp pool.

C

Gets removed in one of the clusters, so I'm very excited about that. There is a tracking issue that opened up, because tame land was detecting that this was a problem as well.

C

Currently, its trend line looks a little goofy, so I'm waiting until next week when that gets rerun to see what that looks like, but we should be okay on that front, and lastly, was the one that I really want to talk about.

C

uh Right now is what happens because canary might take too much traffic, um and this goes back to a little bit of historical context where, when.

C

The way that our ancient proxies are configured, we set the weight values on two clusters: the cluster that is inside of the same zone that that he proxy node is in and then canary, which is a regional cluster. So our production clusters are normally set to a weight of 100, whereas canary set to a weight of five. This provides. Roughly 4.76 percent of the traffic is going to land on the canary cluster.

C

If we mark one of those clusters as a maintenance period or min in sort of maintenance or drain state canary will start to take 100 of that traffic for any h, a proxy node. That's in that same zone, that's a very dangerous situation and while for the most part, we should be able to accept that amount of trafficking canary, I don't want to risk it because canary serves a very specific purpose.

C

One. It's meant to only take a very small piece of traffic for the purposes of testing auto deploy, and if we want to rely on qa being relatively successful, I don't want to sit here in a situation where canary is taking 100 of the traffic in one zone, but is only taking five percent of the rest.

C

That just doesn't make sense that it's introducing in a very awkward consistency situation that is going to make for an interesting headache that I don't want to have so I've got some ideas, um I'll start sharing my screen again, because I created an issue. I just haven't really fleshed everything out just yet.

C

um I want to introduce a set of tooling improvements. um This is just a kind of proposal, so one of these is that we have a set server state script that allows us to set um the state of back ends in h.a proxy. This was used more when we were not in kubernetes, but the tool still exists.

C

My thoughts behind this is that we would kind of expand this to do a few things.

C

I hate it when I spot typos.

C

D

C

One would be to let's try to shut down back ends in a specific zone, so um instead of asking h a proxy to shut down cluster c across all h, a proxies target, only the h e proxies that live in zone z and shut down the cluster for zone c.

C

What that should enable us to do is preventing the situation where um I guess I'm going out of order, but that should prevent a situation where all the traffic ends up in canary, because we've shut down the traffic for just that zone.

C

It would also do the same for canary in that particular case that way the canary doesn't blow up.

C

The other improvement I would like to try to introduce is to make these changes slowly right now. This script operates against pretty much all the fe nodes all at the same time, and that's going to be kind of dangerous because with one third of all traffic or you know in certain like web requests, we're getting 93 million requests per 15 minutes.

C

That's going to be problematic because that load is going to ship to the other two clusters very abruptly, and they may struggle for the first five minutes, while the hpas start to ramp up and scale the pause as necessary.

C

So I'm thinking of adding some sort of sleep in some way shape or form that enables that trap to slowly get rotated, whether that be touching a single server sleeping for a few seconds and going to touch the next server, I'm not sure how to implement that just yet, but it is something I want to look into and then lastly, this is one that needs to be evaluated is just like what state should those back ends be that also enables auto deploy to work?

C

Okay right now, if I recall correctly- and this again is something I need to research, if I recall correctly, aj proxy has stood up in a way where it's checking for the status of our servers and if it's in an upstate, auto deploy is happy. The prepared job is happy and allows the deploy to continue if a service is in down state autodeploy would be like I something's wrong.

C

Look at this I'm going to fail that way. The deploy does not continue if we put them in maintenance state. If I remember correctly, um the prepared job is okay with that, but that's something I need to recheck, because I can't remember off the top of my head. What that looks like I.

D

Remember it to be the same scope.

C

Okay, so alongside that, the other thing I want to make sure is that the low balancing works okay in that situation, because of our weighting and our backup configurations instead of aj proxy meaning, because we have um say I'm operating in zone c again, uh zones b and d will be marked as backup servers, backup endpoints. So they don't see traffic for the most part ever from a zone. Ch epoxy.

C

I want to make sure that low balancing still operates as we expect it to when we start tearing things down for a given cluster, given the state that they're in- um and this is just a quirk inside of h- a proxy- I want to validate that we're not going to see anything. That's surprising!

C

That's all um outside of that outside of these tooling improvements that I'm proposing the next step as far in or in regards to this particular epic is actually starting the testing and staging now, let's tear down the cluster and staging and rebuild and see where we are.

C

I think the only other things I have listed under this epic are the tooling updating. I just proposed performing the testing and staging and then some future items so, like you know, trying to figure out what to do with network spaces and kubernetes overall, which I don't feel like we need to address now it's going to be something we need to do when we're ready to execute in production that we can rebuild clusters as necessary.

C

I think that's a precursor to actually formalizing what that angle looks like and then also creating saturation alerts for the networking space which I don't think we have uh tamland is doing that, but we don't have alerting for it and maybe tamline is sufficient. I don't know uh because those are something that gets evaluated over time and we get like a.

C

We know what the course looks like and we know what we're going to start running out ahead of time, so it enables us to kind of handle it in a more graceful situation like we are doing today that way, we're not in an emergency situation, so I may just close that particular issue, so, but I'll I'll try to look into refining it if necessary.

C

So I guess that's all I've got for this. Does anyone have any questions.

A

So I I have a question after the work you're doing now, with the git http, node pulls and so on. I guess our space saturation is uh lower than before.

C

Yes, I don't I'm not going to have the tamblyn report for that until next week, when tampon.

A

C

Because if I recall correctly, it runs on a tuesday but like just to showcase the work so far like this was prior to the get nodes yeah. um You know this is back from last week but, like you know, we were on an upper trend and if I recall we were going to hit 70 saturation by july- and I didn't market here but like we were on target for being fully saturated, um come august september-ish, but now that we've done all the node pools except the one I just fixed today.

C

You know this looks wonderful, like yeah.

A

C

Still going down, in fact, I'm kind of curious as to what that looks like now, yeah now.

A

C

A week has passed and.

C

Okay, so like there is no forecast at hitting, neither are 100 or 80 percent, which is fantastic, and you see where the drop the dots are now down to like 40-ish upwards of 50 during the load time. So that's fantastic, and this will come down even further um next week, when tim lane runs now that they get no pulls run.

C

So I'm very happy about this is.

A

A very good good result. Yes, thank you, john. Do you think we can have a little demo in the next uh next week to see, which is our saturation level after tamland uh runs yeah.

C

A

D

What was that issue? You just were showing us before scarborough, the one you talked through the alerts on.

C

Oh I'll link it to the tools.

D

Yeah yeah, could you just drop on the agenda yeah.

C

Thanks we'll do.

D

um One question I have for you is when we take this cluster out in staging.

D

Do we need to stop deployments.

C

Good question um so here's my thought on that and here's what I plan to do to address this with the he proxy tooling improvements. My goal with this is that canary is still usable and autodeploy is smart enough to skip deploying to a cluster that might be undergoing maintenance, which that is a separate issue that I need to still create, which that's a good reminder, because I need to figure out what to do with um with all the deploys during cluster outages.

D

Because we have a.

C

D

It right like so we have the um the allow it's causing like kate to allow failure flag that will that will not will not attempt to deploy to that cluster.

C

C

D

Can dig that? What does that flag? Do it skips a cluster.

C

That doesn't sound right.

D

Yeah we so we talked about it quite a lot because it was like. Oh, it sounds largely bad, but except for the case where you have a deliberate time like this, where you actually don't want to attempt to to deploy to a known cluster, so we do have a way of skipping out, but what I was going to actually ask as a follow-up is assuming we have a way to keep deployments running.

D

How does this cost to end up back in sync? What does we have to do a deployment at like the end of this work to get that cluster.

C

D

Onto the right version.

C

Yep so part of my procedure, when we do this in staging is going to add a step that indicates how to perform a. I guess, a true up after the fact. So you know autodeploy does this automatically, but if a cluster gets skipped or if we rebuild the cluster, it's not going to have gitlab deployed to at all, or it's going to have their own version.

C

We have the ability to do this locally. Our tooling operates or can talk to our clusters outside of ci, so the esta, the steps would be you know, do whatever is necessary to rebuild the cluster after it's up healthy, ready, ready to go, um perform a step which is just running the k, control command like we do in.

A

C

But we do this locally and we're just pointed to the new newly created or whatever cluster one under maintenance.

D

Great sounds good.

C

Any other questions.

C

C

Akbar do want to talk about camo proxy because we've been making some slow progress on it.

B

Thank you um so come proxies on production already out the door and we are not accepting any traffic. Yet I was checking today the change request to accept traffic, and there is a point we can basically discuss at this point. It says uh fairly like like.

B

Best case scenario is to actually execute this cr on weekends, so I'm not sure if we actually want to do this. So part of this is due to invalidating the cash of the markdown, so this will basically have a huge uh right on database.

B

So this is part of it is to execute it during the weekend and the other thing is basically like. If we have any issue, we can roll back safely so like what do you think like? Should we actually execute it on a weekend or is it safe to execute it?

B

I don't know like uh monday, because I thinking actually, if we are going to execute this, to execute before my uh duty as a release manager, which is coming after next week.

D

So, what's the how how long does the impact last do we have a sort of rough idea once we invalidate the cash.

B

B

Hours two days, unfortunately,.

D

B

So, like part of the problem is basically just the previous markdown that has the old cam proxy from vms is to have the new camera proxy from coordinates.

B

So this is why it's going to take some time to basically update all of these images, although we take a very low traffic for cam proxy, because we serve only external images so, as it says here, 10 to 20 requests to confront, so it's not a huge load on its own. So.

C

I feel like the rps is low enough, that this shouldn't be too strenuous on our infrastructure.

C

So I think if we have a concern with this, I hesitate to do this on the weekend, just because that's thoroughly against gitlab values, but we do have the personnel available where we could schedule this during the lower period of traffic.

C

So I try to encourage it to have this happen either during the apac time zone or maybe as early as possible during your working hours. It's.

A

C

My thought at this moment I think, if we had a better understanding as to what that impact may be, we could probably readjust, but I don't know how we would discover what that impact could be. I don't know what kind of mechanism we would use to determine whether or not.

A

C

Database system is going to crumble under us, but being that the request rate is like you know, 20 per second, you know it's tiny, that's nothing! You know we serve 90 something million web requests per 15 minutes is.

B

D

I feel like this is not going to change very much. Can we validate on a cluster by cluster basis, or is it just it's everything I.

B

Think it's everything.

C

Yeah yeah: it's everything because we're switching which architecture is serving this content.

D

So all the clusters just go in one game.

C

Effectively we're turning the cluster valid or uh the cache mechanism is being revoked because the uh the infrastructure is currently in virtual machines.

D

And the impact is potentially the database is that correct, yeah.

B

We are basically gonna update one version like it's currently, six on the uh one of the column on the database, the markdown- I don't know what, what what column is it but like from six, it's gonna be like updated to seven, which is gonna, invalidate all of the markdown rows on the database. So I think this is why potentially it's going to have like some impact.

A

How fast is going to be the rollback? In the case, we see problems.

B

It says 10 minutes, but I'm not sure if it's how long it's gonna last to be honest,.

D

It wouldn't help for this, though right would it like, because isn't the the problem with validating the cash rather.

C

Camera for that.

D

Yeah, so I don't think the risk is like camo. I think camo is probably going to be fine. It's. How long will the cache take to rebuild unknown? Let's just wait out. If we do it.

C

So what if we did this on? Oh no, that wouldn't work never mind. My apologies.

D

Don't do it on monday um because I think we're doing this database stuff at the weekend and we will give someone a heart attack if we throw up a database problem on a monday morning. But perhaps like.

D

C

I'm thinking friday.

D

Yeah, it's family of friends, but it's super quiet.

C

B

D

You think you're going ahead, oh good day, I'll, take you next week, but yeah keep getting sick. I guess, if I guess it's so the question is: if we have no way of testing this to know how long the impact is we're going to have to do a best guess, um there's no rollback so we'd have to just wait it out. I guess like in the event, this is horrible and it's much worse than we expect. Is there anything we'd be able to do.

C

Well, so hypothetically gitlab.com will survive if camo proxy is gone. This was done as a cost-effective saving measure for reducing the amount of network calls that we need to make outbound in order to serve content for both ourselves as well as client-side having to reach out to a third party.

C

So realistically, I don't think this is going to be like what we're updating in the database should be quick, because it's just modifying a url right.

D

So graeme is also out on friday, so that's not going to work in terms of doing it across apac.

C

I would try to like bind if able, I would try to do it as early as you can friday, for you, if possible. I'm.

B

Out on friday, but I can basically.

C

B

Yeah, I can be present to execute this.

D

Let's keep you, let's do it on a day you're around just in case something comes up later on um or it drags out and you get caught up. So so don't do friday.

C

Could maybe also potentially do this late my day.

D

That was the other thing I was thinking if you did it late friday, because I think we must be upsetting all the caches anyway over the weekend, as we are doing, maintenance.

A

C

D

May not be about to drop this site right before.

D

And less potentially.

C

D

Change that affects the plans for the cid composition.

C

Yeah, the only thing that would interrupt me is the fact that I'm release manager and this friday is.

C

A

17Th, so I don't think we'll.

D

C

Anything release management task related because that stuff starts in the 16th.

D

Yeah, you have a little bit, but it hopefully isn't going to be anything too unusual. Yeah.

C

So let's go ahead and schedule this work for friday, meaning you know this week. um I can try to accomplish it um and, in the meantime, leading up to this I'll try to see. If there's any way, we could try to measure what that potential impact will be I'll. Do some research and see if I can't figure anything out um and if I do discover something, either I'll delay or reach out and try to meet up so that everyone understands what that is.

C

If I don't find anything would just continue on, like we kind of play in the speech.

D

I don't like the.

C

Fact that we don't know how how to test this, but I think, because camo proxy is not officially like an integration like built inside of our product. That kind of makes it hard.

D

Oh, hang on didn't craig miss school put this in.

C

D

Wonder how he put it in maybe he has some tips.

C

I'll reach out to him.

D

Because at some point, presumably we've done the opposite of this right so or something similar to this right. To get this cash in.

C

D

um If we don't manage to go on friday, like I would suggest like maybe next thursday could be another opportunity uh in your time, I thought, graham should be around with the post release, so that might be another good day. Sure.

A

um Is this gonna be a c2 change requests.

B

It's c3 actually.

A

Even if we are possibly impacting the database.

C

We're not making oh, maybe it should be a c2, because we are I I don't have the change requested in front of me, but our.

D

C

Yeah the invalidation sounds like it should be: a c2.

D

Yeah, I would go higher just because it gives you one extra step for.

C

D

An approval but like we'll approve that right like, but I think, if, uh in terms of it, giving people a suitable potential heads up makes sense.

A

uh One thing also: I'm gonna update the some things a bit related to the template for the change request since c1 and c2. The report is a gitlab production calendar, so I'm gonna add an extra check there. I'm gonna submit an mr in a few minutes.

D

Where did this process come from.

A

From the change management, one in our handbook.

D

Okay, because the calendar is very empty, I mean maybe.

A

It's because it's not honestly, there is the birthday. There is a birthday of someone in the calendar.

A

But I change or we change the process or we adopt the cr template yeah.

D

A

Otherwise, we are kind of an inconsistency exactly.

D

B

Thank you scarback, actually, for taking this uh either you or me like gonna, execute it so finger crossed.

C

We'll aim for this friday, if not you revisit again.

D

C

C

I think this will excite vlad when he gets back and he's.

D

Like, oh, it's done cool yeah yeah like that. Just take some time off. We got it.

C

D

Great great, to get some points, one thing that actually once we do get this rolled out, we need to figure out an answer. To is what we do with the denylist, because I know it's a little bit of a question mark.

D

What I'm actually wondering is if we even need to do something with the denialist, because it's never been used, and I I wonder if there is any sre on call who knows it exists, because if they don't know it exists, they're never going to use it in which case probably doesn't matter if it if it works, so it might be worth um as actually like checking in with you know, java, or you know, reliability basically, and actually is this something.

D

Do they want to have deny less capabilities here, or do we always just default to cloudflare and then that's it.

C

A

So do we have any.

C

Other questions or comments.

D

For you, scarborough I've added um on one eye, this issue, two two, five: three, where we talked quite extensively about this flag um and do we pull it out or do we keep it, but the reason we kept? It is almost exactly this scenario you're going to need of. I possibly want to keep deploying, but I know I don't have the full set of clusters.

C

Yeah, so this flag is for the entire pipeline specific to all of the gitlab com deployment, and I really don't want to impact that, because I want to make sure that if a failure does occur in another cluster, like maybe it's overloaded and so a deploy is like struggling on the cluster. I want to make that well aware and if we hide this with this flag, we're preventing the release manager from discovering some sort of failure.

C

So my initial thought was: let's create some sort of new environment variable and then during the ci pipeline creation. It knows I'm cluster c, I'm going to be skipped. Allow failure is true for me, for this particular build or something to that effect.

C

But I'll get an issue created because I should have created one by now um to figure out what to do with that and this I'll link to this, as this is a certainly an option, but I think it's too broad to be a sane option. In my opinion,.

D

Oh okay sounds good.

C

Cool um michaela is there anything that we want to discuss related to gitlab sshd we're kind of on hold at the moment. So I didn't know if there's anything that we wanted to discuss in here at all.

A

um I can give an update, so the rollout of the proxy functionality is on old uh because of the blocker that identified the blocker was that we would introduce a regression uh in the case. We are going to release that. So there is a new feature coming in to avoid that uh a subset of customers are gonna, have this regression in and they actually, they change, requests that they ping.

A

You in the channel today is to fill the data and after the merge request, is it's in production and the migrations are run to fill the data in this new in this new attribute coming in so after that, I guess we will have a new updated date to run the change request for production enablement of the proxy functionality, but so far that one is an old asked uh yeah. I asked the team resourceful team also to.

C

A

A full script tested that we can run as part of this year, so we are just their executor there and we because we don't. We are not the domain expert in that site and I would suggest also maybe tomorrow, when we execute that to do it in a in a joint call as well with them.

B

So the script is not ready yet.

A

uh I got a ping right now from sean uh as soon as I finish. This call. I'm gonna check the ping. I guess is the script and I'm gonna, I'm gonna, let you know.

B

Okay, thank you.

C

That's excellent: okay,.

C

uh Anything else, otherwise, I think we could adjourn.

C

Thank you excellent. Well, you all have a lovely wednesday. You.

B

See later good day take care.

C