GitLab Delivery: GitLab.com migration to k8s demos, 4 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-05-04 GitLab.com k8s migration EMEA/AMER

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, everyone and welcome to may the fourth be with you, um igor, has not yet hopped in so why don't? We skip igor and henry you've got the first item on today's agenda.

B

Yeah, um you have to admit, I'm still catching up a little bit from the last three weeks, and actually um vlad did some work on this and also paired with ahmad.

C

B

So um please chime in once. You have more details than I can add here, but um from what has happened on cameraproxy.

B

Basically, vlad did a lot of work on creating a hand chart but and working in bit, lapel files to create a release still in a branch which was experimentally also deployed to pre already, so we have cable proxy ports running there and also pre was configured to use chem proxy communities, and I think we tested this even and could see some images. But I just looked at this uh and right now um it doesn't seem to work, so I'm not sure um of the current state. Actually, maybe we're still experimenting on this one.

B

The other things that we started to look into also yesterday is enabling logging, so I think vlad incorporated yesterday some changes to enable pubs, upbeat and things missing on pre uh to get log files through, but I guess this still needs some work on on fluentd still to be configured. You need to check on this um and what I think the next steps here are to.

B

I guess we should really uh try to merge this, mr that we have here for development just to get a state and pre out, because it shouldn't break anything in pre, um make sure that it works in pre as expected, um look further into logging and monitoring and then figure out how we can work with load balancing, because the next challenge will be if you want to migrate in g-staging in g-prot, where we already have an existing um cable proxy setup with a load balancer which is being used via dns to proxy traffic to, and we need to find a way to connect from kubernetes to this lb, maybe and then transition over- and this is the next thing to figure out here.

A

B

A

Load balancers for that.

B

uh Right now we have a google load balancer for that for the vms for communities. I I'm still not sure what would be the best way to go here if we should introduce something in kunitas, but then how can we transition over or should we just use the existing load balancer from kubernetes and make it connect into kubernetes? I don't know, there's something to research still ah did I forget something.

D

No, I think you covered it. uh One thing I think right now, vlad is experimenting with the external load balancer, so using the creating like load browser jke, and this goes into grenades as well. If I remember correctly uh so, I think he's also like experimenting with this.

B

Yeah, I will see if I can pair with him this week, so we can work together, maybe um maybe ahmad you also want to join. That then sure, so we come to a common state, maybe and can merge the current. Mr, so we have something which we know it's it's working.

E

B

Experiment from there further, I think that would make sense.

A

Excellent, um I don't have any further questions. Anyone else have any further questions on that.

A

Cool uh henry: let's talk about uh your proposal here.

B

Oh yeah, the order changed, I see um yeah just um I I just saw that we have an issue, I think, from green uh about issues with um our produce option budget right now.

B

This is the issue 31 I linked here and um looking into that, I figured um that we have a very old issue um uh for increasing stabilization in the windows seconds, um which I also linked here in the description um in the document.

B

The thing is, um what we have right now is that our deployments are all scaling up and down a lot right, so our hpa is looking at our load.

B

In all cases we use cpu average load, I think, to decide if you need to scale up and down and our traffic is, you know very bursty, sometimes and changing a lot over time. So we constantly are scaling up and down pots. Very, very often in most of our deployments and the one problem here is with pot disruption budget and why we are scaling. It seems that we don't have further destruction. Budget to um you know, shut down more parts. If we want to.

B

This is the issue that green is working on and also, if we scale up and down often the.

E

B

With our web service, ports, especially, is that they take very long to you know, be created, I think, a minute or something like that, and it's very resource-hungry process to just spin them up and if they are just taken down again after five minutes or something because we did something, then we spent a minute just scaling something up and after five minutes we scaled down. So it's a lot of resource waste and I.

D

B

When this issue about um stabilization window was created, there was only beta support and kubernetes for that and we couldn't easily uh deploy it. But now, with kubernetes um version 121, which we are running, I think it should be straightforward to um set the setting and by setting it to a value which is longer than the default five minutes. I think we can prevent to scale down too early and those we should um stay more stable over.

B

You know, traffic spikes and avoid a lot of these issues that we see with um pod scanning and that maybe also could help with the pot scaling uh port destruction issue that dream was looking into.

B

So I guess it would make sense to set it to some value for our web server spots, which is much longer than five minutes and then see how we go there, and the thing that we need to figure out is, of course, uh coasts because of these slowed uh scale down more slowly, then maybe we faced a little bit more um sources and coasts, but I think by not as often scaling up and down, we even save resources. Maybe so I think it's worth to looking into that again.

A

And now fully support this. I would love to see this specifically for sidekick, where we scale very rapidly in certain cases and for some of our workloads we scale down unnecessarily.

A

I like that. I like what this option provides, but in looking at the api documentation, I don't see it in the version of kubernetes that we currently run. So I don't know if we could support it yet, but it would require a little bit more in-depth research versus me just quickly googling, so.

B

That's maybe some beta version. It's already. There need to check that yeah.

A

Potentially yeah we're not using the beta version of the hpa we're using the version one right now, so I don't know if it's out of beta in a later version of kubernetes I'd have to research that um but yeah I totally. I would totally support trying to modify this value if able.

B

Yeah, I think it could be a quickvin if you can make it shouldn't be too hard. So I guess it's really worth to um pick this up.

E

F

I have a question.

E

Capacity to actually take this on, I know graham's kind of juggling a few things.

B

A

E

Capacity to help them out and with that it'd be super helpful.

B

I'd like to pick this up, yeah.

E

Oh perfect, thank you. Okay, great. Let me coordinate out on the issue so graham knows, but that would be super helpful.

F

I have a question as I I'm reading the jar of issue, which is one year, so I want to check something. So this is stating that we see a lot of 500 errors during scaling events. Does it mean that we start routing requests before the box is ready, or this is happening when we fear down that? Basically, we are closing connection that are serving traffic or both both things.

A

So a few things that might be outdated information, because this was created before a major improvement was made to our containers.

A

We are in a better position today, where, when we tell a container to stop accepting traffic, it better handles closing connections out better than previously like. This was something that was introduced maybe three weeks ago. At this point, okay,.

F

A

That specific tidbit of information might be a little out of date,.

F

Okay, because if we still have a situation like this, this is something we have to address in the product itself, because this is unacceptable. Obviously, you can just throw 500s, because you are spinning up or spinning down notes. Okay, thank you.

C

I also shared a link to the recent change that fixes that.

B

I also think we did a lot of improvements over the last months and years to um fix, scaling up and down and then have long enough windows and and readiness checks.

E

B

Actually so I think that shouldn't be a big issue anymore,.

C

Now it was broken.

A

All right so with that amy's gonna spin up a potentially new issue to address henry's um idea here, but that's not going to stop what graham is currently working on, though.

E

No, I was going to actually suggest that maybe henry you want to take over this issue from graham um and figure out what might be the best next steps.

B

Okay, I will um look into the issue from dream.

E

I'll leave a comment on there. We can actually figure out next steps, but I know graham's.

B

E

A few other things he's working on.

B

Yeah at least the stabilization window thing would be a win in several regards and shouldn't be too hard, and in addition to that, I can also look into the capacity planning issue. Then.

B

C

So the mental model that I have for that is we're spending a lot of time, um booting up parts and that's the main thing that we would save on so by having less flappiness, we kind of amortize that cost and therefore potentially lower the overall resource utilization. Does that match what other people think.

B

Yeah, that's exactly what I think, because if we scale down after five minutes again, then we just used a quarter or a fifth of that time. For spinning we.

C

Spent one minute to boot, yeah and that work gets thrown away. Okay, cool thanks.

A

Any other questions.

A

All right, igor, yes,.

C

I have the next one so um we're finally getting to make some more progress on the the rollout of host names. We we were waiting on an omnibus change that has now landed and we're looking to get well, don't have anything to demo. Yet I don't think but looking to get that on pre by the end of the week and then once it's working on pre. We should also have the procedures in place to to do it on staging so there's some interesting stuff that we're still discovering, uh in particular around how.

C

The the chef, client and reconfigure interact like which, which of those does what uh scarbec made a really interesting discovery yesterday, um so we don't run well for most of our chef lee. We run reconfigure on every chef client run on redis. We don't and we don't in order to protect us from surprise, reddish restarts.

C

That means, however, that we're pinning very old version of the package. It also means that uh stuff that usually gets done regularly by reconfigures is now very stale on those boxes, and uh the issue we ran into in this case was actually deprecations, because those only get processed like the the file that the package installer looks at only gets written by reconfigure, and so that means, if you try to upgrade from an old package to a new package, it'll use the old settings.

C

Unless you run a reconfigure before trying to install the new package. So there's some weird dependency ordering stuff there huge kudos to scarbeck for figuring that out, um hopefully that's the the biggest hurdle uh on on this particular rollout, we'll see we'll see how the rest goes. um Probably some more dragons to be discovered.

C

So that's the host names side of things uh slow and steady and the other one is process exporter. So this is on the observability side of things for edis.

C

The helm chart does ship with a redis exporter and that does all of the polling on the redis instance itself. However, we also want to have per thread cpu statistics, because we want to differentiate between the main thread and the I o threads and the background thread. Our saturation metric is on the main thread and.

C

The red six water doesn't give us that information, so we need to add the process exporter and luckily the chart does support side cars um so hopefully we'll have the that exporter in place soon.

C

That's my update on the redis front, any questions.

A

That's excellent.

C

A

All right so get lab shd, um I'm struggling to try to figure out where certain issues lie, so I've pinged sean again um as a quick reminder, we rolled back again because we were having issues in canary. So you know, gitlab sshd is not taking any traffic in canadian production.

A

It was identified that we had a lot of errors coming out of canary a very generic context, cancelled error message, don't know where that is spawning from at the moment. So I'm trying to figure out if we have an issue to address that- and I don't see one so I'm asking sean for that. The other item was related to metrics, where a metric item was simply renamed and that was not reflecting our dashboard.

A

I thought I saw an issue for this in the past, but I struggled to find it so again, I'm still trying to figure out what that is. The third one was just generic load performance testing of some kind. You know we've rolled this back multiple times now I feel like we should have been able to detect some of these issues in staging prior to writing them into production.

A

So I'm trying to get us into a state where we're testing this a little bit more thoroughly in some way shape or form igor, has pulled that work, so I'm eager to see what those results are, but I'm kind of going to enforce that we block migrating gitlab sshd into production tool. All at least those three issues have been addressed to some extent.

A

So at the moment it's a waiting game for us, I'm eager to get this into production because it's kind of an exciting project and it would benefit both us as well as self-managed users, so I'm kind of I'm not trying to rush it, I'm just trying to exclaim that I am eager for this to get rolled in. So I don't have any questions like um it's kind of a waiting game, I'm not driving these improvements.

A

These are kind of on the gitlab shell team and they've been preoccupied with another issue um between the last our last attempt and this week. So there hasn't really been much movement in the first place. So.

C

I had one question: slash suggestion.

C

During the last rollout attempt uh the the communication with that team wasn't so proactive like once it failed and we started talking to them. That was fine and they were very responsive, but I think involving them earlier on and actually having them join a call and us rolling that out together is maybe well it's something I'd like to see.

A

So why don't we try to schedule a time so, the next time we determine that it's okay to start rolling out. Maybe we could schedule a zoom call with that team and we could roll it out together.

A

Okay, um like mine, I know you've got a couple of um cr. I think you've got two crs one for canary one for production. I think for I'll go through the canary one and update that such that we get that notified and then also comment on that issue. That way, they're aware that we're going to be doing that, I think that's an excellent idea and I fully support that.

A

Cool anything else.

A

Does anyone have any fun facts about kubernetes that they would like to share.

A

Does anyone have any fun facts about deployments? They would like to share.

A

Very well does anyone have a favorite flavor of ice cream? They want to share.

E

I have a comment about your suggestion of using ice cream names as team names. I was saying it's so controversial because clearly there are better flavors than others.

A

I don't disagree with that.

A

Okay, well, in that case, everyone enjoyed the rest of your day. uh I look forward to seeing you all next week.

F