GitLab Delivery: GitLab.com migration to k8s demos, 29 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-06-29 GitLab.com k8s migration (EMEA/AMER)

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, everyone.

B

Good morning, good.

C

A

So welcome to I don't know what today's date is. I know it's wednesday. The 29th give me a kubernetes demo. I do have something I want to showcase today.

A

So let's do a minor show and tell um so. I have been working on some tooling updates that target our aj proxy configurations. um I had two goals. One of them was to enable us to be able to take down a back end. That is targeting he proxies that live in a specific zone.

A

The reason for this would enable me to not completely disable canary across all aha proxies. We still want canary to be usable. Therefore, autodeploy still has a mechanism for which it has a backend to test against. If we completely disable canary, we effectively block team, delivery's, auto, deploy capabilities, and since we don't know the full scope of cluster outages, I don't want to take that type of negative impact to another team.

A

So, in order to be friendly to release managers, I've added the ability where we could disable canary for target zone. um That way, autodeploy still has the test, and secondly, because we'll probably be doing this across uh all nearly 30 front- end hd proxies and we'll do this slowly. That way, when we start disabling actual zones that are taking the brunt of customer traffic, so say, cluster b, for example, becomes disabled.

A

We want to move that traffic over slowly to the other two clusters and the only way to really control that is to modify a single fe node one by one.

A

So I simply added a small sleep so just to showcase I'm going to use staging that way. I'm not sitting here banking production if we do a get server statement. Let's so we want to look at the ssh back-ends, specifically um hypothetically everything should be up. Let me see if there's any deployments going on okay, so there is a deploy going on.

A

But it's past the kubernetes part. So that's that's good. So in this case um and jenny just in case you're not familiar with this tool, all this is doing is reaching out to our chef server gathering a list of all fe nodes that we have so in staging. Currently, we've got two fe nodes that are of type ci. We've got two or excuse me, three fe notes they're, just our generic fe nodes that intake most of gitlab traffic.

A

We do have a few others there's one four pages: well a set of them for pages and a set of them for registry. They do not run the ssh backend service. They don't intake that traffic. Therefore they don't have an ssh backend, but we are inherently querying those backends they're, simply not showing it, because there's no data to show I'm targeting staging and I'm looking specifically for ssh. This is just a blank or blanket regex, so it's looking for all ssh backends.

A

So in this particular case, we can see that for all three f8 nodes that we have in staging the ssh backend service for all the clusters is up as well as canary.

A

So my imagination is that what we would do is there's a convenient set server state. um So my imagination is that what we'll do inside of our procedures for mucking around when it comes time to take down the clusters that we want to disable our canary back-end explicitly?

A

The reason for doing this is that we don't want to disable the zonal cluster and then canary takes all of the traffic from that front end node. We would drown canary if we were to do that. So, let's say, for example, I'm operating on cluster b.

A

I want to disable canary for cluster b, for the reason I just described so we would effectively do this is staging so I'm not worried about ruining anything.

A

We would set it to maintenance mode and we'll do this across the board for all of our front ends sure and then to validate that this actually occurs is a little goofy I'll showcase that in a second.

A

But if we do another get server state, you know we're going to query all of our servers, but the ux of this particular tool is not great and I'm not really sure how to make it any better, because we're just using basic bash commands like we're grabbing the information for all the h, a proxy nodes, we're doing a sort and a count. So we see that one fe node says that this can air back end is in maintenance, and we see that two of them are up.

A

Based on what I specified here, where I specified zone b, I could take a gather that the h, epoxy node, located in zone b, is the one that's in maintenance mode, but that's not explicit here, which is kind of disappointing, but in my procedures I could always add another validation step just to make sure that people fully understand what the state actually is. We could also probably look at metrics to determine that hey effing, these sets of fe nodes, all locations that would be are no longer taking traffic as well.

A

So now that canary is disabled. The next thing I would probably do is go into here and, let's say we're again targeting cluster b and we would just go in here and say: hey you set you in maintenance and because I know this is staging but, like let's say we're doing this in production, for example, we'll want to like slowly roll across all nearly 30 fp nodes, but in staging we're only going to hit one node.

A

So in here we'll target here. Feo3 is the only node. That's going to be touched.

A

We'll say yes to this and we even output a little message, saying: hey: you desired the change related slowly, so we modify our knife command to operate, one node per time or one node at a time. There's a concurrency flag on knife, the c flag of the concurrency file. We simply default that to one this particular case and the actual knife command that sends to the he proxy administrative socket, there's a sleep injected in front of that command. So for every ha proxy node we sleep first and then we set the actual flag.

A

So if we do a get server statement here, what we're expecting to see is that canary is disabled across one back end and cluster b is disabled on one back end, but cluster b will remain up for the rest of them same for canary.

A

This is effectively what I'm imagining and sure enough. That's exactly what we see here. This is effectively what I'm imagining. What we'll do when it comes time to testing bringing down a cluster of traffic.

C

uh Do you have any validation in place for other people brings everything down at once.

A

Repeat that question please.

C

So do we have any validation in place that these people using your tool to bring everything down at once, something that is extremely disruptive.

A

Not really, you know this tool doesn't send any audit logs anywhere. It doesn't send a notification anywhere, that's being used, um so not really. Normally. This is used inside of change requests, but you know there's nothing preventing me from using here locally. You know I'm sitting here targeting staging with this. You know without any sort of issue at all.

A

So no, there is not really that type of protection. Okay,.

C

I mean I'm just like someone not specifying the cluster b. My mistake. What would happen.

A

Just bring down the wrong cluster, okay, which you know would have its you know a wide array of side effects like what I would love to do is to try to automate this. That way. Instead of running this, not.

C

A

Locally, maybe like script, this locally there's a flag that I could set. This is taking forever there's a flag that we could set on this tool. That effectively is like a force flag where it doesn't ask me to hit, enter to continue, and I could create another bash script that wraps around this tool that ensures that we can just copy and paste a single command.

A

That does precisely what we want, but you know that's just a wrapper script around this script and you know at that point we have to assume that we wrote the wrapper script correctly, that we're copying pasting, maybe a chef or shell one. Liner correctly. You know.

A

Anyways I'll get I'll get these nodes back up and running the way I need to uh canary's the last one.

A

um But that's what I wanted to showcase! That's what I've been working on lately so.

C

D

A

A

As far as next steps go, the last thing that I wanted to make sure that we accomplished was making sure we don't block all the deploys right now. We've got a situation where, if a cluster is down, our deployer will see that and we'll bail, and this is intentional because we don't want any back-ends to be down for the most part.

A

um So I need to make sure that our I need to make sure that we'll be okay like setting some sort of flag that ensures that we could still proceed during a maintenance procedure such as that we already have environmental variable set today or that I can set today. That would do the same, but it's very global, it's it.

A

It would bypass the check for all of our clusters, but I don't want that to occur. I want that to occur just for the target cluster that we're operating on. um I think that'll just be a more safe method of doing that.

A

So that's the next steps. Any questions about this particular topic.

C

No, but some seems cool that we can start to automate that and move to the direction so sounds pretty interesting and promising.

A

Yeah all right so discussion items. I already talked about number two. I went out over my apologies item number one um get ssh. I wanted to bring up a chart, but I failed to do that ahead of time. So my apologies um leading up to the get lab sshd enablement. We used to have a an hpa that was leveraged heavily, but our eocs started getting paged, often enough to be concerned.

B

A

Nothing was able to be found out of any research that was done. I'd have to go back and try to find that issue.

A

Whoops wrong one shell pot info, so I'll share my screen again.

A

um We used to leverage the hpa of the gitlab, ssh or gitlab shell demon quite greatly. It would scale all the time, but because we were having issues and we couldn't find the root cause. We just told the hpa to.

A

Don't do its jobs, we effectively told the minimum replica set to um be a value that was, that was exceedingly high. What that led to was the inability for us to leverage the hpa and now we're running a lot more pods than we used to run no data.

C

This is after the rollout of gitlab sshd that we removed hpa.

A

This was just prior to it. uh Let's see if we can get this data quickly. No, all right! Oh grafana,.

A

There we go I'll switch to the thanos.

A

Let's see close to b gitlab get web shell replica that we'll just do a summation.

A

A

This one is called gitlab shell.

A

I know it looks goofy that I'm looking for gitlab gitlab show, but there it is right there. So why somebody replicas that.

D

A

Okay, so if I go back in time, we'll do it's worth the data and like we have to go all the way back to say, go back to march.

A

So we used to run roughly 40 pods uh spiking upwards of 50 pods right.

D

A

Sometime in I think it was, may we decided, because we couldn't figure out what was going on we're just going to hard code, this minimum replica value to 100.

A

So now we just have 100 gitlab shell pods running across all of our zona clusters. So instead of saying running 40 times, three we're earning 100 times three, which is kind of unfortunate.

A

um I'm not a big fan of this, and this is a lot of wasted resources.

C

What is the current load on those.

A

It's not going to be any different than it was a couple of months ago, but theoretically the load should be a lot lighter. Okay, get lab canary, get lab shell cluster. Okay! This is supposed to be graphics gke.

A

This is supposed to be canary.

A

I hate looking at metrics, it's the bane of my existence when we performed a test um michaela. You may remember this when we performed the test where we send all of one particular, I can't find it screw it when we did a test where we took all of the canary traffic and ramped.

D

A

Up to like nearly 100, yes, we were running like eight pods in canary and it withheld just fine. So I suspect we don't need to be running 100 pods, but we can be running closer to say eight times three and still be completely safe.

A

So I haven't opened an issue about this. It's just been something that's been on the back of my mind, but I'm curious as to whether or not we should fire up an issue, and maybe now that we've got the lab s's hd enabled you know this changes, how gitlab shell used to work. In the first place.

A

It might be a good opportunity to try to see if we can't reduce the amount of pods we're running, which would thus reduce the amount of nodes we need and as a side effect, because prometheus just recently, barfed in one of our zones with less pods, there should be less metrics to scrape which should ease the strain on prometheus.

C

We're also changing producing a lot in our pod.

C

Makes sense so, from my point of view, we should go ahead and open an issue for that.

C

I was just looking at so there was a, I think, an investigation going on from igor and someone else from uh um source code about uh doing some performance comparison that I think is completed.

C

I would say, maybe.

C

If do we want to wait the results for that? I guess not, but maybe it would be worth it to understand which number are coming out of the investigation is about. It is about to be finished.

A

And I guess you know depending on timing, you know if we pull this before they're done with their testing, we could use whatever values we end up, settling on as influence to whatever documents or whatever performance, testing they're doing.

C

A

All right I'll get an issue fired up to address that it's low priority at this point. We're just wasting a lot of money, in my opinion, so yeah, I think it's low priority, but this would be a good like it shouldn't be an issue that takes a lengthy amount of time. It'd be one of those quick wins in my opinion.

A

um It would just be. It would be time consuming because of the necessary need to perform the research and then make the changes slowly such that we don't induce an outage. We want to you, know kind of introduce changes, slowly, maybe test them on one particular cluster before we roll it out to all of them, for example, and then this is only impacting production. um The change that we made that I described where we set the minimum replica account. We did not do that on stage, and we only did this on production, because only.

D

A

Was the environment that was yelling at our eocs about stuff, so um I'll fire up an issue? I guess I'll link this.

C

This is the issue on source code site to measure the performance benefit. That's what I was speaking about.

A

I'll link it all together that way, whoever picks up uh tuning gitlab shell. We could kind of showcase some of that detail for them as well, if they find it useful.

A

D

C

D

A

About that otherwise, okay, so we already talked about number two, so jenny, let's talk about what's going on with uh stateful sets, because this.

B

Is silly yeah? Oh I mean you two both have um the context behind it, but yeah. Let me just go over um so yeah. I was working on this issue to basically add um basically a label to a staple set, and it's not loading great, um and I had this vertex quest before just for the pre-cluster just to test things out. You know we're trying to add a stage main label. That's the only new thing, um the change being.

B

You know, um if there's a stage label, please add the stage label which is made um and then once we run that so at first you know um the pipeline that starts running from this fails because the diff says hey.

B

This stateful set cannot take any more labels, basically um other than these fields are forbidden. So basically we thought okay, so we're going to delete the staple set manually, merge the merge request um well delete the delete staple set um orphan, the pods in that step and then merge the merge request and then the new stateful set theoretically should pick up the orphan pods, restart them with the new label.

B

um That was our theory going um like from that, which is why I made this change your cost um for that which did not have the whole deleting pod step before.

B

um But then, when I actually went and uh did that step, we came into an issue where it says um the new sds couldn't pick up the new pods, because the post operation against the pod could not be completed at the time. Please try again like it's. It's a very ambiguous error that happens um so then we manually went in and deleted them one by one. um Thankfully, in the pre-cluster there were only two um now.

B

What even stranger is that you know when I took down the first part, it actually didn't say ready one out of two when the new pod came up by the new stateful set, it said zero out of two, so that, like made me worry again, um but then we said, let's just go ahead and delete the second old pod and then another one came up and then that's when it turned two out of two um which just might be a visual glitch. I'm not sure um why that staple set behave that way um but yeah.

B

So, basically, that's the premise of what happened. um I wanted to talk about like um like that's the downside and then because of that, if we're going to do this in the product cluster, um we're going to be doing this to 50 pods um only bring them down one by one um at a time which takes around five minutes per pot, um which made my change of cost, have 260 minutes for the time to completion, which is less than ideal right, um so yeah, that's. That was why I wanted this discussion now overnight.

B

Graham had suggested that, well, he noticed two things right. um He says that pre-cluster is the only one on kate's um 1.22, um and that might be the reason why the staple set is acting that way. um But then I thought you know, 1.22 is the latest one on the very stable, um like release line, which kind of doesn't make sense like why a stateful set would um behave so weirdly on a very stable release, um but I can do more research into that um and then also that fluentd archiver can take down time.

B

So if it's going to take 120 uh 100 260 minutes, um then we should just delete the sts without opening the pause and just doing um a full redeploy taking down time um so yeah thoughts, opinions.

A

So, just for clarification, it's not that we the reason why we need to recreate this is not because the state will set can't accept new labels. It's just that this particular configuration is simply not allowed like. We can't make the change after the state staple set has already been created.

A

So that's why we need to recreate it because, unlike deployments, these are the type of workloads where this type of change is simply forbidden by the kubernetes runtime for x reason, I'm sure, there's a design decision behind it.

A

um I think grain pointing out the debris cluster is slightly. Newer um is just pointing out the fact that maybe there's a change brought into 122 that may differ with the way staple sets behave versus 121, which is what we run everywhere else.

D

A

I'm curious about and I'll leave this up to you as to whether or not you want to test it this way, if you want to go with your new testing. Well, I guess before I ask, have you tested this at all on the ops cluster at all.

B

No, I thought we'd have this discussion before I.

A

B

And change other ones.

A

So one thing I'm kind of curious about because I know staple sets like to work in a backwards format. You know it's always building the first pod and if there's a change that needs to be introduced, it's going to make that change to the latest pod. So in, like pre, for example, was running two pods, so you have the pod, zero and pod one.

A

I'm actually curious as as to whether or not if you deleted pod one first versus pod zero. If that would have made any difference.

C

A

Not because the error we got is very generic and doesn't lead us any towards any sort of troubleshooting direction, so I suspect there was probably a conflict with the name and kubernetes didn't know what to do, because a pod was already named that. So I suspect that deleting the pods overall is going to be what's required, but it's just a curiosity. That's you know itching, my brain, you know, so I don't know whether or not you want to test that out that test that theory, auto and ops I'll leave that to you.

A

Otherwise I think it fluenty archiver, you know fluently uses that positional file, so it knows where it stopped. When it was last running so, like graeme said, we can't take down time, it's just a matter of making sure we limit that downtime as much as possible. So we don't end up making fluent d suffer too much by taking too many resources as it catches up. We also don't want to drown wherever those logs go, to which I think, if I recall, go to gcs.

A

So that's probably not an issue for us, so I I see two options. If you want to try to figure out, if you know stateful sets behave differently on version 121, we could repeat the exact same experiment you did already, which I suspect it's not going to change anything or we could go about testing your new procedure, which is just simply deleting all of them, and you know hitting that deploy button.

B

How about this um yeah so in the ops cluster, so we have off staging and prod left um and obviously don't want to test on prop, but on the ops one. How about we test out?

B

um Basically, what you suggested delete the last one um delete the last part um delete or rather um I'll delete like orphan delete, the staple set I'll delete the last pod, and then I will deploy the merger quest.

A

B

You'll know very.

A

Quickly, whether that's going to work because as soon as it tries to create that second pod you'll probably see that goofy error- that's not helpful, so you'll know it's a failed experiment at that point. So that's.

B

Exactly what the recovery.

A

B

So that would be.

A

A short experiment, too,.

B

Yeah and then, if that experiment fails, um I'm just going to delete all the parts delete the new staple set and then I can redeploy that pipeline right like run the deploy picture.

D

B

Even though it succeeded before.

A

B

Okay, yeah, then I think I'll do that and then, depending on what comes out of that, um go forward with that on staging I'm guessing it's going to be the whole, the like delete the whole thing and then redeploy.

D

B

But you know if we don't have to do that, it'll be ideal um just because 50 pods it's a lot yeah.

A

D

A

To me that they take so long to die, so I wonder I wonder why that is, but I also don't think it's a huge deal either.

B

It might be because it's keeping track of where it leaves off.

A

Yeah but that's quick to write, you know just you know, stop reading logs write the position quit.

D

You don't need five minutes.

A

D

I know I I don't know why it takes around five minutes, but um it is what it is so yeah.

A

D

Forward with that.

A

I do suggest you just validating that between you know the last time we looked at cluster versions, that ops wasn't updated to 122.

B

Yes, yeah agreed if it was I'll, do it on staging or something I.

A

I just can't remember what the ops upgrade schedule is compared to everything else. So that's why I.

B

Suggest doing that.

A

Before you do the test.

B

Makes sense all right yeah all right thanks for those questions.

A

Anything else got anything, nothing new, I don't have anything else either so good nice short meeting and we got a demo in there. So I'm pretty excited yes,.

D

A

Thank you, you all have a lovely day. Thank you.

D