GitLab Delivery Team, 16 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-06-16 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

Awesome so welcome everyone um to the demo. So scobec you have the first item.

B

Indeed, so I created a change request that allows me to, without following a predescribed script, make specific changes to the api and our ability to tune it to see if we can make any improvements or accidentally degrade things.

B

Luckily, so far we haven't created any degradation, but I'm actually kind of at a loss as to how to proceed forward, because I feel like I've made improvements, but it's harder to actually visually see those so I'll, first go through what I changed so far and then I'll show some charts as well. So um so, firstly, this was cluster b that we're modifying and it was already slightly configured differently than the rest of our clusters. So number one item is the amount of um our hpa tuning was at a value of 2600 versus the we.

A

Can't see, are we meant to be able to see.

B

I'm not showing anything yet um sorry, I just wanted to give a quick highlight and then we also have a worker count difference. In this case we have it configured to six versus the standard 4 on the rest of our clusters, so the first change I did was to modify.

B

Let me now share my screen.

B

We changed the amount of cpu requests, there's uh hpa average cpu requests, so we we tuned it closer to the value of um the amount of workers that were running so instead of 2600 or 2300. I believe what was it before? It was 2600. We bumped it up to 3 000. with the intention that we would end up removing some pause from rotation which did occur. We also saw ruby thread contention bump up a little bit, which was expected, but the abstracts didn't change.

B

We do have these awkward periodic drops that happen quite frequently, but um we can see that cluster b is the red line. I should have put the the legend here. Cluster b is the red line. We can see it's actually higher than the rest of the clusters, which is a good thing, so we're actually doing pretty good memory. Saturation was still high that I wanted to address and that's exactly what I did next was move or existing limit from seven gigs and bump it up to ten gigabytes, which should put us around seventy percent usage.

B

In this case, we went from having around around eighty five percent and we dropped all the way down to fifty five percent uh ramp usage. So I thought that was great, so we're making progress, but we haven't made any angle, progress towards lowering the amount of pods and nodes that are being utilized just yet pods a little bit, but not enough technically.

B

So the next thing I did uh was modifier hp again in this case I'm trying to squeeze a few more or more workload onto the existing number of pods. So we changed the target average utilization from 3000 and bumped up to 4000.. So this is currently equal to the amount of cpu that we request, which is still lower than the amount of workers that we run and with that I noticed that the cpu saturation for the nodes that were running more pods than say one pod.

B

We saw cpu saturation of those nodes increase, which is what we want to see.

B

This chart is showing roughly sixty percent cp utilization, but we had a lot of nodes only running one pod, so there it's harder to actually see any progress because it takes the cluster auto scaler, from what I've seen approximately 15 minutes before it decides to add or remove nodes in general.

B

So a lot of these nodes where we're running when they run pod, eventually they'll get cleaned up, but it appears the cluster auto scaler is kind of sluggish at that point. So I'm kind of disappointed by that, but the one thing I did want to highlight was that the ruby thread conduction jumped right back up to just shy of 80 percent, so I don't really want to modify that any further uh butter pod counts are a lot lower.

B

So here excuse me: pod, counts, yeah, we're running like 37 pods per down to like running 30 as of right now, but like our nodes they jumped when we had to deploy, um and they just been slowly trickling down like it's. The time between these two events is about 15 minutes, so I'm I don't think we'll actually see the usefulness of this experiment because it takes so long for the nodes to spool down um in an effort to see.

B

If I can't jump start that process, I was watching our memory saturation and it had jumped up to 80 again. I thought let's try to get that down again and because that's going to cause rotation of all the pods due to that configuration change, mabel, that'll, re-compact things that way. We potentially have no pods running on some of our nodes and then maybe the cluster autoscale will get rid of stuff pretty quickly. So that was the last thing I did that was about 20 minutes ago, and that's where I'm at now.

B

And after doing a bit of math, you know we're running 192 workers for the api service in each zone prior to us moving to kubernetes at this moment in time we're running 240 workers, so we're running 20 more workers now than we were previously, and I actually think that's an okay value, because we were slightly under provision in the api service prior to this migration into kubernetes.

B

So I think from a worker count standpoint we're okay from a node pull standpoint. I'm probably not going to see a lower node count until a large amount of time has passed just due to the amount of nodes that are running just one pod and the cluster auto scaler. Just.

B

And the cluster auto scaler just taken an obnoxiously long time to remove nodes from the cluster so like it just removed one a second ago. So I kind of want to keep this experiment going.

B

I'm weary of changing any values related to cpu and hp at the moment, because our ruby thread contention is right back up to 80.

B

The unfortunate thing that I'm not entirely amused with right now is that the hpa target and the amount of cpu that we are requesting is the same value um which yeah.

C

I don't like this really yeah having having the hba target and the requests at the same value means that we constantly we will have pots going over and below and especially going over, isn't really nice on one hand, because I plan to have saturation metrics in this and every time we are oversaturated. It's not nice. Looking in the graphs right and uh the avalanche is, is an average.

C

It's not a maximum right, so um I think they should at least have it slightly the rigor slightly higher than the average so that we fit most of it below it. Maybe I think that would be better. I mean if we sometimes have um pots going over. We can't do much about it. I think because there's always a lot of variance in cpu usage um yeah, but but I think we should not have it equal and and to the other point for forcing notes going down.

C

I think where we can get forward here is now only to see if we can fit pots as good as we can on our nodes right like leaving just a little bit of of uh allocatable resources over if we fit. Let's say two or three pots on on each node: I think more efficiency can only get can be gained if you maybe have bigger pots which exactly are matching on the nodes. Maybe.

B

But we can't increase the size of cpu requests, because otherwise we can no longer squeeze any more pods onto the nodes, because we're pretty much at that limitation of what kubernetes will allow us to schedule per node as it is.

B

So, from my perspective, our best idea at this point is increase the amount of cpus that we have allocated to this node pool, which we cannot do for the c2 instance type, without jumping up a large amount of ram, which is only going to increase cost tremendously, and I don't think that's a wise choice.

B

Our best bet would be to switch to a different instance type, but we chose c2 for a reason. I have to go back to figure out why we chose c2 but as far as I know, their clock speed is a lot faster.

B

Unfortunately, this particular cpu- or this instance type- doesn't have as much flexibility in terms of customization. Unfortunately,.

C

But with the 4 000 requested cpu, I think we can fit three uh pots on each note right.

B

C

B

000 yeah it's like 3.04 yeah, so which is actually really nice because, hopefully for the nodes that are only running one pod when a deploy occurs, there's less nodes that need to be scaled upward in order to get new pods into place during a deployment which speeds up deployments by a little bit.

B

C

Was wondering what would happen if we increased the pot size to, I don't know five or six thousand requests, and so that we exactly fit two ports on one node, much higher amount of workers, maybe to see if that even helps more, I mean.

B

That that's a perfectly legitimate item that I was considering because right now we're not seeing any more than two pods anyways um at 4 800 requests, which puts us right around 80 of the ahpa target average value that would give us 2.6 pods per node, which you know we're not going to ever see. So I think that would be a fine change that I might do next.

C

Yeah but but if you say we don't see more than two ports on our nodes anyway, maybe it's um the kubernetes is, if you scale up right, then we fill nodes with bots and then later traffic drops down and then we remove pots from nodes. But we don't remove nodes very fast right. It takes a long time, so we will run for a long time with um not really filled up nodes and if.

A

C

Two no two parts on on each node, then the effect will be even worse because then we will have just one pot. Maybe on a single note,.

B

Yeah, which we.

C

Don't want to do unless I spin up.

B

The amount of workers tremendously so yeah I'm not entirely pleased with this experiment, thus far like we. There is some progress, but it's it just doesn't show up as quickly as I kind of was hoping for. Unfortunately, so.

A

Why does it say so long for things to be cleaned up? What start 15 minutes coming from.

B

I don't gke does not specify how often their cluster auto scaler does any sort of evaluation in their documentation. So I'm not actually sure.

D

I think that they are also considering the fact that spinning up a note is an expensive operation, because you cannot until it's completely provisioned you, you cannot utilize it. So maybe it's a kind of the bouncing of of spikes, because you may have a spike of traffic. You increase the number of machine, then it goes down just maybe just go down because you are not allowed to serve stuff right so and then it goes up again and you just destroy the machine and you need to recreate it from scratch.

B

Yeah, so the 15.0 that I see is just from our metrics I'd have to go, which I haven't done yet look at our logs to see, if maybe they're evaluating more often- and I just haven't, seen the results of those changes so I'll I'll be looking into that during the rest of this experiment as well, so um so yeah, that's where I stand currently for this interesting experiment. I'm just glad I got the change request approved because it's not one that's kind of standard policy. So I'm happy about that.

D

Can I do a very naive observation based on my understanding what is happening? So it sounds to me that basically, we have those boxes and we want to fit uh football balls inside it and basically there's a lot of empty and used space, and the problem is that we have just one size of both instead of having, uh which is more, the the.

D

So if you look at the all, the commercial around docker and kubernetes is very designed for having different size and type of workloads so that you can just put everything together and they kind of naturally squeeze together. So they are kind of uh filling the empty spots right. So you have a better optimization of traffic, and but we have just one size because kind of all the it's just one single application, and so all the parts are more or less same amount of resources, same size, same cpu, request and so kind.

D

You can fit one, maybe two, but then it's does it make sense. I mean.

B

That's actually a perfect analogy: we've got the footballs, are the api pods, whereas the ping pong balls are say the supporting services required by kubernetes to run on these nodes.

B

um The everything is filled with air, so they're compressible, but in terms of be able to squeeze another football in you can't because you just can't shove another item into that box. You know it's full. um The limits is where we would end up seeing the compression of those footballs.

B

You know that's where the football can't be expanded to its full capacity, because the node might be running under high contention or cpu load, but you know we're trying to stay under a certain threshold to make sure that our nodes are not being thrashed in any way at the same time, making sure that they're not going to thrash our actual pods in the workloads and negatively impact the aptx.

B

D

We need some memory intensive workload with a very low cpu recurs, so you can kind of bump the machine and just not waste all the memory.

B

Yeah like what we have here is the api is just very cpu intensive and the nodes that we're using have a lot more memory allocated versus the amount of cpu that we kind of need. So realistically, I think the ultimate solution here is to modify the cpu or the instance type that we utilize.

B

I just need to go back to the history, to figure out um why we switched to c2s, because I know that was a project in of itself.

B

So I think what I'm going to try to do is finish up this experiment first and then explore that route, if necessary.

E

We switched to c2s because we found we were very cpu bound and on the api, and it was more cost efficient to do c2s and in fact it was. It was started out as a performance thing where, just by switching to c2s, we were getting a lot more performance out of the api.

E

A

And do you have an idea skybeck of what like good enough, will look like.

B

Well, I actually documented this in the change request that I'm looking to reduce the amount of nodes that we run down to 15..

B

Currently, all I've done was increase the amount of nodes that we're running so I've done the complete opposite of my goal.

A

Here and with the others, what was the previous thing we migrated that was was also under provision before we migrated it. Can you remember, was it gear ssh.

A

There was something we previously migrated that was also really um under-provisioned, and we had trouble tuning that down. I'm wondering what we ended up with on that one like how close we managed to hit it in the end, because it wouldn't have been like for like right, because we we knew we were underprovisioned.

B

What, if that was the registry or the first one that we migrated.

A

Oh, no, I think it was late. What was the one that delayed us as we moved into q4? We spent most of we spent a lot of time through this uh november tuning.

B

It would have either been get ssh which I'd. Maybe they get like. Shell.

A

Maybe yeah it might be worth having a look at, let's see if we can actually find it well.

B

Henry's tuning the show currently so I'm not actually sure, but also keep in mind that when we first migrated all these services over, we didn't really spend a lot of time afterwards, like this is the first migration, where we're spending a lot of time afterwards um tuning these values. After the fact we haven't done this for any other workload, that's, which is what henry is currently doing,.

A

That sounds good. It's a nice segue there have you got any demos, henry.

C

um I can talk about what I found at least for tuning, um so not really demoing much on that, but um the basic things that I was looking to observability and um what magics are missing in kuwnitas, and I noticed that for a lot of these saturation metrics that we got before in the vm fleet. We now need to do it differently, because we can either measure saturation against set limits in kubernetes, like cpu limits and memory limits, which is what we do until now.

C

But for certain deployments like api, for instance, we don't have set cpu limits, so we can't measure anything and the other thing is measuring saturation against the request that we use to reserve space on the nodes, which is also determining how how communities is scheduling, ports and then trying to find place for them and by trying to find the saturation metric for base on requests for cpu. I found that for a lot of our deployments.

C

We are totally over saturated and that's especially true for us sidekick github, shell, also for registry a little bit, and so we need to adjust there and forget that it's a lot of work because for each single shot we need to adjust the settings there maybe see how it reacts and how it um we'll play with that. I mean it's all working fine right now I mean we didn't run into any big problems.

C

But what we already saw in sidekick is the strange issues with hpa node saturation right, where we running at 100 saturation of the node pool, but still don't seem to be having updates issues, and I'm wondering if this is related to setting requests in the wrong way.

C

So um by adjusting those, we will see maybe more usage of nodes for for those deployments, but we should be safe or not running into a situation where we are out of space sources, but because we plan from the end and set the wrong requests and kubernetes.

C

um That's what I'm working on currently like I'm, not working on that I'm making issues for that. I just also made uh 2mrs for registry in gitlab. Shell already, but for sidekick that would be- I don't know, 10 mrs to work on.

C

So that's just an issue that I created with some suggestions on how to work on that they're all put into this epic I created for this, which is linked here yeah, and this is currently blocking my saturation metrics work a little bit, because if you don't fix this, then the saturation metric as soon as we enable it will cause alerts, and then we would need to um silence all of those and fix it later so um yeah. We need to see how we work with this.

C

If you find fine time fixing this, that would be great um as there's still enough work to work on other observability issues for um kubernetes. So I can work on that and this can wait a little bit.

A

Does this overlap with the um taints work, the other epic? We have.

C

um I don't think so, besides that having better observability on the vr um saturated certainly would help, I guess, but um I it shouldn't, have a lot of tainting it's just to determine on on which nodes we have certain pots running right, which we don't do for some of the kinds of pots that we have like most of the logging and monitoring things. I think.

B

Right, the taint's work is more geared towards the scheduling of what pods are and where.

B

A

B

The overlap comes in, you know how we schedule our workloads, but for the most part, we've been trying to segregate our workloads on their own node pools and we've got some workloads that shouldn't be running where they do primarily. The monitoring stack is running on certain node pools that we don't want them to. I think that's, I'm pretty sure. That's the overall goal of that epic, but I'd had to re-review it.

C

I mean one thing that would be helpful, for that is having a saturation metric for how many many requests are we already using on on a node right? I'm not sure if you have those anywhere. I wanted to look into that next, because if you see that which node is having allocated how many requests already, then it's much easier to see um if we could fit in more parts or where we are inefficiently using nodes, and I think that would be a helpful metric also for looking into how retained things but they're not closely related.

C

You can figure this out by looking into the gcp console, or otherwise it's just a little bit. It's not in a nice place to figure this out. We need to work a lot with thanos and and gcp consular, but we should be able to get this into a grafana dashboard. I think.

A

And are you right now, are you? uh Are you making these changes to registry and get like sheldon's sidekick? Is that or are you talking.

C

About the saturation I created those, mrs because they were easy and fast, but I didn't execute them because skymac was working on kubernetes right now, um but they can be executed any time when we feel fit for that.

A

Okay, so your plan is to to finish this stuff to get the saturation metrics unblocked.

C

No, because um for github channel registry that can be done, it's not too much work, but for sidekiq that's a lot of work because for each single shot you need to find the right values and most of them are tuned in the wrong way for memory and cpu, and this is just work to be done. But but this I don't know that would take a while to get this all finished is.

A

It is it possible to get saturation metrics for everything, except for sidekick.

C

Yeah I mean the way to do this would be to just enable it and then silence alerts for sidekick, for instance, for saturation matrix and side kick I mean it's not hard to do, and then we can.

A

Fix it like on having something that we are silencing alert just because we haven't done the work to make it not go off. Basically,.

B

If we could configure the alert disablement inside of our like configuration, I think that'll be a wise choice. Just until we get the work completed, because I don't it's not worth our time to create an alert if it's not going to be useful and it's just more work for the sra team to create the the silence and re-establish that silence until we complete that word.

C

Yeah the thing is: while we are tuning, of course the metrics are useful right. If you see that we are saturating requests, then we know that we need to tune here right. So it's very helpful to have these metrics and.

A

C

Look at them and if you can just um for the three deployments registry github inside click, disable the alerting, then we still can see how we do an api and the other services right. So that was the idea.

A

A

Okay, yeah, let's make sure something's moving forwards either like we push on and get the saturation metrics so that they can be used or if we aren't in a place where we can do this. Let's be clear and like park this stuff and move on to some of the other observability um work, so graham, is uh making good progress on the web migration so like. Let's focus on doing what we need to improve observability for web migration, so that that goes a bit easier.

C

A

Cool, okay, uh sorry, scott. I also skipped over your point in that point. So uh let's talk about your theory. Aptx drops.

B

So we have been seeing aptx periodically drop anytime, there's a deploy across all of our clusters for the api service, and this might happen to other services. But api is the one that we noticed, because that's the one that we're focusing a lot of our time on right now.

B

I was pairing with henry earlier this morning and we watched the production, deploy go out and I've got a theory where we are using our default. Rolling strategy for deployment and the default strategy provides you with a max surge of 25. What that means is that if you have 100 pods it'll add at most 25 pods during the replacement period, we also have a max unavailable set to 25.

B

Those are the defaults provided with kubernetes, and what that particular value means is that, during a deploy at most 25 pods will go start their termination procedures at what point they'll stop taking traffic, so we'll go from 100 pods that are in operation down to about 75, which will continue taking traffic during the time in which deployment is occurring um and the pods will get rotated.

B

So as a new fresh pod comes online and one will be torn down, so my theory is that I think we're dropping the amount of capacity we have an unnecessary amount during a deploy. We could toy with these values, but last time I checked our home chart doesn't allow us to adjust these values.

B

Kubernetes allows us to total them our home chart possibly does not, I think, for our smaller environments, that being canary and staging and such it might be worth toying with these values, if possible, to potentially figure out, if maybe increasing, max surge to a higher value that way more pods spin up and lowering the amount of pods that get torn down during a deployment process.

B

That way the same, we could get it as close as possible to the same amount of pods taking traffic before and after deploy, um but this record last time I checked this requires a helm chart update to add that capability.

B

This is just a theory at the moment. um Obviously, I think the only way to really prove this is to really dig into our logs, which will be a little time consuming, because we have to cross check the logs between the events of kubernetes and the events of when a pod stops and starts taking traffic throughout a deployment cycle, which is not an easy task to do. um I plan on creating an issue for this. I just haven't yet because I've been working on this change request.

A

Awesome yeah sounds like a good one. Yeah it'd be interesting to test that.

A

um How difficult would it be to put this in the helm, chart and just play around with it, and if it doesn't do anything just take it back out.

B

Oh, this is one of those things where I think anyone could benefit. So if we put in the hilton chart, leave it there, um but this should not be a difficult add to our home chart. We would follow the same commonality.

B

We could follow the same pair that we have in other locations to add this capability. I would start with our web services chart and you know if we identify stuff like this- is happening kind of badly through other services. We can certainly take it over to our other individual charts as well.

A

Cool yeah, it sounds like a really good one to test out. That would be. uh That sounds like it'd, be a really good thing to try and sell. I.

B

Don't want to say that this is a web blocker, but I think we need to be careful that during deployments um just keep.

A

B

Eye on the aptx and if it does unnecessarily breach our slos, we probably should bump the priority of this issue. Yeah.

C

That's good. I mean that already becomes an issue. If we tune api first in canary right, um then usually it shows very big abdex drops where in g product we don't see it. So we can't use the same values for tuning or we need to choose different values for for the the maximum search in canary. It's.

B

We do have a work run for this in the meantime, where we could just simply increase the minimum number of pods yeah.

C

I think that this moment what we did last time right, yeah.

A

A

Nice cool yeah sounds good.

B

um The only last thing I wanted to make comment on is- and this will probably impact henry the most, but I did want to comment that we did yesterday introduce a new shard. It's simply called imports, the only worker that's on it is the repository import worker.

B

The reason for this new shard is: it's been identified that we sometimes will kill a pod while an import is happening and for imports that take a lengthy period of time that import might get killed, it'll get picked up by a new scikit worker and then some other scaling event occurs and kills that pod. So then that job ends up having to be requeued again and again and again.

B

So what we're doing with this particular shard is not allowing it to scale in any way shape or form. It's just going to stay at a static, node count forever.

B

This was just introduced yesterday and I need to still pull off the repository worker from the old catch-all shard, and then we also need to make sure I've got enough. Pods running that way. Imports don't get seemingly hung right now, there's only four pods that are servicing this uh work.

B

I suspect that's not going to be enough, but I'm going to try to complete an evaluation of that pump. The workers as necessary, as well as removing um imports from the old casual chart.

C

But during deployments it's with the um based outputs right, oh.

B

Yeah deployments will still see impact regardless, but auto scaling is where we see the greatest impact, because that happens very frequently, especially with sidekick, because our hpa is very sensitive.

B

C

Do we see the same with exports too, because exports also.

A

C

B

Probably, but that's also tied to the memory bound shard and I think just the workload characteristics of that entire shard varies quite drastically compared to catch-all.

A

Nice thanks for uh dealing with that one skype, um do you know, like sort of how long do you think it'll be before we know if we've got enough resources for that or whether we'll have to.

B

We're going to try to complete that evaluation today, um it's not entirely easy just due to the nature of the way work works but and release management is kind of a little busy these days. But aside from that, I'm hoping to try to complete that today.

A

B

If not, I'm bad at saying this, I should probably say by the end of the week.

A

After 14 000 right, yeah awesome that sounds good. Do you need any help with any of these things.

B

No like this one's going pretty smoothly, um it's just a matter of getting the work done.

A

Cool okay sounds good um and henry I should ask you saying: do you need any help with any of your stuff you're working on.

C

um No I'm not blocked. I just need to figure out how to best bring those messages together. It's a little bit fiddly for the stuff I'm working on need to reach out to experts like andrew. Maybe if I get stuck.

A

Cool sounds good great, um was there anything else, and I want to discuss or demo.

A

Nope, okay, awesome um just to say uh kind of update, so graham, is working on um putting together a setup for nginx that will allow api and web to both run. He's also going to check that the um the setup he comes up with will support pages as well and, like all of the things will just plug into that.

A

So he's going to work on that next few days and then hopefully uh right after 14.0 uh he'll have something he could put on to pre and we can actually then see what it looks like um and he'll do like a side by side like comparison uh with existing setup. So hopefully the web stuff is, uh will start to start to become visible in the next week or so.

A

Super all right, thank you very much. Everyone um really great to see everything. um I hope you all have a good rest of your day. Take care.

D