GitLab Delivery Team, 9 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-06-09 GitLab.com k8s migration EMEA

Description

Demo of API Service resource tuning

A

Hello, how are you doing today.

B

I'm fine tomorrow I got my first vaccination shot, so I'm looking forward to it.

A

You're looking forward.

B

A

It which is interesting because, anytime, I look forward to a shot. It's more like I'm, going to regret the side effects.

C

I just didn't want it hard enough. There scarbeck, like it just came too easily right.

B

Yeah but then I can drop my mobile internet subscription because I will be 5g powered by myself.

C

Exactly there are benefits right.

D

C

Are you feeling all right after your shot, henry yeah nice to meet you.

D

Yes still, okay, but I think for for madonna and so on. I think they say wait for the two days after that. So you don't see it immediately. So let's wait, but I'm feeling fine. So maybe.

A

It didn't work.

D

Maybe it was fake, I don't know.

B

This is a bad news for me because I get my shot tomorrow and today, after have planning a hike. So I I don't want to.

C

You might be fine, I felt completely fine, except for my arm was like the most painful like okay, six.

B

Kilograms of radio stuff on my shoulders and uh four hours hike yeah. I can handle it right.

C

It'll be fine, yeah it'll be great.

C

Right awesome, so, let's get to the oh here's android nice.

A

um I guess I'll get started since I've got a demo item.

C

A

There's not it's less demo and more to show and tell which I guess, that's considered a demos and everything.

C

That's a demo yeah, that's good.

A

Whatever so, um I wanted to kind of give a quick overview as to where we have been with the uh tuning of the api service. So I guess I'll start from the very beginning.

A

This is become really long. Currently, our goal is to reduce the amount well. First, our goal was to figure out how to reduce the amount of paws that were running so engine x was running at max capacity in all of our clusters.

A

The amount of api pods was excruciatingly high, considering the amount of workers that are running in consideration to what was running on our virtual machines after the migration we've since eliminated the nginx issue, so we're now running pretty much at our minimum. If, last time I looked but we've also reduced our default, node counts. Api pod counts are still relatively high, but they're less than what they are shown in this before status.

A

Our node counts, however, have not really changed at all. So we've been tuning our cpu requests and tuning our hpa in various formats. Our latest goal was to try to reduce node counts.

A

Our latest attempt at this was targeting uh our canary stage as well as just cluster b in usd one. um The goal here was to compact the amount of pods running that way we better match what our virtual machines were doing. We were running four. This gives me 16 processes, pumas and right now we were usually seeing at most two or excuse me two times four, which is eight so usually around eight.

A

Our latest test was trying to shove four times four, so 16 processes- and we see in the latest change we see earlier today there is one node and then there were like upwards of four those was primarily deployed. We saw we saw five for a very short period of time, but the amount of times we see the amount of node this chart is showing the amount of nodes running more than three pods um that are running the api pod on a given node, so that very rarely changed.

A

um Considering how many nodes that we run, which is one of these tabs.

D

A

D

Out a little bit, um at least, we saw this at nighttime at low traffic. We saw a big improvement right. It's just that um yeah. As soon as we get higher traffic um things break down again,.

A

Yeah ignore big spikes like this. This is probably a deploy or something, but so you know seven nodes were running at some point in time, more than four pods, but that pales in comparison to how many nodes that we're running okay- this is the weak view. Let's turn that down a little bit, I mean we're still running 24 nodes, which is you know three and uh five nodes less than our existing clusters.

A

So at this point we're still if we continue on the current path, we're still destined to run 75 ish nodes, which is still a lot more than the 36 nodes. We were running as virtual machines and.

E

A

We were slightly under provisioned, but we weren't under provision to the point where we need 75 nodes doing the same amount of work.

A

I think what this really boils down to is that the method at which we are trying to schedule resources is significantly different on kubernetes versus virtual machines. um I outlined in a thread down here that kind of the same information I'm dialing here or indicating here, but like our saturation, is right at 75 percent.

A

This is just cluster b.

A

And our cpu saturation for those nodes is hovering just below 80 percent, so we're already really close to our saturation levels for ruby contention and the cpu saturation of an individual node.

A

The discrepancy you see here is where we have a node running, maybe one or two pods versus say three pods, for example.

A

My worry is that we're too close to our thresholds, so I think the current experiment that we're going down we just need to kind of abandon and shift gears with how we want to try to shove, more resources onto a node or maybe modify the goal that we're going down, and I just threw a comparison as to what our cpu saturation was looking at in our api virtual machines prior to migration, and we could see that we're hovering between 60 and 80 percent for well just shy of 60 to about 75 percent for the most part, but real realistically we're cpu bound.

A

uh If we look at our which one these is memory usage, you know if I could label my tabs, that'd be great memory. Use for kubernetes right now we're at most 30 memory saturation for these nodes, which is horribly inefficient.

A

So my initial thought right now is to simply change out the instance type.

A

So I'm curious what others may have opinions on, because we have other options such as modifying the amount of workers that we're running inside of our pods right now, it's four! But if we change our instance type, we could trade the amount of cpus. We have with memory uh right now we're running uh close that tab.

E

Where did that fall come from because that's not what we're using in uh vms was it? Was it also four? It was 16.

A

E

A

Four came about because I didn't know how to perform the necessary evaluation, determine how many workers should be put into a pod, um and the theory was that we could just say you know, set it to four and we could run four pods. That gives us the same amount of number right.

E

um I mean it's surprising that it's four and we're still at that threat. Contention number of um you know 80 percent, so something doesn't add up there. um I just had one other question: do we know like the state of pods before like? Can we tell if a pot is pre accepting requests like pre-readiness?

E

If you want, there is a kubernetes.

A

A

Yes, we should be able to um because this timeline is usually pretty sure.

E

One thing that would be kind of interesting is to know what percentage of the fleet at any stage is initializing, because that is very cpu intensive. We know that uh anyone who runs jdk knows that, and um you know, if we've got, we know that we've got a hyperactive hyperscaler and um you know a lot of that cpus, possibly on that, or at least it'd be interesting to gauge what percentage it.

D

Is yeah? You should look to this scaling right right right, but um I.

E

D

Had a look at how long it takes for pot to um set up at all- and it seems like it- takes around 60 seconds from a container start to the ruby workers accepting traffic and then they start right away with the same latency like later. So they are not really starting slow and then dropping latencies or something.

D

But if you scale a node, then this gets additionally 140 seconds or something like that. So 60 seconds for container startup. And if you put it on a new node, then I think it uh again takes something like 100 seconds or something.

B

This is based on the readiness probe or is just the information from kubernetes.

D

This is based on both like I saw I looked into event logs. I put this in the issue somewhere event logs, showing that container was starting up and and also in one case, that a node first needed to be a setup to be able to schedule this part, and then I looked into kibana, and then I saw okay. Now um I see ruby starting up and a lot of locks about will be setting up. Then workers becoming ready, and then I see the first request coming in being asked.

B

Because I was investigating an incident uh earlier this week, and so we I mean maybe was specific to what happened in that case, but we saw that basically, um every node is out from every part is out from the balancer for two minutes, so we get 502 from the readiness probe for about two minutes, which is twice the amount you are telling me.

B

I mean it was very specific to that one, because the young call had questioned about the if we had an outage or something like that, so we tried to there were a lot of errors, but then we broke it down by um part name. I don't remember hostname partner. I remember what was the point is that it was clearly two minutes. Every uh everything was out of the readiness probe for two minutes.

D

um Let me find what I have written down, so I can show you the numbers, because I don't have them all in my head. But um ah let's wait for the issue to look because it's very long.

D

D

Must be at the very end, so here it is um yeah. So in the case of.

D

Like uh pot being created, it takes around 50 sec, six seconds to be um accepting uh traffic, but in the case of a node being scheduled, then it takes over 166 seconds for a port to become ready and if we deploy, then we see often heavy node scaling right like like adding five more notes. Stuff like that, and so we have- I don't know, maybe 15 pots or something like that- um waiting for four notes, first and then being scheduled and then starting up. So I don't know at this point they at the beginning.

D

Of course they don't get readiness probes already, but so did you see really readiness probes um going to them already.

B

For two minutes.

D

B

So we, basically that was what we were looking at, because we were looking at errors at workers level. I think, was I'm trying to find the details, because maybe it's of topic from this. But the point is that we had clearly two minutes of uh failure on the readiness probe for each uh for each one. I I find you just because maybe it's completely unrelated so just.

D

Just a data point yeah, but but coming back to the um performance optimizing, I think um if we very often um scale up and down, then we of course have issues, and that relates with the size of the pots that we have right um a they are very slow to start up anyway. So we don't have good elasticity to respond to spikes so having more small ports.

D

Maybe to I don't know, um don't start up notes as often maybe it would be nicer to just correspond to finer finer, finer spikes, but, on the other hand, having small ports is very inefficient, so I'm not sure how to best tune this year, but playing around with uh workers and uh maybe choosing a different note sizes garbage was um suggesting- maybe could help here.

E

Can I can I just ask, do we know if a pod, like the the state diagram um once it goes to ready, does it ever go back to? Can it ever go back to like not ready and then back to ready? Was it just it's ready and then and then shut down kind of thing, or does it? Can it switch between ready and not ready.

A

It can so we send health.

E

A

Or we send readiness probes and if should a readiness probe fail, uh the pod will be removed from the service, so it shouldn't accept any new traffic but will still process the existing traffic um and the writing is probes for, um for the web service go through uh puma.

A

So if something is failing inside of puma for any reason at all, like its inability duration database as an example, uh we'll start filling that health check um and, of course, if we start the shutdown procedure we send with sig term, I believe, and that forces us to return, I believe, a 503 which will fail the readiness probe it will pull the node out of her or pull the pod out of rotation.

A

So, yes, things have the ability to flap in and out of service.

E

So I'm just quickly looking at the metrics that we've got, um I don't know if this is interesting to anyone or just me, but uh where's, my screen here. um So this is for the api service. um What it's not very nice! We can make this better, but it's effectively what percentage of the pods are not ready as a percentage of all pods, uh just kind of answering that question and saying earlier like how much of the time are we initializing and you can see it's up.

E

You know sort of spiking at like 20, sometimes, but it's only briefly um and then.

A

E

Would expect to.

A

See that anytime, we scale because we're adding a new.

E

A

And it's going to start reporting, metrics nearly immediately and, of course deploys. We would see that as well.

E

I'm gonna, I'm gonna like see if we can start tracking this so that we can get a long-term metric like that, we can start optimizing on. Obviously I make it prettier than this, but then we we can know, like you know, there's a few things we can make the. um If we can improve the startup time, we could improve it. If we could improve that scalar and then you know those things, but the other thing that I just wanted to show is one way of looking at the um uh how long the startup time is.

E

uh Sorry, that's not gonna be. This is what I want.

E

um So if I just take one of the of the pods- and I just take a look like this- you can see that it's taking you can see how long it's taking so, let's zoom in a bit more um so basically the pod came online at like 1408 and then the the workhorse one was ready at 1409.

E

So one minute well 1408 16 14, 09 12., it's about one minute for workhorse, which is quite surprising. Like that's quite long. I I would expect workers to be quicker than that, but anyway, workers.

B

Is forwarding wording there's no health check in workers, but but look how much earlier the the workhorse readiness came online.

B

But this is the slash dash. It's it's a fake route. It's just forwarding straight to the ruby process. Okay, that's.

E

What it's approximately, because can you see because here this second line over here is the um where's- the container, the containers? Oh, no git lab yeah this container over here, where's the container label container web service right. So there's quite a big difference in time between you know.

E

If we, if we do this until 14 15, so can you see what the the the uh 14 20, let's just say, and we do one hour or even half an hour like the the that time to there is the workhorse startup time according to the readiness, because this is the ready, the readiness label, but then from 1407.55 to 1409.55.

E

So two minutes, like you, said, alessio, that's how long the web service took before it was ready. So.

C

It's strange that they're different graham, was talking to me this morning about nginx, and I wonder if that might be linked, so graham was saying that he recalled a comment from jakub somewhere, which alessio may also know, details of which is that nginx used to sit right next to puma and the reason nginx was there was to handle the routing of traffic and because puma's super slow at doing that, so they sat directly next to each other, so engine x could handle that quickly.

C

Now, graham, was saying the way we have internet's configured is it's across the network from puma, so he was actually wondering like what is the impact of that, whether that might be fetching in on this tuning?

C

Like so he's saying it's it's they're, no longer so close. So it's engine x goes to oh sorry, ho proxy is going to engine x and then that's passing back to puma.

A

Well, keep in mind that the tuning that we're talking about here engine x is not really involved, we're strictly talking about the pod startup times and the redness probes, which do not go through nginx. In this case,.

B

What are we asking? I know, I know the problem stuff.

E

There there are, but most of the I I mean there probably is a problem with that, but most of it most of that protection is against sort of bad people on the internet. um You know so the client.

C

Buffer, we spoke.

E

About this last week, I think, or week before and.

C

So what you're.

E

Trying to do is stop um like slow loris attacks where people kind of feed you and then you saturate, your entire fleet and there could be some delays, but I don't think it's nearly as serious as what you have like on the wild internet um just but yeah yeah, but yeah. I'm surprised. I would like to check that work. So alessio are you. Are you like 100 certain that the the readiness check proxies to the workhorse, sorry to the to the rails, readiness check in workhorse, so.

B

I'm just looking right now so back to the original point that it was related to web. So it's vm, so we and was more than one minute, but not two. So basically we had 502 spread across one full minute, plus a section of the next.

E

B

So and so that's entirely another problem, so um just.

E

Because also, what happens, how does how does routing work if the workhorse, if workhorse, is saying it's ready, but the other part is not is saying it's not ready. Will that.

A

E

A

Marked as not ready, so it shouldn't.

E

Okay, they all have to be. They all have to be ready before the okay. That's.

B

E

Because that would be kind of worrying otherwise.

B

So the this part here is the readiness probe of workers. So it's matching readiness or.

E

B

And basically it is doing this. So if I remember what this is doing, what's probab.

E

B

Yeah, probably stream is just defined up here, and this is one. So basically, this is an error page unless so it's kind of a tricky if statement that disables something in development. So if this.

E

B

Running in development, basically, it's just uh is proxying the requests upstream and is just making something about um formatting. So unless there's something here that is doing the magic, this is going.

A

B

A

Need to interrupt you because health checking is working slightly different in kubernetes. um Let me share my screen, if you don't mind yeah sure so, down here inside the gitlab workforce container, the lightiness and readiness checks are executing a script called script. Health check inside that script. All we're doing is a printf to get and sending it to the local host. So we're just we're not even pinging.

B

Yeah so he's ready to accept.

A

Incoming requests, so it's okay, precisely! That's all we're checking. Why are we.

E

A

That's the question we would need to bring to distribution, because I do that.

B

It makes a good, I think it's a good, it's better than what we have. If you remember angie, you were making that uh that merge request in workers for implementing proper health check, so that workers can have his own checks plus relying.

E

B

The status of the underlying pod, I mean this one is a clever workaround, because you keep in mind it's it's real, it's ready. If you can handle incoming connection, then it's the yeah, it's the real spot.

E

B

Is not ready, but because you cannot connect. Why are we.

E

So why don't we just put that? Why is that not just coded? This is a surprising way to do that right like why? Doesn't the health check and workhorse just respond in that way? As you know, when workhorse is ready it, it says it's ready rather than us, going to like the route. Another thing that's weird about that is that in normal cases, that's getting proxied through to gitlab.com root right.

E

Is it not yes, so so that's generating traffic? Most of the time am I am. I am I missing something when we're saying that and what I'm saying makes sense, because you know that readiness probe is getting called, however, often and every time the first few times.

E

Obviously it just throws an error or whatever tcp dev tcp does, because it's kind of also surprising and then but then eventually, once it starts getting going, then every other call of that will be making a full rails request to the route, because those will get proxied through to rails and be generating traffic. On the you know the dashboard.

E

What can you show.

A

Maybe it's a redirect.

E

A

It's a you know.

E

It's just a really boring redirect, rather than an actual page, but still seems weird yeah.

B

I mean, I think, that the reason why we never moved forward with the help proper health check in workers was for the same problem of workers being kind of in maintenance mode, and so nobody really working on that yeah.

E

But um yeah I mean maybe just the thing to do is just to have workhorse just return, I'm good like pretty much all the time except except maybe we could. When you do, I don't know if workers has a drain mode, does work or strain like itself or because then we could then the only time it would stop being ready is when you like, sick, kill it, and then it's like I'm no longer good.

A

That was newly introduced just a little bit.

A

E

A

Is supposed to terminate all existing long polling sessions, it's supposed to stop accepting new requests and any existing requests. That's not part of long poll will continue until some sort of time out that was just implemented by stan um late last week. I believe.

E

So so so then it almost has to be that the health check kind of follows suit, right that the health that checks not just proxing through, but it's also kind of mirroring that status. So is there an? Is there an issue for.

A

E

A

Then also get rid of.

E

The funny um health check we can just use the really boring http health check.

E

Should we open an issue about that? Oh.

B

I just checked, I was just checking something: there are two different routes in workers for health check. One is the readiness and liveness which is yeah goes in probe upstream and then there's health, which goes to health upstream, which is a different thing.

E

I thought that health did the whole check that everything's good. You know the dangerous type of health check where it's like. Okay, I can talk to the database and, like you know, everything behind me is also say. It was also good which, obviously, if you use that kind of health check- and you have a hiccup, then you know you bring down your application.

E

But I thought it was the the scary kind of health check it. It's kind of a validating one, which is really good for support people, but really bad for for.

B

Load, balancers yeah, but my my point around this is that it looks like in the first thing that we were doing different things. But it's just a measure of.

D

B

Workers is still um upstreaming the requests.

E

Yeah, I think that, should that should probably change problem is it's so kind of fraught?

E

You know when you to test out a change on that, like maybe it has to be almost like an environment variable that you can say either proxy it up, and then you can test it with that, because there might be other kind of concerns around it.

E

A

I got you andrew. uh I just linked the merge requests that you were asking.

E

A

A second ago, uh so we kind of drifted off topic a little bit, but um um I guess.

A

So back to the api and experimenting with tuning, I don't think we should move forward with trying to shove more pods onto our current nodes, because we are at our saturation limits. So I guess I'm looking for an answer for two questions is one: should we look for a different instance type, the dangerous part about that is the we're currently using c2s.

A

uh We don't have the ability to customize those any further. So if we want more cpus, we also get a lot more memory, which means we're just wasting a lot more money on that situation as well.

A

The other option is to figure out what further tuning we could do with the number of workers, but I don't know how to properly evaluate that in any way, shape or form. So I don't know what to look for unless I make that change and observe the behavior, um but I would love to have more concrete answers and maybe a person to talk to to figure out what the behavior might be. Yeah.

E

So we did a very, very similar thing with you know the sidekick workers and then, when we switched to puma as well- and you know java and myself and camille were all kind of involved in in that and and you know, maybe one of us needs to kind of sit and because a lot of it was just collecting. You know like the matching up the same requests for the same endpoints and then looking what it looks like and and then you know just kind of brute force data.

E

uh You know not and not aggregating across, like a large number of different requests, but actually choosing you know this. This merge request, you know, maybe a busy merge request when we have a tune like this on average, takes this long to serve that page and on and when it's cheap like this, it takes, you know 15 longer, and then you can.

E

You get a really tight boundary about, like you know how the performances of that thing is, and you can see if there's if you need to go up or down, um does that make sense? What I'm saying.

A

E

Does I just don't.

A

Know how to replicate that in what I would need to accomplish here. I guess.

D

I think you also maybe should um do a step back again and think about what problems we try to reach here, because if we try to um get the our nodes, our node points saturated as good as we can right, then we need to tune in a way that we have the hpa um try to hit the cpu target. Maybe that brings us closer to load, pool cpu saturation right, which is what we are aiming for, and this necessarily always brings us to a state where we have less head room for spikes right.

D

So if we have a deployment or certain spew spikes, we don't have enough headroom to catch them up, because we are very slow in scaling up, because puma just takes at least a minute to be ready and and notes uh scaling up even longer. So this will always be a problem. We will never be able to saturate our node fleet in a nice way.

D

If you can't, you know, deal with this in some way- and this is the big problem I think, how can we come to 80 usage of our load, pool and cpu without um always suffering on these traffic? Spikes- and this is really a very difficult topic, but the slow startup type of containers.

A

It is, but we know now if we set the hpe to a value of 3800 for the cpu scale. We know that we are saturating our cpus too much so like we had to roll back that change.

E

Pretty quickly.

A

We probably waited too long to roll back that change, but we did roll back that change. So we know what value of cpu, with the current value of four workers per induces too much problems on that particular pod that drops our aptx and increases our area or error ratios. I guess so. I think we know that portion, it's just a matter of what we could do, the two next to figure out how to go backwards in order to move forwards.

A

I'm just saying words at this point: dude.

D

Yeah I mean the problem is that we then always will stay below something like 70 cpu usage, because this is the value for the cpu average target that we needed to choose without running, into spikes, right and and and that even for for not fully using an api.

D

If we really put four parts in an api node, then we get into bigger problems, and the issue was that we still had a lot of headroom of several cores on each api node to catch up traffic spikes right because we are allowed to go over requests if needed, and we had this headroom. But this is really bad because we really suffer from big spikes all the time. So I don't know that's really really bad.

D

What I would like to look into is maybe um tuning further with workers and and threats, because having more workers and less threats probably helps with um contention, and we still have a lot of memory headroom anyway, on the notes, so that shouldn't be a big issue.

A

D

A

What I'll do moving forward prior, because switching the fleet type is that's going to require a little bit more research and I think we can move faster if we play with uh uh worker counts. So what I'll do next is revert our testing, because that's still in place, I haven't reverted it yet, um and I'll start playing with worker counts. um I think I'll. Just move up worker count one at a time. You know we'll have an odd.

E

A

We'll go from four to five to start with, but I don't want to do that.

E

I would I would, I would do it differently. Scovic.

A

E

Have you've got three three um regions right? Is it hard to change it to different values in different uh sorry, different zones? Piggy patterns, very easy, very easy. So so why don't? We have like three four and five right and we leave it like for at least 24 hours and then and then we go and we um go through elk and we find like the the most common queries and then we look at how they break down between those regions. And then you know we keep going.

E

We keep iterating, but no less than like once a day, so you can collect like nice, fat chunks of data and then just keep them one.

C

E

One another yeah and then and then.

C

You know every day.

E

We review it and we figure out like where to go from there.

A

So I know what to look for in our metrics what will be looking for inside of cabana.

E

So just a warning: you don't want to use metrics for this at all, because metrics you're, using histogram, sorry and the reason is right- is that in in in his in in in the metrics, we're using histograms and the histograms are buckets from uh say, 0.1 seconds to 0.5 seconds, and the next bucket is 0.5 seconds to a second right, and you can't tell there's no resolution inside of that right. So it's very it's kind of like looking through like frosted lenses and trying to see what's happening on the other side.

E

So because you you just have these very coarse buckets that you're using for latency. So it it you know, and then the bigger buckets you go from one second to 10 seconds. It's you know, really really blurry, and so just don't just first thing is: don't use metrics, don't use histograms for this at all in kibana.

E

What you want to do is you want to try and find like a bunch of very common urls that are kind of chunky, um maybe cpu bound things like uh like merge, request, controller widget stuff, if you just basically rank by like the most common paths, not not controllers but actual paths right, because what you want is like a project that um is getting called a lot.

E

So the the the variance in the in the latency will be very low because it's always looking at the same resource right and then and then what you can do is you can break that up in kibana by re by zone, or I think it's a region label so uh I'll just call it region and then and then take a look at the histograms. Sorry, the the like the p 50, the p 75, the p95 for those and you'll, see that there'll be like um differences between them and then we can optimize on that.

E

Does that make sense and and the the reason why kibana doesn't suffer from that is because you, when you say the 95th percentile in kibana, it's literally taking buckets of 100 and choosing the top five right. So it's it's accurate, whereas the histograms are very much a um an estimate in in the metrics.

A

Yeah, so I guess when I say looking at metrics, I'm looking at cpu saturation, ruby thread contention, but for the purposes of monitoring our aptx and such we could use kibana to validate that we're not unnecessarily hindering the latency for a given request.

E

Yeah you can yeah, you can certainly look at the metrics, but the, but like, ultimately what you, what you're tuning for here is the the customer experience right. So you want to be looking at those. The other ones are kind of secondary and and yeah.

A

Okay, so I'm thinking because we have three zones, we could perform three experiments. We could leave one as a control cluster and that could be like cluster d.

A

Probably since that's the last one, we deployed to anyways like clusters, b and c, we could set the workers to a dif, a differing number like five and six um respectively and see what happens and then we could observe our cpu and memory utilization um I'll have to generate a fancy chart in cabana to make sure that we're not negatively impacting or we have the ability to compare the latencies of various routes inside of our logs.

D

Keep somewhere today, I've been also need to adjust requests.

A

D

To the worker third combination that we choose right because.

E

Depending on how many effects.

D

Or how many workers we take? We need to also adjust the cpu requests and maybe memory requests, because that will change in each of the configurations.

A

So maybe what I'll do is modify the worker first, because I don't know what that's going to do just yet, but seeing what that does I'll be able to better understand what cpu utilization is going to do? Then I can modify the cpu request appropriately and I could probably do some sort of fancy math to do the same thing for another bump. That way I could request the worker and the cpu bump at the same time, for the other cluster.

D

Yeah, I would expect of adding more workers. We will use more memory but see less.

E

Contention yeah.

D

And by by adding more threads, we shouldn't see as much increase in memory, I would say, but we definitely need to um also higher the requests than to be able to um guarantee enough our pots there and that will again influence how many pods fit and true notes. So we need to also fine-tune and think about this. How we squeeze puts into notes then, but I think, with three parts on a note that should work in most cases, because then we still have a lot of headroom on cpu and memory.

D

uh So we have some playroom here.

A

Yeah, um I think if we could figure out a way to get ourselves down to like jared mentioned this last week, but if we get ourselves within a certain amount of the similar usage we had with our vms, I think we did a pretty good job.

A

It's just a matter of figuring out what tuning we need to do to get to that state, and then we can go from there to optimize it even further, if necessary,.

A

Right now, cost is our primary concern because we're running nearly three times as many nodes and that's kind of unacceptable in general,.

E

What do you think like a as a guess where do you think um the right number of nodes would be like one and a half? I would.

A

Love to be running, I would love to be running somewhere within a 20 range of the original 36 nodes we had originally because we were under provision right. So if we could, you know bump up that mount and still have the auto scale available to us. So we could turn down during the weekends and night time. That'd be great.

E

Yeah, I I there's something in me which thinks I I really hope we can get there, but it it probably might also be challenging, but like that that'd be, if we can get 20, I think it could be very happy with that.

E

I was thinking. Sidekick is way beyond that. If you, if you look at the.

A

Old side compared.

E

A

But also we're.

E

A

Tuning sidekick very well in any way shape or form like. We didn't really do any research. We didn't sit here and concentrate as much as we are today with the api. So I think whatever learnings we gleaned from the api service, we could take the sidekick and really do some call savings there as well. It's just.

D

A

Of taking the time and effort to do that, work in research and also.

D

You have to um remember that at night times at low traffic times um we can reduce the number of nodes automatically drastically even much more than we did now I mean we already hit the ceiling that we set ourselves at 30 notes or something, but we can go much lower, so we would save cos at least at low traffic times why we spent more on high traffic times, but maybe that evens out a little bit.

E

Yeah, you have to look at it like an average over the week, rather than a like. Oh right now we're here, yeah.

D

We are comparing maximum or.

E

Yeah or comparing maximum yeah quite right, yeah.

C

E

One kind of additional.

C

uh Final bit on this kind of like where we want to get to, I think the other bit that would be useful. It doesn't have to be an exact resource match, but it'd be good just to understand why it's different. So we already know we were under provisioned if there's another good reason why these things are going to behave differently great like if we know that we can just log that right and come back to this stuff later.

A

This issue is horribly long. I hate how long it takes not only to load, but just finding the information where you are.

A

I think it might be worth the time and effort to write an interesting blog post about what we did to get here, because this is also a topic that I see often in the kubernetes selection of like how do I tune these requests and such and like the only thing you find on the googles, is how to do the tuning, not how to observe the behavior of your application to decide what to set, how to set it and get the result that you're looking for. So that could be an excellent blog post. That.

E

Will generate some interesting.

A

Traffic to our site, but like I'm just trying to figure out a way to condense, that massive issue into something concise for anyone in the company to be able to read, because right now, you're not going to find anything that you want out of it.

A

E

A

Would be a good uh end goal to have is to consolidate that massive thing into a blog post of some kind.

C

Yeah, I agree that sounds like a good outcome.

A

um But yeah awesome.

C

So um should we move on to the uh to the next item for retro.

A

Oh yeah, so retro, I'm just looking for any further feedback from anyone. um If there's no one, if no one chimes in on this by the end of my day today, I'm going to close it. um I've created a few issues related to the stuff that we found and I've started to go through the web, epic that amy you have stated as now ready to go, and I've started to populate some of the issues with a little bit more information to make sure that we don't forget certain things.

E

A

So hopefully, um hopefully we're in an okay state. I think we are it's just a matter of finishing up. Writing all the other issues that we need to um need to create awesome thanks for doing that, and of course, you already saw the language changes to both their our readiness templates, as well as our delivery team, epic language, so that we hopefully do a better job with certain things as well.

C

Great sounds good thanks for staying up.

A

And this link was supposed to be up here. Awesome.

A

So that's all I got.

C

Great sounds good um henry was there anything you wanted to go through on observability or andrew.

D

um I can just say what I found and then what maybe we need to discuss for cpu saturation.

D

So the issue is that, right now we don't have cpu saturation metrics for services and kubernetes, which don't have cpu limits, because our current cube, cpu saturation um is based on the limit that we set and for things like api, we didn't set limits.

D

So um what I'm currently working on is adding another saturation metric for cpu, which is based on requests, because this is helpful anyway, because this request is what we estimate that v, what we need to guarantee to a container and if we come close to request with cpu usage, then I think we need to be aware of that. But we can potentially go over requests, which is not nice for a situation. Metric.

E

I was going to say: oh, do we treat 100 as um the request.

D

Yeah, that's the question that I have right: what do we do with saturation metrics, where we get 120, sometimes.

A

See I think this is unfair, because we could set a request of 10 millicourse and we could use 100 millicourse at the gate and we're already well over 100 saturated, but we're not necessarily saturating that service. So.

E

A

We in that particular case, we really need to figure out where that service is being hindered. I don't think deriving a cpu saturation metric based on a request is a wise idea.

D

um It might be wise at least not to be alerted on, but but to see that we um reach a point where we miscalculated how much we need to guarantee so to certain services right right. If.

A

We like look at sidekick as a prime example right now we're scaling when we hit 400 millicourse of cpu usage across all those pods, but side kick is not being hindered in any way shape or form.

A

So if we set our saturation level, if we look at that saturation level today we're at roughly 80 percent, because that's what the hba does and it's never going to change, it's always going to sit that way, but technically we're well under utilizing psychic as a whole and our nodes are going to react accordingly because of our hpa and how it scales. So I don't.

D

Yeah but I don't.

A

Know how example, if we have.

D

Workhorse and and web service uh containers and and the api port right, so we tune hpa for the web service container to reach. uh I don't know 2600 milli cpus right now, and this will of course also change how much the traffic will send to um workhorse. So in workers we will see um cpu go up or down, and but we are not setting any target for workhorse right. So we just um don't watch it. If you don't watch it, then we don't see what is happening with the workhorse.

D

Why we change the cpu target for for web servers, and I would like to see some kind of metric that shows me. Oh, I see workers getting close to our requests unexpectedly because they fiddled on on the web services container. That would be a helpful metric, but it's hard to express as a situation, because we allow you know to go over the limit there, which is not nice yeah. We could.

E

Try to see what is left and what.

D

Is left on the node and capacity and try to that's what I was thinking that wouldn't be probably.

E

Too much better way to do.

D

It but this is super hard because you need to figure out how many web service pots are on on this node and how much capacity do we have left and then calculate out of this for each of the parts um you know. If there are three pots, then they all share a third of this leftover capacity of the node. I don't know if you can calculate this and believe it.

E

But surely the the ultimate um thing that would happen in prometheus is that the node, so we can figure out what the total requests are and then and then we can also say look the node is, is that, like you know, 95 of all cpu is gone and you know in total it should have been a 16, so it's badly packed and it's like 66 and we know we're at well at 100, then yeah. Maybe maybe we just need like no the node metrics. Maybe the node metrics are the right thing to use there.

D

Yeah but but I think it's very funny.

A

If you have 30 pods all performing the same work that might not be legitimate because you could probably remove 50 of those pods and that would reduce cpu usage greatly and we'd still be. Okay.

A

Like I feel like, if we understood what the service would be bound to, we'd have a better idea of what saturation metrics we really ought to be. Measuring.

E

Should we be putting a limit on.

D

I think java was very against limits for cpu and I think, if you look at how we used our nodes, it makes some sense right because we really didn't use them very well, because we didn't fit very well on the nodes. So we would have wasted a lot of cpu resources, but.

A

um Yeah there was a blog post that then cochi had found where it was discovered that cpu limits were being it wasn't just a throttle where you would run into say. 4000 was your limit. If you ran over 4 000, you weren't just limited 4 000, the entire process would greatly slow down to squeeze itself under that 4000 limit and that had a lot of a lot of negative implications. I don't know if anything's changed related to that, but that's kind of why we're not setting cpu limits at the moment right. Okay,.

E

So it just seems as though we've kind of like we have to work around a lot of things because of that. So that's why I was challenging it, but it sounds like a legitimate um thing.

D

E

Trying to exploit to use.

D

Robert trying to use um requests as a limit and then make a saturation out of that, and even if we then sometimes go over this, we have a saturation which is going to 120 percent, which makes the graph look a little bit strange. But you still get some value out of that right.

E

Well, actually, the the the way that saturation metrics are designed is they always get wrapped in a clamp max one, so, no matter where it goes to it'll always flat line at that point just because otherwise it messes up all the graphs and all the downstream data it's like designed to always be between zero and one. So if it's a hundred and twenty, it's at a hundred percent um just.

D

Something that's weird: let's just do this and the graph is the clamp max one, because then we see it getting close to ninety one, one hundred percent and if it's going over, then we still saw before that. We are close to 100 right and can ignore how much we went over, but but.

E

What are we really good at like yeah?

E

If we just if we, if we put together mr and generate that snapshot and get the alert, and then we can kind of look at some of our services and see how it looks, I think it'll it'll, either very quickly, be like that's great or it'll be like, and I don't think we can really guess until we know what those graphs look like like.

C

If they basically.

E

Pinned along the top the whole time, they're not going to be too helpful, but if they have little peaks, I'm just about.

D

To work on an mr to do that, so let's try it and see how it looks. Yeah.

E

The other thing you could try with that henry is, if you do have little spikes, that spike up instead of using, like rate over five minutes, you use a rate over like an hour or something like that, like we do with this there's one sidekick um saturation metric that we do that with and so that it kind of climbs up very slowly. But if it's kind of pinned there all the time, then then you get that. But if you just have brief spikes, then they just kind of round out.

D

Yeah, I will look into that. Let's see how the data looks like. I can't tell you right now, so let's try it awesome.

C

Awesome sounds good, um so is there any thing else we need to go through.

C

Today, awesome: okay! Well thanks! So much for the demos and discussion um and uh yeah good luck with the with the tuning and the saturation metrics all right take care. Everyone bye.