GitLab Infrastructure Group, 24 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Firedrill Git Service has saturated it's HPA

Description

Members of the infrastructure team review how to determine and what to look for when observing that the Horizontal Pod Autoscaler has reached its capacity limitation.

A

Hey, I just uh did the initial run of uh customer staging on the on the new poc environment and it got all the way down to trying to pull the code in wow. So that was.

B

Ten percent of the time chef works every time.

A

I was, I didn't think it would get all the way through, like installing the database and everything or the initial database. It did so. I got a few things to hammer out, but that was good news.

B

And then I'd like to point out that I beat scarbeck to the meeting.

B

Sorry go ahead.

A

Maybe it was just going to be the three of us doing a fire drill.

C

Awesome, well, that's not as fun as I want this meeting to be. I think others are joining I'll. Give.

B

C

B

C

Everyone's always two minutes late to this thing,.

B

And I know, but I knew you were running this one and you were gonna start it right on time. So I was trying to be on time.

C

I knew it wouldn't start on time.

C

B

I've even got the issue and the dock open already.

B

I haven't read.

C

Them yet, but that's a different thing: well, sir you're you're really you're really up there today, because I didn't.

B

We obviously have not had enough incidents for you to take care of no, no, the it's. The the ci incident this morning took away all my motivation for other things, I'm like kind of dna, starting in five minutes, I'm not squeezing anything else in.

C

All right so since we have at least a few people here, I'll go ahead, get started so as noted in well. First off announcements, um thanos is ready, enjoy, have fun, okay, so agenda fire drill. Let's do it so there's a google doc linked into the issue.

C

So if you all want to open up that issue, we can all gather, notes and type together. We are on scenario, number six, so scroll all the way down to the very very bottom.

C

So this is this: one: is a this one's a fun one, um because we're like in the middle of trying to improve this process. So in the meantime, let's try to poke holes in our current process.

C

Tell us where we can improve, let's rock it out man, so the maximum number of paws has been reached for get https.

C

So before we investigate how to increase the capacity of said servers, let's figure out how we know that we have hit the maximum of the said service who wants to lead the way dave. I think you should lead the way since you are so so prepared for this.

B

I wouldn't call myself prepared, but I am gonna: go blindly jump into dashboards. Let's.

C

Blindly jump together, sir, so.

B

I'm not so, let's see here, I don't even have the right things bookmarked anymore, so here we go we're gonna, search, truly.

B

Let's go with get cube deployment detail. I don't even know if I'm going at the right dashboard.

C

That's the point of this exercise. Man.

B

So it looks like we've got some cpu graphs memory, graphs and network graphs, but no saturation and stuff here. So I'm going to go ahead and say that this is not too helpful.

B

So what other git dashboards do? I have.

B

Here's my containers, dashboard.

B

Deployment cpu memory- those are my container cpus.

C

I guess I should point out that you are looking at some of the same like type of information, this particular dashboard you're, seeing the web service container the workhorse container and the gitlab shell deployments.

D

C

Stuffed into this dashboard yeah.

B

So I'm not seeing.

B

Much so far, so let's go to pod.

C

Info did I already go to that not.

B

Yet, oh that doesn't look good.

A

Oh, I know the secret to this one yeah.

B

Gotta pick the actual cluster rather than the regional cluster, rather rather than the regional cluster of the actual particular zone.

B

So we've got interesting, active replica set things, but still that's not going to tell me that I can't provision any more pods.

E

Although if it was, if it was a flat line, it would be a clue, it was a yeah.

B

No, I would yeah, it would be interesting to see if there were flat lines in anywhere. That would be. You are right um for our joiners. We are starting to walk through the fire drill and I got volunteered to start blindly looking through dashboards, to answer the question of. If we had hit the saturation point, what would we see it's interesting? I would note that this quota seems to be blank, um and that would be a potentially.

C

Useful thing well I'll, give you a hit. No, it's not okay. uh What that particular table is supposed to show is the cpu specific quota resources? Okay, so for each pod, there'll be a listing of hey. This pod is at 50 of its cpu resource allocation, for example. No.

B

But to go to craig's comment if this was flat lining, I assume the active remote active replica said if it looked like it was hitting a ceiling rather than dipping up and down like this. That would be a potential clue. I would guess: okay.

C

And potentially.

B

C

So far, I'm not sure you are on the right track. Active replica set is currently the only thing we have in our dashboards. That say this is the count of pods that are currently running in this environment. In this particular case, you're, concentrating your efforts on u.s, east 1b, so only one of our three clusters, which is not the best desired state, but this does give you our pod counts.

C

B

And as craig noted like thank you craig, like, I would agree like if I was seeing a flatline ceiling here, then that is an indication of like something's, not right.

C

So you know our current count. We are obviously missing a piece of information on this dashboard. So why don't we try to find out what the maximum configured count is that way we have some number to look at all.

B

Right, I'm gonna put on my like really dumb manager, hat and say like I know I should be looking at the gke stuff um in the gitlab on ops for our config there I see craig bear raising his hand. So maybe you want to.

D

B

Your screen share.

D

Or if you want to, I would say I would suggest, maybe searching for hpa in the dashboards and maybe there's something useful there.

C

Do we have an hp, dashboard.

B

Let's find out.

E

C

D

C

D

Yes, that's like two in a row, man, two fire drills.

B

So this I've gotten.

D

Something like that.

B

Looks like we do have an hpa dashboard, and then I would want to highlight on either gitlab shell or um depending on what.

E

Website two of our boxes, it's canadian.

D

um Canary web surface kit.

B

Yeah right now, I'm on get live chat, but we would, I mean we'd, want to filter on the right service here um where I I thought that show would be one.

E

Of them right where you right, where you were down about five or six from where you were get down more a couple more.

D

We've shown you yep.

B

Yeah so we'd want to highlight that guy and if this again was ceiling, then the 90 to 100 range. That would probably not be good indeed, but thank you craig for pointing that out. Then. The other question I think you were looking for scarborough, because we could also look at in configuration in the gitlab deployments, yep config and find the actual hpa config there that I do not know off the top of my head. I'm going to go ahead and reveal that um I thought I'm going around pretty thoroughly to try to find that.

E

The hba config, combined with the active, is that saturation graph there. So I mean that that saturation is that's.

B

The key the graph is the indicator, but I just meant like if, if we were also looking for the raw config of that, where it exists in the yaml files, I know it's somewhere out there in the lab.

C

Deployments I do find it fascinating. I didn't know we had this dashboard, that's cool yeah, that's really cool.

E

From the hell out of that, one.

D

Ask stupid questions win fun, prizes.

B

Yes, you get sick saturday off and.

C

Maybe you know what.

D

Tomorrow, fridays.

B

Are family and friends day so yeah go ahead and yeah don't take that one on two um okay cool thanksgiving? I was just going to my notes like we should note that dashboard awesome.

C

So we have a saturation dashboard for our hpa. That's excellent! So in the scenario, let's just assume we've hit the maximum number. Let's say: we've hit the number of a value of 100 percent.

C

um I have two questions regarding that: if a 100 saturation, if we are 100 saturation, this is a very easy question, but kind of dumb. What technically does that mean.

C

Like are we saturated across all clusters or saturated across one cluster.

D

I would think that would be one cluster like you'd have to check each zonal cluster independently. uh I didn't notice on that dashboard if it was that dashboard is not broken down by cluster.

E

It's an aggregated metric, so it's probably the max pro I from what I know of all the other situations. It's probably the max that it's seen across the instance of things and there is an instance of hpa per cluster, so yeah, it's probably the worst one.

D

B

It could actually be.

D

That could be a misleading metric if it was if it's aggregated. So if the other two clusters are healthy and you've got like one zonal cluster, that's just completely blown out of the water, your saturation numbers could still be.

D

Am I okay is that am I understanding that right.

E

It depends on well at the moment. Yes, so I'm just thinking at the right. The second we're talking about producing, zonal side, good clusters, where one of them being saturated, would actually be a problem. uh But that's that's so independent don't get too distracted right.

C

E

Yeah, we might be okay if the other ones are as long as the the load distribution is okay, because our load distribution with the threshold proxy is random across the zones.

A

I'm going to say if your load distribution is equitable and working like it should, and one of your clusters is at 100. The other ones are going to be very close if not at 100. Also, now, that's not to say there aren't corner cases where that could not be the case, but that's ideally I think. If it's not, then the problem isn't necessarily the uh running out of space. It's. Why are we either accumulating traffic in one zone or routing traffic to favor one or the others.

E

Or some or one's owners doing more work for some reason: yeah either.

A

E

Yeah you're right, it's it's.

A

Like returning 500s is easier, so you can get through a lot more traffic.

E

Oh so the ones that aren't saturated that are the problem, then yeah.

C

So as part, let's continue the exercise. uh Where would we go to dave's shied away from doing this? But where do we want to go to if we want to bump up our maximum allowed pods.

B

I might have found it well, you guys were talking.

B

I would go to kate's workloads. I was starting to look through here. It's looking at just the gprod ammo and I found gitlab web service hpa replies. Min two minute was 30 max replicas 150..

B

That's where I'm thinking, but I'm actually gonna. Stop myself. There.

C

B

Say like I usually would be wearing the imox hat, not the eoc's hat, so I'm gonna take that hat off and like try to transfer it.

D

Yeah, that's where I was.

E

Going that is the right space, though I think that's exactly the right spot for changing.

C

That's precisely where we are um so uh craig missko I'll know. If you want to share your screen just so, we could do a quick overview as to what values are actually configured because, as dave quickly showed, we see a lot of values repeated a few times, so I think it might be worth just discussing at least yeah. Thank you. I just wanted to try.

B

To hand off to somebody who might be wearing the engineer on calls, it's a spot.

C

So what craig is showing is that we've got our hpa configuration for each of these, so the min replicas is obviously the minimum number of replicas that we want to be running and the max is the maximum number of positive. We want to run it's important to note that what you see for these values is per cluster.

C

So from a technical standpoint, we run three clusters, so we max out at whatever 150 times 3. Is I'm not doing the math in my head minimally we'll be writing 90 pods.

C

So if we want to change this, we we need to be aware of a couple of things. One is that this is going to take a play, take place across all clusters, and then we also need to make sure we have enough resources available to us. um We have right above this, you see where we set our resource limits and requests.

C

um It would be wise to make sure that the node capacity that we are using for this particular um for this workload is sufficient to hold the amount of pods that we need to run um so craig. I don't know if you kind of want to delve into this, but if you want to pull up our terraform repo, we could then look at what kind of nodes we're running what size they are and guesstimate, how many pods that we should be able to achieve on these nodes. If you'd like to sure.

E

Just want to clarify, though, that's web service, so this is that's actually that one there is actually the.

C

You're, looking at the specific web, sockets yeah.

E

E

Runs and we run the web service in that deployment.

C

No so gitlab show is only the ssh front end to our application.

E

So I can't actually find I mean I, wouldn't we can't find the same button for a lot of actual numbers. I can't find the https get the.

C

Web service all right check out either the um I don't know what file you have open.

E

I'm sorry this is releases.

E

Oh, let's go to default; yeah! That's a service.

C

Just do a search for web service only or excuse me um get because it's.

E

Going to be under.

C

The web services deployment.

E

ah Website was good. There we go.

C

Yeah, so you see websockets get and at some point in the future, api will shortly be added to this list.

E

Cool, but this isn't the top level values. So it's applies to all departments. By default we don't actually have.

E

A gprod specific thing for.

E

So where is it actually getting any.

C

It should be in.

C

E

E

Interest so without any specific replicas you'd get.

D

Is there, is there a global fallback configured somewhere.

E

Well, that that's in values.

E

And that's got all of this, which is fine, but none of that specifies an hpa.

D

Right, so that's why I'm saying is there anything like at a well? No, I guess it wouldn't be at a.

C

Level at the level of the web service and not the specific deployment, perhaps.

E

Possibly oh right, I wouldn't hear it yes, you're right because.

E

Oh right, so it's yeah it'll be 150; no, that's registered yeah! Yes! No! I don't know that that there applies to all deployments of this web service, so we're getting 30 and 150 of web sockets and 30 into 30 to 150 of http is good. Okay, cool.

C

So, just to point out, though, if you go into the websockets, you see, the min replicas is set to six.

E

C

Must have it yeah, it overrides it, and I think you have an old branch because that should be set to two at this point.

D

And also, if you were to change that, that would change the replica settings for all web services, not just git.

A

You'd need to break out a separate section and put in specific limitations. If you just wanted to grow, just the get component or in the future, like the api.

E

So in this specific example, we would probably very much want to do that, because otherwise we could be affecting all sorts of other things. Unexpectedly, especially once we get api and other things going.

A

I don't know it's just the maximum. It's not like it's going to automatically create all those pods true, unless they've been very carefully limited, yeah.

E

So in terraform we have I'm assuming in the variables for jeep fraud: okay, um https. We are running on custom 16 2480s, which uh 16 cpu then yeah 16, cpu and 20 gigs of ram.

E

But I don't know how many of them we have um that's just a machine type. Where do we define node count for.

E

Gke but node count that's for individual nodes.

E

A

Is it an auto scaling group or is it statically defined.

E

ah We do don't, we.

E

So ah there we go next note count of 50.

A

I was gonna be kind of worried if we had limited this to a hard count.

C

A

But like it's not iterated as individual nodes.

C

If I remember correctly, our node pulls our module defaults to having a max node count of 10. I recall so you'll see for this one. We upped that to 50.

E

So we have 15 up to 50 nodes per zone and each of those has 20 gigs of ram 20, gigs gigs. Sorry, yes, gigs! um So it's a thousand gigs of ram per zone and we allow four gigs per machine psycho. Oh that's web sockets.

C

Well, we request one: oh yeah, that is also websites.

E

E

Don't have a request ah default level for the web service is.

E

C

E

This we'd have to go look at um at the actual deployed configuration.

E

C

Mean it should be in here somewhere right, it's got to be somewhere in there.

D

All right that there's no taking what are we looking for.

E

We're looking for the limit, the resource, requests and limits, but for the get.

E

And there's no, it's only the environment and then the next one up.

E

But we don't seem to have any what we can do is we can go? Look at the actual deployed configuration.

A

These are not hard requirements, as I understand it like you, don't have to define them, it's best practice to define them, but it's completely possible they're. Not there.

D

Well, we do as before do we have a top level setting for web services? There's a general default.

D

A

He was looking for that by.

D

E

Yeah, that's where I am so web services. This is web service. I mean.

A

Yaml files are so easy to read. Yeah.

D

E

So this is this is the definition for get specifically, which just says and there's no top level requests that I can see.

E

That's for workhorse different beast.

C

Entirely line 821.

E

Oh, my goodness, all the way down there.

C

It's a massive block yeah. It doesn't help that the network policy is in the middle of all that.

E

Right excellent: there we go right so a maximum of six gigs and we have.

E

A thousand gigs of ram per so sorry, that's the max.

D

So the request is five gig.

E

All right so right, we've got five, so we can have 200 pods per zone, so 200 yeah pots, things, there's not no no per zone, so we've got, we've got a thousand gigs of ram and we want five gigs for each one of those. So we can have 200 and our limit is currently 150.

E

So we can bump that up to 200, probably 200 and it will be satisfied. It's but.

D

You've also got other services competing for the same resources on those hosts.

E

Not right sex, no.

C

Job well, we may have 20 gigs of ram per node, but the allocatable ram will be slightly less than that, but the the amount of that is minimal. I think it's just maybe one gig at most I'd have to.

D

But well, what I was getting is we would have deployments for websockets for api for cas. All of all of those pull from the same resource pool right.

C

So, let's look at to answer your question: let's look at how we do this so there's a value where we set the.

C

We set the type of server that we want it to run on.

D

uh Okay, taints.

C

D

C

D

C

um It's just more like we look for it's a node selector, okay, so in this particular case I'm looking at psychic just because I have it on my screen, so we have a value called node selector. We look for a key of type and we look for a value called sidekick in this particular case, and we should see.

D

C

Somewhere in here for our web services, as.

C

Well, here we go so for the get services. We look for the reflector type equal, get.

D

We have node pools that are service, specific.

C

D

C

Which is you know not the most efficient way of running kubernetes at the moment, but in order to minimize the amount of change that we're introducing as we migrate things over we're just creating new node holes to avoid the noisy, neighbor effect of our own workloads, I want to revisit that in the future, but that's a future task. I'm trying to get rid of I'm trying to finish up in okr before I get to that process.

A

So, as a kubernetes noob, I'm disappointed that, like even vmware, could tell you if you were over provisioned like I guess it. What is this like?

A

You just get all the parts and if it breaks you get to keep both pieces like that's, that's not a very like.

A

It means that you can make a change and then you can end up completely over provisioned without any insight into that provisioning from your config, which is a concern for mitigating uh outages and other problems.

C

That's true, that's very accurate. How does vmware tell you that you are over provisioned.

A

It tries to quantify that stuff into a common measurement between your available resources and what you're going to run and compares them and says hey. It looks like you're going to be over. Sometimes it's fine, sometimes it's not, but breaking things down into these common you know like for vmware, is very simple: to break it down to memory and cpu and compare them, and it sounds like we're doing this all by hand.

B

Well- and I thought I mean the other thing would be- is that it's been a while since I interacted with vcenter and vsphere, but I thought there were warnings on api interactions and particularly in the ui anytime, you did stuff in the ui. But let's not talk about kubernetes ui for now. Just but an api interaction would throw a warning back at you and say you're over provisioned or you could configure the api to do so. That's what I thought vmware had for that kind of stuff.

A

I see a market opportunity for a tool to.

B

A

I thought there already was.

B

One didn't you recommend. Somebody commented in the last two months with a ui for kubernetes. Was it you craig? Somebody did.

D

uh Yeah, that's no different than the console, though.

B

Okay, yeah, it.

D

Was lens? Okay, that's what yeah, because.

B

That name reminds me, but.

D

Yeah, it's basically that's more like a a gooey alternative to cube cuddle.

B

D

Sort of showed you your.

B

Architecture, but it still wouldn't show you if you were over.

D

It wouldn't be to add, on top.

B

Of that, though,.

D

Yeah yeah, I mean, if you but manifest files are just files on disk right so like you could have.

B

D

Any arbitrary number of manifest files scattered throughout a directory and that doesn't necessarily reflect actual workloads that are currently running in a cluster.

A

Well, you think you could compare what the current config is to what the current provisions and.

D

Nodes are that's something more in the realm of helm, because helm does make some attempt to render its values, okay files and and do uh right. I mean correct me if I'm wrong. Here's garbage but like helm, tries to do a pre and post and like generates a diff of current state versus new state, so helm would be best positioned to maybe try to start teasing out that intelligence.

C

Yeah um to explain a little bit further prior to us performing a deployment, we run a diff and we validate that. The changes that we expect to occur show up inside of that diff as part of our merge review procedure for auto deploys.

C

We have this little checky thing that says: hey give me all the diffs and if we find a configuration change, that was not meant to be there. We bail the auto deployment just to make sure that we're not mixing a deploy with an auto deployment to go into the details of what cameron is trying to get towards. You were correct, craig and that helm would probably be the best position to start trying to determine whether or not we may potentially over provision ourselves.

C

The problem, though, is that helm does not know what servers are backing these services, so it doesn't know that it's got 20 gigs of ram available on a specific node pool. It just knows: I've got a configuration file that says I'm allowed to have one gig for 30 pods.

C

It sounds like, like dave said: we probably need some sort of tool that probably is able to coalesce all that information together. So maybe this would be a great improvement to the services um thing that we've created. Do we still use that? I know we use it for creating dashboards and alerts, but I don't know if we still have our ui in front of it.

C

D

That could be interesting. um I think it would be a yeah. It would be an interesting challenge, keeping it keeping it connected to the manifests in the active kubernetes cluster. Both.

C

And terraform yeah yeah.

A

Doesn't cube, admin has a has access to all this information. It has access to the current config, the previous configs. It has access to all the nodes and what their what's available on them.

C

All the pieces are there, yeah cube control will tell you when things are wrong, like we have the event log and if a pod is unable to be scheduled, we get the reason and it could be that a node is out of a specific resource and it'll tell you that we've got 30 nodes available. We can't match any because say: cpu is out of or out of cpu availability, rather so cube control.

C

The event log coming out of kubernetes would be beneficial to look at in those particular cases, but hopefully we don't run into that after it's already happened.

C

um I think the ultimate goal here would be. We set up the necessary alerts, so we know that we're saturated beforehand. So we know how to tackle that prior to it becoming a problem.

C

But, unlike vmware, we are doing that by ourselves leveraging prometheus in this case.

A

As long as we know we're doing it, and we have an idea on how to perform it, it's not that bad, but it is possible to make mistakes so, having like a even a robot second set of eyes, help guide, it is sometimes beneficial, but.

D

Yeah accidentally fat finger and add a zero in one of your one of your limit, your requests and whatnot and not catch it and run through a pipeline and kablooie.

C

Why are we requesting 100 gigabytes for this one pod? I don't.

C

Understand um we blew through that um fire drill relatively quickly, so I think, for the remainder time, I think we should just ask questions and just keep whatever conversation we want going. I don't have anything else that I want to cover specifically right now.

D

Should we for the purposes of focus and brevity, should we uh cut the recording.

D

I don't know if we wanted to keep it like highly topic purpose specific or just let it go.

C

Just saying I'm, okay with I, I don't have the capability, but I don't. If dave, if you have the ability to stop recording if we stop this particular conversation, I'm okay with that.

B

Let me try, I'm not sure, because it's part of the looks.

C

B

Looks like I can, but why did we want to stop the recording just because.

D

I don't know I just like if we're going to put this on youtube as this is the fire drill video. This would be the end of it or oh.

B

Yeah well, no, so I could stop recording and then restart it, so it like has separate things. So we can catch the fire drill.

D

Yeah, it's just something that occurred to me or we can keep going.

B

I'll stop and restart it because yeah, then the fire drill is separately recorded. That makes sense.