GitLab Delivery: GitLab.com migration to k8s demos, 18 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-08-18 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

Andrew hello, I just realized once again I'm in the wrong call. I've got a clashing call, which is why I haven't been here. I hope you're well and uh enjoy the school. It's thank you.

A

I miss so much disturbing light effect when scarborough joins the meeting.

C

Well, let's, uh let's get it going.

D

C

Was it? Is it this one yeah.

A

But is it an effect, how are you doing it? It's another camera.

C

Yeah, my laptop has an infrared camera along with its normal camera. I think windows uses it to help identify you in helping log in, but I don't use windows, so it's completely useless so like. If I lift the lid, you can see the trackpad underneath.

A

The camera yeah okay, so this is pointing this- is in the closed lead of your laptop, pointing to the.

C

It's right next to the normal laptop camera.

A

C

Welcome to the kubernetes demo, everyone.

E

uh How's everyone doing.

C

I'm doing fine.

A

Yeah it's terribly hot, but I'm fine.

E

It's still hot simply it's cold.

D

And rainy here, but I'm also doing fine, okay,.

E

Move in you do fine, yeah, yeah, good, good, so exciting demo. Today, um let's go back over to you.

C

Yeah, so it took us three attempts so far we haven't rolled back yet, um but we are taking traffic inside of the canary stage on kubernetes, um so I'm just going to show a few, not really anything, exciting dashboards.

C

um We can see that you know our app x is wiggly, but like it's always no big. If I go back to seven days, which would take a while to load, it always looks a little okay. The granularity changes, so maybe not seven days, maybe two days.

C

Okay still can't see it very well, but we are taking traffic inside of kubernetes, which is pretty exciting uh one of the things that we fixed. Let me go back to that two-day view.

C

When we first rolled this out, we were saying high memory saturation.

C

We were using the default values that we use for all of our web services, which is set to six gigabytes for our memory limit, which is how we derive. How close we are to our saturation.

C

Thank you to henry for helping um me with the nmr to review that. So we are now seeing saturation just shy over 80.

C

um I think, after we complete the migration into the main stage, we'll probably revisit the tunings for the service overall, which will definitely change the saturation.

D

Maybe you should keep.

C

D

This is the maximum right, so there might be one pot always being going very high, but the majority is like yes, mostly below it right. So it's not that all of them are using that much memory. Just one may might be very high.

C

Yeah I feel like we should have like a a median, or maybe the average instead of the max, but whatever, um as you can see, we did fix our metrics. This is when we first implemented everything and the metric just kind of dropped down to zero. I wasn't watching this chart. I wasn't watching this chart. I was paying attention to only these things up here.

C

I did see the memory saturation but like when I was doing the migration I'm like we could toy with that I'll revisit it later, um but yeah, so our metrics are now fixed. So we have the data necessary, um interesting point about metrics. Let me see if I could go back to our main stage, because we did make a mistake yesterday and I accidentally induced an outage because of that when I was attempting to fix the metrics and might have been this period of time.

C

When I was making a change to our metrics there's one yeah, it was this one. It looks just like it did when I was making a modification to our metrics.

C

We have a very similar query to our api, but what was missed was that it wasn't looking for workhorse specific to the web fleet. So during this time in which appdex plummeted, what was happening was that we were gathering metrics for workhorse across all of our services, inside of both kubernetes and or virtual machines. So, instead of only taking into account roughly five thousand well, I guess nearly ten thousand requests between seven and a half to eleven thousand requests.

C

We were taken into account like over 14 000 plus requests, because those were, of course across everything oops. I don't know why. We don't see the rps change here, but if we go to uh staging, I think this is where I'm like. Oh yeah, I'm doing something wrong in our query, but I don't know why that same thing didn't show up in production when that change was rolled out. But that highlights an issue with the way that we deploy our rules for prometheus, because I can't create something that targets only production and staging.

C

So I can't test something in staging and push it out to production but anyways. So that's our metrics log-wise.

C

This is a chart, that's showing how much traffic on sage canary of type web, avoiding.

A

Service regarding testing thing, so do you refer uh how we can test the metrics, so the dashboard or the metrics itself.

C

The metric itself like these are reporting rules. Okay,.

A

C

That's why, after we adjust the metric like we're stuck with this dip, so yeah.

A

And just impacts.

C

Our slis that get reported above us well, I accidentally just made it look bad, which is not great, so.

C

So this is a chart showing all of our canary traffic for web, avoiding the health control on the web exporter because those are always being hit and we can see the majority of our traffic is kubernetes. In fact, if we get rid of, if I re-enable this filter, uh we see some traffic going to uh or virtual machines, but if you go to what that traffic is, we just recently had a deploy. So all you see here is us starting and stopping uh the services on our virtual machines. So that's good!

C

That's the only thing we see is non-client related traffic go to a virtual machines. So I'm happy about that. So what I look forward to is removing our main stage from taking traffic um instead of a virtual machines. I want to see this to turn green, so that'll be exciting.

C

Which goes nicely into our topic of discussion. um We have one readiness review left over from the infrastructure side of things. I mistakenly assigned one of the reviews to someone who was on vacation, so I got two volunteers to help with that. One person's already completed the review. Another person is doing that today. So um if nothing comes out of that review, that's super exciting. I think the our target would be early next week would be the next time that we um shoot for performing the migration.

E

Yeah, that makes sense um I was chatting to graham this morning. I think, with the release um coming up. um It fits in well with timing anyway, so yeah, let's, let's aim for next week.

C

um The change request he's still working on um I'm trying to figure out a way that we can reduce the severity, not severity, but like the change level, I guess because currently it's labeled as a c2, but we could do this as a c3 that way we're not being a blocker for auto deploys yeah, because it's going to be a multi-day thing, so we don't want it to be a very high sea level, change, anyways yeah. We.

D

C

Need to tweak some of the steps to enable that to occur, which is not hard.

E

Cool yeah that sounds good. That sounds good um super. Do you know so one thing that was um a kind of I guess a question mark. Maybe it's too early to tell, but one thing that was a question mark around removing x was whether we would see performance impacts is canary, an environment where we can get a indication of whether we have impacted performance or would or do we need to wait for the main stage. For that.

C

We should be able to compare canary before and after running on kubernetes, for that um I.

D

Had the same question, because when I looked last time when we enabled canary for the aptx, for instance, it looked like at least as good as before last week- maybe even better. But now, if I look at it, it's not really a big difference. So I guess also. We need to look into locks right, skyway for for really figuring out, if maybe some percentile of latencies got better or worse and we should be able to filter this for canary. I think to see this.

C

Yeah, that's what I'm trying to figure is how we would find that information, because, realistically only aj proxy is going to give us that information.

D

Yeah then it's harder.

C

To figure that out.

E

We I mean we can certainly look at it on on the main stage like I was just curious because I know um as far as it stands. I think that was the the only real question mark that was left um from, like the testing graham had done was actually whether this, like there probably isn't, but there may be some small change, so um we can. We can look at that next week. That's fine.

C

Well, that's a good question. I just no I'm not sure how to find that information, because from a technical standpoint you remove an object that measures things we removed nginx yeah, so because we don't have that to measure because it's gone, I don't know how to say: hey yay we improved or degraded performance. um I guess we already have aptx for canary, so we haven't negatively impacted that as far.

D

C

Could tell but our true method of determining whether or not we did any sort of improvement, we would have to look at ah proxy and look at the response times from our back end.

C

Okay, that's a big query and I yeah.

D

C

Not good at looking at that data.

E

That's right, I'm not expecting it to be faster. um The main big thing is stability right, so we should see like fewer, uh odd stuff happening, which is good um but cool. Well, we can. We can monitor that next week, that's sounds good.

E

um Did we put out a message to let all of uh reliability know that we did we've gone back into canary.

C

I did not see a message from graeme, I don't know, I don't pay attention to the on-call handover issues, so I don't know if phil's, including that I.

D

Mentioned it in the emir um reliability, discussion.

B

Today, okay, okay,.

D

I don't know if he put it into the chat very visibly, but but at least there he mentioned it and there's some for some discussion. He also mentioned that no chef changes to that should be applied if possible, because we have it in canary, mostly now.

E

Cool okay, great um so yeah. Let's just keep repeating that say that uh as many people as possible know where we're up to um super exciting stuff.

C

I'll make sure I bring the same messaging up to the americas. Okay dna meeting as well.

D

E

Awesome sounds good.

C

Okay, so I guess time is permitting, so I guess I'll do a quick review of an investigation.

C

Didn't really prepare this because I thought we'd run out of time. So, what's going on soon so this issue, we had an incident where the ssh service soi was violated during a deployment.

C

um What this chart is showing is the amount of ready, pods per cluster, so cluster b is in yellow cluster c is in blue cluster days in red.

C

The incident was primarily during the time in which cluster b was performing its deploy.

C

I don't know precisely what initially triggered this, but we could see during the deploy we scaled up a few pods, but we immediately brought them down. It helps the new pods and then we just kind of crashed.

C

um What we had was a lot of boom kill events occurring at that same time. So looking at the memory profile during the deploy, we were at our limit. So this is one gigabyte and we were pretty much at our limit, so any pod that was at that limit was going to get killed, but what I also saw was some saturation at the node level. So this is the amount of memory free and then later, matt smiley came in behind me and showed me a different chart somewhere in here.

C

um Not entirely contradicting uh what I was saying but like we still had memory available, just not down to zero, so we had like three and a half gig left on server nodes. So I'm kind of surprised that we had some moon kills in general, um but like we did have quite a few.

C

uh We had 149 events where things were being killed and, of course, I can't pull this chart up anymore, because those logs will have been rotated out because time has passed, but it's a mixture of either specific client events being killed or the actual process that manages the pod itself and when the process that manages the pod gets killed, that entire pog gets removed from rotation.

C

So what likely happened- and this is the event of the oomkil events on different nodes?

C

What likely happened is that we brought our capacity down to so little because we reached, what's the bigger version of this chart, we went from running 35 pods to just shy over 20 and like prior to the deploy we're running at a nice, constant 35 in cluster b.

C

um If we lose enough capacity, we're not going to be able to serve those claim requests because we'll have saturated those pods in various ways. We've seen in a previous incident where, when the api went down for a period of time, gitlab shell ran up its cpu usage off the charts.

C

In this particular case, our memory usage went way too high. So I think get lab. Shell is one of those workloads where we need to figure out a better way to tune it um because during our normal usage of this workload, oh no, I closed my tab during our normal usage of this workload, we're greatly over provisioned we're only writing between two and three pods on these nodes and if you go somewhere.

C

This is what I get for, not preparing. um Let's see, we've got eight. Nearly uh six gigs of ram that we could allocate, but we're like sitting here at using less than two. This blue line is how much we use we're not using very much ram at all so like a lot of the resources on these nodes are just not being used at all, but during times in which we are suffering we'll use all of it and the note itself will start to suffer.

C

um So I created a new issue, and this is something that we've briefly spoken about multiple times over the course of the last two years is one to be able to scale on custom metrics.

C

Which we would need to then tie into prometheus to pull a custom metric and figure out what that custom metric is going to look like whether that's something that currently exists, or maybe it's a metric that we create, but the ability here would be that instead of scaling on the cpu usage or the memory usage or a combination of both, we instead scale on, say the number of clients that are connected to gitlab show.

C

I know from sorry. Let me go find another chart.

C

Okay, so here's the number of clients that were connected to the average number of um the average number of processes which, to an extent correlates to the number of connected clients to gitlab shell. We know that when we get to a high number, we start to saturate our pods.

C

So in this particular case it looks like we started to struggle when we shot from roughly under 20 to something say over 30, because we started to recover on the downward trend up here, and this is primarily because we also started to scale up our hpa started to take control and help us out here.

C

So I think if we were able to scale based on this metric, we would have a better shot at determining what a better utilization would be for these pods. Overall.

C

It's just a matter of time and effort to get us to that point.

E

Did we ever get anything or did we ever ask the gitlab shell team about the architecture, because I feel like this changed at some point in the last like six months or so and started causing us more of these? Maybe not these sorts of problems, but it started to use up more um memory and stuff. Did we ever ask them about that?.

C

Not that, specifically, I did ask them if we buffer anything inside.

D

C

Shell um and I didn't receive a like a solid answer, but it sounds like the answer is technically no, um that they pass the information directly to getaway as necessary, and vice versa.

C

But the one thing I did put a feature request in for was to see if we can't ask both getaly and get lebshell to see if we can't log the amount of data that is being transferred to and from the client. And this is a common thing that we love for http requests. But we don't do the same thing for this particular service.

A

These things interact with italy with grpc core right yeah, so um I remember a conversation that I had with jakob when I was working on some improvement about cloning stuff. I think we were discussing about this now. I remembered it was too long time ago, but basically, what he told me is that the process, the grpc architecture, is well designed for short and small data transfer.

A

But when you start moving chunks of data it just becomes a memory hub because, basically it just allocates and buffer and buffer things converts them into from the internal structure to the to the wired structure, then it serialize them and send them over the wire. And when you receive the same thing right just you have to pick the package just convert it to the memory structure and things like that. So maybe it can be related to this because it really depends on what which kind of operation we are doing.

A

But alicia told me that I think we were discussing about cloning uh yeah. I think we were discussing about cloning, but there was some time some type of of a grpc call or in cloning. They were just taking a lot of memory, and this depends on the on the sides of the of the repository. Obviously,.

C

Yeah and one of the things that matt smiley noted was that when the memory killer came into play a lot of the times, it was killing processes that weren't using a lot of memory, um which is kind of problematic, because that means we we are serving a lot of requests, but then there's not much. The node can do to help keep itself stable and healthy.

C

So it's likely that we were just following up a lot of connections on these pods, so they're, just they were suffering, is really what it boils down to.

C

So it I know gitlab show, is in the process of reworking how gitlab show operates entirely. So I don't want to delve too deeply into you know, trying to figure out what we could do better for that service at the moment, because I feel like in the future that's going to change significantly with them, introducing a demon versus the current method.

C

um Secondly, the well, I had a second point.

C

Okay, I lost my train of thought. Sorry.

A

We can still counter your first point, which is: maybe we are focusing on something that is. This is not affecting this, and now that we are in a major architecture refactoring. Maybe it's a good time to think also about this, because maybe we just changed stuff and then their underlying problem is still there.

D

I just had a look at the memory usage of gitlab shell during that I think it was on that 12th right yeah and it looks like a very strange pattern like over. I don't know two or three hours we had very low usage on this cluster b. I think I'm looking at and also it's very spiky right. It's going up up and down and like like doubling the amount of usage in between. So I think the the pattern of usage of resources is very erratic in general for for good lab shell right.

D

So it's easy to fall into something like going over limits or something.

D

So I think the only way to protect us is really to set limits so that we don't just to you know, run out of memory on the nodes.

C

So precisely I don't want to change what our current memory limit is, because that's protecting our nodes from you know dying I'd rather protect the node over a single pod, because otherwise we would bring down a large portion of our cluster. If that were the case,.

C

So, going back to one of the corrective actions I pulled out of this is that we adjusted the deployment strategy. So previously we were using the default kubernetes provided defaults, which enables us to remove 25 of the capacity while adding 25 of the capacity during a deployment.

C

If our pods were failing during a deployment which was the case in this particular situation, we were removing 25 of the capacity, thereby allowing any new nodes and any existing nodes that still have yet to be rotated out during the deploy to suffer more, and I think because of that we started hitting our memory limits. Things were getting killed and that was just a cascading failure until things got to a stabilization point.

C

What that stabilization point, I think, was the deploy had eventually finished and the hpa then was able to kick more pods into place.

C

um So I modified our deployment strategy that we already use this for our web deployments, where, instead of allowing 25 or pause from being taken out of rotation instead, that value is now zero. So, instead of scaling down or up at the same time, we only scale upwards before we start tearing down old pods and those new pods have to be up ready, taking traffic before any older products, get taken out of rotation.

C

And in fact, if we look at our latest deploy.

C

We see that we scaled up new pods and then we were doing a little wobbly effect during the deploy, and then we went back down to where we were so. We started at 45 prior to the deploy we finished with 45 pods and then our hp is doing its job. As you know, load happens. So that's precisely what I want to see.

C

And thereby we're not unnecessarily cutting capacity just so that we could do a deploy hpa is the horizontal pod, auto scaler. Currently we use utilize cpu as a metric to determine what we scale based on.

C

In this particular case, um when you know well when the cpu usage scales up or goes up because a load will add new pods and vice versa, when cpu usage goes below a certain threshold will remove pods.

C

This is the one thing that we don't have with virtual machines, so we're not able to scale up our workload to handle client requests instead of having to wait for an infrastructure engineer to be like hey, our cpu usage is too high. Let's add three nodes and they'd spend the next two days trying to figure out how to accomplish that task.

C

Now it's just automatic.

E

Back to your point that you made alessio uh what were you talking about the get up shell architecture, as opposed to our infra.

A

Yeah, I was talking about the gitlab shell architecture, so you said, if the, if you're really rethinking it, maybe it's worth just to put also this question on their plate so that when they are thinking about what to do, they may consider also this type of, because I just very very briefly just as an idea right. So if, as also as henry pointed out, the behavior memory behavior is erratic.

A

So probably we need something which is more within the business logic that can really, let's say, uh made a guess or an estimation about the incoming request. How much memory will it take? And probably this is the type of um project the type of software where we need to have uh tighter memory management within the process itself. So let's say something like this: uh we get a new client and we know that in we have a memory limit because we're running kubernetes, so we need to bring this information down in the process.

A

So we know that we have an upper limit of say one gigabyte in the spot right, so the process know how much request is serving and roughly the memory that is using. So if there is a gisely call that can kind of estimate the amount of memory needed for the incoming requests, it will make um an informed decision about serving it or just putting in a queue, because you don't want if this is serving more than one request.

A

At the same time, you don't want the pot to be killed and maybe just for an incoming request to just raining either five or ten yeah, 10 or 20, because if I remember we were talking about 10 20 requests per pod. So that's that's the type of situation where I'm thinking.

C

Something to keep in mind is the way the get lab. Shell currently works um right now, when you make an ssh connection. Gitlab shell spins up a brand new process for each new connection. That in itself has its own memory overhead, whereas the future architecture is github, shell becomes its own ven. It's running the necessary tcp stack to accept an ssh connection and it's not spinning up a new process, but instead will be a single process that could then do exactly what the list is.

C

Managing those connections is necessary and it'll have its own memory capabilities, because it's managing all of the connections instead of a single process that is disconnected from all of the others, so the architecture will be better. It's just a matter of what are the future impediments or future situations that we need to be aware of when, whenever we turn that on, I don't know what the timeline is.

C

I know our helm chart now supports switching over to the new shell demon, but I don't know where that team is at with, um but actually rolling that out.

E

Can we find out like because I think like um that would be a great one for us to like plan stuff, but also know how much we should worry about this stuff or metrics, and things like find. If it's just a comment on this issue, we have um especially, we can link them to this video um and see if there's any any kind of practical stuff.

C

um I'll figure, I think, yeah I'll- do just that. Actually.

E

Super thank you um awesome.

E

Does that give you what you need um skybeck for um in terms of um overviewing like this thing? Is there anything you need from from our side um today to help here.

C

For this issue now so at this point I've kind of completed my investigation, I pulled in a few other persons who are curious about this they've kind of completed what they wanted to see out of it. If anything, jacob may have more questions, but I'm going to go ahead and proceed to close this issue. At this point in time, I've already completed an initial corrective action that I think was the most beneficial and I just showed the chart, so I think we're in a good state.

C

So I want to continue spending time on getting our web service migrated over.

E

Awesome cool sounds good, great um cool and then uh goals, we've kind of touched already: readiness, review and change requests. um Is there anything else and do we need to um give anything, provide anything or help with anything.

C

I've had no buff, um I commented on the cr that graeme has started to hopefully make it a little easier on ourselves and then I've got one person, that's doing the review on the right or performing a readiness review. So those are the two things I know of that's going to be holding us up, um but I think we're, I think, we're getting near ready.

E

Super okay, stuff.

C

E

C

Could you know, do the migration four times in production.

E

Nice um on the uh we can see it probably I'm sure, they're on the emails but on the main uh web rollout like what's the kind of rollout plan, are we going cluster by cluster or how are we gonna get this out to them?.

C

The current plan that graim has put together it looks like we start by introducing the new configuration uh one cluster at a time here, I'll roll through this real quick. Since I've got my, I have the issue up, so we've got a chef change and all this does is add for one cluster, the back end of the web endpoint to that to that back end, and we do that per cluster.

C

So at this point one set of hda proxy nodes will send three percent of that traffic to that cluster and we'll get to that point at the end of I guess day, one wait an hour, wait an hour yeah, so wait one day so day one. We do just the initial shift change across um our front end nodes after that we just play with the weights and we'll just play with the weights until we reach 100 of the traffic being in kubernetes.

C

That's the initial plan. What I'm this is a little bit intrusive because it involves um stopping chef on our fe nodes, executing the merge request and then running chef. I guess you don't have to do it targeted because we're doing this one cluster at a time, but I think we can make this easier and then we could reduce the change level.

D

C

To c3, that's what I'm going to try to encourage for grain say.

E

C

The comment, but.

E

Cool okay, great sounds good.

E

Right is there anything else we need to go through.

C

I don't have anything else now.

E

Okay, awesome thanks so much for uh demoing, scarborough and uh yeah, exciting great progress, exciting to see, um see things on canary and looking forward to the next step.

C

E

Awesome all right, everyone have a good rest of your day, speak to you soon.

D

Have a good one, everyone.