GitLab Feature Walkthroughs, 27 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-03-27 k8s migration sidekiq: project export queue upgrade

Description

Part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/143

A

Stur, thank you for that sure.

A

I'm gonna attempt to share my entire desktop. If someone can, let me know if that is just unreasonable, because I do have a relatively large screen, but I'll try to bump the font size can I get a thumbs up from jarv or thumbs down. If it's horrible, it looks good job alright. So the work in progress for enabling auto deploy in Kate's workloads is still ongoing. There's a few puzzle pieces that we're closing in on the ability to get together about at least we have the ability to trigger and upgrade.

A

So all we need to know is the environment. We need to know whether this is a dry run and we need to know a specific image that we want to upgrade our sidekick image to in a future. Iteration we'll be upgrading our home chart until some other work. That is, that we're waiting on to get into place is done.

A

We're going to upgrade specifically to the sidekick image. So if I execute this fancy, little trigger script, that I've created this is going to run a dry run on the pre environment, for the tag 12:10 some date stamp- and this is some different image.

A

There we go so theoretically, the only change we're going to see in this drive run is the fact that we want to upgrade the sidekick image being utilized in our pre cluster.

B

Comeback, a quick question: we're doing this for sidekick, but it probably makes sense to do it for registry as well right.

A

To keep scope limited I'm, limiting this to sidekick, only because that's what we're concentrating on the registry, if we get auto, deploy working properly in general as meaning that we get our helm tags up to or helm images updated with, the appropriate information will be able to auto employee registry. At the same time, we enable the auto deploy our helm, charts, yeah.

B

That makes sense, I, guess, I, guess what this wait do, what we're doing now for registry until yeah boy stops for this.

A

Yeah so, as we could see in our diff, the sidekick container were going from we're running in horrendous ly old version for some reason, probable, EE, and we're going to update that to 1210 what I specified so I saw something else. Oh those the dependencies container, getting the same image name and that's the only thing in her death, so I'm, confident that, theoretically, this will work precisely the way we wanted to.

A

So the only thing I really need to change- and these are this is the precise stuff that we'll be sending in our.

A

In our trigger, when we get to the point of updating deployers appropriately, so here, I'll just save my little script, which is just mimicking what a job would do inside of deployer, and this will actually trigger a writer to bold change to the pre cluster, so the driver one will still run in this case, because we want to see if there's any problems with the driver, we want to make sure we catch that prior to actually performing a deploy.

A

So the driver- one role in this case I'll, put this same exact information that it did in the last one. I just showed you I'm going to move forward into producing the actual change in the pre cluster and.

B

This I guess this image tag is a bit earlier than the last deployment we had to be proud of their loss, but.

A

I plucked this from the latest of the time that I was looking at it. It might be out of date because this was something I pulled late yesterday.

C

When I was testing.

A

A

26 was today's date, the 20 yeah. So this is an image from yesterday, but still newer than what's currently running in our environment. So.

A

If I could remove that, I could reveal our values cuz we're not going to show anything special, but you can see they were pulling in yesterday's image. We're operating on the pre environment we set driver on the false dry run. False is very important because that controls a few things one. The building of this proper pipeline in we want to make sure that we're gonna pull credentials that are not read-only, so we're using our writable service key instead of our read-only service key. But our diff shows the same thing: we're going to update our image.

A

That's the only change for both the sidekick in the dependencies container and we see down here the upgrade actually started and now we're waiting for that deployment to be ready and I could sit here and do a watch. You control equals sidekick, because this is pre have access to this cluster locally.

B

That's lower what we're waiting for this sky radically are. We are we, oh, that doesn't look good nope.

B

Did you check this inch before I.

A

Did not check before this demo no, but I can check it now.

A

Theoretically, this would be where okay, so this is an image that does not exist for X reason, so I didn't have to investigate. Why, for the purpose of the demo, let me see if I can find another image that might work. Let me look at latest in.

D

Check out distribution channel right there.

C

Are tags posted there.

A

We don't get success, messages in the distribution channel, good, so I'm, just gonna.

B

Also also be careful, cuz crowd was 2.2 yesterday and he can't take an image. That's more recent than yesterday.

A

A

Okay, I'm not finding any images available at this point.

A

Which one did I initially try? It ended in six 1940.

D

You can already see that we'll have to spend quite a lot of time.

C

Improving the visibility and also alerting for all of the.

A

A

Ok, so I found an image that does theoretically work, so this deployment we upped our time out for the deployments to 15 whopping minutes, which is very unfortunate, because this will take forever to fail. I, don't want to wait for the entire demo for this to fail.

A

Let me copy fix this.

A

So I'm just gonna mainly intervene this, because this is the demo and I wanted to do with this.

A

So I've got to update it. Image so dry runs. Do not tell us that an image does or does not exist, but theoretically, when we've got a pre job created, it should be already in our pipeline, where it's going to check to validate that we have our images available to us.

A

We might need a job we might want to consider. Instead of just checking that the tag exists and the pipeline was successful, we might want to actually reach out to the docker registry and validate our images are available to us. I.

B

Thought about that I! Guess the well I guess wondering that this is supposed to be temporary yep, so I'm not sure. If we want to spend too much time and then I was thinking like well. We check, as we check like all the images but then again we're in psychic now so maybe just check what.

A

Do we care about right now in the night gigs, so yeah.

B

I mean it does it doesn't? It does make this a bit simpler, because then you don't have to figure out the speed, the time format or the pipeline check, which is a different format than the image, so that that would make these little bit simpler. Yes, I'm a little bit worried that this image that were gonna test with NYX might be. You know like more recent than what's on pre product because it was yesterday, but he may have.

A

I don't know older, it's.

B

A

B

A

B

I guess we'll see, I put an even older one and zoom chat people in another way, but.

D

It's that's only important if something actually changed on the import side of things right. Sorry now.

B

It's important if any worries where we've gone, where you've had problems is with the database schema that the database schema is more I use it than what's on the database, then yep it'll fail. So we just have to be careful about that. A.

D

Good point- sorry I keep forgetting about that. You already told me that once yeah.

B

Correct what? What would you think about also doing this in staging and in parallel.

A

I I can yeah, so all we need to do for that particular case is modify this pre and said that the GST, G and I could run another trigger.

A

So my middle window here is pointed to the pre environment, which is trying to roll out a deployment and then I've never seen the mobile version of her website. This is scaring me running.

A

In this should be yeah, this is staging, so we got the dry run, followed by our actual, deploy.

A

So this looks better looking at my console, it's trying to roll out in pre and we reach the init stage on the second dependencies or the second in a container, so we're doing good with this image. So that's a good sign.

B

So my right window.

A

Here is showing the staging upgrade.

B

I'm, just thinking ahead for this demo, if we're gonna do this in production, are we gonna do? Are we gonna set the cue configuration or not before we do it.

A

All we discussed was just changing the image, the upgrade not necessarily changing.

C

A

Configuration so I, don't I preferred not to until we have appropriate and appropriate change request in place, and that is already on calls. So.

B

So this is it, so this is just going to be the null. The note to you then, right now.

A

B

A

All right so looking at the stage.

A

We see that were upgrading from an image that came from the 23rd of March and we're upgrading it to the 26th, and that's the only change I see in this diff output, so just watch that upgrade go out if they take so long. It's been nearly three minutes and we're still in the init phase. Yeah.

A

Again, when I can't see what I'm typing, because the zoom.

A

Staging is a much better test because there's so many more pods running in that environment.

A

The pods finish and knitting in staging faster than they do in the pre environment.

B

Yeah I mean that's not surprising, given how the pre environments configured we're also using cloud sequel right so I don't know. There's probably things are just slow and be proud. Yeah.

A

Like staging we've already gotten at least one righty pod, and here we're still in the interface.

A

But at least the deployments are moving forward, which is precisely we wanted a steam.

A

So I guess it would be fun to know how this operates on the back end, because right now we're only setting the image tag. We're passing image tag into our trigger as a variable. If we do not send an image tag into a pipeline.

A

Instead of having some awkward default, our helm file is configured to pull the cluster, to figure out what version of the sidekick image is running that way, if we ever need to make a configuration change which will not contain an image tag, environmental variable that will automatically not change it- will automatically populate our helm, template with the appropriate sidekick image. That way, we don't accidentally change the image value unnecessarily.

A

Which is the same approach we plan on using for the chart version when we get that ready to go.

A

All right so the station deployment completed, so we now have our brand new, updated image that was performed via a trigger that was like watching paint dry. Do we want to do one into production.

A

Let's stop watching pretty because it's never gonna finish today.

A

All right so to do production all we need to do is change our environment to Jeep ride.

A

Now go ahead trigger that.

A

You know login to that server just so we could monitor that from that side of thing.

A

You only have one pod running, which is to be expected.

A

So this deployment should happen a lot faster.

A

So, looking at the diff see that we were going from an image from the 13th upgrading it to one from the 20s things for both containers I see it config map change. What's that.

A

Job, do you know anything about external, that's, being updated in production at all.

B

Definitely not and they shouldn't be updated in production.

A

Well, according to what I'm saying we enable it, yeah.

B

I would say we need to hold off before making this change then we did enable it. So there was a problem where I wasn't enabled in staging right and I was causing this problem. So you did the help file update to source the enable flag from chef so expect this to be false, not true.

B

C

I did, of course, this is.

B

Not going to cause any problems right if we applied it because it like hasn't, doesn't have any effect on anything, that's running, but I would say we need me, probably shouldn't apply it well,.

A

It's already going now: oh.

B

Oh I thought I'm sorry I was I, was I had another screen open you're already applying this on I see because we didn't look at. We didn't review the dish. I guess that's.

A

D

Again so I'm still here the tiny bit.

D

What's that sorry I disappeared, a tiny bit! Oh no.

B

A

Okay, well, the production club will all clusters, except for Priya, are now updated with a a version of an image. So we've got a few things we need to look into. One is improvements into validating that our images are being built properly, because the last two that should have been built are not available for X reason. So we need to figure out why that is maybe an improvement to our pipeline to check the registry that the image is available and then on the side.

A

We need to figure out what's going on with the enablement of diffs, since that change was unexpected. But that concludes what I was attempting to demo today. So.

D

This is going to production right now,.

A

From what I could see? Yes, but I, don't I need a I.

B

Mean yeah, the change was made in production, but it doesn't have any effect because we don't have DQ we're not using sidekick I'm a tech support, introduction.

D

A

The important part is that we had a configuration change made, a chef that was not reflected to kubernetes. This is something that we I'm documenting today. I've wrote up a document that I'm about to submit in a merger quest for, but we don't have any way of saying: hey dear person, who's making this configuration change. This impacts kubernetes. Please make sure that we get a pipeline configured to apply that change to or kubernetes environments as well.

A

That's still a missing piece that we need to resolve. I.

B

Like principal astonishment here we would say that like when we do this triggered pipeline, that the only thing that should change is the version and nothing else and I'm not sure. If there's something we can do to do.

D

B

But maybe we could even inspect the dish in the pipeline to say, like hey, did anything else change if so, I'm exit- and you know you can make that change outside of this. But let's not do it now kind of thing. I.

A

Know there's an action item because I agree with that as well: I just.

B

Don't know I mean, like I, don't know how to do this elegantly though it's gonna be me. Neither yeah I.

A

D

B

A

I'll see if maybe Hellmuth has options for outputting data that might make sense to help us out here. Well,.

B

No, absolutely like I think I think all we would need to do right is an extra like one dish before updating the image right.

A

Oh, in this case, psychic has two deaths. There's gonna be one for the psychic container and one for the dependencies containers he'll see two dips in exchange now.

B

What I'm saying is that we first do and if, with the image version set to what it is in the cluster, without overriding it right, we do a diff, and that should be like nothing. Nothing has changed right and then then.

A

B

Then we do the second dip, which is with the image changed and the sendee apply with the image change. So the first step is like nothing and then the second dip would give us the image change, yep yep.

A

That's the what you're saying.

D

Would the project there was a project for the testing helm as in something like r-spec, both for for helm might be worth.

C

Working on the same principle, you could actually write tests against.

C

Now I can't remember, the name might be words.

C

A

Yeah, that would be a good idea as well.

A

Anything else job you want to get away with the BPA stuff.

B

Sure so we did a lot of work this week on, like performance, monitoring, adding. We now have a large project added to the load generator. So that's giving us a bit more data and what I wanted to do is discuss the BPA results.

B

Now I can share my screen. I.

B

Can find my window here so.

B

We've enabled the vertical pod autoscaler. This is not something we're going to turn on, in other words, we're not going to enable vpa to adjust the pod resource requests and limits. You know automatically, because it's not something you want to do with HPA enabled to do something you have enabled, but it does give us some recommendations on what you know. What the VBA thinks is good.

B

What are good limits for limits and requests, so I think we'll start with staging staging is only interesting for cyclic export because that's where we have mode generated I, don't think there's anything here. That is too surprising. So these terms lower bound to argue it uncapped, Arbit, uncap, target and upper bound. I define them up here.

B

I, don't think I like 100% understand what they are, but this is kind of like a high-level explanation, note that these are limits but for requests, and so this isn't like recommending what the limits should be, but the request and the request is like the amount that's reserved for the pod when it gets provisioned on the cluster, so I guess what we want to compare is what are the requests that we have but now and what is it recommending? You can see that these are the requests for CPU and memory.

B

Interestingly, for sidekick, you can see that it's actually thinking that we should have much higher requests. I think this makes sense. Skarbek, maybe maybe even like chimed in here, but I- think what it's telling us is that our requests are too low and that that kind of makes sense because they are pretty low, as they are now right.

A

Yeah we just took what we thought came out of our virtual machines and kind of applied.

C

Their eyes, guest.

A

Scenario, we didn't really have any load in a pod to look at for this, so.

B

So I think it's even more interesting for Registry, so the problem we're trying to solve are what this issue is about is the fact that we're getting a lot of evicted pods in production, and it appears to be that the node is just running out of memory. It's not that we're hitting the memory limit for the pod, we're hitting the memory limit for the node.

B

So what that likely means is that we're jamming too many or jamming too many pods on a single node and then eventually, you know the memory starts to increase on each of the pods, and then we run out of memory on the node and then then the pod gets evicted and it gets put on to another node I think this definitely points to our requests being to bow right, because if the requests were higher than it would have tried to jam so many pods onto a single node, and that's what the recommendation says here so we're.

B

We have like a request, think you know 50. This is saying like we wanted in the hundreds, so I think to me. The upshot of this is, like probably our request limit is probably too low for registry and you want to increase it.

B

What do you think it's correct? I would.

A

Agree, I've always found it kind of surprising that we're only running three nodes in production. Consider we had four nodes that was processing all of the registry traffic prior.

B

So yeah I guess what's gonna happen is, is that when we increase the requests, it's possible that I already.

A

Scale up a node yeah.

B

So I guess for so. Let's look at the the memory here is. This is really hard because, like you don't have units it's in bytes, so it looks like it's saying that hey like this upper bound is one gigabyte.

B

His lower bound I mean some cap target, is 600 megabytes, so I'm cat target recommended and request without taking the restrictions account.

B

So so this is like really low that, like I, would say, like we probably need to bump the requests up for both CPU and memory. Quite.

A

A bit yeah I think the request was just an idle. We took what was considered an idle of hog in a place here.

B

A

Only needed 32 megabytes for a pod that does zero work, but that's not the case and get live.com. We're always gonna be doing work in this case.

B

Yeah yeah, so if you can just I'm, imagine like you start like adding a bunch of idle pods and then they start receiving traffic and the memory grows on each of them and then boom. You run out of memory on the little afternoon, yeah, so I I think just we increase this slowly like we. We start by maybe like CPU of a hundred memory of like maybe 150, and do it in increments.

D

That we were gonna be having to do this like in 50 increments, because the the step you're mentioning here is.

B

D

We are seeing actually being measured, yeah.

B

So I I kind of want to take a small step first, just to like make sure that we aren't. There's no there's like some assumption that we're missing here, something like something I I'd like to take like a baby step first and then kind of look at the behavior of the cluster and then decide what to do next, because what I guess what we don't! What we want to avoid is not having the node capacity, because it takes time to scale up and end up with. You know something that affects a service production.

D

B

D

For the cluster to scale up a node, it.

B

Spurred quick I mean, to be honest, I think it's on like less than five minutes. You know it's like probably like two.

D

Minutes the amount of time yes.

B

Exactly yes, yes, it's fast I, just like I. This is something fundamental, so I'd like to just be a little bit cautious and it's not like it's easy to change right, so I would say: let's, let's kind of let's creep it up slowly, first and then kind of see.

A

B

About wants to be.

A

A good exercise that just learning how gke reacts when we make changes like this to just be a good learning exercise, yeah.

D

Cool am I supporting that then I would have been less conservative most likely, but I understand why you would want to do it this way so time well, yeah. When do we want to focus on this as well? I.

B

Think we could do this, like I'll, do the first bump today and we can let it go over the weekend and then like I would say early I mean yeah Monday we can, we can bump it again. I'll say like we're: go to 50 to 100 and then 100 to 200.

D

B

Cool what else did I have? Oh yeah I wanted to just kind of go over a plan for the production rollout and whether this is a realistic goal for next Friday I. Think.

A

It's really I think it is realistic, but I think we should probably reevaluate with the action items we came across today, because I don't.

B

A

To introduce a problem where we're starting to block our deployments, because kubernetes still has things that we need to improve because.

D

If I failed, I mean.

A

Sidekick, that's going to prevent post-deploy migrations and I don't want to get into that situation. I mean.

D

In all, honesty like we are so slow to deploy these days like this week, has been horrendous when it comes to that that we could have theoretically run this in production for a whole week without being affected. Yeah.

B

I know it's like: we could just have like a stock notification, the slacks car back and myself every time there's a deploy, and then we would just do this manually like it's not and that would it be a whole lot of interruption. You know it drops for us, so I yeah.

B

My vote is to like, let's, let's target to have this all the pieces, all the moving parts working by Friday, like I, think I think there's something in here like from the action items, I think it's pretty simple, we're very close to this deployer update to trigger the pipeline.

B

A

B

Do you feel, okay about that start back or not right? Oh yeah,.

A

I'll take the action item to create the necessary change request to get this rolling and create.

D

A

Create a proposal I guess, because I think what we should do is maybe roll out four pods and maybe knock off one exporter node that way we're starting to chip off our VMs. Don't want to go that direction, or do we just I'm fear. Adding capacity is gonna stress out, Redis and Postgres.

B

A

I'll put together a plan of action will put the conversation there have with that right.

B

Right, are you sure, that's gonna be I mean like we already ran in production for a while, like this right and we've had a bunch.

A

Of problems over weekend, when there is less stress on the overall infrastructure, I, don't know where we are in terms.

C

Of capacity for.

A

Reticent, Postgres and I know: we've had a few incidents in the last few weeks regarding precisely specifically for Reta, so we.

D

We have capacity now in Redis to to be able to. We.

C

A

D

Of that, just in the in this week's work the scalability did, but the we already have a production change issue I mean I, remember referencing it yesterday yeah.

B

So I opened up a production change issue for the demo today, but we ended up not being ready to do this live so it wasn't that so sorry. In other words, what.

D

I'm trying to.

B

D

Can we repurpose this one and just move on with it yeah.

B

Yeah yeah, we could do that yeah.

D

Okay: let's, let's do that.

B

Just a quick review of what's left in the epic I think pretty much everything is actively being worked on. Then this one has a duck yeah, so everything is assigned right now, so I think we're and I'm pretty good shape there and also I just felt like with that with reticent Postgres like like the nature of project export. It's just like I, just don't think it's putting a lot of load on the Atari, bursty, I I think we'll probably be be. Okay, all.

A

Right concern is that they're holding connections open and I.

C

A

Rodas we've exhausted the capability to have existing connections but Marin. It sounds like you made it sound like we've. We have.

B

Yeah yeah I feel pretty good about it.

B

Yeah I think can.

D

We can, we then I mean. Can we discuss the actual, then checkpoints during the week.

D

But this is really easy to for for things to creep into the scope. So how about we roll out on on Monday? We roll out a change to production so that we can start taking the traffic okay and then have a task until Wednesday to manually bump cyclic. If we don't have this done by then when I say this, I mean the.

A

Connection ik image.

D

B

I think like, given that we have, we have to look in this mrduss thing still I would say like Monday, we dropped the change issue or at least have a final version of it and then Tuesday. We planned for the deployment.

D

Okay, that's also fine, as long as it doesn't expend far beyond that, yep.

B

D

Just remember something really silly that you both would probably kill me for for it, but look I'm even more inclined to say that, instead of having a reminder to deploy manually, we would schedule pipeline that will check.

A

Hopeful that we'll be able to complete the work that I was struggling to get it done for this morning, I'm hoping we could complete that early next week.

D

A

Anything what we could do is just set the jobs that do the triggers to allow failure being true. That way, we're not blocking the pipeline, but we could outside of the deploy, investigate why sidekick is failing to deploy that it alone will get us some information as well, and we can make improvements in parallel to running this stuff into production.

D

D

So just to just to make it clear like why why I'm pushing you guys so so so much in this is I already explained last week, but this week I want to talk about the tooling challenges that we are seeing like, apart from us, wanting to see like the real production traffic and inform our decisions further from that I don't want to go into like too much of prettifying how things look because, like it will use carpet I'm, not I, don't want us to go and over optimize this solution, when I'm planning to carve out some time for us to remove this deployer caseworkers connection.

D

Right, like we won to get the release tools to be the coordinator right, like I'm, not arguing with that whatsoever, I'm just trying to get us to a point where we will have some time to actually focus on it, and this is where we would want to have like various notifications, safe gates and so on, because, ultimately, even with this custom tooling, if we do have a failure, it's going to be on a very low impact q.

D

We have alerting for it right like we can immediately like we can leave a change issue open or something open and then just have the sa resume called be aware: hey. This is where you go. When you see this alert, this is what you need to do to scale down, so that production can continue as normal right and I kind of expect that in the full week we won't have that we won't have a problem, but like I'm, not psychic.

D

So then that will allow us to actually focus on getting the tooling better and getting the devs to fail right when, when we see something else and so on and so on, it feels like a second level improvement rather than the first second first level. I.

A

Completely agree.

D

D

All right, great great work, folks, again: okay, yet another week where I'm super happy with the progress we are making. So thanks for pushing really hard in this really I really do appreciate it.

B

Cool managed, or even though I just want to sync up on a deploy, so we're kind of stuck on production now and like I'm, not sure what the next step is say: I can I.

B

Yeah I don't know.