GitLab Delivery: GitLab.com migration to k8s demos, 20 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-03-20 k8s migration sidekiq: project export queue production rollout outcome discussion

Description

https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/143

A

Thank You somebody all right.

A

Let's get started I think.

A

It was gonna take over the first item, I.

B

Can do that since I put it there, so I started out with usually last week's goals.

B

First, one was reviewing the data from Project X program in production and I think I'm pretty happy with the results. We also looked at data from staging since we were running or we are running on VMs and pods there, as well, with a little bit more consistent workload. Butt-Fuck fine I think like what we see is the same or better performance.

B

The better performance I'm attributing to NFS I, didn't get into it too much, but I think that's most likely what it is, because we have this outstanding issue where we're running to the uploads directory for project exports, because the carrier wave. So that's that's that I, don't there's much also discussed there. Scarback do one take the next one.

C

Yeah, what I discovered was that there's a disconnect between our rules that we generate for Prometheus versus what the current is mixing that is used by the community that we're sourcing in for our dashboards. So I found a few rules that we're missing, related disk, metrics and I added those back in, but thus far, there's a blocker. That is preventing me from me when apply these rules across all Prometheus instances. In.

D

C

Throughout all environments, so right now, it's currently still broken I've got a merge request that hopefully addresses this issue that I sign to go. Andrew! That's completely unrelated this. It's just there's something wrong with a set of rules and Prometheus is crashing and Prometheus is not happy, so I'm trying to fix that working progress which.

A

One is that which, which where's requests are you referencing I will.

C

Dig it up, please hold.

C

I'll drop it in here.

C

And that merge request huh is associated with an issue that I've been doing some investigation to figure that out.

A

C

That's all I've got for that particular item, so.

A

Two questions: a one for job, one for skarbek.

A

What we saw in production was we had like big spikes in memory consumption consumption for the pod, depending on the project on the project size, I guess does it mean that we have no limits in the application? And if someone tries to export a hundred gig project, they will destroy our cluster.

B

And I think the table will destroy the cluster, but it could mean that the export will probably fail, because the pod will be terminated.

B

I, don't know like I am sure there are cases of this happening now and I know that these jobs are not being recued indefinitely, so I'm, not sure I guess this is something you could test just to see what the failure mode is when we have a project that is like larger than available memory, I mean maybe the desk as easily is probably just reducing the memory limit on the pod and like doing some exports of medium-size projects to see what happens, we.

C

Know exactly what's going to happen. If we reduce the memory on our pods, you know we have them set to 16 gigabytes, because we see the spikes and memory usage as they perform their work.

C

If a PHA gets terminated in the process in the middle of doing an export, it's just going to wreak you that job and we're gonna attempt to do it again on the next pot out of memory again, and the cycle will viciously continue until we hit the retry limit on our.

C

What's that psychic option calls oh.

B

Yeah right, yeah, I'm, cleaning, jobs.

C

B

C

You think it'll.

B

Be like three three retries and then we'll be precise. You know, yeah I, don't know, I, don't think there are knees, I mean we could. We could add limits to the exporter for project size, because we have a lot of products out there that are larger than the current project limit, see.

A

What I'm concerned about here is that it's different when we have VMs versus Auto scaling where we could have the amuse er start up. 100 export jobs for projects that are 100 gigabytes in size and they can cause our autoscaler to kick in continuously for a long period of time and starve out everything else that we have in the queue not starve out.

A

Sorry block everything else that we have in the queue and instead of having like a very short burst, we will have a very long delay with a very long tail of actual proper jobs that that would succeed. And if you don't have any limit any control but giving those 3 retry. So how many ever retries we have I, don't know exactly how many will not be sufficient, because yeah we're being a state of degradation for a while, maybe even down.

C

I'm pretty sure we've got some sort of application limit that prevents you from trying to initiate a project export per project, though right yeah, yeah,.

B

This wasn't that reason, but yeah there's, there's there's limits for per project exports. Now, of course, you can create lots of projects to get around it. That's well! That's how the load- tester works, hey I, think, like I, guess, I guess, meirin! This question might be is like: are we more susceptible to abuse in kubernetes I'm you rpms and both in like with the autoscaler?

B

Is it potentially could one bad person potentially like cause us to scale up and waste a lot of money, or could one person potentially cause interference with other things running on the node and that's we're.

C

Talking any concern, yeah.

B

So man and I were talking earlier and we're thinking like isn't worth for us to dedicate, maybe a note to like sidekick for now and just so that we're we're sure, especially when we launch that we're not going to interfere with other workloads and and then, as we start to add, more queues, then we may want to have like one note per queue.

B

You know queue type. We.

C

Certainly can spin up a dedicated note pool which has a different set of node CPU memory requirements, and then we could apply a deployment to that node pool potentially Jason plum had the idea. He commented on issue that I'm chasing this exact question about what to do with resource stuff and he suggested using the vertical pod autoscaler to see what recommendations they may have, because we could run in a mode where it doesn't actually apply. Anything to the Diploma messages provides us with information.

C

I would like to enable that for the next time we perform an experiment. Production I, just haven't had any time this week to chase that down odds have been working on some of the auto deploy stuff right now so hopefully, next week, that's something I could dig into more. If that's an option, we want to go down, or we could try two of these at the same time surrett, but another note poll and see what it takes to apply psychic dedicate to a separate neutral.

C

That way we don't have any interference with other services that are running.

B

Right, I think from a cost point of view, it's probably not a problem right, I mean I, think we're talking at most one one like. If, if we isolate, is it possible to put workloads on to prefer one node, but also allow them to run on additional nodes? If there's a need or is it isolated within the North Pole.

C

It goes by node as far as I know, it goes by node labels, so it's gonna. If, if they know, does not exist with that label, it won't try to schedule all that node.

C

There might be something new in kubernetes. We could check out that I haven't seen in.

B

C

That might be worth spinning up an issue to investigate we'd like to go down that route. I would.

A

I would rather focus on optimizing controlling the blast radius because that actually has potential more of a money impact. Then you know like notes, booting up and spending some money, or rather spending some cycles that will because being down actually cost us more money or being in a disruptive state, cost us more money, and then we can like once we have a graph, a grasp on that we can focus on. How do we optimize the cost as well.

A

Like none of these things are actually for me, a blocker at the moment, but we are already in a blocked state. So, given that we're in a block state I would like to see that we are thinking about this in this pause, so to speak. Yeah.

B

um Yeah I mean I, think I think having a dedicated, nude pool for a sidekick probably makes a lot of sense, long-term and possibly even find her like more granulated like down to even you know, workloads, but we could start with a dedicated sidekick note pole and see.

A

A

E

A

We can go to Bob's item a little.

E

Carrot you want to go first with the table below.

A

Turning down project exporting state.

B

Yeah yeah, so we never intended this to be on forever and we're starting to lag behind a bit with the like the image we have on staging in the cluster, so I could refresh it just you know, because I probably need a good thing to do, or.

C

We could turn it down. I don't want to turn it down. I want to keep this going. So, let's upgrade the image. We've got the tagging working in C and G, so you should build out a pool. The latest tag and utilize that one and just upgrade the environment.

B

When you say we have the tagging working, but we're not actually tagging with Auto deploy yet right. We're.

C

Not we're tagging the images with Auto deploy helm chart not yet so you should build a pool. The latest siege.

B

C

Just upgrade manually that poor little file and deploy it that way so.

B

We're tagging so part of Auto deploy is now tagging CNG. Yes,.

A

Right: go to distribution, you'll, see continuous flow tags. There right.

B

All right right, yes, okay, okay, now update the image it will keep it running.

A

So I think it's time for you to take your item so.

E

We're bumping the priority on some of some of the work to get sidekick cluster ready for for the charts and one of the things that we yeah that Joan you you've commented almost like. We plan on running a single cyclic process inside a pod.

E

Is that way to go? How do we know yeah? Have we tested multiple psychic processes like without changes to sidekick cluster and just putting psychic cluster in there or, like that's, not a way? Yeah? No.

B

No one's pregnant psychic restrict us; the chart doesn't support it in here did upping the concurrency of you know, sidekick just writing sidekick with a higher concurrency, and it's not clear to me. I know that you know this. This, like creates me, but have multiple threads to work on jobs and I thought there would be some like concurrent job processing but for product export. It still was like one job at a time, not more than that so I think to get more concurrency for product I'd support within a pod.

B

We would need to use multiple processes and I. Don't think you want to do that. I.

D

I think you would sidekick. Cluster is really just there to like map out the list of queues, and then it should basically spawn sidekick. You know with the content: we're not a concurrency, but number of child processes is one.

B

D

Like I, don't think we ever want to have like sidekick cluster being a like a process manager inside Cuba nannies. Like that's a silly thing to do. Right, agreed.

B

Yes, Vasily an.

F

Angel since we're spawning, despite key commands the the parent process, because I imagine that cyclic process would be would be dependent rows. Then we had right.

D

F

D

I was mad like this like like: let's, let's talk about, it am I the way I imagined it is. You know, we've got that um open issue about the sidekick dry run and instead of it just printing out the command. Sorry printing out, like a bunch of stuff in Jason or whatever it is. It actually prints out the exact command like sidekick, blah, blah, blah blah blah and then there's some shell scripts. Whatever you guys use in Cuban, Eddie's, land, I guess it's just been sure and it runs psych cluster.

D

You know, dollar queue selector or whatever it gets back the command, and then it literally does an exact to that, and so at the end of that boot sequence, all that's left a psychic. There's no intermediaries. You know because the exact will replace been show with the with the cyclic process. So there's nothing in between in that process tree right, it's just sidekick and kubernetes is.

C

The goal to replace the process rapper that we currently utilize in our image. I.

D

I would imagine something I, don't know anything about that process. Rather the.

C

Process first, it's just there to tail some log files.

D

And I guess that would probably stay there. Yeah.

C

So that would be a process. Lauren engage that that stays there yeah.

D

Yeah because because if it's tailing logs and stuff thing, we need to kind of care and doing that so young, so so so just to update what I said earlier, we would get the command we would spawn sidekick, but we wouldn't exec it so process. One would still be whatever it is at the moment.

F

So that would work for one process on psychic right, not exactly multiple processes, yeah.

D

Okay, yeah you only! You don't need to do that inside, like you, bananas land, because then it gives us them the mapping functionality stuff, but it doesn't get in the way, because, obviously, if we run process one sidekick cluster and inside there's just more things that can kind of go wrong. You know yeah.

F

I was just trying to clear out that we are going to use that just for one process not trying to steal what put processes without a parent process like type it closer, but yeah that that's clear now. That means.

E

That the issue I linked there makes like a plus we made a start up. Command option is actually a blocker for you to continue work on getting yeah the new stuff running into Bonetti.

A

I think so you.

F

Mean they, the young sharks, team, right, yeah,.

E

So the thing that needs to be like part of the chart that needs to support like a cluster and that's going to use a dry run to get the command to actually run inside us all.

F

Yeah, we would pick us as a blocker. This is a blocker because we don't want to simplify before using sight people, sir, but not really sure if that's a real real problem- or this is just more like simple to use like that, just just to return it slightly right.

D

So you can, you could just have sidekick in between, so you could have sidekick cluster in between and then optimize it by getting rid of sidekick cluster.

F

There's a three yeah yeah yeah yeah.

D

I guess it depends on like I can't imagine the dry run. Option will be complicated too. It's not gonna be a huge amount. Just.

E

Cleaning out cleaning up the out yeah.

D

E

Be just a shell command and no other information anymore, like yeah.

F

Maybe, with birds first step just use a sidekick closer like with Dee Dee, parent parent and child Fink, and then we replaced by the dry run afterwards. But yeah are.

D

You gonna be doing both at this. Assigns him so yeah I mean it's up to you. Whichever way you want to do it, it is no.

E

I just floated today, like all these things up before, but I saw that yeah like you're already running psychic, so you mean we need to give you something right.

E

Okay, but both are necessary, not necessary blockers for both me to be.

C

F

E

Run this from get Lancome production got it.

B

What is the, what is the time frame mother Andrew about, like obviously I, think I, think we're probably gonna go forward and getting product exporting production before this is ready with key selectors and things like that light. Cue.

E

Selectors and so on are basically ready, although you created a time frame in one of the issues later. Do you have that handy come on.

F

So the you mean the cyclic person moving everything just like it was right. The idea is, is having at least it enabled in omnibus 13.0. So that's that's the Big O right now and for 12.9 at least what was working on making this like an optional thing formidable.

F

So that's that more or less did the time frame very immature.

D

So 12.9 you're.

A

Talking about released versions, 1.9 is on Sunday right. Sorry,.

F

Yeah to add another.

A

Thing when we were talking about github.com in the work we are doing here, we get to utilize. Whatever we build it almost immediately. We have Auto, deploy button in charts button or is being worked in charts and in omnibus, which means. The time reads we are talking about here are shorter, so, instead of having it officially released in 12 dogs 10, we could have it I, don't know next week, for example. So that's the timeframe we need to to talk about when we're talking about calm, ok,.

F

E

Although I think we better ask the rest, like show, basically show them, maybe I don't know if he is going to work on this, but ask them on Monday, because now we know that both of these issues need to be done. And yes.

F

E

Makes sense yes, so then I'm sure to modern, and we will communicate to the right. People would be magical. Yep, cool, ok,.

A

Thank you very much awesome thanks. Thanks all for humoring me and discussing this thing, it ended up being actually useful. Shockingly.

A

Alright I wanna go over the the status. If no one Minds, when I say the status I mean the status of the project of migrating project export in production to kubernetes.

A

And you're gonna start from bottom up. We see a list of issues here, but I want to start with this one skirbeck. This appears to be only documentation related now correct.

C

It's no longer a blocker for us moving forward. It's we just need to make sure documentation is updated so that everyone in infrastructure knows what's going on with this.

A

No disk utilization, that's the thing you were. We were just discussing yep the Commission bump. This seems to be a discussion issue and I. Don't know if it's necessary anymore, given that no it's necessary, because we're dependent on charts here right, yeah.

C

So this was something I was testing in pre and an upgrade after 25 minutes in that mine, in that example, was not yet complete. That's.

A

The real issue.

C

If we moved a production and we're running 16 pods- and we kick in at a deploy complete within 25 minutes, I'm hoping to test on staging what it looks like because the staging nodes are a lot larger than the nodes in pre. So we might be able to see a difference and how long it takes an employee. So currently we time out after 5 minutes that may not be sufficient for the length of time it takes sidekick to get up and ready and running and running and ready.

A

Ok, so that means, in my opinion, that for the next demo, we should focus on upgrading I.

D

Would agree that mine.

A

Upgrading and not even in staging but in production, so having something running in production going into the into the demo and then upgrading during the demo yeah.

C

We could use our no queue situation and just test and upgrade.

A

All right, I think that makes for psyche to perform a deploy. That's what we just looked at me.

A

C

This was related to the fact that it takes so long for the pod. This is poorly titled. This is how long it takes for the pod to transition from the I'm running to ready we're discovering it takes between one and two minutes in certain cases, but like in the previous issue. We see it taking over five minutes and such so there's an issue inside of charts to figure out what we need to do in that particular situation. That.

A

Seems like a very low priority for now right. Definitely not the blocker low.

C

Priority not a blocker. We just need to address the fact that readiness probes are failing, often.

A

C

This was an idea that we had this was a few deployments ago, unless you had the idea of checking out other deployment methods that are strategies that community ships with to see. If there's any change in the speed in which deploys happen, I'm not yet taking a look at this, but I don't think it's.

A

Alright resource limits for fluency logging, jar, I.

B

Just described to myself right now, I mean fluid heed. The demon said without any resource limit set. So we just wanted to see how it's running kind of made. Some you know estimations on what we should set the resource limits to so I'll I'll have that for the next demo for sure alright,.

A

And now we are talking about all of the auto deploy stuff, I think how to deploy. This is already done. This is going to be an ongoing thing.

A

This one I know what it is. Basically, so what I'm seeing right now is that.

A

We're basically waiting on not to deploy at the moment. I would.

C

B

Yeah I agree as well.

A

Okay, I'll take a look at the auto deploy I'll zoom into that separate from this call to see whether there are alleys where we can paralyze this, but from what I understood scar back in our recent discussion, we're not too far away from getting that done right now,.

C

We've got Robert working on a part of it related to the release tools. I picked up. Some work also related to release tools about waiting for stuff. That way, we could start moving the triggers between omnibus and release tools and then it's a matter of supplementing our current deployer with necessary, tooling such that it could reach out to Kate's workloads and berth forma trigger and that in that regard, so it's a work in progress.

B

Ballgame odds and ends and I've kind of freed up, like I, don't have anything. So what can I do? What can I jump on to help I think.

A

Like garvik I'm going to speak and then you're gonna tell me I'm wrong.

A

Jar, what I think you should be focusing on is investigating the memory limits, investigating the correct configuration for HPA's and whether we should be looking at the VPS as well.

A

What do you think start back that would free you up to focus on ensuring that auto deploy is working as it an Intendant yeah.

C

That's probably fine is.

A

That, okay for you, Chad yeah,.

B

Just what's the like just to look at to see what like memory consumption, maybe look into creating the separate note pool as well correct.

A

Yeah, limiting the impact of other.

B

Workloads on this.

A

And you're already looking into what's happening with registry at the moment with you know, actual note being out of memory, I think that this ties in well with that work, sure.

A

All right, you'll sign the necessary issues to yourself. Please a real sign, I'm, sorry babe! You.

C

A

So I want to see basically for the next demo, an upgrade in production and seeing the behavior of that and that we can talk by the end of next week, turning Auto deploy on four charts as well cool anyone else. Anyone wants to talk about anything else here.

E

A

I want to thank everyone for joining cooking scalability as well. This was useful and I hope. All of you have a good weekend.

E