GitLab Delivery: GitLab.com migration to k8s demos, 2 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-06-02 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

A

Okay, I'll uh pink's go back, but he might be.

A

Hello, I was just about to say I was literally just about to say scum, but you might be busy with qa tests but um awesome. So who would like to demo something.

B

I nominate henry to show off progress. We've made thus far with tuning.

C

Yes, of course, one moment give me, I just want to add one note, that's why I forget about.

C

It okay, so let me share my.

C

C

So um we have this fine tuning.

C

My cat is annoying. We have this fine tuning of api and gke issue and what we did so far is um trying to decrease the amount of main replicas, because we already hit the floor most of the time and also increasing the hva target right.

C

So in the first try we raced it a little bit too high, which caused um slight updex drop, because we I used too much cpu, but now we tried it with a less ambitious number, which is 28, 800 requests target, speed value and in general this looks fine.

C

So we see a slight increase of ruby threat contention, but not as much much as before and looks good, and so we prepared the same change for cluster usb 1b now in nmr, which could be merged after the pcr log went away, and I prepared in the issue a list of next steps to follow.

C

So it should be clear, then, to do witch and veteran in the next mars um and how much detail should I go, because um we are talking now about the things. The things we already discussed before in the same group.

A

Yeah, we can just link that in I'm, graham we'll have to watch two videos. I think, but that's totally fine. um Are there any uh like live dashboards or any interesting things? uh Yes,.

C

So um the interesting thing is, um first of all, looking at the updatex, of course, since we did the change you see this spike here. This is when the we deployed the change to canary and each time we deployed to canada. We have an uptext drop spike and you see the updates didn't change after that. So it's nearly the same um with the higher value. We saw a decrease in updates, so this time we seem to be good the same.

C

If you look at the ruby thread contention, we start the change around here and you see that thread contribution. The red line got a little bit higher, but not as much as in our first try. So you also are fine. Also, I checked back uh the last 30 days and when we were running on vms, our ruby threat contention was even higher than we had it um most of the time in kubernetes.

C

So we have some headroom there still, but I think we are okay here. So I think we are really fine to try this on one production cluster. Now um we can also see how the node count got a little bit down and canary. This is the red line.

C

We started the change here and but we didn't get much lower than than um one day before, which is around here so because to see if this has a big effect on canary, but especially because we just have a few nodes and we are not at low traffic times right now. I expect this to go down this night and the same for um pod count, there's not much of a difference to see, but a little bit at least then we um let's zoom in, did the change.

C

I think here no check yeah around here we dropped down a little bit on pots, but there's not a huge difference to see. I think in canary there's not not too much to see. Let's see how much this will affect the production.

B

Do you have any sort of way to predict if we're going to run into ruby thread contention if we decide to lower the amount of cpu that we are requesting in our deployments.

C

No, I was wondering about this.

C

Lowering requests doesn't mean that we um use more cpu pair put right. It's just that we can't scale higher if we if we spike right- and so it depends very much on on the patterns of the spikes that we see on our single containers right, I think that's the thing which is hard to predict. um uh Looking at the cpu, I don't see it spiking too much like.

C

If you look at the usage of cpu on a single port, you don't see a lot of spikes, but we also don't have a very high resolution on that.

C

So it doesn't look like we have a very spiky behavior on a single pot for cpu. So I expect not to see big issues, but if you have very special traffic patterns, I'm not that sure about it. So you should watch this over a longer time. I think.

B

Maybe we should, if ruby contention, is I'm just thinking long term because we had mentioned this in our other meeting. But um I wonder if ruby thread contention is the driver for what or maybe the first driver where we start to see issues. Maybe we could leverage that as a method of um scaling as our custom metric yeah yeah, we could make sure that we don't ever exceed 80 percent ruby thread contention.

C

Yeah, if he exceeded constantly or spike over 90, then we should be careful. I think that we start to get into trouble. It looks like it at least, and also looking at the aptx. Of course, I think as long as the updatex isn't dropping, I think we are still fine. At least this was the experience.

B

Last time, so you showed a chart a second ago where the aptx dropped a little bit when we deployed this change.

B

C

Would that have.

B

C

The updates will be deployed here. This is node scaling, I'm pretty sure this is because of node scaling. Each time we deploy to a canary we scale from five to I don't know eight nodes.

C

This is doubling the amount of nodes nearly and I think this is causing a lot of um cpu and- and I don't know, maybe warm-up time so this really seems to be an issue.

B

Do you think we may be prematurely uh making our nodes ready or pods ready rather.

C

That's a good question: we really should research what exactly is happening when we start up a note and start up a pod. I am not fully sure of the lifetime at the beginning of starting a pot. How this looks like in.

C

B

Let's keep that in mind the future when we finish up this particular issue, yeah.

C

um Talking about workers, there was an interesting remark from matthias in the issue where he mentioned. Instead of increasing the amount of workers, we could increase the amount of puma threats, because that would get around the lock contention and still give us a memory. Reusage.

C

So if we want to get more workers or threats into a pot, then if we can increase the amount of threats that would be better than increasing the amount of workers.

B

Is there a good basic guide or some documentation that I could read related to workers and puma threads, because I don't know much about this so, except for what metrics we may already have and learning that hey? This is bad when this is high.

B

I don't know how I would go about thinking about making certain changes such as that like. I need to learn something, but I don't know where to turn to I'm wondering.

C

I'm not the expert here. I am also not sure. I just know that we have these minimum one and maximum four threats for puma right now in gke, but I don't know how ruby or race is deciding that we now use. I don't know two or three threads and and two or five or four workers how this is mixed right. So I wanted to just look into that, but I don't know it yet. What.

A

Are the um what are the risk areas like if that's a bad idea? What would be the impact.

C

I mean it's just how you, how you um distribute load on different cpus right um and that's what you're juggling with when you change things there. So um it's kind of a tuning thing. I don't see.

A

Wondering if scalability, for example, might be good people to know this stuff.

C

Yeah somebody who is really good in raids and and ruby in general, some expert team.

A

Because I'm assuming that the downside, like these settings, have scalability impact right like at some point, I'm guessing.

C

I guess we could just leverage mathias kepler chiming into this issue, because I think he's very much into the details of this anyway and he's very interested in that anyway, because um he was asking questions about what we are doing with tuning right now in gke, because he wanted to understand better, and maybe he can even just give us advice on how this is internally happening, uh juggling with threats and and workers there. So we can just make use of interest. Maybe we, I think, yeah yeah. I will ask him in the issue.

A

A

Great okay so um looks like we're: making good progress today. um Is there any additional stuff we want to go through on tuning.

A

No cool, so the other thing I was going to ask about is run books. How are are they?

A

The issuers we've got it written at the moment, has quite a length of run books in it. Is that, uh like achievable, is that how we want to tackle this stuff.

B

um I picked up this issue um so far.

B

The work that I'm doing is mostly just updating existing documentation because a lot of it's out of date, um I am going to ask for feedback when I have a merge request ready, but I'm still touching various bits and pieces of documentation, so I'll ask for feedback when we get to that point.

B

I do want to see if I can get some feedback as to what chromebooks we should add that might be specific to the api I'm trying to shy away from creating hey. This is kubernetes, and this is api, because there's gonna be a lot of troubleshooting steps that you could take with kubernetes in general, not specific to api. So I I'm trying to avoid that part. But if we are missing something that's specific to kubernetes that we don't have anywhere that we definitely need to address in some way shape or form.

B

I'll, certainly add that as necessary.

A

Okay sounds good. um Please keep the description updated with whatever you decide. uh That issue should focus on.

C

A

Awesome um and are there other pieces to the api service? We've got run books. We've got the tuning to wrap up. We've got the retro issue, but otherwise are we? Are we feeling that api is complete.

C

I have a question about production changes um because you know when we, when we do production changes to be a chef now we often need to create change, request issues depending on the impact, the possible impact that it could have right, and I'm asking myself if we do this via kubernetes now how we value the possible impact of changing configurations, because in most cases we do more advanced than we do often with chef where we just you know, stop chef clyde, maybe somewhere, try it uh just on one note and things like that, and um I'm not sure about how we should deal with.

C

Information changes much more often, so it's more about um thinking about a policy on how to work. With that.

C

It's a question: I don't have an answer to this.

A

Jeff, do you have any tips.

D

I'm sorry I was doing something else. What was the question henry.

C

It's about change, request issues, because when we do changes in production with chef, um we often if we think, there's um high impact for this.

B

C

And and with doing conversion changes, we are communities which is happening more and more often, I guess, because we move more stuff over there. We.

A

C

And less things and chef, I wonder: how do we value the possible impact of doing changes in gke, because the reasoning of changing some piece here and what is happening then, is a little bit different than doing things via chef yeah. So.

A

C

Wondering how we do this, I don't want to introduce bureaucracy for sure, but.

D

um I think, like it's really the judgment of the I mean we need to trust the judgment of the sre. We typically use change requests for uh changes that involve a lot of distinct steps and manual actions and where there's like continual monitoring required and not possibly like a handover necessary. So um for these types of changes I would say yeah. Maybe it would be better we're kind of using this issue as a change request, because it has like links to monitoring, but perhaps it would be better if we did a change issue.

D

I dislike the bureaucracy as well. The good news is is that as soon as igor finishes the review, I have an update to woodhouse that allows you to fill out a change issue similar to how we fill out an incident, and it allows you to fill in all the fields and everything in an interactive form. So it does make it a little bit easier, not that much easier, but at least like at least you don't have to manually edit the description and all that jazz, which you have to do now.

D

um So um I don't know if that's any consolation, but I would say yeah, maybe maybe we should be. um Maybe we should like create a change issue for each one of these changes that we're making.

B

I would vote differently so like.

A

B

Very small, concise changes, we're testing them on individual clusters. We don't know what's going to happen, but we are focusing our efforts on specific things, we're focusing these on specific clusters and we have all thanks to andrew's, wonderful work. We have amazing metrics to peel through and determine whether or not we're doing bad things relatively quickly.

D

Well, I would say the the this fine tune. Issue is sort of becoming a change issue right. It has links to monitoring, it has a timeline. Why don't we just promote it to um a change issue, and just so so we so we can at least reference like we could. We could do that. um I I don't feel super strongly like if you guys think that it's better just to do what we've been doing. That's that's fine with me. I think I'm.

B

Simply just trying to cut down on the amount of administrative work required, because the change request is not.

A

B

Far as time taken just to implement that in itself, aside from the investigation and the work required in general.

D

Yeah we could just have one change issue for the changes that we're making and but that's basically the same issue that we have now so um the main thing, um the one place where I think we could have done better, was on the change to uh increase the target average utilization, which caused the net decks drop. um I don't think there was a very good handover, like the on call. Probably did wasn't aware of that change because I don't know it may be. A change issue would have helped.

D

Maybe not you know um it's hard to say, but yeah there's.

E

Two things there. The first is that we have a more relaxed aptx for the api than we do for the web, and so maybe the headline figure, which was aptx, didn't dip enough. Yeah and.

A

E

The second one was, I don't know if that ruby thread saturation is something that actually pages, and maybe it should.

A

Also, I'm reluctant to.

E

It it does, it just goes to, let's general, does it because it was firing for the whole time. I know, but I was like yeah, but.

D

You're, the only one on that a hundred.

E

And twenty people in there so.

C

What's the case um on our vm fleet that we also had a very high threat contention all the time when I look 30.

A

C

So it wasn't the primary cause of of problems.

A

Yeah, so it's an indicator.

C

Of something going high, but it's not immediately leading to true problems in the aptx. I think I.

D

I would argue: api vms was saturated, like we were way under provisioned and we realized after we did the kubernetes migration. How bad things were right, like I think, maybe maybe threat contention, is a really good signal and maybe we should have been paying more attention to it, um because aptx is really low. For you know, our thresholds.

E

Yeah and and like I said to you, I think, on monday, job like there's another one of those that's firing constantly for sidekick, urgent other fleet and uh I'm kind of ignoring it at the moment. But maybe and there's an issue. We know what the problem is, but it's scheduled for like two releases into the future and maybe that's something we should go and take a look at and it might be something we need to kind of action sooner than that.

D

Yeah, um I think um it wouldn't hurt to at the sre on call for these changes. I think just so that um they're aware, especially if, like especially if we do another change on the end of the week, because I think it was kind of bad last weekend um beyond call was getting these canary aptx alerts, and I I don't think it was hard to narrow it down to the change that was made on thursday.

D

I don't know if it would have helped, but maybe it would have if we, if we communicated that better to the on-call.

B

Slightly made worse by the fact that there was a typo in the first merge request, or it didn't even I.

D

Know the canary.

B

They had to backtrack a little further, but that was.

D

B

Yes, yeah. I agree.

D

Okay, well, uh I don't know, I think, um as long as we're. Maybe maybe we should just keep on doing what we're doing but try to let the on-call know as much as possible like what changes.

C

We're going through, I think, yeah. What's the.

D

C

D

I think I think sre on call slack alias, is probably the best uh thing we can do. Okay,.

A

B

Cool but amy going all the way back to your original question of. Do you think we're done? I've got one cleanup issue left in this epic, um when we first created a lot of our infrastructure and configurations, and we kind of over optimized, because there was an assumption that we would get rid of engine x so because of that, we've got a few ipa addresses that are reserved and all of our clusters. Oh.

A

B

It's sucking down a few dollars per month um and then our um our kubernetes objects are not even used in those ip addresses, so that we've got a little bit of cleanup that we can perform inside of our configurations to make sure that we are more concise with what we have configured in our environments. I would love to build and knock that out before we close this happening.

A

Awesome: okay, that makes.

B

Sense- and hopefully that's one of those quick wins- um I would want to test this and pre to make sure that the we don't accidentally lose the api from engine x and.

A

B

So I don't cause an outage, but you know we could test that. Obviously,.

A

B

Outside of that, yes, I think this epic is like the finish line is inside.

A

Amazing great work, exciting stuff, um awesome uh andrew was there anything you wanted to demo? I mean no pressure. I know you're busy on other loads of other stuff, so.

E

Yeah, I don't have a lot that I can really add this week unless you're interested in knowing about postgres saturation metrics, which I can give you lots of insight into now. But uh I don't know if that's the the right audience, I'm really sorry about that. Hopefully I'm no! No, but I am anyway.

E

um I am hoping to have this all done by the end of the week so um because I I just want to get it off my plate, it kind of came on unexpectedly and I'd like to get it off as quickly as I can, um and so hopefully, next week um you know I've put on that daily stand-up thing that I'll have it done by the end of the week. So hopefully I can clear it off and uh and be done yeah. Thank you and.

A

E

Week I'll be able to focus on this again.

A

Okay, cool well what I was going to say uh when I I might sort of check in on your status before I went in so um what I am, what I've been trying to work on this afternoon and it's certainly not uh finished yet so I'll keep um pulling it together, is to capture up the ideas that um everyone had protected and work out, how we can rank these things and what's really interesting.

A

uh So this is kind of the combined view, um so the numbers are um going to wait differently right, so I've just added them all up, but uh there are some things we might want to prioritize the web migration for now and look at some things later. But what I did want to highlight is all the observability stuff is ranking up really highly.

A

So I think that's uh that's a really good sign and I think we should. You know, make sure that we are uh working with you andrew, so that we can actually all get this stuff done. It's going to be super valuable for the web migration, um but having.

E

A

Will definitely help on that one.

E

A

And any other thing that I've changed graham about this morning was that he pointed out that two kind of issues we have with the ideas on this spreadsheet one is the slice of overlap and two is that some of these things are not at all like achievable in a month or probably even a couple of months.

A

So what I might try and do is pull this data out and put it into an issue.

A

So we can see like you know here, are like the top four most fun things or here are the top themes and see if I can present the information, various different ways and then going with suggesting that maybe what we do is work out, what problem or theme we want to work on, and then from that point say: okay, so if we want to solve like maintainability what are the things we can do and not limit ourselves to hey this specific task?

A

It might help us uh move through some of this stuff.

A

So I will pull all that stuff together and share that out, but thank you for all the input so far. I think well, actually, I'm interested other people's thoughts are like. Was this a?

A

um Is it helpful, seeing all of our kind of like pain points in this way or, like other other things, that you'd like to add to this? Like to this approach, I mean.

B

I enjoyed walking through it just to see what other people thought, as well as just kind of gauging or seeing a lot of the work that we all have an opinion about yeah. I think it was very helpful.

A

Great great and I think from 10 to grade this morning, we're kind of saying that it was interesting how actually a lot of this stuff circles around maybe three or four bigger problems um which is good to know, so we can work out how we actually fit those in in the future months.

A

Awesome so yeah I'll, pull all that stuff out, and then we can actually take a look through it and work out. We still don't have issues for everything I think that's totally fine, like I don't spend too much time doing admin um so I'll, pull that stuff together and share an issue out um for for the next round of uh reviews.

A

So is there anything else anna wants to discuss or demo today.

C

Yeah, I just added this very much at the end, but I just did a very cheap api, cpu request calculator and a spreadsheet, to see how much we could fit on a node and can just show this very thick as this one yeah and for adjusted. Looking on one of the api nodes, looking what kind of containers we have running there.

C

So maybe I'm off here because on other nodes, it's looking different, but I think it's more or less the same like these are the not api related containers running on each of the nodes, and this is the no total allocable node capacity we have, and if we then take that web service pot takes 4600 requests, we come up with three pair per node, which we can fit in there and it's very hard to fit more and then we would need to um reduce the cpu requests on the web service port drastically or find place somewhere else to fit one more node, one more part on one node.

C

So I think, with just reducing requests, it will be hard to get more pots on a single node or we need to move things way around.

C

Workhorse can be reduced because we are way below our request most of the time, but there's not much to squeeze out via still, it's not really easy to fit four parts on a single node, so increasing node size. Maybe is the only way here or playing around with threats and workers.

B

This is good information.

C

Yeah we can just adjust the numbers here and see how it fits and look into what other nodes look like. It was a very quick look on onenote only, but um it looks hard to adjust the numbers.

E

Thousand requests worth of uh mining henry.

C

E

Thousand requests worth of um uh bitcoin mining yeah.

C

We can fill with.

E

The other thousand.

C

We have something left over yeah, so we could use it for something, but it's barely enough for one more puma threat. So that's really sad.

A

Thanks for sharing that henry super um does anyone else have anything they want to go through state? No, okay. Well, thank you for demos and discussions and uh great work on the continued tuning. um Looking forward to us getting api service over the line, so nice work all right enjoy the rest of your wednesday speak soon.

A