GitLab Delivery Team, 2 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-06-02 Delivery team weekly APAC/EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning,.

B

B

B

C

Hey how's it going.

A

Cool, so um just the three of us today, uh lessio is on uh public holiday um before we dive into the agenda things, I'm just curious. I just saw jarv put in a uh api tuning meeting for later on. Graham, is that meeting too late for you it's in like a few hours.

C

Unfortunately, probably just passed, it's 10.

A

C

A

C

A

Going to suggest to javascript.

C

9 30, but not 10, 30.

A

I I think job's probably trying to be ambitious and get absolutely everyone on. So I was going to say like judging from the stuff that we've got on the agenda uh from henry like. We should go through that, together now and I'll suggest, to jeff that he pushes his stuff into the uh emir kate's demo, which we have later today.

A

B

And then we kind of get.

A

B

I think we are anyway into research right now, so I think there's not much to decide, maybe just what you need to research, and I put this on the agenda today, especially because I wanted to involve griem into this discussion more because yesterday he was missing out and- um and we discussed some things and I wanted to get him into the loop.

A

Yeah, that's great, go go dive in um that's great. Let's go straight in.

B

Yeah- and I think we should anyway try to be more um async on these discussions right, if you discuss this too too often in in spontaneous meetings, it's hard to follow. Yes, we need to somehow get this into issues.

C

Yeah, look, I I don't have a. I definitely just want to make it clear. I don't have a problem if you guys sync up or whatever like I'm, definitely all for that. um The only thing is I. I wasn't really able to make much progress today and I think now I'm definitely whereas, where we were handing over a bit more, um I'm so far behind on this specific issue that I'm probably not much help at the moment, um but yeah, let's spend the time to catch on sure.

B

Yeah, so do we want to dive into it right now.

A

Yeah go for it. Let's see what we can. We can do yep.

B

Yeah, so just as a summary, so um what we did um last week is um increasing the hba cpu target on canary and one of the um zonal clusters for a api uh to 3200 um milliseconds, which was eighty percent of four cpus, and um this led to problems because um we saw slight upticks drop in general.

B

Then later we noticed that we also came to um ruby threat contention, a lot which is probably because we.

C

B

Use more cpus, so workers need to uh evade more often and longer for getting um the ruby.

A

B

And um so this, and also in canada, we saw that the um often had drops and and updates correlated to canary deploys. But later it turned out that we see these anyway, uh regardless of cpu turning up or down. So we have an issue with canary deployments. It seems which maybe is a little bit related to api tuning, and then we discussed yesterday a little bit of reasons where we are stuck, and one idea was that um maybe we have an issue because um api is very slow at starting up right.

B

Api ports take a while until it can spin up- and we often scale up and down, and so one idea was to find a way to not as often scale like having a longer cooldown phase, which seems problematic, because our gke kubernetes doesn't support it right now, um there's a beta feature, but that would need chart adjustments to jump to the beta api to exploit something like this.

B

So this is more long-term thing, but that could help generally, I think, but we need to research how much time our pots spent and starting up and if they are slow slow after that, even because maybe they need to warm up some caches. I don't know, maybe they're generally a little bit slow at the beginning.

B

That would also explain the updatex drops for canary, for instance, because we have a smaller fleet and if the new pots are at the beginning slower to respond, then we have a bigger impact here than with the bigger clusters, where we don't see this as much, and so our current idea is to just slightly increase the hpa target again to see that we can come to a level where we use less pots, but don't reach the saturation limit so that we still are fine and that's what I will try after this meeting again justin canary.

B

So this is where we can fine-tune a little bit and come to a better um usage of resources and maybe a lower number of pots, but what we can do after that is a little bit yeah in discussion, and you know, and that would be cool if you have some ideas there. Green. Yes,.

C

You need to really.

B

Research, the behavior of our pots and the scaling behavior and then the resource usage when starting up, but behind that we are not sure right now,.

C

So I agree with everything you said, and it's worth noting actually um thinking about this more actually uh thinking about tech debt for this month as well and observability in particular, it seems more. I I hit this a similar, similar kind of thing um touching on. When I did the console work, the console investigation, we are still from an observability standpoint. We still can't establish it's hard for us. Maybe we should create a new issue for this, but I think this might be covered under andrew's, observability, epic, creating a timeline of a deployment starts.

C

We say we want new pods, um we run out of resources, so a scaling event has to happen like new pods need to be scaled up. This is when kubernetes gets told the cluster auto scaler determines that I need to scale new nodes. Those new nodes come online. Those new pods go on to those nodes, um and you know that full story and then of winding those backwards. Oh yes, I can scale down now and I scale those things down. So what is interesting and it's it?

C

I didn't really think about it until um this issue came up, but it's entirely possible. So, interestingly, how the autoscaler works is it looks for pods that are that can't be scheduled right. So it's actually. The cluster auto scale is watching kubernetes and it's like hey. I see a pod, that's saying I cannot be scheduled, there's not enough cpu and then um this is the cluster. So there's two parts: there's the pod, auto scaler.

C

So how many pods am I going to scale up or down and then the cluster auto scaler which responds to that right? So there's two scalars going on, but it's interesting because the cluster auto scaler will just go. Oh okay, so you know: there's there's not enough nodes for these workloads. I need to spin up new nodes and that can take time.

C

So I would bookmark that as five minutes for a node to come online if it's happening when demand- and so it's interesting what you're saying with deployments, because it makes sense if we're if everything is tuned, fine and we're just got enough to cover and then a deployment starts, spinning up new pods and pushing, and it's not just the deployment, spinning up new pods deployment time we're probably putting extra pressure on those pods, because other pods are going away. Connections have been dropped. Things have been shuffled around. We put pressure on that.

C

Therefore, the pot odo scaler thinks just naturally oh wow. These pods are being more pressured. I need to put new nodes in if it takes new nodes. It takes like five plus minutes for a new node to happen by which we've already the the spike has come and gone, and so by the time we have new nodes, ready, we're already over the hump, the auto scaler pot.

C

Autoscaler then goes oh, I can scale down pods, it's scaled down, pods and then you're scaling down nodes again to kind of like you, you kind of the lag between when you determine you need new nodes and when you nodes come online, can be five plus minutes long and that's huge. That's a huge amount of time.

B

Yeah I just shared my screen. I hope you can see it. Yeah.

C

B

We're just showing you the the note scaling and the red line is scenario right, and so you see it constantly during day time when we do deployments and also have more traffic, constantly scaling up and down on nodes right, so that, maybe really is exactly what you're saying same should be true for pots yeah. Let's go.

C

B

um The same scale, so this is spots scaling up and down all the time right and and for.

C

B

You see this matching the traffic, but really a lot of nodes getting up and down and.

C

Yeah, that's a huge matches.

B

With the um yeah, this is the um each time they do a canary deployment. You see these attacks drops.

C

B

Exactly matching here, and so I think that is maybe really with notes getting up and down canary, but but this is canary, and I think so. This is exactly a thing where we need to work on on the scaling being less ambitious.

C

But yeah so they sorry keep going.

B

Yeah, but but for the other issue, is how to make better use of resources. This is the second one how we can use more cpus without running into this um thing that we run out of resources before we scaled up again, which is yeah coming to the same conclusion right. We need to be faster and scaling up or don't do the scaling as often as we do, but then we would need to need more resources. Most.

C

Of the time right, so I think we can so there's a few different options we have here, so I I wasn't able to spend it. I wanted to spend some time today, but I wasn't able to um to fully investigate the so I do believe that getting the cluster auto scaler to not be as aggressive to say, um using those settings or whatever to say you know if I've just scaled recently, please don't scale like don't don't do as much work as often um I know like so like starbuck is right.

C

So there is the there's options in the configuration of the scalar itself. We may and I'm not sure we may have access to some of those configurations through gke I'll actually want to. We we can reach out and confirm with google and how you actually do it is. I think, if it's kubelet configurations they've actually got documentation on how to change kubelet configuration, so we might be able to override it. um It sounds like, though, long term, from what I had a quick look today: the the new auto scaling api.

C

Oh sorry, yeah the api version, the v2 beta, 2, uh sorry v1, beta 2, I think which, which has those settings exposed directly in the manifest, I think, probably has the best chance we have of getting something that is easy to configure and manage for us and workable.

C

The chart work I had a look today is not too bad. It's certainly not not like a little bit of work, but I think how we could do it is. We could actually make like a make a helm flag. Basically that says enable v2, auto scaling and then in the auto scaling kubernetes object itself. We can kind of do some conditionals, so if that flag is enabled, do v2 beta2 and then expose another helm option, so I do think we can do some chart work where we can expose.

C

We can make those objects configurable based off what you pass in, um which could be I if we really wanted to go down that path. um I definitely think that would be the cleanest way because then also in the future, that api will get promoted to like a proper version and then we'll just get the we can take away that conditional, but we certainly can do conditional objects in the in the chart based off. You know what you tell it to do, so I think that might be a way we can get this.

C

The other thing is: there's a few different ways. You can trick the auto scaler into keeping nodes ready, even if they're not being used, so you can like set up a dummy deployment. That's like one pod, big, that's the size of your whole node, but then you set the priority class on it on the lowest possible, so it'll always get evicted, so it's kind of like.

C

If you have 10 nodes, then you have this deployment that takes up one node and it so it goes on by itself on the 11th node and then when the autoscaler deploys new pods, it goes oh well. I need to evict something I'll evict this dummy deployment because you set the priority really low and then that way, you're kind of always ahead of the autoscaler one pod. It's a hack, it's pretty ugly, but it's a possibility.

C

um I think, there's other things you can't there's other tricks people have can do, but I guess that's the question I mean it sounds like we're. If we're confident that having extra capacity always around would help, we can try and figure out the best way. So we've got two options. We need to try and do the chart work to um you know to to see. If that, like helps at all a quick question, are we seeing this in staging and pre or just production? This kind of this hit.

B

um Let me check, um I don't know if you have enough traffic, that's fine.

C

Yeah, that's what I would have thought.

B

So in three it looks pretty stable and um staging the same. We just have.

C

A flashlight yeah, so it's a shame we can't kind of, like you know, reproduce it and test things in. It really seems like we need to, even even if we set the cluster auto scaler to not scale down as much because really over the day, you kind of want, like this nice curve right like as traffic, goes up. We scale up, but we stay there for a while and then when traffic goes down, we scale off.

C

We don't want to kind of keep see-sawing up, but even if it seesaws up and down you kind of don't want that. So we kind of do want both things. We kind of want extra capacity just a little bit one or two nodes. Just.

B

Try to figure out how much we need right by setting the hpi cpu target, um just the right way that we keep some uh cpu buffer.

C

B

Yeah yeah, but.

C

The problem is even if we set that when you get that sudden spike like yeah, you know what I mean like it's you're still.

B

Relying on it, you need more cpu buffer than you can't yeah yeah yeah. It's.

C

A tricky situation um I.

B

Mean ideally, it would be that you just scale up and down two times two times a day right after the big traffic spikes that we see um each day. That would be just good enough to um uh save a lot of resources.

C

Like if we configured that new beta api, instead of cool down.

B

C

Down for like three hours at least to start, and then we slowly bring that in until we feel like we get the profile, we look like. Maybe that's the way we go.

B

Yeah, um but that, as you said, requires some chart, work and testing so.

C

A

B

Thing to be done within one or two weeks right that, what's.

A

The balance between short term, because we know longer term we want to do more with like custom metrics and have more scaling, um assuming that that is like just a few months away like I'm. Thinking would probably want to do that after the web migration, like what do we need to achieve now to conclude the api, I.

B

Think what we are targeting right now in the tuning issue is just the simple thing that we try to find the fine-tuned hba cpu target, which allows us to go a little bit down on pots that we use right now without getting into these api updates drops. So all right.

B

After this meeting, I will set a new cpu target, which is not 80 but 70 this time and hoping that it's low enough to not get us into trouble um because of uh traffic spikes and and still uh showing us that we um go down and use a usage of pots. This is.

C

B

Did here right, where we are too ambitious, like the green line, is showing how we scale down on pots a lot, but we got into updex drops and I expect with a cpu target of 70. Maybe this time we just reach hive of the drop in and pots that we use, but without running into problems.

A

What do we currently have it out.

B

Currently it's 55 percent, like 2 200 milli cpus we have set, so we just use 55 of our available cpu resources um before we scale up. So it's a lot of unused resources just to have enough buffer to not run into.

A

B

Because we are careful, of course,.

A

We take a smaller step first, like does it make sense to go to like 65 and see that for a few days,.

B

We just do this in canary this time, so I think the impact would be very low and we can easily watch it. I think it's fine to do it. I mean we didn't, have a big updex drop like we only alerted for for canary anyway, not for the one cluster that we also did in production. We had lower latencies, but not below the slo target. Okay,.

A

B

I think it's it's good enough to try with that, and then you can decide if you need to go back.

A

Okay, good um sounds good, um because yeah it'd be good to know like we're, clearly not going to get uh um the perfect solution because, uh like we were talking, I mean we covered it. I think it was in the apac demo last week. Wasn't it graham where andrew was talking about the pressure we're putting on thanos from starting um and.

B

A

Really frequently, um there isn't a good solution right now. We have to do some stuff with kefta metrics, but um we should just get it. So it's good enough for now and then we can uh move on and after the web, uh we'll have to review the whole thing.

C

So yep so I'll just add a quick note. um So we've talked about two things: right, yep, so pot, auto scaling getting that. So it you know is getting the pods right and then also that lag and cluster auto scaling. So I double checked the official docs for the cluster auto scaler, and it does point out if you want to over provision. This is how you do it. It's basically how I talked about. It's got very clear steps outlined there. It's like you create a over provisioning priority class.

C

You create a dummy pause container with that, with with your number of replicas equal to the number of nodes, you want to keep warm and then it walks that, through I put the link in the meeting notes there and it walks through how to set all that up. We could conceivably put all those resources that are needed into the gitlab extras release and now kate's workloads.

C

So we could add some manifests in an mr today and probably set that up easily. What that would do was we could keep. Let's say three nodes: warm, um I'm not that won't fix every piece of the problem, but that should, in theory, give us enough headroom so that whenever we are scaling, pods or doing deploys at least we've got extra nodes around. If that makes sense, we always have at least one to three extra nodes ready to go, that we don't have to wait for, and maybe that will help the up deck start drop.

C

Maybe that's not not the problem, but that's certainly one thing we could try.

B

Yeah sounds great because I think this slow scaling thing is really a big issue that we need to solve midterm and short term. I think just working today and maybe tomorrow on the fine tuning of hp acpu target, will give us something which is a little.

B

And what do they want to say? I forgot about it. Sorry that was the third thing.

A

Henry, would you mind uh writing this up in the issue so that uh we do turn these conversations at uh async.

B

A

Thanks great um cool thanks for bringing that up, um so I just wanted to just take a couple of minutes, so we have some time at the end as well, but I was just curious, so release management training, so henry you've just come out uh game you'll be going in soon, but any preference from either of you on in terms of uh I guess, learning um and then also kind of reference. Like do you have preferences over docs or short videos or what would be the easiest way to access this sort of information.

C

So I think for me um what I kind of like the best is uh docs, I think, are good, but then also access to someone for questions like yeah, docs, plus a a very clear feedback loop and maybe even like often, what I'd like to do is like do the docs and then I probably try and schedule a meeting with someone with a follow-up of okay.

B

C

um I don't mind videos, but I'm kind of like to me to me as a learner. I learned the same from videos or documentation um and both are non-interactive to me. It's that interactive element. When I got questions um or were just you know, I don't mind asking questions even asynchronously, but often you know it's great to hear someone like explain it straight away and you can.

B

Ask clarification straight.

C

Away, instead of the feedback, so I'm more than happy with just you know, documentation um and then yeah. Like a follow-up of you know, even just an informal follow-up and say hey, how did you find it? You have any questions, even if it's you know very short or something cool.

A

Okay made sense.

B

Yeah same here basically, but I prefer more reading docs instead of watching videos, because um that just works a little bit better for me, but it depends. There are some video where you have a demo of something which makes it really clear how something works.

C

Very true, actually yeah. If it's.

B

A demo, but but just talking about something which you also could read, then I would prefer reading okay um and also as I as I mentioned yesterday, um because the documentation that we have for these management is very good, but also very detailed and a lot of documentation.

B

It's problematic at the beginning, because for most of those different docs that we have, you need a certain amount of contacts already to fully understand them. Yeah and it's hard to understand those pieces without having the context already and that's why I came up with this drawing this picture because without having this mental model, already it's hard to follow the documentation in some.

C

B

So it's a little bit of chicken and egg problem, but yeah. I think, looking at this big picture thing and then walking through this interactively with somebody, I think would be the best to to get into.

C

B

If you see the whole thing visually and then can talk through details of this big picture with somebody, I think that that would be very helpful at the beginning. If you start up getting into that.

A

Cool okay, great um and then are there any um particularly difficult pieces or particularly unknown parts, and it's also fine, if you just say like the whole thing, but are there any specific pieces that are particularly, I guess, hard to wrap your head around.

B

We are, I mean the problematic thing is to figure out in which repository do we have which part of code which is triggering which pipeline and which project which is triggering another pipeline in another project right, because this is super confusing and to to find your way through all of this.

B

If, even if you talk about this, um it's super, that's exactly the point of the big picture, because then you can at least see I'm coming from here and going to there with this pipeline, triggering something without seeing this, I I wasn't able to follow it, then, when it.

C

B

C

This could just be my complete. You know new to all of this, but I find sometimes- and possibly this has just happened over as everything's been developed, things have been renamed or different terms have been used, but sometimes I see a little bit and maybe maybe it's because these things are different. I think I see different terms used for the same thing. A classic example.

B

C

Be like just looking in the upcoming release channel right now I see it, it says: gitlab chat up, says new coordinated pipeline blah blah blah and I'm like what's a coordinated pipeline, ah is that an auto deployed pipeline? So it should to me even just a small thing like saying new auto deploy pipeline if we're gonna call something auto deploy, which is great. We've got like a very clear name for something using that name consistently everywhere, um I think is, is is really important.

C

Like yeah, aiming and especially upstream, downstream jobs, I know gitlab gives you a short. Like short, you know, if there's room, to make things more descriptive, making sure our descriptives are consistent and accurate. I.

A

C

Can help as well so understanding.

A

Cool okay, great stuff, thanks for that um so we'll be there, we've got a lot of work to do. I think to make these uh easy to use. So please, if you, if you have questions or you see, opportunities for us to make them clearer, please ask or point them out, or you know, update the docs, um but we'll also try and do the same as we bring people on online um and trained up awesome. So just got a few minutes left um at the end. So uh henry oh, hang on stop recording.