GitLab Infrastructure Group, 24 May 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Infrastructure sync for Code Suggestions accelerated GA

Description

https://docs.google.com/document/d/1vXMXvOqDIIZPINb2wpk4lRFAukIVHwgpsJZfaSqg9HY/edit#heading=h.crgtre63jgsn

A

A

For code suggestions, I'll start I have the first item: I want to take a look at the current state of infrastructure what's running in the cluster and make sure there aren't any problems. uh First thing we see is like this error can scale up nodes. I think this is for the GPU instances. So I don't think it's anything to worry about.

A

um I wanted to check out the model gateway to kind of see where we're at I don't think we ever had a problem with the model Gateway and we we increased the number of replicas, so um I, don't think there's any problems here, I think, once we have requests and limits set for this service, then it'll probably scale down, because right now, I think we statically set the number of replicas for the service, but I, don't I, don't see any problem.

A

I, don't see any problem with the model Gateway right now.

B

A

So we have now four odds for the model model, Triton um and I. Think we're just going to stick with that for now.

A

Does anyone have any comments on the current infrastructure set up and whether we need to do any manual scaling.

B

Not at the moment, it's for your later point but I'm working on the auto scaling but I think for the requests we have right now. The current setup is sufficient.

B

Okay, it might be scaled down even but yeah I.

A

Think so too yeah.

C

I'll yeah, just mindful of I guess the adoption and yeah.

A

C

We have actually also enabling in web ID, which is literally every user, uses web ID at some point.

A

Yeah and right now the model Gateway is like running on pretty small instances anyway. So it's not costing us a lot of money, so I think if I think that's what we would scale down into anything but I don't think.

C

Yeah um uh I guess on that as well. The only part is, we were looking in analyzing uh Triton and we definitely need to optimize that um and the only way is I think I've got to put it in the corrective action is on Dynamic batching, but that is something we would probably look into next week and that would also help with the load on Triton side as well, but that would be nice. Okay,.

A

Super I also have number two um I wanted to go over like what the neck like the high priority. Reliability tasks uh or infrastructure tasks are for the next two weeks. uh Really these are probably high priority for this week. um The first one is to get the kids manifest into CEI, um I I. Think like we can. We should I'm going to try to set a goal to have this done in staging by the end of today.

A

um I think Devin is able to help now, so that will help I'm also moving like doing less Disaster Recovery today and more of this, so um I can probably help as well uh as far as like temporarily temporarily interrupting the service. uh Do we think that's a big issue like if it's just for like a few seconds, um I assume it isn't? As in like it's it's beta, we expect blips of downtime uh when I do do you agree with that.

C

I think so I think we'll just probably just make an announcement. So no one as yeah we'll just have to probably just uh announce it.

C

How long are we expecting this to be I.

A

Don't know I think we'll know better when we do this. We're gonna do this in staging first and the I guess this would be the AI test cluster that we have set up already- um maybe maybe entras. Maybe you know like- is the AI test, kubernetes cluster, pretty much identical to the main cluster.

B

It should be yes, so.

A

B

The the setup is the same: we have 12 model Gates way for triton servers and from that I assume that the hardware Android is also practically the same.

A

A

I think like I'm, hoping that we're talking about seconds of no more than seconds of service Interruption uh but we'll I'll impress on Devin that will, if, if he does do it today, I'm not sure if it's going to happen that he should make an announcement um and uh let people know what's going on but I'm hoping like we'll we'll, probably just do staging today and then we can talk about production tomorrow.

C

Right I'll just plug it in uh we don't get uh since with the user production, we don't get like just like it's louder a bit. It's not working for one second yeah.

A

C

A

um And then infrastructure is code, um I'm gonna start by importing the project into our terraform pipeline. That will allow us to start selectively importing the Clusters or resources in that project into CI. So this is separate, then moving the kids manifest into CI um yeah. So I'll I'll do this as well.

A

um The third item is request limits for model Gateway Andres. Did you have something for that.

B

I'm working on it, I'm aiming to finish it by the end of the week, both the rates and limits both the auto scaling setup.

A

Is um like is there any reason why we can't just kind of come up with some conservative request numbers for the model Gateway and apply them and test them? Now? No.

B

Or what I'm working on you're just trying to get the right numbers? But okay, okay, yeah.

A

um We've we've adopted like a couple different conventions. One is is like that we just set the request the same as limit so that you basically have a generous request uh value provided, and you set the limit to be very close, but this really depends on the type of um how the workload, whether the workload has varies like spiky usage or not I I, don't know if this does I, don't think it does, but um for the model Gateway like I, assume it's fairly stable. But what have you seen so far.

B

Yes, I'm more worried about the Triton server so.

A

Yeah, that that is a different Beast yeah.

B

So I've finished up a load testing script yesterday with k6 and I'm trying to profile with that at the same time, I'm just a bit new to this so I'm uh taking.

A

B

Slowly but uh I'm trying to wrap it up this week. Okay,.

A

Okay and then the last item is resolving the Prometheus recording rules uh Bob since you're here. Do you have any um status update on this I know you've been following it a little bit I, don't.

D

A

Much work you've been doing, but so.

D

um The problem is that not everything like two problems, the first four are slis- is that um or recording rules that we normally depro deploy to that we normally push to all Prometheus servers are not pushed to the Prometheus servers in this separate cluster.

D

um There's two ways around that one is doing all the recording in Thanos, which now gets all the metrics through tunnels received. That's something that I'm trying out right now, because it ties into other work that I'm doing. uh We had a blocker there that I don't know yet how big it is. They intend to look into that today. uh The alternative approach we can take there is deploy a select set of rules, yeah deployed select, set of rules to to this Prometheus server, and that's probably the more boring solution for that.

D

We need to have the Manifest and so on and CI I think um so as soon as that's done, I think we should explore both options in parallel. The other thing that uh needs to be addressed is the labeling of kubernetes type metrics. They need like um label type and so on, applied I. Think Nick added a comment with some details to one of those issues. I haven't looked into that yet, but that will make our saturation metrics show up on the dashboards and make the service get into capacity planning.

D

A

Is the labeling of metrics from the cluster right.

D

Yeah Cube State metrics.

A

And they're missing, specifically they're missing, like type labels.

D

uh Yeah label underscore type, so then we can create the recording rules. That does it. You know the yeah.

A

D

Version of them kubernetes metrics,.

A

D

I'm, focusing on the slis I hope to get like uh call on the tunnels thing and.

B

D

The tunnels approach today, I hope I- can spend some time on that today and being able to tell if it's a go or a no-go.

A

Okay um sounds good uh Monterey. You have number three.

A

I think you're, muted, yeah.

C

Sorry um yeah, so we do have a request based on how these changed the settings from uh having global settings on to default and with all the migration everything we have a whole lot of users who had enabled it and need to re-enable it, and so we want to know I think it's uh if we can pull all the user IDs for the last 30 days, uh for who are the authentication request for code suggestions?

C

Is that something anyone can support us with, because on elastics we can only track back to the last seven days?

C

Is that even possible? It's.

A

Possible we would have to pull the logs um out of object, storage so into bigquery and then do a query that way. I can I can maybe look into this. Do we have the uh the query that we need to run in order to get the user ID like looking at elasticsearch? Do we know? Can we get the last seven days already and so, and so we can take that and um look back 30 days.

C

Yeah I can I can send that to you to you as well uh or I can ask John as well to look into that yeah I'll post this on our uh code, suggestion slack Channel and then I'll tag. You then jar.

A

Sure sounds sounds good, so, okay, how? How urgent is this like do we want to get this today or tomorrow, or this week or next week, or what.

C

um I believe it is as urgent as possible based on the fact that there are users, who've been uh disabled uh us. Well, they don't know they're disabled and we're not even sending a announcement to them. We're just doing it. On the back.

A

Plan and are we going to re-enable them automatically.

C

Are we going to send them I believe I mean I think this is just against. uh That's just me, yeah we are I believe we took the last seven days and re-enabled through the rails. Consult, oh I. Don't think that's the right way to do it, but I'm.

A

C

A

C

Know who did it and I don't know why we are doing it this way so I'm not making this decision I.

A

C

A

You point me to the to before we like, because this is going to take a little bit of work, and did you just point me to where the slack thread or issue or something where this decision was made, because um it doesn't feel right to me like I, feel like. If people want to use this, they should just enable it themselves.

C

So I am also as um I, don't know how to lack of a very shocked with all of these decisions happening when I'm sleeping. Oh.

A

You sleep, I, didn't even know you you actually sleep.

D

Yes, actually that yeah.

A

C

You know those little.

A

C

Three hours.

D

C

Were people who so for the last seven days people went and turned it in the rail sponsor, I didn't know you can do it I thought we can't and we shouldn't against compliance, but I guess: there's no handbook page stopping people.

A

Okay, um if you could point me to where.

C

That decision I'll.

A

Dig into it a bit more um as far as timeline on us like receiving more traffic, we have the web IDE coming up soon. Right um and that's gonna happen this week.

C

Yes, so there are two parts, so the banner is up again, uh which.

A

That was enabled when.

C

As of today morning, we are incrementally rolling out. uh I'll have to check what are we hundred.

A

Percent: okay,.

C

um Now the web IDE, uh we are, if everything goes well, we are only enabling it for internal team members this week and rolling it out, hopefully incrementally starting from next week, so starting from 30th.

A

A

um Do we do we have any sense of like what the usage is for the web IDE like.

C

A

Trying to kind of think of like okay, if we have X number of users like maybe a very um small percentage of them, will turn on the setting, because it's still, even after this is enabled right. They still have to turn on the setting right to enable the code suggestions, uh code, suggestion, integration right.

C

No, so for web ID, the user, it's on by default.

A

Oh I thought, because there was this- we were talking about the web ID in that issue and I thought I thought I thought you said that it won't be on by default that they'll have to enable a setting.

C

That's the workflow, so the workflow setting for this one, because we've changed the user level settings to on now. We've changed that in the last I.

D

A

C

In the last four days so.

A

Okay, so now, when they use the web IDE code, suggestions will be enabled by default. In other words like we'll, have it we'll have X number of users using the web, IDE and they're all typing code, and that they're all going to be like sending prompts to code suggestions? It's going to increase load quite a bit if there are a lot of people using the web, IDE right.

C

A

But we are going to be able to turn this on for internal users first and then yeah. Are we going to be able to like gradually expand usage in some way? Are we going to do it by I? Don't think we have that facility with feature Flags other than project namespaces or maybe no. We do have like a.

C

Percentage we can set yeah yeah I.

A

Think that's so I think that'll help a lot.

C

Yeah, it will be that's planned from uh next week. So, okay, the setting in the I, think the built-in extension by default is true, but you can go and disable it and we will incrementally roll it out.

A

Okay sounds sounds good, I. Think that'll that'll help a lot uh if we can enable it incrementally.

A

Okay, we're at the end of the agenda. I don't have anything else. Does anyone have anything to add?

A

If not I think we can end the meeting thanks. Everybody.