GitLab Delivery: GitLab.com migration to k8s demos, 12 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-11-12 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

What's wrong with your camera.

B

My laptop is a lenovo thinkpad and it's got one of those infrared cameras. So if you had windows and allows you to unlock your screen by simply staring at it and it uses that camera.

C

Wow that you had some sort of.

A

Techno party, it's uh yeah.

B

uh Jar shall we get started by going through our bloggers.

D

Yeah, let's get started, I will share my.

D

D

Going through the highlights um first issue, we already have it on the agenda, so I'm going to skip that one unified structure, logging. um This is um oh okay,.

E

Sorry, that's me, I'm also cleaning up as we talk.

D

No problem, um I think we decided scarbeck that this the unified logging is not a blocker anymore for get ssh. Do you agree with that?.

B

That's correct.

D

And possibly not a blocker for the front end either. I think the the main thing we would like to do here is to split out some of the different log files um into different indexes, for example, right now for virtual machines, we have puma split out from rails logs right now, they're all mixed together, um but I'm not sure if this is really a blocker. I think I'm gonna take it uh or marin. If you want to since you're already doing it, we can. I think we can take it off the highlights uh yeah.

E

We can take it off the highlights, um but it might be actually good to just leave it be because I see jason actually answering that uh things are scheduled for thirteen seven.

D

Okay: okay, it's.

E

Already scheduled, let's leave it, so we can keep an eye on it.

D

That sounds good um next is deciding on environment tier type stage, shard service, etc deployment. All these labels for kubernetes infrastructure um andrew has been. You know, working on this actively and.

D

Let me just make sure that this issue is the best one to track. I think this is the best issue to track for now. We do have some other issues that are blocked by this.

D

um I think andrew said that he might demo something a bit later, so we can talk about it later in the meeting and then uh new performance differentiation between k8s and virtual machines.

D

um This one that starbuck and I were looking at yesterday no day before yesterday, um starbuck did we end up reducing the percentage of traffic to the cluster.

B

Reduce it, the only thing I did was block it from proceeding with the change request any further, so we're still at 30 taking 33 percent of the traffic. At this moment,.

D

Yeah, I think we should reduce it to like five okay.

A

What we said on the on the call yesterday was that on on pushes which are really quite slow on gitlab.com, um you know we're talking about 200 milliseconds here and is that worth um us losing the data that we get.

D

C

I don't think this is, I mean.

A

D

Yeah um yeah, I don't I don't know like, for example, I I assume we're seeing the same issue with git https that we're seeing here with git ssh.

A

um Yeah, quite possibly, but sort of my point is that it's it's like get operations. They yeah people have more tolerance for them. If you want yeah.

E

We talked about maybe lowering it to a value that is a percentage of traffic that is uh not going to make a huge impact on everyone else, but will still provide us enough data. If that is 33 percent that we have right now, we can leave it. If that is 10, we can go 10, but it's up to you.

A

My my my ask is that we have more than like 10 requests per second, like whatever that is like in in percentage sure.

D

I think yeah that sounds I was I was thinking. It would be more on the order of 100 requests per second. If we lower it.

A

That's what I'm not sure.

D

B

I'm not sure what.

D

Yeah um yeah yeah, I guess it doesn't matter. I guess you're right in the grand scheme scheme of things, it's not a big impact to people. It's just really bothers me that we don't understand why.

A

um I noticed I, I think there was an assumption that everyone else was gonna write in the point that we made yesterday and and then no one did.

E

But something but I wrote it like while I was in another meeting, so I don't know whether that was the point you were making no.

A

That well yeah. I guess so. um I I actually just took your point as mostly being about and not writing out more but uh and and actually when I read the rest of it, it is true but um skybeck it looked like. um Maybe the problem is because we are not rooting to the internal api and we're getting queuing on the, because we're rooting to the git service and there's you know stan pointed out above the problem is queuing on the on the request right. So that's where we're spending all that time.

A

It's not like the execution of the request is taking less time. It's the um it's the queuing before it executes, and so that means that we've got too few sidekick workers or not psychic sorry puma workers, wherever those requests are getting rooted to, and we spoke about two things and the first one was that the puma workers we obviously whatever pool that is needs to be bigger. But the other thing is that it might just be better to route that to the internal api.

A

Rather, I don't know where it's going at the moment, but internal api might be better than whatever it's going to.

E

Okay, so I wrote it wrongly then andrew so uh might work correcting it in the in the issue, because I said.

A

E

Going to the internal api and maybe go outside to see whether there is a misconfiguration on the kubernetes side or not yeah, that's how I understood it yesterday, okay,.

C

A

I mean what I was imagining yesterday was that we should just try and route it to the internal api, and, if that fixes the problem, then you know that kind of ties in with this. Is there any reason why you wouldn't send that to the internal apis? Go.

B

Back, um I don't know.

A

And the one potential thing is that it's different, it's more difficult to kind of version, make sure the request is going to the same version of the of the application right. But.

C

A

But like I don't think that that's a major risk right now, yeah.

B

F

It's something we could certainly try out, don't show or get https.

A

It's it's the back end call from there jason. So it's it's from gitlab shell to um to to authorize. You know the internal um yeah.

D

F

This is going to the gitlab web server. That's why I wanted to ask.

D

Yeah, so by default, this is going to gitlab web service, which is the same pods that are servicing, get https traffic.

D

um I think when we, when we're able to split things out better, we'll probably go to the api fleet instead of the you know the pods that we're using now, but maybe we should just switch this to the internal api and see what happens.

B

Which is customizable so maybe since we're on pause with the migration jar of you- and I had discussed spinning up the canary variety of this, because I failed to do that. So what we could do is configure the canary endpoint as necessary with the customization to the internal api endpoint and see if we see a performance differentiation between those.

D

B

Talk about it there'll be a safe way to experiment with it.

D

Yeah yeah: let's talk.

B

A

It, I think, that's.

D

A good thing to do.

A

Another thing to note is that if, if that queuing's happening for the um internal authorized requests, then it's probably also happening for the the other ones, and it just sounds like the the pool of human workers isn't big enough yeah. Is that the same size.

D

Authorized keys and the internal like both both every api request, is a little bit slower, um so yeah. I think, let's, I think, switching this. The internal api makes a lot of sense. That isolates the changes and um we can see.

A

How it's performing are you going to do that for workhorse and http, as well as.

D

I would say no like I prefer not to yet I mean. Maybe we can consider that but yeah, um because.

A

What you're running is just workhorse and you know you're no longer actually running uh puma for the get http authentications right, you're just running, so you almost don't need to deploy puma. In that case,.

F

Right there are some hiccups to that that we'll get into at another point, but separating those two containers is not quite doable yet.

D

F

Maybe it's doable.

D

For get, I don't know, is it not doable for get https like? Maybe lfs is an issue there.

F

No, I mean the pod is.

D

F

Not designed in such a way to be able to shut off web service.

D

Oh, I see okay.

A

Yeah, it was more like it's just that it wouldn't be getting used if both of them were using internet api. You know you could just leave it running.

D

Yeah, okay, um hot off the press. I just submitted this issue now. um This is something I noticed I think yesterday you know the day before yesterday, which is that we don't currently support zero downtime deployments for the nginx ingress controller. What this means is that, if there's any reason why we have to cycle those pods, it's quite disruptive.

D

The connections are terminated immediately and, interestingly, I just did a quick check with s trace when you delete an nginx controller pod, it gets a sig kill, which is a bit surprising because I thought the default was a term um not sig kill. But in any case um there are a bunch of blog posts about this behavior and um basically, you need to add a pre-stop script to the nginx controller, to um let it do a graceful shutdown and we'll have to also add the like.

D

You know the termination, when the the termination window for kubernetes kind of say, like wait longer for the connections or wait wait longer before killing the pod um uh wait longer for the process to to quit. So this is pretty high priority. I think for us, um I'm gonna say it's a blocker.

D

For now, because we definitely don't want to move the front end uh before doing this, it's not so great for git https, though it's very it's pretty rare that we're cycling, nginx controller pods, um it does happen occasionally like we did an nginx controller. It was we took a chart upgrade uh today and that cycled the controller pods, which uh you know caused a little bit of errors, forget https.

E

All right I'll put severity. One then.

D

Yeah jason, do you remember, have you do you recall if we looked into this at all, or is this brand new.

F

I don't specifically remember it.

D

B

There's a new version of the engine x controller. I know, there's a backlog item to.

D

B

I wonder if that might resolve this potentially.

D

From everything, I've read so far that without doing the pre-stop script that there's really no way like you have to do this yourself, but it sounds like it's doable. You just need to update the config.

C

E

Asking andrew t to help us prioritize this.

D

Next is a cloud native gitlab, sorry.

A

No, it's fine I'll ask afterwards.

D

Next is cloud native gitlab pages. This is moving along.

D

We also took the we also took the chart upgrade today, which added the pages config, so that was nice, so we have the pages bucket now officially configured, even though we don't have pages turned on, uh so I think we're we're going to be ready for this when it uh when it lands.

D

Okay on to the demo, um I just want to- I invited jason just to answer any questions that we have about the mr, the split traffic um jason. Could you just give us kind of a brief kind of introduction and kind of overview of where we, where we went with this.

F

Sure so, there's two factors that come into play, but the first one is having a method to actually split traffic at the ingress. To do that, we have to be able to deploy effectively multiple deployments that provide us with named fleets and the way that we're doing that is out of the box. You have no declaration of deployments and it will basically create one default out of the normal chart properties.

F

When you do supply deployments, it will take the defaults from the upper level of the chart values very similar to how we do it with sidekick, and then it will create a number of actual deployment objects, ingress objects, pdbs, hpas, all the individual resources based on the names of the keys for the deployments, and then you can override any of the effectively chart global values with what you provide for the individual deployment.

F

So I have an example here actually running. If I think I have it in.

F

R, I do not have the example that I have here with internal versus external apis on the mr. So let me just share.

F

Close that window.

F

Share that screen and grab.

F

F

I have my configuration.

F

Zoom toolbar, thank you.

F

I have my configuration set up in such a way that I am deploying a default api and an internal api and then to this aspect, I'm telling the various other services where to send their traffic for the by based on where workhorse is actually at.

F

So I have gitlab shell specifically sending it to the api fleet, which is what everybody else would actually hit for api and then gitaly, for example, would actually be sending traffic to the internal fleet directly.

F

If I look at the kubernetes setup that we have, I've got a small number of pods for api, a small number of pods for internal api and two for default, and we look at the ingress. I have one ingress for slash api and one ingress for slash, but I do not have one that directs any traffic to internal api.

F

Is there any particular question.

D

Looks great really excited about this. um I have a few questions one well. Actually, I think the scarbeck you started, you had some questions on the agenda, so go ahead. First,.

B

Yeah, I did, I was just wondering: is there a nice clean migration from the current state that we utilize today to this new status.

F

Effectively, if you aren't the way this is set up, if you do not supply a actual deployments map, it will just deploy what you already had. The only difference is that you'll now have a deployment in pods with name dash default, so it will it's an automatic change for anybody who already has one in play. It's just the full pod names change and let me shoot.

E

F

If I go into the describer here, you'll see that we've effectively added just one more label here that just says what is the particular web service that you're sending it to so that app label hasn't changed, but we've added one more label to delineate which fleet a given pod is from.

B

Perfect, okay,.

F

um And my second question was: okay:.

B

Sorry, go ahead, go.

F

B

Now my second question was being able to name deployments uh just to provide some extra customization in case. We need some sort of contextual information quickly, but it looks like you're adding the necessary data. We need at least to the pod names. I would imagine the same. It's going to see we'll see the same on the deployment ends as well. Probably.

F

Yes, yeah perfect their own name. Everything that is generated now. Has this additional label in place.

B

F

So the one thing I will point out is that when this change happens it will be a full replace from the old pdb. So there may be a blip as it transfers from one fully named label set to the new fully named label set.

F

Trying to find a way around that, but I'm not sure I'm going to find it.

B

So this would just have to be a very careful roll out when we get a chance to do this. I would agree.

F

F

um One thing that I don't have yet implemented, but I'm going to sneak into this, mr, is for a long time. We've basically been stuck on specifically target average value for the metrics.

F

While I was doing this setup, what I do actually have in place is the beginnings of how to actually fully override the metrics that the hpa consumes.

F

It's there, it's not necessarily going to be part of the feature, but it might just sneak in because it's a three line change to the template.

F

So this could allow you future capability to better tweak how the metrics are being handled for the hpa and.

C

What will go in there? Jason.

F

In this particular case, you would be able to supply the entirety of a yaml configuration for a metrics. This would actually effectively replace this item here, where we have resource.

C

F

That entire uh yaml array, section of the metrics definition, would be replaceable.

C

A

But if we I know like this is like jumping way ahead, so forgive me, but if we wanted to use prometheus metrics in there, there's still extra steps right where um you you, there's there's other things that you need before you can actually be driving that off like prometheus uh metrics right.

F

Right so basing it off of prometheus metrics such as q depth, that's something that actually requires further setup within the cluster. It's not something! That's specific to this chart! It's just! You need to have this additional functionality available to your kubernetes environment and that's the use of custom metrics to use those.

B

F

Metrics you've got to make sure that you have necessary controllers providers and that your hba setup within kubernetes knows how to get the data from it.

A

Yeah and and it's going to be a long time until we're doing that, but uh I'm I think it'll be.

C

A

C

We can, but it's probably way off.

F

C

F

Look forward to the day when we can to say the least, and it's not as far as as other people who may watch distinct, because we're talking in gitlab time so it could be, could be a little ways off. But yeah.

F

F

Any other wanderers in regards to this yeah.

D

Yeah, I'm I have a few, so one is um uh node. Selector extra and pod labels and nginx annotations, um node selector was that on the original list or not, I think it it might not have been, but I don't may not have.

F

Been I went ahead and.

D

F

Node selectors tolerations additional annotations additional labels, so you will be able to inherit pretty much every property of the chart.

D

F

D

You're awesome. Thank you. um That's great! um This! The next two kind of tie together, um you know right now we have aj proxy, which is doing this routine, um we're turning this around a bit right, because now we're going to have a single ingress which is going to route traffic instead of having multiple ingresses right. Now we route by not only path but by header and by cookie, is this now a cookie doesn't really matter.

D

This is for canary, which we run in a separate namespace anyway, um but but what about header- and this is important for websocket traffic.

F

That's something I'll actually have to look into when it comes to how the nginx ingress behaves um and just to throw this out there. There have been discussions about people wanting to replace nginx ingress with something else.

E

F

We simply haven't had the time to look, but when it comes to routing it by a specific header, if nginx has the ability, then in theory there is an ability to do it in the controller. I just have to find out if it's capable right now and how it is exposed.

D

Okay, um I guess the one issue in andrew justice is rel relevant for scalability work. Right now we do rate limiting in aj proxy for only api traffic. If we start routing at the nginx ingress we're going to lose that rate limiting capability- that's probably okay, right, because we're on the cusp of using application rate limiting right.

A

The application rate limiting requires it's still leveraging h.a proxy right or so effectively for the for the white. For this excuse me for the um allow lists and for.

D

The header yeah yeah.

A

D

We're not gonna, I'm sure we can do that in.

A

Engine x, though, right.

D

No, that's not the problem. We're going to have h.a proxy. The problem is, is that this is replacing the api in web split. It's now happening down at engine x. So what it means is that traffic will first come into ha proxy and then all traffic will just go. There'll just be one back end, which will be nginx ingress and then from there. Then it will split between web and api.

D

We can still add the header at aha proxy um it'll, just be set for all types of requests, web api, etc.

A

That's that's what we want, though,.

D

Yeah, I think that's what we want so yeah.

A

So I think it's okay craig's, if craig miskel's doing the the the work at the moment on hi proxy. As I expect he is then he's adding those white lists to the web and to api, and to I mean I don't know about git, but certainly web and api, and and at the moment we don't have anything on the web, which is which is pretty weird. So he's adding he's adding that.

C

Acl stuff on there.

D

Yeah, um I guess.

F

There is a way to do the rate limiting through annotations on the ingresses.

A

Yeah, what we don't want, we don't want rate limiting per se jason. We want um an ac. We want like a list of of ips, um that you know basically a bunch of cidrs. uh That say you know this.

A

This network, this network, when you see one of these networks, add this header, um so it's not actually doing rate limiting it's actually just including a header which is a like a bypass that will get passed on to the application, and it just says to the application: don't do any rate limiting this is uh you know special customer x, or this is the gitlab runner and we don't allow.

A

You know we allow to bypass our rate, limiting infrastructure and that header is injected at the moment by h.a proxy um I mean I'm sure engine x must have similar. I know nginx got maps which might kind of relate, but I don't know if you can kind of include them in an external file.

F

I will right: I will have to look into that. One.

D

For now we're okay, because we're not getting rid of aj proxy, but we're removing the logic, a lot of logic out of h.a proxy, which I think is good for us long term. Assuming we keep with the nginx controller, which maybe we will um so I don't think that's a problem. But we will need to make sure that we can route based on header.

D

um In addition to request paths and.

B

D

Request paths: uh I was just taking a look at our aj proxy config. I think what we have like there's nothing.

D

I don't see anything complicated that we're doing, but you may want to take another look.

D

D

Anyone have anything else for this topic, if not, let's move to andrew on saturation metrics.

A

C

Obviously, haven't prepared a thing, but what I was doing um last week was.

A

We've got the saturation framework that we used to measure like all different types of utilization and saturation, and we added two new things into that, and those are um the container memory which is basically how close a container is to to reaching its limit for for memory and then the same thing for cpu and what's really nice about these metrics is that they have our existing service and shard and stage labels on them.

A

So you know if we deploy something into staging and it goes awry, then you know we'll we'll be able to see it and and isolate it very quickly. um Obviously, for the people on this call, I think it'll be more obvious that you know a pod. You know at the moment we're kind of going on pod names.

A

As far as I can tell- and I think people in this call might be able to recognize a pod name and know that that's a canary gets pod, but you know if you spread that knowledge out across the whole department or even wider, it's going to be really hard for everyone to kind of get to grips with.

A

You know this is a canary get shell git lab shell um container, and so I've been trying to get it so that you know we've we've already got these existing labels type um shard and stage which we use for lots of things, and I've been trying to massage the metrics that we get from cubelets and from c advisor and from cubestate metrics into the the taxonomy that we use for our labels.

A

You know the way we divide things up, so we've got to um get service, we've got a web service, we've got an api service and we want those metrics to to to be the same way. So if I go just take a look at this and.

C

C

D

I added a link to the agenda for an example. If you want to just click on it,.

C

A

Yeah, okay, that's that's the yeah, I'm actually trying to find yeah. Well, we can look at it here as well, because it's repeated in sure yeah, perfect uh yeah, exactly thank you. um So there you can kind of see it's really really busy um at the moment. Probably too busy. um One of the things I was thinking of doing is actually not having the this level of detail, because it's too much, um but what we could possibly do is just put like quantiles in here. So we say like this is the 19.

A

You know 1999 quantile. This is the 50 year the median and we can have like a sort of cloud of of where the spread of these values is. I don't know how people feel about that or if they actually want to see these individual parts, um but it is also worth pointing out like that that um bug in sidekick, in the background migrations, there was something spinning that had been spinning for three weeks and you know as soon as we saw this on the on the sidekick dashboard.

A

You know I was drawn to it and then we discovered the bug and and we could, we could dig it up where at the moment you know no one's really. As far as I know, going into the kubernetes metrics and and looking where this kind of drew that out a little bit. I think um so. That's kind of the first thing and then the second thing that I'm doing is obviously at the moment. We've got these dashboards that we generate and we've got lots of useful stuff on here and for vms.

A

You know you can just open this up and you can go to node metrics and you can immediately see like a whole bunch of information about um about the machines that are running this piece of the fleet right and what I really want is. I want exactly the same things for kubernetes without people having to kind of navigate through namespaces and nodes, and you know all the other stuff, and so I just wanted to be like the same kind of straightforward sort of thing.

A

So I've been trying to figure out how to do that and what I've got is.

A

We've got the metrics catalog and I've just got a small amount of configuration in there, which is saying that for the git service we're running two deployments, one called get uh one called shell and it's got one container called gitlab shell, and then we have a second service called web service, and it's got these two um containers that are that are running inside that and with that, um what you can do is then.

C

uh Let me just show you.

A

So then effectively all we need to do is we can go to the git service and say give me the the kubernetes dashboards for the get service, and when you run that.

C

I'll just do it slash.

A

It'll generate some dashboards, and these are obviously like very much work in progress, and I have some questions that are advanced of you guys, but um so this gives you, you know broken down by our straights. You know the the taxonomy that we use, so you can go here and you can change us to canary and it will give you the canary nodes and there's no sort of navigating around with namespaces or anything like that. So that's that's quite different with canary.

C

A

um And this is, you know, you have one row per container and then you have sort of a header per deployment and if you go, you know the same thing, just works for, say: uh mailroom.

C

A

Know we just have the same dashboards and this this was kind of interesting. I don't know if mailroom's got like a leak in it or something like that, um but it it seems to have these big drop-offs every now and again um sorry, I didn't show I was showing the container overview, but we also have this deployment overview, which is which is just a kind of aggregated view of the deployment.

A

um So this is kind of a more interesting one for git.

A

So you can see here it's broken down by the three clusters, and I don't know if that's something that we want or if we want to just aggregate all together or you know, let me know um whatever's whatever's best for people, but you you can kind of see um the registry. There was some kind of you know again.

C

No! No, not that one.

A

You know so we can. We can build these dashboards with a very small amount of configuration and we kind of get all the other stuff, um so yeah registry, during that deploy, there was like a massive drop-off, so I'm guessing that the registry also has leaks of some sort um that that's that we need to take a look at perhaps but anyway.

A

Obviously this is very basic at the moment, like I really like to have things on here like um evictions and you know a whole bunch of different failures, but all tied into the way that we think of the application you know with the stages and the shards and the and the services that we've got um so I've kind of, I think, I've kind of cracked the the way to do this, and you know we've just got the deployments going as well.

A

So that's great um so the question I the first question I have is whether this needs to be its own dashboard. Obviously there'll be more things on here, but we can either have this as a zone dashboard or we can have it built into you know the main.

A

Dashboard for a particular service, so in the same way here you can kind of go down and open up the node metrics and get like how much cpu the how much memory the vms are using. Do we want it in here as a folder down row, or do we want it as a separate dashboard?

A

The only thing that's sort of pushing me towards a separate dashboard is these dashboards are getting a bit big and um a bit slow. So I was thinking we could do some nice linking where you know right up at the top. It's like go to the kubernetes. You know and have like some good navigation between the different dashboards, um but I'm keen to get other people's opinions on that.

E

Can I ask just one or two questions andrew and completely ignore um if you want, but like I'm looking like you scroll down to this dashboard? um Actually I don't know about others, but, like I spend most of the time, looking at that and not going to detail there for every one of these.

A

E

Actually, to go to the drop down and like investigate like through the service, or rather through the component differently. So to me, like this, all feels a bit redundant in the overview section like. If you want details, you go to details.

A

So yeah yeah, exactly because it's the the service level stuff is what you get on this page, and you know that's like. Is this affecting users and yeah you're? Quite right, like that's? Why we don't have like cpu and memory and stuff on these, because it's it's about like user experience rather than um you know the the machine, so that kind of sounds like you want to go with a separate page as well. I.

E

Mean I'm just seeing how it looks to me, uh I'm not saying.

A

No, no, I think, that's a reasonable, I think that's reasonable. um The other thing that I was wondering about was how would people feel about having the deployments have the same name for canary and main stage, because it would make my life so much easier, obviously they're in different name spaces.

A

But obviously you know we've got stuff like this over here. uh Where is it this over here this? This is actually wrong for canary. So I could I could figure that out, for um uh you know in in canary. I think it's gitlab dash cny their shell and you know, obviously the other one is not gitlab dash main. So it's not like uh gitlab dash stage dash something they're different things and it makes it kind of difficult and also if they have the same name, those two deployments it's easier to do.

A

You know comparisons between them. You know with with computers, and so you can say just like look at all the things between the gitlab cny namespace and the gitlab namespace and kind of do like for like comparisons. If you want and use that for some sort of health checks, so I don't know if people feel super strongly about that, but it would make my life a whole lot easier if they had the same name.

B

I think that's a reasonable ask. The only thing we need to be cautious about is that we need to make sure our metrics are coming through with that name space label such that we don't break our dashboards and alerting, as they currently are.

A

Okay, um so are those um I need to take a look at what those are, but have they got hard-coded into them like gitlab cny like are we doing alerting at.

C

A

Yes, okay, I'll review that um and if I can nudge it, then would you be okay, with with us kind of moving over to matching names, matching deployment names.

B

I think it's reasonable. I'm curious if jarv has a differing opinion.

D

I think the the tricky thing there is that it's not going to match what the actual deployment names are like right. Now, all our deployments are prefixed with the namespace, the name of the namespace.

D

So if you do a cubectl, getdeployments you're going to see gitlab cny, gitlab, shell or gitlab gitlab shell, because the name of it is gitlab shell and then the namespace is gitlab.

A

Do we have to put that prefix on like like the whole? Would it would it be something that yeah.

D

Jason might know the answer to that jason. Do you know whether the names of the deployments like right now, they're prefixed with the namespace name? Is that something we can change.

F

They're not specifically prefixed with the namespace name actually.

B

That's what we call the release, we call the release, get lab c and y.

D

Oh, I thought it was the namespace okay, so just happens to be the same as the namespace right. So it's the release.

C

F

What we could do is go through and ensure we have fully functioning um name overrides in place so that you could be a little more accustomed with them, but as of right now, it's.

B

F

Release name and then chart.

D

And if we were to change the default, would that be a breaking change.

F

uh If you were to change the outright name of all the resources due to the generated behavior, assume that it's going to just kind of go poof on one side and stand up a new one.

D

F

Wondering if that would.

D

Be a major: would that be a major version, or could that be a minor version.

F

That would be that.

D

F

A: u thing, though, what I'm saying.

D

Oh, I was hoping we could change the chart default to what we want.

F

uh There would be a longer discussion and that would definitely be a breaking change, which would be a major version of the chart. Yeah.

E

Yeah, but we don't need to like, unless we all want everyone else to suffer with us.

D

Right well, I I kind of do want people to suffer with us, but yeah.

A

Yeah, I guess I I tend to to agree with job like we don't want to we. If we can keep it the same as what everyone else is using. That would.

F

Be and right now, the only thing is that, because you have gitlab cny, that's why it ends up with that in it.

A

Yeah like what I would like to ultimately from. Obviously I've got my head in the prometheus metrics for this. What I'd really ultimately like to end up with is that we're not using like regular expression, matches to kind of select metrics anywhere like it's like you know, and everything it's like the deployment is this and the you know we're not doing like question mark cny and you know all sorts of hacky things like that, uh because I you know it just kind of helps: build more structure around the metrics.

F

Forget my weird faces: why aren't we using labels and annotations.

A

um We could so so I mean that's a that's a very good. That's a we could label. So I actually brought us this up with jav um yeah. We we could totally do that, um so I asked for it. I asked for the our type label, which is effectively a badly named service label on the on the deployments, um and I you know that would be super helpful already.

A

um We could just use our own label and then match on that, but we'd need to add it to the deployments. That might be a much better option.

F

One of the things that's already there and deployed to all objects created by the charts is the release name. So if you need to know the difference between canary and production, it should be as simple as checking labels, release, name versus doing a full.

A

Yeah I wanna, I want a way of being able to tell that like this is the get lab shell like this component in canary is the gitlab shell and it matches the gitlab shell component in the main stage. You know, which is in a different name space.

A

um So we we could come just use our own label for that. Maybe that's a better way to do it, and then we just leave the deployments as they are.

F

Right so first would be that there is a label called app equals and that sets whatever the particular chart came from um for get https. That's actually going to say web service.

A

F

At the moment, when.

B

F

Shell, it the app is get lab shell, so you would have.

A

F

Equals git lab shell release equals git lab cny versus git lab.

C

B

Sidekick, it's cubot name is the specific one that we'd be looking towards. Yes,.

F

All right, so you.

B

F

um I think that we have it on gitlab.com, sidekick, coupon name or something something along that liner should have had.

F

But you should be able to pull up the the general chart from which it came, the release is from and whatever the delineator is for deployment sets, which, as john mentioned for sidekick.

A

Yeah I mean I okay, so here we've got see what I notice is. We don't have like. I would quite like type on here. Specifically, um you know outside label, so.

B

Well, you're, looking at calico for that first one: that's nice.

A

Yeah, that's just okay yeah I mean so. If we just do.

C

A

um So you know we, it would be super helpful for me at least if we could have a so this there's what you were talking about. I mean even that is kind of tricky because it doesn't quite match. You know: we've got a specific label called uh stage and it's main and um the main main and cny, and then we need to map that with you know not very intelligent tools like grafana and graphonet into main stages.

A

Presumably, label release gitlab and then cny is gitlab.cny so that you know it's very difficult to kind of map. If, if we don't use exactly the same labels and so yeah, I think you know if we could get the type label on here, like a label underscore type in the same way that we've got these ones. That would be super helpful as a starting point.

B

C

B

Yeah we've got our extra.

C

B

Labels- I don't know if we could apply those deployments as well.

F

We do have a global uh deployment annotations like one double check. We have labels.

C

A

Think I'm sure that those are labeled these like these, not that one. But it's.

C

A

That that one over there that label app, I mean that's a custom one yeah yeah, um so.

C

F

It looks like we have an open issue to do a global deployment label, but we don't currently have one in place.

A

Like freeform labels under yeah freeform labels on deployment would be super helpful because that would probably be a better way to do it than than using the name.

F

We should have a methodology to put that in place as it stands. um Okay,.

C

F

Double check and if you don't, I know we do have an open issue in regard to the globalized settings for certain labels. Okay,.

A

Will you will you respond, or should I open an issue about that jason.

F

uh If you can go ahead and open an issue, I just slap it with the appropriate labels and get it moving cool.

C

E

Thank you and jason. What kind of severity can we apply to actually get this in quickly? I mean I know we cannot clone you, but is there anyone else who could help us? Because you know that, like in our land, everything needs to be happening right now, not in 13, 7, 8 or 9, but now.

F

Right, so if this is indeed a production, blocker throw a production blocker on it and that will automatically give it sub two or seven one based on whether it's needed like now or next week. um But those are the labels that you're going to want to be able to get this set all the way to the top.

E

Okay, andre, I would. I would argue that this is a blocker, because we are in a weird state.

A

Yeah, like my take, is that when we've had like issues with anything that's running in kubernetes, people are like it's kubernetes and then they don't have like a clear thing. That's like look. Everything in kubernetes is fine. This is a problem with the application and- and I think you know before we carry on rolling it up too much. We need to have like all of those things in place, so.

E

Well, yeah: let's, let's slap that label, then um we would want to do things faster if possible.

D

I added a couple uh comments here. Andrew one is that um I think uh I think this kind of belongs, maybe in what you have as a service overview dashboard, which is the number of replicas and the number of nodes, because we've had incidents where, like okay saturation is increasing, but it was due to a scale down event, which is something new in kubernetes that we don't have with vms.

D

um So being able to view that I mean I actually would like to see it alongside the saturation, but I'm not sure where it belongs again like I don't want to grab the dashboards too much.

A

Yeah also, we need to put some sort of hpa kind of something whatever's driving the hpa. We need to have some grass for that as well right, presumably at some stage.

D

Yeah I mean there isn't: the hpa is pretty dumb right now, it's just looking at average cpu utilization across all the pods and it's hardcoded to well. It depends on the service, but yeah we would have to, but I think for now I mean just seeing the number of replicas and number of nodes. Okay would be helpful number of nodes, I guess, would be at the service and you.

A

Want that on the service overview, dashboard.

D

I don't know like the number of replicas I almost kind of want to see that on the on like the service dashboard, that's that's fine, but the number of nodes I mean that maybe is um maybe on a different dashboard than I don't know um when you were talking about um aggregating the metrics, because it's so no noisy, um I think like having them broken down by zone, is going to be extremely helpful.

D

We're going to be moving the registry to the zonal clusters and, like we had in the last incident with registry, I mean they really wanted to see whether only one particular availability zone was affected and I think we should incorporate this into all the dashboards. Somehow.

D

Maybe we just aggregate by zone and then have quantiles for for each zone, I'm not sure.

A

So zone so I I've been using cl that first version that I did was cluster, so cluster kind of. Is it the same? Who is it because it's it is yeah, there's a one-to-one mapping between the two yeah um cool okay, um so for each of those for each of those clusters. Sorry for each of those zones, we just want to show the median aggregate, the average.

A

What what's going to be useful in that? In that view, probably.

D

Probably maybe average median and max would be the best to see because like if you have a single pod that is fully saturated. You want to see that, and you also want to see the average.

A

D

um So yeah I think that would be a good start and, and so you're not thinking there'll, be a selector for zone at the top. But rather just the panels will have.

A

Just three of them online yeah, so so.

C

What I was thinking.

A

Is that, like there's kind of some dashboards that are like a first port of call- and that's like you- know, here's in the in the old way that we think of of everything, here's the information presented? But if you want to go like really deep dive, we still like have the kubernetes.

A

You know the ones that ship with kubernetes those grafana dashboards and people can always go to that if they want like seriously deep dive, you know whatever those dashboards have on them, but like so it's never going to kind of, because it's the one thing about the way that I'm doing it is that. Obviously I have to build these recording rules for every metric that I use to get the type labels on and it's fairly expensive. So we don't want to have like everything in there.

A

We just want kind of like the the ones that we think is as being useful, but yeah like. I think.

A

The other thing I wanted to know about was like how does a so on that metrics catalog we've got the the deployments and the containers, but we don't obviously there's well, not obviously, but as far as I understand it, there's a there's, a many-to-many relationship between node pools and services.

A

Right and so do we want to have yeah. I was going to say: do we want to have like a like a dash, a general like node pool health dashboard and then kind of from the service overview? You can say like show me like the node pool for this service. Sorry, it's not many too many. It's it's one service is in a node pool, but it's only ever in one node pool right.

D

A

One too many yeah.

D

So yeah I mean for now we're trying to create a notebook per service. I don't know if that's good.

A

It's not it's! That's not strictly true in every case at the moment, is it yeah? It's not strictly true in everything yeah, so yeah yeah. So that's that's fine! As long as we can have you know, so it won't be like directly attached. It'll more be like you know, this is the node pool that this services is uh is running in and what about? We will always all the components for a service like what we call gits at the moment, which is get lab shell and um web service.

A

Will they be in the same node pool or they each have their own? Will each deployments have a node.

D

We created a separate node pool for gitlab shell, and the reason for that is that the workloads are just so different compared to get the um you know, https. So in.

A

That case, you have.

D

One service in two node pools.

A

So so what it will be inside that cube deployments thing, there's like the deployments and then each one will just have a node pool tag. That's fine.

C

C

D

Okay, I guess we've we've reached the end. It was a good meeting. Does anyone else have anything they'd like to add? If not, we end the call thanks. Everyone.

B