GitLab Delivery: GitLab.com migration to k8s demos, 6 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-04-06 GitLab.com k8s migration EMEA/AMER

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, everyone welcome to the april 6th uh kubernetes demo meeting uh no demos on the agenda today, uh but we've got two discussion items. uh Bob you've got the first one related to that saturation. You want to kick us off.

B

Yeah, so that's something that popped up this week somewhere in our saturation metrics. We count how many available ports we have in total on uh on a gateway and that's calculated from the number of rp addresses times 60, something thousand uh to know how many ports we have available and we're running out.

B

uh The projected date is like somewhere in june, uh for like 80 confidence, but there's some outliers that already reach above the 90. So I don't know how urgently we need to look into that, because I also don't really know the side effects of what happens when we are saturated.

A

So we have had incidents in the past.

A

This was an issue after we completed one of our service migrations into kubernetes, where we increased the number of both the pods and the number of nodes that were sitting behind our net gateway and the issues were difficult to troubleshoot at first, because we didn't have the appropriate monitoring in place. We've since improved that so now in the future. If we're approaching saturation around that device, we should have the ability to alert the eoc to be like hey, there's something that's going to be wrong.

A

If we don't address this today, I did drop a comment that kind of highlights what starts to happen. You know autodeploy becomes blocked, because every single node needs to deploy a new image and since they're all reaching out they're all going to be using a port to get out and ask for the docker container it needs, and that has a compounding effect, because if those are blocked, sidekick which has web hooks or needs to reach out externally, for any reason starts to get blocked and sidekick is smart enough to retry.

A

Kubernetes is also smart enough to retry, but eventually you're going to exhaust the retry counts across the board.

A

So I think the threshold that we've set up, which is 90, I think, is a reasonable threshold and, like I mentioned at some point, we need to tackle this, whether it be a short-term solution or a long-term solution.

A

I'm going to cover rebuilding clusters in the next point, but I think until we get to a point where we could quickly rebuild our nat gateway devices, it might be worth trying to explore adding ip addresses to get us away from those outliers which are currently hitting the 90th percentile threshold.

A

I'm curious is this: if there's been any other research outside of what we've discussed so far happening on that issue at all,.

A

B

Not by me, you know, okay,.

C

B

C

Think one of the um I mean this topic has been discussed since since a long time right, I think addition. uh Initially, craig furman was working on on cloudnat and uh also with uh strategies how to deal with that, and we even got an extra extended ip block from google just for us to assign. So normally they don't give out um a block off of ip addresses just for a single customer, but we got it from them.

C

Just for that reason to to be able to extend expand uh our cloud nuts, I think, and also to give contiguous um ip addresses for white listing to our customers so that they can say, okay or gitlab traffic is coming from these non-ip addresses, and I think one of the biggest problems here is that for each connection, which is going to the same destination and ip destination ip and port tuple, we need to open a new port on the cloud network cloud gateway to be able to distinguish those connections.

C

And the big issue is, if a lot of I don't know runners. For instance, a lot of our um ports try to reach the same ip and port address like like trying to pull the image from the from from our docker registry.

C

Then we need to have a lot of different ports opened in our cloud, not gateway, and this is really saturating things. So one thing is to add more ip addresses to the cloud: netway not gateway to have more capacity, and the other thing that we could try to work on is to prevent that a lot of ports are trying to reach the exactly same address and board.

C

At the same time, I think the reservation for the nut port thing is, I don't know for two minutes or something this reserve, then, for the connection, at least I don't know when this is uh relaxed again at least again, but um if you would be able to somehow split the traffic that is going to just one single destination address and port.

C

That could also help. Maybe.

B

C

Just saying that, for instance, if we do a deployment and all our pots try to go to.

C

Dev.Registry.Giblet.Org, I think right to pull the image and we have a lot of connections from a lot of parts open to the same ap destination, ap address and port, and this will lead the clutter cloud, nut gateway to use a lot of of those ports that we have available to be able to distinguish between the different pots that are all connected to the same address. If they would reach out to different um addresses, then they could use the same source port on the cloud nut gateway.

C

So we wouldn't expend all of our process ports there, because then we could distinguish still but um yeah. That's the difficulty here.

D

This is only for traffic get reaching out of the cluster. That's correct right, so even just um I need really depends on what the problem is, but even if it's just a registry for downloading images having a proxy approximation inside the cluster, we just using the the button here because you have one connection goes outside and everything is connected to the internal registry.

A

And hypothetically, we could add a step in our deployment procedure that says hey now that this image is built, push it to that registry and then we're never making an outbound connection. We're just saying: hey talk to this other registry, that's local to you and theoretically, our images will download faster. Therefore, hypothetically deploys will be significantly faster as well.

A

Not sure what type of engineering effort would be required for that, but we could also potentially leverage the gcr google's container registry, which we do not do today.

A

Actually, we do leverage that, for certain images like busy box, for example, because those are automatically mirrored and nginx as well that way if the dev registry does go down, we still operate, or rather, if the gitlab.com register goes down, we still have the ability to operate. Those images.

A

But I guess bob: what is the next step for you on this issue?.

B

Finding somebody that can help resolve this because we don't know like I don't know if anybody on our team has the knowledge to do that. Well, not.

C

B

Yes, but the time and the resources to do that. So um if we need to do a short term thing, you mentioned adding some ips and then we can probably engage stigor math to to help with that, but the long term solution and working out the plan for for it. I think um delivery would be better to handle that if you yeah, do you think that, because you mentioned adding different different gateways per cluster, is that something that's possible if we communicate the ips outward to customers, so they can allow us yeah.

A

So we currently, with the current single net gateway that we have, we have two ip address blocks assigned to it and we're assigning those ips in an automated fashion through some magic inside of our terraform configuration.

A

So we told the terraform use this block and then use this amount or count of ipa addresses and it just loops through and adds those ip addresses to that gateway.

A

We still have over 200 ip addresses available to us in one of those blocks, so we could take the current that gateway in the short term solution and just increase the amount of ip address. That's assigned to it long term.

A

We could add a new nat gateway. The one thing that we are currently missing is the ability to connect a cluster to that specific nat gateway device. That's the missing piece that I think we need to improve upon.

A

And I think I'll touch a little bit on that.

E

On the rebuilding.

A

Clusters, part okay,.

E

Yeah then I'll decide. My question.

A

So I get before we move on um bob. Do you have the resources necessary to assist you now or do we need to try to find some persons.

B

So I'm going to well but igor: do you think we can? We can add ip addresses using that's just conflict configuration and terraform. I can. I think we can figure that out right. So.

E

We can give it a shot, yeah yeah we can. We can.

D

E

B

And then um it would be good to link here what we're going to do in the long term.

A

So speaking of long term, let's talk about rebuilding clusters. There are many reasons as to why we want to rebuild clusters. Some of it is due to limitations that we're going to run into one of those is the nat saturation, for example, another one is our ip address. Space for arizona clusters was not set up in the most optimal fashion.

A

So eventually, if we migrate any more workloads, we run the risk of running out of ip addresses for nodes, so we'll be able to scale up nodes, but we won't have an ip address for it. Therefore, kubernetes will be like ah sorry guy um and then there's also other things. uh Calico has caused some pain. uh Gke is phasing calico out in favor of something called data plane version two. This is not something we could just switch to.

A

We have to rebuild a cluster for this type of thing, and currently, rebuilding clusters is simply not something that we could do highlighted inside of this issue is a bunch of items that I kind of cobbled together that I think, are blockers that prevent us from simply rebuilding quickly and easily.

A

I think, if we needed to in an emergency situation, we could bring down an existing cluster entirely, and then we could build a new cluster. A replacement cluster- and you know igor just highlighted the fact that this is still iffy, because we do have other problems that need to take into consideration.

A

If we want to bring a cluster down entirely, we have to make sure that our other zones have the capacity to take the traffic load from them. The other thing we need to be cognizant of is that we're going to drive up our networking bill temporarily, while that one cluster is down because we're going to have a lot of traffic coming through our front door, aj proxy, and that traffic is going to cross zonal boundaries as it reaches out to the other clusters in our environment.

A

And thirdly, h.a proxy is not very well configured. Therefore, when you drain a single cluster, you end up sending a lot of traffic to canary, which is not very well tuned. So from a technical aspect, it would probably be wise if we figure out what to do with canary and at proxy prior to us bringing clusters down.

A

I don't know what the solution to that is, but in vein of just talking about cluster rebuilds, specifically a long list of challenges, um where did my own tab go?

A

So, firstly, is the naming of our clusters for every ci job and for every configuration that we use across various repos. So we have at least three. We rely on the names of our clusters, so at no point in time can we have a second cluster in usd 1d unless we name it something different.

A

um That in itself is kind of a problem, because now ci doesn't have any way to connect to it and doesn't have any way to be like. Oh, this environment is for this cluster, even though it's the exact same as another cluster, and this is kind of a problematic.

A

So I think we need to figure out. A naming schema for our clusters is probably one of our points.

A

E

Propose a suffix that is dash and then like six random characters.

A

Yeah something um and we could handle that in two ways like we could build that into our gke module if we wanted to or we could assign that as something we send when we build our cluster initially when we consume the module, so we have options for that. So it's not a huge blocker. It's just something that we need to be made aware of.

A

Another problem is going to be the fact that our ip address schema is very hard coded. We plan for this ahead of time, because we make sure that all of our pods have the ability to connect to each other, primarily for the purposes of monitoring.

A

So this is kind of problematic, because last time I was looking at what ip address ranges. We have available we're getting into kind of a tight aspect where I'm not really sure how we could rebuild another cluster unless we tried to enable google to configure the ip address schema for us.

A

I don't know how safe this is, because I fear that prometheus will not be able to reach out to a cluster properly and scrape those metrics, and then we lose visibility of an entire cluster, and that seems bad. You know I'm not entirely sure, but you know it could be a bad thing.

A

um So, if any, if anyone has ideas on networking, I'm all ears I'll gladly listen to any ideas. Anyone has because I do not currently know what to do with this.

A

And then, the next item that I think is going to be the most important is what to do with all of our static ip addresses for many of our services. We ask for a static ip that is stored within terraform, and then we ask for that ip address to get applied via our kubernetes configurations and that's scattered across kate's workloads in those three repositories in some way, shape or form.

A

I think for this one we could probably switch to using the external dns project. This is something we already use for alert manager. I believe- and this is something that we're going to start using whenever we get around to migrating redis into kubernetes,.

A

The only problem I have with that solution is sometimes h. A proxy's global state file maintains older ip addresses, so aha proxy might start trying to send traffic to the wrong ip address.

A

I imagine if we upgraded ha proxy that would probably solve that situation, because I would imagine this is a long-standing bug that eventually got fixed now. Does anyone else have any concerns or thoughts?

A

This is a very open ended and I don't have specific talking point. So I'm kind of just I'm watching everyone's face.

C

You just scared me away totally with that. That's that's enough of issues really.

B

Regarding regarding the running out of ip addresses, um have we validated the saturation metric that we recently added for that.

C

Because that's.

B

All pretty hard coded now, like we looked into we clicked around the google cloud console see which subnets are used. How many ip addresses are there and added that to drum books, because we didn't have a way to export that.

B

A

Looked at this recently, but the last time I saw at most we're using about 160 nodes, which is a little over 50 percent of the capacity of ip addresses we have available to our zonal clusters.

A

So I don't think it's an immediate concern, because the amount of nodes that we run doesn't change as sporadically as say, the nat device. Does it's not as spiky. I guess I could say, but the concern is there and eventually we will need to rebuild cluster. So it is something that I do want to make sure that we keep in mind when we think about this project.

C

I don't know if this is maybe out of scope, um because we already have enough issues to tackle here. But what I would really like to see is that we um mix our different workloads on on the same cluster just for better resource utilization, because what we are doing right now is for each of our workloads. We have one dedicated node pool, um so I always say cluster.

C

What I mean node pool in this case, and if you would be able to just mix different workloads on the same note pool then conditioners could just try to place ports in a way that we use resources optimally right parts which use a lot of memory could live together with pots which use a lot of cpu, and so we don't either based cpu or either we use memory as we do it right now.

C

So this would be cool to have just a generic amount of generically named. I don't know node pools and generically named clusters and then just try to randomly assign where we run our workloads in a way. So they are mixed, but this will be a totally uh really a big change over how we run our clusters and outputs right now. So this is really another big challenge. I think this.

D

Is important henry I I really so do we had this conversation in the past that may become more something that we may act upon in the future, which is uh rethinking uh the product deployment based on featured category instead of service, so that you run by everything, ci everything, package, everything and- and everyone has his own deployment.

D

So if you think about this way, the workload would be substantially different than what we're doing today, because then you have metrics in terms of uh features category and as well as you no longer have front-end uh api, and maybe sidekick can still be different. But that's just this is just a detail so in that direction, having beefier node pool and cluster that are easy to rebuild and they are kind of general purpose, it's easier because then, by name spacing, you can just mix the workload together.

E

I guess the the main concern that comes to my mind is uh isolation. um Like c groups aren't perfect, and especially I don't know if we're specifying hard limits on all of them. Currently, um I suspect that we don't, and so that means we become a lot more susceptible to kind of crosstalk.

E

Because we're potentially relying on excess capacity, like more capacity being there than we've budgeted for so I think kind of as a prerequisite to that. We, we really need to get the isolation and resource consumption side of things uh locked down to to avoid that kind of crosstalk.

A

I think this is a very large project, um because there's a lot of facets to this and that's one of them is trying to figure out how best to create our node pools because similar to sidekick and the routing mechanism that we created to kind of spread that workload as much as possible. We're effectively doing the same thing at the kubernetes level and the one thing I'm kind of concerned about is our metrics are very tied to making the assumption that all the pods that we care about run on a similar, node pool.

A

So like the api pods run on an api node pool. That gives us an easy way to look at our node level metrics. But if we were to start intermingling our workloads between node pools that becomes complicated, and I don't think we have a solution for that. And I think we need to figure out a way to do that.

A

E

A

Completely enjoy the aspect of this project because I think it it would help us out in various ways like I do think we run too many nodes in looking at a few nodes. You could tell they're quite underutilized, as is, and if we could pack them down a little bit further, that helps us through various different cost mechanisms, not only just the cost of having a extra node running, but also the logs associated with it, the metrics associated with it and the networking cost that actual ip address. That node is using.

A

All of that is adding to the cost of running one extra node because of our inefficiency. Right now go ahead, igor.

E

um In terms of metrics, I guess it it means we need to rely more heavily on, like the c advisor container level, metrics, um which I don't know, what exactly we're using on the kubernetes side right now, but um basically looking looking more at deployments and containers and looking less at node level metrics.

E

um The other thing that comes to my mind is priority levels on a process level. So if we want to increase utilization.

E

In order to still survive bursts, where we kind of need that burst capacity being able to sacrifice some lower priority workload, so say: oh this sidekick thing well, not as important as this web pod. So it's not going to get cpu for the next five seconds, while we spin up these pods to serve this bursty workload right. So I don't know what the built-in kubernetes story is for that kind of stuff, um but I think that's what's needed if we want to increase our utilization without destroying our slos.

C

Yeah, but I think um we could easily if we pack tighter, we could still easily define our requests in a way that leaves room right, but right now, even by tuning our request as good as we can. We still leave. I don't know one third of our notes unused because we can't pack it better together because of either memory being underutilized or cpu being underutilized.

C

So I think adjusting and tuning to leave enough headroom shouldn't be that hard later, but we can't do right better right now, as you do right now,.

A

I also think that this particular project is probably lower priority than the ability to rebuild clusters.

C

A

A

I'm curious for anyone on this call who has a gauge as to what they think the level of difficulty would be if we were to bring down a cluster with the intention of rebuilding it. What do you just imagine that level difficulty looks like.

E

I mean the main issue would probably be capacity in the other clusters not being sufficient right.

E

So if we go slowly and kind of decrease, decrease, decrease, increase, increase increase and- and I guess kind of plan ahead of time- how much like basically ensure that we have one third uh of headroom for each for for both of the other two clusters um yeah, combining that with moving slowly, I think, should be fairly safe. So I'm not seeing any strict blockers it's more of a mess around and find out kind of situation.

A

Okay, so I think the point that you raise is important, though, so I think we could. We could at least predict that ahead of time, because we could look at our max pod counts. We could look at our max node counts and such and determine whether or not we've got that extra capacity. So we could look at that from a configuration standpoint to see if we could survive this type of exercise.

A

So I could work on that next, I guess and then maybe in parallel, try to figure out what else we need to uh potentially prioritize from this particular issue in regards to configurations, and such my biggest fear is that if we go obviously we want to test this in staging first, because my biggest fear is that we're going to have there's going to be some run book.

A

That's missing some obscure detail that we forgot to document and when we bring online this new cluster, it's going to be missing something and that something's going to be critical for it to run appropriately.

A

E

Yeah I mean we.

C

Can definitely.

E

Control the ingress on that right, so like we once it's brought up, we can kind of avoid traffic traffic being sent there. um Sidekick is on the the regional cluster anyway right, so I'm guessing we're starting.

C

E

C

We're starting with.

E

E

Yes, so I think we've got a fairly good handle on how much traffic we send and that will also allow us to ramp up once. We've rebuilt.

A

So how does this sound? Potentially, I could craft up an issue for investigating what kind of capacity we have in our extra zonal clusters just to make sure that we're not going to drain ourselves or you know, kill ourselves, um and then maybe we could spin up a change request that goes to the exercise of taking down a cluster nice and cleanly and bring it align a new cluster, nice and cleanly and bootstrapping it.

A

Because I think step one for us should be, let's make sure bootstrapping a cluster works to the extent where the application will work on it. So that might be I'm thinking just a straight replacement. No tooling. Improvements are made whatsoever just making sure we could follow our run books as step one, because without that it doesn't matter what improvements we make to our tooling, to enable us to have a secondary cluster in whatever zone, because we don't have the ability to bring one online.

A

So that might be a good starting point for this. um This epic, I guess.

C

Just to be exact sure, so you think it's really impossible to have a second cluster and parallel running in one zone.

A

I think that I think is possible what my fear is, or I guess what I don't want to create- is: let's bring online a secondary, well, okay,.

A

Currently, no, it is not possible, in my mind, to bring online a second cluster in the same zone because of the way our ci works currently like. We all rely on the name very heavily and we rely on static ip addresses that we provided to ourselves ahead of time like inside a terraform.

A

So I think, because I don't know how good our runbooks are as far as bootstrapping our clusters and bringing online all the requirements necessary for just a cluster to run. I think it'd be a good exercise to just delete an existing cluster and bring it back online.

A

Because that enables us to test our runbooks to start with, if that works, okay or if we need to make improvements, we could do that at that step. We're at that point in time, and then I think maybe we could start looking towards making whatever tooling improvements. We need necessary that way. We could run two clusters at a time.

C

A

C

Good for especially because one of the important aspects here is to be able to root traffic away from a cluster and route it back and everything is still working right, because this currently isn't possible- and this is really a bummer, because we likely needed to do something like that right and cause some outages with that. So just being able to do this in an emergency right would be great enough already, and that would also help with um being able to create a new cluster in this way.

C

So I think that would be a very first important thing to get right on this. One yeah.

E

Yeah and what I like about the rebuild in place approach, is it disambiguates issues in our like just general issues in the bootstrapping procedure versus issues introduced by having a differing cluster name, and so once we've worked through any bootstrapping issues and we're confident in that part, we can then focus on the the separate piece without having to you know disambiguate that or investigate that difference. Every time.

A

Pretty seriously okay, so I will start to formulate this epic to circle around that initial aspect of rebuild in place in some way shape or form as safely as possible. um I really don't know how to do that safely off the top of my head, because I know there's just so many moving pieces inside of our infrastructure. It's gonna, be it's gonna, be fun.

B

A

Well, that's all I wanted to talk about. Did anyone else add anything to the agenda.

E

I've got the next one.

E

It's drop down page, so um yeah just just wanted to quickly highlight that I've restructured this radisson kubernetes epic.

E

Previously it was a huge pile of random issues um slightly grouped and I've tried to kind of give it a phase based structure so that we can work towards specific goals and have something a bit more tangible to not only to work towards, but also to show and to kind of measure our progress against and since there's a few folks on this call who have worked on this project. I I wanted to highlight it and see if there's any feedback on that.

A

This is greatly appreciated. Thank you, igor.

C

Yeah really cool.

E

All right, that's that's all I had so henry you've got the last item.

C

I just wanted to briefly note that vlad last we created a game chart for camera proxy because we are working on migrating over to queen users for camera proxy, and I just wanted to link it here. It's not complete right. It's missing a few configuration settings that we want to have like.

C

I think it's not having hpa integrated, for instance, but it's a start, and it will continue on that and it's living in the gitlab pane files repository for now from where we can deploy it, and we can consider later if you want to integrate this into our gitlab charts. Maybe so I wanted to just list.

A

It here is cable proxy, a feature of gitlab, or is this something that's dot? Com, specific.

C

It's not it's not uh in omnibus and and not in our charts, so it's nothing that we deliver with um gitlab right now. So if we wanted to um uh put this into our hand charts, then we also would need to uh adjust omnibus, I guess to deliver a camera proxy with it. We have it in our documentation.

A

C

How you can use a proxy to safely link to external sources, but we don't deliver it as a product or in within our product. So far,.

A

Sounds like this fits right or right in line with platinum, ul or plant uml. It's something that we can utilize. It's just something we don't ship with the gitlab product. I guess yeah exactly. Okay,.

A

Cool is this a helm chart that we are developing from scratch, or is this something that we are inheriting and just playing from.

C

There was nothing before so we do this from scratch.

A

Excellent alrighty.

A

Okay, well with that, does anyone have anything fun and exciting they want to share before we end the call.

A

Alrighty, well, everyone enjoy the rest of your day and for some of you I'll see you next week have a good one.