GitLab Delivery: GitLab.com migration to k8s demos, 24 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-09-24 GitLab.com k8s migration APAC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hey george.

B

A

Should we start going through uh blockers.

B

Yeah, we can do that. I I just went through the list uh before this meeting myself, oh okay, but um but.

A

Like there's really, I have something to type anyway, so I'll give you a few minutes sure.

A

Okay, it was not a few minutes, it was a second.

B

Cool so uh graham just joined, let's get started.

A

B

um Going through blockers first, uh I went through the full issue list, just to make sure the lines with the highlights and the things that we should be tracking. It looks good I closed a couple issues that are no longer actually both either closed or unlabeled issues that are no longer blockers.

B

This list is pretty much all we have right now, which is the uh build logs mapping services, uh proxy request, buffering, which is sort of related to mapping services, since we need to change that per service and then the cross az network traffic. The focus right now is on cross hazy. Network traffic, um the mapping services is a problem, but it's not an immediate problem.

B

We're gonna do get ssh after get https, um it's not a blocker for that, um but as soon as we finish get ssh, then we should move on to doing web and api. There are no known blockers for web and api other than buildblocks, but that's in progress so.

A

How would you feel about mapping services to um for me to raise priority because.

B

A

One milestone already and I'm concerned because it's priority three, it's gonna slip, the next one for sure.

B

I'm I'm concerned we don't even have a plan yet so we need to.

A

B

Better, what the plan is.

A

I'm going to raise it to priority 2 for now.

B

B

That's cool um yeah, I'm yeah, I'm not really yeah, I'm curious what direction we're gonna go with that? uh I think I think right now. My my current thinking is that it it is important for us to split out at least get https from https traffic, given how different those workloads are.

A

B

A

I mean what is the: what is the actual alternative to bundle everything together again.

B

Yeah, I think the alternative would be all https traffic goes to web service pods and uh you would which would mix. um I mean web and api. Sure. That's like not the they're very similar workloads but get https is very different.

A

I I I remember the problems we used to have back in the day before the split I just I don't. I don't think this is acceptable for gitlab.com scale to mix those things together. I just think we'll encounter so many odd problems just because of um yeah, I no I'll I raise it to p2. I think we need to uh figure out what what we are going to do there and if the charts is not going to provide us that we all need to find a different way, um including forking the chart, if necessary,.

B

Yeah I mean one alternative without forking the chart would be to deploy the give-up chart in multiple namespaces and um how.

A

Many, how much do we need to raise the complexity of what we run? You know around um a product deficiency like I, you know I don't. I don't feel comfortable with that. Given the complexity of what we are running here. Right like we are running hybrid infrastructure with now like multiple clusters, with layer on top of layer on top of layer like it just feels a bit for me, it feels overwhelming, and it could be that because I don't know much, but I don't know.

B

Okay, well, um that's pretty much it for blockers.

A

So the proxy request buffering um that one you said it really depends on mapping services. Can we tights tie them together.

B

Yeah I was going to move it underneath. I think we can move it underneath and and to be honest, like um it's not 100 clear whether this is an issue because um we disable it globally and we already have a cdn in front and we think that might be sufficient and we can just leave it disabled globally. So I'll I'll move it up under mapping services.

B

Okay um projects pages: are we still looking at 68 months.

A

uh I raised that this monday, as a concern and update from the from the working group uh multi-large, is that we set ourselves a deadline which is um being able to stand up individual instances uh using a tool for july next year. I I know it's.

B

Live next year.

A

I know I know I know, but um it it will be a forcing function already, even if it's only july, given that um the projections that the teams made specifically for pages was that they'll be able to roll everything out and finish in may that ain't gonna work right like that.

B

A

Work so uh they'll need to pull in the timeline a bit. So I think this week next week and the week after you'll see more yeah discussions, first like how what and how and then I think, priority will have to change so um I think it'll be pulled in uh it's not going to be that long.

C

I hope at least let me put it that way.

A

I hope um people are aware that it's it's too much or too little rather for um for what we need to achieve.

C

For someone who's not immediately involved in a lot of the higher level decision making stuff, what is the motivation behind this aggressive kind of migration of pages? Was there one thing in particular, or is this just a.

A

Couple of things, um graham so one thing first, is that we have been blocked with the migration of on gitlab.com right, like pages, is a crucial service, and you know the story on gitlab.com.

A

That's one number two is that we have self-managed customers that are refusing to use omnibus because they spent a lot of time migrating their services to kubernetes. They don't want to have a one-off tool in vms and pages is not supported in health charts at all. It can't be right because of those weird dependencies and then number three is we are working on the multi-large, which means like multiple github.com, like instances um set up like day two operations. Basically, um and in order to do that effectively and efficiently, we can't replicate gitlab.com.

A

We can't we can't run hybrid infra uh easily right so, and the idea is well if we untangle pages and ensure that you can use kubernetes properly, and you ensure that we have tooling around it to to set you up for success.

A

Then all of these things are out of windows, so the investment across all these three points into pages and all of these migrations makes probably the most sense. um We currently.

C

Have you know yeah, it makes sense yeah. I wasn't even aware that pages wasn't supported in the chart so that that's news.

A

Yeah, it has the nfs dependency across multiple services, so there was a suggestion for us to introduce nfs into kubernetes, and I yeah I well jared know knows how I reacted. Jar was much more calm and said. Well, these are the technical reasons I just went out and said like. This is not something that we should be doing at all or thinking about, even like not entertaining the dots, so yeah mixing 30 year old technology with five-year-old technology feels.

C

A

B

Cool um on to the demo, I wanted to give a quick tour of where we are and where we're going for the multi-cluster kind of shape. First of all, the thing that came up um the last time we talked was the naming and trying to keep the names like not crazy.

B

So what I settled on was we have the regional cluster, which is going to be environment, gitlab gke we're just keeping the same name, because we don't want to change the name, because that requires changing like rebuilding a cluster and the zonal cluster will just be environment plus the zone. So for pre-prod. It looks like this. Of course, this isn't set in stone. We haven't done anything like. We can't revert so um doing the poc on pre-prod.

B

It looks like pre, gitlab gke is gonna, be the regional cluster, and then we have pre one b, c and d, and uh that keeps it nice and short. It keeps our names short that derives from the cluster name. So- um and I- and I think I did one- I wanted to include the full zone because um including the region, because we could eventually add additional regions here and uh that'll be important for the zonal configuration I did a bit of refactoring.

B

Instead of having everything tucked into main.tf, we now have gke.tf, which has the regional cluster, and then we have the gke zonal tf, which just has the zonal clusters. um The gke zone ltf, which is what I'm showing here, is where we have some repeated module calls. Basically, we have two modules. We have the module for all the ip reservations, uh so that kubernetes can as a so. We pre-allocate the internal ip addresses for kubernetes, and then we have the gk module itself, which creates the cluster.

B

um The maintenance policy here will be different for each cluster. I haven't made that determination yet, but we'll eventually probably have like alternating maintenance times for the different clusters, and you can see like basically usc's 1b usc 1c, your c7d. It's not great. um It would be nice with like the next version of terraform. We might be able to do this in a loop, but keep in mind also that each of these, like I don't know like they, each have different sub networks.

B

I think, even if we did it in a loop, you would still have a pretty big data structure that would like you know, need to be created for each zone because there's a lot of differences between them for subnets, um node pulls, and things like this.

C

So quick question just.

B

C

Up the top, the reservations module is that just essentially doing the vpc and the subway networks now is that what its purpose is.

B

The the reservations module is only doing um ip and dns reservations that are gke cluster specific um we thought initially, I was thinking like we'll just have a generic module, that's not gpe specific, but if you think about it, this is fairly unique.

B

You know it's a unique thing, because we're actually calling this module three times um so to use it outside of gke doesn't really make sense. So it's that's why it's namespace to gke and it just it just creates a bunch of uh ip's. That's all it does right now um that are needed for monitoring uh the ingresses uh forget uh you know and registry, and all this yep.

C

So I, like that's fine. I think this is just all good one of the things actually now we have external dns deployed and available in all the clusters. We can actually get rid of reserved ips and just use external dns to create internal dns, entries and stuff and use that, like we don't actually, especially in monitoring. For example, we actually don't need fixed ips at all, because it's all that it's exposed through external dns and that's the entries we use, so we may even actually be able to simplify that model. Quite a lot.

B

Like use the registry.

C

One, but a lot of the others. We could probably get rid of.

B

Yeah registry and um and git because those are to ingress uh so yeah. That's a good point. I didn't move over the external dns because I wasn't actually sure how to do that. Yet um currently, this is defined in main.tf.

C

That's I think you know this is perfect for us to get moving. It's certainly something I think we can easily refactor later and just drop start, dropping some components out of terraform and then dropping them out of the helm file part as well. Actually.

B

This is the main tf. It's like terrible.

C

Yeah, I'm really glad we're splitting it up.

B

Yeah well it's in here somewhere, but there is like configuration for the external dns and I need to make that uh compatible with the zonal clusters, still cool. um That's it for terraform and then.

A

I have a question for you: jeff, it's more stylistic thing. um Is it possible to be explicit uh for the terraform files right now you have gk, zonal and gke. Can we call that gk, zonal and gk original? I think like we need to be explicit here. It's going to be a better time that makes.

B

A

Another remark I have is I'm concerned about the maintenance policy, in that nothing is preventing us to just through a merge request somewhere just change this and make it equal in like a refactor somewhere, and I don't think we want that because it could cause an outage.

A

um Well, can it cause an outage if we have the same maintenance policy everywhere.

B

This is just for the you know to go unavailable for kubernetes right, like not outage like we wouldn't be able to deploy yeah. If there was a serious problem.

C

B

C

There's two there's: I've looked at this a lot quite recently, so I can. I can very quickly run you through. So when we define the maintenance period, two things are going to happen. Master upgrades will always happen first, in which case we won't be able to do a deploy during that period. However, I've got a little bit of automation now that um adds certain annotations to afar and that could be extended for anything. If you want to set a flag to stop deploys or whatever we want to do.

C

We now do have control over when those we we, we have no uh an entry point for understanding when those master upgrades happen and it's going to be in the matter of minutes. They spin up new masters in the background and then when they actually send us the alert the upgrades happening they just swapped them. I I actually thought they took down the masters and what have you so? The control plane outage is like it's only a couple of minutes.

C

What I've seen like two minutes or something the node upgrades, is obviously they like cycle through all the nodes, and that is more disruptive. However, that problem is it it's it's more impactful to potentially to our workloads, but we should be able. We really should be handling it and I think we can handle it. We do handle it. We could make it faster by doing better pod disruption policies, but every upgrade I've done so far and even watching it doing auto upgrades, it's been very safe. It just takes a long time.

C

um We can still stagger the upgrade windows and I actually have a big discussion about how we want to stagger it and stuff, because the choices, if we because the outage windows have google mandate, that they've got to be so big like eight hours or something the more. We stagger them. It's like we're, basically, the whole week.

C

We're just got one cluster constantly in an upgrade window, or we just box them all together during us sunday evening, australia, monday morning, where I've got it at the moment, which is the quietest period of the week, and we just try and crack them through all together and in theory, we we are doing multiple at once, but each of those node pools should be rotating independently and the pod disruption budget should be making for making sure we don't have major problems. So I'm I'm happy to do it whatever we want.

C

Just we have to note the more we stagger it. We could be at the point where, from monday to friday, we've always got one cluster, that's potentially in maintenance.

A

Yeah, okay. That makes sense thanks for explaining that okay and then my my um my remark is invalids, then it doesn't matter then cool. Where is that issue grim? Could you link me that? Because I might have missed.

A

C

And we always have the policy, obviously of changing. It's like a one line or a couple line, terraform change. So if we want to change it, you know next week. We want to turn it to a different one. You know different timing.

C

We may even do one cluster at some point and then every other cluster all overlapped at the same time, so we have one kind of canary in case the upgrades go wrong, but um yeah the google have very specific requirements. They're not like you can only do one hour. You've got to do like four to eight hour blocks. Wait. They can do it at any point. So it's really tricky.

C

A

I mean it does still make me feel more comfortable that we can do that without much effort. Actually so it's it's.

C

We still have some tweaking we can do on pod disruption. Budgets I feel like I feel we can make like. The upgrades at the moment are a little painful because it's like it kills a hundred pods and every pod takes like two minutes to kill. So it's like just sitting there grinding over it, so I feel like we can.

C

We can adjust things to make it better. We just haven't spent the time, which is fine, because you know it's a low priority issue. We can figure that out later we can tweak it yeah.

A

Yeah, I think we can solve that once we see the some additional real traffic and how that affects us right, because when git comes in and web comes in, we're gonna see like longer much much longer times so um so yeah.

C

And we'll still have problems about like with registry and get https. If I'm like doing a get push of a large object and it's taking like 10 minutes, we kill that pod. That is going to kill my yeah.

A

C

You know- and I I don't have a good answer on what we, what we should be providing as the customer experience in that case and how we can work around that. But that's like just a giant question. We have to ask for all of the cloud native stuff.

A

It's not only cloud native. I think that question has been asked before for for our regular setup, which is that we never finished answering it. I think um I think I asked you about it like I couldn't find it like. I was trying to find. Where was that discussed, but I remember there was a discussion of what is reasonable.

A

um What is a reasonable timeout and you were explaining to me how? Because of the way we are rolling, vms and so on, we we gave like a large channel.

B

A

But not by on purpose, it was by accident. Basically, so.

B

Yeah, just by just by the fact that the omnibus takes a few minutes to install, um we wait much longer. I kind of wonder how much of this is also a monolith problem. Where um you know we don't real realistically, like get hdps, has very little dependency on rails right, it's just using it for authorization.

B

um Yet we are cycling, it very frequently for rails updates. um I'm not saying like there's much that we can do here, but uh the fact that, like I, I kind of wonder whether other like, like github, for example, whether they have um whether they cycle their get hdps service as much as we will.

A

I think that's a question we need to ask, so I would. I would ask you to to uh raise an issue for us to discuss this like what is the actual limit right, like what kind of expectations do we need to set to our users um explicitly now right, like not just what we are.

B

Doing uh now by accident.

A

And whether there is some application level work that can be done to not do this all the time so.

B

I think I think the answer would be to have a separate service that you could deploy independently right and that's.

A

If it's necessary, if we see a problem, we we might need to uh attack it so yeah. We expect to see a problem, but I would rather see something happening before we go and invest more time into that. You know it.

B

Sounds silly for me.

A

To say it, but.

B

A

Like we need proof that this is actually a problem, rather than just a minor inconvenience.

B

Yeah, I think the first thing we need to do is somehow figure out how long these connections are on average uh for git https, and I'm not sure if we can do that with metrics. I need to check, but that would be interesting to see.

C

I think registry is just the bigger one like if I pull a docker image from get labs registry of my home connection, which to be fair, I'm halfway around the world. My connection sucks, you know it could take me 20 minutes. If I get 18 minutes into that, you roll a pod, it would just chop the whole thing and then fail and that's kind of like ha I. I don't know whether we have to see like does docker support, http retried or we need to set some headers or something, but.

B

Results for registry.

C

But we'll forget.

B

C

Else all these apps may support some kind of mechanism, so we can roll things easier. I'm not sure.

B

C

B

We we're independently deploying registry separate from the monolith, so it's um it's not as much of an issue because we aren't. We aren't rolling those pods as frequently.

C

B

Only for config updates uh that affect registry, so it's uh this is. I think this this helps a lot and it would help for get https if we did the same, but um divorcing get https from the monolith is going to be a uphill battle, so we'll probably need to come with a lot of.

B

Evidence cool um while mary's typing that.

B

You can take a look, I don't know. um Graham, I saw you commented on the uh hum files, mr um and you're you're. Okay, with the current approach like uh by having the separate uh environments for.

C

I think that makes sense. It's yeah, I I you would know better than me. If they're literally going to be identical, we could probably get away without separate environments, but I suspect that there's going to be some point where they're not going to be identical, in which case helm file environments is the only kind of mechanism we have to cleanly make the differentiation.

B

Yeah they're going to be identical initially, except for tags like that we need to and labels for prometheus, so we need to know like which, so that's why I went down the environment route.

B

So yeah so um so now.

C

We have necessarily a bad thing too, because I think with environments we can also actually force it to certain cube contexts, which is another thing. We should actually probably look at doing as well like if you do like helm file, environment, usc 1b. It will automatically always use the context for that cluster. So to avoid.

A

It's not a big.

C

Issue, but it's another thing we can make sure people are hitting the right clusters with this kind of thing as well, when we start rolling.

B

Yeah exactly yeah, um I I decided you probably saw this. I saw I decided to put the cluster and region in the environment file itself, and just put it explicitly, um one thing that where this is repeated is that we also set cluster and region as environment variables in ci, and this is how hem file this is how heb file knows like connects to the right cluster right, because this happens before it reads. Environments.Ml.

B

I I think setting this in two places is the right thing to do for now, and until we change we can change ci so that it it reads this file, and um maybe it gets the cluster and region from it. Instead,.

C

I haven't looked.

B

C

But I'm also wondering if there's a gitlab ci has got more features for more dynamic pipelines where we can like hear some file or some data. Can you dynamically set up the jobs or things that we need, rather than us, having to have a massive gitlab yaml for every single thing?.

B

Yeah, maybe that's an option um yeah, but.

C

Time gets the time to you know, I think.

B

C

Studying things just gets us out of the line. Now we can make improvements later.

B

um One thing I one thing I learned: what I initially was thinking of doing was that if you go to kit lab values canary, you can see here that we we inherit canary from gprd this. uh This doesn't really work, though, if gprd.yaml.comtemplate is an invalid yml file and it becomes an invalid yml file. When you have these variable definitions, do you know what I'm saying or not.

C

B

Yeah, so if we look at, for example,.

C

Because they can't render that one anymore.

B

Yeah, for example, values.yaml that go template is a valid, go template, but it's not a valid yml file this. um So so, if you inherit something that has like uh variable definitions, this will fail because it's doing from yemen.

C

To yemen, I think.

B

C

Template function, we can do if you don't pipe that instead of from yaml. I think if you do template, then to yaml that might work.

B

I was looking, I couldn't figure it out, but uh but anyway this was my first thought was that we could just have all the zono ones. Do a read file on the base regional pmo, um but that didn't really work. So instead.

B

B

So, um instead, what we're doing is this like in, in the help.

B

File.Dml not for thanos here you can see that it actually first it loads. This end prefix, which is a value that's set for environment, so for for all of these zono clusters and prefix is set to just pre. So first it loads, pre, dot, yaml.co, template right and then it loads. The environment name.dml by go template it's kind of like because if the environment name is pre, what happens? Is it loads, pre and then pre again? And it's fine? It's the same file, but it's a little bit janky yeah yeah!

B

um There's probably we could probably make this a bit nicer by being more explicit. Maybe like maybe make the environment name for the you know, regional cluster be like pre-regional or something and then have a base environment named pre. But the thing is: is that, like I don't want to overload environments too much and uh I.

C

Don't know I I I've mentioned also to the developer of helm file before as well about hierarchical environments, and he was kind of interested in the idea. Obviously, that's.

B

C

Doesn't exist at the moment, but it seems like that's. Another thing we kind of would be useful is a pure actual hierarchical of our environments and setting things at different levels, because I think that's.

B

Exactly what we're just.

C

Patching we're just patching around that concept.

B

Exactly yeah, that's exactly what we're doing, um but it works okay for now so yeah.

C

I it's you know it's, it's not unreadable like. I read it and I knew what was going on. You know. What's and all so.

A

I think that's.

C

The important thing is, people can find stuff quickly enough, and it doesn't seem too confusing to me so.

B

Yeah you can't make help about yaml can't be a go template right. It has to be because um then you can get rid of these like really this janky syntax here where you're, because you could put you, could make a variable definition instead of embedding it, because what I found was that if I, if I did that right home file right now, it needs to be valid ammo. So you can't put variable yeah exactly.

C

You can't put variable declarations, yes and that's why we have to like um yeah, you gotta quote it that way: yeah.

B

Yeah you're right.

C

It can't be an actual template. This.

B

Is just inscrutable like looking at this stuff like it's so terrible yeah we got backticks and, like you know, template within template marin.

A

B

A

Progressing as society and engineering.

B

A

Coming to this I know it's awful.

B

Yeah, okay! Well, that's it.

B

Is there anything else that we want to discuss.

A

um There is one thing, but I only have two minutes um and okay. Let me because there's some: what is it confidential information?

A

It is I'll, stop the recording.