GitLab Delivery: GitLab.com migration to k8s demos, 19 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-08-19 GitLab.com k8s migration APAC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Can't hear you hello,.

B

Hello there we go hello.

C

How are you doing I'm good? How are you I'm like I'm stuck deep, deep, deep in in kubernetes land for once so uh probably got lots of questions for graham um but I'll I'll? Let you guys have your demo first.

A

Yeah, this is going to be an interesting uh demo, I'm hoping we'll well, I'm hoping uh we'll get our view of um webfleet on canary. We will uh but yeah graham's also having a think about kind of future stuff. So um yeah it should be, uh should be interesting.

A

One thing andrew, I was going to say, was whilst it's just us here.

C

A

um Do are you, do you have time to go through um alejandro's, um mrs, that he sent over to for registry.

C

Oh, um I I I haven't I will I will I didn't realize I will look at them thanks.

A

If you don't, then we should find.

C

A

Else I know you're super busy, so I just don't want um I. I think um I think it was kind of a scent to you and we sort of sit back.

C

A

Like feel free to nominate someone else who who.

C

A

C

Yeah yeah I'll, okay, I'll I'll. um I might do that yeah just because it's yeah.

A

And are you busy awesome, hey graeme.

D

Hey how's it going.

A

Yeah good thanks how you doing.

D

C

Graham before we are, we waiting for other people, I'm going to get my christmas tree now so.

A

I think yeah. I know, let's start with two minutes, but I say yeah they can.

C

Okay, let's start, let's start okay.

A

That's gay yeah.

D

Okay, um so I will share my screen. Hopefully this works. Oh man.

D

Cool, so everyone can see dashboards, yep I'll quickly, change this to okay, so last six hours. So let's look at canary in production, see how canary's doing is that loaded already. I should move this window yep, so that's loaded, so we can see. Overall, I mean looking at the last six hours. Canary's been running for about 18 hours. Now I'd say: oh, maybe a bit less. um We can see that there's no huge abnormalities. I won't blow this out too much just because the longer it all the dashboards take so long to load.

D

I've updated the ticket to point all this out as well, but we don't really. We have not seen kind of any real change um in terms of say, 5xx responses from aj proxy. Let's have a quick look at web itself.

D

Web overview, I don't know why that's saying that this is six hours, but if we try and tweak this out I'll try two days, but hopefully this doesn't take too long to load.

D

um Can we get rid of these stage canary.

D

So it was about it's about 13, it's about there that the changeover actually happened. um I believe double checking the date. No, it was about.

D

C

Okay, so that sort of sustained period of of uh drop down that sort of flat issue, there's two spikes and then the drop down was after the.

D

I believe so I'm just going to double check. Once again, it was okay, 4 18 utc is on gmt plus 10. That's 2 pm my time yesterday. Yeah, I believe that's about correct and so immediately. Those would look scary. However, if I blew this out to seven days, um so this means it happened around.

D

Stick around here.

C

Weekend looks good, graham.

D

Yeah, um it's not unreasonable for us to get dips that low and we've seen dips that low before um I. I definitely am interested in giving a few more days. Let me just kind of say it by that, but I'm not I'm not seeing this as a huge difference.

D

If that makes sense like, I think there might be slight difference, I'm not entirely sure, but I'm pretty confident it's it's similar, but we'll need to run it for a few more days and probably look um probably look overall with all that time in the weekend, that's probably not making things a bit clearer either during apac hours. It definitely looks fine as well.

D

The other interesting one I had a look at was.

D

Saturation, this is going to work for me.

D

So you can see here there's kind of an interesting story with saturation for the memory component, especially we had a false start where we set the memory uh limit on the containers too small six gig, instead of eight gig, turns out. I think we um we're modeling it off api originally and it turns out, I think it's slightly different than to api, as in the memory usage, I think, is slightly different, so we were actually hitting out of memory errors here and so pods were out of memory and getting killed.

D

Quite a lot, then we uh drained canary when we had. We had to drain. Can area roll this back for another reason, then, when we've added it back in, we actually changed the memory profile. So most of the time, I should change this. Maybe down to two days.

D

Most of the time now so this is where it kind of came in. You can see that the the memory utilization is a lot better than it was before over here, where it was really up high, but we still do get the occasionally this massive memory spike and then an out of memory error like maybe once one a day.

D

So I'm not entirely sure. Obviously I think that's might be an application issue, but um it's interesting nonetheless,.

C

I'm I'm certain. It's almost almost certainly um the application right, um the the question. The the problem is tracking it down because um we're probably not logging it right because it it's it's crashing, so you've actually got to look outside and try find that trend which is pretty difficult, but.

D

Yeah and- and the thing is too, you know we cycle pods during like now with me- doing- deploys definitely like recycling pods, pretty consistently throughout a 24-hour period as well, which um I mean, that's just it's not a big factor, but that also means that you know it's. We we rarely get pods that live for a long amount of time.

D

So I don't know what am I trying to say that that also can make things difficult if you want to try and profile a pod, or you know, profile a pod over a long period of time that becomes a little trickier.

D

D

C

D

There's any sorry.

C

So I mean if it's what period, what time range. Oh, this is two days so we're talking about five or six ohms there right yeah.

D

Let's, let's yeah well.

C

D

Believe it or not, this was so. I can actually look how many of these were actual ooms only say out of these three. I think only one of them were actually, um but that being said, I still think that that's still quite not great right, like we probably don't want that, even if it but yeah let's, but can we.

C

But we I was, I was wondering if we could like look at the data and sort of statistically figure out like um if something was getting more 500s. But if we're looking for five events out of millions of tens of thousands.

D

Of 500 is just.

C

Not going to happen.

D

Yeah, like the usage, actually, let's zoom in on one of these, how quickly is the usage go up? So that's uh like 1729 1730 uh 1732. So that's like what one minute two minutes, more more or less or two minutes.

C

Yeah, if you want that resolution, you really ought to go into um prometheus, because I think we've got a minimal thing of 50 off of one minute but sure okay,.

D

I wonder if I could yeah yeah that's right.

C

Yeah press x that's the trick, but anyway yeah. Oh.

E

C

E

Doesn't really matter.

C

I mean it's: if you want to look into.

D

Yeah yeah, no, I might have a look at that. It's good to know. I might look at that um because it's definitely yeah. It's definitely interesting. I'm not sure. I personally am of the opinion. It's probably not enough to hold this from going to production um unless anyone has any.

C

D

Thoughts, but it's certainly uh interesting.

C

Graham, the other thing that I would look at if I was you is, you should go, look in the rails logs, um go and find that pod and the rough time period and then basically order requests by there's a memory.

D

Oh, do we have actually.

C

Stack we do so that might be a really good clue. They're members and it's some of them are shocking, like it doesn't surprise me at all that, if we, if we're getting stricter on memory, uh they're mello by yeah, one of those ones will will will give you and.

E

Then turtle bites maybe yeah.

C

And then and then sort of narrow it down to where you're, seeing the ooms and then well actually sorry. I just said earlier: we don't we're, probably not going to log it, but you know you might get some idea of of by looking. There kind of near misses so we'll obviously load the near misses, um but that might be something to look at.

D

Yeah, no, that's. That's really good help. Actually.

D

Is this we do see spikes in disk space usage as well, but I'm wondering if that's related to the open issue we have where we leak files, don't like a large file uploads or something. I think we occasionally leak them. um But once again, it's not enough to really it's not enough to cause anything to run out of disk space, but.

C

And also with the the rate at which we recycling pods yeah, we're probably doing it faster than the leak right.

D

Yeah yeah yeah. I agree.

D

um If we look at, I mean if we look at the the aptx g pod. Oh, this is whoops.

D

Cool um so once again still over here, this is over a 24 hour. If I change it to two days, definitely was a change around here, um but I'm hoping that that's just within the kind of normal parameters for a week. I guess we'll see over the next few days um the other metrics I looked at seemed once again, relatively okay, um that's a quick look at pamela and workhorse. Some of these metrics now obviously are not broken, but their usefulness is diminished.

D

Things like per fully qualified domain name, because, obviously now everything just becomes under one value like you get lose all of that granularity, which makes sense, because you know we only have one back end um and in fact these will change over to a different set. I think we have specifically for kubernetes services, but the reason I haven't yet is because if someone has a problem in production which is running in vms, I need to we. They need to be able to use these now.

D

So once we cut over to production, then I'll actually go back to these dashboards and and maybe review and see if there's others.

C

There's a whole bunch of stuff that if, in the in the dashboards it says provisioning and then it's got kubernetes true and vm's true or something along those lines, and when when, when the thing's finished, when you say vm's false false, there's a whole bunch of panels that will disappear automatically. Okay, um not that per ftdn1 in particular. That won't because that's one of the significant label um things so that you'll find elsewhere, but definitely there's a whole bunch of node level stats and also, um if you I don't see the kubernetes the kubernetes panel.

C

So if you, if you turn that.

E

On you'll get some other problems.

C

Yeah, there's there's some there's some more things that, if you, if you set it to true you'll, get some some nice.

A

C

The moment uh yeah yeah, so we, I beg your pardon, we do have a chair. Okay and effectively. Those node metrics will will disappear. The one the row above gotcha.

D

Oh, I say yes, yes, right makes sense.

E

uh Green um yesterday I discovered the aja proxy um dashboard. Oh.

D

Very helpful to.

E

See the latencies absolutely because that's the best way we can easily at least track latency changes from moving to uh uh coordinators. Here.

D

These take a while yeah.

E

But if you, if you specifically um select the right, back-ends and so on, I will place the link and chat. Oh yes, then it's much faster and more selective.

D

Why does zoom chat disappear when I share my screen? Sorry.

E

I can also put it in and stick maybe.

D

Yeah, if you could just drop it in delivery or something.

C

Yeah, just just keep in mind it's it's best to use the log data yeah for those latencies, but yeah.

E

Sure, but this is the faster way, let's say for lazy people, so.

C

Yeah for sure and it and it and if it's lower, it's probably lower so yeah, it's good.

D

I will actually do a bit more analysis on the logs. I did an analysis, a quick analysis on just like 500 errors, but I might dig through like that. You've you've definitely inspired me with that memory field. I might try and look it through all the fields we have and see. If I can find any differences to some of those other things so.

C

D

And I will sorry.

C

Sorry, I was just going to say and then what you can do is you can look at the workhorse logs find like 500s from similar place because workhorse is running in the same pod I forgot now or is it running.

D

In the same part, yes.

C

Yes, okay, so you can. You can look up for 500 on workhorse, because you will get the log there and then you'll know what the candidates are from the bad memory uh users. And then you can kind of correlate.

D

So if we look at the last two days, um we seem to have a weird multi-hour gap here or multi one hour gap year of metrics, which is a little bit interesting, but um yeah. So, besides that uh the change came in around 19 like somewhere around here 29 even further back- oh sorry, it should be somewhere around here shouldn't it.

D

Around there, um so this is probably not enough resolution, but if we I'll try pulling this back to seven days.

D

Is that gonna refresh for me did that refresh? Oh yeah, 17, no didn't yeah.

E

I think yesterday, when I looked at the seven days resolution, um it looked like that we have a little bit higher latency, especially on the p95 sure, um but but overall and average it doesn't seem to be a big difference.

E

So it's not really worrisome just like expected. I guess, because we removed engine x.

D

But I would have thought latency would improve. Is that what you're saying like.

E

Because the other way, because we are not using web sockets anymore right- um that's this was the expected latency slow down, maybe- and also maybe we just have differences in how how pots are uh reacting to load.

D

Sure so yeah, I guess that is a good point. I guess we um we used workhorse and puma, used to talk via a unix socket that was and and nginx they would all talk via a unix socket, I'm pretty sure to each other on the same node, which was obviously quite quick, but now they could potentially be bouncing all the way over the network. We don't have nginx. Fortunately, but at least I know pimmer and rails will be talking over the loopback interface, which could conceivably be slightly slower.

C

But those those latencies you're not going to spot on a prometheus um sure because you're talking about like many many seconds, you know we're talking on that we're talking. You know five milliseconds max right, so yeah with the resolution you've got you you're not going to spot that.

D

E

I think you could maybe check the um readiness um checks here and, and I think in front and let's check http or check as you guys, I'm wondering if that one maybe would be the very fast request where we would see. Maybe a difference.

D

Sorry so you check the right.

E

Check like in the top values, there's the front and run and instead of all, you could select, I think check http I'll check https, I'm not sure, but uh maybe that would give us the very fast um clickfunnels.

D

Oh, these, the the front ends for.

D

uh For doing the health checks is that what you're saying.

E

Yeah, I'm not entirely sure what people checked there, but um at least they look like they should be the fastest responding right. So if you see that latency differences, then we should see them where they are right, but I'm not sure if that is really true.

D

Let's see if these load.

D

I guess really, the question is as well is, is you know it's, it's definitely possible. We've we've lost a tiny bit of, um or we've introduced a little bit more latency like what is the threshold. We have to kind of consider it a problem. I guess, is the next question like obviously I don't like to see. It would be great if it was the same, um but it might be slightly higher.

D

A

If we're still within slos, that's fine right, we can. We can tune this stuff later right. I think, uh as long as we're planning a gradual rollout um through production, I think that's totally fine.

E

C

Graham, the the one thing that I do think is important is that you, um you also compare queuing time. So that's something else you can get in the logs, um because there's no reason at all that queuing time should be should be bigger. If we've got that, then we've got our kubernetes config wrong.

C

um So I I would, I would look at queuing time, but I'd also look at it at execution time so separate those two components out, because the the reasons for both of them are different um and matt smiley pointed out the other day that, on some of the the concurrency settings on sidekick were very high and that's going to lead to things slowing down as well.

C

I don't know if that's the same with um with this puma work, but um it was like 15 or something like that, which is which is too high, um because it's all effectively contending on a single core.

C

um So so break that, like what maybe one thing to do is take a look at those like histograms of the queuing times for canary versus main stage and um and the execution times from the logs um and those are in the logs and and that will give you a really good idea of of like where the latency is coming from.

D

Yeah, if it's queuing time, we.

C

Should work on it, we should you know, there's no reason why we shouldn't fix that.

D

Yeah, no, it makes sense.

D

Yeah no good to know uh that's definitely. I.

A

Think for these things, graham like if you just like, do a side by side like you know, I think, let's try and get some traffic shifted to production next week after the release, um but yeah as long as we've got a kind of like we know this thing slowed down a bit or we're seeing a bit more of this or a bit less of that, then you know we can kind of monitor that through the rollout.

D

Sure so what I'm kind of hearing is, we should definitely do a bit more investigation which more for that I'll do that first thing tomorrow morning, um but we're not seeing any obvious signals, at least at the moment that we won't stop at least trying to go to production next week. Sometime.

A

I think that sounds right. Yeah provided we've got the other things in place. So how are we getting on with the readiness review.

D

uh So, okay, so I think we're okay, basically we're at the point now, where I'm happy with the state of the infrastructure. One we've had cameron review it, and I think there was someone else review. It stan reviewed the one from development security. One's now been done as well, so we should be.

D

I'm just gonna double check to make sure I say the right thing, but we should be basically, I think, almost done, if not done.

D

D

Yeah, I guess that's really a question of I mean we've kind of given the only one we kind of left open, which is so we've had pierre and we've had craig miskal we've had cameron review it. So I honestly I unless there's anything major that comes out of it, which I don't think there has been.

D

um I will there's a couple of suggestions here, I'll double check, but they're, probably just more like documentation notes than they are actual issues, but I think that's probably done- or at least I can make sure that's wrapped up tomorrow. I don't. I don't. I feel, like we've got a critical mass around that now. So I think all of the readiness reviews should be done by the end of the week kind of thing awesome.

A

Good stuff, good stuff um and then what else have we got that we need to do like? When are we thinking we'll be ready to start moving over to production like assuming all the investigation things go smoothly.

D

Yep, so assuming that we don't find, there's a reason that we shouldn't go to production um tomorrow by the close of business, I will probably have the change request for production done um and then the next question will really be how slow or fast we want to go to production. So I've I've got about 90 of the change requests done it's just a matter of. I need to tweak some numbers in terms of how quickly like do we want to do 10 every hour. Do we want to do 10 every day?

D

Do you know what I mean like it's really a case of just how fast do we want to go, and then I just need to document that work out. The timings and kind of you know prepare to do it. I guess, or you know, schedule a time.

A

Oh um yeah we'll work with henry and to go back, I guess to work out like those those times like. um Probably something in between knows it's, probably whatever you want to hit yeah to be helpful.

D

um I mean api, I think we we kind of, I think it was around 24 hours like by like, by the time of the 24-hour period passed we've gone from zero to 100. I think so. Maybe we try and work off that. Obviously we need to be aware of. If we do that, then it means deployers and stuff are going on during that time.

D

So I do need to do a slight tweak to the change request to make sure we're not blocking deploys um yeah and I might even split I I had a chunk of the change request. I'm gonna might split up into a different change request that I can action tomorrow and that's just adding the back ends into ha proxy with no weight. So it's like it's it's getting everything ready!

D

It's it's spinning up the kubernetes, it's like all of the prep it's but and everything's there, but just no wait and then the actual change request that we do say over 24 hours will be the change just changing weights, just like slowly waiting waiting, you know changing the weights, changing the weights and I think that's kind of I think that would be the best solution so fingers crossed. I can try and action that tomorrow, depending on just splitting that out.

D

When does the release finish next week like? When? Would we conceivably be comfortable kind of rolling.

E

In this actually finished on sunday,.

D

Okay, so any time so.

A

Yeah yeah, I think, for your monday yeah you can. You can begin um for this.

D

Okay, it's good to know.

A

One other thing to factor in, though, is we've got family and friends day next friday, so there'll be a pcl over that. So, let's, let's make sure we are finished so that we've got um thursday. I guess to monitor things.

B

A

And like reduce it back, if we need to.

D

So if all things go well, possibly we could do it next monday, um if not and and I'm going to probably be very realistic here.

D

Maybe I get some feedback on the cr and I need to do it again, maybe my tuesday, um unless someone else in a different time zone wants to do it, although once again, I think we're probably going to be doing it over a longer a 24-hour cycle. Anyway,.

B

D

And in which case, according to the time zones and the clock, my my my tuesday is uh the first. So if I start it at least, then it finishes on the same tuesday, technically speaking or a lot of it gets done on everyone's tuesday.

A

Yeah yeah cool sounds good. When are you.

E

A

The security release henry.

E

um I think the plan was to get this out next week before the friday right. Okay, because friday is 27th, I think normally, we would um release security releases on 28th, and so we wanted to try to get this in before that to see. If you can make this happen, um yeah, if you can get the change request just to this um changing weights, that would be great because then it would be very comfortable to fiddle around with it or to roll.

E

It back would be super easy and fast, and also it wouldn't block deployments in any way. So that would be cool.

D

Yeah, that's what that was um the feedback I got from uh john as well and yeah. It makes total sense and it wasn't until I sat down and kind of rethought it all out that I was like okay yeah. We can do this, we we're lucky. We did the work on the cookbook. We wouldn't have been able to do this without that.

D

So we can just leverage that now and, as I said, I'll try and just cut and paste some I'll try and split it out and actually get hot that that prep work done tomorrow I can, I should be able to grab whoever's on call in apac and just you know, it's a quiet time of day and just kind of get that done, and then it literally will just be changing weights. That's all we need to do everything else.

A

Awesome sounds good, sounds good, great.

D

And I guess it is worth noting with our canary setup as well. Technically speaking, I don't know what it is exactly, but I'm trying to think it's weight. Three versus. Let's say it's one percent, it's probably a fraction of one percent of gitlab.com's real traffic is going to canary and he's just going to kubernetes as well.

D

So it's not like um you know it's not just it's not just I mean we're all familiar with this obviously with canary, but we have, you know, got real user traffic on there, which is which is which is good, because which makes me more confident already as well.

A

Great yeah, that's good, excellent, great stuff um cool. Well, please, like you know, let's keep things moving um like scovic uh henry's rounds morning, scuba be around later, so, if there's things we could be continuing through today um to get this stuff ready for next week. That would be good.

D

A

A

Do you want to go through your discussion items.

D

Yeah, so I mean I'm wary of time and not wasting everyone's time, um so I'll try and be quick. So really, I just wanted to point out based off the discussion. I think we had in the what it was the dna. I guess the the reliability discussions.

D

um I've created an issue in the gitlab chart to talk about using getting the upstream git like chart to use uh v2 beta 2 or basically, the new version of the horizontal pod, auto scaling api on the objects that we ship with the chart. um The short answer of that is it'll, give us greater flexibility and control over the scaling. In particular, we can say stuff, we can do things like say scale up very quickly but scale down slowly um to stop. You know that kind of thrashing and going up and down in pods.

D

We have happening all the time we can be like you can scale up. You know over five, like you can look at metrics over five minutes scale up quickly, but you can't scale down until you know you've seen an hour of lower traffic or things like that. So we'll see what the if um distribution can pick that up or we can pick that up for them um so just want to people be aware of that. um So one of the topics we want to try and cover after web migration.

D

Alongside the pages migration, I guess, is just addressing some of the technical debt and bits and pieces. We have um I've kind of chosen to focus at this point on the gitlab com, the biggest reason for doing that is um with the migration of web. Obviously, we have all of reliability uh now. Looking at that repo a lot more closely trying to utilize it a lot more outside people outside delivery, basically contributing to it.

D

We also have you know a lot more developers within gitlab who are you know, trying to roll stuff out and contributing to it as well? It is it's just grown organically over time to have you know some ugliness and some bits and pieces which we we known for a while that you know need work, so I don't need to talk too much more about that. So, in terms of actual steps, I've started to nail down. What I would like to do is like the first real, concrete steps to kind of fix some of these things.

D

um First thing I talked about there and I've got issues as you can see for all of those, so we've got getting rid of helm git, which basically is blocking us. It's a helm, plugin um that's loosely maintained. It doesn't do much for us. Well, it does a little bit for us, but what it does is very simple, but the problem is it's blocking us from upgrading helm and it's blocking us from upgrading the plugin.

D

So I've outlined that issue there, and people can have a look and discuss um how we basically get rid of it to replace it with something it sounds like andrew. You had success with get subtrees for get for like get like the thing is we have another get repo at the chart? We just want it kind of locally in that repo. Somehow.

C

Yeah, so I've been using so because yeah we've got the same problem with the dedicated and I'm really liking. Subtrees, because I mean there's some downsides and there's a little bit of complexity.

C

It's nothing like git modules, it's much better um sub modules, which was an awful thing, um and uh what I really like is that, like all your changes on one merge request right so you're not trying to like vendor in something and working in two repos, you push, you do your changes locally and then, when you're ready you you push your merge request back and it's just got the changes.

C

The um the commit uh sort of list looks a bit weird on the other project, because it's got all the commits, even ones that don't have like anything to do with that, but you can squash them down, um but it's it's it's. It's really like. I think it's gonna be a big productivity boost because you're not trying to you, know work in two places. Push that change test it here, you're just doing it locally.

D

Yeah, no, that makes sense. Our use case is actually simpler because I don't think we will want to push changes up to the chart. This way we just need to consume it down, so it's even simpler for us. So.

C

But you want to track those changes you want to know you you're talking about you've, got deltas off the main charts and you don't want to push them.

D

Theory should never have deltas. That's the.

C

Oh right, okay, okay, oh really, I thought that there were there's a separate branch, though right.

D

There was for a while there I think java can speak more to this, but we were running off a branch for a while when we had some intermediate work, which we try not to as much as possible, but a solution where we could conceivably work off. A branch if we needed to especially in an emergency is useful.

C

Which, once again.

D

It sounds like the sub trees will give us if we need to.

C

Yeah and what we're doing with this is like we've: we've got all these different, merge requests that we're pushing back and then, but if you, because it's uh you know it's all gets, we can look at our total. You know we can look at what the total difference is and then you know work on them individually, but the other thing is that we're not trying to wait for every change to get into get before we can. You know, move ahead with our stuff. Otherwise it's just going to be.

C

You know like no parallelism at all, which I think will slow things down a lot.

D

Yeah, fair enough, um so yeah there's a topic about that and how we can provide some safety for that as well, um because if we start locally vendoring, we need to make sure people don't change the local copy, although it sounds like git subtree might solve that so anyway um yeah. So so, once we action that small discrete item, um we can then do the upgrade of all the software components we haven't upgraded in months, if not years, the next step from them will be.

D

I want to remove as much of the chef sinking as we can. I think it's just more confusion now, because you can't see the values you just see that they're being pulled from chef and it's really difficult to debug and honestly, we just don't have stuff running in chef as much as we used to um and then the final piece in this first wave will be getting us visibility back into our home deployments.

D

At the moment, all we get is helm, saying that the upgrade has started and then 30 minutes later saying it completes with no visibility into what's going on. So there's an issue there to track that um and how we can approach that to try and get some visibility back.

D

I do have obviously larger issues. I'd like to do in terms of attacking tackling that repo as well, but I feel like these. These ones are really good to smart start with to just get the ball rolling and then I'll continue to probably try and break off some of the technical debt work into these discrete issues and get them onto the the you know the the board. So we can action them or others can action them.

A

Oh yeah, thanks for that, graham so yeah um just to catch the rest of you up so grab. I chat a little bit about this yesterday, um so we've got a load of tech there. We obviously don't want to have to stop a whole project and spend six months paying it all down. So we're going to start having like small pieces like this, that we can just trickle in alongside our other work, so um for now game. I've not got any of these on the board.

A

Just um I think some of them we want to see if we can actually break down a bit smaller um and fit in around the web stuff, but uh like for all of this stuff, please keep opening issues for, for anyone. Who's got uh suggestions for how we can actually like make this stuff um like how we can break off small pieces of tech debt and actually have value uh without needing to just like re overhaul. The entire thing basically.

A

Awesome, um do you want to go through uh number three graeme.

D

Yeah, I'm a little bit wary just because I'm keeping an eye on time- and it is a big topic. I will try and briefly summarize the important points so with our current four cluster setup. um Zonal clusters do not have a high availability, kubernetes master and we have upgrade windows for our kubernetes clusters.

D

Basically, during apac work hours, it's been bit me twice now and it probably will bite me more that when the masters go down for upgrade, which can be like half an hour at a time, um if I'm running a deploy during that time, they will fail because they can't talk to the master. They can't change the number of pot yeah. Basically, they just can't deploy it's not a big issue and it's semi-annoying, because we have zonal clusters for a very important reason, which is um cost like we.

D

If we have regional, then pods can talk to any pod in any region and we just get a. We get things for cross-cluster traffic, so zonal clusters make sense because everything talks to itself in the one zone and we don't get things for that traffic. But google give us two options: the regional cluster, where the traffic's free to go across zones as much as it wants and high available masters, which they say they recommend for production, use or zonal clusters. But this downside of your masters could be unavailable during upgrades and obviously during real outages.

D

um I don't think this is a burning issue that needs to be tackled straight away. But it's just something to note, and the other interesting point is once again going back to people outside like reliability and other people. Looking at our architecture, it is confusing, like it's different right, like people like the upstream helm chart like gitlab helm chat, they do great amount of testing in a single kubernetes cluster and then they're like okay. So it all should all work for you and then I'm like. No.

D

I need to go into terraform and create load balancers and do like internal dns and like wire, this load. Balancer up and- and you know, we've got to do one for each zonal cluster and it can't just talk to itself by the cluster ip, because it's not the same cluster and then one component lives in this cluster and one component lives here, and so I'm kind of like I'm okay to keep zonal clusters.

D

um It'll be great. If google could just give us ha masters on zonal clusters, and so part of this issue is we need. I would like to squeeze google a little bit on giving us a solution, but I think I I would I'm wondering if we should just move to zonal clusters and then just make them identical, so a small copy of everything, and then it's just everything talks to itself inside the zonal cluster, but we just duplicate that three times and then have h a proxy like arbitrary traffic through the front.

D

The huge downside to that is say like the web pod talking to the ap ipod in one zone um can talk for either cluster ip, no, no external uh load balancer needed. But if api one zone goes down, we've got no failover to another zone. So there's definitely like pros and cons with all this approach. um It's a problem really the it's really a problem that kubernetes like the kubernetes layer, the platform layer is supposed to solve and I'm actually really disappointed. Google don't have a good answer for it, um but yeah.

D

Maybe maybe this problem is just too big for now and in fact, kubernetes upstream have added alpha features so in alpha state in uh kubernetes 122 to solve this problem for everyone. So maybe this problem will just go away naturally in a year, um but it is worth pointing out that- and it's just me me and apac time zone.

D

In the same time that the cluster auto upgrades happen, I get dinged on that um and yeah it is, can it it would help people's understanding and align our architecture better with what we expect upstream, if we could move to just everything runs in the same cluster, even if we duplicate that multiple clusters, canary and everything just canary goes into there like just everything, goes and they're just cut and paste like identical.

C

Just with with canary right are there never going to be sort of global things that we're going to want to test and roll out core dns or service discovery, or something like that where we want to change, we want to test it in a canary state and it it I mean I guess we could have everything. Well, we can't really have. We could have most things petitioned. I guess, but but there might be some things where it's easier, just to have a separate cluster for canary. Do you follow yeah.

D

Configuration yeah, possibly um no, that's, certainly another option as well.

C

But generally, I agree that that, having like a single cluster and and and it you know, keeping it using a kubernetes um fun like feature to to to keep everything in in in the zones would be much better.

C

You know the the the last option that you just mentioned. That's an alpha. I forgot what it's called.

D

uh I think they call it anthos service merchant, like they've, got some stuff that can kind of do it. I think gke does um talk about.

C

The kubernetes native stuff, though,.

D

Oh yeah, right yeah, so that's uh topology, keys on services.

D

um Java did raise a good point that, obviously, if if somehow, we would have solved this problem and go back to a single cluster, there is like a all our eggs in one basket if, like the cluster crashes or fails an upgrade, um but that being said, we could move to more like a blue green cluster kind of setup, where we still have two clusters: they're, both regional and and they're there for reliability, not just because we want to shift them in different zones.

D

um Yeah, I'm still on the fence about a lot of this, I'm just trying to think of a better way to to kind of clean up this, so that people can just contribute to the upstream chart. It's all wired through the kubernetes internal services. They all know how it all works, and then we just kind of duplicate that without like oh now, I need to figure out what terraform I have to do to glue this all together.

A

Yeah, and so I think, um yeah, I appreciate you bringing this up. Graham, um it's. It sounds big enough that it's the sort of thing that would need to be an okr and need to have some kind of like. um I think it would need to be tied to us having to unlock something versus just it's sure.

B

A

Hard to use just because it's going to be such a huge change like super risky and stuff. So um like yeah, I mean like, let's keep it in mind, but um we don't think we're gonna prioritize. This uh we've got enough kind of scaling related things that uh will come in um so uh yeah. So I think it'll be a good one to see kind of like what things look like a little bit further down the road.

A

B

I had some comments on this. um I just wanted to.

A

B

Some of them so, um like I think, like graeme, already covered, like I have my reasons for preferring the multi-cluster just for the blast radius isolation, um but on the other hand, the bandwidth issue is really not much of an issue um except between aj proxy. um So, if, if we do like this was the main driving, it was cost, we had to do multi-cluster because of cost because of bandwidth. um Now without nginx, the only cross-zone traffic is between h.a proxy and the cluster.

B

If we can eliminate the aj proxy layer, which I think we are probably going to do eventually, we may want to do soon. For registry, I mean not like this quarter, but relatively soon um and- uh and my last point is uh six is like.

B

Maybe if terraform confusion is the main driving factor, could we just simplify this with jsonnet like have um maybe some guard rails uh for, like you know, we could have like one jsonnet definition that generates the terraform for the different clusters, like you define your node pools in one like and, and that would like make it maybe clear, but personally, like maybe I'm too close to this, I'm just not feeling the pain like for terraform, but maybe uh I I don't know so.

C

That's not cool jeff like full full uh dedicated we're generating variables, but we're not generating terraform. Are you talking about generating the terraform itself? I'm.

B

C

Why can't you just express the differences as variables.

B

um I mean that's an option too, but this is like defining node pools, so this is actually defining resources uh in terraform and terraform has therefore reads json right, so you can just write out json for terraform and that's what I'm doing for um the db provisioning stuff.

C

Okay but then you've, basically not you're, not running, get you're, not running the get terror.

B

I didn't mention get at all like no. I.

C

A

E

Use? Okay, that's not a family! Sorry, I'm not a hammer for this now.

B

um No, I mean I mean like in the project itself. We just generate the kubernetes cluster terraform config, with the jsonnet definition. It says like okay, these are the workloads and then from that we could generate the the cluster config in some way. That would make it more straightforward. Maybe it's less straightforward. I don't know, but um I I don't know like what, where the pain points are with kubernetes, because I haven't felt them.

D

Yeah so you're right, um I, that is a really good point. I'm I didn't even consider that, but that is a really good point. We can do it one better. um All of our gke clusters come with a the cloud connector, so you can actually just write. They've got crds and components for a lot of the gce resources we need and they could probably just if we're writing you know, json like it could just be stuff that you know yeah like you're right. We could probably make tooling that makes this better.

D

um I didn't even think of that, and that's that actually would make it so simple. You could just basically define a services, we need this exposed and then it would create the dns record, or we already have yeah.

E

D

To wrap this up, it sounds yeah. It sounds like it's not an immediate problem. um You know we'll we can circle back to it in six months or a year and see if it's still a problem, then or if you know, if things are better or worse, I think the maybe maybe it's also I mean the the big component. I've had recently has been cars, and I put that in the regional cluster.

D

So maybe the different way to look at this problem is we should we should be moving everything to zonal, where we can, I guess like, like maybe the reason sidekick's in the regional is because it doesn't need to like it just talks to redis right and like like. Maybe this was more just a bad call. I made on that component, which so I see that pain point more, but for the other components, it's not as big as a bigger pain point but you're right when we replace hp proxy there's.

D

So many of this, which is just going to make this problem space different, that yeah.

E

A

Awesome um great and then I just want to give a little bit of an overview of uh what I um expect uh we'll be looking at in the next uh next few months. So pages is going to be the next one uh which gets us to the end of the stateless service.

A

Migration skybeck has already started, putting together um issues for that, so that will kick off um pretty much as soon as we're done kind of hands-on with web, so yeah, possibly even next week, and um then, alongside that, we have uh two that are coming up that are both I'm, not sure which one will land first, but it'll either be redis or perfect.

A

So perfect should from what we hear should just migrate. um So we can certainly attempt that, and you know we'll need to test it and see how that looks, but there's nothing actually blocking that work at the moment and then the other one is redis. So there's a we have a shared okr with platform, that's right across platform, so we're sharing it with the scalability, which is to work out a kind of longer term redis scaling plan.

A

um So we were expecting uh that this quarter will be kind of the investigation and working out like which cluster uh makes sense, and how would we actually go about doing this? Then there are some kind of new things coming in. uh As of yesterday that we may need to push this force, we need to scale redis, so um it's quite possible that, if we're thinking kubernetes is the answer to how we scale radius, we may want to um sort of adjust the order and perhaps do redis before we try perfect.

A

So those are certainly the next three, though I would expect um to be coming in.

B

Thanks amy, I had just a couple comments. One is about uh secret refactoring. uh I don't know this could belong in reliability and not delivery. Maybe.

A

Yeah be better there, but um I would I would like.

B

To create an epic for it, at least so that we can start scoping it out.

A

Right yeah, so that's a super good point. So yes, secret is almost certainly going to drop out um from one of these other projects, but yes, it may be. It may end up being a reliability project um yeah but um like yeah by all means. If you've got to create an epic, then go for it um go for it now, but yeah, I think, secrets will become, uh will almost become a like required piece for for at least one of these three.

B

Yeah and then the second point is just our our migration post migration story. uh This this might need to come before secrets, because uh I'm just really looking forward to getting um puma secrets out of uh gkms and um the deployer machine that will be like the last actually deployer and console are the last two vms.

B

Once we migrate pages that will be running puma and require things like the database password and the redis password like as soon as we get rid of decommission those hosts, then um one of our two major secrets, like redis password database, those can go into google secrets manager or whatever solution you know, and we won't need to use gkms for them anymore. So uh that would be great. But do you think that won't fall on delivery is like how we do migrations, or is that also.

A

I think it will so what we know at the moment is that at some point, registry are going to need these we've kind of kicked it down the road a little bit, because at the moment they don't need them. So we don't know exactly what that looks like, but yeah once we've gone through webb, then that there's no real reason not to um to sort these out. um One thing actually related to post deployment. Migrations is um our other big challenge through our other big focus through q3 is around deployments and rollbacks.

A

Now the single biggest problem with rollbacks is post deployment, migrations block pretty much most packages from being rolled back, so there's a sort of a separate angle there as well, where we're actually going to review uh myra has started thinking about it already like. What do we actually do with post deployment migrations? How do they fit into deployments and they're for rollbacks, so um yeah, I think post point migrations will will certainly be a delivery thing.

D

So some quick points I already have an epic for the gsm migration. It's probably a bit unloved as in it needs a bit of an update, I'm more than happy to not necessarily do the work, but like help, I know exactly all of the options we have available and what needs to be done and which vms need to change. Oh scope. I know it all, so I'm happy to help scaffold that out, so it could be actioned by um anyone um and in terms of the console migration.

D

So we were going to try and do a console readiness review as part of the web migration, and I think we we think it might not be ready. Turning on the console so the masters, so the console servers to running kubernetes is like a two-line code change because we already have the helm chart already supports it. Like we've got all of the bits there to basically just spin up console master pods running in kubernetes extremely quickly, and we could even federate them to the vm.

C

What about just leaving the console out? Because otherwise the vms have to get in right or.

B

Yeah, just to be clear, when I said console, I was referring to o l e, not ul,.

A

Which is like yeah, it's really confusing.

B

The ole boxes are used for developer rails access, so they have all the secrets to connect to the database. This is what I want to focus on the ul boxes. Those I don't think have as many secrets like they don't need the redis password the database password. So, although it'd be easy, I'd rank it kind of lower priority. I'd like to.

D

B

The deployer and the ole console.

D

Okay makes sense and the ole boxes were just redone right with teleport. So.

B

I think so yeah yeah.

D

So devon just did a whole bunch of work on them, which is fine.

B

D

You you could talk to him about what he knows about secrets or not. Secrets are being used.

B

A

Awesome great great stuff, thanks everyone um is there anything else we need to cover today.

A

Nope super all right. Well, um good! Luck, graham with next steps. Let us know how we can help and uh hope everyone has a great day, we'll see you soon.