GitLab Delivery: GitLab.com migration to k8s demos, 3 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-12-03 GitLab.com k8s migration APAC

Description

Discussing progress and next steps for the GitLab.com Kubernetes migration

A

Good morning,.

B

Oh sorry, I've got my music going still and I can't hear anything there we go. Are you.

A

Amy, I am good thanks.

B

And I apologize, I just realized that my hair is a total mess.

A

Is it super early for you.

B

No, it's not I just um I I I went for a run on the mountain, and so I'm still in all my active gear uh and I haven't gone for a shower yet so exactly.

A

That's definitely the benefit of remote working yeah.

A

How uh how's things going for you, uh sorry how's how's it going for you.

B

Oh good yeah good thanks, um yeah just busy busy busy like everyone, I guess um yeah but yeah. It's uh yeah, I'm yeah! It's all all going well in yourself.

A

Yeah good yesterday we escaped lockdown, which is very exciting. I've discovered the main benefit is because people can go to other places. It's really quiet outside, whereas in lockdown.

B

A

Go to the park, so it's chaos.

B

A

B

A we've got this um flat in bow and um this whole cladding. I don't know if you're aware of what's going on with cladding everywhere.

A

Yeah yeah, we had a.

B

Call last night and they always go like every time we have one of these calls they go on for hours and hours and hours and everyone's kind of asking all these questions, and last night it was the shortest one. Ever it finished within an hour which was amazing because everyone wanted to get to the pub. So I was like I don't know if you should all be going to the pub on the same night, but.

A

Yeah they were.

B

All very happy to be leaving uh knockdown.

A

Yeah definitely.

A

A

So um java will be joining us shortly. um Do you want to do you want to run through? Maybe what you'll do with the dashboards andrew, because that's kind of the bit we can cover without java.

B

Oh okay, I haven't prepared anything because I guess the point of these things so um yeah what I've done with them is. Let me share my screen.

B

B

All yeah I've upgraded my window management thing and it's gone totally hairwise, oh well, um so I don't know if I've demoed this anywhere but I'll I'll give it for your uh benefit, mostly graham, so please ask any questions, so I don't know if you know all of the service overview dashboards that we have um you know for each service.

B

um We've obviously got web git api et cetera, et cetera, et cetera, and these are all generated from from jsonnet using graphonex. um So what I've done is I've added this little descriptor, which kind of maps out the the layout? Very briefly like just enough that we need for monitoring it's not like a full descriptor, um but it kind of explains as much as we need for monitoring how we're deploying things in kubernetes.

B

um So you know in this case this is the get service that we that we've got here and we're saying it's got two deployments. One called git lab shell, hi jeff uh and then one called um get https, and um this is a little bit of there's a bit of technical data around this at the moment, because we actually match that up with a tag. Not the name, and so the name of these deployments is in many cases something like git lab web service or web service.

C

Yeah, it's kind of confused with.

B

Yeah and the the reason is, is that it's very difficult to kind of, because there's multiple deployments in multiple services that are called web servers, for example, and so the first thing we thought was well, let's rename those, but that's obviously more challenging. So in the meantime, what we're doing is we're just matching on a tag that we call deployment um and that actually goes on the pod. I think, if I remember okay.

C

So that's now a label or an annotation actually.

B

It's a label, that's that's exactly so, there's the to kind of carry on with the technical debt we've got to copy and paste those at the moment which isn't great and and obviously every time we copy and paste them. We can make mistakes.

B

um Another thing is like: if you look at mail room um right which we've got in here, I don't think that it's got the correct labels at the moment.

C

And so correctly, if I'm wrong as we bring more services on now, we need to go into our gitlab.com, where we've got all the values and actually start putting these like. When I enable a new thing now, I should be putting some set of labels.

B

Yeah yeah yeah that that's right, so the label is called deployment um but mailroom the last time I checked this might have been fixed since I, since I did this, but certainly mailroom didn't actually have that label, and so when you go to the mailroom dashboard, it's not really working yet. um But it's that's kind of I mean it'll, be much nicer when we can just go on the name, but that'll that'll take time and it'll be less sort of repetitive, but um yeah.

C

Sure so makes sense.

B

So anyway, for for now it's kind of like really just like a toe hold like for for more things. But you know here: we've got the the git dashboard um and I'm actually I've just pushed a merge request. So we can start linking directly to these dashboards from the um from the alerts which at the moment you don't get to them, which has always bugged me so finally fixing that um so now we've got over here this kubernetes overview and if you open that up.

C

B

C

Us, the actual.

B

It's actually on this level. It's it's aggregating everything to the cluster, so you can see. We've got these three clusters and you can see at the moment it's just cpu memory and network, and the next thing that we'd like to bring online is the um hpa stuff, but we need more labels for that because of the same reasons and I've I've seen I haven't been following it, but there's a lot of activity on an issue around that.

B

um I think java's um looking at that, but we need we need more labels before we can do that. But.

C

Then so this is sorry is something else we're interested in is maybe uh limits and requests for the particular like total limits and requests.

B

So yeah uh yeah we can. uh We can totally do that at the moment. In fact, I don't it might have just got lost in that first release because there were so many things that were changing. Let me just show you what else we've got here um so basically, if something's got so close all this stuff, if something's got kubernetes deployments now we have this little thing up at the top here. That gives you it's got a little.

B

I think it's called the wheel of dharma, but it's basically kubernetes logo, emoji um and if I click on those two there's a container detail and a um deployment detail, so we go into the container detail.

A

B

I'm surprised I I started doing it and I've oh, I know why I'll I'll tell you why afterwards, but so this is the container detail, and this again it's very unfinished, but it's like it's a toe hold right. um What I was doing originally was, I was plotting all the containers on here, but I've just found that it's um it's too much. It's too much information, there's like 100 series, and so we've got this kind of pretty um quantile graph.

B

So it's like you know, 99 of the containers are below 41 percent um and we've got that for memory as well, and so these little spikes at the bottom are obviously as a new container. Rambusan comes online, uh so we get you know, five percent are ten percent or less than that, and and so that kind of gives you you know here, you can see the difference between 99 and 95.

B

That probably tells me that there's one gitlab shell container- that's that's kind of you know in a mess at the moment. Now. One of the other things that I really want to add soon is a way to navigate from this and it'll probably be collapsed, rose on this dashboard, um but down to like the full like sipping the fire hose detail, so you can see which container it is. That's that's pinned up at the top, but we don't have that, but it'll come and you know there's so much more.

B

It's kind of like almost an endless amount of work with this stuff. When I start looking at it, it's just like whoa. That's why I keep saying it it's the beginning, um and then this is at the moment. This deployment detail is exactly the same information you see on the on the service overview page and it's just.

C

Broken down word deployment here, which this is our kubernetes deployment object at that level. That's.

B

That's that's right, so each row is one of the deployments. So the first row is the get https deployment um and then the columns are cpu memory network at the moment and then the next row down is the get lab shell deployment um on the I I don't have we don't have the um the requests and limits, but we do have um the oh yeah. That's that's right! So we.

B

What do we have? Oh, we we starting to track in the saturation resources. Let me just find.

B

We've started tracking; no, do we have it. We started tracking some kubernetes stuff, oh yeah, container memory, but the so so, but it clearly isn't working because I didn't see cube container memory coming up in one of these things here so and this is get here. So I'm not sure why, uh let's just.

B

Keep provision services that definitely contains kit, so I'm just gonna fold this up. I need to look into why it's not showing up here, but so.

B

Is this looking yeah, it doesn't have to be working very well.

B

Yeah there's something odd here, because it's not generating the saturation you can see. The yellow line is kind of the the saturation value that you know. The thickened yellow line is what we, what we record on, and it's not generating one for this and that's probably some sort of artifact.

B

Oh, I know why this is because it splits across three prometheuses. I need to work this out, but it's probably because mostly saturation metrics are recorded in um prometheus and this one is actually across three, um but you can see here. This is this is based on. You know the the limits so basically memory wise we're sitting at about 50.

B

If this goes up to 100. Theoretically, we would get alerts, but I'm guessing that because we don't have that recording rule we're not going to get the alerts. um I need to take a look into that and what we've done there with the uh with this cube container memory, um we can do exactly the same thing with the cpu as well, and we just it's just a matter of getting it done and then we've actually got the hpa one already, but the problem with the htso.

B

So basically, if uh we're at the limits of the number of instances that the hpa can scale to and we're at that limit for 25 minutes. Sorry we're over 90.

B

Oh that side. Our hard limit here is below our soft limit. That's that that's wrong!

B

So if we're above 95 for more than 25 minutes, we will generate an alert. The problem at the moment is that um because we don't have the labels on the hpa.

B

Objects in uh you know in prometheus. We can't link it back to git, and so the other day there was a failure. Was it. It was a registry right job.

B

Yeah, I think it was the.

C

B

Was was pinned, but we didn't get the proper alerts because we've kind of we we're not attributing it to the correct services at the moment. So.

C

There's a lot of.

B

Stuff in flux but, like I think, we're making progress and- and you can see here- this is kind of one of the side effects of of what we hack when we don't have um the correct labeling. So you'll start seeing these really nicely regular expressions and captures and stuff and yeah. The reason is because of of the lack of labels, so you know when we're done. Hopefully, there'll be no more um nasty levels like nasty, regular expressions in the label matches cool is that is that explained.

C

It makes perfect sense to me. I think it's fantastic. I appreciate your work on this because yeah at the moment we have, besides actually going into google's console you don't or just.

B

C

Poking at the cons, the cli, we don't really have much, the only other thing I'll, add and I'm not even sure if this is potentially relevant. But maybe when we're thinking of the drill down screens um something to capture what? If there's I'm wondering if a visualization of what the number of pods we have versus pod disruption, budgets and things would be useful.

B

Yeah yeah and okay, so I I need to. I don't actually even know what a pod disruption budget.

B

It is like a useful thing. uh I guess I can google that and it'll be in the docs.

C

Yeah yeah yeah, oh yes, just in the standard, kubernetes documentation and the other reason I point it out is because we actually have um broken pot disruption budgets at the moment uh across so.

B

C

So it's it's not important. It's probably not important to an sre diagnosing it. Well, it's been likely to be um imported to an sre diagnosing an issue, but for when we're doing speed of deployments and like upgrades cycling, nodes hpa can potentially get stuck with a bad disruption budget and things like that.

C

So it's I wouldn't say it's a top level thing, but maybe when, if you're thinking about things to put in that, like pod drill down, like you know, we're at the lowest level yeah, if there's a visualization for like how many pods we've got what the pod disruption budget is, if we're violating that, if is.

B

It something that could be modeled as a saturation metric between like zero being good and 100 being 100 being bad. Okay, then we can probably put it in there.

D

Yeah great graham, I thought that bb, I thought that pdb was only for when you have pod evictions and things like this for upgrades. We'd use the max surge max unavailable.

C

Yes, sorry so I was talking about the gke upgrade. So when I did the gk.

D

Oh gpe upgrade. I see a bunch of noise just flat.

C

Out stalled for about an hour- and I realized now like I can't shut down, because I can't terminate this pod, because your pod disruption budget says that I cannot actually allow any pods to go down and you've got pods on me that I cannot terminate it's not a big issue, and I guess I bring it up because it affects me doing these upgrade work more than anything. But it did.

A

C

That uh we, we need to make sure we're probably just tracking that as a simple metric somewhere, so we can spot these issues ahead of time.

D

Do you have an issue open for that or not no and I'll go away? Can you can you open one, because um I didn't.

C

D

I I don't think there are any problems with the pdb.

C

Yeah, so it's the ingress engine next one and I wasn't sure if it was we're getting off topic here, but I wasn't sure if it was because um I was during the load period when the horizontal pod order scaler was going on or something but it was like. It was telling me yeah. I, your pod destruction, budget fingers engine x is two and there's only two pods running. Maybe it was the canary deployment or something, and it just would flat out, refuse to move um I'll.

C

Go back through my notes and I'll lodge an issue for that.

B

While at the at the danger of staying off topic, does that does it stall or does it just throw an error when you, when you hit that question.

C

Does it stall and waits it? Well, it's kind of it's so kubernetes will go. I can I'm trying to drain the node but to drain the node. I delete these two this pod, but I cannot delete this pod because it violates your pod disruption, budget and then gcp. So gpi will go. I'm just gonna. I can't drain the node and therefore I can't terminate it. I'm just gonna stop and sit here and do nothing.

B

Right, it fails effectively.

C

Yeah yeah, but it doesn't fail, it just happily sits there. Updating, uploading.

B

Updating so I'm sitting there watching okay.

C

I think it's updating, and I'm like this is taking.

B

C

B

Okay, that's really interesting thanks.

C

But it's it's part of how gk works like they're very much with their eventual consistency on everything they're just like they expect things to kind of fail. They very rarely actually throw back an error at you. They just happily let things sit there in a waiting state forever.

C

C

And once again, it'll be important because I do want to enable auto upgrades in production, so I want to make sure that things like upgrades, don't stall out. I guess it's not it's not a massively important thing in the big scale of things, but it's just another saturation point.

B

Do you think that it's the kind of thing that, if we've hit that saturation budget, we want to page the engineer on call about.

C

Good question: I don't know we should really never it's not. I mean java feels free to chime in as well. It's not bad per se, but it's yeah. We should never hit that stitch. We really should be configuring things without um yeah. You know it's.

B

It's a smell right yeah. It.

C

B

Probably be investigated at the very least, unless you know there's something happening.

C

And I think, as I said, I'm pretty sure the situation that I got caught out by was probably because it was the horizontal pot audio scale. I just scaled things down so much and then the pod disruption budget is fixed to two in the get lab chart or it was fixed, the two and and um yeah. So I'm pretty sure it's just a misconfiguration issue or something like.

B

That right, okay,.

D

I'll take a look.

B

Please could you see me on that issue when you created grant? Thank you.

A

Cool, what would be useful? Do you want to go through some of the stuff you were looking at yesterday, jav or anything else,.

D

So so we left uh yesterday with some changes that we were going to do. um I I'm still catching up, but it sounds like that. uh The first change was to change the health check for the kubernetes cluster, so currently in aj proxy we're using slash readiness for checking the health of the cluster, and we realized uh realized yesterday that this was.

D

This is not a good way to check the health, because um what we see is when we terminate pods, we see some requests that are going through to a pod, that's in the process of being terminated, which will then fail the readiness um and that could potentially bring the cluster out of service completely, and we do see when we look at the graphs we look at when we look at prometheus. We do see that clusters are occasionally being marked as down, so this is not good. My suggestion was to change it from readiness to health.

D

This is like a very old readiness health check, endpoint that we used to use, I think, was the first one. We created there's three of them: there's readiness, health and liveness um that was rolled out on staging and then um from what I saw on the on the issue. It started to be rolled out on production, and then we saw the problem on canary, so we rolled it back, so I need to figure out. What's going on with that, um we.

A

D

Discussed, possibly just removing the health check all together from aha proxy.

D

I'm not sure if I want to do that, but that might be where we land. I think it's important for us right now to be able to validate that the cluster is healthy. All the time before we remove the git fleet, because, right now the git fleet is set as backup servers for the aha proxy back end.

C

D

So the get fleet.

C

Has still got like the services running and listening on ports and stuff and could conceivably serve traffic.

D

Yeah so they're set as backup servers. So if the- and I think I mean, if you look at the logs they're still being used occasionally and what you know, what happens is that the gke cluster is marked as unhealthy and then we go back to the vms.

D

um I can do a little demo here of just the logs. I wanted to take a look at this on staging now that.

D

Now that we're using the new health check endpoint- and I just wanted to kind of see kind of demonstrate the the problem. Unfortunately, I guess the staging kibana nonprofit kibana has completely changed their ui, so it's like I'm, I'm almost like a little bit lost here. um It looks so different.

D

Andrew have you played around with this yet or not.

B

I didn't even know it was different, like look at it, it's like.

D

This is like completely.

B

D

B

D

It's supposed to be better, but.

B

Well, the last upgrade was way worse, so I'm hoping that.

D

Like it has these suggestions at the bottom, but I don't even know like.

B

Are you sure that you haven't gone into this looks like lens.

B

It's just gonna visualize.

C

It says lens in the in the url bar, I'm not sure.

B

Yeah so lenses says you fancy it's it's a new, really fancy visualization that they're trying to move across to, but certainly in the old version it was. It was like a beta.

D

B

It wasn't very good, but um I don't know if they've, um if.

C

You go to visualize, you should.

B

Still be able to get um so what what.

D

What happened here, because usually.

B

Yeah, you never so that now navigates to lens by default, you're going to have to go to visualize and then.

D

No because if you like this is so handy when you're able to go directly.

B

D

Mean you could just learn how to use lens.

B

Like I just keep putting it off.

D

All right I'll go to visualize, but.

B

Now I have to like yeah now you have to take a thousand things. Yeah.

D

B

Yeah I have to.

D

Create a visualization I have to select, um I don't know like area or something whatever line yeah yeah. Then I have to select the this is like infuriating and.

B

Then you have to that. Normally, the the the useful ones are like right at the end of everyone's names, not in stages.

D

Okay, now I'm here finally, and now I lost my filter right, even though it was pinned, so I.

B

D

Was it it wasn't.

B

I thought it was.

D

Pink, let me let me see, pin all okay. Now I go to visualize. There has to be a better way like this is oh.

B

um I I couldn't agree more all right.

D

I'm not trying to defend.

B

This at all, okay, there we go at least that's and now you need brackets.

C

It looks quite nice.

B

Now, no because you don't have any buckets click add on the buckets, oh right, because this doesn't isn't there by default. No! The next one down here, right, right, x, axis histogram, obviously,.

D

Right um so so this is a story about we have like one pod and we're seeing a bunch of requests. What I wanted to do was just show um when we start seeing 502s from this pod and when the health check starts failing to do that, we're going to have to split this out by filters, so we'll do a we'll do a split chart to start with the filter and the first filter is.

D

I had it over here already so we'll just look at um for refs for info reps and then the other one will look for readiness.

D

Okay, so we have uh inforest at the top readiness at the bottom and then let's do a split series by status.

D

D

D

This is staging, so there isn't as much going on here. First thing I see here is that we have some 401s and then we get a little 502 at the end, and this is really what we want to avoid. um This is a user facing 502, and when we see this on prod, it looks even a bit worse now I don't know um whether we have made any of the nginx configuration changes in staging. Yet I haven't fully caught up. um Yes,.

C

But um we have to change those values of the like how long it keeps the connection open.

D

C

um Yeah so they've made it to staging. I confirmed them today. They've made it to previous staging.

D

Okay, so so this line used to be longer, so maybe this is good like um let's try to like change the scale here. If we can.

D

So um so we're going along we're processing info refs, and then we get a little blip at a 502. This is when the readiness starts returning to 503.

D

This means that the puma has received a second and then we're in the downtime, like the not the downtime, the grace the grace period blackout window, um so that started at 803., and then we get a 502.

D

What we, what we don't, I guess what we don't really expect is to see any of these readiness check failures at all, because what we hope to happen is that kubernetes will switch a pod to terminating and as soon as it does, that we'll stop routing requests to it. What we see instead is that um we're seeing readiness checks like going down into the pod and returning a 503, even though the pod is in the process of being terminated,.

B

Jeff, could you just just to help me understand this a little bit more? Would you mind going and changing the the time in the days histogram to start time, because then, if it'd be interesting with those 503s that we see in the top there, what time they started like, were they uh so in the dates histogram? You can see field next next uh column down and then, where it says json time, that's the time that the log happened, which was at the end of the request. Sometimes the start time is much.

D

Start time field, what do you mean.

B

We have it for some some some things. Have it some things? Don't because.

D

B

um I thought workhorse did, but maybe it doesn't, we should put it on if it doesn't, because it's super helpful for this kind of stuff, because then you know if that request was stalled for like five minutes or whether it was dripping in like during the shutdown phase right. But if we don't have it now, we don't have it. I mean.

D

You can go pick up. Those requests.

B

And look at the duration of them, but we can switch to rails if.

D

You're sure are you sure rails? Has it.

B

I think rail I mean giddily definitely has it so, but I I find it super useful for this kind of stuff but check on. um Let's put it into we should we can just put it into log kit now and then everything will get it, which is uh yeah.

D

D

No, I don't think I.

B

Don't think I've seen start time before it's only it's only giddy, then.

D

um But it's sorry.

B

I'm getting confused, but we should put it on everything else, because it is super useful for this kind of diagnostic. Sorry, sorry to interrupt.

C

I'm still trying to wrap my head around once. My understanding, which is could be wrong, is because, when a potter's mark's terminating and when the signals are sent to it versus when it gets removed from the service is there's no guarantees in the way. So I don't see how we're ever going to avoid at least seeing a little bit of 503 on the readiness.

D

Yeah and that's and that's okay, because during this window we can still process requests. You can see that we have two hundredths happening during this 503 window right, it's just at the very end when eventually.

C

Yeah yeah sure.

D

Eventually, we start getting a 502.

C

D

Definitely I think that.

C

Purple line is fine, so yeah we should. We should hit this kind of phase where we're always going to have some kind of purple the purple space but yeah. I see what you're saying that you should never. Actually it should just cleanly disappear and not right at the end hit the photos.

D

Yeah, um so the next thing to check would be to look at the cubelet logs to see what exactly happened.

C

Or even um q proxy, as well.

D

Yeah, so let's, let's do that.

D

C

Because what we're saying correct me, if I'm wrong, is those little dots at the end, the 500s at the end, uh the part is literally actually gone by then, and then.

D

Yeah, it should be no. No, it's it's not gone. It's in the terminating phase.

D

B

Workhorse is still writing that log, so.

D

B

Exactly it's still.

D

It's still there.

D

Let's do events.

D

So so here's the stopping container at 803 21 and then.

D

Wow, it's like why! Well, I guess I guess this is expected right, because I think I think for kubernetes. It's still doing the readiness process until the time that it's actually gone right yeah. I assume.

B

So, just just help me understand: do we have like some sort of switch in um so in in workhorse, once it gets uh into a sick term or whatever it actually will switch over, and it will give 503s for the readiness? Is that right, correct.

D

B

Yeah, okay, so there's a there's a little middleware that does that. Okay,.

D

um So the stop happened at 803.21.

B

Is it just for readiness, or does it just not accept anything.

D

Is it just for readiness like? Oh? No, it's just.

B

Does it doesn't so is it 503 or every on every on every request.

D

Yeah so so camille implemented this when we were doing the puma switch, and it's quite nice, because what it does it's quite nice with aj proxy, because what you do is you said you said pumas again, it returns 503 for the readiness, but it still processes requests during that window and that allows it to be slowly drained from the load balance yeah, because um the load balancer should take it out.

C

Yeah, so it's not engine engine x, that's actually calling the readiness here right. It is the it is the kubula that's calling the readiness and.

B

We can look at the user agent, we can go to the discover section and actually look what um what the user agent on those readinesses is. If that helps.

C

Yeah, I'm just trying to mentally map how? Because we've got because there's like nginx in the middle of this right, because you've got standard kubernetes service, but then you've got nginx and it sounds like even things higher up of engine x proxy like trying to call readiness. Is that correct? Is that h, a proxy.

D

D

um Yeah we can take a look at the user agent we did have like I mentioned at the beginning. We had aj proxy using this readiness endpoint as well, and I think that was a mistake right. um But but one thing I'd like to show here: if we've already made this configuration change on staging, so the pod was started stopping at 803 and we were successfully sending a 200 all the way up to 807..

D

That's crazy! That's not an eternity.

C

Yeah, it is it's a long time, which is why I'm interested to know what q proxy was doing, because it's the one that's actually uh modifying the ip table rule. So if anyone hits the service they're, not.

D

Yeah dude: what visibility do we have into that, though, like.

C

I thought it was in stackdriver if you filter on component q proxy. I I think it's a stackdriver. I haven't looked recently. um It's json payload, maybe dot com. There was some component there. It talks about the different.

D

um I'm worried that we also may be filtering it, because we are excluding a lot of logs to save money. Fair enough. Is it a? Is it a container log because I think all container logs are being excluded.

C

uh Because it'll be run as a demon set, I'm pretty sure.

D

C

Actually, maybe not let me double check you could just even go in the in the query. Preview section just type cube dash proxy and see what comes up right.

D

C

Let's just see if this gets us any.

D

Hits yeah did you have some.

C

uh I wonder if these are like ordered events of q proxy, because it's talking about cubelet.

D

Here yeah, this is.

C

Is it the cubelet running the cube proxy pods? That's the other question. We need the actual d yeah that looked.

C

Anyway, we can we we we should try and and see what we can um glean from that, because that should have some messages about when the end points are up like when it's like. I'm updating my like I'm updating my node to remove that pod from being routed to, but that's only for new connections for existing connections, there's always like global iq tables rules that are like. If it's data for an existing connection, I will always like wrap that data.

D

But but let's like I mean being realistic here like nothing, should be taking this long. Oh it's.

C

D

And, and when I did, this testing like, I definitely saw the end point being removed like I was looking at the service and the ip address for the pod was being removed by the service endpoint. But we were still able to send messages through nginx ingress and what we observed yesterday was that if we just did a nginx reload, then the messages would stop yeah. So it's like nginx was holding on connections to the pod. Somehow, even though it was removed from the service endpoint.

C

Yeah yeah, because so what it'll probably be so.

A

So there's two things there so because it was.

C

Removed from the endpoint there's still no guarantee when, when cube proxy, will read that updating, because the endpoints is just what's stored in etcetera d, q proxy still has to read that out and go. Oh look, that's gone I'll. Remove the rules.

D

C

And if it's an existing connection, I don't think it will chop existing connections. So if you've got a bad.

D

C

Nginx that keeps the connection open. It will happily still service that the other thing to note actually did you read the update I put on that issue. Yet john.

D

uh Which issue this one or this one.

C

This one yeah that that comment there so.

D

C

I called it out, and it just when you were talking about nginx reloading reminded me. The default configuration is that we use like it, takes the endpoints and syncs them into the uh nginx pods configuration file using lua and then every.

D

Time the endpoints.

C

Changes it causes and it calls an engine x, reload, which will obviously change the connections as you see, but we can.

D

Get that off, I I brought this up yesterday and jason was saying that this is the way it used to work, but it doesn't work like.

D

Even with it turned on, he said, nginx doesn't actually reload, um there's some other uh code in the load balancer that um updates those endpoints. But I I didn't look into it, but um I, I suspect that there's something happening on nginx when we don't have this direct service. Endpoint enabled that that that prevents this from happening.

C

um Yeah exactly that's, I think, yeah, that's what I'm trying to say. I think what happens is it? Will it will reload uh yeah.

D

C

Nginx it will go. It will see that the endpoints to change and reload the configuration which will probably drop the connections- that's bad. If you have a lot of pods like like thousands of pods, because you, like you know every time the pod changes you're reloading nginx constantly, but I think for us it might, or all I'm wondering is I put in that comment- is: have we traded one set of problems for another? By with this annotation? Have we fixed one set of problems but introduced another.

D

Well, yeah, that's stitching, that's possible, but the first set of problems were awful because we were seeing way more errors on uh deployment than we are now.

C

Yeah, no, I know I agree, I agree completely, I'm just it sounds like both have their pros and cons and I'm also wondering are we running a recent version of ingress engine x? I just want to make sure it's not.

D

It's two years, two years old man, it's really old.

C

Okay, yeah, but.

D

um There's some issues with upgrading it: okay, because, uh because of our fork of the chart and uh scheduling, okay,.

C

This may have been not I'm not saying fixed, but you know there's probably improvements made to this in the last two years, at least the comments I was reading on the issue we originally linked off.

D

um I don't know like um maybe.

A

We should put some more pressure.

D

To get get attention on the nginx upgrade, I would really like to rule it out. um It's just very difficult for us to test. um Maybe I can come up with like a way to test this and reproduce it outside of our chart.

C

Yeah so so, there's nothing stopping us right from just deploying another copy of ingress engine x, alongside just like, with some raw manifests and creating new services and just changing ips, of which it's pointing to right. Like you.

A

Know you can deploy.

C

As many ingress controllers as you want, they all happily live side by side.

D

Yeah, that's that's a possibility. um You know I did try just to change the version myself and it didn't work because of configuration yeah uh problems, but we would have to yeah create our own config map and.

C

Deploy it separately like yeah, you can do a whole like just other helm, release of it. That's just could even be sitting in a different name space. If you really wanted to keep it separate, but just pointing across to the right service or yeah yeah, and then testing that and then you could just change the ip I'm going to point to the chart and inverse engine x and I'm going to point to the other one.

C

Maybe it's harder, then I think that that could be a way we could try and see. If, if upgrading at all is going to solve this problem, which you know there's no guarantee, it will.

D

Right so I think the the takeaway here is that that, if we already merged, keep alive configuration, changes and staging they are probably not working and that there's more.

A

D

Than just adjusting that.

C

You really get. You really probably want nginx to be reloaded every time. An endpoint change is happening, but I'm not sure how we can do that in the current setup.

C

D

C

Maybe there's another setting or annotation where it will do that um that we've missed or something.

D

Yeah I was looking, I didn't. I wasn't able to find anything: okay, okay, well, uh that's sort of where we are uh today. What I'm probably going to do is kind of look to see what happened with the other health check and uh yeah we'll see, but uh it sounds like we may need to put some pressure to get the nginx ingress upgrade up. You know done sooner or later.

C

I know it it sucks, but you know it's such a critical component for our ingress and.

D

It's just going to get older.

C

And older the longer we put it off, I guess so that distribution team just might have to put the.

D

D

Amy was there anything in the blocker section of the meeting that you need me to comment on or update on.

A

um I suppose it's only what might be useful for for you graeme um to know about um with updating on traffic splitting. I guess that's uh relevant to everyone.

D

um Yeah the update on traffic splitting is that it looks like it's moving along. The last update I saw from skarbik was that he's he's pretty happy and uh the mr has moved to maintain a review, so um I think we're we're in good shape there I'd like to try it myself. I haven't been doing any of the testing myself yet so maybe I'll have time to do that today,.

A

Awesome um and yeah those are pretty much our big ones. Like pages, uh we still wait on. um I think all the others are in progress.

D

The um nginx, um uh the nginx issue, the uh like high availability, no downtime upgrades for engine x, there's one, mr that I'm that should be able to get reviewed, hopefully in the next day or two, and um I'm still still haven't rolled that out to production. This is ensuring that nginx drains connections before recycle pods, uh but that's running in staging. It's been running staging for a few days now. So I think we can pull that out.

A

On that one jeff, you mentioned that it was a temporary fix like how temporary how long would this last is for.

D

I think we're talking it's temporary until the engine x, synchronous controller, upgrade, which you know.

D

A

Cool, okay, cool um anything else, anything you want to run through graeme.

C

No, I've looked in the blockers list and everything it all makes sense to me, and I think you know it's looking good yeah.

A

Cool one thing uh we'll cover in the um european type demo later is chat about helm three and how we prioritize that so skelbex put an update on the issue um with what's left to do um so. Hopefully it's just a case of scheduling.

C

Yeah, I'm I'm wondering now. The regional cluster is the regional cluster, but I'm wondering now with the zonal clusters, especially if we do it during a quiet period. If we just drain the whole cluster, do a helm, delete of the helm2 release and then just do a helm, install of the helm, rig, 3 release and then just add it back in rather than trying.

D

C

Stuff around with upgrade plug-ins and just just purge the old release, it'll delete everything, that's fine, and then we just install it again with helm3 and then just move on and then just have to add it back into the load. Balancer.

D

Yeah, I think I think draining an entire cluster might be a little difficult to do. Is.

C

Is that too much like? Can we not take that here in terms of what am I trying to say is.

A

C

The cluster they're going to put us too much under capacity. If we lose the capacity, I guess it would.

D

um I don't know we could, um we could maybe turn it down slowly and let it scale up. Another option is that we just create new clusters.

C

It's a little bit more work.

D

But uh maybe another another option, and then we can just switch over to them. um The issue there is that, like uh we'd, have to come up with a new name yeah which would suck because, like I'm pretty happy with our short names right now, so I don't think I'd want to like make them agree.

C

I I guess I I'll read the notes at some point. Oh and I'd like to understand yeah how much um skavic thinks it's like like how bad is the upgrade process like because if, at the end of the day, helm data is just stored in secret objects in kubernetes, you know like that. You can just delete them by hand and make them go away or like yeah, there's nothing.

C

You know that they yeah. If it's going to be really difficult. I reckon we just try and burn burn the clusters down or whatever we have to do to just avoid, having to try and upgrade it and make healthy break if it's difficult. But if it's easy, then.

D

Yeah I mean, I think, even if it's a little bit hairy as long as we validate it on staging, since those environments are identical staging in production, I think we'll be okay um but yeah. Let's see.

C

I guess the thing is too is we can always back up those objects and like if the upgrade fails it shouldn't actually like touch pods or anything? All it's doing is manipulating the data in those config maps. So if.

A

C

We'll just restore the data and just try it again, a different time or.

A

Cool okay, so yeah we'll see if we can make a plan on that one um and schedule that in like seems like a good time to try and do it awesome. Is there anything else anyone wants to cover nope nope.

C

No just a quick note from me, I'm actually on pto for the next three weeks. I think um I've still got some bits and pieces I'll wrap up um tomorrow and stuff, but uh yeah. So um unfortunately I won't be around to help, um but then I'll be back on the week of the 28th or the.

D

27Th, whenever that is so your last day is tomorrow or today.

C

Technically my last days today, but I'm gonna, have I've got so much stuff to catch up on. I'm gonna be doing catch up on bits and pieces tomorrow morning.

B

Coming back in the week between christmas and new year's grand sorry, you should just push.

C

It out another week I would have been called on call so.

B

C

I'm doing on the wall for new year's so mac. I come.

B

Here uh I hope it's, I hope it's quiet and uneventful.

C

Yes, hopefully yeah, it should be all right.

A

Well, enjoy it um we'll get we'll put an update on the helm issue today uh in case you want to catch.

C

Up but I'll have a peek at it tomorrow, just at least um but yeah we've got options. I I think, as I said, even if we scale up the other clusters- and just you know I don't know but start I think stuff has done the most on like understanding how bad this problem is. I think he I defer to his judgment.

A

Awesome sounds good thanks. So much for joining everyone have a great rest of your days.