GitLab Delivery: GitLab.com migration to k8s demos, 3 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-12-03 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning,.

A

A

I will uh upload the video from this morning um as soon as I can scavenge the um there's nothing too um current we've spent quite a bit of time reviewing uh andrew's dashboards. He gives a demo of the dashboards he's been doing so it's an interesting video to watch to see those um and then what else we had a bit of a chat about the um uh 502 um errors, and I expect the latest for that.

A

One will end up on the issue and we had a brief chat about helm, which we can wrap up in this call. So there's nothing uh missed, but it is a worthwhile video.

B

I look forward to watching it then.

A

And also uh graham, is finishing um today is officially his last day before he's back kind of in just after christmas he's doing the week between christmas and new year he's out for the three weeks. So if there's anything, um he said he'd check in on the helm issue before he finishes, but if there's anything else, we need from him then um get over to him sooner rather.

A

A

Hey andre hello.

C

I don't know if I'm needed here again.

C

A

Was in the more as in the morning.

C

Call you're running.

A

I was just plugging your demo, I was saying it's a great video to watch.

C

Well, I think I think scarves seen it. I think, I'm sure I'm sure you you know about that. Cuban.

B

I reviewed well yeah. I saw all the dashboards that you created, but I'd still be interested in watching the demo about it. All listening to you talk is fun and interesting, especially when it comes to metrics.

C

And we had a very interesting day today.

A

We certainly certainly have it's been been busy.

A

C

A

Interesting one day, the uh the incident, because the first uh kind of one of the first indicators was the um deployment abdex.

A

Not alert but like uh indicator warning.

C

A

Yeah, but there was nothing uh when we first looked into it. I was like no, no, they look. It looks mostly fine, there's just a little bit of latency, um so it's kind of.

C

Which one was it: where can I actually see those I kind of I've fired and forgotten those like. I, I don't actually know where they're in.

A

Yeah they're in the f upcoming release channel.

C

A

They see like um they come periodically through the day, um but this was the first one where I was like yeah interesting.

C

Okay, that's that's cny web with the red dot.

A

Yeah, that's right all right.

C

And that was at two o'clock.

A

Yeah, oh okay! So let's get started, um let's go through the blockers. Firstly, so we have um first up is implement, generate a pattern to allow traffic splitting.

D

Scarborough, can you give us an update on that because I think you've been doing well all the work on the delivery side.

B

I've just been doing a review of what I think a configuration would look like for gitlab.com and things look to be. Okay, I think there's a few items that he still needs to address before this gets merged.

B

I have a long long thread about the testing work that I've completed, but, for the most part it's working. Okay, I think there's just a few environmental variables that can be plugged in and then there's a making sure that the images change when auto deploy works appropriately. I think it's the last few items that I have on my list of things that need to be checked up on.

B

He assigned it to dj for final review, so I guess he's checked everything off.

A

B

So hopefully, next week we could start peeling this in and evaluating as necessary for our own purposes.

A

A

Is that one anything additional we need to do for that one, or is it just moving along nicely.

B

uh Just it's moving along, so we just need to upgrade our chart and start uh figuring out how to migrate over to it, because it's not going to be a clean migration when we um implement it.

D

Maybe we should start writing up what the uh migration plan is going to look like. I think we we discussed that we wanted to do websockets action, cable first, um but I think we'll have to probably migrate both right because we're going to have a path for action, cable and then everything else.

B

Precisely so, we need to yeah, we need we just. I think we need to create an issue if we don't have one already to figure out how to migrate at all.

D

We just do this as part of the I've been thinking just doing this as a readiness review for taking the change and also doing the traffic splitting, because there may be other considerations.

D

um So perhaps we just wrap this up with, like we're now going to be creating a new service for websockets and action. Cable, we'll create a readiness review for that. That includes the traffic split.

B

A

Is that going to? Are they all gonna time up like if say, action, cable got delayed because we're, depending on a readiness review from development? First right, so.

D

I think uh I think we'll do it together. um I know that marin really wants them to drive it uh so I'll talk to them first and get their input, but um I did talk to marin earlier this week and he said he's fine. If we have one readiness review for both the infrastructure work to do the split and induction cable.

A

Cool, okay sounds good cool, um so then next one we've got the structured logging uh geoff. Do you want to give us a quick update.

D

Yeah, evidently, it's done and merged and then broke production, so we reverted it uh bulletproof canary there's a lot of special consideration. We're going to need to make here because we're now wrapping all of the log messages in json. So this caused all sorts of problems with elasticsearch who's, like you know, suddenly sees this wrapped these wrapped messages and wasn't happy about it uh because of the field mappings.

D

uh I also had a comment that I noticed that we're pulling this thing from from github uh the community like this is a community contribution. This is item number, seven or b7.

D

I don't know like this doesn't make me feel great.

B

It's got an apache.

D

B

License which I feel falls into our line of what we accept as.

D

Okay, like are, we sure, we're okay with having a github. uh I guess like this thing, what what happens, for example, if this thing doesn't isn't able to build or like github has an outage, will our cng pipeline fail, probably right.

C

Just to frame that is it any different from us, relying on npm, which is also github or yeah, fair enough go modules or yeah, just just as like a contest.

C

D

Just wanted I'm saying it's.

C

D

I guess what I'm wondering also is whether like is this, is the guy is the guy who wrote this? Is he signed up to be the maintainer and to get changes done or are we, the maintainer, and uh I.

C

Guess like we can just have we offered it like, because I think it would be definitely preferable if we owned it.

D

It's so specific to us yeah, I think yeah uh I'll bring it up with distribution to see what they say like. Maybe they already planned to have this guy maintain it for us which is free work. It's just I'm worried that we're going to have to I'm thinking like for one like when we wrap these messages. We may want to differentiate between structured logs and unstructured logs, like maybe even having two keys in the json wrap to say, like this is json. This is not json or unstructured.

D

um This may make it a little bit easier for us.

C

Can I suggest that we just have like mess? One has just got a message and no other keys.

D

Yeah, that's! That's! That's a good.

C

Yeah idea, you know, so the message is the is the unstructured junk.

D

Yeah, that's. That is a good point um like.

D

Here let me share my screen to show you what this looks.

D

D

Probably easier if I edit this so here are two examples of.

D

Two different log, two two examples of two different log files: one is from sshd and the other one is from the gitlab shell log. So what we have here is this json wrap that gives you the key, for, I guess log and then date component, sub component level file and then message.

D

So it's actually pulling out quite a bit of like meta information from the log line in this case, like the level's unknown. um I don't know.

C

Yeah on that, I I think it would be much better if, like we didn't, have it nested inside the mess, it would probably be easier for us. We didn't have the structured log nested inside message.

D

Then what would you do like just? Have it.

C

Can you um it would just be flattened, I mean if they've done it then we'll just have to deal with it at fluence, right, yeah,.

B

We really don't need that log object. We just need the what's inside. The log object should be yeah.

D

C

Event that we send yeah.

D

I think the key here is that we want. We want this file name right. We want we, we need to be able to differentiate between ssh log and var log gitlab, shell gitlab.

B

Shell.Log, I think that's the point of the sub component.

A

Key and if that's.

B

The case we don't need the log file to be sent to us sure.

D

Okay, so maybe that's.

D

Well, I guess we need to decide what we want and then we're probably gonna have to fork and maintain this thing. I.

C

Mean I would say like we can, is this going to get delivered straight into elastic from here, or is it still going to go via fluent in some way.

D

It still goes, it still goes, you know through fluent, and we can do whatever massaging we want in fluent it's just expensive, and I would prefer it if we didn't have to um it'd be better. If actually, I think, you're right just having like a bear like if it's unstructured like. Maybe we should just have an unstructured key or.

C

Generally, with structured logs, the flatter the object, the better, like the less nesting that you have, the the the better it is for everyone. So- and this is quite a deeply nested object right.

D

Yeah, it's quite nasty. Yes,.

C

Yeah, I mean maybe two levels like one which is the envelope and then the detail. Yeah.

B

C

Just out of interest that prefix at the front that won't be there right. The the gitlab kit lab shell.

D

This part, no, this is the actual log file yeah. So it looks. It looks like that yeah, okay, um but this extra like like, for example, this top level log key. um That seems like unnecessary. I think um I completely agree.

C

What about I don't see a um like a pod id or container identifier in there.

B

Fluentd will inject that right.

D

Yeah, so fluency already injects that, so this is the kubernetes fields. Okay, so we have that.

D

C

And the other way you could do it is, you could just simplify it almost by just accepting what's there and then do the rest of it in fluency.

D

Let's talk to distribution to see what they because, I think like I think I think. Ideally what I'd like is um well. I think I think this component field is is useful, so maybe we keep that. um Why do we need an extra date like, maybe that's unnecessary? I don't know or maybe maybe it would be handy to have, but what? What is this date anyway? Like is this when the rapper consumed the log message and wrote it out? I don't know if that's.

C

Yeah there's a date and.

D

B

Yeah, we don't need both.

D

And they're, very um they're, very close, but obviously like a little bit delayed like the date, is a little bit after the time yeah. So I don't think that's useful so component at keep uh sub component. I don't know level, probably not file, maybe not, and then.

C

The thing is that for any unstructured log you're not going to know what the level is, and it's just noise.

D

B

Jarv, what what is you you were? You were indicating that you would like to have some indicator, whether it's structured or unstructured.

D

B

Does that provide us if we know that information in our logs.

D

B

Know I can't imagine a situation where I'd ever research give me all structured logs from our gitlab shelbys in cabana.

D

Yeah yeah, I guess not so maybe maybe.

C

It's not very important. I've just realized something that I hadn't realized before, which is kind of a deal breaker to me, and that is that the log is json encoded yeah, so that, like please, can we not have that it should just be it should just be net like we can either flash.

D

Nested right, but but.

C

D

But it's just the thing.

C

Is that for every single um row you're doing an adjacent parse then? And it's like that, you know, especially with fluent that's like super slow. I think, because you've actually got to read the string and turn it into a json object.

D

Sorry, I think I might have misled you here. I actually think that this is just kubernetes wrapping it. I think we are we already do we already unravel um influence.

C

D

We do we do it in fluent um and have to look that's.

C

But he the thing that the the tool is sending it wrapped, like the the the tool that that this.

D

C

Built that to me is the thing I'd like to see the change in the most.

D

uh Okay, I'm not sure exactly where this is done, but, um let's, let's, let's like oh I'll, follow up with distribution. I think we want to make sure that we don't have to unwrap.

D

I think, actually, though, I think all of these log messages are already wrapped and we unwrapped them in fluid d and the way it works now is that log is either this json encoded string or um a raw string.

C

D

And uh now it's like now that it's wrapping the raw string into another into in this json block. So.

C

The two, the two things that are deal breakers to me. I think uh um that it should just be jason and not jason, not a jason, encoded string, um because I don't know, there's there's someone was on the on twitter was laughing the other day like what percentage of compute capacity in the world was spent, encoding and decoding json and like they were taking.

B

C

And this is like a classic case of that, like I'm sure we don't need to do that, but the other one which are just reduced bugs is for uh for raw unstructured logs. We have msg as the message and for structured logs. You know, you know we don't we don't want to check whether message is an object or a string, because that'll just lead to more bugs right. Like people like, oh, I didn't realize. Sometimes it could be a string.

C

You know I looked at a thousand of them in all objects um and and so like having. That is like a that to me seems wrong, like it would be much better if msg was you know, log.msg was an unstructured log and like log dot, well log dot. All the things is, you know structured. Should I write this down because I know I'm not being very clear.

B

I started writing stuff down inside of this little section of the agenda.

B

Under point b in our agenda.

D

Yeah, just just I just confirmed that we are actually this.

D

This log wrapping happens right now. This is not something that was added, so I think um we'll have to look at what exactly the logger is doing, but I think it's actually just taking the raw message and stuffing it into this message field, and sometimes this message field is a raw string and other times it's a json object, so we basically, so I think you can. You can ignore the first level of escape json. That's not the problem. It's the second level here under message that uh you would have to unwrap again in fluentd.

D

If we keep with this approach,.

C

Yeah, it just feels like yeah yeah.

C

C

I'll write some stuff.

C

D

D

So, okay, should we move on.

A

Sure, um let's skip down to d and no downtime deploys um must be uh-huh.

D

Yes, for the nginx ingress controller, uh I think I'm just waiting on a review for this uh scarbeck. I saw your comments and did you see my response.

B

D

Okay, um I do think we want this change. um Your comment was like: why does the grace period matter? Because the pod gets terminated immediately right, but the change that we merged? um I think a couple days ago, which is this uh pre-stop hook that runs in a loop and it basically will ensure that nginx doesn't terminate until all active connections are finished. Okay, that's! Why that's why we need to extend it, because I think this may even take longer than 60 seconds if we have like someone cloning, a very large repository.

B

Gotcha, so we already remediated the initial problem that spawned the issue. Now it's just a matter of ensuring that we could extend the timeline for.

D

That pod, okay, so that changes only in staging it's not in production. Yet.

B

D

A

Cool sounds good and then uh just to loop back um andrew on the dashboards and things is there anything you need like. We've got the recording from earlier. Where we got the walkthrough. Is there any action you need for us to do for you? I.

C

I haven't been paying close attention, but the labels on the deployments is the next thing that I need, but I I'm aware that there's been a lot of discussion on that with jason and I haven't been paying super close attention, but once we can get like the same sort of labels onto the hpa, sorry onto the deployment sorry onto the hpa is what I meant to say. Then we can start adding a whole lot more information um with that. So that's kind of what I'm blocked on there.

B

Yeah, I think we've.

C

B

Somewhere in our backlog right now, I think we just need to prioritize it if it's not already.

D

B

D

For this one for mailroom and which you said was missing the label and the other one for the hba, um that's uh 1375-1376.

C

Hpa had like, when I looked at the charts, um it does have like a customs labels thing, but I I don't know how easy it is to put things on there and.

C

D

um I was hoping to get to these already, but yesterday was like complete waste because of all the investigation we did into this performance. My performances, uh these errors on the gps.

B

I don't consider that wasted time at all. That was a good learning experience.

D

Yeah, I'm not sure.

A

Cool okay and then the final one we have. I have got.

B

A

Is pages um I'll see if I can find out it's nothing on that epic, so I'll go off and get us an update for that one but um progressing along. Hopefully,.

A

A

So uh one thing I would like to cover make the agenda uh helm3, so it would be great to be able to upgrade before we do. The next service.

A

What um to scrub it? Do you want to walk us through maybe or just give us a brief overview of the um the situation? As you see it, based on what you put on the issue.

B

So the last time we tried to perform an upgrade, we ran into a blocker where we ran into immutable fields where kubernetes would not let us perform the upgrade, because there are certain things that you cannot make changes to in deployments and services. For that matter, it doesn't look like we attempted to address this.

B

It looks like we just documented the situation and we inform our users that it's easier for you to just delete the object and perform the upgrade and recreate the object that in itself will create a disruption in traffic. So what are the options we could do if this continues to be an issue? I don't know if there's been changes to the way, how um operates in newer versions to make this any easier?

B

But if this does continue to be in a situation, uh testing will tell us- and we could you know, as gray mentioned, we could, you know completely write up a cluster and replace it.

B

We could also drain traffic from a cluster, perform the upgrade and send traffic back to it.

B

Either of those options will require us to do some tooling upgrades that way. We attempt to prevent ourselves from blocking auto, deploys and patches and such so we'll need to make sure our tooling is able to uh handle both helm, 3 and helm 2 at the same time, and we'll also need to thoroughly test the upgrade path to determine the best method to go about upgrading things.

B

um I think, last time we attempted to do this. We just failed to properly test it. So I think this time we just need to spin up a like a pre-two cluster and fully vet the testing procedure, just to make sure that we've covered all the use cases.

A

Cool so I suppose a couple of things that are different this time round one is: we have the multi-cluster which good two is we're doing auto, deploys a lot more frequently which might make it hard.

A

So um how do we want to go about like? What do we need to do to get to a plan where um we know kind of the steps we'd want to take and like how we want to handle those things, and then we can work out scheduling.

B

I think the first step is going to be testing. We need to spin up another cluster, deploy a similar gitlab configuration to it and figure out what has changed with the upgrade procedure.

B

It's you know. It's been since april, ish that I did this. So it's been a long time. I don't remember everything. So um if we thoroughly vet the testing we'll have a better idea of what we need to accomplish as far as what changes need to be made to our tooling and how we could, whether or not we want to destroy clusters or just take traffic off the clusters.

A

B

I think I'm more concerned about original cluster, since you know there's only one of those and that runs all a sidekick. So it's not like an easy one. We could just dismantle and spin back up.

A

Quickly: cool, okay, um let's: let's, let's find a time to do that, um so that we can actually see where we are with this one. Okay,.

B

A

B

I'll take the I'll take ownership of that issue and I'll probably upgrade it to an epic and create some smaller issues.

A

Fantastic thanks very much.

A

What else is there anything else anyone wants to demo or discuss today.

D

I think we should probably catch up now that scarbeck is here for um like what are the next steps for get it uh get https and the cluster problems we're having.

D

So after catching up this morning, sky break. It sounded like we switched to using the health endpoint and that didn't go well in canary, because that endpoint doesn't perform well under load right.

B

I haven't fully read everything but yes, upon initial review, apparently not.

D

Andrew, do you recall, like I, have vague memories. We used to use this slash health endpoint on aj proxy, and then we switched to readiness.

D

um Health, I wonder if it was for the sake by.

C

My recollection health was the one that connected all the way to the back end right.

D

B

So it's very clever.

C

And when you, when the database has a blip or you have any basically any downstream problem, you you manage to kick off your entire back end because they're all experiencing the problem and they will report.

D

So this is so. This is funny because this was my recollection as well, but then I went to our docs and yeah. It totally is the opposite like it's, it does not verify the database or or other service. So this looks like I was like okay, so health must be the most lightweight thing that we have right, um but but I guess it doesn't perform under a load which makes me very suspicious.

D

C

Guess we need to look at the code, we need to look at the code and it's implemented as an additional basic health check. This endpoint circumvents rails controllers and is implemented.

D

Yeah but yet, when we use this as our health check, it completely brought down, you know like it totally caused problems for us. We started crashing puma like how is that possible.

C

C

I want no, I suppose you have to go to puma for the health check. You can't just terminate it at workhorse, because if puma is really sick, then we want to take cluster out, so we always want to. Although actually are we absolutely sure that the health checks are proxy to the back end in in workforce, because.

D

Well, readiness is for sure, but really.

C

I don't know about how, because I know that they've got their own roots in um in workforce, which.

D

C

Indicate that it's actually its own sort of setup right.

D

I'm I'm almost certain that readiness.

C

D

C

Really bad, if it wasn't.

D

um I don't know whether health goes to rails, but, given that we saw puma crushing when we switched to this health endpoint, I assume it does um okay. Well, I'm I'm a bit lost because I have. I have the exact same memory that you have, which is that, like we got off of this health check because it was too heavy and it.

C

Was very clever, it had like like radius, and it was like very engineered.

D

Right like if there was an object, storage like you, know, error, it would like fail or something.

C

Yeah yeah, exactly like everything, has to be up in order for your servers to be.

D

Out right, anything's done is down then.

C

Yeah yeah yeah yeah exactly.

B

So health is not the right one and readiness throws a 503 when we don't want it to what about liveliness.

D

Well, I think we have to start like I. I did some tests under with uh like siege tests on staging against these, and it does seem like liveness, is um the lightest of the three.

C

So what about like um one of the things that you could do is you could uh use those uh chaos endpoints and you could like send a request you, you could put a little script together that uh sends a request to the chaos like sleep endpoint, so like sleep for 30 seconds or, however long then send a sick kill to to work. What's this against to workhorse, um and then the next thing that comes along should be a 502 right.

D

And that should.

C

Be and it and it shouldn't, and that should terminate that the one after this again should terminate at workhorse and not go all the way to the back end right, because because after the seconds then it then it kind of draws the drawbridge in, and so it's like. No nothing else is coming through. But if we could kind.

D

Of put it together, that's not how it works, that's not how it works. So when you said it, when you send a sig into puma, the readiness responds with a file.

C

Is it to puma or to workhorse.

D

Well, it's to puma, but it goes through workhorse. I guess because.

C

The signal goes to.

D

No, the signal goes through. No, no, the signal only goes to puma, not sorry, it doesn't.

B

D

But the requests that go through workhorse um we'll be we'll return, a 503 for the readiness, but everything else will be successful.

D

The problem, the problem is that we're using the readiness as the health check endpoint for h.a proxy, which means that the whole cluster is being dropped out of the um the back end as soon as as soon as like. We terminate a pod because, like some of these requests, are actually going to terminated, pods and we're seeing like 503s.

B

Does liveliness show the same behavior when we say.

D

D

B

But I don't want.

D

To re, I don't want to repeat uh you know, I don't want to create another incident like we need to somehow put this thing under a load and test it.

C

So what you're kind of saying or one maybe the most resilient way is if the health check for um aj proxy to the cluster is totally.

D

C

From the health check, cpm.

D

And we were saying before, like maybe we don't want a health check at all like if, if we.

C

Only that's what one was thinking.

D

Yeah, but the thing is, is that, right now we have our training wheels, which are these, which is the git fleet, which is set up as backup servers um yeah. I I need something that tells me that the cluster is unhealthy, so that we before like at least until we take those good servers out of the aha proxy back end, because I want to be able to use them as a fallback. So I need something that tells.

C

Me something in engine so so just.

D

Yeah but it's like.

C

Aj proxy talks to nginx. Can we not have a health check in nginx which is like nginx? Do you have back-ends? If you've got back-ends by your own measurements, then we will talk to you, but without because yeah I don't know.

D

That's probably the best option, but I don't know if that's something.

C

You can configure.

D

C

Sounds like an object's pro feature.

D

Yeah, it probably requires a sales call.

C

So I mean the question is why, if nginx didn't root to terminating pods, that would be best right. So.

D

That would be the best, but.

C

Yeah wise, but either in either case like we shouldn't, allow a single pod. You know that's misbehaving to happen to get the health check endpoints and then take an entire cluster out because of one part right like that. That's like inverse resilience.

D

Yeah but okay, so we know that when you send the signal to terminate a pod, there will be a small window of time when requests will still go to that pod. I mean it's just eventually consistent right yeah, so the readiness check protects us against that, because the readiness check returns a 503 but we're still able to process requests, and we have this configurable blackout window, which is set to two minutes.

D

So I think um it's okay, that we're getting some requests are leaking and going into these terminating pods, because they're mostly successful the problem is, is that the health check is returned to 503 during this time. So we need to come up with another check. Yeah, that's always going to return.

C

So, like I'm, just like here's a failure scenario which so say you have one pod which is broken for some reason, and it is, you know, pumas in a broken state um and the request is getting workhorse is accepting but puma's not accepting it's dead. The sockets died and you know after 30 seconds, it's timing out with a 503 or, however, you know after a higher amount of time, and a series of requests come in from h.a proxy through nginx and they hit this and they all come back with a 503.

C

At that point, h a because of one pod out of out of n h a proxy then removes that entire cluster from the um you know, one of the three clusters.

C

So that's why we, we absolutely can't connect unless I'm misunderstanding, it.

D

No, I don't think you're misunderstanding I think, like our assumption was that um this window was very short like by the time like and.

C

I'm not talking about.

D

Yeah, so if you have a single pod, um I guess our assumption was that h.a proxy has to fail like three checks. So given that you're gonna hit random pods for those three checks that wouldn't but you're right, I think maybe that's not a chance we should be taking, but.

C

Maybe, but that kind of to me smells like the architecture should be that it's a different like it should be. Almost the health check for the cluster should be terminating at nginx, rather than going.

D

Yeah, I I see your point, I think you're right. So, let's, let's leave, let's see what our options are there um and and if we don't have an option then maybe we should just remove the health check all.

C

Together kind of things, some some geek some ways logged about.

D

I'm sure yeah yeah.

B

D

I would encourage us.

B

C

Just gonna say: most people don't have the setup that we've got with the three clusters, and so maybe.

B

Right, you know it's less important.

C

When you've only got one cluster because you're never gonna, you know you're, never gonna, pull it out, so you're never gonna. Take it off as a back end. So yeah.

D

Well, in our case, even we don't we'll. Never, I think, once we get rid of the git fleet. There'll never be a case where we're going to want to take out the cluster from the back end, because.

C

Yeah, but also you know that situation could occur during a role. You know like a deployment, and there might be, you know, might be more than one machine shutting down and then you might accidentally just because there's a high prevalence of 503s coming back. You might, you know you might get a situation where you get three in a row.

D

Yeah, but in the sunny day scenario like I was hoping that we'll never see any errors, because we have this uh blackout window period right, um which should be long enough, uh so that kubernetes doesn't route any traffic after two minutes to a terminating pod. But apparently it does. And- and this was the problem we found with the nginx controller yesterday- where it appears that nginx keeps these connections open to pods that are in the terminating state and the only way around it is to reload engine x to.

C

How does nginx I was just about to ask similar question: how does nginx get told which back-ends to route to.

D

It it has two options: we use the option where it just sends it to the service endpoint, which is one ip address in one port and the service. Endpoint um does the routing to the pods like that it has.

C

Like how do you configure it, though, is it? Is it like you literally rewrite the nginx config and then or is nginx using um uh like dns uh service.

D

Nginx has nginx has like an upstream proxy with a single ip address, which is in a single port, which is the service id. So.

B

C

D

To the service ip, the problem that we think we discovered yesterday was that, because of keep alive that, um even though that a node was removed from the service, um these existing connections, you know, are still active and we would still get requests going to these terminating pods. So we adjusted the keep alive and from what I've seen so far. It doesn't help as much as we thought it would like. We have to play with that more, but.

B

D

I don't know man it feels like. uh Maybe we need to get rid of engine x.

B

Jarv, I don't know what you have tested today, but I think maybe we should concentrate more on this keep live stuff. I also found another option that um we could set in our ingress controller to.

D

B

Tell nginx to resend a 502 that way we hopefully avoid sending 502s back to our clients. Okay,.

D

That's good yeah, so in the demo in the demo this morning we saw that, like we were still seeing the same problem on staging.

B

Okay: okay, um during a short conversation with matt smiley yesterday in our dna meeting, um I was talking about how we just have the amount of keep alive connections or requests that go through.

D

We should probably.

B

Consider lowering that a significant large amount just just to lower it more so maybe we should just keep going just to see if we see any improvement at all.

D

Yeah, I think, like it's really easy to see this on staging because we have enough traffic. uh I think, because of the traffic generator, that we can kind of see like uh what. What we see right now are not a lot of 502s, but we do see a lot of 200s after the termination signal is sent and the readiness check is failing- and I think yeah.

B

We need to eliminate that as much as possible. Yeah.

D

C

So my guess about it being an nginx plus feature seems to be correct. Are you serious? I I I might be. I mean it's. The first thing I found, but uh it's definitely a plus feature what I found the active health checks. They call it we'll see. Maybe it's a different thing.

B

So, let's keep working on the keep alive stuff, let's tune that more. uh Maybe I could do some math to figure out better appropriate values for keep live settings and maybe also try out that other configuration where it'll retry a 502.

B

What should we do with the health check at this point, because, right now, staging and prod are configured differently at this moment right.

D

um I thought you reverted it on staging as well.

B

Alejandra reverted production. Yesterday, due to the incident, I have not touched.

D

Staging okay well,.

D

I want to remove the readiness check, but I also don't want to remove the check all together and I don't want to add a different endpoint. That's going to cause problems, so I can do some more load testing to see whether liveness looks better, and maybe we can switch to that. If it does.

B

Okay, well, if you want to continue to down that route, I'll start looking more into the keep live stuff, because jason also asked if we're touching the correct, keep alive setting in the issue. So I know.

D

B

A little bit more research on that as well.

D

What do you think about being as uh like severe as possible with the keeper like on staging, let's just like remove keep alive all together and see if that actually makes this window either like go away entirely or is really short? And then we can work up instead of working down.

B

I like that idea, um yeah, let's try.

D

That, because I'm worried, we're just spinning our wheels and like it's to keep alive. It's not going to matter at all like, but.

B

Yeah, do we need to worry about web sockets at all.

B

D

That's client to nginx.

B

I think nginx will still maintain or restart the connection, that's necessary to the back end, but I'm not really sure how that works.

D

um I don't know, but it's so low traffic right now, it's not our biggest problem and maybe like when we have websocket traffic going into this other like have it segmented off we can. We can worry about it. Yeah.

B

Okay, all right, let's start doing some testing and staging.

D

Cool all right.

A

Awesome sounds like a plan. Is there anything else, and I want to cover today.

B

Can tomorrow be saturday.

A

That is beyond my powers, unfortunately,.

C

We have the gitlab cape town christmas, get together outside open air tomorrow, so it'll basically be a long weekend. No.

D

C

No social distance uh holiday.

D

You're inviting people from gitlab over to your place.

C

No, no we're having it in the national botanical gardens.

D

C

Not going for a hike.

D

Oh that's cool. I.

C

Figured you were having like.

D

A barbecue or something in your backyard, no, no.

C

No, no! No! No! No! No! We're not gonna do that. That just sounds like a mini super spreader event. Yeah.

D

A

But you were after an invite, then job. I thought that's why you were asking.

D

A

Like I want to be a job.

D

Seriously the people in slovenia like we, don't even talk to each other like we don't we have a slack channel, no one says anything and I never even I've, never even seen the day. I don't even know what he like looks like in person and and yawn like I never talked to at all. So yeah are.

C

You welcome to come down to cape town yeah. Maybe.

D

Maybe we'll move to cape town.

D

Cool guys, all.

A

D