GitLab Infrastructure Group, 1 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Consul discussion gitlab-org/gitlab!271575

Description

Skarbek and Jason Plum investigate a target event involved in consul misbehaving with some queries, and target some options for moving forward to continue the investigation or possible solutions to the issue.

Reference: https://gitlab.com/gitlab-org/gitlab/-/issues/271575

A

I probably flip excellent.

B

A

B

So what we would like to do is see if we can't troubleshoot console in our application, speaking to console and sometimes receiving a response, or is it no response or that we're receiving a response? That's empty looks like no.

A

Response, this is a fun one. So what we're seeing is actually that the connection was made, but there was no response given in time.

B

A

So we're not getting an empty response back because then we would get a different error message.

A

You have a series of error messages. You can get an actual fail to connection right, so you connect to no route to host or connect timeout right. Then you get the no response, which is where there's literally nothing came back across the pipe and then there's effectively a which is a no answer was given.

B

B

uh So it looks like in your latest message, no route to host, and that could be a few things um right and that would require us to kind of dig through logs to find a correlation of a scaling event or your pod rotation event. The other one that it seems that you want to concentrate on is.

A

B

Receiving a tcp timeout, it appears.

A

Right which, based on what I saw uh noting that I watched and read the notes from the call this morning, yay double speed. um What I'm seeing is that, on occasion, council has a message that the client disconnected.

B

A

um My guess in that regard is that console did not respond fast enough, so the uh net dns resolver reconnected and tried again.

B

A

So in the message I have from 10 a.m. Eastern this morning points out that there's the two types, the ones that we really care about, which is the no response from name servers list. There is an instance where we were getting more than one actual error, but we're swallowing it. We fixed that since so now we get a new response from dns servers when we try to look up the dns server that we're supposed to be asking, so that problem is out of the way we can identify that now.

A

We now know that almost every single time, it's truly that there's no response from the name server's list provided to net dns resolver.

A

That means that we're connecting and then there was no response in time and it cut the connection because it was like uh somebody should have responded by now.

B

A

There are very rare, no route to hosts those up here um based on their sparsity that it's either pod scaling or node scaling yeah right, um almost always what I'm seeing are the no route to hosts are trying to reach out and hit the actual dns server like the tube dns, and it goes poof. It goes away. I need to verify this, but it appears that that's what I'm seeing, I'm not seeing it getting a no route to host to connect to a service ip I'm seeing it get a pod ip yeah and then.

B

I find that a little strange because we're sending it to the service address because we're using that dns address for the service so that we know how to reach the console.

A

Right, that's why I'm saying I don't think. That's particularly the case yeah, that failure that connect to no route to host is actually a problem reaching coop dns, because it's scaling up or down based on their scaling methodology. It.

B

A

Something a little funny for their two dns, auto stealer. um I don't think that's us, because we would have a consistent ip address, and that is indeed where all of the messages that we're going to talk about here in a second come from is from one locked service address.

A

Okay, um so almost all the time, it's like proof something went away and, as a general rule of thumb, a service so long as the service is present will have a service id yep no route to host you'll get a route to that host. What you won't get is a connection. The connection will time out trying to connect you'll still get a connect to error, but it won't be no route to host.

B

A

So on my hunch of the no response, meaning that there was actually a timeout somewhere, I went looking in the logs and just took the same sector, the same base, selector and time zone and looked for anything with timeout and then realized. Oh look: there's a very nice clean message from the warren long out of standard out that not responding within tcp timeout, I'm going.

A

I will bet you that correlates, and so I picked a single pod, kubernetes dot pod underscore id, so I just picked a particular pod id out and then went give me.

A

You know the last day with this message or this message and I just kept cutting it down to the smallest window, to show replication of the events and what we end up. Seeing is every time that we get the message, no response from name server list. You can see the underlying net dns trying to connect timeout expires. Try it again, try it again and I'm like bingo yeah. Now we have a at least we have a point in correlation that we can try to use.

B

And we've only got one name server in this case, like we there's only one service exposed, so we've only got one ipa address that we're ever going to hit for.

A

B

So there's not going to be a next one for ruby to try.

A

Right, so that's the finicky part here, um because the way that we're doing our gitlab database load, balancing resolver only pulls the first answer now in theory, that should be the service ip address.

A

There are ways with cube dns to make a request and get all of the ip addresses of all the services currently backing the pod, and then you can walk down it. um We've seen this in some experience in dealing with uh nginx yeah right, you get the end points back instead of a particular individual, ip or the service. I'd be doing the proxy result.

A

So what ends up happening? Is we resolve the name? Server's name aka. You know, hey cube dns. Where is this console service right, dns.

A

Console.Dns.Console.Service.Cluster.Logo, I think, and that will return a single ip address and if it returned more than one we're discarding it anyways right now, uh so this is kind of all those scenarios.

A

The from that point, assuming that the ip that we have logged there is the current active console service ip.

A

Then you know all the traffic is going through that, but then the service routes it to a given pod, that's live and marked ready and then that pod is not responding within the tcp timeout that is set by ruby's net dns resolver.

B

B

A

Know what that.

B

Timeout is, and if that's configurable by us at all.

A

On the button is: where is it coming from? What is the expectation and how far out of that sla are we.

A

Yeah, okay, so.

A

Also mentioned I'm going to tangent just for a second also mentioned earlier was the concern that the the current expectation of council's chart or just console and kubernetes is of question just because it's newer or that hashicorp themselves may not be fully familiar with handling a service like that in kubernetes.

A

um I don't really know what's going on there, but I'll admit I've kind of gone.

A

Why do we have a demon set and then access the service ip.

A

That's a fun one that can somewhat be controlled via external traffic policy, but saying that it's also somewhat logistical to point it at the service ip rather than the daemon set, because instead of trying to connect to the damon set and knowing that the pod has an ip address.

A

And you don't know if console's there it's entirely possible for a given console client on a node.

A

Then you can have a node that's out, so, if you're accessing the local copy of console, that's not currently a member, then you're going to get no response and every single pod on the node is not going to respond or not be able to operate correctly.

A

So there's that kind of trade-off.

A

Now one of the things that's more expected from their documentation and their chart from what I can see is they expect that you shim or stub the console domain into the actual cube dns or the core dns that's implemented in your cluster. You basically convince it to send anything. That's you know, dot console to the console service ip address now, there's an upside in that regard in that, even if the ttl is set to one.

A

Second, if you get 10 requests in one second, console's, not trying to answer that request for all ten right, let's just say as an example of should we scale plus two nodes and both of those nodes will happily fit 20 web service pods. They won't at our scale the nodes, but let's give it an example: there's a sudden scaling event, the pods will come up and they will probably come up before cut the local daemon set. Console has actually joined two console cluster right. Neither of them are members yet.

A

So we need to ask the service ip the service ip gets run through cube proxy, which then routes it to a pod somewhere beyond another node for sure, but we don't know how far away we don't know any of these things.

A

The 10 pods come up and you know their dependencies run pretty quick and then they all try to fire at once and they're going to be within a second or two generally, then you've got the problem of boom. Who do I get? Do I actually end up spreading the service requests like to the service? Ip? Are those requests being spread across the other eight members of the cluster that are active and members or is one of them just getting hammered all of a sudden right.

A

Service ips that do load balancing are not necessarily load balancing.

A

It's the first thing to keep in mind, cube proxy service. Ip is not a load balancer round robin. Maybe can we configure that? Can we convince that do we? Can we define exactly how it's going to behave? It's a good question.

A

The next question I actually have to toss out there is: why are we running console in the cluster.

A

Because one of the things we could do is actually create a basically an external service. We have a service that reports back with an external ip address, thus actually pointing to an external console.

A

Do we need to have it inside of the cluster? I mean this is a does that make sense if we can't resolve this quickly to bother to keep trying to run console in cluster.

B

I don't remember why we added console into our clusters.

B

But I know we did that prior to us migrating any rail service into our clusters like we ended up making it a requirement. So I wonder if I can find out why?

B

Because I I can't remember that was so long ago and jarv did a large portion of that work.

A

Okay, well, let's, let's table that one for a second! I I just wanted to raise it, so I don't. I don't forget to ask.

B

A

So, let's see if we can figure out.

A

If there's a correlation between one of these points in time and a scaling event on console or if there's a client hung up at about the time that we saw a flurry of tcp timeouts from the rails side,.

B

B

B

um I'm, if you don't already here's a link to this, was this morning's agenda. I'm just going to reuse it. So I just added your question as the very first item that we can revisit.

B

B

B

A

Looking for the right term for the service.

A

B

B

Alright, so using the query that you were that you had posted in the issue, clusters c and d haven't suffered anything in the last five hours. Cluster b has a sporadic set of events that happened at 1400 utc again at 14, 30 and then 16 30. So like 10 minutes ago, roughly um I'm tempted to make this slightly easier and just target on one of those time zones.

A

Right and that's that's why I had taken that screenshot of like here's a specific time zone, we can look at.

B

Yeah, so how about we try to focus in on 1632 is when I see only three events, so we get the no response from name server list and we see two events both saying name server.

B

This ip address did not respond within the tcp timeout trying the next one.

B

So, let's figure out if these were the same pods that were doing the same request: pod name hfcs2,.

B

They are the same. So are these long messages, studio.

B

It's the exact same log message.

B

Different time, perhaps looks at the same, looks like we just duplicated an event at the same time, for some reason,.

B

So we know we had a pod struggle to make at least one.

B

B

So, let's look to see if we had any scale events near that time for console pods or maybe some pod restarts at that time.

A

Right other than that, because it could be console pods, it could be the nodes scaling.

B

Yep I'll check the pods.

A

First parallel, I am digging into the ability to understand and control the timeout from net dns.

A

B

Okay, interesting fact, occasionally the readiness probes will fail in console.

B

So let's narrow this down, though,.

B

B

B

I think I just broke kabano.

A

Yeah, the interface is a little odd to me. Sometimes too.

B

No that they just completely blew.

B

Up, um I did confirm that the ip address that's showing in the log search that you posted, that is the correct cluster ip address that is utilized by the console dns service.

B

Not sure if that provides you any comfort, more or less.

A

A

That's that's good. We know that we're getting the right ip address, then all the requests are going to the same place yeah, and we can see that they are indeed retrying twice.

A

And then, when they don't have the response, then they're, that's when you're getting back the no response from name servers list um I have found. The default is five seconds, so console isn't responding to a dns request within five seconds.

B

That's a little suspicious wow. So let's I'm still.

B

Trying to get our logs going to see if there's any pod scale, events around that time.

B

Okay, so we failed at 16 30 2 0 9 utc at 16 32 14. We had a readiness probe mark, a failure on.

B

One of the console.

B

B

Deep space: let's do this.

B

Which would so that was at 16 32 14, so that is just five seconds after our failed request.

B

I would assume that we're going to log in rails after the timeout occurred, so that would have been that would have assuming a 5. Second timeout is not being overridden anywhere. That request should have made it to the pot at like 16 3204.

B

So I guess what is the readiness configuration look like for this.

B

B

The readiness probe runs a curl to local host.

B

Its failure threshold is three.

B

Its period seconds is set to ten, that's every so it runs every ten seconds and its timeout is one second. So if this command fails to run within a period of one second, we'll fill the readiness probe. We only failed the probe once and assuming our timestamps are all in a kosher standpoint. Here it looks like we could run, we will have failed, we could have a pod in a failing state, receive a request. The request is not met, and then the next rareness probe is set to the pod and fails.

B

So there's the possibilities: pod was in a failing, safe prior to that.

B

But we only failed the event once so. No actions occurred on this particular pot at.

B

B

So since we did have a failure in our readiness probe, I think the next step will be to concentrate on maybe this particular pod potentially and see. If we could look at the logs, I know we're not capturing them, but maybe they're still inside of the pod, because this was not terribly long ago.

B

And the pod should not have been restarted because we want three failures before we bring the pod out of rotation.

B

So, which pod is it tg scm.

B

Yeah running no restart, so that's.

B

B

Over the course of the last eight days, the readiness probe has failed 1300 two times.

B

It's never been restarted.

A

What is causing this failure? Because that's a lot of times in eight days.

B

Well, the timeout for the um curl request is one second, so like I don't know why we're filling our readiness pro, but that's a very short window of opportunity.

B

So let me see if I can look at the logs for this.

B

B

Last time this pod was started was.

B

March 23rd.

A

Do you have a time for that.

B

uh 23 23 28..

B

B

B

A

B

The only log message outside of I'm filtering out informational messages, the only thing that occurred at the time stamp closest to where we filled our readiness probe, is a refuting of a suspect message from one of our virtual machines.

B

But we see that quite often, like that's the dominant item in our vlogs.

B

A

I wonder if we would be able to.

B

Document which pod we're targeting in this research.

A

Because if a message refutation comes in and then blip it goes off, is it is that resulting in whatever is causing the one second curl timeout to not pass, or are we seeing a cpu spike while it's processing, some sort of event or other network traffic? Is there like? There's got to be something correlative that we can bind to why this is happening? I know that we see these.

A

You know refutations happen but like if it's constantly refuting messages from the vm fleet, then like we need to figure out why we're consistently getting these refutations because it shouldn't be happening all the time.

B

Let's do a quick count, so correct refute, wc.

B

Nope because it's.

B

Refuting so still excluding- I don't know what's in info logs, but excluding info logs and counting the word refuting uh 5600 times wait is that the right line, number yeah, 5625 times refuted a suspect message from some machine, and this varies like around that time was specifically a vm. But I also see this for gke nodes as well, that are participating with.

B

A

I just popped up the console chart. I'm there's a laughable comment in here for the readiness probe, because we do this um curl exec and then we turn around and like hit localhost on the specific port or whatever else.

B

A

There's a note that says uh when our http status endpoints support the proper status codes, we should switch to that. This is temporary.

B

So we're running a really old version of the helm chart. I don't know what version you're looking at, but I feel like that comment probably exists for quite a while um if it means anything uh we're on version 020, dot, zero. I believe.

A

B

A

Zero 31.1, so it's still there.

A

B

A

The version that you just said.

B

I did capture that somewhere, not in this issue, but I did capture that somewhere already, um so we know that nothing happened to the node itself like it's, not a node scaling of it.

B

We know the pod has not scaled, so maybe the next thing to look at is metrics for that node to see if we are suffering, maybe a high usage, just cpu or memory for that node that this is operating on and maybe also for this pod, like maybe this pod was suffering from unfortunate event, so where in our logs is the node that we're running on? So this is the target node that we want to look.

B

B

See if I can't first find that note at our own kubernetes.

A

A

um Where did you get that one second timeout from.

B

I will copy and paste the configuration of that daemon set into this document.

B

I just copy and paste it feel free to push that where you.

B

A

Whoops crap, that's not what I.

A

A

Okay, so my only problem is that, at least according to the version that you mentioned doesn't have those properties, so you may.

B

A

But it doesn't seem to do anything.

B

What do you mean.

A

Like I'm looking at client daemon set I'm looking at the rendered template for it and there is no template that has any control over the readiness probes behaviors like.

B

The only thing.

A

It can has, is tls enabled or not that's it.

B

A

You said 0 to 20 0.20.1.

B

Let me double check that, um but I'm pretty sure that's where we're.

A

Running uh yeah, even in zero 31.1, there's still no control over the readiness probe timeouts. Anything.

B

It's just hardcoded.

A

It's whatever the server's default is because it's not otherwise specified. Oh yeah.

B

Interesting: okay,.

A

So we're using the kubernetes.

B

Default for all this.

A

B

ah Okay, um let me still just double check what version we're running.

B

We're running 020.1.

A

Okay, so yeah I'm definitely looking at the right one yep yeah, those those settings don't do nothing.

B

I mean they'll do something we just can't modify or change them right now, right.

A

Right like no, you misunderstand: the values that you have present here appear to not be rendered in any fashion, therefore, have no literal effect, so they they do nothing. You can provide them and then we'll know about them. They just will have no impact whatsoever.

A

Now, if this is the rendered content like if you, if you're looking at it from k9s or whatever.

B

This is, I pulled this out of kubernetes.

A

Okay, then, yes, this is entirely default from the particular kubernetes deployment.

B

That's good, that's interesting information.

A

So the question here will end up being 8501. Okay, 8501 means tls is on good.

A

um In that particular case, we should be able to replicate that exact pattern and on a regular basis and watch for pods that are having problems responding within the one. Second.

A

This is a particular instance where we should really have the ability to tweak this.

B

B

Okay, so here is the chart or our metrics for that node itself. So we could see that cpu usage, the cpu load, is high-ish.

B

So we're definitely getting our money's worth out of this node like cpu utilization, is hovering around between 50 and 75.

B

We're not using very much memory at all.

B

There is a flurry of disk activity just prior to this.

A

Event much you want to bet. There is an uptick in pods.

B

Potentially that that's primarily what I wonder about the um disc uh utilization, but also note that right around 1630, our metrics kind of disappeared for a second on this node, some of the metrics disappeared. For some reason.

B

I don't know why some of our metrics would disappear and some would not, though, I'd have to like dig into which metrics and where those are coming from to figure that out.

A

B

But that disk utilization that propped up is, like short, it's a very small amount network traffic. You know it bounces around, but it's not horrible.

B

B

Let me shift focus and look at the metrics associated with that.

B

B

B

So here is a link to.

B

B

B

And there's only one container in these pods and that is console itself. We don't.

A

B

Any other anything else running, but if you look at our pod metrics that we're gathering like memory usage didn't budge. Cpu like these are very small numbers point zero one.

B

I don't know if that's a percent or if that's like the usage in millicourse.

B

And nowhere cayo is very.

B

A

Right but we're looking at that, apparently just console we're not looking at all the load on the machine itself. Could it have been, the machine had been congested at the point that console wasn't guaranteed.

B

Good point: I would.

A

Hope, or not, guarantees on the qos for consoles memory and cpu, uh as opposed to best efforts.

B

A

Our limits and resource resources requests and limits that are defined and identical.

B

I don't believe we can figure that at.

B

A

Okay console does this as a matter of.

A

Values.Client.Resources and then they run tpl on that content.

B

We do not, we do not configure any resource requests, nor limits.

B

um What what's the name of that item like podclass.

A

B

um Yeah resources is an empty block.

A

Okay, so that's what they actually have set as well, so you're, getting whatever gke just happens to default for a best, a best effort. Non-Guaranteed result yep, um I'm going to say that's. Probably one of the first things we want to do is guarantee its cpu time.

B

B

I thought we had metrics for like throttling.

A

Well, can we check the cpu usage on the node on which this pod was living at that particular incident time.

B

um Look at I have a link to the node dashboard and that's the node in which this pod was running on sorry, I'm changing our font size in this document.

B

Oh, is that what happens when you click on it.

B

Oh, I see what you're.

B

B

A

Okay, so the particular point in time we're looking for, is why this roughly this 1630 gap right.

B

A

Okay, I'm going to zoom mine in from 1629 to.

A

A

Yeah there is a distinct load shift.

A

For about a minute where it went from the average load of 25 to 133.

A

What happened at 16, 30 and change.

B

Well, if you zoom back out those spikes happen very often.

A

B

Yeah we should yeah, we should figure out what happened. Yeah.

A

Because I mean you can look at the exact same time frame now mind you we're not super in here. I don't it's running on 30 second intervals, so I don't have any.

B

A

Right um and I don't know that we can adjust it to a smaller interval, but unquestionably an average per core load of uh 120 percent.

B

Pretty significant.

A

And if this is regularly happening, we need to understand this behavior pretty badly, and this definitely definitely tells me that console absolutely absolutely has to have specific defined matched requests and limits so that it gets guaranteed. Cpu time in the middle of this.

A

Because there's a massive spike in disk activity and then the disk activity, trails off and then cpu goes boom.

A

Didn't deploy happen.

B

No deploys today so far at least.

A

Smash goes the disc and then immediately afterwards, smash goes to.

B

Cpu I'm trying to pull up our logs from this node around that time.

A

Because I I'm not seeing any real change in inode usage.

A

I'm going to jump out a bit from 1600 to 16, just a 1700, then.

B

B

Well, even I'm looking at her logs and there's not really much happening.

B

I might need to go elsewhere.

A

um Now this node is, is or is not part of the fleet handling sidekick.

B

This node is um handling gitlab shell.

A

You're deploying more than shell, though.

B

On this particular node pool no like we'll, have fluent d and elastic or uh like pub sub, but like as far as what we want to run on this node pool, it's just going to be getting lab. Shell.

A

So just that container no other rails api, no, nothing.

B

Correct is there a fancy uh query? I could do to get all pot well. I know I could grab, but is there a quick way, I could say hey give me all the pods that are running for that node and k9s. Well,.

A

B

Just do I'll just do a graph.

A

You can actually just pop over to nodes. Go to that node hit enter and it'll. Give you the description which will show you all the pods running on it.

B

How do you go to the nodes.

A

B

I'll show you I'll share my screen.

B

A

What node are we looking.

B

For uh we want shell, two three three e three, four f: eight dash dvd five s and what's the key combo to show me pods.

A

Just hit enter or describe, which is a d.

B

Yeah, so, as you see, there's only get lab shell. You know here's our pod that we're investigating so only gitlab gel a bunch of stuff that uh kubernetes or gke provides us followed by our logging infrastructure and our node export that we manage 45 restarts for that node exporter. That's great.

A

That's not our node exporter. Is it no.

B

This is no, this is the monitoring provided by the infrastructure team. That's.

A

What I thought.

B

So there's no there's not much running.

A

B

A

Yeah, I definitely definitely want to do some investigation to watch generating these massive spikes on average per chord load um like it's good that we're using it right. Don't get me wrong, but when you can just kind of trace the line average and you're below 75, but you're regularly spiking over a hundred yep like there's a problem here and.

B

A

Should at least understand what's happening and the expectation of what will occur when that happens,.

A

Okay, so one discovery is that we should definitely be setting resources.

B

Yes, I would agree.

A

That could definitively explain why we're seeing transient timeouts as if console, is just being shoved out of the way. It's it's nice, it's too high.

B

Okay, so look at that chart for me uh expand your view a little bit.

A

Still on the node correct.

B

uh On the pi uh yeah on the node, okay,.

A

B

So 1632 is when the first readiness per fails in the window of time, I'm looking at okay, the same readiness failed again at 16.36 and at 16 38..

B

Do you see any load spikes during those two time or during those two other times.

A

Let me grab the usage, can I you know, let me.

A

Just trying to relocate the dashboard but 16 30.

B

16 36 1932.

A

Massive spike uh there's a 100 percent. You should just over 100 you spike at 16, 35 30., and then you said.

B

A

There is a peak.

A

A

1638 16 38 30 it's 42 and then it immediately shifts by 16 39 30 to 77.

B

Okay, so we have something that is impacting load very close to the time in which console begins, failing or showing signs of failure. It's weird, though, that it's only one, readiness probe that fails of these three.

A

Well note that we we don't have anything finer grained than 30 seconds here, yeah.

B

A

um So the likelihood is that we may be seeing it not actually doing a full swing right. We could be going from 30 to 200 and back, but over that 30. Second time it's you know, didn't crest, all the way up there.

B

A

So the thing is if, if like, I'm looking at a one hour window right now, so let me share.

A

That window alone, there we go. Let me know if it's doing the the flicker.

A

um But I have just 1600 to 1700 and.

B

A

The the 1630 spike on the system load and then, if I go over here, 16 30 30 on the cpu average, so 30 30, 30 30.

B

A

Thirty, five, thirty thirty five, thirty right and you we see these saw teeth happening and they almost correspond to the per core usage.

B

A

And then we look at the total cpu utilization that doesn't necessarily align, but what that tells me is something is literally choking out a couple of cores right.

A

Like how, how can we hear at 16 30 30 right so 16, 30 30., the cpu utilization is 67, but the system load is 133.. Now how many cores are in that box.

B

Should be eight, if I recall.

B

I'd have I could look that up.

B

In fact, let me just do that. That way, I'm not guessing.

A

I mean it looks like it's eight, according to what I see in the next one over but like most the time we're in the you know, 40 to 50ish range. We just okay, but you know when you 150 percent, higher 133 or whatever right we go from a low average in the 60s to 133.

A

You know: that's plus 120 130 percent, it's pretty rough! So there's something happening. One.

B

A

The side q8.

B

Which provides us eight cpus.

A

And that's the interesting part is look at this 15 30 20 30, 25, 30, 16, 30.

B

You know what that correlates to that's like almost at every minute or every five minutes. It's like cron jobs of people doing get requests, probably.

A

It might be maybe.

B

One more sync and repository.

A

Console service.

B

um What I would love to know is if we have a dashboard that tells me all the resources that are on a specific node.

B

Looks like we do.

B

Because that might help us determine what's going on with this, this poor server.

A

Okay, so I don't want to take too much on this one. We should note like we need to go look into the usage behaviors of gitlab shell and these apparent spikes. Every five minutes like this.

B

A

The usage which is cool, but we need to know we need to be prepared for it, and we need to guarantee that you know ssh can take a little longer. That's fine, but we got to make sure console stays up.

B

A

Right so, if that's happening here, then the next question would be what happens if rails ends up querying this console right, then rails can't start new pods or can't get load balancing to resolve who the current host should be, because shell is overloaded.

A

B

Seems weird, you might find this interesting, uh I'm gonna paste a link for this is our workload on the node.

B

B

We're going to suffer the same issue when looking at this, where our resolution is kinda.

B

B

But nothing really terribly sticks out, like the cpu usage of all, the pods that are running is relatively flat.

A

Right, which tells me to some degree that there's definitely something happening on that regular basis yeah, we should see a correlated flood of requests from the shell fleet to the api on that five-minute. Cyclic pattern.

B

Yeah, I was kind of hoping to see that in this top chart of that graph, where maybe gitlab shell was getting some pressure on that cycle. But I'm not seeing that on this chart.

A

Okay, let's, let's circle back to shell's behavior.

A

And let's go say: if there is anything else we can find on a non-shell pod, non-shell.

A

B

B

I kind of want to pick up a note that still suffers from this issue, or at least that we've seen this console error.

B

B

B

Server that runs our get fleet.

B

That we could probably look at.

B

B

All right, so I'm looking at another server in this case it's a server that manages or runs our git fleet where we had a usage spike of some or a failure here. I'll share my screen. uh This one.

B

So here you know no response name service list. This happened at 1401, um so similar behavior.

B

I guess what I should do is make sure this is the same exact server.

B

uh Where's the node name here.

B

So this is the same node and this happened a few times, so this event happened at least two times within the 1400 range. If I go, look at the workload on that server at 1400, like the load is relatively similar. Hovering at five cpu usage is kind of bouncing all over the place for each core.

B

These machines look like they have 16. Cpus memory didn't change at all. During this time, no disk.

A

B

Network activity was on the drop, in fact, but you know still reasonable. um Let me go back to this.

B

One, what was the name of that server.

B

B

B

B

If we zoom in closer to.

B

1400 here, like the dominating services or the get pods like nothing, just nothing looks terrible or out of the ordinary. In this case,.

A

Okay, so then it does really seem to come down to wherever that particular request ended up going whomever it ended up at, and I don't know that we have a way to trace that to be.

B

A

Did not respond within that time.

A

A

Sorry about the random jake break in the middle of.

A

B

So I've got a merge request open that will hopefully get us our logs into elasticsearch that we were not peeling logs out of our pods manually.

B

um I will proceed to do a little bit of investigations, but we should set our resource limits and requests to and get a merge request open for that in the meantime, what else can we research or look into at this point like we definitely have some awkward usage spike, but.

B

I'm not really sure how to further trace that down at the moment.

A

Okay, so don't be wrong, we should know about the you should spike right now. My focus is explicitly on. What can we do to mitigate the fault that we're seeing right now.

B

A

Do we have any task runner pods active?

A

No okay. um Can you.

A

Check the environment to see any hosts that are exposed through nvn or emv, um specifically, if anything happens, to show up as a as an entry for console.

B

I need to repeat that question.

A

Okay, so every pod in kubernetes has environment exposed when it starts in that environment are all of the services that are exposed to its environment.

A

I don't think console's going to expose be exposed through env, because it's not within the same name space, but it might be worth checking to see if there's an environment variable present within the pod. A web service pod in particular that would actually have information about where console is.

B

I guess the best way to find that is just to run the env command inside of one of the web service pods correct.

A

Yeah, you could just uh control exec env.

A

Obviously, probably want to type that to less.

B

I do not see I mean you're correct, because it's in a different name space- I don't see anything related to console in the environment, okay,.

A

That's kind of what I expected, but it was just that off chance that something showed up.

B

A

Now it comes down to the design and understanding what we can do to possibly change how console gets deployed and how we consume it.

A

I'm going to have a look at the.

A

B

I guess, while you do that I'll start looking at some resource utilization and see if I can't figure out some same values.

A

Honestly, my expectation is, we we won't need more than a couple of hundred meg um like 100, like 200 megabyte of memory, seems probably not needed. That's an easy one to find out just like grab all the console pods and see what their lifetime memory is boom um and I would say not. We probably don't need to guarantee more than say, 300 400, like 300 m on cpu.

A

You know, a third of one cpu is guaranteed theirs.

A

Now the downside of setting requests and limits identical is that you know you'll never be able to be allowed to use more than one third of one cpu, but anything is better than.

A

A

I'm digging into the service behaviors I'm trying to find something.

A

A

So the only effective change that I see as available and what's currently present in the version of the chart in use is the ability to specify the cluster ip.

A

If you specify the cluster ip and you say, none, then you effectively get a headless service which, because there is a selective present, means that you will actually, when you query the dns you'll get back all of the a records.

A

That's the that's the problem, right, you'll get them all back. You won't necessarily know which way you get them back.

A

And then you start having long entries that like say, I tried that one. Then I tried that one and then I tried that one um that would be somewhat feasible if we could ensure that not all traffic is being directed to a given top of the list and two. If our gitlab database load, balancing resolver actually returned more than one entry because it doesn't.

A

So I don't think that's really something we can viably do.

B

Is there anything in there and.

A

B

This is a bad idea, but instead of having our pods reach out of that node in order to reach console, is there a host port configuration that we could leverage and then we could somehow configure all of our pods to talk to local hosts and some port that we know that console should be running on since we run console in all of our nodes anyways. Currently, that would eliminate network copy.

A

That would, but, as I mentioned earlier in the call in doing that in the event that a given nodes, console instance got kicked out, all of that node would be out.

B

However, if we set the priority class to a certain value and we set the resources as we should, because currently we don't theoretically console should bounce back and since we pull for console with every 60 seconds, I believe it is we'll put pressure for that. One node onto the primary database for that one minute in which all of those pods failed the request for so hopefully, if something like that happened it shouldn't last, but so long assuming console was able to spawn back up correctly.

B

Our blast radius of a failure is still limited to one node. At that point,.

A

Right, which is effectively yeah.

B

And currently our pods are really spread out. We run quite a few nodes, so hopefully that's not a terrible deal, but.

A

So that's the trade-off in that um the target ports are known and we know the ports according to the daemon set. They expose named dns dash, something um and basically one is tcp- wants you to be in the same thing. So if I look at the daemon.

A

Set they are. The containers are 8 600 for tcp or tcp, and udp dns. The dns server runs on port 8600.

B

A

Inside the container itself,.

A

But without exposing that as a node service, which the.

A

A

Console.Dns.Console.Service.Com, so service done, that's the one we're using does not have a way to expose this as a node port.

A

So we would, we would actually have to.

A

Ourselves actually turn around and like write a service that allows it to expose the dns service as a to the daemon set when in.

A

A

So the tcp port hasn't changed in 0 3 1.1 of the console chart.

A

uh I have to find the service.

A

A

Found it, it was just super hard to find.

A

Okay, so they do expose type in zero. Three one one zero three one dot one, whereas they do not in zero 20.1.

B

So we need to upgrade I'll make sure to note that, because that might be an option we may want to explore further.

A

You may want to explore it further anyways, just because, let's see.

B

We're also using an older version of console. I documented. Let me pull that information, because that was fun like I haven't, looked at release notes, but we're kind of far away.

B

uh We're running version 172 of console and the latest version of that helm chart we'll pull in version 194.

B

A

B

I mean that in itself could be helpful.

A

Right well, are we running console 172 because we're aligned with the version- that's in omnibus or just happens, to be the version of chart? We saw.

B

Good point, I think, we're running just what comes with our helm chart. I don't know what we have on the list. I don't know how any of that's configured.

A

Give me a second.

B

But do we, I don't even know if we're running whatever we ship with omnibus for console and if we set up a dedicated cluster for that or whatnot.

A

We're wow one six, ten.

B

It's already slightly.

B

A

All right, so I'm hitting my up against my hard stop.

B

A

So our console is out of date, chart wise, but that means an application update, which might mean a compatibility update to versus. What's in uh omnibus, are we using the omnibus for our console cluster.

B

Need to find that out.

A

Okay, so that'll be the first answer, because maybe it's time we tell omnibus to update console.

B

A

A

The change to make the service become a node port instead is not that hard, um but there's some dancing to do on that one, because you really don't want to try and expose port 53 or burnaby's node unless you have to and there's a lot of touchbacks on that one I still kind of want to try.

A

Instead of having figuring, I should say I want to try to figure out how, instead of having us ask for where is console and then ask console, is: does it make sense to route the request through the system dns through cube dns through core dns in a stub domain, because if anything else at least then we would have when you get 10 requests all of a sudden? It's not hey! Here's ten request console. I hope you return right like yeah.

A

I want to know why it took five seconds for console to answer, but I also kind of want to know like can we use a caching layer that is built in to the dns that we already have.

B

So console provides the mechanism to inject its results into cube dns.

A

Right so basically, you configure cube dns to to make the stub domain a forwarder. So if you ask for anything at service.console, it gets forwarded to the console dns service, and then it will. You can configure how long it will cache the answer.

B

um Maybe refine what I'm writing! I'm writing down action items.

A

um I have it in the issue comments.

B

Oh okay, perfect.

A

I did it earlier, don't worry about.

A

It so I'd have to go digging into the load. Balancer. The gitlab database load balancer code base again to see what parameters we're passing to search and if the timeout configurability is there. um I know that the gems default is fine. I have to see if our default is that or there's any timeout configurability that's beyond. What's there we can pass it a different timeout. We just have to make it possible.

B

I would also shy away from doing that. I feel, like five seconds should be kosher enough for us to be operational, but.

A

I generally agree with you.

B

It's a troubleshooting step, at least you know.

B

Right, okay: um is there anything else that we want to touch on, otherwise, I think we're ready to end the call.

A

No, I think we've explored just about everything we can and we found a couple of edge cases while we're at it that we want to look into so, I think we're at a good state to having forward progress.

B

Cool excellent.

A

Cool I'll catch you later.

B