GitLab Delivery: GitLab.com migration to k8s demos, 15 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-10-15 GitLab.com k8s migration EMEA

Description

Discussing Git https migration

A

Cool so hello, everyone uh welcome to the uh to another demo, another thursday. um Let's kick off with a little view of the blockers.

A

A

A

The uh build logs is making good progress, so they've got their running in production, uh they're fixing out the last issue and uh then planning to increase up to 25 of traffic, so that one hopefully will be uh progressing nicely.

A

So this second one, um the issue's been split out, so we've got these. uh These two issues, um no great progress on either since our last demo are. These uh are either of these blocking us yet or are we still we're still good.

B

I think we're still good for now. It's really not going to be a blocker until we finish up with both https and ssh. It is also a blocker currently for uh action, cable, because we don't.

C

B

To mix that traffic in with uh the git https traffic.

A

Yep, okay, that makes sense.

B

A

B

To get an idea of like what the time frame is for this, is it going to be um like next week week after.

C

The second issue has a milestone of thirteen five.

B

A

C

B

C

A

Yeah, okay, let me follow up on those and see if we can actually get a view of when they'll they'll come in.

A

Cool and then um prometheus metrics.

C

We decided last mean that we're just going to move forward without this. um I just want to keep it here, because I want it known that we don't have metrics coming from just the shell itself,.

A

Right, yep, cool, okay, that makes sense um cool and then the other one we had on there with the az cross ac stuff, which I've removed unblocked on that which is awesome um and then the pages stuff is in progress, so um hopefully we'll be partially unblocked pretty soon on that, as well, so good progress, I think.

A

A

Jeff, do you want to give us a run-through of? I wasn't sure if it was a demo discussion but like where what you want to cover for get https.

B

Yeah sure I could give an overview, um there's not a whole lot to demo here we're still in the middle of the change issue, and um we have we had a small blocker this morning. But let me just kind of give first an overview of what we have so.

B

B

B

B

So we have um each well how many so so we have the get https backend and in each back end is the zonal, zonal cluster.

B

um So a single zonal cluster and this depends on the on the zone of the load balancer. So um so there's just one so, for example, we'll just we'll just use usb 1d as an example so e, and then we have a weight of 10..

B

In addition to that, we have eight vms. That's these one b and also a weight of 10, and then we have the regional cluster for canary with a weight of five.

B

A

What do the weights indicate.

B

Yeah, so the weight indicates how much traffic is going to each of these uh each each of these servers so.

B

So, on a given backend on an h, a proxy node, what you'll see is or on an https backend?

B

What you'll see is a single server for the zono cluster, with the weight of 10, then you'll have eight virtual machines with a weight of 10 and then you'll have the uh regional cluster for canary with a weight of five the weight of five and how you calculate um and how you and how you calculate the percentage of traffic is, basically you add up the weights, and then you see what um what uh percentage of weight goes to the server so just kind of like.

B

B

So we have total total weight is 90 95 and the zono cluster currently is getting about, probably about 10 to 10 over 95, whereas the regional cluster for canary is getting close to 5 5 over 95. Does that? Does that make uh sense.

A

Yep that does yeah.

B

So what we're doing as part of this change issue is that we are slowly increasing the weight of the zonal cluster and we do that by issuing the command h, a proxy where.

B

So maybe we'll go from 10 to 100 um in steps to kind of start starting to give more weight and then eventually, once we have the zonal cluster taking over the majority of traffic and we're more confident in it, then we will just remove the vms altogether.

B

So that's kind of an overview of what we're doing the estimated pod count in this zone. So um for the in the readiness review, we did some kind of rough calculations that to in order to have a minimum number of pods.

B

To match the compute and memory on the vms, we will need 150 odds, which is 50 zone currently.

C

B

Have the min pods is set to 30, which is a change that we made this morning um first, I first thought was thought we would do 50 to have like the full minimum, but I realize that's going to blow out the number of notes like it. It basically blows out the number of notes to 30 30 vms across the three zones, and since we aren't yet putting all the traffic on the kubernetes cluster, I want to start with 30 to see how it behaves um and then we'll increase it to 50. If we need to.

B

Does that make sense.

A

It does I wonder if what's the, if we don't have the size right, uh if it's too small, what will be the impact.

B

um I don't expect it to be too small, because we're gonna be uh kind of ramping up the traffic gradually and we'll be able to see if it ends up being too small, then the auto scaler will kick in and we have like a max of 150 pods.

B

So the reason why I'm setting this floor is just so that when we shift the traffic initially, we don't run the risk of overwhelming it, but we will be doing a gradual transition anyway, so um I would say once we do the full transition. What we'll probably want to do is back it off to set a floor. That's you know something that will allow us to scale up and down uh with traffic.

A

Cool okay, yeah- that makes sense nice and do you um did you have all the uh like? What's the kind of vlogging and dashboarding and stuff that we have for this one.

B

B

So I put some links here of what I'm paying attention to. um We have logs in the git overview dashboard. This is sort of the base basic stuff that we're looking.

B

B

So um what I'm doing here is saying: I'm looking at type equals git, I'm looking for only things that are coming from kubernetes and I don't want to look at the canary stage right now and I'm just filtering out readiness checks. So um these are the logs and then we have the uh get overview. The git overview um doesn't help us too much just because it's not possible to well with this dashboard, it's not possible to differentiate kubernetes and non-kubernetes traffic.

B

um So I think this gives you like this. Can this overview will give us like? Okay? Are we meeting our slos overall um as we shift traffic over, but I'm looking at other things as well.

B

So most of my focus is on latency and um looking at the 50th and 95th percentile of latency for both workhorse and.

B

Rails, so here's what we have a direct comparison of virtual machines versus kubernetes kubernetes is on the top. Virtual machines are on the bottom, um so this is doing a query, basically for not kubernetes and kubernetes.

B

This is workhorse percentiles, 50th and 95th percentile. um You can see that it looks like vms are doing much worse than kubernetes. So far like on the 95th percentile we're seeing all these like little spikes that go up. um I know that um we believe that the git fleet is under provision right now, and that, like this, could be related to that. We aren't sure it'll be interesting to see what happens to these two graphs.

B

As we start moving more traffic over right now, we only have 10 percent of traffic in the kubernetes cluster, so I wouldn't put too much stock into it. Another thing that sort of conflates this and skyrack I'm curious. What your thoughts on this are is that, um right now on the git vms, we mix both ssh and get https traffic on the kubernetes cluster. We're only doing git https, so um it might be. It might be interesting for us to do a more apples.

B

Apples comparison, we could uh remove some of the vms from get ssh traffic and uh just have like vms, just doing git https versus kubernetes just doing good https. That.

C

Would be an interesting experiment.

B

Yeah, so um I'm thinking about that right now, I think um I think the transition to kubernetes is for ghbs is going to go on until monday or tuesday, because I just want to be super slow and careful with it. Here's a similar graph for rails and again it's interesting. We see this same spike now for the rails rails is only showing get https because git ssh um well, that's not true! Actually, yeah, there's only authorization requests are going through rails, but I would say most of these requests are coming from get https.

B

um You can see again like the 95th percentile here um is fairly steady for kubernetes and we see like these spikes um spikes up to um like a half a second, where.

C

B

Like down in like 0.15 second consistently for the 50th percentile, it jumps around a bit, um but I would say yeah I mean I'll, be interested to see as we start to shift more traffic over uh what these graphs look like.

B

It's possible that we're going to be performing better in kubernetes, just because it seems like we're a bit under-provisioned in the git bmp.

A

Yeah interesting, okay.

B

um I did just so so I mentioned. There's a there was a blocker.

B

The blocker I found was that we don't have the metrics yet and thanos, so our rps, you can see here like as soon as I started, to shift traffic into the grenades cluster. This started to go down and that's just because um we have prometheus running in the clusters, but uh thanos, which is our aggregator for metrics, which serve, which is like feeds into the dashboards, uh wasn't uh configured yet for the zonal cluster. So um that's fixed now, so, hopefully the next run. We will see the rps stay constant.

B

This was a bit concerning um load, balancer component error rates. I need to dig into this a little bit more, so this started to creep up as I started to move traffic over and oddly, it seemed like the errors were coming, not from kubernetes but from the virtual machines. So I don't know what exactly is going on there, um I'm going to do a smaller jump on the next, because I already like when I already reverted after this I started. I started to see this climb.

B

I just like undid um set the weight back to what it was before, so we never got an alert because we were below our slo for this, but um next time I'm going to maybe do a smaller jump to see. If we're going to see errors again,.

A

Do you have any idea of what might be causing that.

B

A

B

um That is pretty much it um yeah you can see here like um we were seeing mostly 500 errors and they were actually coming from the virtual machines. When I dug into the 500 errors um it wasn't, it wasn't clear yet it wasn't clear to me yet what's causing them, so um I just need to look into that a bit more and um yeah. So I'm kind of waiting waiting for this query to reach ah there we go so this looks like it's canary. Let's take out canaries.

B

Yeah, so still still no metrics in thanos, so we have to wait, for. I just did a chef, mr to um there's this fmr to add the zonal cluster endpoints to the thanos configuration. So once that's done, then you should start seeing metrics.

B

So that's it for a demo. Any other questions.

A

B

I don't think if not scarborough do you have anything.

C

I've got one discussion item.

A

Yeah go for it.

C

Okay, um with the addition of the zonal clusters and the addition of the new thanos related work, that's been coming down from the observability team. I have a worry and concern that our existing metrics and our logging mechanisms are not going to keep up with the changes that we have yet to place under run books, and I have this fear that eventually we're going to have some form of outage and we're not going to have a quick and easy way to decipher where the problem might lie.

C

Due to this fear, I kind of want to see us try to focus on tech debt in the future. Like I don't know, what's coming up next after we finish what we came with sidekick and after jar finishes up to get https stuff, I don't know what's really next, but I would love to see some improvements made to our dashboards and I would love to see some. We also have the helm upgrade, which has been stuck forever.

C

I would love to see us get out of the helm, 2 version, because it's kind of crappy I don't know how to steer us in the direction to focus on tech net. I know we've got an epic that captures this data, but I'm hoping this is also. You know the opposite of our okr, but I'm wondering what do we need to do to try to figure out? What can we do to ensure confidence in our runbooks and our monitoring to ensure that now that we've got our multiple clusters?

C

How can we share the learnings that we've got so far, and how can we explain upon the data that we currently have such that future sres, that are on call aren't stuck relying on us and even I am not fully up to speed with the zonal clusters as far as jarv is so even I would have problems initially troubleshooting certain situations.

B

C

We- what do I need to do to sell us that we need to work on stuff like this, in the.

B

Future, I think I'm.

A

Fully sold like, I think uh it, um it's a good point for us to to review this stuff, like with the having the multi-cluster stuff unblocks the helm, upgrade.

A

um I think, like it's a great point, once we've finished the git migrations to actually take some time and actually review what we have and dashboards and run books and things.

B

Yeah, I think what would help for me. Starbuck is, since you have a fresher set of eyes on this stuff, um maybe kind of gives us an idea of what you want to see because um yeah. I was talking to amy about this in our one-on-one this week and I think, as soon as this migration is complete. We're gonna we're gonna, improve both the architecture of review and the documentation and the run books, and um I'm not sure what so some some ideas that I've had is.

B

One is like: how do you configure your workstation to troubleshoot each zonal cluster with cube ctl, so that would be the first thing right. Next is the endpoints for prometheus. um I mean those are kind of the same. You just substitute out the cluster name, but we can. We can write that down somewhere.

B

What what else I mean? Do you maybe well.

C

I think our next part would be uh dashboards and metrics. um We don't have any sort of saturation metric or alerting for our clusters at all. If, for whatever reason one of our clusters starts to take more traffic, you know something might be wrong.

D

C

Google's routing traffic, for example, I'm just thinking off the top of my head- we're going to have the cluster in usd 1b spin up a crap ton of pods and also spin up some notes to match it and we're not going to understand why it'd be good to know if we had some sort of saturation alerting or just some sort of dashboard to look at to figure out how well distributed our traffic is occurring that way in times of dire need or outages. We have that ability to say. Oh look.

C

Something is wrong with cluster sitting in us, 1d um stuff, like that.

B

So so I I was thinking like. Maybe we could talk to andrew about this, but maybe just having a region filter on our general service dashboards.

C

B

That would help at the high level.

C

Similar to that some of our kubernetes dashboards don't work, um so I'm not sure what information they would display, but we also don't have information based on like a node that is going haywire in comparison to other nodes.

C

So like we have individual node dashboards, but we don't have an overview that says: hey you've got this one node, that's just barely hanging on, but the rest of the cluster is fine, so we can't determine whether or not a specific, for example, sidekick shard is struggling based on its current current workload unless we go to every individual node on our dashboard and compare those numbers in like different tabs of grafana, and that would be horrible.

C

That's a horrible user experience for an sre. I don't know how to improve that oftentimes my head, but I think we need to make some improvements to how we use the kubernetes mix and together our dashboards together.

B

Yeah, it was my understanding we're going to just deprecate the mix in dashboards and fold what we want into the general dashboards. So that's it's a little bit more work, but first we need to figure out what we need and what we want to see yeah. um So I don't know if we can, like maybe work with the absorbability team on this.

C

I created a few issues and associated them with the tech that uh epic, that we have created in our backlog that.

D

C

A few additional things that I would like to see that one revert that you had yesterday late your time. I created an issue about that, because I thought it was kind of odd that we had a spike in nodes in one zone and not the others and I'd like to figure out. If we could address that in some way shape or form.

B

You mean the um where you you mean the 50 setting the mid pods to 50 and that created yeah. No, that was no. There was nothing unusual there. um I applied it in two zones and then I saw that we spiked up to ten, so I didn't apply it to the third zone and that's when I reported it.

C

So there was nothing unusual.

B

There- and I think um I mean I wanted to increase the maximum, because our max node, auto scale- you know limit was 10, so I didn't want to like be running at like the max.

C

Okay, so I probably interpreted that statement incorrectly, so I probably created an issue that may not that can that we could probably close because of that. So.

B

C

B

um So so what I have then, is like we take. We add a region filter to the general dashboards. We take a look at the pod info dashboard and incorporate that into the general dashboard, and then we take a look at maybe even maybe whatever information we need about nodes. We incorporate that into the general dashboard.

C

As well yeah and then improving our run books, that's necessary for helping people troubleshoot stuff.

A

So I put in a new epic that we could pull this stuff into have like a kind of a. I don't know like kind of like cluster hardening or something like along those.

D

A

And then we could pull in anything. You've got from the tech there issue, plus all these new ones helm upgrade. So we've got a place to point everything to.

B

I mean we do have, um I think scarbeck created epic. A long time ago. That's called like technical debt.

A

Yeah we do have a technical debt. One um I mean.

B

I don't know we.

A

Could go without.

C

A bucket of issues.

A

Yeah, like that's kind of what I was thinking because we won't be tackling all of it.

A

um Whether it's useful to have an epic that kind of just explains like where we're up to and what our goal is um right now, if.

B

I were if I were to like put this into a bucket. I would call it like cluster observability and troubleshooting, or something like that.

A

B

And maybe I wouldn't even necessarily call it technical data just because it's it's uh a lot of it's going to be documentation, um although some of it is the monitoring stuff, but.

C

B

Just call it that and then we can, I guess, prioritize that epic, after get https and get a saturday done.

A

Yeah, I think that makes a lot of sense, so we could just pull in anything that we think like review this dashboard or write this document. We've got a few things on the board already, um so I'm kind of assuming we will focus and get like uh all the get stuff get ssh get https completed. So if there's issues uh outstanding for those would be great to get them onto the delivery billboard.

A

But following that, we've got the issue around ensuring we could put zonal clusters into maintenance. um I hope that's possible already.

B

That's not possible yet.

A

B

That's something that I was going to work on I mean we can. I was thinking that we were going to do that after they did https migration.

A

Okay but yeah, we should do it before too long, so yeah, okay, cool, so that one's on there and.

D

A

uh This one's a bit more of a placeholder but document how to debug the multi-cluster setup, which I think maybe ties in a bit more. You were saying scotland.

C

That's pretty basically what I'm asking for yeah.

A

A slightly bigger thing, um but anyway we can go. We can start there job. Do you think it'd be worth doing a retrospective readiness uh review for the multi-cluster.

B

Well, we never. We never did a readiness review specific to multi-cluster but um yeah, maybe a retrospective for just how it went.

C

What would a retrospect to be a good form or.

B

Maybe doing a readiness.

C

Review and just well.

B

It's a little bit late to do a readiness review. I.

C

Mean it is late, but, like I don't, I think, there's still value in having other people look at our notes about it.

A

Yeah and I think it's sort of thing which we I can imagine it to be a useful documentation piece in the future like exactly.

B

Yeah then, let's just jump, let's just jump to doc documentation, and um I think that the challenge there is just figuring out where to put it, we have too many markdown documents as it is in the runbook stock directory under uncategorized.

C

And it's just like.

B

So I think um maybe we should try to take another look at all. Those documents consolidate them a bit as part of this. uh I don't. I just feel like weird. It's like adding. Another document is like it's just getting to be extreme at this point.

C

What, if we add it to the handbook, so we can have google search it for us.

B

No, I don't think it should go in the handbook.

C

I was joking jarv yeah.

A

But I do think a readiness review would also still be useful because uh definitely like gets people to review it, and I say, even though there's lots of documentation, I think if people expect to be able to find a readiness review and they live in a certain place. I think it could be a nice entry point into other documentation.

B

But to me, readiness reviews are like for the moment like they're, not something that we maintain. We don't go back and we up, we don't go back and update them. I'd rather I'd rather just do documentation for the architecture of the multi-cluster and troubleshooting, and also keep it up to date. um Okay,.

A

I mean yeah. That also sounds like a great thing to do. uh Do we do you think you'll get the same level of review like because what I think the difference to me is that the documentation is kind of a telling everyone. Here's how it is, whereas a readiness review is a little bit more inviting critique into here's, how we have it set up.

C

A

There's anything.

C

Wrong with creating a documentation piece and treating it as a readiness review process, we could just take the same process. We do for the readiness review and apply it to a documentation, update.

A

Yep that makes sense: okay, yeah. So let's invite people to make suggestions if they, if they have them thanks. I changed it a couple days ago.

C

Okay, the other thing I want to bring up is logging. I don't.

B

C

I hate logging right now, um like jarv, showed his screen for a split second, and we saw just nothing but new lines because of the way tailing or logs in the state of networks, and I realize the distribution team has this in their backlog to work on. Is there any way we could try to figure out how to like?

C

I don't know whether it's us contributing or helping them along, like. I would love to solve this before you start moving any new services over, because it's only going to get worse.

B

um Well, we sort of decided as a group that we were fine turning off the unstructured log and moving forward, even though we are getting we're we're getting log data that we don't really need to see in elasticsearch.

B

C

Solution here is going to be.

B

Yeah, so I guess that state-of-the-art logging issue is the one that's tracking it and it's a question of whether we're going to use that logger wrapper or not.

A

It really put me to work these last few weeks, but I think if there's a, if you have a suggestion for how like no problem at all, whether it's contributing to stuff, if there's something in progress like if there's, if there is a issue that uh captures what you like there logs to it like, um feel free to raise that up and we'll see what we can do with that.

C

Sure I'll chase down their epic and see if there's any way I could contribute or to it in some way shape or form.

A

Okay sounds good cool.

C

I think that's all I've got for my discussion item.

A

Awesome- and I say we will also be able to do the helm uh upgrade as well as alongside all this stuff.

C

Oh, I don't want to do that.

A

We should definitely review what's on the issue and, uh let's see where we're at do you think it'll be painful.

C

Yeah I tried it once I worked on it for about three or four months, and then I hit a blocker right when I was about to execute it and I'm like well crap that sucks, I don't want to work on it.

A

Okay review that stuff but um cool, shall we take a quick look at the board and see uh if we've got so.

A

Do so at the moment um we've got.

A

So the ones related to uh well actually so this enable action, cable um issue. Job like is that one that we actually want to keep in progress, or do you want to like separate it out and come back to.

B

That I think we can move into the blocked column.

A

Is there anything we need to do to unblock that? Are we waiting? I saw there was a discussion we waiting on other people.

B

Yeah we're waiting on the distribution team.

A

Cool okay, um so uh let's go back. You've got your readiness review, stuff, um investigating vlog, uh stuff catch-all going on um and drive, you've got investigating memory profile for pods, uh you've got an intermittent error and then zonal deployments and you've got the workhorse stuff.

A

um Do you have? Is there any other? And actually that's I think all we've got so? Do we have any other issues from uh get https or get ssh that we want to pull into this board.

C

uh Yeah um everything inside of the epic that I'm working on should be on this billboard. If it's missing the label I'll go back through and add it.

A

Awesome and if there's anything like on there, that is um you're not yet working on, feel free to stick the ready label on so that it's here you can just pull it in ready.

C

Yeah, they all have the triage label associated with them.

C

A

Almost wondering whether we should just add the triage column, so we can pull them over, we'll see how that goes. Yeah it'd be one less label to have to manually change.

C

Well, if a r bot will add the triage label, if a workflow infra label was not added.

A

That's true: we've got the other issue, primarily.

C

Up to you and how you decide to set up the workflow, that's.

A

True yeah we'll change it. So when we assign someone of signs themselves, then it will go to uh in progress. That's fine cool um and job. Do you have any extra ones that you want to pull in either in progress or ready for the get https.

B

B

I mean anything I think every every issue in the get https epic is ready to be picked up.

A

B

Cool put labels on those.

A

Okay cool: well: let's get a few of those on the on the ready column.

B

I mean I'm just double checking now, but I think so.

B

um Except for the action cable one, uh I think um the post migration analysis of websocket nodes. um Maybe we should just combine this with https, uh but so that one's not ready yet, but the other.

C

B

I would say are well: one of them said signed to me so yeah, there's.

A

B

That not that many left.

A

Yeah, that's fine! That's fine! Nice! Okay!.

C

Amy, I wonder if you should create a board- that's like a copy of this, but just filter on the kubernetes label. That way, we don't see all the stuff from the rest of the released uh delivery.

A

Oh, that's interesting, so we can add the label uh like you. Can it's not super hard to tweak this uh to dive? In the main reason, I think we should probably have as a kind of a temporary we jump in is just so that you get some more visibility of what the others are working on. Oh.

C

Certainly, I'm just saying for this particular meeting for this chat. Yeah.

B

A

Oh everything, yeah! That's a good point: yeah cool okay. So this is what we've got at the moment. Cool okay sounds good I'll, create that epic and uh ping you on it. So we can start pulling in anything that we think would be useful to tackle in our kind of next uh next phase.

C

um Right I've got one question for jarv: that's not necessarily part of this meeting. If we want to in the meeting and jarv, if you could hang back for five minutes, I'd appreciate it.

A

All right thanks, everyone.

C

Do you have a question.

B

C

I was looking at our logs I'll share my screen really quickly because I don't know what makes this.

C

But I noticed if you can see my screen right.

B

Oh interesting.

C

Where did these come from.

C

Is fluent d causing this like? Is there something wrong with the way I built fluid d or configure fluentd.

B

I don't think you did anything strange right.

C

I just copied and pasted what we probably currently had elsewhere. I didn't look.

B

C

Fluency configuration after I saw this.

B

I don't know why this is happening like we shouldn't have indexes specific to zones.

C

But I also don't know what all these numbers are.

B

Those are just the rollovers like those are the um so.

B

So the ones without the numbers are the ones that the usc one b, b and c. Those are the ones that are weird yeah.

C

um But there's not that much in.

B

C

Well, this is also production and they're, not they're, not taking any traffic, so this is probably just pods starting and stopping as they get.

B

C

That, like, if I go.

B

C

C

C

I don't see that in here, which I thought was strange. Is that expected, like.

C

And I could do like.

C

um Yeah kubernetes doesn't even show up in here, so I don't even know. What's inside of that index.

C

I don't know how to look just specifically at that index either.

C

D

B

B

I think what I would do is probably delete the indexes and we can see if they are recreated, but I think that they're just junk, but I don't know what created them. That's really strange and if you and if you see vlogs, going into pub sub shell and gpro now then like. Do you see your easier kubernetes logs.

C

B

C

A filter- and I you know, start typing kubernetes, there's nothing. There's no.

B

Okay, so that's definitely and that's definitely a problem um I'll need to take a look to see because we should at least have something right.

C

Yeah like that, because, based on that, I was expecting to see like the zone show up here, but they don't in this drop down.

B

Yeah, so uh do you see vlogs for staging.

C

B

I like that, staging.

C

But I uh for pre I did see logs like I expected to because that's where I've been doing most of my testing so far.

C

I did not look at staging at this because I saw this while we were having the meeting.

B

C

Okay, no, I do not see kubernetes in here, so this one has the same problem.

C

I do this so very little. I have no clue what I'm doing. Half the time index management.

C

Yeah staging is suffering from the same issue and.

D

It has data in there.

B

C

We're gathering logs but they're not searchable for some reason.

B

Oh, that's the wrong, it's the wrong, so there's a there's, definitely a configuration problem here. This could be related to the zonal, cluster, refactor and well. Clearly, it's related to the zonal cluster refactor, where.

B

It's it's using instead of g-stage, it's using g-stage usc x right, so we just need to fix that. uh Well, how come.

D

You want to take.

B

It a good question, I don't know, but we can.

C

Take a look all together, we've been working on.

B

um I I don't know, but you want to share.

C

Your screen and.

B

Yeah we can take a look at it.

C

Together because I have no clue what to look at mayor now, we ended the meeting. uh I just asked jar for some assistance on something, so I don't know if.

D

You don't mind me just being fly on the wall. I would sit here if you would prefer to be alone here. I.

C

B

B

So probably, what happened here scarborough? Do you see this see these lines? Oh so when I did.

C

That prefix versus environment.

B

Yeah, so okay, so I was just slightly.

C

Behind when you, when we did.

B

C

Okay, so we just need to change that to enb prefix and we should be okay.

B

Yeah, but does this like? First of all, since you know where to look for the stuff like it's in the um the fluent elastic search uh yeah, you know killed files, there's this values file and then what we do is we.

B

We have these uh index names um and then this this name corresponds to um creates is a glob. It's a glob, it's actually star git lab underscore git lab. Where course star we put the star before and after, and it looks for slash var, slash, log, slash containers, and then that name and then it goes into this index.

C

Okay, so you and I were just stepping on top of each other, because I wasn't.

B

C

Into it, as you were doing, the multi-cluster.

B

Yeah, I think I probably merged my- I probably merged myself. First, and uh this came in after or something okay.

C

All right I'll get a request in to fix this, then. uh So, when I fix this, do I need to manually delete these indexes? Is that going to cause an issue at all.

B

You could probably I mean they'll, I think they'll eventually be deleted on their own. You could check with uh observability guys, but I think it doesn't hurt to delete them either. Okay,.

C

Perfect well I'll get this fixed shortly. I was wondering why I couldn't find my logs. I was like what.

B

C

All right, thanks for that assistance, sir greatly appreciate it.

C

Alrighty, well, that's all I got, then man.

B

Cool um marin since you're joining a bit late, just to give you a quick update on good https, um we currently have around 10 of the traffic going into the kubernetes cluster, uh saw that there was a problem with thanos getting the metrics in for the zonal clusters. So that's fixed now, and I'm going to be like slowly ramping up the traffic a little bit more today, but the expectation is that this is not going to complete tomorrow.

B

I think it's going to go into monday or tuesday unless, if things look like they're going very smoothly, we're just going very slow.

B

Okay, the way that and the way it works now is that we have both the cluster and all of the virtual machines in the https backend. And then we just like increasing the weight of the gk cluster slowly over time uh in intervals, and that allows us to kind of like slowly move traffic over what.

D

B

Go ahead, I just want to say, like the performance is suspiciously good, which you know.

B

Like compared to the vm fleet, uh uh we're seeing on vms we're seeing like the 95th percentile is very spiky on kubernetes, it's very flat and it's good. So um that sounds good, but it also is perplexing, although I do think that the git fleet is a tad uh under provisioned.

B

It's also a bit hard for us to compare apples to apples, because we.

D

Have three ends: yeah git.

B

Vms are sort of doing both did ssh, so we were thinking about um taking some get vms and just putting did https on them. Just to compare our point of comparison, this could be something we could do so how.

D

About I mean I, I support in doing this a bit more careful because it's a huge possibility for um for an outage. um What? What kind of um confidence do you need to get in order to, for example, jump from 10 traffic to like 25 and then to 50 instead of 10 15 20. like what? What's apart from that spikiness that you're not sure about like what else yeah.

B

um That I'd like to and also just um I mean I saw that I think, once we have proper metrics going to thanos properly, then uh I was losing some visibility because of that I was just looking at logs, um so we'll see um I think my target is to. I was hoping to have 50 of the traffic in today. It looks like that might happen tomorrow and then we'll we'll. Let that sit over the weekend and then go to 100 early next week.

D

um I I have a suggestion. You don't have to take it, so you don't feel isolated in this and both that means for both of you, uh you and and well both john's.

D

If you want to get a bit more of a confidence, consider getting the sreon call to join you and increase traffic together and just look at things together.

C

D

Like chat things with multiple people just to see different perspectives there and you also like roll them into the rollout, um so maybe someone tells you like something that you're not seeing or maybe someone gives you more confidence or less confidence and uh um yeah.

D

I don't expect you to roll out all of the traffic, that's up to you, but at the same time it would be nice to also involve uh more than just two of you, so you can get a bit more um questions.

B

Yeah, that sounds like a good idea, and especially for whoever's doing the weekend shift, maybe involve them to you know, make sure they know how to turn it off if they need to.

D

Exactly yeah cool cool, how did you resolve your stuff, or did you decide to put that on the side for now.

B

uh Resolve the what.

D

Not not resolved but mitigate the the workhorse being deployed from master.

B

Yeah starbuck: are you up to speed on that.

C

I am not what's going on. Oh.

B

So fun um we're all deploying uh workhorse on master right now to the kubernetes cluster, which is different than what we're deploying to the vms.

C

I thought we fixed that with that uh chart change what happened.

B

I don't know what happened. I don't think this was ever working man like when we tag when we tag cng. It picks up the latest version of workhorse, so we're using the tagged version of workhorse. That's fine, but it's actually we're using the tagged image the tag cng image, but it's actually the latest version of workhorse. That was available at the time that that uh cng is tagged.

D

Workers is actually built from sha. So if you take a look at the version of workhorse, it actually differs at different times, because it's built from a master version yeah. It's a.

C

Moving target, it doesn't sound good. Oh.

D

No deployment man.

C

Yes, but it's not necessarily the version, that's compatible with that version of git lab right. It's.

D

Not what we expect? No, I think that's the the main danger, but luckily this traffic is a bit.

D

I think it would be much worse if we were sending an api traffic to it, yeah right um for git https. We I spoke with jar already. This is basically workhorse is only a proxy, so.

C

We're just having a few seconds: is there a desire to fix this or we're just going to let it be okay? Okay, because I didn't know this is a problem. I never saw an issue, so I had no idea.

D

There is an issue they have created one, but there is a desire to fix it. It's more about the order of it like do. We stop the world now to fix this or take on some risk, so we decided to talk about what the risk is and we're deciding to take a bit of risk, but then the two corrective actions, as we roll things out, are that and also the shutting off traffic between uh clusters.

C

So I guess, how did you determine this was a situation because I had not looked at gitlab shell and I would hope it's not suffering from the same problem.

D

There is a large chance, it's suffering from the same thing.

B

Yeah, I would say almost certainly yeah um I mean I.

C

Guess I could look at the latest auto deploy and see if the uh version of shell changed.

B

Just just just like ssh to a vm and go to your uh container, and do it just just run the binary with version flag and see if it's the same.

D

For shell, you can do that right. Shell doesn't have a binary, it's a ruby project.

B

All right what.

C

The gitlab show was going no.

D

C

No, I could have sworn it was.

B

I don't remember.

D

Kids, we did do some refactorings and changed gitlab shell around, but I didn't think that we removed all of it or rather moved all of it.

C

On the project, it's eighty five point: six percent go and thirteen.

B

Yeah, it's gonna be so maybe it has a version flag. If we, if it does, then we can.

C

I guess I'll find that out and uh I'll create a new issue.

D

Look at that look at that it is wow. Okay, we did rewrite it in go um anyway, like the the vm is not going to be a good test. You need to test the image.

D

You need to pull an image, pull um attack the image and.

C

I'll, probably just do an exec into a running.

B

Yeah, just just exactly.

C

B

C

D

That's that's what.

B

D

Actually, you could do that easily, can't we well. No, we can't, I thought we could do this easily by using our integration.

D

um What I mean is the deploy boards and the the terminal, but I think it only works for registry.

D

You know what I'm talking about in a while.

C

Yeah, I think, last time I looked at this, it's missing like a lot of pods.

D

D

That page is refusing to load.

C

I don't even know how to get to that page.

D

Environments yeah, it takes me directly to registry, so I think we only configure that probably or when I say we I mean jarv.

C

Object but for some reason it's not picking up everything.

D

I wonder if this is the case of.

D

C

Also, there's no way we're running 820 get lab registry pods there's no way. That's accurate.

D

C

Max count is not over 300, I think it's not even over 150. I think.

D

All right, that's a different problem.

D

Cool all right thanks jar for the update, um good progress. I I do like it. uh They look like that. We are being a bit careful there as well. So thanks for doing that,.

C

Marion, I feel like the last time I saw you, you were suffering from heat stroke because it was so hot in amsterdam, and here you are in a sweatshirt.

D

Yeah funny how what two months can do.

D

All right need to drop off for our next call. Alright,.

C

Have a good one have a great.

B