GitLab Delivery: GitLab.com migration to k8s demos, 28 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-01-28 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

This good morning.

B

Good morning,.

C

Good morning, hey.

C

C

Cool, so welcome um to today's uh demo, who has something they would like to.

D

Demo, I do, uh are we jumping right into it or is there anything else we're going to discuss.

C

uh Go for it, I think, like okay,.

D

C

See what you got.

D

I'm basically going to repeat what I did this morning, though I have a little bit more information and I also emailed- or you probably saw that I emailed google about this problem as well, so this is um for a little bit of background. This uh there's an issue currently with our websockets deployment.

D

We've we removed the nginx ingress, so we're connecting right to the service endpoint and we're seeing problems when pods are cycled we're seeing errors, and this is very reminiscent of an earlier problem- and this is something that jason was also helping us out with where uh we saw an issue with the nginx ingress, where connections were being held open and we were seeing errors as bike as pods were cycled and we tweaked some parameters in the nginx ingress and that problem went away.

D

um We then, after that, we were seeing problems with when engine x, controller pods were being cycled. So what we did is we threw a lot of resources at the nginx ingress so that the pods never get cycled.

D

What we've done since then, is that, instead of running a gcp internal lb in front of engine nginx, now we've taken out nginx and we're running a gcp internal lb right in front of our nodes nodes that are servicing websockets so in front of our rails nodes. What I'm going to show is what happens when a pod is removed.

D

From the list of endpoints share my screen and hopefully this works, can everyone see my screen.

A

E

D

D

D

D

D

Okay, so what's going on here, I'm sending requests to the service endpoint directly, so this.

A

Is sorry we can.

E

F

I don't know what happened there, but.

D

This happened this morning like, can you see it now.

F

No, I'm gonna trying to show.

G

F

It's just blank.

G

This is happening to me a lot on zoom at the moment, as well, really great on a custom call on tuesday we had to log off from logon three times.

E

Does it mean don't update.

G

I I've gone through two updates and it's still at least no one or two updates, and it's still happening for me. Oh interesting.

E

Yes, jared works.

E

But now we don't hear you.

D

God there we go okay, we are hitting this endpoint here. This is the service endpoint for the gitlab web service websockets service. This is a google uh load, bouncer uh internal load balancer and I'm hitting just like a readme in a project. This is on pre-prod in the lower right hand corner you can see like the requests coming in and in the upper right hand corner. You can see. I have k9s running so the pod is xghtk and you can see in the lower right hand corner these.

D

Are the requests going to x g h, t k.

D

I can get the endpoints, which shows that, for this node we have xghtk is the endpoint for web server's websocket. So everything looks good. We're able to talk uh to the websocket service. This isn't a websockets connection, but it doesn't matter because you can. I can demonstrate this by doing just regular http requests um I'm going to edit the service I think hold.

B

On I don't know if.

D

This is visible.

H

For everyone else, but.

D

B

Screen share looks paused.

G

E

This is exactly what happens to me. Yeah.

G

E

The logs are not moving you're right.

G

This is really annoying, it is very annoying. I mean I'm close to moving over to different screen, sharing options or something.

D

um You, but this was working fine up until I upgraded osx, I think or something maybe I upgraded.

G

Yeah, it's been, it's definitely been since upgraded everything. Could.

E

You check your uh zoom version, I'm just curious sure it is five three one. Oh.

D

Nine two seven.

E

G

That's a very old one. Yours is ancient minds. Yeah.

E

You should you should consider upgrading, okay.

G

It may not help.

E

Though, if my experiences are.

G

Going to go by.

E

At least it's not ancient.

D

Yeah, um this really sucks, because I I don't know I'll, try to leave and come back.

G

It's all reminding me a bit of skype.

C

That's just cruel, I remember those days. You never had any credit. That was always my problem. There we go.

D

How about how about now or thank you.

C

D

For now, right, try not to touch too many things here. um So I'm going to make a change. That's going to result in like a new pod coming up. Can you see that a new pod is being brought up? Oh okay! Maybe if I don't touch anything it'll it'll keep working, um so this takes a little bit of time, but what we hope to see is that traffic will shift from this xghtk pod to the nu1 ztw pod and we won't have any 500s, but that's not going to.

E

E

And what uh boot up time are we talking about these days about a minute or two okay,.

D

So uh please like, if you can still.

E

Two seems really long why um I think it stopped jarv again, I don't see anything moving. Wow.

F

Is it a matter of focus.

D

Yeah, it might be, I think, I'm I'm gonna try to keep the same focus.

F

This seems like zoom plus big sur.

E

Yeah, probably.

G

It's it's funny how it works for a while and then stops working. That's the odd thing.

D

I'm just going to unmute all right are things moving, I'm not just.

B

D

Hear you yeah all right, I'm not! I would like to show you more, but I'm not going to I'm not going to touch it.

D

Basically, if you were to look at if you, if you were to do uh cube, ctl get endpoints, you would see that both endpoints are listed. This new one isn't healthy yet, but it will be soon as soon as it comes up.

D

And hopefully uh you still see we're servicing requests. You see some of like the new pod blogs. Are, you know, we're starting to see some of some activity from it? So first thing to notice here, hopefully still moving xg htk isn't terminating, but look at that. It's still receiving traffic. That's strange.

F

Okay, so my immediate questions on this are: do we have the appropriate drain timeouts configured and in use.

D

um Drain time, I'm not sure what you mean by that like in what context you mean like the grace period or.

F

D

But that's that doesn't come into play here because, as soon as the termination starts right, the q proxy, it should stop traffic that right now we have zero out of two pods zero out of two containers ready and we're still receiving traffic. Now we're still processing requests because um we're in this period of time, where um puma this is the blackout window where.

D

But there is no reason, even even with like the termination grace period, there's no reason why you should be receiving requests with zero out of two pods ready and in the terminating state and we're still we're still seeing requests. This is just wrong right.

F

So, let's understand what takes a pod out of a service because there's there seems to be this uh view that if it's terminating it should just be removed.

D

It's not in the list of endpoints. If you look at the endpoints, it's not it's not in the list. So it's it's out of service, I'm gonna! I'm gonna do that and hopefully we can still see if you can see my screen.

D

You'll see that.

H

D

H

D

um Just trust me that we're still receiving requests and in a list of endpoints ztwq, which is the new pod, is there xghtk is not in the list of endpoints.

F

All right, so what we ended up tweaking, as you mentioned earlier in nginx, had to do with the number of connections that are actually piped over a given open socket. Basically, we use socket ha proxy has a connection in the service. That's going essentially hitting the endpoint.

D

I am not using hi proxy at all.

D

I am doing directly to the service endpoint. I am not touching. Hi proxy is not being like used at all here.

F

Okay, so you're going directly to the service proxy.

D

F

Okay, so I need to know exactly how this is operational and if it's actually forming new tcp connections every time, because if it's at all reusing the socket is going to hold the socket open and the tcp layer within the kernel on the node won't stop using that existing connection to the pod.

F

Something has to tell it to stop.

D

That's possible, like I can look at this load generator thing to see if there's a new connection every time but like I don't know, it still seems kind of iffy.

B

To me right shouldn't the google load bouncer be removing this pod.

F

He's not using the google load balancer.

D

Yeah, so the google load balancer has attached to it. um Let me I'm going to stop my screen share again I'll, be right.

D

D

On mute, share screen.

D

So the gcp tcp load balancer has the nodes attached to it, so those nodes are always healthy and the requests are going to one of the nodes which then you know then you're dealing with like the um the q you're doing the q proxy right and then that gets directed to whatever pod is healthy right now, um if you see my screen like we have this pod, that's still in the terminated state and still receiving traffic, and this has been like over a minute now.

D

Finally, now that the pod has been removed completely now it's switched over to the new pod.

F

Right and I have no doubt that's because the termination grace period came along and went what are you still doing here.

D

So I I don't know, um I sent an email to google and I'll be interested to see what they come back with about this.

D

um I don't know, I I guess it's possible that we have connection reuse, but for over two minutes like does. That is that, is that really a possibility.

F

As we've recently seen what happens when you have a long-standing connection that doesn't quite match up well now that was specifically around idle timeouts regarding mail room. However, the principles within the networking stack remain the same.

F

If you have something that's currently directing traffic and that pipe is held open, it is explicitly not terminated unless told to do so. Tube proxy does not terminate with haste.

D

F

Have to figure out what's going on that that is being held open and effectively. You have a sticky connection through the ip stack from node. A to pod z, because that's what's happening is that traffic is still being routed there and there's not something that says get off of me go away and it's just keep getting routed there.

F

If we have a connection, reuse and we're just issuing many many requests, then we're going to continue to see traffic going over that connection, because it's held open.

D

Well, this will be easy enough to validate, because I can switch from using this load generator to just sending curl in a loop. I'm only sending like one request a second, so we can do that.

D

um Yeah I mean I can do that now. I I just like. I just hate the screen sharing stuff because now like as soon as I click anywhere, I'm probably lost. My screen share.

F

D

um It's kind of annoying.

E

Maybe share the whole desktop.

D

Yeah, I can try that.

D

E

D

E

D

G

E

Maybe we should build our own yeah yeah.

D

Switch to linux.

E

I mean we depend.

F

H

D

Yeah only I was running linux. That must be the problem.

B

No, no! No! No! It's not.

I

B

Upgrading at this moment,.

F

I don't, I don't, have any problems with my zoom, so I don't know. What's going on.

F

I guess it also depends on which linux you're, using.

B

Problem I suffer is that the ui will flicker quite often it's kind of annoying.

F

The ui flickers are you oh wow. That's.

F

F

It's definitely a very annoying effect of big sur in combination with zoom right now, because doing in a big gray box is fun, at least when the camera breaks on mac os. You get a black box that you know that they're trying to send you.

C

C

Are there any other things you want to discuss or demo in this call today, just so, we can manage time.

B

I don't have anything.

C

Do you need any help on the um dependency external dependency um issue.

B

Java- and I had a conversation about that yesterday- we're going to try to iterate on the solution, so I'm going to try to. I didn't get a chance to do this yesterday, due to other things, but I'm going to try to work on a first iteration of that today and then I may wait until an improvement comes into a git lab to make it more better in the future, but we'll see how that goes over the course of time.

B

Hopefully, I should be able to complete that within a day, just a matter of um getting that reviewed if all goes well sounds good. If anything, I could use help on jason you've seen me putting in thousands of merge requests into the home chart. uh They don't have a priority label. Some of them are labeled as production requests, because we use those home charts it'd be great if we could just get those reviewed so that we could at some point when we get a chance upgrade our own helm chart that we're consuming that way.

B

I could unblock andrew for rolling some sla sli saturation metrics for the hpas x hydro.

B

At this point, I think I've got all the charts in that we control. um I think that one question remains is whether or not we want to do the nginx chart, because I think that requires a lot of touching, because we forked it.

B

The others that we inherit like postgres and radius and grafana, I think, based on the conversation that you and I had very quickly yesterday, that will probably not touch those which I think is fine. um I need to note that on the issue just so that we have that documented though- and I have not done that so I'll- make sure I do that after this call.

F

Great um the the items that we have not forked, therefore do not control the source of we can propose, mrs upstream, but they're. Adding labels is unfortunately not sufficient to fork. So sorry, andrew, you, don't use those pods in production anyways. So at least that kind of helps we can. We can get started on that road, but not just yet um as to the the large number of mrs. Yes, we have seen them. We know that there we're trying to get through all of the reviews.

F

um As a as a maintainer, I can tell you. I've been swamped because you're, not the only one, submitting a bunch of mpars, so I I do apologize that we're not moving faster on those I'm trying to get to that. On top of other duties that I have going on.

B

It's all good, I'm just providing a status update as to where I am.

D

Okay, so I think we've confirmed then at least like from what I see doing a curl in a loop looks like I'm, not sure if you can see my screen or not, but it looks like that does flip over immediately. So it sounds like to me, then the errors that were seen on websockets probably are like held open connections, and we just don't handle that gracefully.

F

Okay, then, by nature the way websockets actually function. They are held open connections.

F

So what we may need to look into is how to tell the pod, during its blackout window, to start issuing responses to the clients that say reconnect.

F

So we may actually have something that we need to do to enhance the application to say: hey the one I'm connected to is shutting down reconnect so that I get a new.

B

One is there something that we could uh leverage inside of say, aj proxy, that could try to recognize that a pod is it's not there's no way to do that that she probably has no information about pogba.

D

No, I I think, like we're going to have to like jason, was saying we'll have to signal to the client to read, to reconnect.

G

I I just just on that point. I mean having spent like eight years working on web sockets and but I don't know, action cable, but I very much doubt that they haven't built action. Cable with that in mind, like you, get a signal in javascript that the connection's gone um and if your ha proxy and your intermediate proxies are all set up.

G

Fine, the browser will get an event that says the connections down and like every implementation that I've seen of this will immediately try to reconnect, and so there's no need to signal to the client to to to start trying again. um You know in the time frame that you do that it might actually connect back to the same one. So the simplest thing to do is just to retry.

G

F

Yeah you're a 100 are correct about the client side. My concern isn't the client side. My concern is that the server side needs to say go away while we're in the the grace period of termination, while we're in the blackout seconds for the rails application, we need to be telling active connections, stop it go away so that they know to reconnect once the stop action has occurred and we're into that window after sig term has been issued. That's when we should start saying I'm going to finish your requests, but you need to go away.

F

That's what the server needs to say that, because, while the client is still connected and neither party has dropped, the tcp socket it's being held open by the kernel on the endpoint, we actually have to cause that to terminate if we do it with with some configuration of coupe proxy we're affecting everything. We do not want to do this.

F

So if we wait for the application, we can literally just say look. This is in the process of shutting down go away once the sig term has been issued, and the pot has been said. I want to terminate that drain. Timeout should be any open connections that are already placed to be answered.

F

It should be removed from the service, which is what john has noted earlier. Is it gets taken out of the service, but it still has connections going to it, because they're they're in a held open, socket.

G

Yeah I mean I would say that the drain time on a on a websockets connection is basically zero or close to zero. Like a few seconds and you just drop, you drop everyone and shut the process down. You never wait for for people to disconnect.

G

um You know, because, even if you told the client to disconnect you know, a good percentage of your clients are going to be malfunctioning or going through a tunnel at that exact moment and they're not going to do what you tell them to do, and you shouldn't trust them to do that anyway. um So the best thing to do is to you know, you don't need a drain, you know, and especially with action cable, because action cable is broadcast, um it doesn't have like individual cues for individual clients.

G

Basically, everyone gets everything or they don't, and so you can just you can just shut the process down and those clients will reconnect and andrew I'd be brutal.

D

About it, andrew there's, there's another element here that you may not have had in other implementations, which is having workhorse in between where these connections are proxy through workhorse. So is that why we're seeing a bit of ugliness in our error rates? Basically, like you know, rails rails dies before workhorse and then workhorse gets like a 503 service unavailable returns a 503 service unavailable because, like rails just goes away.

G

um Yeah I mean we that's a very good thing to. That would be a very good thing to test like we should have some sort of integration test that um that makes sure that workhorses is handling that correctly, but you know certainly nginx and other you know front side proxies, have or reverse proxies at least manage that perfectly fine. So if they can do it, then workers can do it too.

D

Yeah I mean I I guess like I I yeah I mean I, I don't know what the next steps should be, then, should we maybe like try to do some controlled tests to see uh what happens when these websockets get like? You know when these connections drop with the open website.

G

I mean what you could do is with with like on gdk or something like that or just locally running processes. Just um you know, set up a open, a websocket connection just for the clients to to to the back end and then to action, cable and then do a sick term. On on on puma and and see I mean I would imagine that workhorse would just drop it like that's. You know, there's not that many things that it can do.

D

F

To interject here the easiest way is probably use docker compose out of the cng repository, stand it up and then just forcibly terminate the puma pot or puma container you'd be able to test this very easily yeah.

B

F

Thing I want to call out is be aware. The reason I say we need the application to know to terminate action. Cable is because, in our deployments we have separation between api web and web sockets in ninety percent of the world.

F

That's not the case. There is one two or five at most. We need to be aware of that fact. So, while the answer can be just terminate the thing anybody who's not operating in that individualized process, behavior is going to be heavily affected by that choice. So when I say we need to tell the application I'm saying during the the blackout window, we need to identify that this is an active connection that is, from action, cable, shut, it off and then finish answering anything else.

G

Yeah, I mean that's, that's that's a really good point. I haven't considered that um from a charts point of view. Do you think there's any appetite for making that by default a separate um workload, because there are other reasons why like not having your websockets coming into your main web application, is a very good idea or do you think that's people aren't gonna it's just too much extra complexity.

F

I it's it's, it's not that we can't separate it. It's a matter of resource versus scale right at our scale. It makes hundreds of percentage to do it the right way in the way we're doing yeah in somebody who's a thousand or even two it it's like we're generating. What's possibly an extra two to you know one two cpu and a couple of gigs of ram right.

F

It's that basically an entire vm's worth of load depending on the size of your vms yeah, which is not needed because it can easily be handled by the other, pods yeah. So.

H

That's true and that's just.

F

Thinking in the cloud native sense, if we go to the omnibus side, I know of thousand user systems that are just really big vms and we don't spawn action, cable for this and rails for this and rails for this right. We don't have the rails applications segregated into individual api routes. We just don't it's not that's not impossible. It's not something. We can look into it's that it's not reasonable for smaller instances, and we have to think about not only us at millions and millions of users, but at the people with 15 users.

G

Yeah, the the only other thing I'd I'd, be cautious of there, like. I accept that. I don't know if action cable will even have those hooks into into forcibly disconnecting um those clients like in that way. You suggest but open up an issue about it.

G

And see see where it gets to, but it's quite possible that we'll just have to just shut down without them, but that that will be problematic for people that have got everything going into the same.

D

Service what uh I mean, what do we do in the meantime, I mean we need to investigate this, but what do we do in the meantime about our slis for error rates on the websocket service? You know like it's. We.

H

D

We have silences set across the board.

H

D

H

Balancer error rates.

F

In my opinion, in our our infrastructure in kubernetes with the segregated setup, I would actually just set blackout seconds to zero and just thunk.

F

Just literally okay.

D

That's fine, but that doesn't, but we're still going to see errors when um the issue is, is that we see these errors, but we just get reconnects and everything as far as we understand is hunky-dory on the client side,.

F

Okay, so the the the simple direct route is: take staging or pre, and and set just this deployment to have a blackout seconds to one and hammer it with active connections and see what actually occurs.

F

My bet is that it's going to terminate so fast that we're not going to get extra connections new connections coming through if you set the black sockets to like five, it's not even the.

B

F

Time in all the stack components that lead the traffic to get there because gay platforms, it's going to get to the point where it'll just get terminated so fast that we might see a blip on one request. Maybe two uh not several. While something just sits around.

G

Job sorry, I I was concentrating on several things when you were talking earlier, but like this over here can is my screen. Sharing now yeah yours is working, I'm jealous yeah. It won't last for long so like this over here isn't related to this is web sockets right this over here this load balances spike. It doesn't seem to be related to deployment. Is it at 1400, utc.

D

I'd have to look we'd have to look at the deployers.

G

I thought a lot of them were that's um that other problem with the client header that was being sent through by like one ip address where he was sending a lot. Did you are you aware of that incident? Yeah.

D

Yeah yeah, but we we well, I think ben stripped that header at.

B

So we're not seeing that.

D

Problem anymore, we can look at century to see if, if these are 500 errors, but um if you look at it, the workhorse error rate is much lower right, so it doesn't look like it looks like this is like the load. Balancer complaining, not actual errors, yeah.

G

Yeah and that's what I'm like, I suppose if you you've also seen them during a deployment though, and that's kind of what you're talking about yeah.

D

um Well, I think we have some work to do still and I don't think it's causing that big of a problem, or at least as far as we know this, isn't causing any pain for users.

D

But I don't want to stay in this situation for too long. I think like by the end of next week, we should shoot to having these silences removed.

G

Just um just while I'm here before I go away, this is why I like to have web sockets separated from the this is a classic example, and it's just right in front of us. This is why I like to have websockets separated right, so here's our base load at like 20 requests per second and then for some reason, it's just spiking up, and you know we hardly been using this stuff. Yet so wait until you know we're really using this, and this is spiking. You know five times the load.

G

um This is why it's very good to keep this segregated, because it's very dossy.

D

Yeah, I know I'm very happy. We decided to create a new service for this and also to have its own node pool. um I don't know, I see a deployment that happened about 10 20 minutes ago, so maybe maybe the spike is due to that I'll dig in to it a bit more.

F

Logistically that spike makes sense, because there's a bunch of connections that all of a sudden have to be reformed. So yeah.

B

The 20 or 30 are.

F

New connections coming in the match of 500 is a bunch of people who were already connected going hey.

G

No, but they don't get jason, they don't get 500s, though, because the websockets it connects it does the the upgrade, and that's the moment that you get a 200 back when that gets disconnected it's not a status 500, it's it's! It's just a socket connect. It disconnected. The status code is 200 on those. So that's just something that's worth being clear about and clarifying when you get disconnected uh without expectation on a websocket, it's still just a it's not treated as an error by a cha proxy or by nginx or anything else.

G

It's a 200, because the 200 happened right at the beginning of the request.

F

Right and I'm not speaking about the specific error codes, I'm talking about the requests.

G

Per second yeah, the retry yeah.

F

Is because you had a connection and then there's a whole bunch of new ones, yeah.

B

F

If, if it comes down to the fact of there's 20 requests per second and there's a lot of active connections still being held open, and then you smash it with a thousand out of nowhere yeah. What do we know about scaling?

F

You're handling, a low number, just a whole lot of consistent and then all of a sudden, you, you know, rushing out.

G

For a reason um yeah we we did, we did exponential back off on those or randomized retry, just to kind of spread it a little bit, but I'm sure actually action. Cable is a pretty widely used thing. I mean I would be surprised if it didn't have some of those things in.

F

It all right, so we do.

G

D

F

To figure out where the actual like surge of the errors are specifically coming from, but we know the exact time window is most likely new connections inbound after a rollover. The question.

G

E

G

In order to get a 500, they have to give you yeah yeah.

D

I think um jason. I think I like your suggestion, the most, which is to reduce this grace period. uh It feels like it's a safe thing to do for websockets anyway, and we can. We can try decreasing it see if that helps with the aerospike.

G

Just one more thing on that on that thought: um I saw that heinrich, or somebody mentioned that you can send commands through the websockets.

G

uh That's something an action action cable can do and that- and I think they're not doing it yet, and I think it's a very bad idea to do that, and one of the reasons would be you know if somebody sends a command and they're busy, creating a merge request and they've issued that through the websocket first of all, we don't get any logs, our metrics don't work anymore, but then the third reason is: if we it, we can no longer do the the the short terminate, because you don't know if that, if that request is still ongoing.

G

So if you speak to any of those application developers, I think we should push back very heavily on them. You know websockets should be for things coming out, not for things going back. If they want to send something back to the server, they should always issue an http request for that.

B

F

I'd agree with that and that's the danger. Websockets are really cool from a technical standpoint, but if they're used in the wrong way, you lack visibility, you interrupt controls and the design has to be done right and I can tell you doing that right is not easy. Having done web sockets in the past myself, it's tcp.

G

On top of http, and you got tcp and http again and then tcp on top, but also any any reason that they say: oh it'll be faster or anything like that. That's totally now no longer true because of streaming like http, 2 and http 3 anyway, so you know you're no longer setting up a connection for every request or anything like that. So if they say oh but it'll, be so much faster, you can say well, it's so much faster with http 203 anyway.

G

Like that's no longer a valid reason to use a websocket.

D

Okay, well, I think that's pretty much all I have for this.

C

Cool okay, great, um so just the only other bit I wanted to uh run through, just especially whilst we have you here. Jason is just quick look at our blockers.

C

Hopefully, hopefully you can see the blockers, um so I saw um jason. You got pinged on this, removing the duplicate messages. um Do you know what might be involved or like when that might likely um the chance change that scarborough mentioned when we might likely be able to get something either.

F

uh I don't know about this: charts change and I'll be honest. I haven't had a chance to go. Look at this before just a few minutes ago. My days for the last two have been very busy with interviews, so I had a lot of spare time. I can answer that question in about an hour. Okay,.

C

Right now: okay, fantastic, that's great and then the only other blocker we still we have outstanding. Is there one you're working on scumbag.

F

Right and I I have my team doing- reviews of these things and they're all pretty much identical. So, let's check the last one confirm everything behaves as expected and move along. It's just that. It's it's in the large volume that we have so we're working to get through them as fast as possible.

C

Cool sounds good great. Was there anything else anyone wanted to go over today.

G

C

The regional dashboards.

G

D

uh I just wanted to say thank you to jason for joining, really appreciate it. It was very nice to have you here. um Yeah, the regional dashboards are really.

G

B

G

Git and registry, broken down by the three kubernetes clusters.

D

Yeah and something I I mentioned to andrew and um might be interesting for scarbeck as well- is that when we start moving the api, we can overload this region label to compare the vm um slis versus the kubernetes sli, because for vms we set the region label to us east. It's just like set the same everywhere for the kubernetes. We have it set to usc, 1, bc and d.

D

um So, given that we have this dimension in our slis, if we split traffic 50 50 between the vms and kubernetes, we can actually do direct comparison, which would be nice.

G

So you can really see it on some of the things like this one here.

G

But certainly here like exactly what you said, because this sli comes from the um aj proxy, which is on vms, and so it's us east and then the three other ones are usc 1 a b. What's its bcd, the one thing that's really weird, at least by I was looking at this, and it didn't seem right.

G

I don't know, is the three are identical job like like really like, there's tiny differences but like I was expecting more variety between the three clusters, but they are, they have the same dips and the same spikes, it's very um suspicious to me. That does seem suspicious um yeah. I mean there are differences, they're, not identical, but.

I

They are very, very similar. Well, that's, possibly because we're doing very good at spreading the load.

G

Well, yeah, good news: if it's true.

D

For what it's worth deploys are done simultaneously on the three clusters, so if it's deploy related it'd be the same, uh I I don't know.

D

It could be could be downstream dependencies that are causing the slowness right and then it just is the same on everything.

G

um Are you going to change it so that the deploys are staggered?

G

D

Yeah, eventually, that's the plan.

C

Yeah, it makes sense right like it means the overall time is longer, but it's safer right. So yeah.

D

um Yeah, I don't know if you read github's latest blog post on how they do deploys, but this is sort of like where I think we're gonna eventually go where they said, like they started out with the canary stage at two percent and they realized okay, that isn't enough. They needed like ten percent, and I think for us, it's going to be 33.

D

Where we'll have one cluster go first and then we can measure uh you know using the region label, we can kind of see what the sli's are for that cluster and then move forward, and that would be. That would be great.

E

So we can lose 33 of our users tokens when we roll things out in one go yeah, mr cynical sally over there. Thank you thank you. It has.

D

Been a long day it has been a long day, yeah, that's true, but we'll be able to move forward. So much quicker, we'll just say: ah forget it we'll just move forward and everything will be fine.

H

Break stuff I mean nobody ever notices.

E

Luckily, it's turning out that specific issue is turning out to be not that big of a deal. um So that's good if.

B

H

E

uh Yeah joking, a bit where.

A

D

Article can.

E

You can you, did you share that article somewhere? I didn't. I posted.

D

It in delivery uh I'll check it out, but I was supposed yeah.

E

C

Cool anything else, anna wants to go over no awesome. Well thanks. Everyone for sharing and joining today and I'll see you next time.