GitLab Delivery: GitLab.com migration to k8s demos, 10 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-07-10 GitLab.com k8s migration EMEA

Description

Delivery team weekly demo of the GitLab.com Kubernetes migration

A

Great, so thanks for joining today's demo.

A

This skirbeck with us, no okay, so skull it's getting on the item, nothing to demo urgency. P, you bound. It has been blocked. So it's against at that, but Josh.

B

Still is that still true, though,.

B

Because I thought that I saw that we were on blocked by some work that was done our last night, where Stan found that us actually doing things right in kubernetes exposed something that we were not doing right anonymous. That's.

C

What I said as well yeah.

B

I think I think.

C

A

Job she won't uh give us a demo.

C

Sure so I wanted to show just how console is running right now in pre prod, which is true as of about thirty minutes ago. We had some issues previously with TLS I, put some notes on the issue. The striking this.

C

Let me just share my screen.

C

So the problem you have right now in staging the production is that you've lost the certificate authority key and the certificate authority key was generated years ago by a former employee so and that former employee does not know where this. So, what we did is that we turned off TLS verification and staging their production, and that's the current state of things.

C

So what I think we probably want to do is regenerate these Keys re-enable TLS, and this will allow us to get the good console running in the kubernetes cluster I think we could maybe even do it without doing those steps first, but it's probably something we should fix anyway. I put some instructions here, because I just ran through this for pre pod on how to generate the keys generate the server keys.

C

The client keys, I, put notes here about, like you, can choose for how long expiration is by default one year, which is not very long. So maybe we should increase that. We probably also want to come up with the domain and also council has this concept of data center, which currently is set to the US east. One for pre prod and staging, and that's something really different for production, and it seems like it's just a random string. I, don't know if we want to change it.

C

It might be a dangerous thing to change, because if you change your data center, then clients won't be able to connect. So maybe we'll just leave the wonky string that we have for the data center.

C

But in any case, once you have the keys that you update the G kms configuration and then you source, the G kms configuration in the gates workloads pipeline. What you end up getting after you apply the case. Workloads pipeline is something like this.

C

So we're using the hashey Corp console helm, chart it's actually pretty simple like because we're just using what they provide out of the box. Once you have the correctly configure, you can see that these are the pods for console. You might be thinking wow. We have a lot of replicas here and why is that? Well, the reason why is because it's being set I do the wide option here you can see that it just so happens. We have a lot of nodes because we have a lot of node pools.

C

So there's a console replica running in every single node pool.

D

Those are just agents right. Do yeah,.

C

Those are just clients, mm-hmm, the.

D

Server is running in games. How many of those do.

C

We have in pre, probably have a single server, which is something you would never do right. Yeah.

E

C

We don't care so much about pre prod, because really console is, if you don't rely on console for anything recod at the moment, because we don't have the truck in pre prod. What we'll see I can pull up the UI for console and actually I think it'd be really nice to get the console you I, like accessible from the outside.

F

And you put it in read-only mode job, because it's the only thing about is, is kind of scary. With like full access. You can write the.

D

Access policy and set the default on.

F

D

Right, yeah right.

C

D

If you set policy, then you need.

F

To kill that's cool yeah: when is this a channel to it always I'm always like this is scary that that you can just sort of see this stuff but yeah. So.

C

I don't know, yeah I was thinking we would just put it behind IEP, but that's also that's a good point. Danger I, don't think we want everyone. T Lokhande mean bad access to counsel. So we have to figure that out. You could have like a group, professor ease for it or we don't know yeah but anyway. So here's what you looked after this is what you see. We can connect a console, we're we're actually a like a minor version behind console.

C

So we may want to consider up at some point, but I don't know if we want to do that now. These are the services that are running I. Don't know why we have your services defined honestly, like I guess we have maybe a service that is looking at the like this check for unicorn. These are VMs by the way. This isn't this isn't for kubernetes, where you can see. Kubernetes is if I go to nodes. Now you see like all of these gke nodes here that are reporting from in the console server.

C

So, just to be clear this this, my web browser is connected to the console server cluster, which is running on VMs, not to it's not running on the kubernetes and that's probably the way it's going to be initially, because you know we're gonna be using the existing console cluster. We have.

C

There's one failing check here: if we click on this, this is alessia's point. We don't have any followers. So it's just like one leader and that's it pretty lonely console cluster yeah. That's do.

D

We use the key value storage for something.

C

No, we don't. We don't really use console for anything right now, except for console dns and the Patroni. You know the Patroni service which isn't running in pre prod.

C

What's this beat exporter thing I, don't know, I've, never seen it before I. Guess it's for pub/sub and there's an exporter like a service. That's automatically configured. We definitely don't use it for anything, but it's kind of cool to have all these like services listed here.

C

This is for the node exporter, I assume no yeah just running.

F

Beginning apologies if it's a stupid question, but you mentioned that there's one per node pool. Yes, that's why. Why is that? Because.

C

All containers need to have access to console agent.

F

I need my timeline between yeah.

C

Like yeah exactly so, you have access, so every container has access to the consulate page in bins, local on the boat, so.

E

You mean one per node, not one per node pool yeah.

F

That's different: oh.

C

Yeah sorry I misspoke, it's one per node and we have okay a lot.

F

Of something her denotes pool, yeah sorry.

C

I misspoke yeah I'm, okay, one Korean that.

F

Makes a lot so it's a demon sit then that's just running. It's.

C

A demon said yeah yeah, so.

C

I'll show this again so yeah. So here are the here, the replicas and you can see yeah we have for every node pool. We have you know the reason why there's so many is because we have all of these milk poles, which consists of a lot of nodes. I would say we're a bit over provision because of that yeah.

F

I think and I think I'm soaking sort of some other place to skarbek about this and I. Don't know how you feel about this, but I kind of feel like in the longer term, will probably reduce that number right or you think it's long.

C

F

C

I'm not sure, I think what I really like about the node pool. Isolation is for IO on disk. This is like something that we were concerned about with sidekick, like you know, saturating I, ops and by having separate node pools. This gives you a bit more isolation for that.

F

C

For some things, yeah.

F

For like web and API and those things you know what they need, you know like, they will kind of know they just kind of service I. Don't.

C

Know like maybe maybe maybe I I can't really sit like I, would think, like maybe like internal API in public API. Maybe you'd want those to be isolated into separate note pools, but maybe not I, don't know. I. Think one thing I'd like to discuss is moving registry into its own node pool, so it isn't on the default node pool, because it's a memory, hog and I think it would be nice to isolate.

C

Yeah I mean you can't really pack things efficiently if you're creating a node pool for every single service. So.

B

Okay, I mean it's okay for us to do this now, honestly, I think we we have too many unknowns for us to right now also introduce everything in one place. So this is something we should tighten later on. I.

F

Totally agree like it's a safer way to start doing it, but then like yeah, like maybe you know, once we like really good with. Let's.

E

Get the kubernetes first and then.

F

E

The way tackle performance issues- and we saw in the recent incident that the container register is a little quite under the pressure and its current node Poma. Well.

B

You know the first time we actually have an incident where we don't automatically point towards the kubernetes cluster as the point of failure, then we can start talking about us feeling comfortable about running this. We know a bit better, but it was very way too easy for people to just point to the cluster. Well, things were actually clearly broken somewhere else. That means we have. No, we don't. We are not in a comfort zone right. We are not there to be able just to claim. You know and that's.

F

Just gonna take time, it's you know and ROS, and also like I, think this there's a ton of stuff that needs to be done. On the observability side, you to really help us understand that stuff, better.

F

Looks very pretty job yeah.

C

I mean I, don't know it's it's kind of cool. Like I said we should probably expose this to the outside a bit easier, but maybe only for us Ari's yeah.

C

So the next step is probably going to either be to pair up with reliable ability to get the console, TLS configuration working or to drive that ourselves and then to get this on staging yeah.

B

We are not gonna drive this ourselves, people who were driving it originally should be continued driving because they have the expertise, and we just need to ensure that we expose this in a proper way, so they can schedule it prioritize it and so on, explain how it affects us. Yeah.

C

Usually for it is scheduled for this month, but no one is assigned yet so maybe we can talk to Dave to see where I can talk to him to see what the status is to.

B

Raise maybe a priority, a tiny bits yeah. Let's do that. That's a good approach. I have some.

E

Questions this.

B

Might be worthwhile yeah.

D

I have a couple of question about comes from the grand vision of using it and forget about calm, because so we are as understood, we are using this only in production orally as a service discovery for Patroni. So this is the only reason why we have Council installed, which kind of cover less than 10% of council features. It's.

F

We use it for finding the secondary's from the from rails, the secondary databases.

D

F

It's just on the database.

D

Secondaries, okay and we are just relying on the DNS interface as opposed. We are not implementing the consult agent, api's and rails correct.

F

But but I think that that's a you know using the service records is a good way to go with that, because yeah.

D

Sure it's a separation right, so you can just replace it with something else and no, my question is was more about. If we had plans for making use of the key values or not for the application, but for the infrastructure say things like I want to store status.

D

Maybe the RISM and going in production deployment console when it's replicated and is distributed system is a good place for storing this kind of information which is transient, but it kind of defines, what's the status audience external to the am to get up the product, because it's something that we need to operate the infrastructure. So this basically was my question. If we are already using this for something or it's just.

F

We don't use it for that, but, like the one thing about that is obviously, we've also got HCD and in a certain part of the application, and could be kind of like use the cuban aires abstraction for this, and I think the examples you have kind of show that that's not a great example. Well, maybe it doesn't work because, obviously deployments don't wouldn't work like that.

F

D

But the idea is more about.

D

We have council, which is outside of cuban eras, so it's a tool that we may use, regardless of we are in VMs or in in kubernetes word, because one of the best practice that I know about the ABCD is that if you need it for your application purpose, it's better not use the one we provided with kubernetes, but you should have your own cluster, because it's something is the attd in cuban areas is for kubernetes right. So you should not mess with that one. So this was just thinking out loud because we have council.

C

We're just keeping my number we're not doing a console agent per pod, we're doing the console agent per node so like for things like storing key value. I mean it would be at the whole like environment level. It wouldn't be, but yeah I mean we could run conflation as a sonically. I could run it as a sidecar or something but.

F

I think what Alessi is saying is like okay, we're doing a deploy now, like put it in as a global value, deploy, release X and it is global. You know it's not it's not local to anything.

C

F

And and then might be, then a file, you know lock somewhere or yeah.

D

Because you already have a distributed and system for handling this, so it is local to everyone, because you have the console agents running at least at node level, so the round-trip is shorter and you already have.

D

You have a multi master system, so it's just ready distributed in things like this.

C

Yeah, it might be interesting to use it for deployment state like just to kind of keep track of, what's happening with really that relates to deploy. Maybe.

A

Skirbeck did you we had a brief chat before you joined the migration of the urgent CPU bound, that's not blocked anymore. Is it theoretically.

E

No I'm still catching up on my day but looks like that. Issue was fixed by Stan. So let's just do some validation on that, so hopefully, next week, because I'm not going to do this on a Friday, hopefully next week, I could start doing the migration on that q. Awesome.

A

F

Can I just go back one second to this year's point, so one of the other really interesting things right as if we did use console for like global states, we could also use the console exporter to take a subset of the keys and put them into Prometheus, and that would be very useful for things like not alerting on low operation rates. There is when we drain canary, because that state could be kept in canary. You know, drain I saw it kept in console and then and then we could automatically include that in our loading.

F

So when, when drain canary is on or even you know, we could have higher error rates while a deploys going out or anything like that, and so that's actually a really interesting idea. Sorry to jump back to it.

A

That's fine thanks.

F

A

F

That it's it's got a thing called kV filter, which is a way of choosing certain key value pairs that will at all put into Prometheus for you. So I am nice.

A

Does someone take an action to actually get an issue started to capture that stuff, maybe not Alessio, since it's about to go to holiday in about an hour.

F

A

Don't set it out, we can get the thing's awesome.

A

Cool so discussion, things I want to just make sure we go through the next steps, so maybe job Jonah go through the way we're at with this cache and a first mount, because that's maybe your next step. Okay,.

C

Yeah, so that's that's easy. It doesn't seem like it's a blocker anymore. It shouldn't be a blocker now I've, already unmounted the cache mount across the entire production fleet. We were looking at 90th percentiles of downloads. It didn't seem to be affected at all, but I can see I mean, of course, it's a very spiky metric, but it looks good, no additional load on the building servers, so I think we could just chalk this up as a wing and that you don't have to worry about it. Forget live comm there.

C

That issue isn't closed it because you know, like there's some discussion about whether what should we do for self managed cloud native and I was I was proposing that we by default during this feature Flygon, so that nobody uses the disk cache yacht-club, wasn't too happy about that and responded that, like we probably don't want to mess up people which I agree, but this would only different installations. You could just turn it off for cloud native so that no one uses the disk cache for cognitive I.

C

Think that's gonna, like we'll, probably get to bring in some more people to discuss that. But as far as giving back on this is no longer about.

B

It's a separate issue Jeff, so this one can be closer.

C

B

Resolve the originals asking this issue so.

E

B

Send it off to people who actually own this part to decide how they're going to handle the default insult management, yeah.

C

I mean I, guess I mean we're not technically the solution to use it and on disk cache was to just not use any cache. If that's enough to close it, then okay, I'll just close.

B

C

Mean I don't know so.

B

You named this issue using on disk cache, but basically what we ended up, seeing we ain't using well.

C

We're using common fire, the clouds are completely outside of everything right, so yeah, so I, don't know, I think like I would prefer. If you know, I would prefer to have a solution where the default just works for cloud data, then currently, in my opinion, it doesn't work for cloud native or it's not great, because you have these files on disk that are gonna build up over time. This.

B

This should be an issuing home charts and have distribution, make the decision. Okay,.

C

Well, I'll do that then I'll, just yeah.

B

Deal solution but but it's actually like we found the solution for github.com and we need expert in in South manage as well to handle the default.

C

Oh on the same topic of blockers, I saw that there's now an epoch for the live traces. I put it in the top.

C

With a bunch of issues, well, it's not clear to me is like maybe Amy or like what is the horizon. Have you heard any updates for this.

B

The the horizon is right now the work is happening so and.

C

So, unless I feel like yes,.

B

No Jake which is assigned to it and he is working on delivering as much as possible that will unblock us.

B

So horizon this next month is the initial work and then whatever comes out of that as well.

C

So I think once that's done. The last remaining thing is pages and working.

B

On pages right now, I can tell you the horizon here for us to be unblocked for the web in API, it'll have to have it'll take a couple of months. It's not going to be instant of this, because there is a lot of work that needs to be done, but the work that is happening right now.

B

The investigative work is being wrapped up, and there are couple of proposals out there how to tackle the the problem and I can't say any details right now about what's gonna happen next, but I can tell you that next week there is gonna, be some movement on who and how this is gonna be driven further, and that will give us a more clear horizon on when are we gonna be unblocked on the web and API? This is not to say that pages is gonna instantly become cloud native.

B

That's not the goal of this the Gulf right now you see what we can do to employers and then, whenever pages arrives as as part of the cloud native deployment, there is a longer story, so I'm going to have fun in, amongst.

C

Four four next steps: I think after we sounds like we're going to finish their urgency, PU bound or I. Guess we need to decide what comes next like do we try to take more cues off of the catch-all and move them or do we move on like as soon as the console work is done? We can start all the other blockers are finished, so we could start working on, get HTTPS and get SSH or the WebSockets work and think those are our options.

A

Yeah, what do you have? Any preference? Go back. What you pick up next I.

E

Don't have a preference I think all these are excellent candidates and I would love to see all of them moved.

B

Can we get to the point where sidekick is done, the definition of done? We can change, it doesn't have to be everything in kubernetes because it can't be. Can we get to the point where the only cues that are last isolated are the ones that we cannot touch because of the blocker issues, send.

E

B

Sidekick epic done have a separate epic that is blocked on the application development which will allow us to move on the get a chicopee like.

E

C

Sorry go ahead, kidding I.

E

Was gonna say we'll spin up like a pseudo catch-all, node pool in kubernetes and we'll slowly figure out, which ones don't depend on NFS and we can isolate those cues onto that. You know for probably.

C

Yeah I was thinking. Maybe we use the tag because I think like having a bunch of names, it's gonna be a bit overwhelming to manage. So maybe we just use the tag syntax. We have in the queue selector and just start tagging things husband, but I think what we need to do is we probably need to create a VM. We need to move them to VMs first, so that we can monitor and fs right. We need to have a staging area like the staging area, where we move them from the catch-all to the staging area.

C

On VMs we look at the NFS metrics, make sure there aren't any NFS reads or writes, and then once we confirm that, then we move them to kubernetes. I. Think that'll be the sequence that we'll have to do. I also.

E

Don't know how many queues are being processed by the catch-all fleet right now, so I don't even know how much stuff we have to investigate right now.

B

But yours that one Dan perfect yeah Geoff is it possible then, because what I'm? What I'm hearing right now is that if we are going down that route, where we are going to be isolating in the ends and doing all of that, basically shifting like sifting through months type of work, sorry skarbek I know it doesn't sound excited. Is it?

B

Is it possible for us to parallelize some of this work and say you take on git, HTTP and planning for that and starting the work for that whilst garbagey is doing the psychic, or do you want to work together on this in pairs, as we are done quicker.

C

Yeah I think I think we can start working on the bigot or the WebSocket I think. First, we need to decide what we want to start with.

C

Maybe yeah I'm not sure whether I would choose get HTTP or WebSockets, but here we go through that and yeah. This.

B

Website socket has support in charts. Did they merge that already.

C

I, haven't looked at that specifically, but I like I, was also thinking existing WebSocket traffic. You could also move over it if we just want to move over all WebSocket traffic sure that's function.

B

C

B

Maybe that's maybe that's a better thing, because it's also I mean get HTTP. Traffic is not giving us any headaches. I'm not saying WebSocket is, but it also unlocks a new feature on github.com that everyone is just dying to to wrap up kind.

C

F

Back with the previous comment that you had about feeling comfortable with Cuban Eddy's like if the gifts fleet goes down for like an extended period, the pressure on us will be a lot more to kind of fix that right, like that's one of.

C

F

So I mean I, don't not assuming that that's gonna happen, but it's just about catching. It.

C

Here also, it would be really nice to like. This is the first time we're gonna have an ingress for a kubernetes service other than registry. It would be nice to be able to have canary, but it's like you can't really do canary for a WebSocket. We can't really do I mean we can do canary for HTTP GET, but.

C

Yeah I don't know something to consider I.

B

Don't understand the proposal if you're saying it would be nice to have it, but we cannot do it for any of these. What's the proposal.

C

Now I think, like I, see Andrews point that maybe starting with HTTP GET or get in general is bad because it has like a huge blast. Radius and it'd be terrible if it didn't work. At the same time with WebSockets, we can't do an area for HTTP GET yeah. We can do a canary like I just requested. You.

F

Canary for.

C

Websockets yeah does that is that like well, we can't use a cookie for that right. You.

F

Can't yeah: well, you can use a cookie, you can use. Another I mean Gaeta has canary WebSockets right. So if there's a reason you you know it's at the start, it's just a normal HTTP requests. You can stick anything in that sure.

C

Yeah, we just don't have that capability. Now, that's all I'm saying um maybe we can so.

B

C

Let's, let's research into it, let's make sure that the first thing we do is pick an area, but.

F

If there, if the front-end developers are saying it can't be done, it's I, don't think that's I think it can be okay,.

C

D

Just overruled this decision and run percentage of traffic through it canary, regardless of why not.

C

That's that's, of course, another option right. We can just use weights and take like a very small percentage. I just feel better about canary, because you know it's more internal usage.

D

My point is that when canary was one week ahead of production, this made sense, but now we're talking about hours, so shifting percentage of production traffic to cannery is suddenly. We should do regardless of the cookie, because it helps us understand the status of something that we want to deploy in one hour and.

F

I think we should do that, but you have to have like the lesson we've been learning over. The last few weeks is: if we do that with the web, then the WebSockets have to be sticky with the same version and when you talk to graph QL, it also has to go back to canary right, and so we don't have the the tooling to do that at the moment, and so it's basically scattergun and that's leading to lots of problems.

F

At the moment where you know people add a new thing in graph, QL and they'll go back to the server and it won't be in the requests, because you know the service not canary, for example,.

D

Because we are using path and not seeking to user sessions, or things like that, because we, if we were saying something like 10 percent of users, so sticking on sessions goes to to cannery, regardless of the default they are asking. This will allow us to make sure that if you get cannery front-end, then you also draft ul furies, for instance, would be routed to cannery, because it's your connection, it's your session right, yeah,.

F

As long as a stickiness, then it's then I'm fun thing is really not working well. Well,.

C

You don't like stickiness right.

F

It's complicated if it's as long as as long as everything's going to the same stage. That's what I want right and- and you know obviously like stickiness to a stage- is fine like if you put it, if you put a cookie in but then obviously, if you drank canary, then everyone goes to the main stage. But anything for me is that we kind of met like something's happened. We kind of half going to one place and half to another yeah.

C

Alessio I think that the point of canary in this context there was infrastructure- and in this case like we could we want to like deploy the new infrastructure in the Canaries data. It would sit there for like a week or two right, so I think I think that's the main benefit that we will see about having something you can put them to marry. First is just to ensure that it's working with production traffic for a long period of time before we promote it to production.

A

So we've got just a couple minutes left. So what's the action? What's your your plan, job I will.

C

I will I will I start an issue to discuss what the next thing will be mostly for doubt in the issue.

A

Cool, okay, great and scarback. Can you handle that psychic epoch and getting that so that we can get a kind of complete epoch and have a separate one for, for the other bits are blocked.

A

Right cool anything else, anyone wanted to cover today.

A

Right: okay, awesome thanks very much everyone which I the rest of your days.

F

Good weekend.