GitLab Delivery: GitLab.com migration to k8s demos, 30 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-07-30 GitLab.com k8s migration APAC

Description

Demo of the webservice pod on pre/staging. Discussion of Vault and 1.16 Kubernetes upgrade.

A

Hey, hey how's, it going.

B

Yeah good thanks hating.

A

C

B

So hopefully a few more people joining.

C

C

C

B

So you have a good job online at the moment I don't think but she'll just kick off and run through the blockers, so um we are still waiting. um First time is nearly ready on the live, tracers issue.

B

So hopefully that will move along a bit next week.

B

That's to one um and support for dependency. Proxy is still around that one.

B

Cool so again that will be fixed in 13.3. So that's coming along.

B

This is one, though, at the moment, uh where we're working around on the catch-all shards, so we're going to separate out the cues on catch-all and do as much of that as we can. So that's all in progress at the moment, so this will unblock the next stage of that um and then jakob's working on the removing nfs well removing enough the nfs dependency on pages for us to progress. There.

B

Actually, nuts moving along nicely um any update on that from your side, marin.

D

Sorry, I'm unable to speak. I'm on the move. Okay,.

B

uh I think I was progressing uh as as expected, so no no issues there um and then we've got new hijab, so we've got a new pretend. Is it definitely a blocker at the moment the state-of-the-art logging job or it's about to be right.

E

Yeah so for logging, uh we're not we're, not blocked right. Now, uh I'm a little bit worried when we start taking production traffic for git that we're just going to be flooded with uh crappy logs and we don't have good filtering mechanism, so I would say possibly a blocker I mean, and it sounds like that. uh Well, I don't know: jason is speaking for the distribution team, but he seems willing to incorporate this.

E

uh You know contribution that has a sidecar that wraps the logs in json and then indicates like which log file each logline comes from. That will give us the flexibility to do the filtering. We need to do so. uh I would say it's still a blocker for now, but it's not like it's definitely not preventing us from getting started with the get https uh and websockets stuff in production, but it may prevent us from finishing it.

D

So it's a blocker.

E

Yeah, it's it's a blocker, sorry. That was a lot of words for just saying it's a blocker! Yes,.

B

Cool um okay, so job jonah give us a demo.

E

Stuff working in staging, but it's working now um I guess what we can well, let's let me just share my screen and we'll just go ahead and see if logging is working, I I don't know um I just merged the logging changes, so I'm not even sure if it's working, but you can.

C

You can see.

E

So, just a little bit of background, we enabled git https in the well actually both websockets and did https in the kubernetes cluster for pre-product staging. So when you do a git clone or any git operations on staging using https, you should be. Those workloads are being serviced by the web service. Part.

A

C

A

But how are we handling like? I thought our ho proxy, like I couldn't do websockets yet or has that been solved? Is that outdated information.

E

Yeah that I mean we, we have this new feature called action, cable which we discussed, which hasn't been turned on yet which uses web sockets. We have an existing feature, which is the terminal, um the the interactive terminal which uses websockets and maybe you've seen that before you can. You can pull that up. If you, um you can actually pull that up on our production cluster. If you uh go to the cagedworkloadscape.com project on ops.

A

But so yeah, okay, I I actually thought that we we couldn't support websockets at all, but I guess that's obviously not the case, so that must be outdated. Information.

E

Yeah, no, we definitely do websockets.

C

D

I mean it's how it actually works, that is to be actually confirmed right. We do support it.

E

It works, it works like you can, you can actually use it. If you you can pull up. I was showing amy this uh in our one-on-one. You can pull up a pod. uh You know when you click that terminal icon uh on ops. You can pull up a pod and you can have an interactive terminal.

D

No, I know that part works, but I'm yes, I don't know how it works in staging production. Right, like uh ops, is very much simpler.

E

Sure yeah, I don't think anyone has well. Maybe maybe someone does yeah, I I don't know. No one is actually so it's very low traffic. Let's put it that way. It's like very, very low. So um I'm looking at the non-protagonistic search cluster and so far I don't see any kubernetes.

E

um Let me let me go ahead and refresh the index. Mappings.

E

C

E

Change, I feel like this uh interface like changed a bit since the last time I used it. I need to go to.

C

C

C

B

E

Okay, well, in that case, what I'm going to do is I'm just going to do this on the.

E

E

So we have three web service pods in staging right now. The minimum number is two and uh we're.

C

Just using the defaults.

E

I think it scales up to like 10, or so we haven't.

B

We have an issue.

E

Still for how we're going to like configure this for production, that's going to involve sort of setting some minimum capacity to represent like the number of workers we currently have in production.

E

um Also, when we roll this out, we're probably going to just do websockets first and then we'll follow it up with get https and we're also going to use canary for this as well.

E

I imagine what we'll do is we'll start off with canary and then we'll probably just add uh the gke load balancer as a single back end in h.a proxy, so that it gets a small percentage of the production traffic and then we'll kind of observe it from observe it from there and then we'll shift over all traffic. Eventually, um we can take a look at like.

C

E

Oops, so you see that this is a little bit different than um it's actually, because there's multiple containers that are running, we have both the web service and the gitlab workforce.

E

So if we select gitlab workhorse, you can see that wow we get a lot of logs. Well, I mean not that much because it's staging, but this is uh the workhorse logs and we're seeing https get requests here, along with the readiness probes.

C

You can also do I think it was gitlab web service. No, what was it again ah web service.

E

E

You can see that uh you can see that we're doing the we're tailing the logs in the pod, that's going to standard out, so um we're seeing log lines from multiple log files. This is one of the things that I hope this issue will resolve. Is that we'll be able to determine um you know which log line comes from which log file we're most interested introduced, interested in the production, json log, um but there's still like unstructured logs in here we have the offload there's just like a lot of stuff to sort out.

D

Jar is there an issue in gitlab, org gitlab related to cleaning up some of these logs, because.

B

D

Imagine this is gonna spike our costs unbelievably, because if this is staging, then we are not in a good state.

E

Yeah I mean, I think I think this is the reason why we have this blocker. I think what we'll do is um once we so what's going to happen, is that each of these log lines will be wrapped in json. So we'll be able to say, like this log line came from the auth.log and then we can either drop it or send it to a you know like right, not drop it, but not send it to elasticsearch or um send it to a special index.

E

C

D

If there's an issue yeah, but that is the infrastructure work around for a problem in the.

E

Application yeah yeah, so I think there is an issue I'll take an action to define it uh to see whether, like, I think, um yeah, I think I think possibly doing something like we could. We could add to structure logging like a field that indicates where the log comes from. That would be the best option: pops possibly.

C

D

Find it and uh and ping me on that.

E

D

On the infrared process, there.

E

Sure um so I'm not sure why I'm not sure why logs aren't being but let's, let's just take a look at uh elasticsearch. So if we do k.

C

C

E

So, let's just take one of the nodes that the web service is running on and then we'll look at elasticsearch here.

C

C

E

Memory, so this is the fluent d pod. That's a demon.

E

C

Yeah, so here it is.

E

So what we should be doing is we should be. We have two filters here, one for um this workhorse.

C

E

Where is the ah here? It is so it's looking for this path for workhorse logs, so what I should be able to do is just see if that exists. It does so. This is the pod box that it's looking at, so you can see that the logs are here and then this should be forwarded over to elasticsearch. I'm not sure why I can't see them yet in elasticsearch I'll. Take a look at that this morning, but uh I would say, like vlogging should be working.

A

When was it last working, do you know.

E

Back in the life, oh no, like I, I literally like before this meeting I just merged the elastic search, uh config update.

A

So I can I'm covering cover this in a bit later, but kubernetes upgrade 1.1 1.16 broke our monitoring and possibly our elastic search. So that might be an avenue of investigation.

E

Okay, yeah I'll, take I'll. Take a look I'll. Take a look after this medium.

A

Search but I'm just curious.

E

Yeah, no, I'm sure it's fine. I mean this is brand new, so I'm sure there's something something else, but I'll take a look cool. That's it for me. uh Graham, do you have anything or let me look at the agenda.

B

Yeah, if you want to go like a volt update, be awesome, graham, if.

A

You want to yeah so give a five minute. Vault update the short answer. Is we have a production, a quote? Unquote a production and a non-production vault instance both up and running they're vpc, peered they're in their own special gitlab approach. Sorry, google projects in their own isolated zones they're both peered with their respective networks, so non-productions paired with things like pre and staging production, is only paired with production.

A

um I've done all the ci jobs, all the the boiler plate and all the bolting together so that merge requests to start basically using vault could be added, so my kind of um in terms of the project management side. The things I have left to do before my definition of done for this stage of vault is: I need to hook it up to prometheus and I'm about 90 of the way there.

A

I just need to have a quick chat with some some people from the monitoring team just to wrap my head around how some of the monitoring stuff works, so I'll get prometheus running there I'll run up some uh some. You know monitoring rules, I guess making sure vlog goes down. We know about it and then um I will do run books so how you know if you get an alert about volts or whatever, how?

A

What do you do and then more or less I'm considering the first part of vault done, and I will do a readiness review, which is purely just to sign off of the architecture of vault and the setup I have, and you know how it works and the run books and the monitoring get consensus from the team that you know we're happy with that. There's nothing outstanding, there's nothing else.

A

I need to do or from any of those kind of parts, and then anyone within any of the teams who are actually interested in using vault from my perspective can just start using it. um I do have like I can. You know, show you how to use it in helm, file and stuff.

A

It's it's dead, simple to use, but in fact, with all of our services, um ci jobs, helm, file, um chef is still an unknown quantity at this stage and that kind of work will be parceled off into different epics like how do we use this with chef or how do we use this with kubernetes, um so yeah, so more or less within the next couple of like once I get the readiness review done, I'm pretty much happy for people to start using them.

E

What do you think is going to be the first thing we transition like? What's your gut tell you.

A

The you know, there's a git labs, configs bucket, where we stick some random configuration data for like cloudflare exporter and um some other kubernetes services. We.

C

A

They're not used by chef they're, just all by themselves and there's about 10 files. We have in there that have like a few lines of code, each and they're only used at ci job time to create those secrets. I think that was a prime candidate for just switching over to because from the helm file example, you simply just replace um instead of like that whole gkms, and about those really long awful lines that we have at the moment to grab buckets and decrypt it. You replace.

B

A

Just like volt colon, slash, slash and the name to the key and helm file just knows how to pick it all out. As long as your environment is logged into vault, which ci jobs and desktops can do so.

E

So so the first step would be using vault for secrets that are specific to kubernetes, but not used for chef right.

A

Correct that that.

C

A

I I'm happy to do something different to me. That's the simplest use case.

E

Yeah, that seems to make sense.

D

Just thinking out loud, uh graham uh before disconnect, um so if we go down the route, so just humor me here so if we think about maybe putting our uh staging cluster uh in there just to well test it in this hybrid, weird environment, would it work because of the chef dependency.

A

Which shared dependency you mean staging, as in like the gitlab.com staging.

D

Yeah yeah correct correct because we we pull both from chef and right, like the configuration, is hybrid.

A

Yeah, so I I the problem would be is like we could move a bunch of secrets over interval for staging, but then they're out of sync right, like you got two places you kind of got to keep them updated. So.

D

Okay, oh that sucks, because I was I was hoping we could maybe before we start putting some of this really uh public facing traffic right like the majority of traffic we have, if we could do something with gold as well, because I'm afraid this is gonna, be like a big change as well.

A

Yeah, I think the key is is as soon as I kind of as I said done that first part and that that's ready to go is we tackle the chef problem? But if I've looked at the work you did job with gkms um and the current setup. I think we just create another. We extend that ruby code because you, you wrote it in a pretty good fashion, where it's like extensible right, so it can't be just extend another class there, which basically just calls vault get key or whatever it is instead of gkms, but.

C

E

C

A

Should be easy to do, but getting it.

E

It's just another back end for the gitlab secrets, cookbook right now it has chef vault and gk ems. So it's like an another shim that you could just add. I I would say I I don't know if it's worth spending too much time on this, because when we move all of the front end over um the only thing, that's left using chef secrets will be postgres and redis and italy and yeah, maybe yeah.

E

Maybe it's worth transitioning that to vault as well, but uh um yeah I mean, maybe maybe it would be worth just time boxing to see how tricky this will be. I I guess like I. I also don't want to give each right now we have the problem where every single node in an environment has the keys to the kingdom for that environment and I'd like to go about it. The right way.

E

This time, where we uh limit access to secrets, it's a bit tricky with omnibus, because omnibus is like grouped all together and uh there aren't a lot of secrets that can be separated, but like, for example, um I think uh giddily is a good example. Getaly doesn't need access to the postgres database, but it still has access to the postgres secrets, and this is something that I'd like to try to segment. If we can.

A

So I definitely think with the policy setup and the the machine role set up, we we will probably look at doing it's absolutely possible. The other thing.

C

B

A

Mention is there: was a ticket floating around and I'll see if I can find it again about making the omnibus so the gitlab rb file have relative volt integration, so in the gitlab rb you would write like colon slash and then you, you know if you run a vault agent on the machine. The volt agent talks, the google meta, our data server to get machine, roll and gets policy and everything, and therefore we don't solve it in chef.

A

D

A

Not sure if that's easy or difficult and I'm not sure.

D

I I'll tell you right now: it ain't gonna be as simple as uh extending the cookbook, so it's gonna take way more time.

A

D

And be more fragile, probably as well, because there's too many rappers with omnibus as well sure.

E

Cool, uh what's the uh what's the latest for the tls for console? Is that blocked.

A

Or I'm going to schedule that for my monday morning, which is like sunday.

E

A

For us and europe, because that's the lowest point of the week um and I'm extremely nervous- I'm really scared about causing an outage. So I want to try- and at least do it at a time that it's not going to impact people, I'm still going to do a little bit more testing and stuff over the next few days.

A

I've just been yeah, I'm on call this week and plus I'm trying to do the gkd upgrade and get that over the line, and then I'll circle back and focus on that for the rest of the week and then monday morning. Execute that.

C

E

Okay, so I guess the um the point where we'll probably where we might run into problems is the point where we turn on tls verification right. I think up until that point, it doesn't matter the certificate's complete completely wrong and I think console like without. I think I think, without tls verification set to true. I think uh things will work. So the tricky part, then, is like uh turning that flag on on the master console yeah.

A

Very, very slowly.

E

A

It should be fine, I just know the blast.

C

A

From when the outage happened is very big, so I'm just making sure.

E

Yeah, it's almost tempting, I would say to it's temp. It would be tempting for me to take out the load balancing like change, that dns record to just a static list of ips for the replicas. You know what I mean.

A

Yeah yeah. Yes, I just like yeah.

E

Just and then because that's that's our only dependency on console, it's it's, it's our only dependency on console. It was implemented. We we had a static list of ips up until recently.

E

um uh It's just that uh it was kind of a you know we used console for so it can dynamically change. When we have failovers, um I guess that's an option if you're feeling super nervous- uh or at least like, maybe maybe we should keep that in our pocket as a way to quickly recover. um If.

A

Yeah, if we need.

E

A

Maybe I'll um I'll look into that and I think definitely as a rollback, I might put it as just the first rollback step to stop the bleeding.

E

A

Something goes wrong. Look.

D

A

Test I've done it has been okay. I've got no real reason to be scared. I just know that okay,.

E

A

As long as I turn the nodes over very slowly one at a time and I'm really making sure every single node is um connecting again that's alive, I think we're probably okay. The good news is it should either work or not. It's not like it'll work for 30 seconds and then stop should.

E

Have you tested it yet on staging with tls verification turned on.

A

No, I need to go back and do that. I've just been swamped. Okay,.

E

Just trying to get.

A

This stuff done and then that's next on my list.

E

Okay, yeah, because I think, without turning on tls verification, I I think like, even if it wasn't correctly configured, it would still work.

A

That would be the real test. Oh.

E

A

That's not true. I did turn tls verification in staging. I did it.

C

A

But I need to go back and do it properly through roles and stuff, so I have done that sorry.

E

And and how will it work um so when you turn on tls verification on the server um you do it on the not the master console, but the the other console servers first or like because as soon as you turn on tls verification, that's going to restart console and then clients will try to re-initiate a connection and it won't work right.

A

So apparently I I need to double check this, but I think just a reload will catch it, but yes, there's still some kind of reconnection. So I do it. It wants you to do it. On the non-leader first leaders, then the leader.

E

A

Which will essentially initiate a failover, um so I will be able to at least determine if all the other nodes can connect. I think that's a good sign like before I do the leader as long. If all the other non-leaders still work, then I think that's okay and then I can just target and like roll it out to different nodes.

A

E

Like I'm just trying to think of, and I'm sure there's a standard way to do this, but when, when do you turn on the tls verification, do you do that on the the followers first and the leader, stls verification turned off, and then you fail over to one of the followers, but then clients who have tls verification turned off.

E

Will they be able to connect to the new leader.

A

Yeah, I believe so so.

E

Because there's.

A

Incoming and outgoing.

E

Verification so maybe.

A

There's outgoing and incoming: isn't there.

E

Yeah, I think I think this is the tricky part we'll need to we'll need to validate just to make sure that everything.

A

I I legit turned both settings to true in.

C

A

All at once and was monitoring it and didn't notice, a blip but you're right like if you've got clients with the bad certificate, and you turn on verify outgoing. They should have complained to you.

C

E

Would it notice.

A

E

Yeah, what I, what I'm unsure about is whether turning the verify on and doing a reload does that cause a connection to drop and reconnect that I don't think it it might not. So if it doesn't, then those old connections will be fine and then, when they restart when the connection you know they'll use tls but uh yeah. I think I think also be careful, because I think our chef cookbook might do a hard restart on all config changes. I'm not I'm not sure if it does a reload it just.

A

Doesn't reload actually.

E

It does a reload okay, so that's good, so maybe maybe this is fine, then uh you know it just does I guess. If the reload doesn't force connections to re reconnect, then um it should be a simple really of just turning on tls verification everywhere and uh like and just forcing a reconnect to yeah or just waiting yeah. But we should, I guess before we before we close out the change request. We should probably do a hard restart of console just to make sure that uh yeah yeah cool.

C

E

If you need another another, look at the change request, uh just ping me.

A

Yeah, I appreciate it. Yeah I'll definitely go back and have a closer look and see what options we have for um yeah, just better things to do in case of failure and then yeah. I might update it to make sure we're talking covering what we talked about.

C

E

Cool all right anything else before we end amy.

A

The only other thing I was going to mention is I'm halfway through the 1.16 upgrade in production today, I'll finish it tomorrow, only thing.

C

A

Is one all of our monitoring broke? Basically, all the dashboards stopped working and everything because right down the bottom of the release, notes not even under the please look at this section. Just one of the lines down the bottom was they dropped the label prometheus labels that we were using so container name becomes container pod name becomes pod, so everything we had just kind of and the reason and the thing is this has been running in staging for weeks and the dashboard were broken for weeks and staging the.

E

Night, no one noticed that these are. These are like the uh the mixing dashboards right, like the the these are the mixin dashboards, which are always in like a half state of brokenness, regardless, like I think uh we're only looking at them. I like to just deprecate those dashboards all together and just write our own like it's. It's not very.

A

Reliable the ones I was looking at was like the registry pod info ones, which I thought were ones. We wrote ourselves.

C

E

Yeah, so those those, I think are very important and if those are broken, then we need to fix them. Did.

A

E

Did you already submit an mr of the fix.

A

I've already fixed them put them in already, but yeah. I'm just surprised. I didn't notice it sooner. The only other so is there was a pod in the cube system, name space. So it's basically a gke pod that they provide us um that just sends some events to stackdriver. uh That was crash, looping and yeah.

A

The only thing I could find on it is people complaining to google about it and google's saying that it's fixed in a newer version of six uh 1.16 of kubernetes that came out three days ago, so that pod is still crash looping. Once again it was in staging and crash slipping for weeks. It doesn't serve any real purpose for us because we use a different login structure, um but I am going to look at um like upgrading and fixing that problem some point in the future and it's.

C

A

There's a larger discussion: I won't sidetrack us all with now, but I think it's probably good we're at a point where I feel confident. We should enable an outage window and node auto upgrade only not masters just nodes, yeah.

E

A

Have that discussion.

E

Yeah we discussed it uh early on and we weren't really sure so we kind of went with the safer approach, but maybe you're right. I think auto upgrading the nodes and when you um upgrade a node does that that will actually like bring down the node right and bring it back.

A

Up again, it does the whole um spins up a new one.

E

Yeah yeah yeah.

A

E

A

Safe, we, you know it seems to be working. Fine.

E

So so for sidekick, it's definitely safe. uh It's like um I, I yeah, I think it's. I think it should be safe for for everything. I'm I'm always a little bit more worried about git ssh and get https, because you have these very long lived connections. You know, like you know, like sometimes like these clones, take a lot longer than your typical web request.

A

Even registry right like if someone's pulling from.

E

Registry as well, very slow.

A

Connection like it could take half an hour to download an image so.

E

Yeah, okay, I don't know from.

A

A user experience standpoint: how do we reconcile that fact with? Like people may have long-lived connections versus we will have times when we want pods to go away.

E

Yeah right: well, we just we just live with it for now, but uh yeah. I think this is why we don't use preemptable nodes as well in production.

C

Cool man all right.

B

Is there anything we need to do around that stuff, then, uh graham, like should we just check things still working or is there any like work? We need to schedule.

A

In terms of the kubernetes upgrade.

B

A

Yeah, so I'm pretty confident besides the niggly bits I found everything else should be fine, like we've got pretty good monitoring coverage, we're getting. No, you know user alerts and no like customer issues. So I'm pretty happy with that. Okay, um I've already fixed the monitoring, so the monitoring is fixed. I said there's this other pod problem, but that's not an issue. That's major in any way, and in fact it didn't look like google drag themselves, drag their feed on it. So I'm not really too worried about it. So I'm.

C

A

Tomorrow, all right so just call it done.

E

Are the um are the the kubernetes mixing dashboards, which is like the grab bag of all the miscellaneous stuff? Those are all broken, then, as well or um or most of them are. I haven't. um I use. I use a couple of them sometimes like um like when I'm looking at nodes. For example, let's take a look.

A

I searched for mixing and couldn't find them. Do you know what.

E

No just everything in everything in the kubernetes folder you just prep for kubernetes and grafana.

E

They look like they're working, yeah, okay,.

A

So what happened was there is um like the the change. Wasn't a change from this to this. It was uh a few releases ago. We changed and now we're removing the old one. So it's possible, you know when we got new mix-ins that was already using the new name, which we already had or.

E

C

E

Well, we could, we could take a. We could probably just update it anyway. Just make sure we're on the latest. If we're.

C

E

Cool all right.

B

Okay, thanks very much everyone yep.