GitLab Delivery: GitLab.com migration to k8s demos, 1 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-04-01 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

Hey, why are you gonna demo for me today.

C

That would be fun.

B

Yeah, it's fine right. uh I I'm okay. Also, if you want to cancel this.

A

uh Starbuck's got some, um I don't know if he's going to demo, but he's going to give us a run through. um Hopefully we can use this recording to um help henry catch back up again next week. So he's going to give a bit of an overview of where we are and what comes.

A

B

B

So what happened uh after I left the the incident. Do.

A

B

A

Yeah, so europe managed to pick and has um kicked off the packaging in parallel, so fingers crossed. We should have in an hour everything ready so.

D

D

Sorry, I'm late.

A

That's all right. Welcome, so overseas. Go back.

D

I don't even have the dock open, yeah, don't start with me start with someone.

D

Else um yeah, I didn't really have anything to double, um so I thought I'd share a little bit of um where we are and what we kind of came across this week, just to show how difficult it is to do a migration of a service.

D

If I could find the right tab, because I still haven't caught up with my day I'll share.

D

um Roll out I was not prepared for this meeting. I'm sorry.

B

Just a reminder: you're not supposed to be prepared for.

D

It so I'll share my tiny screen now.

D

So the first time we attempted to um enable staging, we appeared successful initially, um but we were getting a lot of qa failures.

D

We determined that we had a problem talking to um the api service. In general request logging ended. Our request ended up going to the default backend that is installed by the nginx ingress, because we lost our header that told the ingress how to route traffic to the api service.

D

So we had to make a minor change and thankfully to jarv, he helped me out and we were able to add the necessary item dha proxy, to modify the header appropriately and now, all of a sudden. I guess I didn't log that somewhere in here, but now we got traffic to the appropriate pause that we desired.

D

So that was cool the next time we enabled.

D

Traffic to staging, we end up getting a lot of error messages, uh so it was immediately pulled where's that thread yeah. This is where jarv helped me out.

D

um We discovered that we're missing a configuration inside of our gitlab configuration, so this was something that I missed, despite the fact that we were that, I had an auditing to go through the configuration. I missed an item, unfortunately, that required an update to our home charts and then, as I made the necessary updates to our home charts, I introduced two bugs into our home charts, so I had to fix both of those in order to get us past this point.

D

So now, last week, during a demo, we enabled kubernetes traffic on the api service and we started running into a large swath of errors.

D

Rack attack um was showing up a lot, but after I dig further into this- and I don't know if I documented this properly- I did so. I discovered, after a larger situation that we are hitting rack attack quite frequently we're on a constant basis, but we just had a spike when the api was first enabled in kubernetes.

D

So we could see the drop off from our virtual machines and then the same amount picked up in kubernetes. The difference, though, with this was that I noticed that traffic was coming from all over the place when we were on kubernetes or on our virtual machines. But then our api was only seeing traffic for my low bouncers when it started taking traffic.

D

So graeme helped me with this um here's another proof where I was sending a curl request. You can see the requests coming in. We only see our load bouncers.

D

I enabled the api servers again, so api one was taking traffic and you can see my ip address was showing up in the logs like we want them to so, graham helped me out, and he discovered that we were setting the appropriate headers so get lab real ip and x original forwarded forward were being sent, but nginx wasn't doing anything so he found the necessary configuration to fix that.

D

So at this moment in time we have a dependency on nginx ingress for the api service. I want to revisit that question as to whether or not we need to keep this in the future, but to keep on track with what I was showing.

D

um We ran into an issue where geo is pounding century with errors, it was determined that this is just due to a data situation. uh I reached out to geo for some assistance to evaluate this particular error, and we just have bad data on staging for geo um also found a fun error where we're trying to look for an exclusive lease, but apparently there's one already in execution.

D

Apparently, we have some sort of a code situation. uh Robert haven't come out with this, where we are reusing a set of code, that's specific for sidekick jobs that handles specific user requests to the api. If you were to like make a change to your user in some way shape or form, it touches the same code.

D

So this is an error message that we see quite often, um but we could safely ignore it, which I'm not really too thrilled about so an opening issue to hopefully get that looked into and remediated um I'll talk about this one in a second. um What was this new error? I forget what this was go away.

D

There's a foreign key problem, so this is probably another situation I deemed it. It was a situation where you have bad data again in staging, so I'm ignoring that and the metrics we you can see from this chart. We just lost our metrics when traffic was fully shifted over. This was also fixed by graeme. So graeme is you know, you're.

E

No longer sharing the screen.

D

Unless it's on my side but.

E

I think, okay, when you.

D

Remove the zoom.

E

Zoom bar you just stop sharing that's fine.

D

um But, as you can see, the traffic just kind of our metrics just dropped off um when we shifted traffic over to kubernetes, so graeme again fixed this. um Oh, I spun up an issue and grain fixed that as well. So now we are seeing um trash or metrics in our api now, which is wonderful. So if I go to the last say three hours, we have our aptx data in our low bouncer abdex and error ratios, which we were missing when we first transitioned over so um so now.

D

The last remaining thing is that we're trying to determine whether or not this is a block or not.

D

There's an issue that jarv reported over five months ago, where console requested console to figure out which databases we want to talk to is sometimes not returning results.

D

So I'm discovering there's quite a few issues with the console as a whole.

D

Logging is not being captured. We don't have the appropriate requests and limits set for resources on this deployment.

D

We don't have metrics in our grafana at all we're capturing metrics somehow, even though metrics is not enabled in our deployment. So I need to look into that, but I don't know if this is a blocker for us going into production, but it's something that I want to investigate a little bit so graeme is going to graham, is already helped me out greatly with this, but I want to make sure that if console falls over, we don't start pounding the primary database unnecessarily, because that's our fallback option, hopefully with the postgres 12 migration.

D

You know things will be a lot faster or you know not as saturated, but until then we run the risk of sending 144 pods directly to only the primary database instead of sending read queries to any of the secondaries, since we don't know how bad of an impact this is. I want to look into this a little bit today, or at least figure out some of the issues that I'm discovering with console and see. If we can remedy some of those and gather a list of things that we need to work on um in jarv.

B

Who is supposed to own console as a service.

D

B

Think there is an.

D

Official owner, like it's an infrastructure.

B

I would say: probably: okay, so noah.

E

How do we interact with console is through the dns api or we're using regular apis.

D

E

D

The dns to query.

E

D

I don't know how patrony does this, but it somehow injects the necessary key data. Yeah.

E

D

That we know which one is primary, which ones.

E

I have a second question, because this sounds like a nightmare that I had in my previous company and so far is exactly the same thing. So uh so we are using dns api over console. So now the question is how many console agents we have um there.

F

Are quite a number.

E

F

There are quite a number of console agents, but kubernetes isn't adding like an agent per pod. We have an agent per node, it's a daemon set, so we have console running per node and then we make dns queries to the service endpoint from each pod. So compared to our vm configuration where we have an agent per vm, um I'm sure we've increased the number of agents, but right now I don't think we think I I don't believe we don't think the problem is between the agent and the server.

F

We think the problem is between the application and the agent. So.

E

Yeah, okay, I understood so the problem that I had. I mean it's a long time ago when council was very young, but it was exactly the same. So we had um docker installation so no kubernetes and we had something like 30 machines. So not that many and we started with um was a three musk. I think it was three master installations, so basically they were um no agents, they were just they were just reaching out to one of the three master, so it was kind of something like 1 to 10.

E

So this was the ratio between machines and so clients and the number of servers, and basically he kept hanging. So the dns just didn't kept, not responding, and so all the application was failing, because they were not able to figure out where to reach out for things and when we moved to one agent per machine. But again this is not kubernetes, so it was able to handle the load.

E

Now here I don't know how many pods we run on a node, probably more than 10. I mean I'm not going. I'm not going to say that 1 to 10 is the right ratio right, because it really sounds like the problem that we had.

F

I see yeah and.

E

This was specific to the dns interface, because, if I remember correctly, but I'm not- I mean as long as a long time ago, we moved some of this logic into the application using a console api instead of using the dns interface, and this was way better. So.

F

Okay, um yeah, that's something worth exploring. I think you know, we don't think it's between the client to the server, because we would see the same problem on virtual machines if it was because the virtual machines and kubernetes are using the same console servers. But if we're just simply like overloading the dns interface on the client, then maybe what we can do is we can they're not.

E

The same because.

F

You told me that.

E

Hv, you told me that each vm has its own agent, so vms have their own agent, and that cannot be overloaded. It's not the same.

F

E

Where that the kubernetes parts are.

F

Yeah yeah yeah. That's why I think it's like the connection between the application and the agent- that's probably getting overloaded, which would make sense right, that's happening on kubernetes, maybe maybe what we could do here is uh reproduce this on staging by just blowing out the number of replicas temporarily, just as an experiment um and see if we see the same problem.

F

If we like you know, I mean that's a really simple thing for us to do um without any traffic, if we create the same number of pods on staging as we have on production just even for a little bit and we start seeing these errors. That would be a good clue.

F

You could try that what do you think skype.

D

I think that's a good idea. um We could probably demo that right now, if um we could potentially.

D

What's the best way to do.

D

B

I think we probably want to make a decision here and say we don't want to go to production until we have a grasp on this.

F

B

Api will add significant load and some electrical blast radius.

F

Yeah, it might make this problem even worse. Much much.

E

Worse, if the problem is.

F

Between if it has to do with a number of pods, then yeah it's going to make it worse. Yeah.

B

F

And um starbuck did you? Did you look at the resource? You said that you know we are doing a good job at monitoring the resource utilization for console. Have you looked at that recently and does it look like we are like at the limit, because maybe this could be as simple as like? We don't have enough cpu in memory.

D

I haven't looked into it yet, um graham also posed that same question, um so I.

E

D

We we don't have any monitoring in place for consoles, so we need to add some. I plan on creating some issues. I'm just currently gathering data right now we are not gathering any logs, so I want to see if we can't gather logs, because that was something that grain was frustrated.

E

D

um So I started looking into that this morning. I just happened to come across the fact that we're missing a lot of stuff for consoles. So it's like we deployed it, but then we didn't care about it um like it didn't go through a writer's review type of situation, so.

B

Well, console was made at the moment when we were between teams.

F

um Skybeck, are you going to be uh working friday and monday? I am not now. Oh okay.

F

um I might have some time because it's depending on how slow things are, because I'm I'm hoping with a lot of people out, I might be able to take a look at.

F

B

So you said there is no logging on console, but we don't have any monitoring it on it as well right, or am I not able to find this so.

F

We have logging for the console server which, for the console clients, we don't have logging on the virtual machines or kubernetes um for kubernetes, just in stackdriver. I guess and.

E

F

That might be suppressed, so we may just only have that in object, storage. So that's one thing: we need to fix um yeah, it's not been something. We've typically ever had to look at on the vms, which is.

E

F

It was involved on the kubernetes cluster.

B

No, that makes sense okay, so I guess we need.

B

We need to add these things in.

D

B

I'll create some issues, I'm.

D

Slowly, gathering a list of all the things I'm finding that we're missing um jarv, you had the idea of doing something to the deployment. While we do queries what would be the best way to go with this, because I could do an ns lookup, for example, I don't I don't shove that behind a while loop and maybe play with the deployment or something is that what we want to try to test out.

F

Yeah, I can give you the example of just using dig to do the dns query on the shell. I was thinking of just doing this in a loop over the course of a day to see if we see the same issue- okay, yeah, just to see, because I don't know how long it's going to take for us to reproduce this problem um on a given pod across like on the application. I was thinking of even doing this on a real pod.

F

um In the background I mean not like in a super tight loop, but something that would at least tell us if this is happening outside the application, like maybe one one request like every couple seconds: yeah yeah, uh but I like this. I like this idea of like expanding staging first. If we could just see this on staging, then you wouldn't have to mess with prod at all.

F

F

Are you thinking that you'll just do a cube, ctl edit for this, or should we just submit an mr to update it and help.

D

uh Well, what are you trying to change, though, because this is a demon set, it's going to forcefully deploy to all servers, regardless.

F

No I'm thinking like we changed them in replicas and staging to what we have in production for it's gonna. It's going to blow out the number of nodes temporarily, but by doing that, we can see if it is a function of the ratio of pods to clients, we'll see it in staging right.

D

I guess I'm confused it's a daemon set, so it's only going to run one pod per node yeah, but.

F

I want to increase the number of clients that are making dns requests.

F

D

So we'd probably want to do that for like what the the get service.

F

Yeah, you could do it for the git service. I think we're seeing it there yeah pretty frequently um and if it's something that we see even when we're not taking any traffic, we'll see it on staging yeah, and um that would be I mean that would at least clue us into whether this is a problem of overloading the client yeah, which I think is a pretty good theory.

E

Yeah, there's also another thing that I don't remember it was the case and which could be about stale data, so it was kind.

B

E

Mix of the two so client being overloaded and detecting that they have sale data about not being able to get a new.

B

E

So I I don't know if, if we have vlogging about when this happened, if this I don't know primary was changed, I don't think I mean it would be. A database. Failover probably would be something that we are aware of, but if, for an instance, we had some changing in the in the content of console services around when it got.

E

D

Yeah george, let's just do a um a merge request into gitlab.com and bump up. The replica account like just shoving this behind a while loop, I'm not seeing any failures, but.

F

This is um you're doing this on staging or yeah staging staging.

D

Unlike resource requests or resources being used by consoles relatively low, like less than 100, megabytes of memory, usage and cpu is also less than.

F

And another another option is on staging. We could just simulate a lot of simultaneous dns requests instead of blowing out the number of pods, but blowing out. The number of pods is.

D

Easy and I guess what we would look for at that point- it's just a log message inside of um rail, saying it can't get a response from console appropriately.

G

F

F

If you wanted to do this a little cheaper, you could just do it on one of the zonal clusters, instead of all three indeed,.

D

Well, I don't think doing that is worth the time in this meeting, because yeah, if we're going to do that for like a few days or something.

F

I think yeah, if you start on this today, just uh drop me a message. I can follow up on it uh friday and monday evening.

D

Okay, we'll do um all right. Is there anything else I want to discuss. Yes, nginx.

D

Jarv you had a question about whether we did or did not see this on our git services. I don't recall I actually was going to ask you that question. I don't recall, see.

F

I mean, like I think, granted uh we didn't have rate limiting enabled back then I think before we took out nginx, um but I'm pretty sure that this was working like we hadn't. We were servicing with nginx for a while, and um I was definitely looking at ip addresses. So this thing kind of confuses me as to why we didn't see it. There.

D

Yeah, so I guess, there's two things that you want to determine is one: if we remove the nginx ingress, will we get our ipa addresses, as we expect them to into the ip api service?

D

um Because, right now we have a configuration inside of engine x that forces those ip addresses to be presented to the api service.

F

D

That's definitely.

F

The case right because we, this is exactly how we have good https configured to bypass nginx, and I mean I didn't look just now, but I I believe that ips are being propagated properly to workhorse. You can check.

D

Yeah, it did not look.

F

F

F

Yeah looks fine, I see ipv6, I see internal ips, but that's because of internal requests from gitlab shell and I see public ips. So it's.

D

Okay, that's fine, so hopefully we're good to go on that front. It's just a matter of removing the nginx ingress at some point in time.

F

We know if we want to do that yet for api.

D

We haven't evaluated anything related to that just yet. We have an issue to do just that and we haven't gotten to it yet, um which.

E

D

Me to the next question- and this is something I mean I briefly discussed- is: should we do the nginx removal and evaluate evaluation as part of this epic for the api migration or just save that for a future.

D

D

I personally don't care because our goal is just to migrate the service over to kubernetes. So removing a component, I think, would be good just to eliminate some amount of tech, debt.

F

I I personally feel like there's risk because we're dealing more with request, buffering and those types of things that we didn't care so much about we get https but for web requests.

F

There may be things that we're not thinking of, or differences like that we don't account for I I I would like it to. I mean I think, before we migrate completely, we need to make sure nginx scaling doesn't cause errors right if.

E

F

Engine x, I think that's the most important thing.

F

um I, my my preference, I think, is to keep nginx just to kind of keep as much the same as possible, but it's also a pain to remove it later. So I don't.

D

I don't know it's a pain to remove it, regardless of when we do it so and.

A

Which one's safer, like removing it like having a like for, like move now or uh changing like several things, which will give us a.

F

A

F

Part of the motivation for removing nginx from good https was that we had this ancient version of the nginx controller and we were like you know. We probably don't need nginx anyway, so, let's just remove it, there was no proxy request, buffering or anything that you know everything was zeroed out in the config api. Like I don't know, I I it's it's a little bit different.

F

I would. I would rather keep it, but.

A

I I think we should keep it just for a like for, like, like, I think, there's a lot of uh moving pieces already. um Unless we have like it's going to make something considerably more difficult, then I I think it is a good contender for the post migration de epic.

E

Also, this is the nginx ingress that ships with the charts right. So it's the same thing that our.

F

E

F

Yeah we're running watershed, yeah, we're sort of kind of like at least for good https. It's that we have aj proxy in front, which is a little. You know.

B

Which customers.

F

B

Many hubs between all the different uh services, um a question would for me, would be do we have a replacement? We could do right now. That would be easier to do after we migrate, so removing of that layer. Oh my god, I'm not making any sense. Give me one sec to organize my thoughts, so we have nginx right now um on on our fleet and we are talking about possibly removing nginx as a layer um when we migrate to kubernetes right.

B

But the question I want to ask here is: is there a way for us to remove nginx now or replace it with something else that would make it easier for us um to remove later after we migrated kubernetes to kubernetes?

B

Did that make any sense? Actually not anything.

D

Your question made sense. I guess I would ask what would we replace it with and do we need to replace it like? I think I would probably try to answer the question of. Can we remove nginx in the first place.

B

Well, the the question here is: if we remove nginx what what do we expect to take that uh thing over like what will take over engine x's uh uh role, we expect each proxy to be that right.

D

Yeah or not, that was my expectation yeah, but if we.

E

D

On nginx, for other things, for the api service that I'm simply not aware of, then maybe we need to keep it in place regardless.

B

Right but the the point I'm making now is: if we have an option of possibly like right now, making that change with omnibus and replacing uh engine x with asia proxy. If we want to do that, we could theoretically do it now on vms.

B

See how it works and then basically it's transparent to our users. We don't have to change anything after the migration. It's a detour.

E

B

But again um I think amy has a point here, which is one on one, and that is what we should keep in mind yep. uh But what I'm? What I'm trying to add to is what the one-on-one is is up to us to decide.

C

F

Yeah it sounds like you're saying, like you know, if we're gonna remove nginx, we should feel confident about doing it on bms, which would allow us to change one thing at a time. Right, um I think that's, that's sort of, I think, that's reasonable, and we could even do this on just canary. We could do it on the subset of vms.

F

um Do you know scarbeck whether, like I haven't looked at the config recently, did you look at the nginx config and see if there was anything that because we we disabled, we disabled buffering for git https, but we didn't do that globally right or is.

D

It not like index configuration yeah.

F

um So we would need to look at that again as well, but I I I think it's a good idea to like if we're going to do this, let's disable on vms. First, instead of after we move.

D

I like that idea all right, so what I'll do next is I'll try to capture these. I don't think we've got an issue to address it in that fashion, so amy.

D

What I'll do is try to bump up the priority of our existing nginx issue and then add some notes from this meeting into that one and then create a new issue to evaluate uh potentially removing nginx from you know, maybe starting with staging first obviously, but seeing what changes we need, because I know it's going to invoke a little chef modifications in order to accomplish that work and we'll see what happens with that front.

A

B

But uh keep in mind.

B

Don't keep in mind please.

B

What I wanted to say is this delay experiment, whatever.

B

Is also because console is delaying us anyway right, so, let's try to if we somehow magically can possibly parallelize this and do the testing in parallel with the console work to make sure that we can get those things converging to the same point, which is once we come to that same point. We can continue with the api migration uh uninterrupted.

B

So what I'm saying here is figure out how we can face it. At the same time,.

D

Amy could fill you in on details on the workload of various team members, but I think it's safe. It.

G

Doesn't happen.

D

G

Think it's safe to say that we can.

D

Perform this work in some form of parallelization, it's just a matter of getting team members assigned to the work and getting it done.

E

I don't know if you mentioned this just got a bit distracted by my son, but if I remember correctly um on omnibus packages, nginx has special configuration regarding uh bufferings of requests based on apis for artifacts uploads, and things like that.

E

So I don't know if it was mentioned, then it was interesting.

B

Jar mentioned that, but the point there was we want to see whether we can move that layer to an existing service that we already have and remove that addition right, like if possible, which would make it much easier for us to just do api api migration without that additional layer. On top where we know the vms would work in the same way as that's the general idea behind it. I'm not saying it's easy, but it's a general.

D

One it's at least something to look into like I don't know what configurations we have that's specific to nginx and the api service, and I don't know what capabilities he proxy may provide for us that we could remove from engine x, thereby allowing us to remove nginx.

D

All that would need to be evaluated, and I need to learn stuff, because this stuff is not something I've really played with before so.

D

Okay, um aside from me needing to backfill some issues with details and some new issues, is there anything else that we want to discuss on the meeting.

A

Just on your order of kind of progress, then, are you going to try and get these two things resolved or well understood before you try and go to canary, or you can do canary and then understand these things.

D

What I would like to do is go ahead proceed to get something running in canary. That way. I could do a final analysis on making sure our metrics logging and configurations look okay in production, but I would prioritize what we just discussed over moving into production. So maybe we take traffic in canary, but before we go to our main stage is how I'm thinking right now.

A

Yep cool that sounds good. Nice.

D

Because we also still have like the redness review that still needs to be that's.

E

True and reviewed.

D

By other team members, as well too so, like we've, got a few things that we're starting to pile up as blockers.

A

A

Cool sounds good jeff. With that doc, you shared. Would you be able to like either like find some summary or at least add the video into the uh issue.

F

uh Into the issue for.

A

This database load balance sure, okay, yeah.

F

A

Cool thanks very much.

A

Great anything else, anyone wants to.

D

A

Thank you very much. Everyone today um speak to you soon.

D

Have a good one.

A