GitLab Kubernetes Migration Working Group (FKA Scaling, Multi-large), 26 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-04-26 Multi Large Working Group Weekly

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, uh good morning, good afternoon, good evening, everyone today is april 26th. This is the multi-large working group uh weekly sync. So uh let's get started from the agenda first one uh as from me. uh The service discovery sometimes fails inside of kubernetes. That issue is pretty close to be resolved. uh The solution was identified and verified in staging was promising and will be rolled out to production uh amy. You want to verbalize your comment.

B

Yeah, so just uh repeat that really like uh yeah we're aiming to roll this out to production this week and then um so as it stands right now, I think we need the simple production to be fully like to see the errors disappear. So hopefully um this will. This will solve them, but if it doesn't it's still a good change because it simplifies the architecture, so we're very confident that we want to roll this out, but hopefully it will also uh fix this problem.

C

Thank you, the impact of the service discovery, failing is it like outages or.

B

We believe so so. We've got. We've had a couple of incidents over the last few months that it's it's difficult to say for certain, but yeah we suspect it's there is. This issue causes the the incident to happen, yeah, which is really why we don't want to push the api service out to production until we fix this just because the unknowns and the increased number of amount of traffic will be pushing through just feels like if it is causing a small number of incidents. We suspect that would become a larger number of incidents.

B

If we, if we didn't fix it,.

A

D

E

A

Sorry go ahead.

D

I just had a question on the sort of related to josh's about the it looks like what dns is. Typically, what fails.

D

I saw jason's comment in the in the issue and was curious if there was any more context around. Is it.

C

D

The first or like combination of others.

B

I I believe it's always uh dns, so what we, what we suspect is related to is with the um the way the the with the pods coming in and out um means that things can't talk to each other. So we're going to try and like lock those together.

C

B

I think uh I think it's always a dns failure, but I think it's just simply related to our like things losing things starting to talk to each other and a pod rotating out and which is normal in kubernetes, but uh not ideal in this case. So the simple fix that we're going for is to um keep things within their node pools, which hopefully means that if one thing's up- and it starts talking to the second thing- they both say they're- either both still online or they both rotate out together.

D

F

Think of it, like a the problem, was kind of like a shift change right. If, if your shifts change don't align with when your managers change shift, somebody kind of just loses a message right, so it it is dns. But it's basically because somebody went to talk to somebody, but they left five minutes to go for lunch and the new manager doesn't come in for an hour.

B

D

B

A

Okay, if no more questions, let's move on to the second bullet amy over to you.

B

Yes, thank you. So um yes, thanks to distribution, we now have uh sort of the last known blocker that we had, which was the notification secret, only populated during gi configuration that's now resolved, and so we are now working towards getting some api service traffic uh running into kubernetes on canary.

B

So I'm hoping we'll see that this week, but at the moment, as far as we know, there are no blockers towards doing this.

A

Questions: okay, cool next one also yours.

B

Thank you. Yes, so one thing that we are working one thing: that's probably interrupted our scheduling a little bit of when we push the traffic is one of the um tasks we go. We went through sort of in preparation for the moving over is comparing our nginx config between omnibus and kubernetes.

B

Now, unfortunately, we found a difference um that hasn't been trivial to replicate. We think we have a fix that we're working on, but um we we've sort of got dug into sort of request buffering here and um at the moment the sort of short version is. We are working on it and we think we know what we need to do. We don't need any extra help.

B

We don't believe- and I hope you'll stay that way, but if we do it, there's potentially a question um around the investigation stan has pointed out, there's possibly an inconsistency in uh omnibus request buffering. So that's probably the only question mark at the moment. If that does turn out to be the case, we may need some uh development help just to confirm if that's expected and um and why we have it like that. What we're focusing on right now is to just try and replicate the same like.

B

Presumably it seems to work so we're going to try and get the same setup running on kubernetes, so we can push forward. So I will um I'll, probably just give an async update on this in the site channel once we hear more on this, but for now that's kind of where we're focusing our time.

A

Cool, so that would be if uh you need a development help that looks like that would be the distribution team. Is that correct, uh stupid and uh jason.

D

It sounds like if it's an.

A

D

A good chance that we'd be involved, yeah yeah.

F

It largely comes down to the idiosyncrasies between how nginx functions on a vm versus how it functions inside of the nginx ingress controller. We don't the omnibus can fully orchestrate the content on a vm. That's not the case in the nginx ingress controller. So it's going to come down to how do we configure the ingresses in the right way and get the right configurations in place and testing?

F

Some of those some of the workarounds to to make direct to edits also mean zero testing, which, if you bork it, then it is instantaneously breaking all of nginx, because the configuration is invalid. So we have to figure out if there's a good way to do it, what we can improve if we can do anything upstream.

A

A

Okay. uh Let's move on to uh number four a cameo and.

G

Yeah, I guess I can voice this uh so with pages for currently uh not depending on the nfs and production, uh so we're not serving from it, and we are not updating content on it uh and there are some issues or like production, changes to actually get rid of nfs first on staging and then on production and first on pages servers and then on sitekick servers and also one change about basically changing the config flag in production uh yeah.

G

So it would be nice to get some s3 attention and after we close this, there are a few more remarks on docs and stuff. But we can largely call like the pages migration complete.

A

Yeah, thank you vlad, uh so I will. I will work with branch to see if we can get some sre help here to hook up with you so uh stay tuned right now. I know that sree team largely is helping the uh pg-12 upgrade, but I will try to find out. Someone to you know, engage with you so that we can keep this moving.

D

B

Couple of comments.

E

D

Yeah just a couple comments I think igor is out and I'm not sure if matt is aware that he should be looking at this. I think maybe maybe I can help if matt isn't working on this. So I'll talk to brent about it.

A

Thank you very much.

G

Also, it would be nice if we can get something done this week, because I'll be out the whole next week.

A

Okay, yeah good to know: okay, okay, any more questions about.

A

This okay, then, uh what's happening next uh over to you, amy.

B

Thank you so yeah, just uh sort of uh reiterate, really that our plan is once we've got the api service um up and running, we'll be shifting over to um working out how to migrate. The web notes into kubernetes.

A

Cool. Thank you. That's so exciting.

D

A

Yeah discussion topics- uh jason you uh you're typing, but do you want to just verbalize.

F

Yeah, uh so I've seen a couple of issues recently regarding disabling uh redis, key watcher uh for some of the workhorse instances.

F

Specifically, they seem to be related to production and the load on redis, because there's some workhorse that don't need to be doing this functionality um do we know exactly what the priorities are that on this are um so that I can make sure that my team gets involved, there's some funniness with how we do it in omnibus versus how we would do it in the charts, and I just want to make sure we end up on the same page.

F

uh Yes, we do have issues, I'm literally trying to find them all of a sudden. My brain went oh, I should ask about that.

E

I'm I'm asking because, like I think yes definitely, we can disable that, but it also seems like that. We're gonna be running different, poor horses in the production, so the question is like we need to maybe break other functions at some point because we disabled it now so it's kind of late. Can we somehow optimize that simply to not be a problem.

F

Right um and it like the the proposed method right now is literally, if you say not to use keywatcher to disable all of the redis configuration so.

E

I I mean this is one of the ways, but it's first feels pretty risky uh to be honest like since this is, I guess right. This is like the requirement of the workhorse right now, so maybe I mean I can help like with the with the idea like how we could optimize that, but I think from my perspective I would rather aim on fixing workhorse, rather than removing radius or disabling credits.

A

Okay: let's look at the follow-up with the issues to see which team actually can fix those workhorse issues. I know that workforce is. uh uh There is one nominal team, that's owning it right now, but they probably cannot fix everything about workforce, but we'll find out the issues I mean he, the owners for each issues or the the the team who can work on each individual issue. So uh jason. Please get the issues here and uh we may. uh I will look at into the look into those issues to see where they fit.

F

A

Thank you and josh.

C

Yeah, it's open question of. When do we collectively think we should start working on um sort of at the next service to migrate over? I think after api, which is, I believe, it's italy. um You know I think we've had some anecdotal feedback from customers that it hasn't necessarily performed or been optimized to run in kubernetes and so kind of wondering.

C

Given that knowledge and given that we think it's the next service to sort of shift, when's the right time to start spending some more time here um with the giddily road map and distribution and and.

D

C

Infra also involved.

B

Do we know roughly what will be involved like? Is it? Is it like the api ones where we have kind of distribution, work and then delivery work, or is there also some gitly possible gitly work that needs to happen before that.

C

I think the current concern stems from go ahead: jason, you're, ready.

F

Right there, the current we we have an issue open for this um josh. Can you grab that if you have hand the the current problem? Is it's not a well-defined operating environment, the way kubernetes works? You have exactly this expectation and when you go outside of it, you have a serious issue.

F

We have customers that are, you know, 500 users that have no problem, because the way that their workflow patterns hit to get the giddily instance, it doesn't overload it. We have customers who have a hundred users and 1500 ci jobs which just destroys italy, and it doesn't matter if it's on a vm or in a container in kubernetes it just it smashes the poor things bits.

F

So when we go to move us into kubernetes for italy, we have to know what those workflows look like how to control the load in gita lee. How to make sure we understand what patterns say. I need to have gillies that have more room vertically or horizontally and how to handle those things right now we have no definition of any of.

B

That do we have any. um Are there any kind of deadlines that we're working towards with this? Like, obviously, I know it's like we want to do it and we'd like to do it as soon as possible, but is there any kind of business date that we're also aiming to hit.

C

I don't think specifically, um so the reason I'm asking is we don't block in from the migration right. This could take some time to resolve. It'll probably require getting infra and distribution all working together here in some way shape or form, um and so therefore, the more tunes you get involved and the tends to take longer to try to fix things so try to think ahead here. So we don't get blocked for too long and then also it's it's.

C

The last stateful service, I would say of get lab that runs in vms and so from a customer standpoint and architecture. Standpoint would be nice to have it all running in kubernetes. Well, um therefore, you could have you know your redis and your in your database service like rds and elastic cache running in aws, and then everything else runs happily and kubernetes would be a great state to get into.

C

So I think uh to directly answer your question. I don't think there is a date at this point in time, other than just trying to impress you if we can prioritize this work and then according to everyone's roadmaps and then see if we can, what we can accelerate so that end result.

C

I guess the indefinite business impact would be the moving cloud native installation um maturity. So um josh is that kind of what you're thinking too yeah. I think the maturity of the category is one, but also just the customer. Impact of you know we're considering recommending the hybrid architecture for folks over the helm chart um because of what one reason is, because we don't recommend running italy and kubernetes right now right and so that's uh we would recommend customers to do that. Instead of the pure case install.

C

B

Sort of anchoring.

C

B

C

B

C

Dylan, as opposed to our.

C

B

So, from a kind of uh roadmap point of view, what's the like, assuming everyone's kind of slightly busy, uh what's the sort of first sensible time that, uh because I I'm gonna- assume that this is a there's, some bits at the beginning where people need to sit down and bash their heads against it for a bit and we work out what we might need to actually solve. So is that the sort of thing we can schedule in.

C

Yeah, I think, to that end, we'll have to likely do some fact-finding and investigation spikes to really get a handle on what's actually happening here. um We could wait for infer to sort of be ready to pick it up, and then we could run some like a small shard in kubernetes and that's one way of doing it um or we can try and staging, but we, I think you know jason's comment. We kind of see this when we get more load, so we might, I don't know if I'm trying to use the processing environment.

C

Let me just open up the questionnaire, because it's I think, the next next stage, and so I think we likely would involve the giddily team mark, I think you're on um maybe infra, although if we can figure out a way to generate load and generate the failure scenarios without infrared, we could potentially do it without a euro map being impacted, yeah.

D

I mean given the fact that the next few months you know we're finishing up stateless and then also with registry migration.

D

You know we don't have this slotted in currently so we'd probably be you know july at the earliest um amy correct me if I'm wrong there, but I was wondering maybe if gitaly could pick up some of the testing and staging where they can sort of help discover uh some of the gaps there.

H

I mean yeah, I think it's something we should look into, I'm I'm currently working trying to figure out prioritization over the next few months. I know that we're doing some work for a performance and input of things right now trying to get that scheduled in so I'm very happy to talk about potential of what we might be able to help with ahead of time.

H

That would smooth the transition. If that's something we're looking to do right now, we're unclear what won't work. So it's sort of one of those things where we're gonna have to try it and then sort of you know peel the onion. We're gonna have to figure out what's going on, so I don't know if we even have proven it doesn't work. I think it's one of those things. That's right now, somewhat hypothetical and we're in the process of trying to investigate you know what will happen when we actually load it up.

C

Yeah, I think the next step is getting that test environment set up um and then loading and then actually driving load to it and seeing what happens.

B

And kubernetes.

C

With uh you know, with with resource limits set and things like that, and then just seeing what happens.

H

C

um Cool okay, um I'll make that note on the issue just to make that clear, um and then you know, I I'm not sure if distribution or italy is the right team to work on this sort of this test harness.

C

What do we think? As far as you know, like you know who can pick this one up um and then that might define the backlog dude something great.

A

Is that the building of test harness or uh I'm thinking.

C

I I I think you are: we have some test harness for getting um where we can drive traffic as part of our performance testing. That framework that like grant, has done, I'm not totally sure, um but I think we already should likely be able to drive gilly traffic. I would think.

H

Yeah grant's done a lot of really good work on like the reference architecture stuff and I'm I don't know if it's a direct drop-in, but it should give us a good starting point to work with.

A

Mark, do you mind to follow up with grant to see if the gadget environment or the reference architecture is able to handle this testing infrastructure.

H

I mean I can start asking questions. I just want to make sure I ask the right questions. So if there's anything specific people want to, you know, get the answers to I'll, happily, contact grant. I know the reference architecture is working with giddily cluster right now. So, if that's what we need, then we have that if we need something in addition or there's specific features, we need to make sure function correctly in that reference architecture.

H

I'm not probably the best person to know what questions need to be asked, but I'm happy to do the legwork.

C

I think the reference architecture- just one note there- is we don't we don't have a cloud native only reference architecture right. The reference connection we have is the hybrid one which runs italy on vms.

A

C

um So uh yeah we haven't tried this, I think, with with grant's performance tool and see what happens so that you know that might be. The easiest next step is to see if we can get someone from quality, maybe to run it and just see if it blows up or not.

C

A

So host who's going to follow up with the the qe team to just give it a try to see how the reference architecture works. At this point.

F

In terms of how the reference architecture works at this point, the gbt is already being executed on ged generated environments in hybrid mode. So we already have the gbt being run in automated fashion as part of verification for the reference architectures.

C

Yeah, okay, so I I'll comment on.

B

C

So that's my next step. I think um just the decision point here is whether you know whether someone from quality or an engineer, but this is the gpt and I think in general, so claudio make most sense the most familiar with it um or when then the question is whether italy or distribution. Does it uh there's?

C

I think the distribution point here and we can take this async if we want to, um but I think that's the next step to figure out who wants to run this test case in the test scenario and see what happens.

D

A

uh Okay, I will also have this: uh give I'll also give tanya a heads up on this yeah.

C

A

A

Okay, uh we are almost at the time uh that all the topics we have.

A

Today, cool, thank you. Everyone have a good day.