GitLab Scalability Team, 12 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability: Decouple Puma from Pages NFS

Description

A chat about https://gitlab.com/groups/gitlab-org/-/epics/3980

A

Thanks um so this is the meeting. uh This is a meeting where we're talking about how we're doing on the epic to the couple puma from the pages nfs mount and thanks for doing all the organizational stuff on setting this up bob, um I made a little agenda and I suppose we can get started with the agenda or is there any? Are there any questions before we start with the agenda.

B

A

Yeah, sorry, I'm not used to running meetings like this or being recorded. It feels a little weird, but I felt like I had to say something: um yeah I mean.

B

We we can keep.

A

It fairly informal, I think you know yeah, I agree, but the reason I want this to be recorded is that uh so this meeting originated within the scalability team and the scalability team is helping the release team. When the release team owns all this stuff um and.

B

There's nobody here from the release team.

A

Exactly uh so, at the very least, I want people from the release team to be able to to look at this uh and uh yeah anyway, um but I I thought that, because there's a sort of narrowly defined thing that the scalability team is working on right now, it made sense to just talk within the scalability team and high igor about this, about this narrow thing, uh and that is the part.

A

The thing where we um look at the code in the rails code base that touches the pages nfs drive and we try to move it all into sidekick.

B

So some background on that. That's because we want to move uh the puma workers into kubernetes and that.

A

Can't touch nfs.

B

But we still have more stuff touching nfs and sidekick, and we will have like a single bunch of vms touching nfs. That's the goal. Yeah.

A

Yeah, exactly we're trying to shrink the number of free amps that that need nfs and yeah there's a separate um piece of large project going on in in pages to make sure that pages can work without nfs at all, but that is uh that is much um that's much bigger in scope. Well,.

B

You did a proof of concept for that. Didn't you.

A

uh Yeah, well, I'm not the only one who did proof of concept for that, so they still uh the page. The the people working on this still need to decide like which way to.

C

A

And uh then there needs to be a bunch of work done in pages to make that work, a bunch of work in rails and uh all the time we need to think about how what would the migration look like and but even for before we even migrate. We need to get it in production for some sites and see if that works.

A

uh So that's just a that's a big project, much bigger in scope than what we're talking about today, um and one of the reasons I wanted to talk today is that this thing got handed at first just to me and I tried to make sense of it, and I thought: okay well, let's just set up an epic and organized to work and then bob kindly started helping me and now bob has probably done as much work on this as I have, and I just wanted to reflect for a moment.

A

I I want to hear from my question for you: bob is, if you look at this epic and you look at what we're doing do you think this is going to work.

B

um So if the epic is complete, then um I think so I, like the only one that is a little bit of um worry to me, is the issue.

B

Titled final code that uses settings pages part because I don't know what that might be right now, and I just saw that so we have the pages update worker, and now we have the new pages update configuration worker that writes to the nfs. When pages are updated uh from a project like if, when a domain is added or when or when domains change or when something else in the config changes, um are we keeping tabs on what is touching nfs? Somehow, okay,.

A

So uh not yet so what I think what happened is that uh vladimir, who knows this code well made a list of things and said? Well, this is definitely stuff we'll have to change, and uh then I looked at that and I did some crap on the code base and I thought yeah, I okay, that looks like stuff. We have to change, uh so we made some issues for that.

A

Then we stuffed those in the epic, um but we don't know if that is everything and it it's better to have some sort of way to check that we're not using nfs before we just unmount the the volume uh so um yeah we don't know, what's going to come out of that.

A

uh Well, in a way we do know because we have the thing in century now, so we can just look one thing we can do and that I haven't done yet, but that one of us can do is look at that sentry thing and go through all of them and see where it's coming from and see if it's already covered by an existing issue in the epic and if not created an uh initiative, yeah.

B

um Well, up to now like quickly scrolling through, I found one so there's going to be some, but I think the the ones that touch the configuration are already going to clear up a lot, because the biggest.

A

Yeah, that's the thing. The other approach we can take is to say uh we postpone this a little bit and we uh try to get a couple more things in because then a lot of these should disappear.

A

But um I guess it depends on if we can keep ourselves busy enough like if we're busy enough working on known issues, then we can just do the known issues and see the list of century things dry up, and if we're running out of known issues, we can look at the sentry issues and say: hey, there's a new one, and if you already see a new one right now bob then it. I think it makes a lot of sense to just create an issue for that uh yeah that you create an issue for that.

A

Now that you found it already I'll.

B

Do that, after after this, after this call but yeah there's going to be some.

A

C

Go ahead, another thing may be worth considering, I mean I, maybe the sentry stuff already covers everything, but just as an extra sanity check, we could also uh use some tracing on like the cisco level or so file system level um to to check that none of those right calls are happening anymore.

A

So yeah we could. I think I have a fairly high degree of confidence in well. Okay, it's a good question when I think about it. I think that if, unless the code does something insane, the thing we build should catch everything, but, of course that has the start of the sentence. Unless the code is something insane which I cannot 100 rule out, that sounds.

C

A

Yeah, um so I think, there's uh um well, okay, I think there's yeah. So what could the insane thing be? um One thing we, I think we know for sure is that uh the directory, where pages get stored, is configurable and if anything does not obey that configuration, it would have broken for somebody somewhere who uses a funny directory and of course it could be that nobody uses a funny directory and we would have never heard about it, but I think I think omnibus probably uses a funny directory.

A

We can check this, but I actually remember this from a long long time ago, from seven years ago, that uh gitlab was full of things which were configurable, but we were so scared. We didn't know if they worked. If all the configurable directories worked. So we actually had this big banner in the installation guide where we said put gitlab in home, git git lab and if not, then it may not work and it's your fault, because things might look like they're configurable, but maybe they use homegit gitlab.

A

So just don't even change the user change, nothing uh that was sort of the state of things seven years ago and then, when we did omnibus we had to sort of say no we're going to put everything a whole lot of things in different places, and that has to work. So we got better at that over time.

A

So I'm assuming that nothing uh does something silly like well, let's say the current working directory is the gitlab directory, and then we go to slash shared slash pages, and that must be the pages directory. I think anything that does that would be completely and utterly broken.

A

Another way we could have something insane is because of the shared directory. I don't know if the shared directory is accessible through config, but something might assume that there is a shared directory and that pages is then located relative to the shared directory and then so some some code could derive the path to the pages directly via some other, but this sounds ridiculous when I say it out loud, so I I I it.

C

Seems unlikely.

A

Yeah exactly so, I'm not really sure like what you're saying makes sense like we can. We can do a tracing check like that.

B

Should we do that at the end, though, like yeah? Let's.

A

Get rid of the exceptions because that's.

B

Easy and then before we unmount, we check like yes, I.

A

Think the writing here. I think that could be a part of the change. It called change, management's issue or the the the like. Once we like the real thing where we, the nfs drive, goes away. That will be a task for sres and then from what I understand we'll have um where the.

C

A

Sre is working on. This will create a detailed process with sanity checks and right, and so then one of the sanity checks would be estrace the thing uh and look for this, or something like that sounds good, yeah. Okay, so, let's uh let's say we can defer that uh until then, because we think it's highly likely that we'll find nothing uh because of the current approach. Yeah uh wow, a lot of talking um thanks for typing bob.

A

uh I'm not good at uh switching between talking and typing me.

B

Neither that's why I'm buying it.

A

Okay, so um that's good, so we um yeah we're pushing back a little bit on going through the sentry list, but we'll probably have to I I threw feature flags on the agenda, because I noticed that bob made a change using feature flags and I wanted to talk about how we feel about the risks we're taking, because I have to admit that my first thought was just I'm going to make a change that looks good and not use feature flags.

A

But maybe that is because of my bad history of cowboy coding in the early days of kit lab, and I shouldn't do that. uh What what do you think bob.

B

um I figured that if I had feature flags and try this out it'll be easier to roll out, because you can do it more confidently. That's why now, I'm also getting rid of the the like the right when we don't need to write when changing configuration, because that's good, we throw a lot of errors and accessories like because you started tracking those errors, which is a good thing, and if we then now move that into cyclic and flip the switch, then we might get alerts like hey lots of errors here and then you.

A

B

Figure out why that was, and now we know like, let's get rid of it, so nobody else needs to look into it at the time, because not everybody will know what we're doing.

A

What would happen with those psychic jobs? They.

B

A

B

Well, so I wanted to to have the benefit of sidekick jobs that they are retrieving something something goes wrong. This taxes slow. I don't know what like I just count on the retry and then you get it done, but because.

A

The so the way the code used to work just for context.

B

Yeah, either for the sake.

A

I don't know if igor knows this context- and I don't know if the viewers know this context- um there's a service class in rails that is responsible for keeping the config.json file of each page's site up to date, and uh if this file changed, it also writes a new non-stew or like a token to some other file. That pages then picks up depending on how pages is configured because pages can also be configured to not use the configuration files, but we have to support them.

A

So there's the thing that does the configuration files and when I started working on this, I realized that that thing swallows all errors and like if there's an error, it gets rescued. And then the error string gets returned as a return value from the function and nobody looks at the return value of the function. So it just errors, although it's rescue nil, basically it the whole thing has a big rescue, nil, um and uh so we ran into that and then actually vladimir said.

A

You know, because I was just rewriting the thing and saying well, this is rescue nil. So, let's make it look like rescue nil and not like it does something with error because it doesn't and vladimir said.

A

Well, maybe we should track these things and I thought hey that's a good idea and then we got a torrent of errors because it turns out that um I think every time you update the settings of a project, uh it will, the the you submit a form and that form contains a value for the page's access level and then in the project update service. We decide. Oh something changed about config.json, let's rewrite it and then, if the project has no pages associated, then the file system says well.

A

I don't know where to put this config json inno ends, and then you get an error. So we have massive. We call this code way more often than we thought and it's erroring out all the time.

A

So that's the context, and I think what you're telling me now bob is that uh when you put this thing that errors out all the time in psychic, you decided to re-raise the error.

B

If not just that like, if, if we didn't do anything with the error, like you mentioned, that we would be scheduling, psychic jobs a lot more than we thought we would, and they would do nothing like they would report.

A

They would get an eno end and just be done.

B

So but because, like I was reading that I said yeah okay, so lots of things can go wrong with this disk access. So let's do the proper retry thing and have psychic retry three times if it can't write the file or whatever and let's raise that exception.

B

um But that would mean that if we schedule this job that isn't supposed to do anything that instead of scheduling one sidekick job and then failing it, would schedule yeah the job three times: retrying yeah.

A

Yeah, I I I hear where you're going so I I'm going to be devil's advocate for a moment. uh Should we really be trying those jobs.

B

um I think so because otherwise the configuration would get out like.

A

But like it, the the whole thing looks weird right, like it's erroring out left and right, but it it was working and uh now we're saying well, we want to retry if we have these errors, but we weren't retrying before.

B

Well, um it's also the job gets run much more often than we thought it would. So every click like if the pages thing gets out of sync.

A

Yeah, but I'm saying we can um the naive thing to do just to say: well, this thing does rescue nil or it just it swallows errors. That is a different problem and we have a psychic job that tries once and if it didn't work out, then it didn't work out and that's as bad or as good as it was before. We put it in exactly.

B

But we'd still schedule one job. Every time we sometimes.

A

Yeah, no, I I I I project sure I'm the main that that part uh uh that part I'm not arguing with, but I I'm wondering if um making something, that's taking code, that was ignoring the exceptions and saying, let's stop ignoring the exceptions and let's do retries.

A

um If that is, I mean, I agree that it seems right, but is it strictly necessary for us to be doing that right now and if that choice creates as consequences that create more work? Should we maybe go back and not do that, but.

C

A

I'm talking about the retries and not because the thing that you say, let's not try to write configuration for things that don't have pages. That just makes that's. Just like you say, that's like the uh uh the idea of just leaving the place nicer than you better than you found it that that.

B

A

B

If I, if we've left the place better than we found it, I don't think it's extra work to do the retries, because we would not be scheduling the jobs that should play. We can still report the exception and do nothing with it. Yeah, okay, right.

A

B

We're returning it to the job and the uh to the worker and the worker is raising it. If I remember correctly, because that's been yesterday since I've done that so.

A

C

Yeah I I would agree that uh having retries does have infrastructure level implications because there's potentially more load on nfs.

B

uh But the retries have back off like the the thing that um that would have more infrastructure level consequences is the scheduling, the jobs that are going to fail anyway, whether we retry them or not? That's the that's the main thing to avoid, and I think like yeah, but.

C

I guess if we, if we're currently uh making those futile calls at a certain rate, if we enable retries in aggregate, the rate will now be three or four times which might be fine. But yes,.

A

Yes, I uh my my um what what what how good, what we should uh values we should put on my guess, but my guess is this- is not a big, uh not the biggest issue, because.

B

A

B

Noticed before from the.

A

B

A

I'm thinking about the nfs perspective, so uh it tries to read one file uh like because the first thing it does. It reads the current conflict json and then it compares that to the expected configuration and if those are the same, the thing returns.

A

So at that first step where it reads the existing config.js and that's where it errors out, because it tries to read a file that does not exist. So the pressure we're putting on the nfs server. There is a bunch of attribute lookups uh or it it's not great right, because we're trying to open a file that doesn't exist, but it's one file and out of all the things we can be doing to the nfs server. I think this is not the most dramatic.

B

Well, the bigger impact is the extra job schedule. I think.

A

Yeah- and the thing is that uh I do agree with bob like so, on the one hand, we're now tripling right, we we are putting a certain. There was a certain load on the system and we get a lot of it, fails and now we're getting retry.

A

So we're tripling that, but then how many percent of projects have pages associated with them, uh because it it might be like five percent or something we should get that number, but so, on the one hand, we're tripling something but we're, on the other hand, we're taking 95 percent of it away. So at the end, we.

C

Oh so we're we're also fixing the issue where we're yeah.

C

The goal is to retry actual problems, not broken, then I think it's then it sounds perfectly fine to me. um Yeah yeah, like feature flags. Yes, please.

A

Yeah, no, okay, that's uh fair enough! um I'm still working!

A

I'm still trying to do one of these things myself and I'm still struggling with it uh one of these psychic things, because I don't know how to test it but I'll. I think I'll just bother bob with that. uh Outside of this call, because I don't need to.

B

uh Yeah, okay, sure.

A

That's just because I'm not experienced uh another experienced rails developer and bob is a maintainer so uh on on on the on the rails, repo so um with the kids, libraries, repo okay. So let's uh do use feature flags, um yeah and, and one thing I wanted to say that also sort of came that's related to this and it came up a little bit yesterday. But I want to restate this: is that um leaving the place better than we found? It is good, but we need to be very careful about rabbit holes and.

B

If we want to make something on the rabbit hole right now like there's plenty and vladimir knows, then he sees something and then like yeah, so I'm pushing away and I'm going to link issues in code and stuff, but that's about it.

A

Yeah, because if we never, if we never try to leave the place better than we found it, then that that is not a good like globally. That is a bad attitude, because then your code just keeps getting worse and worse and worse. But if we are too eager to take on risk to improve things, then we derail this epic uh or we create new problems and that's all so bad, but.

B

The risk is like the project taking a bit longer. The risk is not stuff going down or stop breaking.

A

Yeah as long as long as it's that, then, as long as the bit longer is manageable, then I I am all for it I'm um every time I see something where I thought this doesn't make sense and either nobody noticed or nobody felt they had the time to do something about it.

A

It makes me sad, and then I tend to be somebody who tries to do something about it anyway.

A

um So I don't think that is a bad habit which is maybe kind of weird, because now I'm just saying I don't think it's a bad habit, because I do it, it might still be a bad habit, but I think no, I think it's a good habit, but we just need to be careful about rabbit holes, and you just said you agree with that. So we're on the same page.

B

Yeah um thanks I'll check with you. Sometimes if I think something might become a rabbit hole but.

A

I I you know, I think this is a very healthy attitude and not just because I am quote unquote, leading this project, but because uh talking to somebody else because of the rubber duck effect talking to somebody else can help you realize like. Maybe this is not such a good idea. After all,.

B

A

So I'll I'll do the same with you. I'm.

B

I'm happy that we're both working on this thing now, um I'm going to take some time, maybe tomorrow, to fill up the backlog a little bit because I'm off next week, I don't know what the time frame is pretty hard for this, like.

A

I finished yesterday.

B

I guess needs to be finished yesterday.

A

Because it's urgent.

B

A

uh Yeah, uh no, it's not super clear to me either.

A

A

No, I I I don't know, but it I don't expect this to be very big and if, if you can do some digging and sentry and find things where you think oh wow, this is getting very big. It's better to know. Earlier than later, yeah.

B

Okay, I'll take something because the two issues that are basically the same with the configuration update well, I hope that they would be like the too much requests that are open for that now. I hope that they are.

A

Yeah yeah there were two issues about configuration, update that looked very similar and I was doubting whether I should even create two issues or one well.

B

If you merge them into one, then probably it would be merged already like like the first one, because.

A

B

Yeah, but that's no problem like because I could reuse the work from the first issue, so it should okay.

A

Yeah, um okay sounds good and uh yeah again. uh Well, thanks to for both of you for uh taking an interest and uh doing work on this. um Is there anything else we want to discuss before we? I think we can end the meeting, but is there anything else we want to discuss before we do.

A

That's a no okay, um I'll.

B

Stop the recording.

A