GitLab Delivery: GitLab.com migration to k8s demos, 24 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-09-24 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

We're missing our coffee chat.

B

How is it uh double booked?

B

It looks like it is.

A

I'm triple booked this the coffee chat and a social call.

B

Oh yeah yeah I've got that too, but I never go to the social call.

A

Yeah me neither.

B

Not really a social person.

A

B

I like computers, though.

C

Hi martin yeah scarbeck jeff. Now these days uh computer is actually the only social thing you need. Apparently it's so bad yeah screw 2020 series.

B

Yeah, my wife is telling me all my friends are in room world. Have you ever played room world tomorrow? No, you would definitely don't play it. You would definitely like it. It's like um kind of like factorio, no.

C

I I can't I can't I I can't do another one of those it ruined my life.

B

Yeah now this will definitely ruin your life, but it's awesome.

C

B

um Should we get started.

C

Yes, let's do that.

B

All right, so um skype we went through the blockers this morning, there's nothing new. Basically, we just cleaned up the issues that uh are no longer relevant or no longer blockers or ones that we've worked around and then we just uh reduced the list to the things that uh we're tracking, which basically is build traces or build logs um the service mapping and then the cross az network traffic.

B

B

um I did a quick tour of the multi-cluster stuff this morning.

B

I don't think there is anything new there for you, because you've been reviewing the changes.

A

I could also watch the recording as well.

B

Yeah yeah, um so, basically uh everything everything is fine, um we'll just continue with the work to get case, workloads uh compatible with multi-cluster, which is going to be a challenge but graeme, and I talked through it a little bit. He seems like he's. Okay with the changes so far,.

A

um Your last bullet point is about the timeouts, for the get service is that the grace period timeout and the black uh blackout timeout that we've been or you've been trying to work on lately.

B

um Yeah, so um I opened up an issue for this and I can show you I can kind of demo.

B

So what we want to do is just kind of assess the impact of interrupted connections whenever we cycle pods, and we were just discussing how um you know we're going to be cycling, pods a lot and it's quite a bigger impact than it is on virtual machines, because we have the safety of draining h.a proxy, plus installing the omnibus, before we interrupt those connections by sending the sigin to puma. So right now those connections are being interrupted in. We only have a 10 second window.

B

If you look at all of the projects and like how long we stream data from giddily for git https, specifically the 99th percent dial, is hovers around six seconds. um So it's it's. Actually, I think 10 seconds is probably fine for the majority of projects. If you look at just gitlab, though the story changes significantly, if you're looking like the gitlab namespace, you can see that the 99th percentile jumps up to like 40 seconds.

B

If you look at www.gitlab.com, it's terrible like we have and of course, if you, you know anyone who's cloned www.com books, so I think what was happening was is that we were kind of hurting ourselves a bit with kubernetes um on canary, um which could explain why we're seeing like a small number of errors.

B

uh I think I'm still I'm still thinking two minutes is the right number for us. um So I'd like to start with that, we can talk about possibly bringing it down.

C

Let's start with fine a little bit.

B

Start with five minutes, I'm wondering what the impact will be, and maybe you know scarbeck if we, if we prevent these pods from terminating for five minutes, that's going to slow us down pretty significantly right.

A

We're going to see that for deployments as well as any time our pod distributor or pod, the pdb, whatever that acronym stands for, is impact.

B

Budget budget, yes.

A

That uh so during cluster upgrades, um this will significantly slow down slow those operations down.

C

A

Pause will take forever to terminate, I don't think the deployment may necessarily slow down, because there's always going to be a pod behind it, replacing it we'll just be waiting for all pods to terminate.

A

So we might want to do a test at five minutes with say, say: do this on pre set the amount of pots that something we guess production will see and just do a deployment and just watch what happens and see how quickly helm returns. I.

B

A

It's going to wait for the same amount of pods to be running, but if it takes a long time to cycle through those that will obviously impact deployment times, but I think that is still a lot less than our impact today in terms of deploying to the git fleet. So it all.

B

A

How many pods we end up spinning up, I guess.

B

One item on this issue is for me to get the exact window of time we have on vms, so that'll be a good comparison point, so we at least know we're no worse uh than what we have on uh vms. I think it I think vms is around like two to three minutes. It might be as long as five minutes, but I have to check the reason.

C

Why I was saying five minutes is mostly because um you you remember like when, when we started like we were playing with the number of paws, with first sidekick uh shards, that we were migrating, we were setting ourselves too low, which created some well unnecessary turn like we can overshot this and like see the impact and like lower it, it's easier than just creating a lot of noise like whether we are causing a problem or not or whatever is happening there. You know.

A

It'd really be nice if we could just set the if the blackout period just exited after all connections were done. That would speed up things greatly, but it sounds like it's just going to hang and wait.

B

No, it doesn't do that.

A

Upwards of five minutes, and that sounds kind of kind of crappy.

B

We also found at least with aj proxy. Remember we added this hard stop after option to kill aj proxy processes because inevitably, there's always a connection, that's just hanging on forever with git ssh, not good https, but um so I think, even if we track the number of connections, it's possible, it wouldn't uh that's true. It wouldn't matter.

A

B

Also, um it's, um I think I think this is specifically that the blackout period is specifically for puma, and these https connections aren't going through puma puma is interacting with workhorse to do the authorization and then workhorse is going grpc directly to italy, so our liveliness liveliness is going through to puma, but really um puma isn't the one? That's keeping these connections open forever. It's it's workforce does that. Does that make.

A

B

Yeah, so I don't think the blackout period would know if there's an active connection for workhorse. I think to do that. We would need like a workhorse uh liveliness or something right that would yeah.

A

B

Course would need to tell us if there's a open connection, not puma. That's.

A

Unfortunate okay.

B

Well, uh we'll see.

A

Something around those lines improve so that we can utilize that as a metric to determine if we could shut a pod down yeah.

A

um Slightly related to this, your merge request that you had and I tried to take some feedback that we got from jason and add a test and such. But I I pulled that nearest.

B

A

Day um I open up a merge request against your request, but it's the test does not work yet, and I can't understand.

B

Why so I don't yeah, you want to try.

A

To take that over or just try to push that merge request forward, regardless of it but I'll leave that to you.

B

Yeah I mean I'm, I currently we're working off the gitlab conference for now and then um we'll get this other thing reviewed as quick as possible.

B

uh So we can switch back to master.

A

Yep that makes sense.

A

Okay, um I'm next, do you all want to see what I'm looking for during the evaluations of sidekick, cues.

B

A

Sure it's knotting, you said at least okay. I don't. I didn't really prepare anything, so I'm just gonna kind of walk through the things I look at, but I've got a list of everything that I do look at inside of this very long issue at this point where I'm migrating cues from catch all to catch nfs under the evaluation section is like everything that I'll be clicking on here. So one of the first things that we look at is a query that andrew helped me put together.

A

We have a tool that he created that's installed and running on all the sidekick catch-all air catch nfs fleet servers. So this is what things look like when we're actively using nfs. In this case this was the mailer's queue last week when we identified that it was using nfs unnecessarily that has since been fixed. So these days there's nothing running on catch and effects, but what we want to see is basically nothing which nothing, but we want it to say: zero.

A

So if I go back to like yesterday or something thing should be zero yeah, so we had a bunch of queues running on catch nfs and they were not making nfs calls. So this is what we want to see for this particular chart and whoa.

A

All we're doing is taking we're just creating a comparison of whether a job was running versus whether it actually actively used nfs, and this is so such that we have enough times that we've kept we've captured enough data to be confident that we're actively using nfs, because if we see a little spike, it doesn't necessarily mean we use them. If that just means, we probably tried to read some data or something, but we want to try to hit all possible scenarios.

A

The other query I look into is the actual nfs operations that comes from our node metrics.

A

So this is what catch nfs looks like today, there's always something that the operating system is doing so these I usually ignore, but for an idea of where we are today for our catch-all fleet, um we still have quite a few pieces that are being accessed so soon.

A

I plan to migrate, more queues off the catch and catch all into catch nfs, and the goal here would be to move them because their execution rate is so low that we could capture as much data as possible and also submit that data to those like 18 plus issues that I had opened for infradev, that you know we removed the label for, but that way we could help getting direction from engineering as to how to proceed forward with those issues, and then one of the last few things I look at is inside of century itself.

A

So here I'm looking specifically at the catch and the shard for anything unresolved and we're just looking for anything that looks like it's touching directories that would be using nfs.

A

So there's a few in here that kind of look a little iffy, so here's one as an example um we're trying to write to or we're trying to read from a directory called srv gitlab shared town.

A

I think, for this is probably coming from a pod, but in earlier events this would have been our var opt, get lab store, that's been nfs mounted. We no longer mount the temp directory.

A

So this is one that I know I could ignore, because if we look at all the events, it only says 139, but I know there's more than that, because I've seen it other in other areas. We used to see all the catch-all virtual machine or yeah virtual machines in here, and this has been happening since april so stuff like this.

A

I know I could ignore because we're not using that nfs mount since april of this year, uh but I did open up an issue because we definitely have something: that's not working correctly for this particular worker. In this case, it's the repository.

C

Or rather it's working by accident.

A

C

A

So it's still checking for something that we used to have mounted, but we no longer have this mounted, so this will always fail because you know it's not going to exist for these uh servers nor the pods. So I opened up an issue to start that discussion and there's a few people chatting in to figure out what needs to be done with that.

C

Can you can you link that issue in uh in the demo doc? Just so I have it as a reference.

C

A

um And then there is another one that caught my attention if I could find it quickly. uh So this is another repository import worker. Oh.

B

A

I'd create an issue for this, uh this one nope, it's not this one. Maybe that's why I didn't show an issue.

A

Maybe it was this one. This one doesn't look familiar though okay, so this one's showing the same behavior where it's trying to access an entire repository, but I know this was never possible inside of our catch-all fleet, because we don't mount this directory inside of our sidekick nodes and if we actually go down further, we see that the root the error message is spawning from our getally nodes.

A

So for stuff like this, I also ignored completely but outside of that the majority of all these errors are just the standard uh errors that we get across all sidekick servers. You know, there's always some sort of execution that expired or some unable to access some other url, that's related to someone's job reaching out to do some work or you know, failure to import stuff or fail to export stuff.

A

You know this is very popular to see so like these are the primary three things that I look at when I'm evaluating whether or not it's safe to migrate aq into um kubernetes.

C

I mean that's awesome that you're doing that, but it also makes me sad that this is the only way we get to do this. It means not sure.

A

C

To this ability to feature features and what they are doing actually yeah right, like your import, might be failing for since april, and nowhere anywhere this is visible in our uh monitoring. Maybe it's visible as error ratios, but they are probably drowned with other right like it's.

A

Low enough, the one that was showing you the repository import work that I was failing, that's 189 failures, but that job runs very often, so we never hit the error rate trigger for that particular queue.

C

And that's staging or production.

A

uh This is production, okay, that I'm looking at well.

C

That makes me feel a bit better, because if it's production then chances are that that's all that's done right like in yeah again. The problem is that it might be working by chance, rather than on purpose,.

A

That's that's amazing, because it's part of the process, but it's an unnecessary, raise.

C

Yeah exactly if.

A

We find out that to be true, I feel like that's a little frustrating, because we get so many errors inside the century. I find it a little ridiculous.

C

A

But that's really all I had to show for for that particular line item. So unless you all had any questions, we can move on.

C

I'm fine with moving on unless jarv has more questions.

A

All right, well, let's move on! um Is there anything else, jarv that you want to talk about bloggers, or should I just watch the recording for that.

B

Yeah, I don't think there are any there's nothing there. That's new information for you, um but yeah feel free to watch the recording.

C

Let's go to your item because the world versus gcp is something I shared with. Graham.

A

Okay, so I will share my screen yet again.

A

um This is mostly a question.

A

Slash looking for feedback, so batch three was defined and it's actually inside of kubernetes at this moment in time uh batch four, I have defined basically anything that was in batch three, but I wasn't getting enough data for I decided to push out that way. I could just get rid of batch three and move those into kubernetes as quickly as possible.

A

So I'm sorry for my font size not large enough, um but here I've selected, whoa one two, three, four: five: six: seven, eight nine ten eleven cues slated for batch four I've already created, merge requests, driver them waiting reviews, but the execution rate for all these is very low like less than one per second across all of these, and some of these don't even show up, so I can't even get them to show up in our charts, which means they probably haven't executed in the last 24 hours.

A

My goal here for batch four is to move them off of the catch-all fleet into the catch nfs fleet that we could use our tooling to capture as much data as possible. I, with the execution rate being so low. I don't suspect that this will have any sort of negative impact, because the amount of workers on catch nfs is a lot less than catch-all.

A

So that's my goal with um batch four and all these were just the ones that were pushed out our last demo meeting these cron jobs were slated for the following batch because we didn't have any information about those they're just marked as yes, I noted somewhere in some issue that I did do the necessary research, and I referred back to the research we've done. Crime jobs only add something to the queue they don't actually perform. The actual work there's a different job that will pick up the work and do that.

A

So if it's that job that requires nfs we'll handle that separately, but at least the crowd jobs we can, we should be able to move those without issue. So with that in mind, I guess do you think we have questions so far.

C

Do you did you figure out, or did you look into what group destroy.

A

C

A

No, I did not look into what it is.

C

Shouldn't deal with uh with nfs at all.

A

C

Ominous, that's why I'm asking.

A

A

A

C

So needs investigation, so this was the.

A

Information that I was provided so because of that I'm performing that investigation to some extent by moving it to catch all or catch nfs. I hate these names, moving it to catch nfs and seeing if it accesses nfs. Okay, so you know, we've done batches one two and three already batch four is being defined.

A

um My next goal, because I suspect bash four- will take a while, like it will probably I'll, probably want to leave that operating on catch nfs for maybe a week just to make sure the jobs were executed enough that we we have a high confidence that we are or are not using um our shared storage, but after that, what's blocked is everything I created issues for already um I figured what we would do. um Amy got me the epics that are associated with some of these.

A

So, like the pages one, we know there's an epic for and for the ones that are related to build traces. We have issues for what I thought about doing was after batch. Four is completed moving all of these over not moving.

A

Well, yes, moving all these over just to figure out what data is being written or read from where and adding that to these issues that I've created to help bolster the need to have someone investigate these as much as possible like at this moment in time. I don't know what to do with these issues, except for adding information through discovery.

C

Yeah and I think that that on its own wouldn't be helpful to developers as well- um maybe they know, but you would also need to spend a lot of time to give them the context of what does this actually mean, because I'm looking apart from pages and and uh ci, like the other ones that are left all of them, are with 100 new teams.

C

When I say new, I mean like new, like six to eight nine months new falls, so they were probably just utilizing the same concepts that existed around um so unless we can provide them with like hey. We know for sure this is writing things down and you are using something you need to kind of go between these two things and find out what's happening right, yeah.

B

C

Just send them to hey. Can you fix this shared storage thing? I don't know whether that's going to be helpful to anyone actually.

A

Okay, so I'll do precisely that and I'll try to grab as much information as I can and plop that into the issues as necessary. As I find them, is there anything else I could be doing here because I really don't know, except for gathering log data and such okay.

C

From from what I'm seeing here, you're doing everything right, it's almost like you did these migrations before um so yeah. I I don't think there is anything else like. I know that this is kind of dragging along a bit, but at the same time I'm super happy that we are not actually having problems so.

A

Yeah I mean we're at the point where these are the ones that we need to wait for development and engineering.

C

A

To complete so.

C

Like it's going to.

A

Drag on for a while, unfortunately,.

C

Like if, if you start collecting this data immediately right after batch 3, you should see some changes coming up at the same time, meaning you'll be collecting batch four data and um ci uh job traces are in it's in in their final stages. Right like there is a merge request open right now, that's adding some functionality.

C

That's going to get deployed next week, most likely, so the rollout is going to.

A

C

Mid next week, end of next week, the week after, which means that, as soon as this gets rolled out you, you should start seeing the drop of those uh requests for mfs right.

A

Yeah, so, theoretically, by the time I get to some of these.

C

Yeah they might not some of them. They.

A

Won't even use nfs at all at that point, and then I could just move them forward. All the others. I'll have I'll be gathering information for okay. Well, that's perfect.

C

I would even I would even argue, scarbec because security scans and requirements management- I would even argue they might you you- might want to move them to catch nfs in in this batch. Okay and not.

B

Wait for batch.

C

Four, okay and they might provide you some noise, so it's gonna make it more difficult for you to make it like to figure out which one of those is actually creating noise, but at the same time you said these other ones are so low in traffic that you know like you're, just gonna be waiting anyway, so if they start creating too much noise, you definitely then know you have some data right like you can remove it, and um we can then parallelize that work um with uh with stage groups and then you leave batch three with well barely any traffic anyway, so yeah, you kind of paralyze it any, but you can actually roll that out pretty much today.

C

I guess yeah. Okay, just a suggestion. You don't, like you know better what you're seeing.

A

Security scans, one security.

C

Scans and the requirements management uh thingy at the top- that's top.

A

C

Yeah because it says group certify and.

A

Let's see, did they give me any information.

A

Is a ci report, so it probably depends on ci bill traces to some extent. Oh.

C

Okay, it's possible, I guess, but the problem I have with this one is that yeah? No, I guess if, if you think it's related to to ci, then you can leave it be, um but the problem.

A

C

With requires requirements, management is, is that it's such a new thing that I'm finding it a bit hard to believe that there is not much traffic there to begin with. That's why I'm kind of suggesting maybe use this batch, so we can really like squash this in one go.

A

I'm perfectly happy with doing that, because I know batch 4 is going to last a little while longer than necessary. So if we do find something writing nfs, I could gather that data as quickly as possible, and I could pull that off of this batch and push it into the future, and at that point we already have infraded working on that issue. With the information I provided.

A

Cool all right, so I will modify my merge requests so jarv you're, not on the hook at this moment, um was there anything else. I wanted to discover.

A

Yeah batch five was going to be anything else that I needed to yeah. We talked about this and then batch six would be anything that happens after batch five, so maybe repeating the batch 5 experiment again with a different set of cues. So any opinions or further comments or questions.

C

As long as you're, actually not only sitting and waiting for sidekick and looking at sidekick numbers uh like how is it reaching nfs, I would I mean I'm not amy here, but I would really like to see the jar get some help on the cross. Az multi-cluster, blah blah thing, sorry, jeff.

C

It needs a better name and the reason the reason for that is not only because jar is a release manager now, but it's more if we can parallelize some of this work better for all of us, um because I think the next thing that I want to see is web and api. I don't want to wait for sidekick to finish for us to start the next step and given that ci jobs are in their final stages of them, starting to roll roll, it out.

C

The sooner we finish, multi az multi-cluster thing, the better for us. We are better prepared for it.

A

I do not disagree with you. Cool.

A

um That's everything I had to talk about so.

C

All right! Well, thanks for sharing, um I feel sufficiently um confident that we are doing great. So thanks.

A

Both good excellent cool, thank you all.

C

Catch you all later.

A

Have a good day.

B