GitLab Delivery: GitLab.com migration to k8s demos, 4 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-09-04 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

Hey, how are you.

A

B

A

B

A

Right, it's half past. So let's get started.

A

uh We don't really have much to demo, but uh I guess we wanted to see each other. So let's chat.

A

Do we wanna go through the blockers.

A

B

um Geez, it feels like we just we just did this right like.

A

B

Maybe we can just do highlights uh and um let.

A

Me highlight a couple of things that I think are important, and you can highlight a couple of things that you are.

B

A

B

So, let's just talk about the immediate blockers to the get https migration.

B

So let's start with that, um the only blocker we have right now is the cross, a z traffic other than that there was a new blocker that we just discovered this morning, which is that I'll put that on the issue, which is just another charts.

B

B

And this is that the century dsm for workhorse ooh, sorry.

A

I uh probably I.

C

B

uh Century dns for dsm for workhorses is, is set in the wrong container.

D

My only thing with that is that I don't know if you've ever looked at the workhorse sentry logs, but they are like really not very exciting at all. In fact they're. Not then I I mean I don't know if I would put that as a blocker, because.

A

I mean that's just.

D

My opinion, but um if you look at the like the rails ones are like obviously critical the century ones. I was looking at them the other day and there was like a lot of random stuff in there. uh You know, and none of it particularly helpful.

B

Okay, well, I have an mr for the fix. We just need to get it merged, um so maybe it's like a minor block or just something we noted uh yeah randomly. Let's.

A

Let's, let's lower the severity of this and uh sure a four and priority.

A

Why not for random, yeah and.

B

You already um so I guess other than those well one in one quarter of an issue um other those issues just have the performance problems that we saw yesterday when we started to take traffics like donald trump before the discussion, your.

A

B

uh Sorry, um if we're back now, um it.

C

Sounds like we're just.

B

Yeah, it sounds like that. uh We're gonna start taking a percentage of traffic to canary to canary git. Before we do the next migration.

A

A

um My only concern with that was: are we exposing our users to more outages and that's why I was asking how many of these incidents did. We have with getting https in the in the recent past, and I guess it's really hard to find that because we are not really it's not really easy to track those uh things at the moment.

A

So I'm fine with us. I think I saw one incident that we had a year ago or something so um it's not.

D

That one black one used today, no no, there was one um sorry just try to find it cameron, basically pinged on that issue that um scalability 470 um cameron placed it right at the bottom of that, um like that here I'll place it in give me a second: it's production, production2461.

A

Right, but is this because of the changes that were being made in the middle of the day when I was working on this, or was this.

D

No, no, that was just sorry. I thought you were. You were asking um how many giddily sorry git canary issues that were in general or you're talking about specifically.

A

I'm talking in general how many issues did we have in general with get https traffic um in the past? Whatever time it can be canary, it can be the main stage it doesn't really matter in production. Basically, so my.

B

A

For this one is, is this caused by the uh work that jar was doing or not.

D

No, it wasn't it's just that the problem is that um the and there's several things right. The first thing is that the the majority of the traffic that goes to canary at the moment is going to prefect effectively because it's, and so anything that happens to prefect, happens to to that, and then the the giddy node behind uh prefect is so overloaded, just because of the pathological nature of the www.com repo.

A

D

Also gets really slow, and so uh you know basically, we've got that silenced on the alerts now pretty much non-stop because it doesn't adhere to any of our alerts. um You know a lot of the time we just kind of deal with it, um but it it it fires a lot right.

A

But this doesn't necessarily mean that it's the actual service, that's the problem. It can be that wow.

D

Yes, it's difficult to yeah, but but if we, but if we didn't have it all pointing at the you know, if we didn't have those particularly bad repos going there, it would be a fairer comparison, because, obviously, what we're doing at the moment is taking the worst repo and we're sending it to canary, which is good in a way. But it's also really not great for, like you know, if you're trying to compare the kubernetes.

C

Canary to the.

A

D

Production traffic because obviously it's going to be much worse um and.

A

Theoretically, it can also give us better insight because it's going to be better balanced right.

D

um Yeah, but it's still going to the same back end, so it's still gonna. You know the problem is the problem is like it's going through prefect and then it's hitting a note, that's like very, very overloaded, and uh you know trying to do hundreds of clones. You know constantly, um so you know if you compare it to the main stage, because it's it's got that directed traffic. It's going to be worse, like a lot worse I mean.

D

Obviously, if you want to kind of really test it hard, that's a good thing to test it with, but you can also do that by kind of increasing the traffic but then yeah, it just doesn't feel like a fair test, because we've kind of we've really kind of handicapped the canary stage by the you know the traffic that we're sending it.

A

A

But I guess I'm not really certain what I want to ask now.

A

I think, like it's, it's um it's a bit unfair to also say well, let's take out uh github.com right, like that's regular request, routing from.

D

Yeah, I think I think what all I'm saying is that, like the the traffic is totally randomized right, so we just say like, and we.

A

Start with one of yeah yeah.

D

And and then it's like it's it's it's much fairer, we're at the moment we're kind of saying we we putting the hardest repository to clone onto canary and then every time something goes wrong. People are gonna, you know what it's been like with the registry right, it's like. Oh, it's kubernetes right and actually it's you know often not, and this alerts all the time as it is, and so it's going to be, like oh quickly, turn off kubernetes.

D

I think, um whereas, if we kind of if the traffic loads are roughly the same, you know to some degree, then it's like it's.

C

Much easier to.

D

Kind of compare like with like um and the latencies will be more similar, and you know you can you can compare latencies at the moment. It's like. Well, you know the the latencies on on this are like five times as much, but actually every repo that's going. There is really huge, um so yeah.

A

Okay, well, I think that resolves one problem, then so yeah, let's give it a shot right like we know that we know what's happening right now, so.

D

So the the other thing that's nice about it is that we can once we take off once we make it randomized, we can take off the alerts, because that's uh the other thing that I've been feeling quite uncomfortable about is that we're rolling this out, but we've actually silenced alerts for the stage that we're rolling it out on which is not cool, and so with this we can. Theoretically, you know first randomize the traffic and then take off the alerts and then roll it out without any alerts on them.

A

Are we doing that or do you wanna or are you proposing to.

A

Do the migration as well at the same time? What was your proposal in that issue in the delivery.

A

A

Oh jar might have uh connection issues.

B

Yeah sorry, I'm back, though,.

D

D

uh Job marin was asking: if you wanted to do it, where we randomized the traffic first and then do the rollouts or, if you wanted. If you had a different plan.

B

Oh yeah, that's the plan, of course, like we're gonna, do I'm almost done with taking a percentage of traffic to vms. I even we even increased the capacity uh on the vm, so that now we have three because I'm a little bit worried because we do the rolling upgrades. So I only want to take out one note at a time instead of going from, like you know, 100 to 50, we go from 100 to 66 percent, uh so yeah. So that's going to be the first thing we do so I don't anticipate us to do.

B

We might finish up this vm change today and then we'll do the kubernetes change on monday.

A

B

Not going to be good.

A

Enough, though um jarv then, because what we want to do is then have a bit of a longer uh sitting time, soaking time, so that we can actually see whether there are any alerts uh going to canary, depending on the time of the day right, so that we have a good base comparison.

A

If you do it today, if you roll this out today, you can't like we can't uh go on monday and change that around again right. We have to pause this a bit. The migration.

B

Yeah, I mean yeah, that's I mean so we could, we could wait, we could let it sit for a little bit longer. um We.

A

B

A

B

Oh yes, holidays.

A

So that's not going to work as well. Yeah.

B

I'm not actually working so um good, so.

C

B

Fine and when we do this migration, we're also splitting uh kubernetes and vms first, so we'll do 50 on vms 50 on kubernetes and then we'll go 100 so um but yeah. So it sounds like tuesday is our our time frame.

D

So, are we going to turn the um those silences off today.

B

I would like to yeah.

D

A

D

Do you, obviously just let the sres on call no yeah.

D

And then yeah and then it's um well. It was matt. I don't know. Oh it's cameron.

A

Oh, it's camera.

D

And then and then craig muscle.

A

Okay, let's, let's make sure that we inform them uh what we are trying to do. I know there is going to be some pushback on um doing this on friday, but last I checked friday is a working day in general. So um as long as there are tools to revert back meaning drain the canary right, we should be good right. Am I mistaken here?

A

Okay, something goes wrong when we.

C

A

The traffic and we don't want us a reason- called to think about reverting, merge requests and so on. We can tell them drain the canary and move on right.

B

Yeah the procedure.

A

Okay, let's give that tool into their hands and uh deal with it yeah my connection's.

B

Awful my connection is awful. Sorry um yeah um well we'll be able to drain canary if or set it to maintenance. If we have to no problem.

A

Cool and let's leave it like that uh and then um come monday. I would like to be able to enable canary uh if things are okay, if something happened over the weekend yeah I work monday, so I'm happy to enable it on monday, okay, but yes, sure, like one day uh when, uh when we are feeling confident, okay, let's let's do it that way- um jar this means! Basically, I don't think before wednesday we'll be able to uh roll this out on kubernetes, and that is fine because we want to have a more stable base.

A

A

Cool all right.

A

Tell me about the psychic shard migration, the catch-all.

C

uh Why don't you let me share my screen: real quick.

A

Yes, of course,.

C

um So, as a quick refresher, um we thought we were blocked on what did you do for we thought we were blocked on being able to migrate. uh Our first batch due to some configuration audit that was done.

C

It was later determined that we weren't necessarily blocked, so we were able to push the first batch of the queues over to kubernetes. The only problem we ran into was that there was a network policy that was preventing the container registry from being accessed by sidekick. This was later resolved quickly because it created an incident.

C

I think jar of you were going to spin up an issue and look into that further later. I did not do any further investigation. I think it's just the way our home chart works. It's looking at the internal end point for the service to reach the container registry instead of going outdoors and then circling back through h, a proxy into the same cluster that made the request.

C

So now I'm working on batch two. So I have a select number of queues. There's a hundred and change in this workload. How many was it 107 cues I selected um yesterday? I did a quick check and we are doing a lot of nfs reads: opens accessing data inside of nfs, which is not great. It's not what we want so this the rest of yesterday and this morning I was trying to get the tool that andrew wrote that hooks into ebpf and the kernel to grab some nfs metric data.

C

So now we should be able to get the classes um that sidekick is x or requesting nfs access of some some form so for the rest of today, along with other work, I'm going to drum up the appropriate queries to make sure I get the right class names and figure out which queues, they're, part of and then remove those from catch nfs.

C

So I could continue the evaluation that way, hopefully next week, I'll have slimmed it down to only the queues that are ready for kubernetes and then later next week, migrate. Those over to kubernetes.

A

So the second batch is the batch you selected, based on um that sheet. We gathered the investigation. We did the other stage teams where they said yes or no to using nfs, and the batch you selected was the one that had nose or maybe you've seen them.

C

A combination, so there was a few maybes and there's a few on shores, but I captured all of them that were either knows or unchores.

C

The first batch was all knows, but we didn't have the entire spreadsheet filled out.

D

C

Enough to get started and allowed me to at least get this process going to figure out how we're going to proceed forward.

C

So we do have a few maybes that are in this list.

A

C

A

Better to just reduce it to only the ones that we are certain about, so that we can because the we want to group maybes and no's or or maybe zenyatta's together or you know, because we cannot expect to have this, but up to you it doesn't really matter as long as we get to the bottom of them yep um can. Can you now show me how do I look in grafana to see um any differences between what is in kubernetes when it comes to the queues and what is not in this chart, so not antennas.

A

The reason why I'm asking this is because that's where people on call is going to be looking at.

C

Yeah, I'm not going to have a good answer for you, sir.

C

Because, right now, we group everything based on the the shard. They are a part of and there's no breakout inside, of our shards that indicate whether they are part of a virtual machine or whether they are part of kubernetes.

A

So how do we follow things going wrong with the migration.

C

We logs are going to be the best place. uh That's how we discovered the problem with the uh conditor registry uh when batch one was completed.

C

um Because, in our logs, you've, probably seen this before.

C

But you are correct: we don't have any visibility inside of uh grafana at this moment to determine where q lives. At this moment.

A

Do you think this.

C

Is yeah? You know.

D

It's quite easy to to add it to the q detail page. If people wanted that.

A

B

Really appreciate.

A

If we did that, because I don't want to demonize what is in kubernetes versus what is in vms, but it would help uh investigations quite a lot if we at least knew.

A

C

We do at least have a method where we could search for kubernetes that something exists and then we'll have all the workers and logs associated with kubernetes that are running inside of the community's workload. And then you know, there's the I don't know why catch all showing oh yeah catch all, because that's the name of it. Sorry, I'm mixing up catch-all and catching the best of my mind, but this will at least help us from our logs determine where things are running.

C

um I would agree that we probably should add some more visibility inside of grafana, so that we could easily determine where the failures are, whether they're inside of kubernetes or inside of our vm infrastructure, um I'll, create an issue to address that thanks and andrew. Maybe I could work with you to figure out.

D

Yeah, this actually ties in there's this thing that we really need, um and it's an observability backlog, I'll just highlight it and like if we had a like at the moment, I would say we just use a label fqdn, but obviously now all of the stuff moving over fqdn is is null for all of the kubernetes stuff. And so we need like. We can't use instance, because it's just ip and no one knows what the hell the ip is.

D

um But we need like a name that that works in kubernetes and in in like vm land, um and so what we can do is we can just create one. That's like a concatenation of them, and so some will have the first parts and we'll have the second but like really. What we need to do is fix that that issue um and and rename fqdn or something there's a whole epic on it. Now I'll find it.

A

Oh, there is an epic about it.

D

Yeah, the the the most important one is that one like the rest of them. You know like um ben, really wants to rename environment to end which I'd like as well because of the shortened things, but the one that's important is there is the naming of the.

A

B

We can do that then.

A

Great, but uh I'm not willing to take on another epic, just no! That's an observability.

D

Epic as well, but.

C

Okay, hopefully we could create or modify our existing dashboards to make that easier before without having to deal with the epic.

D

Yeah yeah, for now we can just use like there's no blocker. It's just it'll be super quick, but it'll just be ugly, but I mean who cares.

C

um Yeah I'll create an issue after this meeting and link that to the agenda uh any more questions for me, I guess.

C

Okay, I don't have.

D

Anything I just wanted to say I was looking at those those metrics that are coming back and like this is all on me because it's my script, but I, when I look at the results, I'm I'm not super like filled with optimism that it's working as I hoped, um because I just put the query in there that I was using- and you know maybe with time it'll get better. But you know what I'm seeing there is lots of job wrapper input. Noteworker I mean: do you think that does nfs.

C

I don't know that's.

D

I mean it doesn't sound like investigation.

C

A

Wait import import note worker, as in n-o-t-e, it's.

D

Important yeah I mean maybe, but I I hope I hope I'm wrong like because I want the code that I wrote to work, but I'm not looking at it. I'm kind of like being a bit neg.

A

Import is quite possible, like yeah.

D

A

Things from nfs anyway to import issues, requests and so on. So usually it's just a zip file or archive of some sort and uh yeah. So.

D

Yeah not maybe possible, yeah yeah, I mean I. I hope I hope it is um andrew.

C

Also, keep in mind that I didn't have sidekick uh configured correctly, because I had concurrency set to like 15 or something.

D

C

I I made the wrong change early this morning, so basically anything before 9 30. We need to just kind of recognize from our searches and.

D

That will probably greatly change the results.

C

That are coming out of this. That's a good point.

D

9 30, uh and what time is it now? How long ago was that.

C

Like more than an hour ago, I'm minus four right now. So um what's.

D

The time now so I can do like two hours.

A

Cool, so is that the reason why you're around at 4am working no.

B

Not being able to sleep.

C

A

C

So no, this is recorded.

A

C

Just startled me awake this morning and I couldn't fall back asleep, so I just said you know what I'm gonna work that way. I could stop working earlier and enjoy my longer weekend a little bit longer.

A

Hopefully make sense cool.

D

B

D

The so there's a there's, a better link in there. That gives a longer.

D

D

Div files is definitely going to be sound species yeah yeah. The the problem is that if there's very short jobs that run and then the the ebbf thing goes from the kernel to the land kind of afterwards, you know they all the logs take a long time. You know, because I'm kind of relying on like synchronicity of the logs getting written me reading the logs and then me getting the information from the kernel out to use the land, and so it might be that I'm kind of like everything's just off kilter and I'm reading.

D

You know things at the wrong times, but let's.

D

See anyway, have a good weekend have a good long weekend to the people having a long weekend.

A

Yeah, I think uh we don't have anything else to discuss. I think so. um Oh actually, let me ask jar of scarborough anything else that you would like to discuss before we wrap up. I.

C

Don't believe so.

B

A

Perfect awesome: well thanks so much and uh see you next week, tuesday. I think some of you.

C

Have a good weekend bye.