GitLab Delivery: GitLab.com migration to k8s demos, 3 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-04-03 Sidekiq Migration demo

Description

https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/89

A

Alright, shall we get started.

A

A

Who's going to lead.

B

What does it jug leave in.

B

A

It in my account.

C

Alright I will lead I, don't think we have a whole lot to share I think what we were going to do is just show I guess just showed how product export is running in production kind of like look at the metrics, see if there's anything that surprises us I'm going to share my screen.

A

We don't have any VMs running since Wednesday right.

B

A

B

Running but the services has stopped okay,.

A

All right so, yes,.

B

A

C

So this is the Gabbana dashboard that I'd created just for some sort of temporary, just as we were rowing the South, what it shows you are the the kubernetes pods on top and then the VMS at the bottom. The bottom is blank for each of these panels because we don't have any VMs processing project export jobs. So, but when we were running this in parallel, it was kind of interesting to compare how many jobs are being processed by the pods. How many jobs are you processed by the VMS? It smells like errors and stuff.

C

I would say like. This is typical for errors that we've been seen, I kind of want to dig into these a bit more I just haven't had time, but these were kind of typical errors of you would see both on the PM's and the pods, and this is kind of typical right now. Sometimes we get like the big spike, the project exports, and then we have a big backlog and then the apdex drops because there's queue latency. This has been happening, maybe like once per day or so.

C

We got like a huge spike of people running project exports all fame when I've looked into it. That's someone who's just like running exports across every project in their namespace. We do have rate limiting for this on a single project, but not on the namespace. So it could be something that we might consider to prevent this sort of.

A

Jerry's a bit frozen for me at the moment, good.

C

Let it's all just one step the project 7up. We should start them. Oh crap, okay,.

A

You you dropped out for.

C

A

C

Well, sorry, I just was going off about people who have these interesting usage patterns and, like you know, typically once per day, or so we see someone backing up every project in a given namespace and we don't have any specific rate limiting for a name at the database level. It's just at the project level.

C

What else do you want to look at? Maybe you look at the pod yeah, that's it yeah. Let's.

A

Let's do a if you can for the whole week or rather well, since we enabled it just zoom, so we can see like what kind of workloads we are sorry network loads but I'm getting tired, already patterns. We are already seeing.

C

C

C

So here is a six days view from Kabbalah. You can see like where we have yams running then you're right in parallel for a bit, and then we just switched to running in kubernetes.

C

You can see like we were processing, more jobs on the ends, just because we had more side work, more psychic workers compared to compared to kubernetes I.

C

Don't know like it does, it does look to me here like that we do see like some more errors, so this is kind of interesting I guess maybe something to look into.

A

But the invalid base64, where is that? Coming from that.

C

Yeah, so that could be I, don't know. Actually, I can I could guess, but I'm not I'm, not sure we need to look into it. Yeah.

C

C

You can see the increased usage when we turn this on.

C

I'm not sure it's.

A

Like anything else, anyone would like to look at specifically yeah.

C

A

Tell any number of us to 15 we spiked, it seems like to the max I. Don't.

C

Know if we have a specific dashboard panel for the number of pods over time skarbek do you beat ophea.

B

One top two on that exact same you're, looking at.

C

Replica said, I see.

B

C

B

Going to show you what our HPA is configured maximum set to, but at least shows you how many pods are running at this moment in time. So.

C

We only have like four running right now. Then we spiked up to 14 max yeah.

D

A

C

Four is our minimum by the way, so we we always have like four pods running regardless they'd, probably go down lower if we didn't set that.

A

All right, any specific reason why we do four, because we did VMs, and that was a parent right. I was.

C

A bit concerned about like how quickly we can scale up, and if we only have like one pod running, then one huge project will prevent a smaller project from being exported until we can scale up another pod right.

E

I mean if you look at the average time it takes for a job to be scheduled. It's probably in like the 30 minute period. So like I, don't think that's a huge concern. I mean you know, 13.

C

E

C

No I mean like 30.

E

C

E

Like that, 90% of jobs get exported, it's 6 UTC every day, because that's when daily CI jobs run and lots of daily CI jobs kick off a project export and with those two you know those queue up for like an hour, and so, if you look at the average every day, it's like you know, we have lots of jobs that take a very long time to get scheduled. So I wouldn't be too stressed about like how long it takes to beat those those jobs.

E

I mean we've moved project export to be what's called throttle, so we're not kind of judging it on queueing time, because we are kind of protecting the fleet from you know too many exports, so we'd rather have the the each export take longer and protect the fleet. So I would say back because it's a throttle job like rather you know you can scale it down and we're not judging it on on on. It takes two weeks up a new part.

D

C

But it's still, it still allows like one large project export to sort of block all other exports now granted. It's probably given the usage.

E

Of this feature, as I yeah.

C

The user is probably not that doesn't happen. Often, but yeah I mean like some I I can do a product export that takes like close to an close to an hour to run right and that's gonna block all other exports until a pod can be spun on okay, yeah.

E

But I mean so that you talked about two minutes right.

A

C

A

Minutes for four boot time for sidekick, roughly.

B

C

E

Maybe the thing to do is to just kind of like try it and then, and then we can kind of collect some data as to like how long it takes before and then do it for a day or something and on that day take a look and see what the average difference in time is like a lot of those jobs, keep for quite a long time. Yeah.

C

I mean if you look at the average, then you're right I mean it's probably not going to make any difference, but I just like I mean we have these nodes running anyway. We have one per Val ability zone. It's not like we're spending money by having four pods and reserved, and once we move to queue groups, we're gonna have probably more pods. You know in reserve right because there'll be more jobs, running okay,.

E

Yeah I mean my point is like I: wouldn't get sort of worried about like the queue time on on those particular jobs because they throttled but yeah I mean if it falls what you think's right, then that's fine yeah.

C

I mean it's yeah like, since we don't have anything else running on these nodes. It's not costing us any money, but that will I just change as we move over more sidekick jobs and we have to decide how we're going to configure the note pools as well.

A

And how's the I/o looking over.

C

Here so I think that's on this pot and Coe dashboard as well. Maybe we're not yes, it's Network IO.

C

mmm Which dashboard has the IO on its car back? You know. Here's.

B

A few of them, but cou Banias.

C

B

Node resources will have it if it will scroll for me, maybe.

C

E

I make a suggestion on that, while you never get into it.

A

E

Sorry, if you go look at like the all pretty much all the dashboards in Gravano they've all got, you know like the queue detail, priority detail you want to just navigate to one of those job.

E

Just I can give you a better example of what I'm talking about.

C

Sure so what I'm? Sorry that we get to what like.

E

Priority to search for priority detail.

D

Yeah, that's a.

E

D

E

And if you scroll down you'll, see this always pretty much all the graph on our graph on its dashboards now have got this row called node, metrics and yeah.

C

I hope you add up.

E

It's got like all the stuff, so yeah. Obviously that made sense in like the old world.

E

We can do exactly the same thing, because that all that does is it goes to like a common dashboard like a common library that builds it up according to a selector, a node selector that you give it and what we should do is add the exact same row before for giving any stuff right so that when you're, looking at the priority detail, for you know whatever project exports, you get that little panel that you can open up and it's common. You know to all you know you get it in the queue detail.

E

You get it in priority detail you get it for the sidekick for the entire sidekick fleet and you.

A

Know make a change on.

E

One place you get in the door, get the for me. Sorry.

A

uh Andrew I was interrupting. You tell jarv that I linked. Oh thank.

C

A

E

No, it's just saying: if you do that, then you know you have a common dashboard. That is.

E

The same on all these I just find that that expandable rose super useful because we get the node metrics for whatever we're looking at and.

C

You know it might having.

E

Things to Cuban Eddie's, we can do the same thing. This we don't have to go, find a different dashboard. You know and and isolated you just open up a row and it's there yeah.

C

I think we have to decide how we're going to what the feature is of the kubernetes dashboards right, because we're using the Nexen and it's fairly opinionated about how things are laid out. I mean we can either just kind of take what it gives us and extend it or we could revamp it. And you know we haven't really decided yet.

E

Yeah like, if you look at this right so you've got so. This is the cluster of the data source. You know this is kind of like it doesn't let you chop it down by the project exports priority, for example, right anyway, I'll put something together and I.

E

Think I'll, you know maybe I'll just do one of the things like I can take a look at, but I think it's super useful doing it that way, because you can kind of drill down when you inside, like the queue detail, you can go and look at what those parts are doing very.

C

D

Easier to demo it so.

C

Yeah, so here is these: these spikes of I/o saturation and we see and.

C

And a shows that it's good, that we have this workload isolated on its own, that pool yeah.

A

This is actually pretty intense, imagine adding more and we would be constantly just saturated here. Yeah.

C

B

We know what the saturation levels were for the VMS that are still online.

C

That's a good question: maybe we can take a look.

C

Yeah, this is side kick because of this yeah.

E

Authorized projects there.

C

C

Let's see do we have you.

E

Don't have a general saturation, we all needed IRS, and now we only do forgive me Postgres and NFS, and then you started a merge request. Waste got like a general saturation metric, but it's not complete. It's not in yeah.

C

E

You need to work out what the I you know, what the available I up saw and.

C

So so then, I like I wonder how how this is measuring. If they look at see how is measuring iOS alteration.

E

Well, if you go to the, if you go to the project's exports, so the import is keys inside key priority detail and you go and look at the at the node metrics row and you obviously have to go back to when that was doing it full-time.

E

Is it exports the priority yeah.

E

You're on best effort at the moment, job oh whoops,.

D

And then, if you just scroll down to.

D

Write AI, ops and then.

E

D

E

It from six hours to give me seven days because we're.

D

Not using this fleet anymore and now, you're gonna get all of the annotations yeah.

C

D

Fun, but it's just look up to you where it was.

C

So I mean I, don't know like based on this. I would say we're not saturated on the disk, I mean don't we have thousands of write, AI ops, available, I.

E

Think it'll be at least 600 on those machines.

E

That was, writing Tina fist that wasn't it.

C

You know it's, it depends like there's. A carrier wave still writes to the uploads temp directory, so there are cases where it creates files there, which is NFS mounted, but the the temporary file that's created when sidekick goes to get aliy and like grabs the hard whole project and then writes at the disk before it uploads it to upload storage, that is on local disk, on both the VMS and obviously in kubernetes as well.

E

So why is this set like if yeah hi that's right on time, I, don't.

C

Know I'm, in fact, like we need to see how it's measuring that right like who knows how bad it is really, if it's really that bad I'm curious how it's measuring it have.

E

You applied documents in Cuba navies around like I, absolutely I, don't think so. 100% yeah.

C

I know: that's that's something. We need to take a look at.

B

I got an issue to address them. Yeah.

B

Andrew, do you want to talk about the metrics issue that you discovered about queuing and D queuing.

E

I'm, just on this cuz I'm.

D

Actually, just writing an issue and.

E

So so this is, it's actually a problem that we've had for a while, but it hasn't been that bad and you might have noticed on some of the dashboards like you know, we have the sort of the key metrics along the top of some of the dashboards. You might have noticed on some of them. There's like two lines, so you might have liked to up years, and the reason why that is is because we might be back for especially for some services.

E

Like web pages, we do some of the metrics in a check box and a check boxes on Prometheus default meters, and then some of the metrics are on Prometheus app and we don't have a way of rolling those up to get like a single view, and so we actually have these two metrics. But we only have that for I mean.

D

E

Here, there's kind of like a bit of like a mess going on and that's exactly the same problem. So this is a kind of an old problem, but it was never like a serious problem because it was only like sort of not really that important stuff. But the problem now is that if you go to say.

D

Patroni man I, don't.

E

Know what this is gonna look like, I haven't actually looked at it. My guess.

D

E

So basically, what's happening, that's kind of strange because it seems to be okay, but it's not so what's happening here. Is we've got a service called Patroni and the way that we measure the the latency of the Patroni service, like like most Kris, doesn't give us like in apdex score, and so we use the application and we look at how long simple queries are taking we say.

E

Well, you know we want 99% of sequel queries to take less than a second and that's how we generate the update score and then we aggregate that up at a service level, and then we come up with this value over here, which is for the service. But now that we've taken some of our metrics and we sorry some of our rails application and we are running it in communities to in order to service the project, export jobs.

E

Those metrics aren't going to the same Prometheus server, they're, going to the forget its name but I just put Prometheus gates, but it's the GRP dk8, whatever it is, and so it's going to a different service and that we have recording rules which take those and aggregate them to a service level. But effectively. We've got the split brain where Prometheus app has its view of the world, and it's got a lot of traffic in it.

E

And it's saying I'm aggregating all the data that I know about, say in this case the rails, sequel, timings- and this is what I know, and this is how good it is. But then you've also got the the Kuban Eddie's Prometheus and it's aggregating a much much smaller set of data right, because it's basically the sequel, queries that come from project export jobs and it's using that as an aggregation. And so we have alerts around that that are basically firing because of the split brain.

E

We're having to education is one of like a small set of data and one of the medium set and two things can happen. We get the split brain where the data is not that good, because we kind of just looking at project export instead of like the entire rails application. But the other thing is like there's an example of where we compare like for project exports.

E

We compare jobs getting in queued to jobs getting executed, but now the job in queue is happening on one Prometheus server and the job execution is happening on a different Prometheus server and the two don't know about each other, so the loads are firing because in in the universe of their lurks, that's that's been executed in the one Prometheus server. It doesn't know that the jobs getting executed, because that execution is being recorded in a different Prometheus server.

E

Does that explain the problem? I.

B

Feel like that, does explain the problem. What I'm curious about is to why we are relying on the metrics the way we are, how can we don't have something that just looks into the queue and alerts us as to how many objects, workers.

E

B

And any give them up in a time well,.

E

We have those metrics as well, but like that, that's like so a lot of our metrics are based on like aggregating values, up to the service level right and that's like a whole talk about symptoms versus causes right, and so, if you want to have like a service level indicator, you need to be aggregating.

E

Your metrics to the service level right so you're saying that I want 99.9 percent of the requests coming into the service to be successful or to have a latency of X and if I, don't then I'm gonna raise it like a symptom based of it's a symptom is that the service is not functioning well, and that is all based on aggregating your metrics up to a service level, rather than doing it at a like. A single metric value.

E

Right, like you have to you, have to be aggregating those values in order to get that and that this, the second one is that like, if you take a look at the that other alert which is on an queuing versity queuing, that's not aggregating, but the problem there is that, like half the work is being DQ'd in a different premedia server from where, it's being you know, we're splitting up metrics and ways that the alerts weren't designed to handle- and you know if you start looking through the alerts, the half the data is not there, and so we kind of have to we'd have to review all of our loads, or at least some of them that I used multiple metrics and make sure that their work as we expect them to.

E

Does that I, don't know if that helps kind of explained. It does thank.

B

E

A

D

A

Time, if you don't mind, because we have other things more important ones in three minutes, starting so I'll see you all there and thanks thanks for the great work Jaron scarback this week, super exciting that we are done with this part, at least so see ya. On the other meeting later.