GitLab Scalability Team Demos, 13 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo 2022-10-13

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

Let me find the issue.

A

Do you have it handy Sean,.

B

It's in the timeline project, so yeah.

A

There shouldn't be too many issues there.

B

I put it in the dark.

A

So yeah, what we're talking about in the in the longer thread there is moving some stuff around. So the only thing that's still left um um from the pipeline on Ops is loading data from from Thanos.

A

um In the discussion we ended up like keeping the entire page generation there, because that also pushes to push Gateway that triggers the the the issues in the capacity planning projects on gitlab.com. So, in the end, we'd have a scheduled pipeline.

A

That starts on gitlab.com triggers the pages timeline Pages generation on Ops that job pushes to the push Gateway, um and then it will call back into a new Pipeline on gitlab.com, um with information to publish the pages but like on gitlab.com and um to update images on the capacity planning issues that have been created from its run.

A

That was that was the gist of it. Wasn't it Sean missing.

B

Something yeah, so it's it's it's kind of yeah. It's confusing how it works, um but like how big of a problem is it that is confusing how it works. I guess is our question.

C

I was just going to ask if we're looking at optimizing, something specific or just making it more straightforward and.

A

We're going to make it more robust, because now we've got um a scheduled Pipeline on Ops and a schedule Pipeline on gitlab.com that counts on um wait. We've got.

D

The schedule Pipeline.

A

On Ops and then that triggers a pipeline on gitlab.com that requires the last Pipeline on the main branch to be well doesn't not anymore. I. Think I think I fixed that, but um it finds the last artifact. So it doesn't have pipeline information from the pipeline on Ops and then it just finds the latest artifact for the generate.

A

Pages stop gets that and publish it, and at the same time we also have um a job on gitlab.com in the scalability repository that counts on the previous job, on gitlab.com in the timeline project to have finished and have published the images. That's right, isn't it Sean yeah so well, I mean that doesn't.

B

Really matter too much because it just says, like you know, if, if I can find images, I will put them in the issues, so that job is is just going through the issues. So they let me take a step back. So we submit an MR to getlab.com that triggers a pipe that triggers a pipeline on gitlab.com, but then we'll also trigger a pipeline on Ops, because.

A

Two practice: hands on Ops Two pipelines on Ops I. Don't know why! But it's because of the standards that we've included. um Oh.

B

Yeah because some of them only run on push not on Mr.

C

Something like that.

B

um Yeah, um so that's because jobs that require um Thanos data can only run on Ops, so we run that on Ops. Then, when we merge the Mr, we run a pipeline on getlab.com that doesn't really do a great deal. We run a pipeline on Ops that will um again use Thanos.

B

Then on Ops we have the scheduled pipelines that populate data and build the pages, and when the pages build one is finished that triggers a pipeline on gitlab.com, but also pushes um metrics to the push Gateway when the metrics are pushed to the push Gateway that can trigger alerts through alert manager. Those alerts then go into the gitlab issue. Management feature uh sorry alert management feature that then Auto creates an incident which goes on the board in capacity planning projects.

B

So that's the part where, like we're dog, fooding, something and like I, think it's valuable to dog food. It you know had some good discussions with the the monitor team around this, but I'm not sure that alerts is the right way to manage, like just Prometheus alerts like that's. The part I'm concerned with mainly is the right way to manage creating these issues, because we did have issues with the alerts going away when we limited the number of um resources we reported on. So if something wasn't in the top, 50 was in the top 15.

B

It would show up in Prometheus then, when, if it dropped out of the top 15 but was still a problem, the alert would go away and then, if it went back into the top 15 we'd get a new alert with a new start date and therefore a new issue.

B

um We also get a new alert any time. We change the underlying queries that we're using. um So recently, we changed from an outer average to an outer Max for Italy. So for Italy total this space we were doing average across not across Italy nodes but across as it turned out Canary in Maine, but canaries get disk usage is much much lower than Mains, so the effect of that was to say our disk usage or gitly is like way below the target.

B

Well, actually on Canary, it's incredibly below the Target and on Main it's above the target. It has been for ages um so anytime. We change that the alert will change, because the alert will get a new um yeah. It.

C

Will be? Oh, no, actually, sorry that doesn't change the alert. Does it Bob.

A

uh The query: wouldn't the query- doesn't change.

B

The alert because that doesn't get pushed into the gateway, but if we change the the things that we're aggregating on in the alert itself, that will create a new alert and what was the other thing that would create a new alert. Oh if we changed no, we fixed before we had one where it would come up each time there was a new page, but basically what I'm saying is like we can't fix these retroactively every time you find something that will create a new set of alerts.

B

The existing issues are still there with all the discussion, and then we will get new issues created for things that we already know about. So we've been looking into like automatically managing the duplicate issues, but we Bob's kind of convinced me that it might be simpler to just say.

C

B

Say like is there an issue for this component yeah.

C

So just keep uh Prometheus alerting entirely and just have okay.

B

Yes, so there's a question about whether we would still push to Prometheus to be able to easily see the history of the forecasts, but I don't know if we really need that.

E

Do we need to see the history of the forecast if it is recorded in the issue.

B

um So if we don't push to Prometheus, we can do everything except populating. The cash on Ops is that right, Bob. If we don't sorry.

D

I'll getlab.com.

C

B

Only thing we need to do on Ops is populating the cash.

A

uh Yes, yes, if we yes, if we remove the push Gateway dependency, then we can generate the pages on gitlab.com is what what we're saying yeah so.

B

The entire pipeline can be on gitlab.com, and then we just have a.

C

Scheduled pipeline yeah.

B

Yeah and then the gitlab.com jobs will just use that cash.

E

That certainly.

C

B

B

I think I think the biggest issue you know like the the pipeline complexity thing is mainly in a Superman ball.

A

B

Now that we know.

C

A

It how it works. We know where to look, and so on, that'll be fine as long as nobody else needs to pick it up and.

B

As long as we keep remembering how it works, uh the um what was the other thing I was going to say there, though the uh yeah, the duplicate issues, I, think, are an issue for everybody who works on this because, like it's really confusing and like it does take a reasonable amount of work and also it kind of stamps the channel.

B

um You know you get a new issue created. Oh this was this was one other thing, I meant to mention about the alert manager.

B

um Yesterday, um an issue got created for well yesterday, some issues not created for um different components. So, if you go to capacity planning, there was an issue created for patronally registry peaky, Petri.

C

Bloat saturation.

B

I, don't know why that triggered yesterday. um It has a start time at the 14th of September, but we only got the alert through yesterday.

A

B

I, don't think we.

A

See the request from alert manager to gitlab.com.

B

A

B

Well, I mean I, don't know if we.

A

Need to look that up right now on the call, but it's just well.

B

No, it might be worth just stepping.

A

Through so we got this issue.

B

Created um so this issue was great: oh in fact this was created 13 minutes ago, um but the alert start time was a month ago. So what happened?

B

So, let's go to the alert uh activity feeds, so we log the alert from 14 minutes ago yeah. So apparently we got this alert from Prometheus 14 minutes ago and then, if I do.

C

B

It's petroni registry right, yeah, petronia registry, so of course, for any registry components.

B

Oops I found this autocomplete almost works. The way I want it to uh so that's it be true. Eg.

B

C

So if we look at the graph.

B

And we go back.

A

A while, why is there there were more than.

C

A

Was in one series.

B

There's the confidence types and.

C

B

um Yeah, okay uh threshold, so there's uh confidence, type 80 in confidence, type mean and threshold hard and threshold 100. uh So if we go back here, did one of these series show up on 14th of September, not as far as I can see, they were already there yeah. So I've got no idea why this was created at this point like and I'm pretty sure it's a duplicate, um no I'm sure if we go back to class yeah. So what was it? uh Petri pgbtree floats to.

A

Duplicate ah there.

B

We go so yeah, so it's a duplicate. We've got no idea why it was just created um debugging that, like okay, it's actually not a gitlab application bug. Is it a problem with alert.

C

Manager did we silence.

B

C

uh uh But they just ask.

E

With having it with having it created, there yeah.

B

Having yeah there's too many things in the middle is basically what I'm saying is like trying to figure out like which, even which part was the problem is quite tricky, um so we in the past, we have been able to debug these, um but then we just come across more confusing ones, because we've solved the ones that we know how to debug.

A

The easy ones we fixed ones.

B

Yeah and it's like you know, if you fix it, if we fix the easy ones, and then this didn't happen, that would be fine, but like we fixed the easy ones, and then we get these ones and we're like how much time are we gonna spend like debugging this that's.

E

Why I think there's a lot of value in making it simpler? I think that we really want to bring the rest of the team in to be helping with us, but we're going to land up explaining to everyone how to debug these problems repeatedly, but for each person that joins and then, um if the, if they're, not part of the rotation for a quarter, then you have to sort of re-explain it or write it down again as to how to debug these things.

E

So I definitely think if we can find what you suggested there in terms of making it simpler, I think it's worth it. It's very much worthwhile.

A

Okay, um so we need to add um issue creation, like that's. That's one step that we've excluded for now that we would add to the plan is create issues from.

C

B

A

B

We add issue creation. We can delete that job in the scalability projects because we can just add the we can we can put whatever we want from the issue. Right like we can add the image we can include the dates we can include the direct links to the dashboards.

A

B

Dates would also.

A

Be handy because that's what I'm currently keeping like Yeah.

B

Well, yeah because I I think I mentioned the comment yesterday. The other thing we could do is we could have the issue, because the issue Creator will need to look up existing issues right, so it doesn't create duplicates. So if it finds an existing issue, you can just update the forecast violation. Date has come forward. uh No, that's you can't use forward and backwards for time Kenny if the forecast violation date is closer as before we can bring the due date in or post a comment or something saying like hey.

B

You know this is this is now broadcasting violation at this stage if it goes further away, we can also post a comment saying like you know, this is less of an issue and if there's no forecast violation date anymore or we can just post the issue, yeah.

A

And then that's a part of like the manual job that we're doing now, like looking at the graph, hey, there's, no more violation or the thing um that you're solving right now is like this thing disappeared from the from the page, meaning it's not in the top 15 anymore. So just close it because it'll come back when it is in the top 15 again. So all of that would be resolved could be.

E

A

E

Be a big big help.

B

Yeah, so basically you know we.

A

Thought we thought we'd keep the energy thing, but talking it through now, we've changed our mind. I think.

B

A

Know I think I think you convinced.

C

B

The other day Bob, um it just took me a while to realize that you convinced me um so yeah.

C

B

We yeah we've got too many things that we need to look at to figure out why an alert got created and if we just look at timeland itself, then that's easier. And then we just have pager site issues.

E

Who is volunteering to create the issue to make issues I.

A

Already did yesterday I hit the game. Thank you. Well, I didn't create yesterday I volunteered myself yesterday, I have gotten around to it.

B

um Are you happy for us just start working on that Rachel or do you want us to make it a project and then schedule it when we've got space it.

E

Feels like this takes up people's time at the moment, so I think it's important to make these changes and get started with it, because every week that we delay is another week of having to go through this process of trying to figure out why we're getting alerts or not getting alerts or um so I'm not like. Is this big enough to be a project.

A

B

C

B

Mean yeah because, like we well I guess, if we just scoped it to creating the issues in Thailand, we could do that from Ops with the appropriate token. But then we'd still have the confusing Pipelines.

E

I think, let's create the let's create the Epic, raise the issues but I think if, if either of you have space to start that, let's start it as soon as we can um I. Think it's just going to save the time and and effort in the long run and if we delay and when we wait. We just we're just going to get frustrated with the fact that we know how to fix this, and we aren't scheduling it for work.

E

E

B

Okay, thank you um so I guess: barbecue! Okay,.

C

A

C

A

Can also create the Epic yeah yeah I'll. Do that uh yeah should I create the issues in the timeline project like because the other ones are there uh can still be since that other it's a it's on the deal infra so.

E

Because we can still link it to the Epic, because the Epic needs to be in gln for us that it'll show up on our yeah yeah this one.

E

Oh, is there anything else we should chat about today.

D

No I can share about the redis cluster pocs initial exploration. I, don't really quite curious if that's more, like I I took a short Step at it uh on tanker uh to do a kubernetes deployment I mean partly because it's you could actually do a full deployment on Mini Cube, locally yeah compared to I. Think VM base is a lot harder for for at least the bank Engineers, because we don't have access that much access and I think there's a lot more provisioning required to actually do a local run, but yeah I I.

D

Let me think, let me thank you yeah. So the issue and the issue links to other other issues and draft Mrs. uh It links to a secondary issue. I started on Omnibus kid lab to get it uh cluster compatible. Then, like I I added instructions on how to run it locally in a multi-pass VM, but you could set run.

D

You could run a full like six to seven note cluster in a multi-pass VM and then and then and then connect to it using uh Dash rails, console yeah, and then you could just keep writing it via the gland. Redis cache module and you can just go. You can actually go into each each like master and see where the keys go yeah and yes, also that that's like local setup but I think I'll probably need help, help or repair with an SRE to be able to further do this further further down the line.

D

Yeah then I'm, probably gonna, show on this a little bit because I'm not sure, what's a prioritization for this, so any any more effort from now probably needs extra actual arms, so like yeah I'll, just let that originally on the silent night, that's prioritization, but thanks.

A

Related to prioritization uh red is Cash, just show uh shoe have has just shown up on the capacity planning reports for uh primary CPU, so for when December, somewhere, 80 confidence. So, like that's like not.

E

A

Urgent but like.

E

Well, in terms of prioritization, for the readers class to work like we need to get the red is rate, limiting and redis registry ones into production. I think one is on I, think they're, both on staging now um and I. Think the Readiness reviewers big has uh already had a round of feedback for the registry. One and some of those items might be relevant for the rate limiting instance as well, um and then the plan was to then crack on with uh with Raiders cluster.

E

The question is just about um proceeding on as I wrote on the issue proceeding on VMS, because I think it was, it was less risky and uh there's more visibility into what's going on there, while we're still figuring out what rate is cluster is and does um before that Matt's got availability to help with that as well.

E

um So it's yeah. It's just a case of balancing that with getting the the instances into kubernetes as soon as there's space, then we're just clusters. The next thing we should be picking up.

A

Didn't uh didn't you try out the the chart the bitnami chart for redis cluster now.

D

Yeah I tried the Vietnam chart. It comes with. It comes with a a like post deployment hook, job that runs to help, add notes. But it's it's not that clean. When you do scaling down which, but but then again I, don't think we would scale down much. We only scale up, I I would guess we mostly scale up and then we keep it and we like we rotate to do sort of like maintenance, but it's getting down I'm, not I'm, not sure practically like do we do that much.

D

But if we do then I think vs might be meter, like the small control I've discussed this before on. Our coffee shop, like VM, gave us way more control and more confidence in doing something that we are doing for the first time, because, like here's, the team in general don't have that much experience with very slots are compared to Sentinel, so I think VMS will probably roll out more confidently on VMS.

D

Yeah mentioned something about like uh the we have to solve, prepare for CPU spikes when we do rebalancing when we add in a new node or even restarting or rebalancing so for that uh that that might be something that is hard to project, at least without proper load test I contemplated during a load test in VM, but I, don't think that's actually going to be any accurate.

D

A

D

A

Need to we need to I, don't know, I, don't know what um what the rebalancing process looks like and if it takes. If it's a CPU heavy thing- and we try to do that on an already busy cluster, then yeah, something to that. We should figure out and let's see, I thought it was easier to do on VMS.

E

Yeah yeah it just might be that doing that on VMS will be able to understand more easily what's happening. If something happens, that we're not expecting.

A

I think a lot of these questions will be easier to answer once we actually have redis instances in production with traffic in kubernetes.

E

Yeah I mean there was the plan all along, like the the choosing two of the simpler instances to get those into kubernetes. So we could just understand what this is like how this works. What is required to make this work?

E

What can you see when you have one like there were so many questions that we had at the start of this all and I think that getting these two instances in answers some of those questions, um and it might answer enough of the questions that we that we do do red as cluster and kubernetes straight away, but um foreign I think the the consensus when we last talked about it was at least starting with VMS having the visibility would be a massive help for all of the pieces that we don't that we don't know.

A

uh One extra thing there is that probably we'll need to do some kind of um yeah another migration, but uh turn the existing instances. We have so uh persistent in the cache into instances in redis cluster capital c. So that's going to be easier to do with. If, if it's not a mixed deployment as well.

E

A

I think I have no idea.

E

We'll only know when we try.

D

Well, I I noticed that they get that application. There's there's like a reduced cluster validator. That is already there, even though we're not using relish cluster. But it's not history behind that. Yes,.

A

um Sean built that because we wanted to do red this cluster at some point, but then we said it's not that urgent, but let's just make sure that nobody else adds new things that aren't compatible. Okay,.

C

Do we know if any larger medium customers is using direct cluster in their setup.

C

um I mean, if that were the case, then we will. That will give us even more confidence but I'm guessing. The answer is probably now.

D

You know I, think I, don't think of reference. Architecture covers radius cluster and unless our, unless the customers go ahead and try it out themselves.

E

I think the reference architectures only go up to a certain size like I, can't also, if the reference Arc like if it's something that we're recommending it's something that we would have tried and understood and tested like I, can't remember seeing anything around reddish cluster in those.

C

D

Well, yeah I I sort of got into Omnibus to like hack it like it's all about hack but like add in uh I'm under the conflict, so that it's the application can use various cluster, so I think for the customers to to you to do that. They have to modify Omnibus yeah.

C

Yeah, so that makes it even less likely so um Jessica, so the application itself is capable of working with various cluster, but omnibus, isn't like refused in a better installation. You can configure it, but not true of news is that the current status.

D

Yeah, like like the red, is dot cache or redis.q or yeah. It's a it's a key. This is a key code cluster with a list of all the cluster address for it to for for the gem to to use cluster mode yeah. So we don't do that right now got it.

D

Yeah, so that that's all for for me for me in terms of the whether it's cluster I think uh to like spend more like create more issues as to like, like criteria to accept like whether we go communities or VM I, think reporting to prioritize. It then wait on the two existing kubernetes, the redness and diverse yeah.

E

Yeah I think having a proof of concept in the in VM in the VM environment in a VM setup is useful and hopefully by then we'll have at least one of the instances really on production.

A

Sylvester for all of your tests, have you um looked at gitlab sandbox, where you can have your your own gcp project to do stuff on VMS do some stuff stuff on the gke cluster and so on. I've played with some of the tanker stuff there, and that was easier than using mini Cube locally.

D

Oh okay, that's okay! I might try that out to like have something there. So I can run practical like, but a bit more realistic test like we will check out like how bad will be restarting kick from like a three, a three shot cluster to a four shot cluster things like that. At least some estimates to that we don't get caught off guard.

C

Do you have a link to the to the gitlab sandbox scene, yeah.

C

C

Is that the one I tried it before and it's quite simple to it's been another project and I had my VM also last time, nice.

E

Oh, is there anything else you want to chat about anything else you want to go over today.

E

Cool well, thank you so much for taking us through those things, um hope you enjoy the rest of your day.

E