GitLab Delivery: GitLab.com migration to k8s demos, 9 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-03-09 GitLab.com k8s migration EMEA/AMER

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

Good morning, everyone how's everyone doing today.

B

A bunch of head nods, that's good enough! Welcome.

A

B

9Th it's wednesday yeah that yeah.

B

um Let's see, we got two items on the agenda um henry. How about you go first.

C

If you don't mind yeah yeah sure, um so I don't have anything to read demo yet, but wanted to give a short update from where we are with red is right now, so um I merged most of the things that you created in your branch skyway to master. Now there will be one more mr coming, and this gives us the scripting to be able to um test the switch to using dns names on dvms, and that should work on parameterized clusters and nodes and hosts.

C

So we can use this also later in the same form for doing the real switch over, maybe with some rework and what I'm doing now is testing exactly this procedure, see if I see any flaws or improvements that I can do there and then see if we can get to a state where we want to maybe try this in pre-even, but we need to see how we, how comfortable we feel with that. So that's um where we are with this right now.

A

Do you have some issues henry you could share like what what these.

C

C

C

Oh, that's right up here.

C

So this is for enabling the node names.

A

This one liam just for a little context, was what we were talking about this morning, so we'll need, as we do, the switch over we'll need to be able to run over vms and kubernetes to allow the deployment. So this was kind of one of those blocking pieces.

D

C

Yeah, so that's what I wanted to report um if there are no other questions.

A

Oh, I have another question, so you mentioned, we may not be far away from pre so like what's the like, what is that kind of current status and uh like like? Are we? Are we able to test on pre or do we have other things we want to get in place? First,.

C

I want to test this procedure in our sandbox cluster first to be really sure that it works right to find all flaws, possible flaws and then base. You know what could go wrong and once you feel comfortable with that, I hope there's not much missing for that. We should see how we can do the same in pre right applying those scripts on pre, which should search over the.

C

Rate limit vms over to using dns names.

A

Oh, I see yeah, so we want to make the change on pre on the existing vms and that will set us up ready to later on. Okay.

C

Yeah, so this is testing if our automation so far is sane enough and working so that we can use that in production. Also, what we need to test is um to try this running with some load like like something which is doing requests, and so we see if we fail or how, how much we fail.

E

C

Doing this uh the switch over because it involves doing failovers and then searching configurations resetting sentences, so we need to see um how much requests we might lose in between maybe so. This is what we need to test here.

B

That's effectively where I dropped off, like one of my notes and one issue that's assigned to me was like. I would like to get the benchmark process of that um environment running again and performing the failover as the benchmark is running to see you know, what's the failure rate of those requests, so that all makes sense.

A

Is that is that comment on this same issue scovic or on a different.

C

Yeah, I think scarbeck had this noted as next steps and his one of his last comments in this issue.

A

ah Okay makes sense: um okay, great, would you mind henry like if you're taking this over, would you mind uh updating the assignee and adding a comment and just kind of uh um making sure that we help eagle uh when they come back next week to sort of catch up.

C

Yeah they do that um so skyway, I'm taking this officially over and over from you. Okay.

A

I'm not sure I.

C

A

Was there a fight there, it didn't look like much fight.

B

Happy to give it away at this moment.

B

um Anything else before I begin.

B

Okay, so metrics insider prometheus.

B

um I don't have a formalized discussion, so I'm just going to try to explain what I've been looking at recently so that all of us could just be aware and then maybe, if anyone has ideas or thoughts, brain dump it on me so I'll share my screen, because I don't really have a good way to present this without fumbling over my own words.

B

So, a little while ago, we started to see more and more issues where, when production is ready to be deployed or if someone is doing a a feature, flag change um and we run a simple prometheus query to determine whether or not our services are healthy and you can see from the screenshots that amy provided the main stage of sidekick was not healthy during one run and apparently two minutes later someone ran the same command and sidekick isn't here at all, and our get stage went unhealthy.

B

So I started looking into this because this is rather concerning. We don't want this to happen. Obviously we do have active work happening on our metrics from the scalability side of things. So you know maybe something digressed and you know made things worse or what have you so I started looking into this, and one of the things we noticed was that our metrics had holes in them, um which.

C

B

Good um and this is a metric, so this is the actual measure that we use. It's called git lab deployment, health service and we've got one for apex and one for errors. We mash those two together and if we get a value of one, things are healthy. If we get a value of zero, things are not healthy and we've got this very complicated.

B

um I drew up a chart that does this um prometheus doesn't provide an easy mechanism to do this, so there's actually some complicated logic. So I'm gonna save that for outside this conversation, but effectively we're looking at the service ratio for one hour for five minutes for six hours and thirty minutes, and we combine them two and if we're good, we actually have a value of z when we flip it to one.

B

If we have a value of one, we flip it to zero and vice versa, to determine if a given service is a healthy or not. The problem is that these metrics so get lab service errors with the ratio of six hours and one hour are quite holy for some reason um they just there's no metric showing up. If we decompose what those um metrics are consisting of, we do see them dip for various services. They do occasionally dip to a value of zero and back up to one.

B

So the fact that we get missing metrics is a little pragmatic.

B

So when looking into this, I think I found somewhere that like well, metrics are missing, um so I thought one of the things I could do with this was like, let's just get rid of the metrics that are missing um and andrew is against doing this so because this picks up errors that are occurring over the course of the lifetime of a given service and even though it's a six hour window, it's looking at the errors over the course of history.

B

So if errors are increasing over the course of six hours, there could be some other inherent issue that just doesn't show up within a short time span of an hour. So it's wise to keep this metric in place. Therefore, we're trying to capture like everything all encompassing in a given deployment package.

B

So we don't want to remove this, because this is important to us.

B

um What we found our investigation was obviously that you know metrics are holy, but um we're getting timeouts um metrics are simply not being somewhere in here goodness. Where is it at.

B

Okay, fine screw. It can't find it quickly. um Some of our metrics. We we get a lot of metrics out of sidekick, for example, and one of the things that I was concerned with was the sheer number of metrics that psychic produces is extravagant.

B

So if I just do this, one metric is simply called the completion seconds bucket. This is what feeds into the aptx, for example- and this is already taking a really long time to render me the amount of metrics that we produce so much so that um you know thanos is crashing on my machine, but we got over 8 000 metrics returned for just this one metric alone, so the cardinality for this one metric is extraordinarily high.

B

Now that's one metric for this one pod.

B

Yesterday, when I was looking at this issue at the time I was looking at it, we had over 570 pods running in production, so for this one metric alone for every single pod, we've got 500 times, 8 000 metrics, that's quite a lot um and I think the problem that we're suffering from is when we try to query prometheus and try to build our metrics.

B

You know thanos ruler needs to reach out to prometheus, to be able to create the metric, the recording rule that records the data that we want and we're just running into situations where prometheus timing out. So I found.

B

I didn't screenshot, that's why I can't find it, but um the thing is this one: just so: prometheus has built in timeouts and we're simply timing out querying this extravagant large amount of metrics overall.

B

And this is kind of evaluating real, failed query timeout, and this is happening for all of our prometheuses inside of kubernetes in production today. So this is not really a small problem. This is kind of a large problem, because this this is going to impact quite a few things. um This bubbles all the way up back to our reporting, metrics that we use for determining how healthy services are over the course of lengthy periods of time so like looking at a month overview, for example.

B

So if we're trying to find out our slis and slos and report those to the public, we might not have all the data necessary to record accurate metrics for that information being that those are business metrics. It might be wise that we try to fix that as soon as possible.

B

um So metric timeouts are a problem. um The fact that this one metric alone is giving us 800 different 8 000 different values is kind of problematic.

B

I think I calculated that sidekick outputs over 10 000 metrics per pod, which is excruciatingly large, um so, like I mentioned before, scalability- is already working on metrics as a whole for other various aspects, but one of the things that we're considering is learning the amount of buckets that we use that capture with the completion times for various workers, as well as the buckets that we use for capturing database related metrics for redis and postgres as well. So shrinking metrics will certainly help us.

B

I guess the question I have for anyone on this call, because I don't know how to look into this appropriately. Are there other things that we could potentially look at in doing that? Doesn't involve touching metrics because that's already being worked on, and hopefully we can be successful in that realm, but we also have prometheus where we've got one prometheus per cluster.

B

I ponder if we need to split out these prometheus deployments into many and then have those prometheus target specific items, so we might have prometheus dedicated to sidekick or we might have prometheus dedicated to specific shards of sidekick or maybe more simplistic. We have prometheus, get dedicated to get lab and then we have another prometheus dedicated to everything else.

B

I'm looking at other ways that we might want to start thinking about how to alleviate this problem, because I don't see this going away. In fact, I see this getting worse over the course of time, because git lab is only going to grow. Prometheus has already been a source of contention. You know we occasionally run into problems with wall files and we need to completely bash and restart prometheus and delete all of its data. So we have to rely on the redundant node to provide us this metric so that we don't lose any.

B

um I don't know open question. I don't know if I covered the topic very well, but you know feedback or questions.

A

Just before we jump into kind of ideas, like you definitely covered, I think some of it was super well like. Could you just give us a brief overview of the reliability issue, and I know you mentioned. That was a blocker so like what is kind of that piece that um we also need.

B

So that's the thing that I'm looking into right now. um This is a blocker for me because I don't know of another good way to fix the problem for team delivery. I think our metrics are is where we source all of this information.

B

So if we try to build some mechanism inside of release tools, that says ignore the metric. If it's empty is a very dangerous thing to be accomplishing. If we tell release tools, hey this metric is missing, even though it should be here. We're not fixing the problem, we're just exposing the problem that we already know. Very much is a problem currently today.

B

So, like I don't know what else to do from that frame, so I thought let's dive into our metrics and figure out, what's going on there. So this is the problem that we're trying to fix and solve right.

B

C

I think this is a really complex topic. um Yes, for one um I mean it needs to be fixed at the root, because the whole engineering is depending on working metrics right. So we need to find a solution to um overloading our previous or thanos instances with too many metrics or too high cardinality.

C

One approach I saw in the issue mentioned was that there's an epic for splitting up prometheus and grenadas, as you mentioned, for maybe having specific previous for different components, maybe or something like that. So there's an epic for this already, but I guess some work to be accomplished as an immediate solution or patch just for us.

C

The only thing I could think of is that we try to come up with some recording rule which is trying to you know piece together the metrics that are mostly working and leaves out what has gaps. Maybe, but then we wouldn't have a coverage for a sidekick, but maybe we could find a replacement query or recording rule for this one, which makes it you know not as heavy and those working.

C

So this would be patching around the the current problem, so we would have our own recorded metric for for saying this service is healthy, which could lean on what we have already, where it's working and replacing the metrics with gaps with something that just you know not as heavy maybe. But this would also be some some work and then it's just a patch around a recurring problem. Right, yeah.

D

C

If there are many other options for this, one.

A

So I guess uh like in terms of I mean like thanks for sharing that effort. Can we like do we are in agreement that that is that's like the proper solution.

B

I don't know if it's the proper solution, it's one idea that we have. That could probably help us out here, like our prometheus right now is just overloaded. You know it's consuming an enormous amount of ram. It's consuming an enormous amount of cpu we for some of our zona clusters. We have nodes dedicated to run only prometheus and effectively nothing else.

B

So it's not even a good use of kubernetes itself like realistically those should be its own virtual machines, but that aside, you know, splitting prometheus up would be beneficial because then we're delegating that workload to many prometheus and then we could distribute what we're recording and what we're saving from those effectively more effectively.

B

But there's a lot of work to make that happen. Like that's not going to be a oh, let's just deploy it and we're done. That's not a one day thing: that's going to be a multi-week multi-month project to get it going, reconfigure things make sure they didn't break anything etc. Like it's not a quick solution, it's my concern.

B

We could iterate towards it, but again, even that process is going to take a long time and it's not going to solve our immediate problem, which is metrics, are missing. Currently.

A

Yeah, that makes sense. um Okay and the idea you suggested henry like what do you reckon in terms of like rough size? For for that.

C

um It's it's hard to say I mean that just came to my mind, um but this would just involve getting deep into our rule, set that we have for prometheus and figuring out um how we can set up our own recording rule, um which then would need to just leave out some parts which are not really reliably working and trying to find replacement, queries for those which are working and patching this together.

C

So that would be I mean just working on those recording rules, so we need some knowledge about how this all stiff together and working, because this is some complex um generated, json stuff that we are dealing with, but maybe we can just make a um easier manual patch around this. Instead of trying to you know, auto generate a lot of stuff here um could work that would be a good discussion, maybe with um people like andrew or maybe igor, or who else is deep into our recording roots.

A

I would probably prefer like if it's if it's it sounds like weeks to months for us uh to do that, like I would say.

C

I think I think, such weeks and months, it's just um spending a few days, maybe um looking deeply into our recording rules, but this is just working on some kind of code on our premieres probably expressions there, and I think that would be work that we can just say. Okay, we try for three days to get this fixed this way and if not just stop here, but it shouldn't take much much longer. I think.

B

C

B

Facets to your ideas, one all of this is very heavily abstracted inside of j science. So there's.

C

B

Need to have the knowledge of how this is put together, so we need to heavily rely on persons like andrew or bob that have.

E

B

Deeply involved in this to help build the appropriate query.

B

The second aspect is: we need to figure out what we would use for that replacement query or what we could supplement our recording rule with such that we fill those gaps inappropriately, and for that I don't know what we would use just yet, whether it's just us continuing on the data and just filling that gap in with whatever the value was previously, which could be incorrect information or finding some supplemental value instead and then third, is the business impact to all of this.

B

Like I don't know what else uses these metrics, I didn't know that feature flag changes were using these metrics. I thought it was just our own delivery. Tooling, like I thought these metrics were built specifically for delivery, and no one else so learning that this is used elsewhere. Has me a bit nervous that if other people are relying on these metrics, we whatever solution, we choose with henry's idea. We may we need to make sure that we clearly advertise that hey. These metrics may not be correct because of x, y and z.

B

For the moment, until we have a better permanent solution and then, if we implement said solution, I guess a fourth facet of this idea is. I want to make sure that we don't leave that patch in place for long, because I don't want to be giving people incorrect information, and I want to make sure that we're going to at least get towards the final solution in the long run, which would probably try to figure out, involve that epic. That henry mentioned earlier.

C

I don't think we need to patch the current um recording rule. We can just leave it in place, uh so it's working.

B

Yeah, we would record a new recording.

C

So we just need to make sure that um who wants to use it knows that this is kind of incomplete and patching around some problem, but I guess, if you name it accordingly, not too many people would try to use it right. Gitlab deployment.

B

Health patch yeah, something like that.

E

Well then, I have a question like regards the recording rule. uh It wouldn't be a real time. Would it like we actually, if the it would give us like a hint that the system is up or down, but not in real time, correct.

C

E

C

Rely on on recording rules already, so everything you're relying on uh is based on recording rules uh mostly, and it's a high several layers of abstraction and json that generating these, and I think only andrew is the person who is really understanding what it is. But yeah.

A

Okay um for now, given that that's a it's great, we have an option. um I would rather not have to patch something specifically for delivery. If we can um let let me go off and see what we can do like it sounds like this. One is a cross department.

A

Effort uh is gonna like something will come out, um but let me see what we can actually pull together and figure out, so that we can actually get an idea of what is gonna, be the real fix that goes in for this, and when can we maybe sort of expect that or contribute to that and then once we know a bit more, maybe we um we make a plan for whether we want to actually like, like you say, henry like time, box a few days and see if we can actually put something up for delivery.

B

So one of the items that bob proposed reduces the amount of metrics we have by a certain amount. I don't know how to evaluate how significant that proposal has until after it's done. Unfortunately- and this is just due to my lack of knowledge with prometheus in this particular case- I would love to see.

B

If we accept that proposal, I would love to see what kind of impact that has, because if we can reduce any metrics that we have, I think that'll be beneficial to us. um So I'd love to see us try to see that through first, if possible and.

A

B

Pick up henry's idea afterwards, potentially okay.

A

Yeah, I absolutely agree- and I think in this one uh scarborough when you say we like all right- let's all remember: that's infrastructure like this is a so like bob uh is actively working on this proposal for um uh for reducing the so it's in the improvements for error budgets, reducing the number of metrics we're capturing.

A

I see steve's commenting on the epic henry shared about um on uh 623 about the options there. So there's a lot of people talking about this and sort of thinking about this. So I think that's, um let's see how we can coordinate um a bit of a plan to actually fix the problem.

C

That's one question and how much is this um blocking us? I guess it's it's annoying because we see these um in our production, health checks right, um but that's.

A

Not working as a tool, because actually our current thing is it just totally skips it. So it gives us extra risk on deployments because we are deploying when something may be unhealthy and not knowing it.

A

C

It doesn't block.

A

C

I think what we still are able to see, though, is in our dashboards just.

E

C

To the dashboard for the service that we see right at this moment and see if, if it's looking healthy there, that would be a work around for now to to be sure. Okay, uh it's it's still! Okay, because there we don't mix all of this together, but we have you know the um just standard, aptx dashboards over a shorter uh time range and those are working normally and I think the really bad thing is just if you want to automate right.

C

If you want to promote, if the thing is green, then we would need to rely on that, one that would become really blocking him.

B

To put a number value to your question, henry I've calculated that between 10 and 20 of the time once you run that query, it's going to return empty data which is not insignificant, yeah.

A

So about one to play a day, so it might be worth like. um Does somebody want to actually just put together like a small thing, then so that, like a release, manager has like a quick step guide of like you run this thing, if you're not seeing all of these metrics go here, look for this double check before you hit the button. I.

B

Could do that um awesome I'll work on that, because that's a good thing to have.

A

And then that kind of gets us through like this immediate um problem, but yeah, like you say henry like we won't be able to switch to automated promotions, but it sounds like from scarbeck's kind of initial overview that actually that's the least of our problems on this metrics one um right now. So, um hopefully we'll actually like see a project come together quite quickly.

A

Awesome thanks for bringing that one up uh scarborough and for the investigation actually like super super great. uh It was a much more interesting investigation than I expected from the issue when I opened it.

B

Okay, well, that's all. I had um any other questions, comments or fun topics. Anyone wants to bring up.

E

At all anything.

A

At all, I think probably two fun topics so one I don't know if you've all met liam uh liam has just joined platform. uh It was the week before last right liam. I I lost track of time. So.

D

Yeah, it's like in nearly two weeks. Now it's a swing by um yeah. I don't think I have met the majority of the people on the call. Actually, I've seen your names lots, of course, around slack and on gitlab um yeah. I joined from the manage stage which I was working with for about three and a half years, um and I felt as if my time there was coming to an end and it was uh fun to come over and see what was going on on the infrastructure side of the business.

D

um So I've joined scalability uh working alongside rachel um to spin up a second team there and so yeah. I'm excited by the challenges that exist over over on the infrastructure side and lots to learn, lots to do and uh yeah lots of opportunities for growth for me and for the company.

A

Yeah awesome good to have you over here liam and for those of you on reddit liam's, going to be sort of starting out getting kind of caught up on the reddish project. So you'll get to know henry probably first and then um maybe akamada should come off release management in a couple of a week, or so um you might be able to get more involved there as well.

C

D

C

A

B

Excellent well, thank you all for participating in today's meeting enjoy the rest of your day. Thanks.

A