GitLab Delivery: GitLab.com migration to k8s demos, 31 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-07-31 GitLab.com K8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Nice job, do you want to review the blockers again? We did the yesterday in a pack, but I guess totally different crowd. So john just give us a maybe an overview of uh of where we are with those.

B

Yeah sure thing so um no update on enabling live traces.

B

Just checking to see yeah nothing, nothing new here, um support for the dependency proxy. I think this is done. I think this was.

B

B

Yeah so it says this has been fixed, forgive a fixing gitlab for 13.3, so we can take a charts update for this.

B

um We haven't actually bumped the charts version, but we should probably do that soon.

C

I have an mr for that jarv.

B

You have an mr for it: okay, um nfs dependency on pages. Jakub is working on this. Hopefully this will happen soon and then um I don't think there's been any significant update for mapping services to cloud native, but uh we're not completely blocked we're just gonna be blocked very soon as soon as we finish, the get https migration and state-of-the-art logging, which is a funny title, but this is just what it happened to be when it was submitted, uh but yeah. This is, it sounds like we.

B

We may use the sidecar approach, um and I owe I think I owe mare an issue in the gitlab tracker for trying to see if we can fix this at the application by annotating logs with the source. um I think this is going to be problematic, though just because we have, we also have the unstructured logs, like the production.log application.log, all the stuff. I I'm not sure. If there's much we can do with the app, but you can see any comments on those blockers.

B

If not, we can move on to the demo.

B

And I guess I had the first one so we finished the https skip migration for staging and we're going to probably start with the production. Well, we will start with the production migration next week.

B

My plan for next week is to have canary running by the end of next week to have canary running in kubernetes, as well as um some of the production traffic, and what I'll do is I'll use the kubernetes cluster as like a single server in the list of servers so that we take like maybe you know, a single digit percentage of traffic to the cluster, so we'll start with canary and then move to that, we'll probably let it sit like that for a while and then maybe the following week, we'll do the full migration finish staging so far I mean there weren't, really any issues.

B

Of course staging doesn't get a lot of traffic. You can see here, links for the logs for rails and.

A

B

I am a little bit concerned about log volume because you know a little bit different than what we have now is that we get all of the logs uh going to the rails index and all of the logs.

B

um Well just the workhorse logs for the workhorse index, but for the rails index we're getting like the production.log application.log. All these are are going to it um yeah. I guess there isn't a whole. There isn't a whole lot to see.

D

Obviously, we've got to we that's not going to be with the product, especially with the rails log right like that's, not something we can kind of use.

B

Anything maybe this well, so maybe this sidecar should be a blocker for us going to production. If we're worried about it. I I think, like we're, gonna increase the volume by quite a bit and also have a lot of junk in the rails index that you don't want to look at yeah and there's not in the senate.

D

E

B

E

Yeah, I think we talked about it a bit yesterday. um I just don't remember what I said because it was yesterday, so I mean I think it should be a blocker and actually I think I said that in the end after you explained how it is not right, I don't know how.

C

E

In longer words right, I remember.

C

B

Yes, yes um yeah, I guess okay, so we're going to make that a blocker, which means that we may be delayed on getting production migrated uh because of it. But I'll update that issue to let distribution team know that it's a blocker and uh we'll probably have to tackle it next week.

E

And we we don't have an alternative right like it's. It's it's really a blocker, because if we get this in production we are a running blind b spending money like crazy c, probably crashing everything left and right.

B

um Well, we're not running blind, but we have like we're kind of over dosing on stimulus. You know like it's like too much because we're getting all these unstructured logs as well. It's just like it's really good yeah.

D

I suppose we could filter for what we need, but yeah.

B

It we can filter, but it's bit ugly like because we'd have to filter based on the log message. So.

D

I mean if we could just turn off the like if it was urgent, that we got this into production and we wanted to avoid the blocker. Could we not just turn off the production rails log and then everything else would kind of be okay, but then it's only that one and no one uses that so like we don't want it.

D

No, like I I, like I challenge anyone to make any sense of the production rails log, the unstructured one.

B

Yeah I've tried this.

D

It's it's ridiculous.

B

E

Oh yeah, I I I understand now.

B

D

It's just like for a single.

E

Node, installation, you can still sift things yeah a bit yeah. I get it now like.

D

For us, it's like drinking from a hundred firehouses.

E

Yeah, the big ones as well right, like.

A

Yeah fire truck.

E

D

So so I mean that, would that block you right job just turning that off somehow, if that's a.

E

Possibility but I don't know whether that's possible.

D

Just can't you not read it.

B

That's the problem. Well, we could we'd have.

C

B

Adjust cng because it tails everything it.

C

B

um I think last time this came out there were it was there was some pushback, so I'll have to look at the issue for that, but.

D

I mean I'm I'm genuinely surprised that anyone that runs kubernetes would find those logs useful.

D

But anyway that's that's like.

B

Okay, let's yeah, let's table it and uh we can discuss, but obviously it's a high priority thing if it's going to block us.

B

um This is the dashboard, the pod info dashboard, which is just a copy of like the existing dashboard we have for sidekick and mailroom 2ml. um I guess the thing to note here is uh for staging I'm using the default requests and limits. So we have um limit of uh one and a half uh cpu and.

B

uh One and a half cpu and two gigs of memory. um If we look at the usage, uh I mean it's very spiky, because it's staging and we're pretty much only hitting https git when qa runs.

B

So you can see that, like you know, we're spiking up over a core on some of these, so I mean I'm not sure, like maybe we're gonna have to adjust the limits a bit uh from memory. um It looks like that, like we're, also kind of spiking spiking up on memory and like going up to three gigabytes, um let me just set the two, so I think we need to do some investigation here to see whether, like maybe our limits are too low.

B

So um next is um there's a there's, a new issue for kind of summarizing where we are in the migration after a year, and one thing I put together today is this dashboard called the kubernetes migration like overview efficiency dashboard.

B

What I did is I I totaled up the number of cores and the amount of memory that we were running on vms prior to the migration. This has remained fairly stable. I don't think we've added I'll have to double check, but since we started the migration, I don't think we've added many uh nodes in the web or api or git fleet.

B

uh But anyway I did include like update it's up to date for gateway web and api and, like it has a number of cores that we were running in sidekick right before the migration and registry in mailroom.

B

So we had like 1500 uh cores uh when we were running on vms and um you know about 5000 gigabytes of memory. So I have two panels here. This is actually a little bit negative. I'm not sure why this single stat, I have to figure out um or no it's actually not negative, never mind. It's very.

C

Close to zero, though.

B

So maybe that's why it's showing zero, but it's kind of funny, if we add up the number. So I have the total number of cores for virtual machines right now and the total number of cores in the kubernetes cluster right now and if you add these up and as well as the number of memory we're sitting in approximately about the same amount, we're running under vms, which kind of makes sense like we're about. Even so what I'm hoping is that uh right now we're a bit over provisioned in the kubernetes cluster.

B

So um what we can do is hopefully like drive this memory savings and cpu savings up a little bit so that the total amount of cores and the total amount of memory is like less than we were running prior to the migration.

B

uh I'm not done with this dashboard. Yet I plan to add some more things. What I want to do is break it out by service and show like the total number of uh cores and memory and utilization per service, as well as maybe showing like both node, auto scaling and hpa scaling uh to see how it's working in kubernetes is there anything anyone would like to see here or um first of all, I guess.

B

Do you like this dashboard, because I don't want to spend too much time on it, but I thought it was kind of nice.

E

um It could be, it could be just my my day-to-day and the warmth here, but I'm really having a bit of a trouble understanding the the numbers I'm seeing sure are. Are you are you? um Is this the current state as in there is no historical comparison right like if we are going to say registry from august, 2019 and register from august 2020 we can like this is just this is what's happening right now. Is that correct.

B

Yeah, so this is the the current state and the way I'm calculating the savings is by putting the previous state in the denominator. So you have like this is the number of cores that we were running.

B

um This is these are the number of cores we are running without kubernetes, and this is not like the amount of memory we were running without kubernetes this. What it's taking is the total number, of course, between vms and kubernetes, and dividing it by the number of cores. We were running prior to the migration, so as we as we like shift cores over from the vms to kubernetes, this number might remain the same, but hopefully like we're, utilizing cores more efficiently in kubernetes, and that number will like the savings, will increase.

E

Yeah, I guess what I'm missing here is um well. First of all, the services you said, you're working on that that's going to be a good thing, because it's going to allow you to right like we're mangling too many things at the same time. So it's it's hard for me to understand, for example, specifically for registry, what's happening, what would be actually also interesting to know is uh if there is a way for us to say like what kind of uh what kind of number of requests were we serving at a certain point.

E

So, for example, if we were serving a thousand requests in august and we are serving 15 000 requests now um we're doing better or worse.

B

Yeah yeah, weird green.

E

What what what kind of stupid thing am I saying, and please.

B

I see what you're saying like basically at the time when we were running this.

E

Number, of course, and.

C

B

Memory, you want to know how many requests we were serving and because obviously like um like it could be that we're saving more cores and memory because, like the number of requests, have increased.

D

It's very difficult over time to I I understand, and I think it's the problem is like the workload changes. So so if the workload was like fixed, then it would be very easy to do that. Calculation, but obviously you know endpoints get more efficient and endpoints get less efficient and and so you're trying to juggle like lots of different variables.

E

And right, but it's still right like, but that's going to be a small bump right. It's not going to be a big change right like if you improve the efficiency of registry itself, sure you'll see a bit of a bump in it's not going to be precise comparison, but it's going to give you yeah.

D

Like if you could get it over a longer period like if you took a month long period- and you said right in this like not a short period, not like an hour or minute, but he said in this month, you know the growth was three percent on this month and you know we used 20 percent of his calls or whatever. Then then that would be valid. But I think if you were looking over a short period, it would be.

E

I would be even fine with the quarter like we don't need that precise of numbers like one quarter of this is the average we had, and this is the average we have now over this quarter.

D

And we we have that if you, if you're trying to get hold of that the the best way is the gitlab service. Ops. Recording rules have got all that in it and they're highly aggregated, so they we can actually query them for more than 30 seconds in thanos. Without a timing out.

B

Yeah, I think I, though, we'll need to look back like for for register. We really want to see this for a registry mail room right and just for this, we.

D

Need to go back.

B

D

Melon, we don't have any mailroom. You can't look at that.

B

Yeah the registry, like maybe so.

D

uh We've got quite a lot and we'll have that going back at least a year.

B

Okay, so what we need to do is like at the time we were running eight cores and 30 gigabytes, how many requests we were seeing at that time and then we can compare to.

D

The number of requests just use the increase on git lab service, ops, blah blah blah you'll you'll find that.

B

Yeah um cool but yeah I mean I was kind of surprised to see that we're about even right now, but I guess not too surprised like.

D

When you say even that's, including the web and api fleet, though right or am I misunderstanding, that.

B

Yeah, so so, first of all, this includes everything except pages: petroni, redis, kiddily and prefect, so all the stuff that we're going to migrate over. um When I say even what I'm saying is that like before, like we were on vms, we had 1514 cores. If you sum up the total number of cores between virtual machines and kubernetes in the kubernetes cluster, now we're about at 1500.

D

Right right, I didn't realize you were including the the vms as well in that calculator.

B

Yeah yeah yeah, because because what I want.

D

To see is just kubernetes.

B

No, no! No, because this is this, so so what we should see is like this should stay well, at least, should stay around zero, but hopefully we'll improve a bit like we'll increase it.

D

E

Like even if we, if we don't see improvements there, if we are serving more traffic, because I would expect that right, otherwise, we have a bigger problem. If, in a year we didn't see any change in the traffic right well,.

B

Well, keep in mind that git, but gitweb and api are like the services that we are scaling the most, and these are snapshots of today, not not like a year ago. Registry is a snapshot of a year ago or so, but registry like we typically don't, have scaling issues for registry, at least on bms. We didn't like we were running.

B

We were running a static set of vms for a pretty long time and sidekick was similar, like I think.

E

Well, something.

B

E

Because of the other actions that were happening right, yeah, mangling of sidekick left and right to try to make it perform and scaling red is up to take. All of that in.

B

Yeah so I think it's difficult, um but uh I like just being able to see like the number keeping a tally of the number, of course, and the amount of memory we have in the cluster and making sure that's under control like because.

E

It's also an option jar if you don't want to go overboard on this, like keep this.

B

E

And improve it the way you want it, but just do a right, like you know, in a google sheet, this is the amount. Of course we had before the migration. This is the amount of requests we had. This is the amount of we have right now. This is the amount we have of requests right now like it can be as simple as that, and then we can just plot that very low level.

A

Cool will this? Will this end up being a way of us actually seeing some of the kind of progress of the migration as well? So I'm guessing we'll see like number of calls coming down and um on the vms yeah. I guess it's not like it's like a hard progress tracker, but like it's a little bit of a visual right of progress.

B

Yeah, it should like the number of cores for vm should go to zero when that goes to zero. We're done.

E

Or we're just beginning.

B

Yeah we're done except for pages and some sidekicks, like yeah yeah,.

E

B

Yeah cool, okay, that's all I got next up is andrew.

D

Okay, cool- um uh I don't know how deep to go into this, um but I can give you a quick demo of um what I did. I think it was last week or the week before we were talking about how to figure out which sidekick jobs are talking to nfs and the two approaches. Maybe I should close the door give me one. Second.

E

D

Kids, decide to start playing outside my room as I start talking. um So basically, the two approaches are one. We like send a message to sentry like every time an access happens, or we do something like that, which is the approach that we used on gidley when we were doing a similar sort of migration to move away from nfs near the end of the gitly project, and that was pretty successful.

D

uh But this is kind of like it was a bit more of a risk. But the idea was we could use um kernel instrumentation to trap the nfs calls and then figure out what was going on at that time to figure out what's making nfs calls, and so it's a little bit more risky. uh So I thought I'd give it a try and just spike it, and if that didn't work, then we could go to the other route, which is um sort of less risky.

D

But it's also more invasive like we have to get a whole bunch of stuff into the into the application.

D

And then you know get that through and turn feature flags on and off where this is kind of like much more sort of low touch. So I thought I'd give it a try. um Let me share my screen quickly.

D

Also, if, like uh the the you know, just if you have any questions, let me know so. Basically the way it works is, I used a library called um go.

D

Bpf, which is uh so bpf is a is a kernel, is a is a set of tools where you can write like c code and like good, old-fashioned c, and then that c gets compiled and gets injected into the kernel, and the reason why like this is phenomenal and incredible and amazing, is that in the past, if you wanted to run things in the kernel, uh you know, you'd have to be like a hardcore kernel, developer and spend years writing device drivers and, like even longer before you put anything into a production environment, and the thing with bpf is that it's got a thing called the verifier, which takes a look at your code and guarantees that it's safe right.

D

So this it is impossible to write bpf code and inject it into the kernel that is not safe, um and so it's got a whole bunch of limitations like you can't have a for loop, which is pretty standard, but the reason you can't have a for loop is because then it doesn't know that the program will ever end, so you could kind of just go around and run forever and because you're kind of a for loop, like comparing strings, becomes quite difficult because you know, if you wanted to compare two strings you have while loops and you kind of iterate through them, and so there's a lot of stuff.

D

That's kind of weird, but um it's actually also surprisingly powerful and people are doing incredible stuff with it. But this is a very, very simple program and all it does is it's got these uh entry points and when something happens, a little bit of co co, a little bit of code gets called, and in this case, what's happening is that whenever um an nfs operation, a write operation, a read operation or a file open operation, I can't remember if I did get actually yeah and get attributes.

D

If any of those get calls get called in the kernel, this code gets executed and all it does is it gets the process id of the process, that's running that made the call originally and then it sticks it into a table and then pushes that table up to userland and then in userland. There's a go program that I wrote that basically reads. All of those events are all the all of all the processes that are doing nfs access and then does something with it.

D

So in this case, what I did was I the pro when the program runs, it monitors the sidekick log. You know the good old-fashioned, uh json formatted, sidekick log, that we've got and it keeps track of all of these start job events.

D

So when a job starts, uh it gets emitted with the process id of the job and obviously the name of it or the class of the job that was running, and so what I do is I tell that that thing and if it says you know a a web, sorry, let's say a web hook worker started and it was on pid 53.

D

Then I keep a hash table in the go program which says okay process id 53 is a web hook worker.

D

So that's running along on the one side and on the other side I installed the bpf program into the kernel, and it's telling me process id 5 did nfs process, id10 did nfs and then, if it matches up, if it says process id say, 53 did an nfs write. Then on the haha process, 53 is a web bookworker and so using those two things I can say: well, it's probable that a webhook worker was uh was doing nfs work.

D

The only um sort of requirement is that we don't have more than one a concurrency of one in sidekick and the reason why we can't do that is because, if you have 10 different things running inside a single sidekick process, we don't know which one of those 10 things was the thing that did the nfs, because we don't. We can't go to the thread level.

D

We can only go to the to the process level, although now that I'm thinking about it, I think we could do that if we just add a small thing to our logs, but um let's see how well it goes on. So basically what it does is it's, it's tailing, the log. It's picking up all the the start, events and then it's uh tailing, all the ineffect, well, monitoring all nfs activity um and then what it does with. That is every time a job starts.

D

It increments, one prometheus counter and every time a um an nfs access happens for that job. It increments another one and so theoretically what you could do is you could run it, for you know, 12 hours and at the end of it you could say like what are the percentages and anything that's above a few percent. I suspect that there will be like a few things where the timing between what we get from bpf and what we get from the logs isn't synced.

D

So there might be some sort of overlap on on some calls, and so there might be a very like low probability of us getting the wrong values, but there will be some web hook or some not workbook, some some sidekick jobs, where we have a very high indication that they're doing nfs, and so we can take all of that- and we can say these are likely to be nfs workers and the ones that have just got like one or two or three nfs accesses.

D

That was probably something else, and um so it's kind of a sketchy kind of idea, but it didn't take very long to write this um and it just basically tails the log parses it out to jason, looks to see if it's a start job and then associates that job with the process. That's running it and then paul's nfs, uh sorry uh reads the the ppf uh data from the from the kernel here and it actually works pretty. Well, so am I showing my whole screen. I think I am yeah.

D

um So skobik has set up a machine, that's running um sidekick with concurrency one but scovic. It doesn't look like it's actually running any jobs.

C

It's pulling only from the default queue at this moment in time. I'm not moving.

D

Just yet yeah, because there's no there's no nfs traffic on that box, so it doesn't work um particularly well. I can give you a demo on on catch-all if you, if you're interested in seeing how it works part of me, do it up.

C

Cool, I want to see a devil.

D

D

So demand production.

B

E

One in a row.

D

So so, basically, what I was doing so most of the time for for actually putting this together was getting a linux environment that matches our linux environment with the same version of bcc. So we're actually running a very old version of bcc, so that was the most difficult part was uh getting that all working. um So basically, I set up a vagrant box, good old old-school vagrant and got it to be the same version and then I could compile it on there and then copy it around and then and then run it.

D

So basically, um if I let me just remind myself which port so when this starts up, it will listen on um 10 uh 282 and it will have metrics there.

D

So if I go uh like that should fail- and the reason it fails is because it's tried to inject ppf code into the kernel and the kernel's like not happy about it, because it's not root so either you can give the program privileges or you can run it as root, um but it's pretty safe and then, if we start letting that run basically now uh uh what was the name of the was it.

D

So we can just ask.

D

Prometheus, what's going on.

D

And the demo has not worked, am I oh? The reason the demo hasn't worked is because I'm on the nfs box, which is not the machine we want to be on. So let's go.

D

Sidekick sorry, so I was on the catch nfs machine where I meant, because this window is quite narrow, I'm meant to go into the catch-all machine uh where there is actually uh nfs traffic. So let's try that again here. I happen to know that I've copied it on um and uh sorry.

D

If I do curl uh sorry about the, I don't know if you can hear the kids having a nice fight next door. um So basically, what we've got here is these are. These are just sidekick, uh prometheus metrics, uh let's open this up a bit and you can see that sidekick nfs monitor jobs started. It's got the name of the class and the number of of the number of jobs of that class that have started so you can see it's picked up. Six of those.

D

You know, however, many of those and then down here. It's uh picked up. The number of nfs monitor accesses that has picked up now. The problem is in this casual fleet. It's actually going to be slightly wrong because it doesn't know how to attribute so. Basically, at the moment, it's made an assumption that there's a concurrency of one and there's not so you know the concurrency is eight or whatever. So it's it's. It's pretty much wrong and you can't rely on this, but uh once we have the catch nfc working, we can.

D

We can test this properly and it should work but and the numbers should be correct, but you can see it's working, it's picking up and it's associating um jobs with the um you know: nfs accesses with the different sidekick jobs, so it seems to work um and I'll just kill that off and it uses about 10 of one core in terms of cpu. So it's not free, but on that machine, that's got the uh you know. Concurrency one would probably have a bit of extra.

C

It'll be a pretty big machine, so there's.

D

C

Plenty of runway so do you envision, or I guess a good question for you is: should I just consider moving every q in catch all just making the same configuration that we have in catch-all over to.

D

Server I mean it like. The other thing is if we know that there's jobs that absolutely don't have nfs success, it'll kind of validate the script because you know, hopefully what we'll see is that there's no nfs access is being attributed to those jobs and if we do get like high nfs access, then maybe this isn't a great approach, but um at least it'll help with that.

C

I'm thinking my initial thought was because I've got a lot of people working on filling out this spreadsheet.

C

I figured there's a lot of defined no's like they don't require shared storage. I thought maybe I'd take all of those no's that people have indicated to me and just moving those over first and let it sit for a while validate.

B

E

B

C

Just to knock out a good chunk of this list.

B

C

um I mean this spreadsheet's, probably maybe 50 filled out.

D

C

But we also have ones that we know we can't migrate over. So you know just exclude those from the list of migrating over to this box as well.

D

Yeah I mean, however, you want to like, obviously the because we have lower con like the the way that this works is is statistics, and so you know like what you want is like a high probability that when these jobs run um they are doing nfs and in order to do that, we need to to kind of focus and run as many of those jobs as possible.

D

So obviously, if we are spreading out and running a lot of jobs that don't that we know don't have any face on that fleet, we're kind of taking away the sample size from the ones that we do want to run. That with so, I think initially, it's like good to test it with that, but then eventually just kind of leave it on the unknowns yeah. You know, obviously the known yeses and the known nose or sorry, the known nfs accesses and the known non-nfs accesses we can kind of ignore, and then it's just whatever's left.

D

um How do you want to the one thing I was wondering is: do we want to just kind of grok the uh like sort of grab the ports, or do you want to connect it up to prometheus or.

C

D

Might be nice to have.

C

This in prometheus, like yes, it'll, be easier just to not do.

A

C

But I think it'd be pretty cool to have it in prometheus, just for historical purposes, at least.

D

Yeah could do that. The only other thing is that it's running as root, so I don't know if we want to have it binding for too long and the other thing we could do is we could just give it the permission um that it needs, which is, um I don't know some vpf thingamabob.

D

You know we could we could give it that one permission and like I presume that's what we did, because we've got an epbf exporter running on all the nodes already, and presumably that's not running as root. So you know we've we're already doing that on our fleet um and that might be a safe way of doing that.

C

All right, maybe we can check that out at some point.

D

I'll ask matt's money, you think he did it: okay, cool cool.

C

Cool so hopefully, either later today or next week, I'll start moving cues or modifying the configuration and get some something running on this server. Only spun up one box.

D

Yeah, that's, I think, that's fine yeah.

C

I couldn't decide whether I should spin up an entire fleet or whatnot, but I thought if I only spun up one box, it'll, be easier to look at just one system containing all this.

E

C

Moving the script across seven nodes and that kind of thing.

D

Like if we, if we find like, like I think I mentioned before, like I think we I would trust the results after they've been like a thousand invocations of a job right and if we're struggling to get that many, then maybe we need more nodes but, like you just saw, I ran that thing now and there was like a lot of jobs right yeah I mean it was a lot higher than I actually expected. So you know hopefully we'll get something like that in on that new note,.

C

Cool, that's exactly awesome.

D

Cool, I mean it's still, I wouldn't bank on it as like. It might still. You know when we tested with the concurrency one, it might still all prove to be terribly wrong, but um so.

C

Far as far as the evaluation goes, I've got a list. Do I have that issue on my screen somewhere? I don't.

C

I had a list of all kinds of things to watch for when it comes to doing the migration and the evaluation.

C

So my goal was to look at prometheus and we got the node mount stats. Nfs operation requests total metric that we could utilize.

D

C

We could also look at sentry and because I only split up one server, it makes a century query super easy. So if we get errors from trying to read.

B

C

Granted we're not going to get errors because nfs mounted, but when it comes to the point, when we decide to turn off nfs on this box,.

D

C

Actually start the migration to kubernetes, we could start looking at century for those errors and then we could also look at our existing dashboards to determine. Well, you built it into the script, but we still have our existing dashboards as well to look at um determining how often those jobs were called upon. So we have a few things. We could look at. That's what I'm trying to say.

D

D

Awesome for ski cats.

C

Yeah, well he's also half blind, so he doesn't know how much he needs to jump sometimes, and he runs into me quite often, but you know well.

D

C

A

C

Good state so that next week we're going to start knocking some stuff down.

B

A

Cool on your final one andrew, uh I mean it's a great question about it. It's I don't know if it's directly related to the kubernetes migration, but uh it's it does raise a good question of whether we should have a we should be discussing those sorts of things, the deployment stuff as well, but, like I'm guessing like it's up to you job scott back, you keen to see um another.

D

Dashboard there's actually no dashboard yet so. Okay.

A

And in this beginning,.

A

It's like a fully formed existing thing yeah. I guess.

D

A

Actually, I put that.

D

In there thinking, um alessia was going to be here so that we could we could kind of run through it, but he's kind of the main target. But uh if you're interested, um we had a really good. We had a really good call this morning to kind of figure out what we're going to do there and basically, what it is, is we're going to have a separate set of thresholds for monitoring for release purposes and they're going to be much stricter than so.

D

Basically what happened was we decided to use the monitoring thresholds that we use already and then this morning there was a bit of a hiccup, but it wasn't enough to basically wake someone up, and so the thinking was we've kind of optimized the um the alerting to be about, like not waking people up, um but actually the thresholds that we want for starting a deploy.

D

It'd actually be ideal if they were decoupled from that, so we could kind of move them up or down, and you know if we make them tighter. We don't have to worry about whether that's going to be people getting pager duties on the weekend. So we've basically split it into two different thresholds: same metrics, but with two thresholds.

A

Nice, that sounds great, so.

D

A

Awesome: okay: is there anything else.

C

A

To go through right all right thanks very much! Everyone have a see you next week.