GitLab Scalability Team Demos, 13 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-03-13 Background jobs improvements demo

Description

Part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/96

A

B

Oops, okay, so yeah as I was saying. This is how the old way of configuring, which machines pick up, which fuse and then we have one catch-all which is the the best effort queue. Where is it? Where is it in.

C

The second one.

B

There and that's because of the Nikkei two, so the problem with this is that we don't really have any idea what best effort is doing and which kind of jobs are running there and how we can make sure to run efficiently and are not getting in each other's way. Stuff like that, and if new jobs, new workers and therefore new queues get at it.

B

We also don't really know what kind of work they will be doing, so they just get lumped into the best effort jump into with the rest of the best effort, jobs and yeah. We have no idea, so that's going to change and we've added the annotation attributes like worker attributes, to the workers. They need to indicate that they're latency-sensitive, if they are CPU memory or CPU or memory bound and if they have external dependencies, am I missing, something I think showing them or bad. That's yeah, Wow.

B

C

Andrew, okay, no, they asked also afterwards. It's fine sorry.

B

Cool so where we want to go is to use those attributes to send jobs to nodes that are configured for the workload for it. For example, the one we did first was no urgency cpu-bound jobs, which means that we want to have a kind of a limited concurrency on them, and this is the first thing that that that we've deployed we use that using this selector.

B

So all the jobs, all the workers that they get black code base that have their resource pound reset to CPU and don't have any specific urgency, said to them and will be picked up by this kind of of workers and, as we can see, that's being deployed yeah tonight and those things are like that. Vm is now picking up job.

B

So it's not very busy right now, because it's just another yeah note that it's processing the same kind of jobs as the I think this is going to be partly best effort and yeah I, don't know, I think partly that's different, partly ASAP jobs. It's going to be here.

A

Do we know each with jobs are currently running in this? We.

B

Can see this here on the AIA.

A

B

A

Were jumping yeah.

B

But inside the issue which I've linked as well I think Craig Lee made a document which would be picked up where.

A

A document is in a Google Doc, okay,.

B

I think it was really I.

C

Can give you there's a there's, a one that lists all of them at the moment? That's a one that I did yesterday.

A

Thanks but we should we should just copy that into another location: I, don't like Google sheets I, don't.

B

Like ya either, but we're using it anyway, there we go all.

A

Right thanks, I, didn't see this today.

B

That's about it for my demo, if there's questions that shoot so.

C

The first thing, I was gonna say, was that pretty much no one's gonna put cpu-bound on there in.

B

The beginning, not like.

C

On the beginning, yeah, and and if they do they probably wrong frankly so national much, we should almost tell them like no.

C

B

An interesting thing that Camille brought up like for the scaleability review they would like with one of us like one of us or one. Somebody from the memory team would be involved in any much request that adds new workers.

B

What we could do is like as soon as that worker gets at it. We don't have the resource boundary, except what I mean really? No it's memory, but we don't have a resource boundary and then create an issue check back in then into it in two months or whatever. When it's been running, and we can make that calculation, the.

C

Other thing to think is that some of the ones that are that have boundaries will change over time, and so what I was thinking is rather than adding a thing to check back is we could get a lucky year, swatch that just runs once a month or once a week whatever, and it does that calculation on the fly, and it just says these are the these- are the things that our memory, barn or CPU bond- and these are the things that are not, and then you can kind of look at that and and move things around, so that that might be a way of automating that and it's not very much work to do, because, obviously, the thing that we don't want is we don't want something.

C

That's like heavily CPU bound running on high concurrency on the on the best effort. No, that aren't called best-effort, catch-all nodes and then kind of, like you know, slowing everything down in there because they they're stealing all the the CPU yeah.

B

But even that much anymore, because the cats all doesn't have any urgent job yeah.

C

Yeah for sure, but it's it's not updates.

B

A

Okay, can someone just like I'm looking at this sheet now that the crack created and new shard low urgency CPU bound has a couple of these cues there I'm, not really I, don't understand. Why is this CPU bound like? Let's use an example of update old mirrors? Oh sorry, now sorry stuck stuck import jobs. Why would that be a CPU bound.

D

A

B

Been lots of it loads, a bunch of Records and then goes through them check them like load bunch of records into memory of imports that are marked, has started, and then it checks is, they are like still running or not like and if they aren't marked them as fail, I think the.

C

Other thing I'd saved Marin is it's kind of difficult to, like guess all the reasons why things would be CPU on. So that's why we just use data. We just use the data that we have to do that division. So if it's spinning, like 33 percent of its time on CPU, then we consider its CPU bound. Otherwise, we don't and it's its basis.

B

C

B

To the documentation for sidekick workers, I'm going to take up the link now the sidekick development style guide, with the calculation of how to get there and then I remember seeing a table somewhere which queues that word.

A

C

The second point that I wanted to make was that, if you like I've, been doing all of this a lot of stuff for these things and my young I might maintain it's okay and you know all of the alerting per worker and I'm pretty convinced that at some point fairly soon, we're gonna want to add this criticality attribute as well, because at the moment we just we're just saying that all workers have got the same error rate for the for the number of errors up there.

C

They're allowed, and that's patently not true, like some cues- are much more sensitive to to failure than others. So, obviously, I'd like like I, think that the least critical should be like a ten percent error, error rate but I'm using 10 percent at the moment for everything, because the the you know, that's that's kind of what I can get, because I have to kind of use the lowest common denominator, otherwise, we'll end up with lots of the Nerds.

C

Obviously, there's some like the the pages, SSL verification ones that are setting it like sixty percent pass rate, which is like ridiculous and that needs to be fixed either way, but I think that we're going to have to grade them and and and so that's an attribute that we probably want to add it.

B

There's also when we do that, we're also going to have work on determining which, which of those workers that currently should be markers, have had external dependencies, but then like, for example, we use errors incorrectly for retrying the option so on sometimes so. For example, the update or mirrors worker know not that they told there's work at the repository update worker, the one that actually does the pull that yeah sometimes fail deliberately to be retried later, which affects or like or indicators, but it shouldn't.

C

Yeah- and we want to give people like an easy way of doing a repeat, but maybe we should have at that. There's, like an error code or an error, condition that that, like it leaves sidekick- and it's like this- is a transient failure and it's all okay and we kind of rely on side, try it again and don't count it as an arrow or something like that.

C

But I think like there's a lot of stuff at the moment where you know, there's there's a job around expiring web book events and it runs four times a day. I think and I. Don't think it's passed in months and the reason it's failing is because the sequels timing out. So it's it's really hurting Postgres, because it's running and and just timing, arts- and you know we don't have any of the alerting on it. So we need to put a loading on these things.

C

But- and this will this will bring a whole bunch of that stuff out of a mix.

A

Cool I, don't have any other questions for now. I mean I, see that they're like this is all working. Progress I see the dashboards. Are there so yeah where we can continue with right.

B

I wanted to ask a good question: I think Shah Maran, who about the dashboard just to clarify there's somewhere on there. You can't see my screen but somewhere on there there's a database timing and those are better than days. I assume that's the sum of all yeah.

C

B

A

Which one are you talking about I'm sure.

B

A

Free to share your screen.

E

It's under the rails, metrics.

B

This one is now ya, measured in yeah.

C

Total sequel time yeah. So do you want to just press B on there to make it big that people can see a screen a bit better yeah, so, like I, find when you're looking at these things and you're trying to figure out if there's a problem like summing or the amount of time spent in in something like this is much more useful like a straight p95 latency, because obviously you can have lots of small jobs or one big job and and and it's a way of being able to aggregate that out, covered.

B

A lot of small jobs, keeping the database for two seconds curses yeah yeah got.

C

It yeah so like it's I, mean latency per sequel. Call is also useful, but, like the sum total of how much time you spending is, is really useful for diagnosing some issues. All right thanks.

B

For explaining, what's, although I think you're next cool.

F

So let me share my screen.

F

So yeah, so it's a summary: we've been working on like releasing the bow profiling for actually for all services that we had go service that we had and now we first started with the workhorse web fleet and that that's actually staging, and we can see some interesting things already. Craig is working with me to release for for other fleets as well like we are going to have like a workhorse, WebKit and API separated here. So it's interesting to have a better understanding of what's happening on each fleet, which is interesting so.

C

Yeah, should we make that web workhorse so that all the web things are together and then I know and I just started sort of thinking about.

F

That as well, but you also have workhorse and then another surface. So that's another kind of sort team. As.

C

Well, maybe if you need to yeah it's fine, you.

B

Really, like the colors of bike shed.

F

Yeah we have like CPU time we keep out of the box, which is interesting, so I also really interesting to see that we can filter by the top 50 percent. So it's spending like true one minute and one millisecond to two seconds. So sorry, one second to two seconds, so you can see here like handle file uploads. You can see my screen. Well, yeah! Yes, so, yes, it points your file in the internal of the black workhorse, and we can see this stack trace in here.

F

It's not that good DUI to have the look on this that racing. But let me see the frame here yeah. It goes to another screen too and filters that there's some kind of fancy filter here. So it just filters by these two file uploading, it's interesting to have a better, better look. One thing that we need to do yet is improving the version, because one thing that the go profiler do is like separating in different virtual.

F

So let's say we just bump a version on workhorse and we want to compare to an old version the differences in the profilers. We don't have that yet we need you to provide this version because initially we were expecting to have like a name for with the version of the service, but it's not reality because we are not using Google, App, Engine, I, think yeah. We are using GCP that doesn't have this. This M bar- and we were discussing me and jar these days about that. So we will need to pass this to the profiler.

F

So we have the separate versions which will be interesting for us so yeah, mostly that's your going to roll.

B

Out or what I mean now that we're just watching the current version, it starts over again or no.

F

Which means that, for instance, if we release another version of workhorse, it will guide like get a little bit mix it so need to see that okay, this this version was related on this time. So we need to filter it on the time spin in time and see ok. So these are the profilers for the new version that was released after this time. It's a little bit trickier it's doable, but we.

B

Can't rely on this during deploys perfectly.

F

Yeah we can just filter after this deploy. We are going to see this this time spent, for instance, so we make sure that just for this this space of time, we had different filers for these release, for instance, but yeah diversion would like make this a lot lot easier. So much is just like a single fine change. Should choose flight lab get cancer? We already passed the diversion to have get when initializing the profile at the the monitoring package, so it just need to use that version. Basically,.

C

Yeah I was gonna, suggest a scan ask if he knew about the lab kids getting injected with the version yeah exactly.

F

I was firstly like interested in using the actual info because it's already on being in the environment, so I didn't consider using that. But since we already passed this on like Neutron.

C

It's the it's the canonical thing and we use it in Prometheus, for you know, so, even if it's wrong, it's gonna be the same as Prometheus. We should rather use. You know the same same observability values um if you just go back to the front page, so just take off those extra filters that you've got there just go back to the CPU time.

C

One year, I found this kind of interesting like this, like obvious, well to me, something it definitely should be investigated and that's that the second sort of biggest stack to the left, the second one, the new back-end round, tripper, like that's the next next, the next stack long.

F

Round trippers sorry.

A

Middle of your screen, new purple between green and orange, all right.

C

That whole stack over there that actually I'm I'm pretty certain that that could just be removed because almost all round trippers are like Singleton's and so the fact that we're creating all of these all the way down to like a big, wasteful siskel there. And if you go back to the go back up like step out of the out of the you know. Obviously the vertical length is sorry. The horizontal length is the percentage of time that we're spending in that. So that's.

D

C

And yeah ten percent, and so that I bet this like a really easy one in in just making a single one of those is there's almost never any reason to be creating a round-trip earn every request and I bet. You that's what's happening there. So that's that's dude! If, if there's all of this goes and makes, work was ten percent more efficient than that.

F

C

F

Yeah, let's create an issue to discuss yeah.

C

A

Have a question: I, don't remember, seeing an issue from you about access who will be able to access this and not I need to go and chase that, because, right now, for example, if I click that link I have invalid zero I'm guessing it's gonna be the case for majority of people as well.

A

So it would be nice to have that issue ready so that I can start worrying to figure out that part, and my second question related to this is: is there a way to have this publicly visible within the organization so that we don't have to deal with roles yeah.

F

That that's thing that I was planning to discuss on this issue about roles in general, because I'm not sure if it is that possible to make you like just the profiler open to developers without having to change authorizations for everyone. Is that possible? Not sure Jarvis is aware of Hawaii, so.

C

Most of the stuff in in Google, console Google cloud at least is has the same permissions model.

C

Obviously you can use groups in there as well, which are associated with Google groups. So if we have like a developer or engineering role or group in in our Google Apps, we should be able to use that.

B

Me that I could just access it and Marlin couldn't.

C

B

Lab that modern practice tonight.

A

But the problem here is: if we have develop staging. That means we are also have give the production, which is going to be far more useful than github staging, and even if we have those groups that you are mentioning Andrew, that means that everyone will have to go through access requests and no no.

C

Because we add the group once more and so we add like engineering it get later on, but.

A

The engineering.

D

A

Getting updated it'll.

D

Have to be an entitlement and it's gonna have to be part of on-boarding and off-boarding. I think I have people to this group. That's the only way we're going to do it, but I think it might also be useful for law excess because I have a feeling restricting access to production logs is going to be coming soon and we're gonna need to use a Google group for that as well.

D

If we continue with IEP so I don't know, maybe we'll switch to octa, but in any case I think yeah we'll have to have a Google group and I don't have to be an entitlement.

A

Ok, this is a recorded call.

A

So no I wanted to say something but I'm, not gonna. Okay. So can you please create that issue so that I can run with this, because whatever.

D

A

Do it's not gonna be useful if we can't share.

F

A

With everyone else, yeah.

F

Sure no create an issue to discuss a little bit better. Ok, anything else on this topic right so.

A

There is one other thing about this topic. Sorry, although can we make sure that we have things working for workers first before we continue expanding further, so make sure that versions are working, make sure that people have access on staging and on production before we invest any additional work into expanding this to other libraries.

A

Well, it will be useful, I question whether it's going to be useful if we cannot figure out access if we have to continuously go back and forth between people wanting access and us telling us well, you need to go to access requests without any proper form. I would like to have that handled before we invest working together.

F

I just took a note on that. Thank.

A

F

So the next one is this: the pink background, job script. So a little bit of background on this will very quickly.

F

We are moving this our idea, moving at the sidekick closer to court and that's a big topic. We are planning to use cyclic cluster as a single source of truth, using this omnibus source installations everywhere possible. We're already using on get like on so would be interesting to spread usage even on DDK. So that's the step I'm right now on this, and so just quickly show how it is working right now.

F

Everyone can see me spring sort of that that uses the work that Chandi choocha, Pass and white card to cyclic cluster and then run all queues. So we are using the same script that we use super dedicate right right here, like behind the scenes and if you run passing psychic workers you to try to run these background jobs, SK processor, up still figuring out that the best name for this- and we are like trying to simplify the interface like making the same interface for the old, the old script that run sidekick and the new script.

F

Basically, that that scripts is basically a supervisor that we need to maintain to to use these on Schwartz installations. People still use that that kind of thing so, for instance, get to forget like on that, wouldn't be use it. That would be used for GDK, for instance, but.

B

They've all supervisor yeah.

F

The d-beam background jobs is being used like stop start and restart supervisor is being used there. So we need to kind of still use this and maintain, but the good thing is that, like, for instance, we are passing out these and if you pass psychic workers like say one or two behind the scenes, it's going to to use tactic cluster so yeah starting to processes, so start that that's the recommend that we are going to use for dedicating.

F

So we can just remove that and fall back to the other people behavior of psychic. If something goes wrong. Ideally, we are going to release this for, like, like opt-in, anything just to make sure everything works well and we don't DDoS people's computer.

F

So yeah I think I just can just kill this so.

E

Big thing: cuz they're, gonna.

F

Know that I was just going to say that the other commands just work like the same. Basically when passing psychic workers, these Falls show the new sidekick, Plus or supervisor, let's say, and if you are not passing, we are not passing this. This basically runs the old script that to provide sidekick, go on.

E

Yes, what I was gonna say was like we didn't, have something that Mattias had an issue with that shouldn't, be an issue for anybody actually using this, but just worth calling out if you're starting psychic custom manually. A lot of the new cue syntax, including the thing that means run all to use which is star, is already a special character in the shell.

E

So you have to quote the arguments like if you run psychic cluster, star and you're running, say bash, that's gonna, glob it cluster and then every file in the local direct in the current directory, which is not what you want. You need to quote quote it. So these this script and omnibus already quote those correctly I'm, just calling it out in case anybody's trying to sell with that, because if it happened to Mattias when he was trying this out and it's not an obvious failure, mode.

F

Exactly I just did that then I was trying these on my machine and didn't quote that and suddenly I had two hundred processes on my computer.

F

Yes, the edges, like my computer, just stopped it working because the nice I just try, try to just see the process and it was trying to run queues that didn't exist. But this to try to run it was like burning my PC one.

B

Quick question: the thing that you mentioned just before, where you were moving the variable cyclic processes to it, would would then fall back to the old behavior being just starting a single cyclic process, or would it still start side laughter and then a sidekick within that know,.

F

The for instance, if we just remove that and start foreground, it's going to use psychic the dynamos psyche, correct, stop it cool yeah, it's using two different spirits. That's basically the same strategy that we use for rolling out Homer. We have the same same idea. We have web web unicorn, winweb Puma and a web single web script to fall back to those two different things to to manage.

F

So we are there's. There are questions. Anyone else wants to speak up loud.

F

Art of moving cyclicals, which you are we okay,.

B

That's right out: okay,.

F

Then II think on this topic. Anyone wants to ask.

F

All right, I think it tion with cpu's Italy, DB, timings, onset, sidekick jobs, yeah.

E

So the stuff I got to demo is not only in progress but in some cases broken because I've had a week of struggling with various things, so it's actually jobs just helped fix one of them during this call. So that's good that reduces my to-do list during the rest of the day, and so first thing we already had a histogram for CPU time in prometheus.

E

We just didn't actually display it anywhere. So now we do I added ones for database and get early time, but this is for the post received queue, and you can probably notice that this isn't right, because the post received Q is doing gifts operations, so the p90, the P 95, the P 99 and the p50 should not be all the same all the time. So I need to look into how that's recording wrongly, but it's not a bug in the dashboard.

E

It's a bug in the application itself, so this will also help with figuring out if the jobs cpu-bound or not, because we can watch like what it actually does in production easier.

E

Any questions on that before I move on to my other thing,.

C

The the official way should still be elasticsearch because of the inaccuracies in the histograms. So just keep that in mind. Yes,.

E

So this is useful for yeah.

C

D

E

But because the histograms cap out at like whatever the top bucket is right, like you don't know, if it's above the top of that bucket.

C

Sorry, no! Well it's it's! It's not! It's also like this, the layout of the of the buckets right. Yes, how much inaccuracy is inside the bucket, not only at the top.

E

Yeah, except for these ones, which are 100% accurate for whatever they're recording it's just whatever the recording is clearly wrong, because it's a straight line very.

C

Precise operations.

E

So yeah the other thing was so last week we had. Oh, my god are.

C

You gonna deprecated, that are you gonna put the deprecation warning at the top of that other old dashboard. The one is the that you took that stuff from because of the great gistic and of like start pushing people over to these new dashboards I.

E

E

C

This one, that's hrs, the one you link there is that just your own.

E

Oh, that's because right I've.

C

E

C

E

Much gets rid of everything when you have in the influx. One accept method, call timings and do we even have the employees one anymore yeah, so the method timings actually can be really useful on the influx one, and the only thing about them is I'm, quite curious. What happens if we turn off the method, call instrumentation goodbye twice as fast well yeah.

E

This is this is the thing right like it's very useful to know that expire BranchCache took nineteen point, seven nine seconds in the slowest case, because obviously that's slow, but there is an overhead with instrumenting this and if we're not really supposed to be using these, it will be interesting to switch them off, but I'm not I'm, not touching that right now, I'm, just mentioning that as a side. Note and yeah I don't know, I know, I use this I, don't know if anybody else uses this okay. I might add a warning to that.

E

C

Other reason why those graphs are so expensive is at least with the Prometheus ones. They generate a huge number of labels, so I think I did a some experimentation a while ago, and there was like 50,000 labels for for each kind of point at which we put that on, which is it, which is a huge amount.

E

Yeah, which is so simply this one.

C

The the method, timing, ones.

E

C

And so I I kind of think one of the reasons why our fairness is so you know janky sometimes is because there's just so many labels and so many I don't know I don't have a solution, maybe like continuous Arby's, Barbie spy, spy profiling or something like that. But that's um really the same thing.

E

Yeah I would be happy turning this off and seeing how it goes. Yeah I, don't want to do that right now, but I will bored anyway yeah. So the other thing I did relating to this stuff was I made a change to sort of sort of latency-sensitive. We used we now instead of relation sensitive agent label yeah. Now we have urgency high because we had latency sensitive. Yes, no, but we realized we needed more more than two options, so I changed this and the problem was so when I changed it.

E

We knew that we'd need to tell the on-call sre when it happened, because we knew that it would trigger some alerts, but I, don't think we really thought through that it would affect the overall SLA days as well, because it sort of feeds up into those. So here's an example of what happens, so this is using latency sensitive and it's the apdex for the high the latency sensitive job completion and then this second query is the same for urgency.

E

So what we had in production was one of these two queries at a time and you can see that the problem is we have this one, but then we lose data and we have this one, but we don't have data for the past. So at this point it was recording for the last hour. I think Andrews changed that in the meantime, but the problem is I. Haven't.

C

Changed it that that's by far the biggest impact, the thing that will impact us Shawn? What's that? Oh, we change it to 15 minutes yeah.

E

Yeah because, lady, if it's recording for the last hour, then we can't just switch to the new one because, like you have the same.

C

E

So what we actually want- and we already have this for other metrics just more abbe Dex- is a combination. So this is quite a long query, but it's not that complicated. So the combination is just this. Oops click the wrong line. It's just this line here, which okay that looks spiky, but that's because the scales change, but basically you can see that it's pretty pretty flat compared to the other two and, more importantly, it touches both ends of the chart, which is neither refers to two, which is the big issue there.

E

So having got some help from Java to actually build a new version of the images so that I can use the current version of JSON it to build this. It's not that complicated now to add, combines apdex query, so that's actually adding the possibility to it. So this is what we could. Slash should have done at the time which is instead of having changing this from using latency-sensitive. Yes, we call combined and we have urgency, high and latency sensitive yes in there and then once that's been in production for a sufficient amount of time.

E

We uncombined it and just put it back to this, but there's no particular urgency on combining it, because you know the combination of a thing that exists, or the thing that doesn't exist should be the same as just the thing that exists anyway. So it's mostly just for clarity that we'd combine it, but it does reduce the dependency on having a person there like telling the SRA on call this, and it also means that it won't great. It won't affect any slis or SLA x', because of it the other option.

E

There is obviously to admit both matrix from the application, but then we have like extra labels for a time and we also have to juggle a bit more. This is kind of easier because.

C

E

This repo is easier to get stuff merged in, like this repo deploys immediately, as opposed to, like you know, having to get held up for a security release or whatever it doesn't affect customers. You know like they're from atheists, we'll also be getting the extra labels if we omit it both on the application side. So this is. This is definitely the better way forward. So yeah, that's part of the fix for that. The second part of the fix is I'm, going with adding danger to this repository to.

E

Warn if you change something that is so often when you change these you're intending to change them and even in the mi, where I change this I was intending to change it. But if we just have like a warning or a heads up like that, might help, we still need to figure out exactly what format that takes, but um yeah just just danger. There will also help us out like really if we want to and other things. So yes, please.

C

It is worth pointing out that we, you know if you change them, the water generated llamo files will have to change or the build will break. Yes,.

E

So sorry, what I meant was like if you change the auto-generated llamo file, it's more a question of like. Should you be changing this expression that refers to something that is this like? Sometimes you want to, and sometimes it's like well actually, I do want a combination there. Yes, that's.

C

That's that's great I mean I, think the the the biggest problem is going to be yeah, they're, education and so danger. Will this new health effect yeah.

E

I, don't have a good answer for like what danger should detect on yeah I need to get on to that, but again I'd. Just just keep like sidetracking myself with stuff, so I'm like working on getting both strains. First then I'm gonna hide danger. Then we get to this John.

A

This is now honestly. This is so important. This is not sidetracking like it's really important for our our sanity and health. So, if you're adding review you reviewer Rowlett, that's awesome add danger if you add that to also some other projects that we have like chef, repo and so on. That would be like bonus for everyone in the department, so we're optimizing here for the department not only for for us yeah.

E

Exactly because, like how the thing is like I, don't know who to get to review an mr and this repo a lot of the time unless it's Andrew so like having that be a bit clearer, would help too, especially as we contribute more to it. I just.

B

Notice that I can merge in the run books waveform is that expected.

E

I mean I can fit six patient yeah.

C

We had a discussion row, we said we should all definitely have merge axes, absolutely: okay, cool, there's.

B

A few that's been.

C

Like that, for a few weeks.

B

Bob yeah, they told me now. That's me, gonna do something.

D

Everyone can baby, you know so yeah as.

A

Long as we don't do self emerges, that's fine.

C

Well, you can't anymore because it's a retrieval.

A

A

Cool cool anyone else, thanks Shawn for that as well. This has been educational.

A

Anyone else wants to add something: discussion, I.

F

Think that just merge it with discussion topic like everything happening on T's.

A

Sure, that's that's also fine doesn't relieve. The structure of the demo document is not really that I'm.

B

Wondering a little bit but I, don't know if that's really something worth discussing, but then regarding the whole dropping duplicate jobs like we now can see duplicate jobs in the logs. Sorry about the snake case. Naming of that show, but like soon we will be able to drop duplicate jobs and I've already marked the authorized projects worker as an idempotent. But we there's some works to me to be done to make sure that we have enough visibility into that yeah.

B

Should that to.

E

B

E

Concern with that is that it is very useful for the authorized projects worker, but they also want to be very cautious around changing that because, like you know, we had a bunch of issues in the past where that wasn't permissions weren't recomputed correctly, and we currently don't have a good way to like recompute Commission's for everybody stands created an issue for that and it's in the gate lab dr. and epic that he created.

E

So if we could try dropping duplicate jobs from a dick q first, that would give me a bit more confidence in that that.

B

Was the thing that I was driving at right? Now? We only like we have a feature flag, but it's on percentage-wise or not.

D

B

We don't like we can't separate for cues. That's the other thing that our feature flag machine know.

E

We just have to add a separate flag, perking right.

B

That we could do yeah, which.

E

Is not asleep well already.

D

Checking the flag.

E

If the cue is a particular hard-coded, cue, I guess I'll.

B

Leave this running for a bit and then, when the merge request that allows us to flip a switch to drop, cues hits I'll have a look at which are like the biggest offenders and then just add an extra feature flag for that.

E

Because right now authorized projects isn't actually the biggest offender illness because of the other thing with the job.

B

Always it pays right, yeah.

E

Sorry some authorized projects, jobs have two arguments. The second argument is a unique key for a web request. That's waiting on it. So if the second argument is different, we can't deduplicate it, but we are not going to add that key. When you have say 10,000 jobs to schedule like we have sometimes because the web requests will never. The web requests will wait.

D

Not be able to do those yeah.

E

D

Exactly like it will try now I'm.

E

Waiting anyway, so yeah.

F

I, remember you posted like a list of lighting Giga offenders and like not not a second offenders, but he gives duplicators. Let's say we have like a lot like an issue to to triage.

B

Because we need to triage idempotent worker, so I added it there, the top ones.

D

B

Ones we need to get to first.

D

F

Yeah make sense.

E

And some of those just won't be out and put like you know the webhook thing like or sends a notification like that we do want to send twice if we can schedule twice right, so there's only so much we can do about those.

B

You, okay, I, will create an initial for this.

B

F

F

Think we just finish it on time.

A

We did just want to share actually no, let's end on a high note hope everyone has a good weekend, thanks for hope, hope. You have a good weekend thanks for another great week, a lot of great work here and it's have to say very rewarding, seeing some of these things finally reach production. So thanks to crack as well for for starting to roll this out.

C

A

You everyone bye.

C