Ceph Performance Weekly, 8 Mar 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-MAR-08 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, everyone so I'm awake morning.

B

C

C

B

B

All right, uh she start with four requests. Sure we'll see how my has one that changes, the Roxy interface to use the rocks to be range delete operator which would help blue star in a number of ways. There are lots of places where we have to delete a bunch of accusing we enumerate them in orange elite them and apparently it has sort of a room of range primitive. But the does a big fat warning in the header about how it's slower in some cases.

B

So that's why I've loitered it before, but I guess, there's a bunch of activity upstream, making it better, so that'll probably go in and some porn soon, maybe with the config option to guard whether or not we use the Newton or not but I'm guessing it will be a win.

B

Let's see, there's a pull request from right, applause, cleaning up some stuff in the bit allocator and a bunch of other ones that emerged last week also at fix up walking in the map. I will care and other for the performance optimizations.

B

Some of those have gone in any other ones will probably follow soon I'll just to clean up there. That's good, that's all for blue store, also with blue stores. There was a polar cuts that enable processor, accelerated, crc32 for rocks TV and that wasn't getting enabled in the build those stuffs. Crc is optimized, but rocks TVs wasn't using the best Intel instructions, but that's six now which speeds up their works to be stuff quite a bit differently. So it's also good for reference.

A

I I swear that we were. We were actually using that at one point but and bring.

B

Two teams, but.

A

B

A

Agency yeah in our who do you like in the perf results it look like. We were you're using cast the fat pad yeah. I'm.

B

I don't know, maybe wait, maybe it broke when we switched the rocks to build around because we went from using their native mix I'll to the she make one. Maybe three or four months ago, vitamin yeah broken man, let's see sorry to say something us off hello.

D

B

D

Speaking, I guess there is another possibility why someone could see the such results unfair. This is because this is because the slow, because first of CLC, fall back to SLO crc this, basically in andheri, strikingly slow thanks in this. In the first function, past years, little question function that calls slow the phone lines at home, worried if compiler is in line inc. That will be the region. Take.

B

A deer, okay, cool.

B

How much of it, how much of it Peter? Could you see as a result of that of that significant.

D

Other it depends, it depends on on a use case. The most significant one is throwing a lot of data through rugby, be like World transactions, huge, huge, omap things, etc, etc. In such scenarios, there is, I recorded around fifty percent bonds, often often.

B

That petraeus, that.

A

Matches what I've seen the past when we your reason for past you, it was pretty significant cool.

B

Okay, well, that stuff awesome, thats quiet! You notice that, let's see what else a bunch of those other pages are also improving the memory usage as a bitmap alligator, and so it's also good news.

D

Yeah, since they said that the cleanup is a preparation for something bigger, what we want to do is to what we observe is that memory structures used by a bitmap area lists instances are extremely hot, so.

B

D

At the moment, bitmap area leaf aggregates bitmap zones in directly using SED, vector I, think that that's a tricky where particles that could that could let make a hell potential to disable at least or an impulse up to impulse functioning of CPU prefecture, I want to optimize I want to embed bitmap zones directly in the race go and after that, put them on a huge TLB. Does the other the go? Awesome sounds great: okay,.

B

The other thing that I've noticed and is that a be processing as well pulsar is just slow to start because it has to initialize put in apps which are shorter than you murmuring, every bit and system in it. When you run it on in dev, it's not that big a deal, but when you're running out on an actual disk, that's you know five or ten terabytes, and it takes many seconds to actually start up and initialize everything. So that's how that path could probably be optimized I'm guessing too, but.

D

Consider done, but it's it's good information I will take. I would take care of that as well. Okay, cool.

B

Awesome yep got it gotta, see it getting some attention, let's see other stuffs the fastest that stuff's merged a while ago. Actually that might have been before last week, so it's good seems to working fine. We didn't actually measure much of a difference, but that could pass is so much simpler and cleaner. That I'd have a hard time, imagining that it isn't helping come on and let's see how am I had a pull request. I was trying to optimize the get reserved map and release maps.

B

He closed that I think because it needs some rework, but I think that is something that we want to pursue on having a more efficient path for grabbing references to those maps. So hopefully we'll reopen a deleter.

B

What else there's some already made changes trickling in I'll, see there's a new poll request that makes at the notre face, so you can explicitly control the recovery priority, four pgs on a per PG basis. This was sort of meant as a last resort interface in case the automated pool, purple recoveries and doing what it's supposed to do, or something and say a little bit more control of the system, so that I'll likely go in.

B

Let's see, I guess the other main thing that's happening.

B

Yeah, that's it! That's mostly it for the pull request. You have the nano one. That's that I've been working on is the blue store wall.

B

Rights which are horribly named, hits the sort of Journal deferred rights in blue store and I made one small change that doubled the I ops that you'd get when you're doing like a 4k is true blue store on a hard disk just by making it not wake up and do a database transaction commits just to clean up the wall record, just sort of useless work who might as well just wait until the next commit, so that doubled it and then I did a another.

B

Here's the changes that experiment that were not ready to merge, but they would basically batch up those deferred rights until you, a den of them and then submit the mall in parallel and that I quadrupled performance on the spinning disk. So it's pretty soon indigent, but in the process of doing that, I was cleaning up a bunch of other things in blue store and realize that there are some sort of deeper issues with different rights wall stuff.

B

So I'm going to go, fix all that stuff first and then and then take those flush, cleanups and differed stuff. So so that will we have to wait a little longer. But the good news is that we went from like 50 I ops, 2 400, some I ops on just a disc which was nice so.

B

Yep think that's all I have on that front.

A

Page one of the things I I I'm nervous about what those results, though, is that we in the past they seem better results from that, so for some reason one way or another. We were your we've done better than that, not.

B

On 4k, right, I, don't think yeah.

A

B

Don't think maybe I don't know, I might also, I also tested it with using putting the journal on SSD and using the disk for data and also saw corresponding speed up with it with just the change. I have now it's doing more, like 400 I apps, where you're basically doing a turtle on SS key and all the rights going to the disk. I was on a.

A

7200 RPM spinning disk yep.

B

That's the WD black. You go.

A

Here your drive shouldn't do that much you're, seeing some kind of pull off anything yeah.

B

Yeah I know exactly yep, which is what you would expect. That's exactly what file store will do so this is.

D

B

Bench, porque rights, so it's a separate object, so they're all going to the journal and then asynchronously they're writing to the disk and they're getting patched up basically pio schedulers, saying: oh, these are all written two adjacent blocks, I'm just going to do niño or they're in the same tracker, whatever how big was named. Damon.

A

B

Kenneth assertive I was.

A

Like with the range that you were writing to, they.

B

See is word, this was just ratos bench, HP 4k, so there are effectively sequential once. Actually it did if so, it's sort of a UD case. In that sense, if you're doing random, 4k overwrites it'll be a little bit worse, but even so you'll see it'll be better because the sort of the deeper the cue, the yeah, the more the disc arm, is going to be able to get to.

D

B

In synchronous, so it'll it'll sort of do as well, as your cue allows it to do yeah, which is the same thing that you would see with files or two with excess yeah.

A

You're, probably writing to the other tracks too, if you're just doing radio sponge and that yeah.

B

Yeah yep all right, but I, think I think the main thing that we can do there that we is is basically whenever we have those deferred rights. If we just patch them up a little bit on a spinning disk, it will help because the discount can stay by the journal and then when it goes and does all the other stuff that can do it all at once. Instead of going back and forth yeah, that seems to make a big difference.

B

It will take a little bit of tuning to figure out like what the right batch sizes and want to trigger a batch flush and all that stuff. But when I rewrite the deferred, writes code I'm going to make sure I structure it so that we can sort of have flexible batching, so neither flush the batch immediately or batch it up, definitely or whatever so it'll be.

B

The code will be structured so that to enable that sort of policy um so yeah that belt thats come later a little bit later, that's yeah, oh I'm, planning to spend the rest of the week, hopefully dealing with the deferred right stuff, the store.

B

Let's see what else is happening on the performance front, anybody else have any topics. I want to bring up or talk about.

D

Maybe one ready to write it to an encoding of extent. Map I mean here: exims map encode some function, but at the moment I'm thinking and things I, look on turf on specials out from I provides it for l3 cache misses and it's responsible for around. Seventy percent of that yep.

B

D

More right now, I'm working on resurrecting your idea for, for caching blog for caching, blobs.

B

D

Using a buffer list, there is a battery is too big. We really need to we. We really need to have the data local, so embedding and directly. Some small and some small array of bytes, directed to to the edge de blob object seems a good idea that would prefer to do the job correctly is.

B

If you're doing, that, though, is that is that going to help with the l3 cache misses I can see that saving all the warrants, including time, but most of those feel better in the blood that are being coded are the field yet.

D

One where it looks it looks that I made a discussion right now got quite interesting performance bonus. However, what is really really interesting are the problems that we're unhiding. By that it looks.

B

D

The compiler has a lot of problems with proper inlining, for example, didn't in line operator + + on on iterator out over Brewster P X and T Wow.

B

D

Possible for for sixty percent of those seventy percent of l3 cache misses Wow. Okay, the same, we have the same similar issue we have with with.

D

Den coding, some small integers, basically a are the forint anchor.

B

D

B

D

Range low Z so.

B

I, my guess is that it's because the current foreign is has a loop in it and usually the paralyzer someone in line steps of loops always suffered, but there is a there was a past forever ago that just unrolled the loop basically into a punching additional. It might be that doing something like that. Well help it never actually emerged it's open. If you look at the request tag for Bluestar, it's one of the oldest ones: okay, okay, we dancing like.

A

D

B

D

B

D

If it's continuous search so.

A

As you can, even though you're seeing all these l3 cache misses, is it you actually suspect about what is causing a performance impact here.

A

Yeah, like is it, is it enough to to destroy him tonight? I.

D

Think that if we eradicate those, those mrs. will be able to to move to move to another bottle, net I guess we'll be sending submitting io request one in one or in other words per one manner. Basically, this is because we don't support batching in the block device, right, I, hope of other way. We already have. We are already working on that. We have a pull request to some preliminary per request. Vip, we thought, brings batching across the wall blew through pipeline.

D

However, the possibility to aya submit is not ready, yet only I think that the most painful bottleneck right now is indeed in the front of the pipeline in the math and methods that that ends in the prepare state, in other words, a bull store or a nasty blue. So definitely booster.

B

Okay, which part of care center, which party prepare like all the like, DeRay and.

D

B

You there yeah.

D

Something like that, basically, those guys that are encapsulated in optics and transaction or something like that. Yeah.

B

A transaction yeah, yeah, yeah I believe that that's where all the complicated parts are like.

D

When yeah, when are ones keeping those those complexity, there are obviously beyond my hardware is able to achieve around 70 kilo, I ops, using da je malach and TC management and compares with TV malach I've, seen over 100, kila ops, so I guess the top CB ever not so slow, yeah.

B

I think that the issue fixed in the dark sphere less around the commit path and more around compaction when it kicks in May synchronously. You start having all this background work yeah on the contention on the device and then.

A

You can write a while really fast yeah.

B

It's all about aford work that that's killed us I'm, but that's good news. That's good news! um I'm cool, okay, well, I'm delighted to see that you're you're doing this detail profiling. This is awesome. Finding all kinds of good stuff. Oh nice, secure yacht. Thank.

D

You sure, where we get a hundred questions, my mate.

A

You just first okay, that was me. That's good.

B

Well, sit here in suspense.

A

Can you mean time saved at.

B

A

B

Radical continuing the second yes, sir, okay.

D

Hello, we once again my machine, my mushroom just crashed. Sorry, I'm using adams computer runs machine hello, so it looks that blue store data structures, I mean here, especially excellent map, blobs, etc, etc, are scattered over memory. They are linked to using boost into this pointer but bite away assistant. The development built not an object on a release, one but definitely development deals. We have assertion in indeed, indeed, if I operator, on a boost into this pointer assertion, 11 set is not so big.

D

However, we have millions of dollars to Brazilians of places where we use it. That's one problem, but I guess it's it's quite easy to deal with the second third and the most day that the work, the worst case is that we have. We have data structures, copyright, / memory, I'm working on that, for example, started to put I stopped using boost object, pool to have the to have excellent maps placed in a continuous region of memory, and it's already made I have part we will I true. I hope I will put the publish it son.

D

However, I have question related to our node 22 blob of Arnold's I would like to put them also also an object ball. That is that that is coupled with with or not so. The question is: can blob instance out life or node.

D

Definitely we have, we have the definite visionary in.

B

D

Blocks but me all not itself.

B

Sorry, a block down, sell something so I, don't think so. There used to be a case where we carried a reference to a bar graph, yeah.

D

B

D

Have we have two different ride back of the envelope.

A

D

And I'm thinking about educating it completely and putting and putting the memory that is used by blob instances directly in dr knows that owns a blob. Yes, what's panning bob.

B

Is the extent be and the Ophidian one other place for the reference is in.

B

Her cash right now right context right, Vegas, it's gone, there used to be a reference in the transact trans context, but that's been removed, so I think we're good yeah. It's really just it's just say: oh no, it on the extent yeah. So you can switch this like unique, pointer or something purchase or a pointer. Yet I think I would be here just.

D

So far, just a row pointers. However, the instead of calling new using the system you using dmu from AI from hip, we can try to use the placement new on the on an area is embedded that is continuous and embedded directly to Connaught. Okay,.

B

D

B

It rely on a sparse memory space whatever, if you run out of space I.

D

Guess no, no, it didn't boost, have object, pools. Basically it's the app. They are ok, they are flops basically I, see.

B

Okay, got it okay, so it'd be a flapper ownage, basically yep.

D

B

D

Access inside intellectual note, yeah that's um k.

B

Yeah look see that cool. That sounds like a good idea and you can probably kill the reference counting on Bob yep yep yep, yep yep. Exactly this is a V. It's a tree a little bit careful because there are multiple extents that refer to the same blob: yeah yeah, I,.

D

Showed up I show that case: okay, cool cool awesome.

B

A

Here, I, look: do you think you could join the booster stand up, so we could sure.

B

Well, as early.

A

B

They do the invite cool thanks.

D

Just I'm so I'm starting poking with that we need to we need to I guess we need to put. We need to overload license new operator in Memphis, but this one BR an issue. I guess operating already am okay, yep, cool, okay,.

B

um Okay, yep, that all sounds great.

B

Let's see oh I guess the one other thing I'll mention and this came up remember we talked about it here or not, but this came up after the end. We must have yeah with the discussion of.

B

With the nokia folks, where they were seeing after when the periodic compaction kicked in, they would see a bunch of random reads or just reads: I guess from their rocks TV device and the speculation was that that was due to not how much thought to compaction itself, but the the invalidation of the parks to be cash that results from compaction, because you'll be able to ask you thrown away and new ones are populated, and so I have a past that just introduces this really sort of brain dead, right buffer cache to blue FS, so that it just keeps the last NSS few that are written in memory and if you return them and I'll just return out of memory.

B

I think we haven't answered how we marked a day. Do we get a definitive answer from them, whether that that was actually helping or not I? Don't.

A

Think so, I don't forget, heard anything since I your plate. So.

B

Are still so waiting for that, so it's so their cue possibilities. One is that those reeds are from compaction itself or its sequentially scanning or supposedly to both sequentially scanning SS fuse in order to generate a new SSD and those iOS should be sequential and hopefully not too painful and advice. Maybe maybe we're not doing read ahead. I guess that's one possibility. Therefore we could fix it, but that sort of scenario, a scenario B, is that it's, this cache invalidation that happens in rocks TV that results in a bunch of caching, mrs.

B

post compaction and then- and these are random, read iOS to the new SSDs and that's what this write buffer would sort of alleviate, so yeah I, guess I'll, just we're still waiting to find out. If that, if it's going to help or not, it might be mark that we need to just reproduce this workload yeah. If you getting that end to it, end-to-end instrument it carefully.

B

So we can first reproduce exactly what they saw where we can actually see the Iowas on the rocks, TV device and then and then dig in and make sure we understand exactly which part of our Scooby is coming I. Don't they I thought.

A

They were claiming, though they weren't seen thereof, that is in their rcvd as I see her claiming that they saw glee is on the walkway. The.

B

Second time around that happened, but that was because they stopped doing the Roxy be device I think because we told them it need to have a server RCB device, which is true, but it actually was helpful just to tell what kind of io's who's who's causing Mayo's.

B

So ok, part of it, was just get away with a setup that particular test from ok, ok, yep this it has the Caitlyn yet and these more stuff of it yeah so I think we should do the same thing so set it up worse, we're just using a spinning disk I think that's the case right I, remember, they're, using an SSD or not I, set partisan tramadol, but then but make a big block that DB partition on the same disk.

B

Just so we can look at block, trace and see, or IO top or whatever the hell just to see where the iOS are coming from and when, if we see when we see those periodic reads happening corresponding with compaction, then we can dig in and find out exactly what's causing them mm-hmm.

B

Yet another project on your plate.

B

Indeed, in doing so make sure that's that that remains on the list, because I think it's going to be pretty important for these object. Work, clothes, they're your workload, isn't special at all. It's going to be a pretty common one, yeah.

A

In the past, I have seen reads on the xerox TBE partition the DB partition, but always assumed it was from term compaction right, yeah,.

B

A

Look at and coconut yeah.

B

I mean it is kind of frightening yeah, but it's a pre and post the connection. I guess yeah wow.

A

If we can, if we can make sure the other seats are actually cached and not nothing use as the win that'll help. Definitely.

B

Yep yep, okay,.

B

That's all good, oh, it might be worth as anybody else does. Anybody else have anything. I want to talk about. I have one other thing at conventions. We have a periodic all with some folks at UC, Santa Cruz, for doing they're working on tiering sort of explicit hearing. They want to build some application, that sort of knows what data is going to be accessed ahead of time, and so it can.

B

They want to sort of pre-staged parts of objects in on SSD, as opposed to disk, so that their computation and runs faster and the their goal is to do that both for parts of objects and for parts of the omap key space, and so, for example, they might be doing a column or database in all map objects and pre-staged.

B

Certain columns on SSD before their computation runs, but as a result of that they're talking about extending the serratus interface so that you can sort of pass an explicit hint that says this Oh map key or prefix 40 map. You should be on a fast here or slowed here and then how boost or actually do that, where r oxy OD smart have to put those on a different device toy a rough plan on how to do that. But right now it's sort of pre research.

B

You right now just want to see if it makes sense and it will actually work.

B

So that's one thing that might might end up coming down the pike at their other if their other use cases where that would be useful.

B

We should try to identify the sooner resin later and make sure the see if it makes make sense to do this earth. There are the requirements beyond what they have that we should be. Is it taking assertion in.

A

Blue store anyway, I mean I, guess I had always kind of assumed that as technology marches on we're going to see full map on some kind of solid-state disk anyway, the occurrence of full napping on spinning disk is can be low for most people. Maybe that's the wrong assumption. You.

B

I think it's yeah. I think it's sort of a wrong assumption because, okay, it sort of I mean, there's that there's a tendency now to like conflate api's with use cases and therefore with sort of the storage back end. So like objectives out of this big and slow and block is like small and fast or whatever. But that's that's not really true and as as the api's become more broadly adopted than they'll they'll be abused in different ways. So you can imagine having an object store that has the zillions of objects that are super cold.

B

They have lots of metadata, I omap indexes that are cold and you.

D

Can also imagine object.

B

Stores that are that are very high performance, and so everything would be on solid state um sort of contrary to either the assumption that some things are so over. Some things are best specifically.

A

With the old map, though, I mean, do you see a lot of you akia.

B

I can view it yet because you might have them you might have indexes for rgw. For example, if you have a huge archive.

D

B

You don't need all those indexes necessarily on SS key of a sort of a limited example, because the index is going to be a minority of the data set and it's in their particular use case. They're, like most their data, would be in Oh maps because it's a database and scalable and they would have liked column groups effectively sort of stored in a gnome at peace, yeah and the whole point of this is that they would take certain columns that so they might have like a hundred dimensional or a hundred column. Data set.

B

That's like some I, don't know simulation or whatever, and they don't want to basically stage only those columns for a particular visualization that they're going to be doing and then run it sure hey. So anyway, it's.

A

Actually, it will work and right I mean I don't think, there's any reason why I couldn't work. It just seems like you have to be your assistant, yeah gambrel, specifically, okay, okay,.

B

Also photo map pieces, half of it, the other half. Would it's doing the same thing on the on the object, so you would say that this range of the object I know is hot and I want to hint that it is staged on a SSD era, a CD and though the mechanism we're looking at for do that is actually just making it so that the iOS that booze for generates using that lib aio interface can be tagged. We wanted, we want a kernel interfaces and tagged individual, a iOS as hot or cold.

B

So that's some underlying cached here, like DM cash or yet d cash or flash cash, can see those Hanson put the data directly we're supposed to go instead of trying to learn it online later.

B

So that requires some kernel changes because they owe interface doesn't quite do it and the cash things aren't looking at / io hidden, sir. Only looking at process hints so there's some picks up anything done, but once all that infrastructure is in place, then they'll have this general mechanism or we can have booths or specifically hinting that this should go on flash, and this should not much will be useful sort of an all in all of the cases framing any hybrid area.

B

B

Yep hope, anyway, that's if any way pincher than that, let us know, and we can connect you with those folks.

B

This might be of interest actually to the Intel team, because they're their calf caching product, as I know janet, has some pretty sophisticated, hinting that they sort of kludge together and I think that's basically the functionality. It probably implements that the hinting that we need month, except for the aio interface cap, but that's that's. We want to basically get that hint.

B

Those hints to be respected by all the different cache hearing, things I'm or at least a handful that people would actually use because the Intel cast steps still closed for us, I think, hopefully not too much longer. But.

A

Those are the ones we're talking about kind. Experimental things are curious about I've wondered for a while. It would make sense to have some kind of an OSP that could do some kind of like Mar distribution of data across different discs. So, instead of using and a crush to go all the way down to the disc level, you use it to go to a specific OC, but then that OSD, you make more real-time choices about which discs are busy or not. hmm Yep,.

B

Yep that could certainly be done no sort of on mostly what I thought was going to happen like seven or eight years ago, when we started using butter FS, you would have I no speak and pro comprised of a whole bunch of device so that butter FS was pulling together and doing its own software raid and replication and smart stuff internally mm-hmm, so that that could certainly be textured to be done and there's some interesting systems out there that I've seen references to that like take.

B

Even pretty simple scenarios like a pair of disks are like three discs where you should have chained them together and you use use one of them for like sequential journaling and use one of them for, like the random, writes and so you're just doing some fairly basic heuristics about sort of directing different types, bio, two different devices again and getting pretty substantial wins yeah. So this.

A

Even even with our issues with like scrub and other things, right, I mean, if you can be internally making decisions about this disease is. This is busy doing some kind of repair or something and other gift can take, go for some blowed yep, and we know when.

B

I think there's.

A

B

There's um sorry go ahead now.

A

That we've got you know persistent fast memory coming down the pipe is simply, you can actually start doing things like having a really fast database of memory. That's that can keep track of what you're doing yep I.

B

Think there's something to be said to you for having local redundancy as well as the sort of stuff, multi or inter OST redundancy yep. So you can do local repair and recovery and that's a win yeah and you can even make the failure demand sort of reasonable too or you could imagine it that has multiple discs where it tries to keep pgs isolated on a single disc. So if you have a spindle failure, you only lose certain key Jesus in the others are still intact and you could do things like that too.

B

Yeah lots of possibilities, but um could.

A

You see something like this actually happening inside the shore. I me is that the is that the place where we, then you will see right and do store, had a good pic.

B

A

B

Don't I don't want to think about it? Yet? Okay, but it Kurt, that's, certainly good. It could be done there I'm a little bit, hesitant to pile on too much into the same layer. So if we can stack on top of you know a DM, a block target that does this and sort of have a richness set of hints that we can communicate what we need to communicate. That would be preferable, I think, because otherwise you're creating a increasingly complex model with yeah.

B

But yes, it might be an important as we.

A

Rewrite the OSD we should be if we're interested in this. We should be thinking carefully about it, though, how the karate models can work, how how yeah.

B

A

Yeah yeah yeah: it's.

B

Already pretty well segregated on ice and groups, I'm not sure we should be fine. I think this part is orthogonal but um yeah. Oh, I guess one last topic there s an email to the list earlier or sorry last week about the explicit mappings in the OC map. I just want to call people attention to it.

B

If you didn't already see it, the idea there is to add the ability, newest map to have explicit overrides to crush so that you can sort of enumerate the pg's that are not optimally placed I crushed and place them optimally.

B

So you get continually a perfect distribution of placing groups across those to use without very much overhead, so hoping to get the infrastructure for that at least in place in luminous, even if we don't actually have all the parts to actually sort of generate the distribution, but maybe we can have some things that believe them to do. It then we'll see, but that's that that the variance between oh steve is a big source of a performance loss, because you just don't have an even distribution and say you're underutilizing.

B

You end up bottlenecking out sort of a handful of those fees, so something to look at I.

B

Guess, that's it any other. Any other topics, I'm.

E

Sorry uh this is then I I didn't quite hear that part about the OSD distribution function. um Is there something written down about it or yeah.

B

Somepony finery may listen to list. Oh I'll put in the chat sometime last week. My feet.

E

Because I brought into that, when you need the bigger the scale of the more problems it becomes, yep.

C

B

It's the subject is in the chat you just look. It I send that last Wednesday and there's a there's a pad with a bunch of notes.

B

Okay, you can have a branch that implements sort of some of it, but it's not not complete. Yet he sort of the nice thing is, you know, typically, people increase number of bits and groups to even out the distribution, which means that having a sort of an explicit mapping is expensive because you have a lot of pg's to numerate. But if you have this mechanism that you actually don't need to do that, you can get by with a much smaller number of pg's, in which case the exception table can be quite small.

B

C

B

Pretty promising thank.

C

A

Hey Jay, regarding that, um how where would you see like a layer living that would try to automatically we recreate exception, for when the topology changes like? Where would that live there.

B

Just be a module in the manager; okay, let's just watching the distribution and tweaking it online yeah. So.

A

We so I think we've discussed this before that the the step would be when the topology changes you, you kind of map to a non ideal distribution which causes data movement. But then you recreate rolls on top of that. But then- and yes,.

B

So the way that the remapping is structured you can you can make the sort of explosive mappings conditional and on a single SD, so that you basically say that for PGX, remap, OST, 120 ste, for because I know that or is underutilized and one is over utilized, and if that PG changes because crush there's a big balaji change and crush different, then, and that exception doesn't apply anymore than it just is ignored. So one is no longer at the genome longer maps to one than it does nothing.

B

So it's sort of great slowly revert to whatever crush is doing which to a first approximation should be pretty good um yeah, so I think it'll be I, think it'll be okay, okay, well, we'll find out, I guess I worth Casey. We could do this sort of synchronously in the monitor, also, but that's not going to scale very well two large clusters, so it's have to be careful.

A

There's there's no way that we could preemptively create the new or rules like say. This is how the map is going to change, and you can't always do that. But could we see say this is how we ask energy, so straight you're open advanced, only user going for one. That's.

B

All we do already with prime PG temp, and so you could do a similar thing here. Okay, if we did that and it would kind of live in the monitor, so whatever the code that does it I think it's all right in and see you puff up and make it a ball so that it can be cancer to run it wherever appropriate, but yeah. We don't want to I, don't think we should limit ourselves to doing at the monitor, because for large clusters that might not be keep it'll.

A

Does it have to be in the monitor because vessel, if.

B

You wanted to synchronously happen with an arbitrary change, then it kind of has to be yeah, because, you might say somebody might say OSD out, and that would cause data movement and you need to analyze what that movement would be and for add exceptions before that move next step actually applies to.

A

Anyone unless before you actually did you at the out, you did other stuff. First rate I mean you could.

B

I have mean I, guess you could have a sort of an up call, I, guess I route to the.

D

B

But the thing is that the concern is that it might be a computationally, expensive thing to do that analysis yeah. So it's not so much I. Guess it's less where it's when and don't synchronously in mind what any our Torrio's TMF update might not be feasible yeah. Maybe it is a PG number. Peach accounts are small, it might be fine, but you.

A

Could you could hear crazy things like precache it right, but figure out what what rules would need to be done if, if something were to happen and then I Superman yeah.

C

C

B

Any anything else.

B

All right thanks, everyone all.

A

Right stealing it let me get on so.

D

D