Ceph Performance Weekly, 23 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-01-23 :: Ceph Performance Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's get going be, the people from core will show up soon, ish all right new PRS this week. So as a result of our meeting last week, a couple of chairs landed for doing PG auto-scaling for PG auto-scaling changes, one from Nihad to change the PG numb to default teaching them to 32 and the men as well.

A

Let's see there was another one that was in here, I thought: I, don't see it I'll have to check and see anyway, let's see, there's another PR from Radek that changes.

A

Encode for standard list T not to use bufferless copy in anymore, I haven't looked closely at this, but that does not surprise me that that would be a problem so anyway, that might be interesting.

A

What another one from Keefe, who is a touches, a ton of stuff but a use, transparent comparator, so basically avoid some extra work. They're.

A

Loose door, oh and then Igor is really really nice PR here that changes a couple of data structures that reduces the memory footprint of cached phone ODEs.

A

This I did a little bit of testing on it. Well, you were did a lot of testing urgently. That I did a little bit more, but all the tests seemed to show that memory usage is reduced, both for reads and writes.

A

Performing in some cases is improved. Latency is improved, maybe not a whole like a lot, but but even in the case, where you're, not memory limited, there's still some reduction in overheads, so it's in general, just a really really nice win there. That's exciting!

A

Let's see only one PR this week, performance PR this week close that I could see just a dot change for OS diamond PG log entries, that's another one that came out of discussion from last week. A couple of PRS got updated chief, who reviewed this one for splitting read IO. If the IO size is at large or NDB device, I'm not totally sure, with the what the effect of that will be, but theoretically, hopefully, they've tested it as good.

A

Let's see, Adams objection, triumphant PR it originally. He failed test, but it looks like it maybe was due to something else not relate to his PR versus back in testing again looking forward to see they're seeing the results of that. Hopefully it means we can merge soon, rgw the filtering logic to seal us that Erica added it looks like that did pass QA but needs rebase now so that will need to be rebased. I don't know if it will need to go back through testing again or not, but that's good.

A

Related to that, hopefully, we can figure out if we can move some of the things that we need to filter on out of the data structure there so that we don't actually have to decode the entire thing in the CLS, but instead can just decode whatever we're filtering, then kind of reduce some of that overhead. That was pretty sitting prior to this PR. So it's not like it had anything new.

A

The objection sustained that got updated and rebase to look like and then other atoms blue, starting work that got some new changes based on a short review that we did is either last week or the week before. I think he's working on just making simplifying to hold things there before we before we merged that, so that was about it for this week. So far, oh and perfect timing, it looks like some of the people from chorus trying to show up here.

A

I. Think Neha was the one that wanted to to discuss boost our space amplification. So maybe we can wait for her. Were there any peers amidst or anything that folks want to talk about? In the meantime,.

A

Right well, Igor, maybe you can, for folks haven't been reading the mailing list. Maybe you can talk a little bit about what you've been seeing with the blob space amplification that you were seeing.

B

Yesterday have been the gate in a simplification for the recorded polls, mainly for B decays. What xn for father the loads as well? The trader for this investigation was ticket from our customer who experienced up to 100% space overhead of the estimated volume and.

B

Actually I did some research and it looks like at least in the octopus releases one might have, which is significant simplification up to in times theoretical upper bound when using recorded pools of inner drives, and the reason for that is pretty large min alaq size at booster level and well and technologic.

B

Record recorded move right.

B

In octopus, we currently have min alaq size set to 4k, and this works around the issue, but at the same time we are still thinking about my bachelor version, this back to 16 or even okay, as since large large allocation sizes, a bit of from performance point of view, and hence we need by some attention to this space. Simplification for every recorded post I have just found.

B

Primarily, that's it.

B

If any question then I'll be glad to answer your.

A

How much of the effect that you saw was due to the blob wastage versus just the effect of having small objects with a larger Menelik size?.

B

Well, I haven't investigated, haven't investigated small object sizes in this analysis, so that was primary for our big images, which tend not to to have short objects and.

C

B

That's more about partial overrides.

B

Tom use patterns, I'd have multiple allocations, birth, o-64, ke blob and hence allocations special players. Space simplification might be pretty high, so I provided some that is simple and straightforward scenario to to reproduce this. It's quite efficient, but more realistic scenarios can suffer from the same here as well, like I, feel the RVD image and then partially right some okay extent. Then you'll get on each, maybe not on each, but on each four key of the right, which takes its own 64k range. You'll get another location without previous extent, relief.

A

Sure, so that's why your other PR won't fix this. The dynamic allocation size, even if you.

A

D

Nothing to do with the allocation size right. The core issue with an eraser coded pool, is that if it's cloned, so you can't overwrite current so I, guess it that's it. It just means that you want to use a small allocation size right, WC.

C

Because you always want.

D

D

B

But actually it it might be a good point to well speaking about another. Another might be our which introduces flexible allocation sizes and currently I.

B

Am suggesting to introduce a specific option for a pool which is optimized for size optimize for space, which is to specify behavior either the user's behalf. But probably we might consider.

B

Using this optimize for space mod for a recorded pole, it's mostly look like a workaround Mac anyway, it's probably better than get this.

D

Yeah I'm just trying to think I mean basically what this means is that, if it's a racial coded, we always want to use for K right and if that's, what.

B

D

Only the only question is like what do we do with a spinner and also yes? So if it's you see, we also always so we it's a spinner and replicated that's the only question right and so there it really seems like this is going to be it. This is a pretty narrow case, because if you get a big right, then we're gonna do a big allocation right. There's, nothing really to do I. Think the only question is: what do we do if we get a small right on a hard disk replicated pool.

D

Right, I thinking about this right.

B

B

I'm thinking about the case when we have a pool full of very small objects, which is not a DW and maybe surface or something like that or just a plain red spool.

B

How we can tell booth to to use.

B

Okay location size in that case of 64 K. Otherwise we can it. First of all we need to. We need to make the decision on the first right, because we can't mix small at large allocations there. One.

D

B

Had all four 100 right, why.

B

Because we have some I recall, an unused bitmap probably depends on that and maybe something's something else. So if you look at this PR with flexible monoxide, selection I need to provide, but I need to persist with alt apply to this object and I need to provide min alaq size to number of functions. Yes,.

D

And well, I guess I'm, not thinking about this in terms of metallic size, because metallic size by definition, is the allocation unit, and that is always that is 4k right, there's not so much metallic a metallic size. It's it's! How big we allocate our blobs, it's like how big do we allocate? That's? Actually the question not middle like size right. So if we get a large right for example, then we'll do a large allocation.

D

The question is: how big do we allocate a small right and by default, the current behavior is that we always allocate the smallest allocation, the smallest multiple of metallic size, that's sufficient for a new right, that's the current behavior, and that's that's! Basically the crux of it: that's that's the thing that we might need to change, because if you have, if so, if you have- let's say you, let's say you pre-fill an RB d image, and so it has a bunch of big allocations and you just a small right comes along.

D

It's going to write into that into that same, it's gonna! Do it over right right, though, there's no allocation anyway, and it's fine. So the only question is: what happens if we get a small, random I/o and a part of the inch that hasn't been written yet, do we allocate a small allocation or do we do a large allocation and write only a small piece of it and set it to be unused.

B

Well, I can tell exactly what happens there at the moment, but I recall that I decided that we are unable to mix of small and large locations in single object. Maybe we can revisit this, but it looked not that simple, if possible, at all.

B

So some object or not structures depend on original on initial allocation size.

D

Nearness of metallic size or the emission allocation size come I mean every every blob might mean a blob might be big, might be small, like it's all purple odd thing. I can imagine that I'm Bob.

B

B

Yeah I had some ideas to try a location, size law and also blend it, but for some reasons, I decided not to go this way, but maybe this is possible right.

B

Yeah I am I'm, not sure. If that.

B

Beneficial company into the complexity brings.

B

D

Guess I think I might be misunderstanding, something I'm not showing what is complicated because I. My understanding is. This is how it blue stores worked from day. One right, a blob can be big. It can be small, there's like a different right for do you write big and it has to decide how big the blob should be allocated. There's a maximum size, though, that they aren't extremely big but like they're, always different sizes.

B

That's correct for blow-up itself, but it's.

B

Might be not correct for for underlying ethical extensity, it controls or maybe checksum chunks yeah. Alright, the checks.

D

B

D

B

Order probably changes.

D

B

Probably you are right, it's technically possible in current design to have different and difficult allocation size envelopes.

D

Not just technically possible but like probably true of almost every object in the store that has had non-trivial I/o settings, because if you have that you write say metallic size in 64k, you do a 384 k right and then you do. 128 K right! You're gonna have two blobs that are totally different sizes. And then you do a 4k right and it's gonna do another 64 K blob and that's mostly empty, but has a that's a fork. A little pot filled in let's unused bits set and all that stuff. If you.

B

Change we might get some side effects if blob, reuse, reach and negate some complexity in this path, because right now, I might rely on fixed allocation size for blob or fixed granularity for blob.

D

What I need to say my it's something you're talking about a hypothetical change. Are you talking about what happens now? Are you talking about I'm? Think I move it maybe.

B

B

I am thinking of potential issues we might get on this path for some poor fool when I initially implemented this stuff I might be dangerous, sometimes, but maybe I.

D

Think I'm I think that's the confusion. I'm not talking I'm, not right now, I'm, not talking about any path or any change or any pull requests or anything I'm talking about just what the current behavior is.

B

Current in the pull request or I, don't.

D

Even know, it's full request we're talking about I'm, not.

B

Talking about anymore, what.

D

It does today right.

B

B

Current quote hardly depends on minilogue size, which is fixed or blue store instance.

B

Exactly my my PR bring some flexibility and I about the PR, with of this flexible aviation device, selection, yeah and the idea to bring a location size to make a location, size flexibility, there blob level alright and can.

E

D

Can we set that pull request aside for now and just talk about what currently happens and what we want to change because I don't want to? Let's just let's just talk about first, what the code does and what behavior we actually want to change and then talk about possible ways that we can get that. But let's.

B

D

Ignore the pull request right.

B

Oh okay, all right now we have I can say maybe three issues. The first one up GW, which keeps.

B

Tons of small objects.

B

In the media and pants really might well this with 64k allocation size, you might forget significant overhead, but.

D

That's our current master, that's fixed because we had the Mallik site.

B

Yes, it speaks occasionally, but at the same time we've got performance degradation for a spin for terror for writes on in based spin.

D

Though that's the second, the second issue is that we as small we are for rights on HDD, with.

B

This 60 with 4k analog size, yes,.

A

Before we move on to the other ones, though, I asked Igor is that, just generally due to yeah I mean we yeah. We know that.

D

Just Israel on freezers and what's the third problem is that that ratio could.

B

Well, the second use case is well it's similar to our BG, but some general purpose pool which in plenty of small objects- and this differs from our GW case in in a way that we can't control if objects are written in a single shot or so in a GW. We can say that small objects are eaten in single shot, and this simplifies the handle in general purpose pool.

B

That's not true, so one might get small, the first small right and then watch one.

B

And intently unable to to select proper allocation size just right, okay, and so the third case, which recently appeared these are these. This recorded poll show right.

D

That's theta the shared plots, not things writable, it's the new allocation. Okay. So if we, if we basically talk about current master, we changed metallic size to 4k. So all three of those issues are solved right. Well, rjw objects are small and small objects. Written pieces are still small and the easy flow overhead is mostly gone because the allocation sides- at least it's gotten as gone as it can be, given that we have to do the cloning ensured Bob dance well,.

B

I I'd say it's all at least 4 GW. It's less.

B

Feasible or one can imagine the case when it keeps one very small object, but at one byte objects: yeah, yeah, yeah, III, I. Suppose we find that overhead I think.

D

That's yeah I think just cause it'll get for you, mostly a minor in your node, but whatever I think. That's that's a separate discussion. um So the only real problem with so I think 4k just setting metallic size. The 4k solves like 90% of our problems right on SSD. Is we always want 4k on ratio code pools? We always want 4k or rgw. We always want 4k. The only problem is if it's a hard disk and I'm.

C

D

Yourself of s, then maybe we want something bigger than 4k, sometimes.

D

Issues all right with small files right, so sometimes you want 4k, and sometimes you don't all right so are really in our median size of S. On her disks are the question mark.

B

Object, W is probably in the same suit because it might keep her larger objects and it's question it's a question: what's the difference impact in this case right.

D

But if, if you have to a large right without interview, regardless of what mental exercise is, it's going to do large allocations right.

A

You're gonna hit a performance penalty with it, though. Why? Because testing showed it Maps, bigger or I, was just pacing out on what Igor's tests seemed to show when he did his last round of testing with 48 Menelik, probably because you're doing double writes right to the redhead log I.

D

A

Actually mind if.

D

That's what was what was the action.

A

Almost everybody here: do you have a link to your best results.

B

Might be sent offline on.

D

Okay, cuz, it seems like there's this sort of their their two categories of issues. Here, one category is how big should our allocations be like if I get a small right should I do a large allocation and right only a small piece of it, because that's gonna be better in the long run or do if I get a small right to I write to a small allocation?

D

That's one category like how big should I allocate and the other category is like- is the allocator just behaving well right, like am I, if I do a large allocation? Is it aligned so that when I do small locations, things are less likely to fragment, or you know whatever it is like I think there's their stuff that could probably just be or if I do, a large allocation, and there are both large extends, free and small extends free.

D

Do I return, the logics, tent or jury returned a bunch, a little extents mushed together and that I think it's just an alligator implementation question the things that we could good or maybe should improve. If I don't know what the current behavior is, but maybe that's that's like a segment, the second category of problem that makes sense I.

D

Guess, I'm sort of assuming that the problem that we're facing right now is the question of how big do we allocate and not what is the what's the behavior of the allocator, but maybe that's a bad assumption. I'm wondering do we know we do. We know, which is what we're facing here.

B

Well, what's about the following case: if 64 K min alaq size we'll always have data the blob using 64 K, a continuous chunks and hence Reed's, will benefit from this, and but for 4 km in a lock size and partial overrides you'll get fragmented blob well, you'll have fragmented 64 K spawn, and these result will result in lower leaves.

D

Yeah, just because we're doing locations right so that feels like that's a that's the first category of how big should I do my allocations because, like I, can imagine that for ARB d, for instance, say it's a completely empty image. We get a 4k random right, that's in some random offset and the current behavior is or K that's a multiple of Menelik size and so I'm just going to allocate for K and so you're just going to get a tiny allocation.

D

And then you write randomly right before K next to it and you'll get another location and so on, or you could say it's a 4k, random right but I know this is a block image and I want to like reduce long-term fragmentation and so on, and so I'm actually going to allocate 64 K. Even though I need four K of it right now, um and that's a decision that you have to make on that first rate right.

D

Actually, yes, actually.

B

What's what might be are.

D

Partially Duff partially that right, so it's not so much a question of what malloc size is it's a question of how big do I allocate when I have a small ring like I? Think, that's that's the question that we need to answer. Yeah.

D

It could also be age all right, just one sec mark. It could also be that there's a for an issue. It could be that there's a different issue right. It could be that um I do decide to allocate 64 K, but that 64 K allocation I do isn't aligned to a 64, K boundary. It's like shifted and as a result, when I do you know a million $80 locations and releases, and so on. I get some fragmentation, even though most of my allocations are big.

D

My free space gets all screwed up or something like that right. There could be some like secondary effect in the allocator.

D

But I think that's not really what we're worried about right now. I mean we fly one I think about that later, but I think the real question right now is just how big do we allocate I'm guessing? Wouldn't we get a smaller and as Ennis, and it's really only for our buddy and Seth FS on hard disks or replicated pools like that's the only that's the only time it matters the rest of time. We always want to just do the smallest allocation. We can.

B

Look like sage.

A

The overhead there to the space overhead we're only talking about base overhead when you haven't filled that base right right. It's not exactly yeah. It's not even like a case of like we're using way more than then.

A

Potentially, the file would be or the that like range of an extender and whatever you talk about it's it's only in the case where you just haven't done anything.

D

Yeah yeah, yeah and I was actually say that in the case of RVD like we basically always want to do slightly larger allocations, because we don't have this sort of like small file issue like it's a block device, an X of s is gonna back or not. Pack lot put like I think we shouldn't get better issues without that we just need to. We should assume it's sort of a a sparsely but coarsely allocated chunk of space. Then file system is gonna.

D

Gonna do yeah, though I mean even something as simple as if it's an RVD pool and its rotational and its replicated, then we should make them in blob size, 64 K, instead of whatever it is right now, Menelik size and for our buddy I think that would basically solves well. Okay. So here's here's actually the question: we keep talking about how it's slower the metalic size made something slower what actually was tested and what actually was slower. What was.

B

D

Is it in your spreadsheet they're requesting.

A

Igor's, no sorry, yours thing, I just posted a link in the chat window. This was testing for.

D

A while it was like adopted did we got it? Okay, yeah! So it's random right. Are we d4k right sort of.

B

That standalone.

B

And the lone wolf core- oh if I you're, writing randomly to some a field store, so just a bunch of objects and partially rights at random sets, though, through these objects.

D

And wait what should I be looking at here.

B

A

What I was looking at was lines 35 and 36 columns, B and C. That's master default versus master for K. It looks like to me.

B

C

D

Father was a 64 K before right, and this number is the percentage or is this an absolute number.

B

It's absolute knowledge, megabytes per second okay, actually, basically,.

C

B

Better to use rows, 52 and 53.

B

54, as this covers the case when dashes off and well random ride,.

C

B

Is 50 line, 53 random ride and columns B and C 83rd.

D

So therefore, Meg reads and writes: maybe I right, yeah.

A

D

Those are large, sequential, writes, basically got slower.

D

B

D

This looks way different than what I assumed. That was assuming that the problems with small writes, but this is actually saying that barge writes changed performance significantly like by half.

D

A

A smaller actually way faster because we can we for small, writes. We basically understand.

D

A

D

That's good, all except reads are small, writes but reads are solar, so basically, okay, so this makes this sense. If we look at these four cells, basically.

D

D

I use a phone anymore.

D

Rights get faster but reset floor and that's just cuz they're smaller, that's right now, but this here is probably with a small rights where we would want to do a larger allocation, even though we don't have to our every D think that what I was just talking about that sort of makes sense. What at this large right do we have any idea? Why that's the case like? Why would why would a large right be like half as fast like less than half as fast.

F

Is it struggling to find allocation units large enough.

D

Yeah, like maybe it needs a coarse-grained, bitmap or something yeah I think this is a thing to look at. This is like problem like sequential cuz, I Nessus do with other SSD tests here. Did this little SSDs? Also the streaming SSD rights also got slower.

B

Well, voices G lines are from 4 to 18 and Chile I, don't see any real change reduction in using 4k, but.

D

A

We've done was faster, actually, maybe.

D

It's non-contiguous or something like maybe you're,.

A

G

D

Or make right, and it's like allocating a bunch of little things.

B

Yeah I think something like that: okay,.

D

So it seems like that's. This is the thing that we need to understand like why what is actually going on cuz? Maybe it from this like just guessing. It seems like it's, maybe a problem with the implementation of the allocator I blue stars, probably asking for big blobs and the allocator is giving a little lens a bunch of little ones, probably just something that the allocator should do different.

B

Actually doesn't care about giving contiguous place calculator location, it performs that for blue FS rebalance, but for regular locations doesn't care about yeah, contiguous or locations? Yes,.

D

Yeah, that seems like that's a problem like that's that doesn't the core issue, or at least the first course you to address.

B

Yeah we can try to fix it.

D

B

D

Mean I would suggest that we first just rerun these tests and then look at what's actually happening and try to understand what that what the change behavior is. If it's, if, in fact it is just getting smaller locations and we actually understand at least we understand why we can figure out how to make the allocator get bigger, bigger things.

A

Okay right, yes,.

D

Just or just look at the log and see what have you, how big the action requests are and how big the results are sure than that, and then it could compare it for the to forgive and just look at the a new case that that sounds like problem number one and then problem number two is: if we want to do something about the small right behavior, where we're basically trading right performance for read performance, and that would be more subtle thing. We're not here, I guess here.

D

Ready to say something: oh I.

A

Thought I was gonna, go back and look I thought I didn't think that we were I thought we were better in both cases for smaller, easy and small rights. A.

D

Small rights are faster, but small breeds because there are little things spread around on there, slower I think makes.

C

D

C

Line are you on.

D

Line 20, I, guess I'm looking at wrong sure so line 22 brights got faster and then line 26 reads: got slower.

A

That's a range, isn't it that's a range of 4k to 128 K? If you look at just 4k, it's faster, oh.

A

I'm, like 27.

D

Okay, that's only the rain Kotler, but and.

A

So larger, so the the the larger io is in that range are contributing much slower. It looks like you must.

D

Be in it- and maybe that's just because that's that same allocator behavior, that's breaking our large rights yep and that actually makes sense. Because if you're dealing like a bunch, a mix of 4k iOS and like 128 K is then you're going to have a bunch of for K holes. And then you do 128 allocation and it's good. You can give you like a bunch of.

B

No I think this degradation for.

B

128 K is orientation as well, so yeah we again get some fragmented tense and need perform to seek proper yeah.

D

Yeah, that's yeah. That sounds in.

D

So, if that's the case, then it seems like there's basically one problem, possibly or wait mostly one problem, and that's just that if there's a fragmented free space and you ask for a large allocation, the allocator is giving you little pieces that doesn't know how to give you okay um so way back when I got in my four years now.

D

Whatever remember first talking about this, when we're talking about the bitmap allocator, we had this idea that there should be hierarchical, bitmaps where there's the bitmap of the actual granular blocks, but then we would also have a coarse-grained bitmap. That's like I, don't know a couple, an order of magnitude, more granular, less granular. That would either be completely free or completely used or whatever, so that, if you're doing large allocations, you could efficiently find those large blocks and if you're doing more allocations and you'd end up going down to the is that the data over?

D

Do we is that what we implemented or is there? Is it not do that Jen implement that? Okay, we.

B

Have some hierarchy is paying that.

B

B

Spawn is completely free and this might help and finding contiguous chunks. But.

F

D

It got rewritten right, are you think of the original implementation Sam or the new one.

F

D

F

New anything, okay.

C

The USS potato, even yeah, you read it right.

B

So actually, we might also want to try a new avail, alligator here and check how it behaves. Yeah.

C

B

Have seen pretty good result, performance results for it, I'm, not sure if it wouldn't bring back this space. My memory yeah our had issue but from performance point of huge, pretty good. Maybe.

D

We should just try that to see yeah.

B

D

But just to look at just to watch this tester I'll tell us a little bit more about what the.

D

B

Mark question to you: do you remember which cases were worse for okay, min alaq size in your benchmarks,.

A

Isn't about lunch annex as well, so in my test on nvme, it was I think we were like consistently better I didn't do hard drive tests recently, ours I done in the past, but in the past back when I was looking at this years ago.

A

If the 40 min Alex is the biggest problem we had with it was was encode decode of metadata. It was actually in all cases yeah, and that was now we've improved everything to the point where, on a flash it looks like we were often better with a fork, a metallic size in all cases, including small, where we then don't have to eat the threatened blog go read because we're okay.

D

Right so I think so we have our line of investigation right. We need. We want to try the AVL allocator, do what happens and also look at a bitmap one and understand what the allocations look like and see. If that is, if we understand what the core issue is, but I think we have a, we have to figure out. We have a decision point for octopus just like both if we want to stick with the 4k Alex eyes or if we want to go back to 64.

D

That's like I think what we're basically trading is with a 64 klx eyes and we've have tons of space amp. The performance is better at least four large rights, whereas with the 4k alex is that we have right now, then the space overhead issues go away, but large streaming rights are slower, but smaller.

A

And rights are bragging rights.

D

Bester yeah and mixed rights, or slower right, it's like only the very small ayahs that go faster, exactly which is like just any pesticide that does that describe any work whatever like. Even though, like realistic, torturous or VD workload. Isn't it's gonna be a mix right? It's gonna be a mix of small and big. It's not gonna, be just yeah, probably unless you're doing something like crazy on it right yeah, which basically means it's just slower yeah. If we trade it, we trade space amp for performance, so.

A

Yeah, that's that's a good question sage. You know we write with the code the way it is right now, which shows us more in guys, I think.

D

That the optimistic case is that we can understand we can get to the bottom of the allocation issue and we can make some change the allocator that will improve things yeah, but I think that's optimistic. Given our timeline.

B

Well, actually, it's pretty easy to to request contiguous chunks from bitmap located right now. It already has this mode for blue face so might be like.

D

B

D

Could just try a large extent allocation if that fails or tries that contiguous or that fails, and it tries a non-contiguous.

B

This part needs to be implemented it right now it on a single call. It might provide contiguous chunk of fail, but you can call it the second time and specify that you don't need contiguous location anymore and patrol it. It.

D

Sounds expensive that for if we only did that for our discs, then we'll probably fine.

B

D

Fly again or something to do that.

B

Well, potential issue with this behavior is that all the the time-space might be fragmented and they were had to search contiguous blocks might be hi al.

A

B

A

That would be more likely to be the case after lots and lots of Io right once you've kind of yeah gun fully engaged.

B

Almost fluid almost full devices after hi Lord that.

B

Unfortunately, we are not good in.

B

In benchmark in such cases in lab, because well several months of production work, they might result in completely different data layout on all of the discs.

A

The times I've done stuff like that, it's not look great already. You know we we slow down pretty significantly once we see a lot of fragmentation and full disk.

A

Maybe though, in conjunction with this, we think about background defrag.

B

Definitely yeah.

D

So it sounds like we want to investigate, try to make sure we understand the problem and perhaps try out like a quick-fix, B yeah, but I think we should also be prepared to revert back to 64 K monoxides for hard disks or octopus. If we don't sort of all this.

G

Realistic ocean.

D

Should we just do that now and then, if we like, have our solve this crack this nut, then we go back I.

G

Don't mind at them and we are all in agreement for the SSDs. It's 4k makes sense, but for hard disks. If we know that there is potential problems all over the space, then we should just go back to 64 and if we have a solution in time, we.

D

Can we can go back? Yes.

B

Which, which is good, which is good about my PR it if we simplify it a bit, we can have the flexibility to switch back and forth in favor I would.

A

Advocate that we use Igor's PR, rather than switching completely back, if we can do.

D

That the IDE I think that, because PR is I, don't your PR doesn't change the allocator and the core problem is that when you ask for a big extent in the allocator, it gives you a little ones it or.

E

F

Actually called.

E

C

D

A

D

Don't think your PR is related to the actual problem. Yeah.

A

If we had a if we used his PR, though we would get that kind of dynamic behavior where we can use. Oh no.

B

No, no! No! No! No! No well wait for optimal, but if you talk the pool as to optimize for speed, it will choose 64 min alaq size, it will provide it to allocator, and this.

C

B

This will require allocator to return and tea goose honks. Oh it's.

C

B

Almost the same as we just have default, min alaq, sighs.

D

There was basically just passing its setting than Alex eyes to 4k, but it's requesting 64 K at a time and setting that flag. That says it has to be a continuous yeah that we were saying: okay, so that that also I. Think that also makes me nervous until we understand what the alligator behavior should be, because that has the risk of with an age store like just being wasting CPU, because it has to do this.

D

D

Like it seems to me if it seems to me that the most that, if we want to do something, the conservative thing would actually be to for hard disks, create the store with Menelik size of 4k but have a flag on the allocator that says, you're required to actually do 64 K, multiple allocations for the whole device, always just so that the disk format is is the same. And so when we eventually fixed this yeah.

D

Like, maybe that's it that would be sort of that.

B

Well, well, actually, it's possible to downgrade or upgrade or you call it so it's possible to to move from 64 K to 4k, but it's not possible to go back.

D

The other area- and that's that's not implemented right I- think there might be some sort of X in the data structures like.

B

Well again, you shoot well each a note currently has to no allocation size. It was created with.

D

Yeah that feels like it's in the category of theoretically possible that also complicated there probably easier to to like, if you really are worried about these, like post he's created with a fresh octopus release being able to benefit later just hack, the alligator are not even the alligator, actually just a free list manager, it's actually the free list, manager yeah, whatever yeah, do something like that anyway,.

A

D

A

We move on Josh, there's a specific customer I'm thinking of that has a ratio, coding and they're doing very small rgw object rights and my question for you: let's just do you know if they were unheard drives or not.

C

The way you're thinking of is not I'm sure others are, though,.

A

Okay, I was wondering specifically about the space amp issue there and how much if we see it on hard drives, if we see people doing this, like really small, very sure, coded rgw objects.

C

We definitely do and they have a lot of overhead because purely from the UCC, in that case, playing with the monoxides. But.

B

C

For the those users who are using SSDs on.

A

The on the hard drive side: do you think that they would prefer to have less space, amp or more performance on the writes on large large writes.

C

That's a good question for hard disks users. I would expect less performance sensitive than that, and then it's basically I'm sensitive for that case. Okay, the nice thing about the our debut small writes, though, is that we know that the property site is going to be 4k or whatever, whatever small size. It is. When we're writing it, it's not gonna change. So if we did have this like this layout in 4k in oxide blocks, I mean um I mean we're doing it immutable rights.

C

Perhaps we could use that as a ski like you, each PR does.

C

Allocate smaller chunks for those, but still let you use like a larger sizes for RP.

A

A

What do you think about that? That's that's my concern with this is. If we move back to 64 K, we might gate in the performance back but I'm worried both of the space amp and then also being able to like in the future. Do things like, but Joshua and Igor I'm talking about here.

G

So if I have, to put it simply, I think we are going to improve things for customers who are using SSD. We are not changing anything for our customers who are using hard disks. We can live with that as long as we're saying that we're making forward progress towards that in a later release or like even later or octopus, point release. So as long as we're not going worse and fine with that.

G

No I mean yeah. If you go back to 64 K for hard disks, we just go back to what we were right, even in luminous or or Nautilus or any customer running any question.

A

Okay, so the the problems with space amplification that people are running into right now isn't isn't too bad. It's like okay, oh.

B

Well, actually, I've seen several complaints about this couple in for our customers and.

B

It exists, I'd, say.

G

Definitely does exist, but when we are trying to make a decision, we should be careful about what we're trying to do right. I mean we've optimized for the SSD case, that's good, but for the hardest case.

G

If we want to make a decision, we should make a decision which is foolproof from all workloads, all kind of scenarios and I think what we just discussed in the last hour makes sense, because if we are able to make small changes in the alligator code and get benefit out of it, we should do that rather than taking a hard decision and I setting wrong defaults for our discs as well.

G

A

The but Neha right that that change would go in conjunction with the smaller minute. That's right like we, we do both at the same time. If we, if we can't do the allocator changes, then I guess we'd revert back to the 60/40 man out size, performance.

G

I'm just saying that the worst-case scenario, what we are, what we are able to do with octopus, is that we go down to for like 4k what we have already done for master. We just go up for hard discs, which is 64 K, which is literally what we were doing for the hard disks case. We are not changing. Anything is what I'm trying to say in future. We know that it is a problem and what Igor is going to investigate will get us to a better solution.

G

Is what I'm trying to say but untii, then we shouldn't make any changes there. What I'm, trying to.

A

Address so so, then, potentially in the future, if we love the metallic size at 64 K, we we would still be stuck with octopus clusters. That's a 64 K Menelik size, which we may we might not then be able to utilize. The alligator changes right unless we did a conversion yeah.

E

A

C

A

The only but difference would be the new octopus clusters would potentially benefit right. Yeah.

C

Essentially but I think when the a weights like either as we deploy, OS DS or maybe you can make some kind of easier migration path than that for existing clusters. Anyway, okay,.

A

Igor, do you have the time to to look at the alligator and try to see if we can do the contiguous.

B

Allocation I'll do my best I'm, not sure about having plenty of time. Yes, I need to do some work before silicon, but I'll do my best.

A

Josh ranee had he knew in the the window for octopuses.

C

Like mrs. Craddock, quite and close things down for nice couple eat something they start Hospital by pixels.

A

Okay, so maybe if it's looking like it's gonna, be rough over like the next two weeks or something. Maybe then we, if we can't figure anything out by then do we revert? Is that a reasonable plan just for hard drives yeah.

C

Just for Hector's recording 264 case he's, like the reasonable thing, yeah.

A

Okay, so let me I'll put in the discussion notes here: when's our when's, our deadline, 52, the south.

C

um Yeah it's late a couple weeks, but I have enough time to make her eat it. Whatever come up with the is tested enough that we're confident in it for octopus, I.

A

Let's see here.

A

Would that be like February's Devon? Is it like two weeks yeah a reasonable date.

A

Okay, I to seven my 28th birthday.

A

C

By the way, this is something that we could probably end up back boarding if you do find the fix for the alligator later.

A

A

All right, cool I think we've gotten like 15 minutes past the end time for the evening. So uh anything else guys.

A

All right have a great week. Everyone see you next week.