Ceph Performance Weekly, 9 Aug 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-08-09 Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance

A

Hey Casey, what's up mark.

B

How are you good, alright, now we're the only two people here so we'll see how many people show up full house.

A

B

I think sage is planning on coming but see.

C

Good morning, Igor, maybe.

B

More evening for you, yeah.

B

Morning morning, sage, hey Igor, would it be okay upon the discussion topics I, put on something related to your your pr4 or tiny offense to the DB? Only yeah.

C

I have some plates.

B

B

They put that after over here, though,.

B

Tried to go through like the last month of requests, the last time I did this I I, don't know if I get everything that I tried to at least you know all right. The big.

D

Stuff all right, this first one I believe it's resurrecting this old pull request from ours, restarting the same idea, at least from this old boy request like it's an implementation.

D

Yeah haven't much of this I'm a little bit skeptical, because the old one pains to maintain the ordering- and it was complicated but I, don't know- is to take a look at. Let's see, Radek has been working on crypto stuff. Do you wanna go summarize that actually one two three to go.

E

E

Switching to the ADP high level of layer of open sell at the moment, we are using low level AES implementation that doesn't use that doesn't use as an eye. For instance. However, I.

D

Mean it doesn't use the hardware version? Yes,.

E

It doesn't, it doesn't I, think I'm a bit afraid about about the relationship between setup costs and actual speed up, because we are the most prominent user of tax, the most prominent user of our crypt obstructionist FX, and we are going there with very, very small chunks, 52 or 48 bytes and unfortunately, because of messengers restriction. We cannot cash.

E

The the ADP context. It appears that it's creation is is, is pretty I would say it's pretty high and, moreover, it's it varies from from openness to diversion to even even it can vary, even together with that, the same across the same open, SSL version because of the because the Phibes amount defied certification, okay,.

D

Let's, let's come back to that in a minute, because I think this is a bigger topic, sure, um let's see: okay, there's a tiny appends pull request from Igor we're going to talk about that. Also in a few minutes, there's this new EC, partial stripe, reads: I, don't think I seen this one actually looks like Gregor's we're doing it.

D

Okay, that needs a closer look um use uptick, nothing doing snap roll back, Jason reviewed that alright I'm, so radix pull requests were merged the async recovery was it a whole discussion that moved to the list, although I'm still confused about what?

D

If there's justice, if Rosalynn the table or Intel cart async on all over the place, there's still that, um let's see rgw changed merge. The up tracker optimizations did not merge. Is that because you're putting together a smaller cleaned up or request Braddock.

E

Radiant, oh it's my review.

D

Weight review: okay,.

E

I, just maybe hated I, just updated it with more data, more numbers, but that can be significant when comparing the ratio between code turn and speed. Up with with the old version of the of the rework.

D

E

We emerged one of the optimization separately together with that the new the new rework gives around.

E

D

On that, through the testing- okay, let's see showing us working on MVS balancer stuff, those that another up tracker pull request. Maybe this is the one that you were just talking about. There's marks, pull requests or two of them actually that I put a cap on the OST memory or managed to cash more smartly and then Shawn Peng is working on one in blue store that does the shard completions in the OP worker thread.

D

That one I think needs another round of performance tests, because in order to make it work, we have to set the shards / or the threads per shard to one in order to prevent reordering and I'm, not sure if that's actually, what we want in general or not I can't remember, do you remember mark is that the default, because I degree added funny or something.

B

Yeah I was I was waiting to reply on that one until I went back and head over data. I lasted looking at this, but ya know. I, recall I thought that maybe having more shards and fewer threads was generally better need to look yeah.

D

Okay, that's amazing! um All right! Oh let's go to the discussion topics. Let's talk about the EDP thing first, just because we already touched on it Radek. It sounds to me like we sort of avoid the issue. If we keep the best FX signature checks using the low-level API, but then change the more general abstractions to use a higher level API, because but I'm not I, don't remember what the users are.

D

I, don't remember if rgw is using the same obstruction if they implemented their own wrappers around not yeah you're not are using the EDP layer, guys.

A

Then I don't even remember whether we converted to open SSL stuff in our GW, okay, okay, the plan has been to every worker igw in terms of the existing auth stuff, but yeah.

E

D

E

Pretty complicated because our GW at least couple months ago was using both opens the cell and Anathoth open SSL was in the Seibert web and if I recall correctly, it was a VP. The API offering is ni acceleration. However, some parts of the Keystone related stuff was still around right.

A

But for s3 object, encryption Adam wrote a whole bunch of custom stuff and that's currently just based on NSS.

E

Yep good point: it.

D

Seems like that's, why the place to focus there.

D

The other thing to keep in mind is that the suffix signature checks are for the current version of suffix, its current messenger one protocol and that's gonna change drastically in the next couple of months, with messenger too so I think I wouldn't bother worrying about that. I would just maybe this the question for look at if there's an opportunity to improve our JW kurta performance or not, and look at it from that England and just take a little bit of care not to break messenger. One crypto checks in the process.

E

I know that the key is actually working on.

E

Modifications that that could that would make the context reusable in Interfax. If so, then, then moving with the ADP.

F

E

At least we are not talking with phibes enabled hosts it's, because, on the main top, let me find a guest in short, when in five smart we need, we would need to make some locking on the main path, the gist and.

D

I guess I guess my bigger question is: is it worth spending a lot of time here? Do we expect that this is actually going to speed up the signature, checks or.

E

To provide data from from from clusters running our BD at the moment, all we have is just a micro benchmark. I made some very preliminary tests using similar conditions to what we had what we had in the case of openness, which I mean the all one scenario, one gig RBD image, fitting entire in cache. One client was the one everything set to one and I'm, getting no difference, / all a regression around one and half percent.

E

Let me find data, and it's also worth to point out that, like our benchmarks in unit tests are free are tricky even because, even because of different memory allocator, our unit tests are linked with.

E

On my my note, it's it's simple. It's just costly because of synchronization, that is there and.

D

E

On in Sirte are painful, I believe this is caused by the fact. It's the multi CPU configuration.

E

I'm, just waiting for okay sounds good. Okay,.

D

All right: let's talk about the HDD STD config options, device templates, sure.

B

Oh I'm, the instigator of this I guess, maybe six months ago, I had the opportunity to work with an actual user who was trying to improve their blue store performance, and this is actually kind of what this. This interaction is. What led to all this work on, trying to make who stores cache settings easier because they were, they were really really confused about a lot of different things. But but the big thing that were one of the big things that came up was that they had defined.

B

They had a big stuff that comp file in their chant trying to change lots of settings like a lot of users often do, but one of the things that they had had changed was the the size of the booster cache.

B

We have the the kind of base option and the HDD and the SSD option, and they had defined it in multiple places in the comp file and beyond, just having done that, they had previously defined an SSD setting and then we're trying to change the base setting though the SSD setting was maybe set at like three gigs or whatever, and then the base setting they were trying to set up to like ten gigs, and it wasn't doing anything and they were confused.

B

While it wasn't overriding her now, I wasn't overriding, but they hadn't I think they had forgotten that they had even set an SSD setting somewhere and then we're wondering why, when they were changing the blue store cache size setting, it wasn't doing anything that makes sense yeah it actually supposed to work. The.

D

Opposite way, if the Asteria CC settings only apply, if the main setting is set to zero, though I don't know, if we're remembering it or if there's a bug, it's.

B

D

B

B

One way or another they were, they were setting both and then not understanding why, whatever they were changing, wasn't changing it.

B

Now that we've got Auto tuning stuff for the cache sizes, the only thing hopefully left for the user to set is the amount of memory they want the OSD to to try to keep itself to. But we're now again left with the question. Well, do we want to have a separate SSD in a sheet for that and it kind of makes sense.

B

This is useful to have that, but I'm still worried that we we can end up in the same place where people are are trying to change the that these SSD in HD settings and might kind of get themselves into trouble, especially since this is this goes beyond just the caching right.

B

There's lots of other things in the compound now, where we have these different defaults, I, I kind of wonder if making the hierarchy explicit or having like templates or something where you define this grouping of settings for something right is, is gonna, be clear for people when they're trying to set these I think.

D

B

D

Settings the HTS DD ones. The reason why we broke out the defaults is so that users never have to touch it. It will just do the right thing, for example, those two thread count or whatever, like nobody, should be worrying about that. Unless you're, like a power user like nobody, nobody should be thinking about so just having it. Having default. I think makes sense. I think this option is a little different, because this is actually something that is biased, nature, not a sort of a magic hands-off option.

D

People actually should set it and the the reason to have different defaults before was sort of this sort of this hack around the fact that older Buster's, older hardware they son hard drives, are more likely to have less memory and to hardware that this uses more like.

C

D

Kind of thing anyway, um so I, don't I'm, not so sure about this. This template idea um I would prefer to not I sort of keep all these other options in the category of things that users shouldn't really touch unless they really are thinking about it and the HD DST thing is just a way to like make sure our default is, is sort of the right one given what we know.

D

But it's worth noting that, for the memory thing you can't actually set different defaults on a cluster wide in this cluster config, though at least, but even then like I'm, not even sure that that makes sense. It feels like somebody I think. What actually is gonna happen is that people are going to have a bunch of old nodes that are in like old chassis old servers from their early days of the cluster that have less memory and they're gonna.

D

Deploy new notes that have are newer and faster and better that have more memory and they're gonna want to apply the settings to those, and, unfortunately, there isn't really a way to tag config settings based on like what revision of the chassis happens to be in. Unless that also maps to like what RAC they're in they can wrap it at their memory per rack.

D

But it feels as a that's kind of an orthogonal problem. It.

B

Almost feels like you want a template for a node right. You wanna be able to say I've got this class of node I've got these other classes of nodes, and you know they have this many disks and this much memory and whatever- and this is how much I want each OS deed- remember: I, want each OSD to use and I want these disks and each chasis chassis to be used.

D

Yep I mean that's, that's kind of what the classes are meant to be like before we have. The system automatically puts you in hard disk and SSD classes, and maybe an NDB class I. Remember if I'd, that was a real thing or not that's not like I merged I'm. So you can make a class of OSD. That's you know gen 2 or Gen 3 or whatever you want to call it yeah. That way, so I think we kind of have the tools there. I guess long story, short I, think you're right.

D

We shouldn't bother with the HD SD for that option. Okay,.

D

But I would spell out memory as that's. We spelled it out for the MVS cache memory target or whatever it was called Oh ster.

B

Oc memory target yep.

D

Sure, okay, there was one other men option: those like blue store, sp2 came in, but I think it's better. To spell it out. Okay, and not a problem, I can do that.

D

Okay, any other thoughts on on that topic. I.

B

Yeah as some other things that we can talk about with the stuff I'm working on, but let's do your stuff first yep, all right! You are.

C

Well, you pass the control to me to share the screen.

C

Okay, can you see my slides, yep, okay, so from time to time, one can see some complaints about an effective way in terms of space, which is blue score, apply for keeping small objects.

C

Indeed, for if we have 64k granularity paw rotational drives and we have that 1k object, we have a waste 6463 a K of space, and hence is the idea to overwrite this behavior by keeping small extents of small parts of the object directly in kV, rather than put them to the disk.

C

I made the request implementing this functionality and currently I'd like to stress some performance results with this poor request, which are pretty interesting.

C

Well, all the benchmark were made using HP, pretty legacy HP server and which has to drive options. The first one is SSD drive and the second one is MV me me me, drive actually I put both kV and blob device to these drives to be able to compare apples to apples.

C

Well, and actually the test cases are pretty straightforward: I use, object, store, a file plug-in, we fade jobs, 1 megabyte object per drop and the first at first stage. It performs right into these files using 1k.

C

Blocks I walk sizes, hence creating 8 million files with 1k nice, and the second stage is to read from these files and I performed that three times since I observed some.

C

Rocks DB compaction process still running at the first read attempt and each time I run the test of the processes they started. So let's afresh, for instance,.

C

Also in these requests, there were some speculations on which approach is better to keep these records.

C

With this tiny data, originally I used the opera effects, which is the same as note one that makes all the records to be kept in the same namespace, so this from improved performance, but it might make it harder to cash them and makes cash in less effective, probably and another suggestion is to create another namespace and put such records there. So here on the opera effects approach, I mean the original one and K prefix is now a new namespace and I'm trying to compare approach, as well as against the original mole right, yeah mark.

B

Oh so, in the prefix case, it's not in mind with the Oh note is just another record in the same name faces. Do note.

C

Yeah, that's another record and even multiple records potentially actually for this test case just one record, but the policy that enables these tiny rights at the moment is an apparent happening at unallocated pays aligned with this allocation granularity. So, in fact, if you perform multiple events aligned properly, it will create multiple records for that I'm, not sure. If that the best strategy I was just trying to implement, they think that allows to.

C

Allows this purple this procedure for for small rights that are performed just using single right, so very small objects written in in just one right, but unfortunately we don't have any flags saying that no more rights expected this object or something like that. So currently, I eat it more complicated procedure which might create multiple tiny records, but actually that's a bit different story.

C

My intention here is about performance.

C

Any questions so far.

D

C

So the first and the mostly completed test case is right in one key blocks.

C

First, rows are about this SATA Drive, which is richest low, but the difference in knobs is pretty significant here. So you can see that for writing. We have two and half thousand I opt for original approach, which is writing to the new store and hence no prefer no, the Fort Wright in procedure here, while the new write have about seven and half thousand by UPS, and actually the numbers pretty comparable for both K and all prefixes.

C

Then the first read attempt told that performance is different and faster for new approach as well, while the second and third reads are even better for new approach, while it's stable for original one actually I observed the rocks, DB compaction performed at the first read, which probably caused these different performance difference and slow down for first reads for this new body, but actually stable numbers once compaction is, is completed.

C

According to this second column, it.

D

Might just be the caching behavior to you, because, when the source reading, it's not.

C

No I started the process between reads so: okay, well.

D

Even so, if you have the the block cache in parks to be, it's probably loading chunks of the sort of the SSD files that cover multiple objects and so you're, probably getting hits on that.

C

I'd suggest that rather compaction, because what I, what things that I also monitored, where the P space usage during each reads and for second and the third read it's pretty stable, both the compiled the current size and maximum size which 8 DP volume, while four so the first read I, can see that maximum column is, it might go up and then stabilizes. Actually, this the last column here is an aggregate.

C

So, instead of publishing all three, even four columns for for these numbers are just wrote here: well, the stable among stable EP volume size after after formally performances, tape, stabilized and files, I can see. No compaction is happening and the maximum size which most of the time modulates to even write in all the first read.

B

You could you could use something just like IO, sad or clock, dull or or even block trace, maybe to you when you're doing a read test just to see if everything's actually yeah.

C

And actually that's another point. I also observed, writes, didn't eat fish, yeah.

C

A

C

Just want to mention that from the left column you can see that in compacted space in compacted state space usage is pretty straightforward and I mean pretty similar to the actual.

C

Space required to keep all this data so for 1k block and eight million files. I need eight gigabyte and you can see that actual DB volume size is nine Alf.

A

C

While for original approach, if you monitor blog device or 4k allocation units, you'll have like 32 gig allocated to keep just 8 gigabyte and.

C

Which is that's pretty expected and I didn't show him, and the second half is about the same, but using envy may drive to keep off block data and kv store here. The difference is isn't that high, but still significant.

C

Unfortunately, it looks like I'll the street numbers for regional approach, for it two and three but again and as far as I remember, they were less than new rights.

C

Well also, I did some smoke testing for different block sizes, which is 2 K and 3 K and M for 2 K blow blocks. You can see that right performance between original and new approach are comparable, while reading is still faster and Pritchett is same for free Cafe, ke blocks.

C

Well and some notes on that cell already said, the original rites were performed to the right to the new store, which causes dduk block device instead of deferred procedure. I did a smoke test for other writing.

C

We've sat our option only and I hopes result in, I hope, even worse, though chilly looks like differed procedure in this case wouldn't benefit.

C

And again, as I already said, I didn't see much difference in both opera effects and caper. A fix approach may be just a bit subjective estimation which look like or opera effects is a bit bit faster, I'm the most of time, but very subjective estimation. Yeah.

D

My my my concern there is that you're, when, if you put the data adjacent to the Oh records, then you're basically diminishing the effectiveness of own namespace or caching metadata and you're, not really gonna, see those effects until you have a huge data set and your performance is really bounded by your ability to keep that metadata in the cache.

D

And so you need to. You need to construct a really big, really yeah.

C

I understand that actually these benchmark is pretty limited and not as complete as it should be. Just just yet. I have so far.

C

D

Okay, this is this looks super promising. It's it's somewhat similar to what the original original new store code is doing forever ago, where I was putting thumb number of writes in a KB store and then I can run with that they were called they're like lazy rights or something I forget, but we eventually ripped it out at some point because it didn't seem to be helping, but now it clearly does I think it makes.

D

B

One thing is: it'd, be really good to see numbers when you actually are using the OSD, rather than just the object store if IO plugin, because with all the PG log data that goes into the database that might restrict. You know how much throughput you can get through through the database itself.

C

B

So that that concerns me just given that usually it's the the Beast or K vsync, that's the bottleneck for for, like you, know, random right to us. um I, don't know if it would affect your read numbers much, but at least in the write tests. I think you might actually well. You might see different numbers, so it might be worth looking at.

C

D

I think if I just went with my gut I, would they would focus on the on the K namespace, just based on what we've been thinking about the rest of the the trade-offs involved. But the I think the big question in my mind and the big thing to to sort of eliminate as a concern or whatever is the point that Greg brought up in the chat, which is that, if we're putting more data in the SST for these small rights, then rocks TB is going to have a higher compaction load and that's been significant.

D

So my sense is that we need to do kind of a worst case scenario test where a lot of data going into these tiny rights may be all of it and OSD is basically filled with it and see what sort of a steady state performances with that behavior and having all this like rocks to be compaction of all these rights. Going on. In the background.

C

D

That means like filling up a device to, like you, know eighty percent, with these 1k objects and doing.

B

D

A uniform random work go over them and seeing what the ups are, then, when you have sort of sustained compaction going and compare that to sort of the same situation where don't you might not actually be able to fill it with as much data, because the metallic size of is it 16 K? Now? What is our monoxides recipes mark? Do you, members didn't think worse, yeah we're.

B

D

16 K, so I actually would only be able to fill it with 1/16 as much data, but maybe maybe I only have to fill. It may be filled with the same number of objects or he still fell to 80-percent, because users could potentially do that, but either way try to figure out how to do. It have some some reasoning or some data about like what with that heavy compaction impact would be on. This is that 8 gigabytes is kind of a small small data set for a big.

C

hmm Unfortunately, if I you doesn't handle files.

C

Takes such a long time to the Lord, just a million you.

D

Might need to use a separate tool, my greatest bench with object size, one gig just to pre-populate the store yeah.

C

D

Of objects in both cases and then an nes F I, had to do random, I/o across a different set or whatever, but that should trigger I think the same. Damn basic behavior mark CB.

D

That's pretty promising, though exciting, like 3x4, slow, SSDs and like 50%, for faster SSDs is nothing to sneeze at.

B

Be they be interesting to run, though oh the wall clock profiler? While you do that to see where real time is being spent.

C

mmm Yeah yeah I'd.

D

Be interesting very mark. Yes, sorry yeah.

B

Perfu, if you want you, the wall.

A

Clock profiler.

B

That usually I feel like this reveals the secrets of what's going on.

D

Awesome: okay,.

D

Anything everything.

F

One question for you: Alfredo diese wrote up a pull request for.

D

F

New store cafe where he was providing some guidance on.

F

You know how much art you know how much space to give it rocks, TV and so I just was curious to hear your take on it and hoping that we can get something kind of general that about that, because then we can start to work on getting that dollars to do it.

F

What do you think do you want me to post the PR I'm.

D

Trying to find.

F

D

F

D

That's question for you: Ari.

F

If you're still on that, I'm sorry I misunderstood.

D

Ready to move on but I'm just wondering who you're asking.

F

Nope no I wasn't that wasn't directed specifically well I guess in a way, they're mean the only piece of it that was related to Igor's proposal was um whether storing data like this in rocks TV makes spillover more likely, and/or Molly. Do we need to provision differently? If we're using this, you think.

D

It will definitely mean more data in Rocksteady, so in that sense a spillover would be more likely.

F

It doesn't mean it's bad I mean now that we could just provision more I mean and, like you know, two terabyte SSDs are pretty common. You know that are the four terabyte MVM SSDs. So I don't know if, if things are as big, if that's as big a deal as it used to be oh yeah.

D

And it's it's! It's as two dimensions. Do like a few under provision and SSD. Then you can make it run for twice as long so.

D

Think how you make you see, SSD should be viewed as a two dimensional box that you can stretch out. You've lost base for more time or all the space for this more time.

D

Last time, whatever um that the possibility of pull requests like this, that changed blue source behavior and how it uses that DB is partly why I don't want to be specific and prescriptive in a size, the DB, it should just be as much as you bought or commute, and there should be I think we should be providing high-level guidance and I think the we should keep the the spillover boogieman in context.

D

When spillover happens, it means that you're putting storage on the hard disk instead of the SSD. It means that it's not fast anymore, though it's having a small amount of flash and having spillover, is always gonna, be better than having the flash and having everything you know spilled over by default or whatever. So it's it's, it's important to understand what is happening when spillover happens, but I I'm not sure that it.

D

Is a huge nono ya got to be careful how it's, how it's affecting your decision-making? Yes,.

D

C

Think that will be unable to properly properly size these volumes all the time, so I'd suggest to concentrate more on the ability to recover after the improper sizing, which is ability to to increase this heavy volume, the great it took somewhere and things like that as well I left in about this global days.

C

F

C

Really good point and.

F

We had a discussion about that this morning in the hall, where somebody mentioned the idea that if you used lbm instead of partitions, that maybe you could they expend the expand, the rocks TV, you know volume, dynamically I, don't know how blue store would react to that. Would it would that be a disaster, or would that be something? Then that's.

D

Actually something Igor was working on about, and if that was a specific case, that we were that you covered.

C

No specific case, but yeah well actually the basic functionality for that already present on the code base, you are unable to resize database volume of line and well I'm, trying to to extend this functionality to be able to migrate between volumes so and simplify this right. Now, it's a while several steps. As far as I remember.

F

Yeah, oh that's too bad, because I would have guessed that you would have had to restart the OSD, but once the OSD restart it would kind of sensitive volume was around or something like that.

D

Partly that's partly there I think, there's by doing something just like that, where we only allow expansion, it's probably not too much work, um I.

D

Think probably the real challenge here is figuring out how to capture this idea that you might want to figure out how to size this box or one dimension, is the lifetime of your SSD and the other dimension is how much data you use it for when you're provisioning ghosties in general, because you might you might provision it such that you have a stretched out box where you're only using you know one terabyte of a 4 terabyte SSD to extend its life than you're.

D

Actually I want some more performance, I know, I, know and go and add sound every OST right now the stuff volume batch thing just just carving it up into n pieces. So it's assuming that the advertised size is the size that you want to use, which is fine as the first starting point but yeah again, if I cut sophisticated, we want to make it.

B

And I don't know how much it ties into the PR that you linked, but I I, don't know. I thought I'd, maybe simplest a couple folks before but I in the chat. There's a link to a document that or it's very much you know just a small set of the all possibilities of what people could do. But this is for K, writes to RB D or K.

B

Rados objects, I think well, a couple different object, sizes, I, guess I tested, but it shows that the basically usage and amplification and all like are rich for for these different workload. But there might be useful in your sizing discussion. Oh.

F

Yeah, it was useful. The problem I think people had was that it was so complicated too, because you needed to know what workload you were dealing with and then you know try to estimate the rate house objects. You were gonna have within that workload and oh yeah a probability probability of getting it right.

F

The first time was pretty low, so the fallback position was, let's just allocate some percentage of the total HDD space as rocks DB partition space and that's a formula that you know it can be very generous with and try to just make it less likely that this bill over occurs. Even if we waste a little space in some cases are.

B

You talking about a single device configuration, so you don't have like a separate SSD for the DBE. No.

F

I'm talking about like, if you have HD DS and you have a SSD specifically for either journaling or I'm in some cases, what people do like with rgw is they they set up. You know rocks DB and wall, but then they also have the metadata index bucket pool on all flash yeah.

B

F

B

Sure it complicates things or for blue, sir. You probably don't need to do that anymore, because you'll, all the oh, no data will be on the SSD anyway. I'm sorry, all the the Oh map data will be on the SSD anyway. Oh.

F

And there's no real data in that OSD anyway, oh I.

B

See what you're saying it might be worth trying I would suggest at least definitely making sure that the the assumption is right, but you might not need to like carve up your SSD in that kind of a weird way when file stores not being used. Yeah I, gotta, I gotta look into that you're, probably right, and then you can. Then you can go back to yeah.

B

D

Go has mean, though, that the that the OMAP workload will be mixed in on the same Oh Steve as your data workload.

B

You could still dedicate SSDs are always these just to the index, pools I guess, but it would be all OMAP right. So you headed you, don't have to like carve a data chunk off four or yeah.

D

B

The unli SSD for those Willis T's, you could still separate.

D

Them if you wanted to all right, I tuned out but I thought um made them and for another conversation, I thought that, though, in the normal case, you have an ending me and you have like a hard disks or whatever and you divvy it up in eight pieces and I thought the case that Ben was suggesting is that you might have an nvme, you take half of it and create a dedicated OSD, and you take the other half and you divvy it up for the hard disks and you have I.

F

Think what you're saying is that that's a legacy I'll store legacy solution right, but.

D

Probably not if you just yes, but it does mean that one advantage of doing it. That way is that you still have a dedicated India me Oh, Steve, that you have him separate cool and then you can scale it independently. So then you can bunch of like pure and emé SSDs if you need more index capacity or less, and so you have these two in that case, you're shooting the same arc.

D

So you can sort of start with a shared point, but you could scale them going from there, whereas, if you're just mixing in the O map with all your other rust-eze one big pool, then you have a little bit less flexibility or.

F

Contention and you.

D

And you might have contention where you have Oh Caesar for doing both bulwark and they have latency-sensitive ohm at work. Yeah might be problematic or might not Sarah.

B

D

I think sage, the.

B

The the thing I'm concerned about right now, though, is that it sounds like maybe they're carving off part of the SSD or the data or the block portion of those OS, DS or bucket index pools that never gets used, because it's all going into the database partition anyway. Does that make sense. Yeah.

D

If well, it will get used because it will spill over over okay.

B

Yeah yeah, but.

D

Yeah, if the, if the, if the OSD is all on the nvme, you should just have one partition, not or LD I. Guess not! Oh these only.

F

In the case where we have our drives yeah, but unfortunately there's a lot of them still around yeah.

D

Yeah I think that the thing that's annoying here is that there's a whole bunch of complexity. You have to invest in, though I ansible stuff tooling to provision them that way and the end result is just like complicated, and so even if it's a little bit better, it might just be simpler. Just to say these blue store uniformly with an nvme and some hard disks and just bank on the fact that euro map is gonna end up on the SSD. All right, that's good, but it's a little bit.

D

It's a little bit just have to be careful because it if it does spill over, then your own map is get slowed down.

D

So you want to have some confidence that it's don't know if it's gonna be small enough and I actually don't have any numbers to know how much data is actually in the old map fools compared to data in the data pools. It would be helpful to have that like if it's usually 1%, then we're safe. But if it's like 20% and we're not so safe, and we have to be really careful about how we balance them, I think.

F

It's more like 1%, but I need to check yeah.

B

I still really hope that this is a problem that down the road. We can actually have like an OSD.

B

Managing have pools of hardware internally, where it just kind of figures out what what to do. Yeah.

D

F

Well, the problem with that is the OSDs, don't know about each other and they don't know what the the intent of the application is. So it's really hard to do that. Yeah.

D

I'm not sure you can get away from her yeah. It's.

B

Definitely hard right, there's there's a lot of things that can go wrong, but I.

D

Mean assuming that the contention between the two sharing the same most, the circles or whatever, isn't an issue, then everything should be fine until you have so much a lot data that it spills over and that's like a the equivalent. If you had four partition, then would just be that your home at pool those Misty's fill up, and so there's always a high level, like user, has to deploy more SSDs in order to like make it work. I throw that in there I.

F

D

The only thing becomes.

F

Out of that is, if you could extend the size of the partition like making an LP a volume and extend it and have boost or kind of catch on to that, that would be. That would be really helpful yeah. It's.

B

It's always a decision process, though right, because how do you know maybe it's better to have your are gwo map data for bucket indexes spill over to a hard disk, then have like your RVD metadata or whatever spill over? You know how do you? How do you decide which of those things is, is more important? We've got all things all thanks to you, boys say metadata needs to be an S.

B

It's all based on usage, though right. You know, as your is your rgw stuff. This, like you, know, write once read, never yeah clear, so yeah, it's not there's this. You know, there's there's ways you can deal with it, but it's complicated.

D

B

D

um Thanks again, Igor the I think the tiny right thing looks super awesome and also a little bit of encouragement on the blue sword, tooling stuff to resize those volumes. I think that's also gonna be pretty useful.

D

That's it thanks everyone. Thank you.