Ceph Performance Weekly, 1 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-MAR-01 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Looks like a small crowd today, yeah.

B

All right, let's go through these quickly, I guess: there's a KP max 2k team in its passing test. Now it just needs to go through QA and we need to make a call and what the what the default setting should be. um I can come back to that when we talk about your results, same thing with blue fest, buffered I/o equals true. It sounds like that is, and actually it's big of an issue because.

B

The person you saw that didn't have the block cache turned up like that. Was action, um read.

A

B

Oh, that's what it was yeah, okay, they're.

A

They're still gency reads, though right so you have buffered I/o equals true, and you have lots of free buffer ash and framing.

B

A

B

Can just get it for free get so I think the question is: if there are performance costs are doing it probably.

A

Depends on whether or not your devices can absorb the large reads.

B

Well, no I mean is Sarah, assuming that oh we've heard the buffered reads and writes are going to be like infant s, ibly slower than the direct reads and writes, and so the question is whether there's any downside to turn that on so.

A

Far I have not seen downside I've only seen upside, but you know it's. It's tricky right. Yeah.

B

Oh, that's the upset, always a downside. Then that seems like an easy call.

B

A

It seems like we benefit from it, but if we can perfectly figure out how to cache things like optimally without the page cache, maybe then it it goes away. Maybe then it's better we're just I, don't think I think we're going to need to design something much more so Vista cated than what we have now to make the advantage go away. Yeah all.

B

Right some, a mark that is needs QA, then, and.

B

Okay, the DI thing merged- oh no cover you based I, think it merged micro, optimization to make the G out condition, checks faster. So that's good I'm up history thing from Peter merged I think will backboard that, but let's let it sit in master a little bit just to make sure words are unexpected issues.

C

B

Don't see an open is there an open pull request for this I did already merge.

B

C

B

Yeah, you said there is a fix yeah.

C

It's already in there, okay.

B

Okay, okay, okay, let's see the boosts, are cached ramune, foals, removed, I. Think I need to change the logging level of the cache trim out stuff that every shard send something every 50 milliseconds at level 10, which it just fills up the log, even when the UST Seidel so I'm, going to turn that up and then read ahead was enabled for jewel, that's merged. So it's good.

B

Let's see this mark pull request for.

A

His that Winnett I think there's there's still a block by David on it. I don't know if he still cares or not.

B

The last thing he mentioned was his other pull request yeah, but it was so complicated. I wasn't so concerned about him. Yeah.

A

I think the deal is that it slightly changes the nature of how the logging works like it it's now, if you enable debug, ten I think it's logging more than it used to, but yeah yeah I've been just kind of waiting for somebody to say if they care or not I guess.

B

Logging levels turned up right.

A

So if, if you're logging at ten now, the logging is a if I understand correctly I think the logging is just a little bit less like spread over a couple of like two lines. Instead of on one line and there's some bit of data that gets logged more often than it used to.

A

From my perspective, getting all of the string stream stuff out of like not processing that when we have bugging disabled is the only thing I really care about all the.

D

A

I, don't really care as long as I. You know. The logging tells you whatever it needs to tell you. You know it.

B

Sounds good man let's just proceed with this I'll market, not it's QA and okay.

C

B

Not on King David, later, okay, um there's a improvement to assert to make it a little bit lower overhead sounds fine, keep it reviewed, I, guess and.

B

The EC reads: thing I, think had a crash so that one needs needs to be.

B

B

See all these no movement ones, I think all these are sort of waiting.

B

B

Alright well mark has a bunch of metadata results with Roxie be to share anything else. People want to talk about, go ahead and stick it on the pad.

B

Most armored sure.

A

So the gist of this, the the reason for all of this is that well there's a couple of reasons: I guess one! No one has really had a good idea of how much metadata that we generate with different workloads. People are always asking: well, you know how much space should I doubt for the DP device.

A

You know what would you know what should I buy that kind of thing, and the other part of this is that even beyond just how much metadata we have in in rocks DB, there's, also a question of caches and how much space should be devoted for own would cash how much space should be devoted for block cache in the block cache? How much of that space is being used for indexes and filters?

A

There's have all these unknowns, so over the last week I've been going through and looking at how much space we are actually using in rocks DB for different kinds of workloads. There's a ton of data, then I'm going through and trying to do a write up.

A

So this is just really a high-level overview of kind of what I saw they'll talk about today and I didn't even link any of the the current graphs or data or anything in there, but um but the the very high-level gist of this is that how much it's intimately tied to the number of objects that we have, which probably makes sense? But it's not something we like to talk about it.

A

We don't want it to be tied to the number objects, but it really is and how many objects you can store is really intimately tied to the Menelik size in in in blue store. So if you have a high metallic size- and you are writing out, like 4k objects in rgw, you're gonna use a heck of a lot more space than you are, if you use like a fork, a metallic size.

A

So, having said that, the difference between using our BD with like four megabyte objects and doing 4k objects in our GW is vast. If you're doing small writes like 4k rights to an arboreal um-- for a 256 gigabyte, RBD volume we're using maybe one gigabyte of metadata. It's really small and the index in filters is tired, tiniest like to maybe 25 megabytes of indexes and filters. The the per object metadata is higher.

A

It's maybe like 11k, but we have so few objects, because they're large 4 megabyte objects is that it really doesn't matter it's it's still tiny it doesn't it's not a big deal, but on the other hand, if you're doing rgw, it's like 2 K of of metadata for every 4k object, so 256 gigs of a 4k objects and rgw you'll be using like 128 gigs of metadata.

A

So if your workload is tiny objects and rgw, you need a heck of a lot of DB space to build to cache everything without it rolling over to the blog device and the indexes and filters are large too. It's like I think it's 1.2 gigabytes of indexes and 800 megabytes of filters which definitely exceeds our current defaults.

A

B

That 56 gigs.

A

B

A

That was 64 million objects on one OSD, so our current defaults are kind of basically good up to around 16 million objects once you get bigger than that and that's 16 million rgw objects if it was 16 million, our diva objects anymore, but I.

D

A

D

Well of objects, is it definitely a thing we're trying to we're trying for large hundreds of millions but for K objects, we're not storing those in any in any of the alert period. Yeah are.

A

There any users that do those I was under the impression that there were some people out there that were doing like tiny.

D

Language they mean like 64, K hundred twenty-eight K, that's a small object in there. W yep, okay on that scale, and when people want to do that below that ya block, you know for K block, writes know it yep.

A

So, okay, so so this is obviously like a worst case scenario right. You know this.

E

A

K Menelik, tiny little objects, yeah I, agree.

F

With you, I agree with you, but I have seen people put a lot of small objects there. We don't, we discourage them from doing it, but they've done it well.

D

So I'm not saying don't mop the for this. We know we want to optimize for massive object, counts I'm, just pointing out that no one's doing pork, a sure.

A

Sure so you know okay, say 64. K then divide this by 16. You know so on 256, gigs of 64, K objects and I actually have data for that. I tested that. Let me take a look and see what exactly what it was.

A

So for a 64 K, our GIW workload with the default 16 came in Alex as an SSD, then we were talking about roughly six and a half gigs of metadata on a 256 gig object, corpus, yep, so yep, so not bad, but still significant right. So if you've got like eight terabyte SSDs, which some people are now starting to to look at 20 gigs, almost.

B

More than 20.

A

Gigs right 6.5, eight times for thirty-two times, six and a half.

A

So that would be 200 gigs.

A

However, eight times six and a half is 32 times: sorry yeah, okay, yeah yeah, 200, gigs and metadata for, like 8 terabytes of 64 K.

B

If you look at the filters and the rocks view filters indexes and filters, that's going to be about 2% of that, and so that'd be 4 gigs of ram for the indexes and filters which is pretty modest right.

A

Well, maybe I mean it depends on how full you're shoving well you're, making your your your nodes, I mean that plus the blue store cash means that if you want those indexes and filters cached- and you want, you know a decent amount of oh no cash and we might be talking like 10 gigs, of our SS memory for simple SD process. Yeah.

B

All right so the the current kV min branch to the 50-50 split between blue store and our CD, so yeah I'd want 10 gigs of ram, basically for a blue store or for the OSD and then v would go to the.

A

The issue with that, though, is that in RAM limited situations you are really over subscribing there. The block cache for, like our BD workloads, you're gonna, see a slowdown, because more of that optimally would be going to the oh, no yeah.

B

A

Don't need anywhere near that much block cache for our BD in that, except in the case where you already are. Caching, all the blue store o nodes, and the only reason you want block cache then is so that competin reads are not hitting the discs. Okay,.

B

So if we, if we think of this- and if we assume that we only have a simple solution where we have several fixed balance between blues from rocks TV, then we want to prevent the worst case, presumably right, and that 50% would mean that you'd have to get it dedicate 10 gigs of ram to the OSD in order to effectively cache a one. I made terabyte SSD with all full of tiny rgw objects, which seems doable.

B

If we move the needle closer to blue store, then that number goes up, but we'll have better cash utilization for our BD right.

A

Yeah, if you see the kind of last point I make here and this whole thing is I- think the priority that we want is indexes and filters first owned. Odes second may be tied with O map, depending on the workload and then SS T's for compassion last, which kind of like gives us this weird. Like you know, if this then this, then this, then this pattern going between blue store and Rox well,.

B

I think that the to be and the three are kind of the same well.

C

B

Is rocks TV Enix right, though,.

A

The workload is different, though right because the OMAP breeds are gonna. Look different on the SSD than the compaction. Reeds are gonna, look yeah the.

B

Memory, if the memory is given to her oxy beef, then it can serve the block cache the service on that reads and if we give it to rocks to be it'll, also service, the right.

A

It's yeah, it's the the difference, though, is that from the testing that we did, our BD workloads are going to slow down if you're, stealing from the oh, no to do it and the own cache isn't already totally cache yep. That's that's the pet price. You're gonna pay for it. Okay,.

B

So you're sort of.

E

Here's my concern. So if I read mark your email- and it sounds like if we do what you're suggesting we're gonna have three caches going on the Linux buffer cache for the you know, the rocks, DB cache and the O node cache and I, don't I think what you were suggesting is, if you give it all the rocks, D bead and somehow that lessens the evil. But it's not an ideal situation.

E

A

The only way it lessens the evil, if we give it two rocks DB is if we ditch the encoding/decoding and figure out whether or not we can kind of make up for it by letting rocks DB compress some of the block cache yeah I.

B

A

B

Not that's not feasible in a search room right, yeah.

A

B

What long-term, that's more, where we're going with this T store, whatever the hell, we call it, but I'm not.

A

Yeah either that or we ditched the concept of using a a key value store that requires connection I mean that's the other trade off right. You know, there's there's multiple kind of directions you need head, but yeah. We just don't have a good candidate for that exactly so we could write one or we could.

G

A

We could, you know, try to make things more more happy for rocks dB I.

A

E

An iPad related question about compaction, because that seems to be the root of the issue here. So with this, our meaty workload that micron was well. That partner was doing it's a common workload for KB, be pre-populated the volumes. So you would think that if you're doing random writes to pre-populated volumes, there would be no metadata change. Therefore, there would be no need for compaction, because the thing.

G

Is changing? You only have to.

E

Compact it if it's in flux right, am I missing something.

A

The checksum changes right.

B

I missed I'm, sorry, I missed I missed the. Could you.

E

Sorry sure sure no problem, um so my thinking is you get this. Let's say: you've got a set of RBD volumes, you fully populated them, so there's no allocation going on to those volumes. What do you do? Rights correct and, let's assume that we're doing 4kb rights and the min Alec size is bigger than 4 KB, so we're gonna be writing in place.

E

We're not gonna be allocating new space to do okay, so you would think I mean I would think because I don't know blue store, like you folks do that the metadata doesn't have to change, because all we're doing is writing data over writing data. So then.

B

E

B

Does change, though, because two reasons that the data has checks at different checksum, so the checksums change, and the second reason is that every time we update an object, we update the attribute object, whatever that has the object info that has the version and the log entry and the timestamp and all our random crap. So there's there's metadata churn, as you do writes, but I think in this, as Mark mentioned in the RBD case, like it's small, so it's not a lot of metadata, so there will be compaction yes, but you're.

B

It's not that much metadata. Given you data. What.

A

One of the things I think that might be treating triggering all this compaction that we're seeing there's a link in there for reading, writes compaction. It happened to be I, had some data laying around that I was able to go back and analyze and there is a lot of time spent in compaction. There I really wonder if it's all of the PG log and do pops is coming in that are making it just like thrashing compaction, because it's not getting the tombstones fast enough. I could be wrong, but I.

A

B

Way to test that is to run that same workload without failures with the min PG log entries and max PG log entries like 10, and that min there's a login, PGE min entries trim or something number trims at the same time also reduce that to ten keeps like disappearing almost immediately and then see I've.

A

Got the branch where I ripped log operation out and then also we could set, do pops up to or the threshold up so do pops aren't happening with that? I mean that that's yeah would.

B

B

Yeah I think I prefer to just create them and delete them quickly, but but maybe not yeah, that would yeah don't toss some more. Maybe.

A

Both, let's do.

B

A

That we can see if there's any significant change in there's.

B

Gonna be a performance.

A

B

If you're not logging them at all, yes I'm worried that might have secured that might scale the UM overhead we see from compaction because you're going faster or yeah.

A

I'm Ian pet store Gingrey of log operation was huge. That was like a massive reduction in CPU overhead, but I didn't actually look at blue store. What happens when you get rid of log operation in terms of like database rights? I didn't haven't looked at that yet so that'd be very interesting.

B

So kind of coming back to the sort of the main point here it feels to me, like there I see sort of two two points in the solution. Space one is what's in the branch right now, which basically is a 50/50 split between blue, saw remarks, TV and that's clearly not perfect, but it's better than what we have right now and it. It means that um if you know that your cluster is all block or that it's all KB heavy, then you could tune that knob if you wanted to.

B

But the default is sort of middle of the road and.

A

B

Okay with the two and the second point I see, is that what we could do is we have the new pool tagging metadata associated with pools that identifies what the use case is, and so from that we should be able to infer that this is an RB d pool, which means there's no K V.

B

It's all block object stuff or it's an RG w index pool, or it's an RG w data pool or whatever it is so for each of these, given like pool types, we can figure out what the, how much of it is B value and how much of it is blocks and then for a given OSD. We look at what the PGS are.

B

You have like, I have 12 Fiji's that are all data and I have 12 P G's that are 50%, KT and I have 4 P G's, that are a hundred percent kv or whatever it is, and you can come up from that. You can just come up with a blend that oval based on this I think that 80% of my data should go to recipe because there's more KD or 20%, my memory should go to bikes to B, because it's more block.

A

I want to I want to contend your point that the 50-50 split is actually better. Okay,.

B

A

Now, with our current settings, we appear to be good up to around 16 million our GW objects on one OSD, and even after that, we're probably still not bad, but that's the point at which the indexes and filters are starting, you're, going to start getting paged out for st files. So I guess the question in my mind, is: do we have any idea for most of our users how many objects per OSD even on big clusters? We end up, seeing because I mean 16 million objects on one OSD is a fair amount.

A

File store, doesn't even handle that file star falls over when we get up to about that many objects. Because of the way splitting works, so I I wonder: do we really have users that are doing significantly more objects per OSD than that right now, I'm sure we will, as we kind of get bigger but I'm, it was falling.

B

Over for a partner right, I know.

A

A

B

A

They were miss tuned, they had no action, lots of reason, compaction, it was not indexes and filters and I'm almost a hundred percent. Sure of that could be wrong, but everything we've seen so far indicates. This is not initial with issue with the indexes and filters being being. You know, patient in out of memory.

B

G

B

The thing that worries me is is the way that it currently works as a cliff like we literally cap, the amount of memory that rocks to be can ever use that's 512 megs, which is yeah insane to me right so right now, our lab cluster has about 45 million objects total across.

A

B

I'm not a lot. It's got 45 million across 100 of us T's. That is three X replication, so it's over across 30 OSDs. So it's like I, don't know what an AB million objects Borowski. Maybe okay.

B

So it's doing okay, but it's mostly that's a fast, mostly big files, so this is mostly large objects. Okay,.

A

B

I think a common scenario that I see and that's this is a even where that, like the 64 kilobyte test is more representative, but it isn't exactly what most users will do where they'll have a bunch of hard disks or a slower pool for the rgw data, and then they'll have a bunch of dedicated nvme SSDs for just the indexes, because they're you're just going to have a hundred percent on a percent key value, data on the whole low esteem right, and so, if it's a if it's a 4 terabyte nvme you're gonna have four terabytes of metadata or you know 3.4.

B

Whatever 80 percent is and then the indexes are going to be what you know like 4 percent of that 3 percent of that total.

B

You know so, let's anything like a hard 60 gigabytes.

B

Or 20 gigabytes for em via Texas and filters one one way to maybe.

A

That seems like a lot yeah, as he said I've. That seems not quite right, but.

B

We're it's about two percent. It looks like it's about. Two percent of the total metadata is the indexes and filters.

B

You had a hundred twenty four gigs I metadata and it resulted in two gigs of important stuff that we need to have cached.

A

So for 64 K rgw rights, we had like six thousand seven hundred and fifty sixty megabytes of data metadata, and then we had sixty two point. Eight plus forty two point, one of indexes and filters so was end up working out to be that would be sixty seven sixty, that's like one point: six percent, so you're close to percent. But one point six percent happen. So.

B

For terabyte nvme, it's just pretty big, it would lead to like mean or on the order of 50 gigs of ram. It's like 80% capacity and the one point six percent or whatever somewhere in there mmm-hmm.

A

B

That means that if they used completely default settings then they would have to have double that and they would waste at the rim, because half of it would be given to blue store, which would be caching about that. Many notes.

B

A

Still wonder if this means that we should have our BD specific and our GW specific settings for this right.

B

So that's what I was getting at with the tuning with the pool settings, because in practice it's gonna be I, think sometimes you're going to have pure rgw and sometimes going to pure everybody data. But it's going to be a blend most of the time, and so, if you can figure out how to blend it,.

B

That's sort of the that's sort of the command and control top-down version. The other way to approach it would be look at what the hit rate is on. The two cache doesn't try to balance it that way: mm-hmm yep.

B

That might make more sense, because then we can sort of teenage accordingly, but.

A

If we know the number of objects in each pool, it looks like across these tests- it's not exactly the same, but it's we have clusters or rgw and, and we are BD in terms of the size of the metadata. We do know the number.

B

Of objects for pool, but again it's a little bit weird because um in the OU map object case we don't know how many OMAP keys there are and that's actually it could be. You know, objects that have four key value pairs or to give the objects that have ten thousand key value pairs just like whole borders and things you drop. So we don't I, don't know that. That's a good enough! You can choose a middle-of-the-road but I'm.

B

This isn't for changing my tune from the last couple weeks, I'm I think in order to address this properly, we should just look at cache hit rates and just that's.

A

Kind of where I was heading to do either saved you or met. Do you know how often rgw objects are fragmented? Like you do partial rights to an rgw object? Is it usually just put our guts or is it like weird? There.

B

A

B

D

B

They're always written.

D

In there that the object data never but their own activities, and they have no don't have strong animosity on the heads.

A

So we should never end up in a situation where you have like lots of extents for an rgw object.

D

For some special tech cases, but but but for now is right: there yeah, okay,.

A

It seems like, as long as we're just doing, puts and gets even regardless of the the object size or the the Menelik size and all this stuff. We end up with a relatively, not exact but relatively tight cluster of average number of keys per object in the database mm-hmm. It does change some and I. Think it's because there's there's, you know a set of existing keys. So there's some just static.

A

You know a number of keys in the database regardless you know and then there's there also may be space amp in rocks DB, where you have keys and multiple SST files that are for the same thing. You can have like a key repeated in multiple levels.

A

So there's skin probably some variation in that, but it seems like we have clustering. You know for fragmented, RVD and then fragmented puts.

G

A

Might built in first something from that on have how many keys per object. You should expect.

B

B

Race condition, I think the yeah I just worry about the number of steps and inferences and approximations that we make kind of trying to work back to this is like so high yeah I'm, very skeptical of the result. I think we'll get almost as good with like 50 goodies.

B

I, don't know about that yeah right, but um as far as like return on return and complexity investment. But how.

A

Bad, would it be just to look at the number of objects and different pools? That's toasty start up and then just get. You know kind of guess from that and then run with it until the next time. The list is restarted. Yeah.

B

I mean we could do it every you know. Second, so doing a startup versus online I. Don't think really change as much okay, okay, but it's.

B

B

So I think, okay, so my proposal is still to start with the current KVM in branch, because it's better than what we have and then I think. The question is: if the end goal is that we're watching the hit rates on rocks TV and on the loose or cash and sort of dynamically adjusting? Based on that, the question is: if we want a midpoint- something that's not quite auto-tuning but there's sort of a guess on the way there sure.

A

Let me publish the data that I've got, because that might inform kind of whether or not how good a guess would be versus looking at hit rates. Right, like you know how, how old, how tightly clustered the the the amount of keys per object and the you know, kind of size, of the the data block per per kind of workload or per you can even get down to like per object. I mean.

B

If you could propose an a heuristic or algorithm or whatever you want to call it, then we can just look at that. I think that would that's gonna be I. Think that's gonna be easier, then indefinitely trying to digest a wall of numbers in a spreadsheet and come up with something sure.

B

I think that the numbers you can the things that the metrics that are probably the most useful are the object count in the in the pool and there's also a I believe there is a kv, a number of objects and there's also a number of objects that have kv data some objects. Let me see where it is.

B

Is that right? Num objects? Oh map? Yes, so you have an object. Kent and you also have the number of those objects that have one or more oh map keys. Okay,.

A

B

Doesn't necessarily tell you how big the own map objects are, but you can probably kind of assume that if we don't really have I think any objects that have both that use both data and oh map, at least not at any significant quantity. So do.

A

You think that differentiating that is is a big deal if we kind of know how many objects we have and how big the DB is. I mean the.

B

Problem is that we don't know how big the OMAP data is. We don't have a chemical that if we did that this was all be really easy, but we don't. We don't have that information.

B

Why do we need it? I? Guess it's the question I'm asking, because at the end of the day it doesn't matter how many objects there are that's just a proxy for how much metadata we have. So what really matters? Looking at your your stuff here on the pad is 124 gigabytes of metadata, but that's the important part. That's what determines how big the indexes and filters are roughly right, whereas if.

G

B

The RVD case, then, you have only you, know, 2 gigabytes of metadata, which means you have a tiny bit of indexes and filters right. It's.

A

It's me better to look at it from the standpoint of how many keys do you have all.

B

A

B

A

The filter bits are dependent on we're, they're really static, usually in.

G

A

W case it's bigger than it should be, but that.

B

Would be best if we need a number of keys, that'd be the best good would be or better would be, the amount of metadata total. How big is where our CV and the least it would be the number of objects, because that's a very coarse proxy for the number right we.

A

We can ask rocks tbh. What number of keys are, though, I think.

B

A

A

The only kind of hitch in this plan or the the thing that is make a little bit more complex, is that we're supposed to have 20 bits per key for filters.

A

For some reason in the RBD case is 28, which I just kind of assumed was maybe due to like padding and stuff for the.

A

Undestand around, like 46 and I, have no idea why I don't know why it's different, but each of those workloads is making that number off a little bit from what we specify in the config, but can.

B

You can you ask, rocks to be what that is. I can.

A

Calculate it from the how big the filter block is and how big, how many entries are yeah.

B

Or we could guess like we get asked Christ to be. How big are you it'll tell us that it'll tell us how many bytes you can ask it. I, don't know. If that's got money, I guess you can ask how many keys I can remember with that what the API Act we.

A

Can even ask it directly how big are all the filter blocks for all the SST files and how big are all the index blocks for all the SST files and then just like okay, here's how much we're gonna give it! That's.

B

That I would be perfect and yeah yeah, hey cuz. It feels like that's that that's the min so instead.

G

Of having them.

B

In be a gig or whatever, it's like at a minimum, make sure all that stuff's cached and once that's in cash, you know plus a 20% buffer or something then start giving up the memory else. Other way, then.

G

A 50/50 thing, I think.

B

Would be fine because I don't care so much after that we.

A

B

A

How much that is and then just give it some padding right and say: okay, we're gonna, expect this to grow some, so here's some extra padding and then the next time the Rios T, Louis, T, restarts or maybe just periodically. We ask it. Yep.

B

We can do this like once a minute or whatever we can Mike just reevaluate. It's not.

A

And we can change the block cache dynamically I. Don't know how to do it, but I saw someone reference at once in in the like in a bug somewhere that hey we can just dynamically resize the cache. So someone claims that you can do it. I haven't I, don't I've never seen how to do it, but someone claimed.

B

A

B

That sounds very small.

E

Okay, just one quick question: I mean: are you assuming that all data is equally accessed, and you know is that real? Is that realistic and our synthetic tests do that, but in the real world? Is that really the case? And or would we be overestimating what we need.

B

Yeah I think in the real world. That's not the case! My hope would be that it would once you basically have a pool of memory and you're putting a divider and you're saying over here.

B

It's fresh doing over here boost- or my hope would be- that within each of those have that we could use the memory most effectively and that it wasn't, it isn't necessary to move the division point that much so, for example, if you have a bunch of hot rocks to be data in much a Cold War XP data, we have enough in the pool in rock CD as part of the pool to fit all the indexes and filters in there. But if half the store is fully idle, then they'll get booted out and we'll be caching.

B

You know some OMAP stuff, that's hot, or something like that. Right and ditto on the blue store side, I.

A

B

A

We really want the indexes and filters in caches so that we avoid the prospect of having to like multiple.

B

A

Yeah I mean maybe maybe I'm over overvaluing having those in cash versus having to read the filter into cash. Maybe it's not that bad I mean they're there. Let's see how big they are so I mean.

A

So like these SST files for a 64, K rgw workload, the filter blocks are usually around like.

A

270 K, so that's not I mean that would be like a big read to read it in.

B

A

Then you have that bad. You.

B

Have to reverse it, and then you do another is oh, it has another IO and some extra CPU to every request right that.

A

But but as been saying, maybe if you don't you're not like going over the whole thing- and it's really just you know within some number of SST files- your workload- maybe maybe in practice you're really not doing that very often.

B

Well, if so, you you wrote this cash priority, which I think I'm expense, but it seems if we can. If we can use the rocks to be interface, to query the size of the indexes and filters, then we can do exactly what we can do. Priority One as number one and and then two and three are sort of 50/50 like that, would be trivial to do right and then, if we wanted to get fancy, we could balance two and three by looking at the hit rates.

A

Ben's comment is making me wonder, though, maybe he's right, maybe maybe an index and filter read, isn't that bad? If it doesn't happen very often compared to the prospect of doing more Oh node reads: maybe the the blue star, Oh node cache really is you know doing a miss there, maybe that's worse than the prospect of doing like an index. Filter mess.

B

B

It's tricky because the if you miss a no note if you miss the oh, no then you're gonna have to go. Do the rest to be iOS. Do note, cache, isn't as memory efficient, so you'll catch less stuff, but you'll sort of speed things up more.

A

The reason that you want the filter in cash right is so that you can do those X, adder checks really fast, when you're doing a write like a new right. So if you're doing like a put a non-existent object, you do etc. Lookups on it right to see. If it's already there.

B

Right, yeah, yeah, right, yeah,.

A

So that's why we want the filters and caches, because every time you do that you're gonna have to be going through and looking for through every SST file. Right.

B

Through to layer by randomly yeah, every.

A

Single filter, don't you yep.

B

To check each layer so, whichever SST you know, you have like a hundred and whatever, however many SS either in each layer, you're gonna take one of them H in each layer, so usually three checks or harmony levels. Yes, why.

A

Only one in each layer, why not all why don't you have to traverse everything.

B

Because you know what key it is, and so you know what SST would land in say.

A

Okay, so as long as you.

A

So so the I guess, then this the the balancing act of what's your likelihood of getting a filter miss in that that's, you know, looking at each level and and potential in each level of missing. You know if your level 0 level one filters are more likely to be cached than like your level 5 filters. What's your likelihood of actually missing and how does that balance out versus the.

B

B

But I think that I think where you started makes sense that we should start by spending the memory on the indexes and filters and then, after that, it matters less so I think my suggestion is that we look at how we can query Roxby to get the size of those and use that to allocate the cache memory.

A

The other thought I keep coming back to is how much of this matters if we find out that the PG log and do pops and all that stuff is really just causing like crazy stuff happening. If, and you know,.

D

Know that you've already I think I'll just make these.

A

B

Related but orthogonal, mostly orthogonal,.

B

B

Does anybody have anything they want to talk about? That's not parks TV. Before we get up our time.

H

Maybe a quick question about the status of sister.

H

I've made some I've made some additional profiling. I was curious. Why, on the right path, my development machine? Why I'm seeing a lot of a lot of impact from instruction caches from data cursors from tlbs from branch predictor buffers from everything, literally everything in processor in CPU? That requires a warm up to work effectively so switch to their as I've switched the investigation of of things that could affect all of them like Siskel's.

H

It seems that we have around in 4k drink for kylo Ren treats. We have around seven sis calls per each single operation. Most of them are fu taxes, each of of the TP OSD TP workers needs to research armed four times to colonel. Only because of of the colonel of the colonel aside, part of phoenix.

D

Wire physics is being.

D

H

To Texas to Texas are used to construct synchronization committees, for instance, to construct a mutex. If you have a contention of mutex, then Fred's Nia Fred needs to go to Colonel. It calls this is the few tax Cisco.

D

H

However, opposite situation could also appear appears because few Texas are used also to constitute the conditional variables. So if you have our long pipeline divided into stages with synchronization between between stages implemented on top of conditional variables, you will face the you will face are allowed from from the colonel site of tutors as well. So.

D

D

H

Is a three-set good chance for dots, see start could help here. I guess well,.

D

You don't need smoke loss. What's up well, I'm, saying peril, I mean there's, you know a particle like freedom. Sort of approaches may not be as good as Easter I think any other.

H

We in the default configuration for for SSDs, we have to worker what to working care Fred's to workers per each chart and when we are picking a job PG item to do, we are locking it. We are not trying to lock and if it's contended get food rents get something unrelated nope we are sleeping, we are waiting. We are going to pan out.

B

E

B

It's tricky because the PG, the next PG work item is coming out of the priority queue implementation, which is hidden behind its own abstraction. So being able like, say, give me a couple things that I could do and I'll pick one will be a little bit tricky.

B

The other thing is that it's this feels like a kind of a short-term fix, because once we do go to see star, then we won't do that anymore right. Well,.

D

I mean I hate to upset the applecart so ever, but how quickly can we could we converge NC star and if even if we could can all workloads convergence Easter can all do what roads are Crusaders converging Caesar I mean to say I.

B

Mean there's it's one: it's one, OS d! So it's all this! It's gonna be all of C stored or none right. How quickly I was gonna.

D

You know more than a year or two out that there'll be some summoned some deployment architectures, where, where we're using something else or do we do we let say staring all the way for everything and I mean.

B

My my assumption has been that will refactor the core OS d io engine to be C, star based and always C star based. We won't be maintaining multiple implementations of the I/o subsystem.

D

D

I, like this entire call, but I miss it. You know we in our oyster cart stuff. We had. We threw around just simple lock for another set of lock, free, primitives and I know they're going in on a z-star anyway, but but but I need for Mitchell to look to look seriously it's up up front. What could we do in a short timeframe? Even if we don't worry about.

D

Bashing on these on these sneak cues because they.

C

D

They are a source of high latency, is their source of scheduling, and it's wicked scheduling or something that's. What Scalia's here I'm very interested in this analysis because of that but I think but I think the solution is to try and experiment or the thing I will do next. Prime experiment with ways to take to take those up blocks and sleeves out.

C

Yeah and I'd like to remind you that during work on this up history, optimization I also run into problems working with conditionals, which simply made things slower than before. I even try to use it so I think removing conditionals from at least some of the text might be beneficial, I'm, not sure how beneficial it depends on the.

D

D

Something for it so.

B

I think yeah so seems like the one one direction a bit before we go to sleep.

C

D

Four years, that's great you do that yeah.

B

Yeah, so maybe they're opportunities there, but I think that so, if I I'm trying to think of where these the key context, which is our one is in there's stuff in the messenger, that's I think not something that we can very easily address. There's the one that right Assad mentioned when we r DQ PG, where we could choose another PG to work on.

B

Instead, maybe it's worth doing a simple proof of concept where you hack up one of the schedulers to give you the next event and I'm Riku, the one that you didn't successfully lock and just see if it makes a difference. The other. One, though, is that the IO completions are put on a finisher and those.

B

Yep, there's that there's a context switch there and those ones are hard to do because they're there put it there executors from different thread contacts to eliminate that box. Not so it's hard to just get rid of that.

B

F

Might be running.

B

They might be running from good.

H

What about synchronization, the interconnections between blue stars right pipeline I mean kV, sync, Fred, KD, sync, KD final.

B

First or not, yeah.

H

However, if I recall correctly, we have very very good policy, almost batching or EC in in those in this pipeline, so the traffic, so the relationship between waste of mutex and their the real work isn't so significance to expected huge optimization from that yeah.

B

I think, actually, that the kV sync the KB finisher their Baptist, so I'm not worried about that one. It's the it's the kV committed where we take the.

H

If you think it's.

B

It's the it's theirs. We have a set of charted finishers that the actual completions for the I'll get queued on and that okay, those.

H

We take the log and finished bottom.

B

Half right, okay,.

H

B

Like I'm committed on whatever, but.

H

Still sister will could it's not sure it I guess it could bring really really significant benefit. It would allow us to have to implement batching in very, very, very transparent way if we have a if we have Freda same free operation, free operations of the same kind and each of them can be divided into some stages.

H

Let's say we have one operation ABC and that is divisible to two stages, one two and three then, instead of going since we could try to trade some a lot and see forever fruit put and instead going curse, synchronously I mean 1, a 1 2, A 2, a 3, etc. We could, but we could make some reuse. Some CPU localities like instruction cast to handle 1a 1b d, 1 D 1 then switch to the dadada to the other stage.

H

B

H

This requires to have reactor sister reactor aware about the budget from that from the air from the it's everydays yeah.

B

Well, coming back to your first point, I think the the main opportunity that I would see to try doing a tri, lock and avoiding a few texts would be that D queueing, a PG, so I think. That's that that's the proof of concept that I would try, but maybe, if you're try like fails just DQ the next one and unconditionally do a blocking, lock and wait on that one and just move forward just for something simple. That should work most of the time so from there or whatever, whatever makes sense, we'll.

H

H

From the from the very different bucket, very good question, I do what about policy for backporting performance related stuff I have some pull request that I did I a back ported to luminous, but still the branches are waiting for some knowledge. I, don't know whether we should what about tracking we are knocking.

B

As long as they're low-risk, then we should do it. We should definitely back port those things, so I think maybe just send an email with links to the pull requests, so we can make sure that they get reviewed and then urge for the next point, release like that's fine, we're shutting murder Murli. So they get plenty of testing in the Luminess branch. Okay,.

H

So sounds like: we don't need to create extra ticket instructor know.

B

If you want somebody else to do the back porting for you, then yes, you gotta go! Could a ticket.

H

B

Yeah, thanks cool all right.

B

Anybody else all right for overtime thanks. Everyone, good discussion.

H