Ceph Performance Weekly, 21 Jun 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-Jun-21 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

A

A

Let's just take a quick look: I don't actually have to run in like ten minutes to go pick a demon, so it will just cut this short um see it's an arch W one that sounds fine. That's from dog doing DFG stuff, let's see was for a fragmentation. Calculation, I'm, I, think padam. It's gonna take a look at that or you're supposed to go like that. No, that's you aging tests.

A

There's progress on the persistent read-only RVD cash Jason needs to look at um she tests. The deaths are merged, live already. These meetups are merged.

A

I'm, a sync recovery stuff going on who, who are on he's, been going off on this other boy.

B

A

Like um he's toughing tracked on this, like super obscure, peering bug, I'm just curious, it works for.

C

Okay, he seems really really concerned about. They think recovery, stuff, yeah.

A

I think I've looked at this particular pull request yet and then the huge page staying that Radhika's.

A

um But I think it's getting close the last the last thing on the on the huge pages Radek. Is that trying to figure out what page size to use.

D

And till it looks, the only reasonable way is to employ a surface.

D

Okay, we can't hear you I'm.

A

Sorry to look in Sisyphus, mementh or whatever, to parse it out and find out what the yep.

D

And finally, deal I guess operator would need to decide which a huge page size we should use, or we can get just right to the smallest, a one possible like.

A

Maybe we hard code or set options for like them in the Mac size, and then we look at the available size of the Bitcoin. That's between knows.

D

I'm not sure whether it's reasonable to spend it to try to automate or make this easier for operator. Because.

B

D

Pages actually requires intervention from a system a to system, administration.

A

D

Needs he needs.

A

To set appropriate.

D

And unproductive yeah on buta per ad on the runtime.

A

D

A

You choose the size when you do that.

D

D

Control the media, just you just! You really tells the kernel how big full of threat of huge pages you are going to reserve.

A

D

Pelvic to transparent huge pages that doesn't need any any any effort from from operator, but less efficient.

A

Just merge it assist them with the configure option and if somebody comes that's the better way to UM automate it or something cuz. Maybe maybe you would imagine that the piston be unit file, that's at the pool or something, but as we could actually have written order started future fish pose.

D

Fortunately, the default default, a huge page size, looks reasonable for egg for current processors. Basically I guess it's quite rare to get something different that to max. Maybe if someone is running our odd system with the SS 6 PS e 56, instead of instead of physical address extension or nd 64, all.

A

Right, it's one. It's a one megabyte to.

D

Mix to mix for AMD 64 and to mix for a physical address extension, okay, so.

A

Let's just started it to makes is that you felt them and be done with it. That's not good.

A

A

A

Right, I guess: that's it for us! There's just I Oh throttle Erik's working on op tracker optimizations are coming up next. After each pages.

D

Total change I sent just for every review that your comments shut.

A

For the worse, that's what I was trying to remember: here's, the throttle that led you to the perf counters, okay,.

D

Yep basically I'm just working on the only intersection between a messenger and T POS DTP, and there are lots of small gizmos like like rattle like twelve counters, that are especially painful for trouble, because we are updating. Piper, digs PerR counters, doing get our phyllo dough.

A

Okay, they're, the only my only concern with that throttle. One is just making sure that we don't introduce a new weight race condition where you end up with somebody who's waiting, even though the throttle got dropped down. So we need to make sure there's no race between somebody trying to get throttle and somebody putting throttle and some anything stuck. But I had one theory about where there might be an issue. I wasn't sure about I, don't know if you fixed it or not or okay.

A

Anyway, that's my only concern. So if you want to review it or maybe Adam can review it or convince ourselves, basically they've been.

D

In the worst case, it still can be useful because, because the comment that could be racing applies only to the route, not it doesn't affect, get or fail. That is that stays on the interconnection between the messenger and TP USDP.

D

So, even if you would need to strip this one commit this wouldn't be end of world I. Guess: okay,.

A

A

That's good and for the pull requests.

A

We got a bunch of topics here: I got a run out for ten minutes. I'll be back about ten minutes. You guys can start talking about the julik stuff, sure right, bareback,.

C

All right, let's see I.

C

B

C

My endo again this there you go all right, um I, don't do. Are people interested in talking about the PG log stuff or the memory target, Oh SD memory targets burst or something else.

C

New memory targets every target- all right, I was afraid you were gonna, say that so still actually making graphs for it. Okay,.

C

Sure, as long as uh as long as I, don't have to say a whole lot, I guess maybe I'll say something. Quick least: I did all the work on this and it it appears to more or less do what she says, which is good, but it's it. It won't work in its current form, its. What.

B

C

Yeah exactly but it's interesting because there's there's it's a really interesting behavior trade-off from what I'm, seeing her runs that she did are faster on obtain and actually faster on the nbme Drive, the P 3700 that she has her numbers and the P 3700 are way higher than mine, though I don't understand.

C

Why he's getting like 2x higher performance than I am but in any event, she's seeing better performance with pyaare I'm, seeing slightly ever so slightly better performance with hers than the master but I'm, seeing a really big behavior change, and it's exactly what you'd expect the the amount of key value pairs coming in to compaction from the right head, log and rocks DB has reduced way down, and the number of output records has also decreased dramatically.

C

The amount of time that is spent dealing with that stuff interaction. Eb has decreased. So oh, that's good, but unfortunately, because these are small, the PGI entries are small and for this our BD workload, one.

C

It's there's no cool Lessing of anything, so we're just like packing this stuff into 4k chunks, even though they're like 1k in size and then also there there's, you know tons of 4k random I/o happening to all these different logs on disk, rather than you know, appending these things in the transaction in the in the the right head log and rocks dB. So the at least in my test setup, looking like it's kind of a wash between you know doing it in rocks DB versus doing this in these random aisles all over the place. Well,.

B

The one thing that I noticed in your tests that looked pretty interesting was the tail latency seems a much lower with the business prototype yeah.

C

It makes sense right, yeah the time you're spending compaction is way way less. So if you, if you make the assumption that your K latency is often affected by rocks TB doing a bunch of RAC background work, you know you don't.

A

C

That you know the the impact of that might be a lot less. It might be more consistent if you're just doing like constant random I/o in the background I'm. One of these.

A

C

Unless they're compaction, I guess unless the SSD compaction itself is bad, but presumably Intel's much better at random user space processes, I.

B

Guess the mercury suppose how this works for, like an hour, if you have used I'll workload for like Ibaka index, where it's doing all my operations, it is the coalescing like actually improving performance there, or is it gonna, be a wash again yeah.

C

I was wondering about that too. That's that will be they'll, be interesting to look at and see what happens. The only thing is that might not work right, because the entries might be too long. Yeah.

B

For the prototype humanity, yeah.

C

The other thing I'm wondering, though this leads me back to the question of okay. It looks to me just based on this, like getting all of this stuff out of rocks. Tvs database is beneficial, but we're paying a big penalty for having it, at least on this kind of hardware, we're paying a big penalty for having these tiny little PG log rights that aren't even 4k, maybe it'd, be better if it was like a 512 case exercise and we're actually, like you know, using multiple sectors. Yeah.

B

C

That the race block.

B

Size, though, for yes, T's Oh,.

B

How big is that, um like one over 28, gay okay, that was nowhere near we're gonna be patching up enough to you, that's nice! If it right every time, any.

C

Idea what it's like I'm, opting, no clue. Okay! Well, so so I guess. The question I have then is.

C

Would something like this alternate scheme where you have a single log, current active log, but then you, you kind of Mark it immutable at some point and you you compact, you know I, guess you in this case, if it was not Roxie B's other data. If it was just this stuff, then you just mark it immutable and then once all the references have gone away to it, then you can delete it and you you eat the the space amp, but never never rewrite. Never I wonder I, wonder how something like that would do.

B

Yeah we read about introducing it other kinds of like a synchronous background work like that, because you always end up running to some official, where you have to clean up, at least as fast as you're. Writing, um if you're doing maybe out doing that constantly with that constant overhead, which is kind of similar to doing an online and not doing the cleanup, you're gonna have four variable latency later.

B

C

That's essentially, what we're doing only worse with compaction and Roxy be right exactly.

B

C

The the delete plus the compaction workload versus just the eventual delete right.

B

Yeah no means just from the latency perspective. I think it might be worth doing this, um even if it doesn't, even if to wash in terms of overall throughput for all these workloads. The other case where it might help I would be RBG in hard disks.

B

That's a lot more random I/o that work. Well, that's a little bit by at least writing the info and the PDA on a log entry in one chunk.

C

So RBD in hard disks, how would this I mean you? This would only work if you could do the PG log on the B. If you had like a flash drive for for the log right, I mean otherwise. This is terrible right.

B

Yeah, not necessarily I, guess you you're doing a couple of Roxy view and and writes each RPG right right. These are storing the PG info and the PG log entry.

B

Maybe it is- and maybe those might be an end up getting coalesce by rocks TB into one ran disk, but it certainly seemed like you're gonna get more locality if you're writing these sequentially in the proper style like this.

C

But every every single write potentially XANA different PG right, so if you've, if you're you're gonna be basically creating like random I/o, instead of coalescing it into the transaction in the wall,.

B

Right, yeah, I guess in the blue store case. If we actually have an SSD journal in wall by n database, it's probably gonna, be it's not good, so probably not gonna help. But if it's a pure HDD case it might not be worse. I, don't.

C

Know I think the pure HDD case is the one I'd be worried about, but we'll see maybe I think it's worth trying at least.

C

So the the to me, it seems like the the area where they're like the place, where this could potentially really really be good right, it'd be like if, if you've got a small amount of like ridiculously fast persistent storage, you know envy dims or you know octane or whatever. It is, especially if it's like reasonable to to do like cache size rights to it right, then, then, you know this. The the the the waste shouldn't be too bad if it's like 128, bytes or whatever, and then then, presumably the random nature of all.

C

This shouldn't be a problem, so I wonder if that's kind of where this really could be beneficial is essentially making PG log, for all intents and purposes, kind of like go away as a workload on the block device. I.

B

Would like you think, except putting off to the side and I can just a memory or up team basis even faster than your main storage debase? Oh yeah,.

A

C

Yeah fish to me, it seems like the the real. The real, like you know, area word, Lisa stuff would really really be useful would be if you have like some kind of byte addressable or even like cache line size right yeah. You know small storage thing that you can just like yeah the.

A

Nvm L library.

A

That I, oh yeah,.

C

Right now, like her stuff is, it's like you know, hard-coded for K like block size, which is you.

B

C

For this and breaks well,.

A

I mean even if it's built to turn into a K I, don't think it would matter that much I get four random objects. Do they can fit in 4k of bak isomeric I think the issue is that it's the code, complexity and I think we need to have both because we're hard disks. We still want to put it in rush to because it's the extra, oh and then, and is that worth it.

B

Interesting is the tell Lindsey rocks TV has because of the compaction, has much higher Kelli agency and marks right. At least um it was something like or was it like, 25 verses, 9, yeah, yeah, a 25,000 verses, 9000, microseconds.

C

B

Yeah and I guess we're kind of interested in your one is the like rgw bucket next workload, which is totally different from RB um and just getting. Let's go to Roxy be for that that workload may help.

C

B

C

The other thing I'm wondering about here, too, is okay, so I mean right now it's really have suboptimal right. You're you're shoving like 1k of data into a 4k right, whereas on something like me maybe on obtain, maybe you can really do like byte, addressable, small writes, faster I. Don't know I'm just asserting that, but maybe you can. Maybe you don't need to actually do like a padded for K right.

A

Well, on the dim form-factor YES on the defect or not can't okay,.

C

I'm not up to how different those two are I guess but yeah like I'm, the dim form factor I mean that's that's kind of where it's more interesting right, yeah.

A

C

The other thing I was wondering too, is whether or not you might have you might see a better situation. If you did something where you are still coalescing, the rights into one log, maybe with or maybe without the other rocks TB data, but then just marking those logs old logs as being immutable and tall references to them have gone away and then not compacting. Anything that you've marked as being lived well, short-lived.

A

Ideas still in the wall and you modify rocks duty, so it just doesn't get rid of those right buffers either.

C

Rocks DB or you have a custom, blue store or whatever store right ahead log that that is smart about yeah like it's moved on and what gets left um is its space amplification you potentially bad space amplification. If you end up like in a pathological case where you end up with like one entry per log, that's sticking around or something ridiculous, but in practice that bad yeah, if it, if it really did in practice, you could compact, like you could say you know every hour, compact, all this ridiculous stuff, that's sitting around yeah.

C

Yeah, you almost want a different compaction schedule or different classes of data coming in.

A

It almost means, like you, want to write your own wall and then feed what falls out of it interacts to be sort of it, a little bit more control way. Yep. That sounds like a possible path.

A

A matter with c-store everything's getting streamed into these segments segments are generally long-lived, though the data from the wall stuff is gonna, be long dead. By the time we go and damage it.

A

I'm, one to me seems like we wouldn't have this problem. There we're just going to mix it with TGA log entries, along with all the other random metadata that were.

C

In in in that, one have is that, like in the works now or is that still kind of the sorry, the the new new new store, wherever it's gonna, be see, store, yeah.

A

Name is sticking so far. I I wrote up a description and it's in it's in the docks and the DEP docks right now, I'm trying to find some academics the project we're gonna. They don't have time right now, but and I'd be happy for anybody to start working on it. We'll see.

A

Yeah I think the.

C

Sooner, that happens the better yeah I, guess the the question, in my mind, then, would be with with blue store and of how a long do we want it to be and a around for, and you.

A

Know how much work do we want to put into it? It feels to me, like it's gonna, be probably two years before we have pythor do this before we have something else or flash.

A

So blue sort needs to last that long, I'm a little hesitant to make like super crazy changes to it, though, implementing our own wall to interact better with Fiji like stuff like feels like it's doesn't, doesn't meet the threshold.

B

Thank you really depends on how much like the tail lanes. East effect is very strong. I think go on and see also how that's kind of change affects the rtw that style workload yeah if it decreases latency there dramatically or even just aliens, see I think that's pretty available. Yeah I.

C

Wonder if we could convince anyone on the rocks, TV team that, like different compaction schedules for different prefix data, is like a worthy or useful addition to rocks TP.

E

Isn't that why they built it.

C

Doesn't affect the right ahead log, though that's only inside the database.

A

That's gonna say this: is this feels like the thing that is general enough to go into rocks DB, that it could be valuable to other loads or whatever, where we.

A

Where we yeah we're weak, we avoid for adding things to level zero for some column, families and let the wall just get really big, because we know it's usually gonna get pruned, but it might be hard implement.

E

Requires a lot of extra tracking that they don't have I mean I, haven't looked to the racks TV code ever, but the idea of making things live only in the log is kind of antithetical to the general LSM data structure, and so it would not surprise me if it wasn't prepared for that at all yeah.

E

Although we could talk to them like yeah I mean it might.

B

A

Leisa yeah I remember it might be as simple as like. If there are PG logs are in different Collin family instead of writing out the ll0.

A

Thing um file you just like, don't write it and keep the logs around, because it's small and then you try again the next time benefits.

A

Yeah I, don't know yeah, it seems like you do something, but it's gonna burn CPU in a process, so ya.

E

Know you'd have to like compact the extra columns only into new level zeros or something into the new level zero logging, because they can't actually angling thing. Zero references, yeah well.

C

Could it be something as simple, though, as for certain classes of data, you avoid you, you, you have them flags, you avoid compacting them in the first round, but then you keep those files around and you compact them in a second round that maybe is much longer lasting right. So you for first default.

A

C

Yeah default data- you compact it once you've you've hit. You know, however, many log files have filled up in in their current settings that they have for saying. You know you, you start compacting after two logs or three logs or whatever filled up and then for this other class. You start compacting. After ten logs, it filled up how.

E

Is that simpler, you know, I, have you know more log sitting around B to track which state they're in and reread them and you've.

C

You've accumulated more tombstones right.

E

C

It's not simpler, I didn't I'm, not claiming it's simpler. It's more complicated right. You said as simple as that means Authority as simple as relative to doing like a reference count scheme right, that's more complicated, just.

A

Brainstorming yeah I, don't know I feel like it's, probably not a good use of our time. Unless and until we have somebody who is actually a proxy to be developer to like bounce ideas against off of yeah.

A

All right should we talk about that I.

C

Almost had my second graph done, I, don't quite so. Let me let me just share what I've got.

C

Is that visible? Is it readable.

A

Well enough, I.

C

Can make it does that? Does that look better yeah all right? So the gist of this is that when you set blue stores cache size to something and let's say, 3 gigabytes mm, you don't get really consistent memory usage for the OSD, as reported by top or PS anything then measures RSS memory in some of the tests that I've done like an RBD workload.

C

You might end up with, like roughly four and a half to five gigabytes of RSS memory usage with an rgw workload that does like small object, writes you might end up with something closer to like seven and a half I spent some time looking into m, advise and TC malloc, and what it's doing and when you release memory, all its really doing is marking M advised, don't need. So you end up with a bunch of memory that that's unmapped, but the there's. No. As far as I can tell and I I do not claim.

C

This is right. I I still am very, very confused as to what the kernel actually does, but it appears to me that the kernel may or may not reclaim those pages and it it may actually do it opportunistically when there's memory pressure, but otherwise just kind of leave them alone sit around again. I, don't know that that's totally right um it. It definitely seems like it varies by platform. Osx may do something different than Linux Linux. You know it it. Theoretically I thought they were supposed to be reclaimed right away, but I, don't I.

C

Don't know that. That's necessarily true! So anyway, the goal of this is to try to control the memory usage of the OSD by by tuning the cache size in blue store based on some target. So in in this case, there's a an option. That's been added for I. Think I in the the branch I have right now it's called like OSD memory, soft cap or something like that, but maybe targets a better name. I, don't know, and we try to then.

C

Tune the cache size, so the overall memory usage of the OST is is around that memory. Usage in this case, I've defined as the amount of mapped memory. So that is basically the heap size of the process, the unmapped memory, which is what you get once you do, TC Malick's like release memory, whatever call so it the the branch appears to more or less be working. This is an RB d workload, the rg2 rgw numbers and the other tab aren't quite done yet.

C

I'm finished pasting them in, but you can see basically at first before any work is really being done or when it's just kind of doing this pre fill or whatever we aren't using very much memory yet so this auto-tuning thing will set the we started out actually at the very beginning, using the default flash cache size from blue store, which is around three gigs, but very quickly. Since the the mapped memory usage is low, we push that all the way up to the OSD target.

C

So, potentially in a theoretical world you could have the entire amount of memory be devoted to or the the entire mapped memory going to, the blue store cache which isn't real, but but until we we start using lots of memory, that's what it's setting it to is Kevin upper-bound, once we start doing, writes like real writes these are. This is like a pre filled stage where they're four megabyte writes in this case the data cashing in blue stores, enabled and so very quickly.

C

We we we push the amount of mapped memory up and the heap size up, and you can see that kind of in probably around I, don't know a hundred seconds in or whatever now we're starting to reduce the cache size, because we're trying to keep that mapped memory, usage really close to the target size and right around, maybe that the nine hundred second mark there, that's when 4k random, writes start and there now all of a sudden, the heap size has jumped way up where we're using more memory for stuff, probably PG, blog I, would assume and other things potentially more fragmentation to so now.

C

The the cache size drops way down under three gigs in this case and we're kind of bouncing around that that mapped memory target the.

A

Baptists, the heap size unmapped pages, yes,.

C

That's what I'm calling it, but that's yeah! That's what it is. The heap size I'm at pages, okay, if there's a better better term for that I'm I'm happy to use it. That's just why I came up with a in, like you know, two seconds why I named it and the actual RSS memory usage is like kind of variable between the amount of mapped memory and the heap size. So that's to me indicating that the kernels like opportunistic reclaiming these things, but not guaranteed to do so.

A

Looks like about 10%.

C

Yeah and I don't know if that's consistently 10% or if it will change depending on like how much memory is, is assigned and used and everything I'll.

A

Be interesting to see with that, how that changes when I'm without you w, if this looks mean different I, have one question: the RSS memory is zero until nine hundred seconds, oh yeah,.

C

Yeah, that's when the test the test actually started at the nine hundred second mark. So nothing was recorded before that. Okay.

A

Cool this looks awesome. I think this is the way to go. I think that my main concern is how we do the dancer on the configure options like if you set Y times. If you set both the blue stored, cache size and the OSD memory target, those which one takes precedence over the other.

A

That's amazing, I guess you guys choose why the new one takes precedence over the old one. That makes it easier. So if you set those to target memory, then the looser cache size is just ignored completely. That's probably.

C

A

C

I think to long-term I'd, be in favor of making stuff like that. The blues for cache size a dev option instead of like a focused option so that yes.

C

C

Yeah, really it be nice if, if really the user was just saying, okay, here's how much of memory I want to target for the OSD and just go and tune yourself. I, don't want to think about ratios I. Don't want to think about what cache blue store has I'm, yet maybe i1 know maybe I want reporting on it. I know yeah.

A

And be nice to know that, like you know, 10% of its being used for blue store 90% of PG locks, that might be useful information to like understand other things, but but we can report all that stuff out. So I guess they're. Just a couple things that come to mind here. One is that right now at the beginning, it's a gem things that boost our cash all the way up to the target. We could probably build in, like a baseline, that's a little bit more conservative, because, probably that's not the case.

A

We know that we're going to use either. You know on the order of 500 Meg's or on the order of 10% or something for other stuff, and so maybe until the boost or cash is actually full, we don't bother.

A

We don't actually increase it all the way up, yeah, it's starting point otherwise for guarantee to overshoot and then adjust down or has to be better to sort of ease up to limit yeah.

C

Yeah, if I was thinking a little bit about ways to do that to like in terms of easing up to it, you know it might even be something like I, don't know. Yeah gets.

E

C

Straightforward right, it's just like okay,.

D

C

Here now, but yeah you're right I mean it does it can overshoot a little bit but it interestingly, though, at the beginning there doesn't overshoot as badly as it does during normal operation, yeah like during normal operation, once you get into it, it's like you've got rocks DB when it compacts it uses way more memory. Then then it normally does unless you set like the hard cap, which then can block rights so so rocks DB can can, when it's compacting read tons of stuff into the block, cache and overshoot the block cache target.

C

So this is sort of trying to like compensate for that. That's probably why there's one of the reasons why there's all this variability in it I, don't I, don't think we're ever gonna get away from overshooting unless we under, like under target yeah.

A

I, guess that's that's the question. Should we like build in this like 10% vector or if you say the target is, is 4.5 gigs then you would get this graph basically or maybe target has five gigs and you would get this graph than the target. It's cool.

C

Yeah I think it's a question of whether or not we explicitly do that or explicitly tell the user. You know this is the target, but you need over-provision by 20% when you do it I think.

B

The other ways probably friendly, where the user actually even sees the RSS match for what they're studying is yeah.

A

B

A

How consistent.

B

This is with think up heavy workload. Sir. Oh.

A

Yeah yeah so yeah. Definitely we want to see this for W and a couple of things. I have one question, though: could we just change this so that, instead of using map, does your target you actually use RSS? No, it breaks.

C

Because the what happens, if you target RSS, is it will the the RSS usage doesn't go down? It remains like flat, but you keep decreasing the cache in an attempt to you to compensate for it, and then you end up with like no cache and high a huge amount of unmapped pages, and it's super irritating. That's why I tried it first so.

A

What if you took like a ten.

C

Minute average.

A

Of the target versus the RSS to adjust what your factor is and so on you know you would do this, maybe like five times over the course of this whole time spin, you like slightly adjust what your ratio of your mapped target or your target to your actual RSS, because er says, is eventually tracking somewhat right. Well,.

C

The the the problem seems to it well that it's it seems to be dependent on memory pressure. If the current, like that's the the the feeling I'm getting right now, is if there's no memory pressure, the kernels, just like whatever you know, and you might end up with like a ton of RSS memory usage words if it's under memory pressure like this number might actually be way way closer. If my chin.

A

C

A

So we could just.

C

A

Like it, I probably seem to do some experiments and get some like real-world intuition for like what that variability is and then just have a option that fits that somewhere. It's.

C

To me, it looks like the variability is the high-water mark in in loCash pressure scenarios, and it's probably going to be pretty close to the mapped memory in high memory for scenarios.

A

We can figure out how to like reproduce a memory pressure scenario on this box. It's like running, compile or something I, don't know in the background, just to like see how tight that RSS gets yeah.

C

This was on this was on my my dev box, so I can I can do that pretty easily, since I don't have a ridiculous amount of memory. I think the this the harder part, though, is we don't know what the high memory watermark is going to be like if, if rocks DB ends up like way way overshooting it's, this cache size, the high-water mark could be you know, maybe six gigs or something. If there's not memory pressure, then then the RSS value.

A

C

A good thing to be incorporating in yeah.

A

Exactly yeah I think you're right I think we probably can probably what we have to do is say that this will approximate your RSS memory usage under memory presser with the big asterisk and have like a FAQ type thing that why is my RSS higher than my target memory? Well, because there's no memory pressure on your system and if you do something like X, you have memory question. The kernel should reclaim it it'll converge towards what the target is and.

C

This is not just a problem we have like Firefox and Chrome, and all these other projects have like bug reports about the exact same thing where users are complaining, the RSS memory usage is high and they're like no. You know really it's not as big of a deal as it looks like it is.

C

Yeah and it intentionally so right because it went reclaims memory, then it costs more to get it back.

C

It's it's a it's a desired behavior! It's just you know. The reporting tools, probably are. What need to be fixed right, like pop needs to be able to give you an idea of what's used, but what could be reclaimed.

B

Yeah right, I guess the next, for this could be really like a W workload with a very large database. Prefilled somebody's gonna have a different behavior with one of these like 20 gig databases, so.

C

If I finish, filling it in I've got, the RSS usage is wrong here, so I should just delete this.

C

This is everything except for the RSS memory usage with an hour rgw workload and in this case, instead of using like seven and a half gigs or whatever of memory, we've tuned, the the cache size a little bit lower than it wasn't RBD. So on RBD, the cache size was closer to like 2.8 gigs on this one. The cache size was down around, like you know, 2.4 instead, but interestingly, that seemed to improve things.

C

Although one thing that's interesting here right is that the the heap size is growing, and that probably means the RSS memory usage. It was growing too, even though the map size was not.

B

German and whatever that is rocks TB using more for as it gets filled up more.

C

One thing I was wondering about one thing: I was wondering about too. Is we don't have any information about what like the the unmapped segments, are ones if it's like a bunch of just tiny like fragmented sentiments s giving back.

C

Yeah, it's not gonna want to allocate them for anything. He's gonna, you know maybe as last resort but I wonder.

A

If there's a stat on like how many on map segments, there are just how many on that pages, like a fragment of the unknown spaces like if we factored in yeah,.

C

This, this is maybe an argument to consider J milk right because it likes to my understanding. Is that likes to allocate like like size stuff in the same.

A

All right, well, this sounds super promising. It seems like the next step is just to run some more tests yep and then we can draw conclusions from that, but, like yeah presently,.

B

Well, already, that's pretty awesome. Yeah.

B

C

Right now, I've got like a hard-coded minimum of like 128 megabytes or something ridiculous like that. But um but you know we can decide what we want that to be and make it an option. It's not a big deal, sorry for the cache size.

A

Okay, so just from going back to the beginning, the one the user facing question, should we just make it so that if you set OSD target memory, if you set that option, then the blue store cache size is just completely ignored and then we set a new. We have two new options to go with it: OC target memory, SSD and HDD kind of like we do with the store cache size, and we just said new sort of default memory. Footprint, I.

C

I'd say: let's: let's make it consistent for all the other stuff that Auto Tunes right now the auto tuner will take the default ratios and start out at them, but then it will change them. So if we, if we are doing that for those I, think we should do the same thing here. But if we want to just ignore it completely, then we should have like everything ignored completely and just like start out with I, don't know some some I.

A

Mean I think the ratios still make sense within the blue store bucket, because it's trying to figure out how to what the starting point in us or whatever, had a vices like memory. We don't know that yet eventually, maybe those will become obsolete because the cost cache auto-tuning stuff that you're talking about in the other attention. Whatever would would work.

A

But I think the goal should be that this is the only option that they said right and everything else is just using the default ratios or as auto-tuned, and in this case the main thing that's being created off is like blue servers. Just like PG lags and we have. We have options for PG logs, but they're, not no change at all they're like fixed and we're not changing that immediate. So for now it'll just it doesn't make sense for the blue store cash size to do anything.

A

In that case, that's the easiest thing to adopt a train.

C

So I mean the only difference. Right is that very, very, very first data point where that sits right, yeah.

A

I think that's the main thing I would change to be make that until until we like have reached our full memory, usage I would like have a conservative bound of like only eighty percent goes to blue store, or you can look at the men pools and like see whether they're all they're, fully populated. If the men pools are only at 10 percent of their configured capacity, then or PG log, then then we should also be conservative on cranking up blue store, because we know we're gonna have to claw back anyway.

A

It doesn't need to be super complicated, probably I mean it could get really clever and thank you do something. That's probably pretty brain-dead and works on us with well yeah.

C

I mean you, could you could make it like? The starting point be like half of the.

C

Yeah, whatever it's not yeah, it could just be something like that: I guess or even like 25% of it or something that, like you, said, if, if you just happen to like the cache before the first iteration right in the thread like it's starting up, the mempool thread in blue store- and you know as soon as you enter, that while loop, it's gonna like adjust it and change it. But but you know if, between the time when that happens, the cash like overshoots, they could I guess you could end up and.

C

Yeah it'd be it'd, be really really unlikely right, I, don't even know I, don't even know. If you could does. Can you accept IO before the mempool while loop like runs.

A

Okay, okay, yeah I mean just setting it up. 50% of the total seems like a safe thing to do sure, that's roughly where it is anyway right. We stayed in this case at least.

C

Sure, okay, yeah no problem, so okay, we can do that trying to think if there's anything else here, I mean when you, when the PG law gets smaller or bigger. This will, you know, adjust to it. So that's not really a big deal that will they'll work independently of it.

C

It will not, you know if you set this really low. That's one of the tests I want to do is to see what happens when, when you've set the OSD target size to be like.

A

C

Exactly what does it do in that case, I mean, should work fine, in that it will, it will not be able to restrict it and you'll get the minimum. Whatever the minimum amount of you know, cash you've set is, and then the you know, it'll just have more math memory than that yep.

A

Yep, that's something: okay,.

A

Well, cool awesome: that's exciting! Yeah.

C

And it wasn't Emmie, it wasn't hard to write this. It you know, took a dance. Oh.

C

All right sounds good.

A

Anything else for today.

A

Every one good one Thanks later.

D