Ceph Performance Weekly, 27 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-01-27

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh Okay, so a little bit of a quiet week compared to previous weeks, uh uh quincy freeze has happened. um I think people are still trying to get some things in, but a little bit less uh movement for prs in the performance front. uh Ronin has a pr related to scrub um that he emailed me about uh earlier this week, specifically uh basically just changing some of the uh uh chunking that happens for this. If I remember correctly, um we don't get the chunk size, oh good job, you're. Here I didn't see you sorry.

B

Go ahead, you talk. Okay, the the basic idea is simple: um there is a configuration parameter actually too, let's say the size of the chunks we are using when scrubbing, for each chunk. We are requesting uh the maps describe maps for this chunk from the secondaries on the replicas.

B

Now the chunk size is currently the default 25, which is very small when considering a pg that might include the millions of objects. We have an example of a million and a half, and what I suggested is that we should create. We should separate the sizes, the chunk sizes between deep scrubs and regular shallow scrubs and allow larger chunks for gloves is we're, assuming that regular scrub is less effect on the amount of high or the effort invested.

B

um I think the idea is accepted, apart from the fact that we need to to make sure that we do not uh create a problem in a latency issue for regular client requests, and I would remind everyone that, up to a point, a client request preempt a running scrubs up to a thing five times. It's a configuration point.

B

That's basically it.

A

And I know that that uh pr uh for for adding scrub testing to the cvt is, is pretty big, um but uh did you did you look at it over and over does it look like it would be um helpful for you for testing.

B

Probably yes, I read it once I there's a lot that I do yet I don't. I don't have experience with cbt okay, so I have to reread it and understand. Take me a few more days.

C

I can I can help you with that ronan.

B

Good thanks, I was hoping.

D

Oh sorry, go ahead.

C

Oh, no, that's it.

A

I was just gonna offer help too. um uh I think you can probably help him more than I can at this point, uh since you know that could much better than I do, uh but maybe, as a part of this, we can get that merged, which is, is very much my fault, I'm sorry, um but uh but it sounds like it's a very useful pr, so uh this maybe this is a good test case to actually do that.

C

Yeah I had actually opened it in draft state because I had been making a lot of changes to it, but I also wanted to share it with sridhar to show him the way I was testing scrub, but I think now I can raise an actual pr, because that's what we've been using for qos testing and uh it does what ronan wants to do or run scrub with client io, and maybe we can see what stats he needs and what could that.

A

It sounds great.

D

Go ahead and sorry, no, I just said the same: it's great! Okay, great, wonderful, all right! uh Let's see well moving on, then um the other new pr.

A

I saw was uh this: is setting uh tracing to be in compiled by default? uh I don't think deepika is here. um The the gist of it is that um there is a little bit of performance overhead by doing this, it sounds like, but but not too bad, and um the benefit would be that we could have uh users and and customers uh more easily be able to set tracing to be enabled when they run into a problem.

A

uh And of course there is overhead when the tracing is enabled but simply having it compiled. Then it sounds like the the impact is is quite low, um if I remember correctly from the pr, maybe it's around uh one percent or less so um that that potentially uh could be uh worth doing and I think dipika reviewed it. I don't know if, if that's set now to be uh enabled by, if that's, if that's actually, um you know if we're planning to merge it or not, I guess, but um but anyway, that's that's there.

A

For anyone interested.

E

So I've I've been following this one and I there's um so originally when they were doing benchmarks. They were seeing a big performance hit on the osd, um because there's a lot of shared pointers that come with each of this bands for these traces, and this pr is basically removing a ton of spans from the osd.

E

And that reduces a lot of the overhead, but the osd is going to get a lot less detail about the timings of these intermediate steps.

E

F

E

And I don't know that that's been discussed with the osd team to see whether they want to keep those.

A

Would you say a lot less detail uh like like how much are you talking casey.

E

uh I I don't know exactly how many sub spans there were initially, it might have been like 20 or a dozen. Maybe, um and now I think, there's just one or two.

E

So originally in the discussions, I think we were talking about replacing these sub spans with just log messages that would preserve those timestamps. So you could still see the timing of intermediate steps.

E

Sure, but that's not in this pr, and if, if that timing, information is important to the osd team, then I want to make sure we keep it.

A

I confess I haven't looked closely enough to to know. I guess how the the exact context of like what it looks like before and after that that's changed, um joshua or neha or anyone else in court. Do you have strong opinions about this.

F

I haven't looked at this vr as closely myself, so not yet.

G

Okay, yeah, I think in the short term. I think it's okay, if you're getting some of those, but we may probably want those back in the longer term. I guess there's just kind of two audiences that there is targeting um and thank you thank you. All's mainly focusing on phrases that make sense from the either perspective, um but for developers perspective and being able to like profile things and see pinpoint where things are going and wrong um we're going to definitely want more than two bands, the osd so I'll figure that out in the future.

G

But for the short term. I think it's okay, if, if we um reduce that uh just to get tracing it into users hands, because today, it's not even compiled in by default.

E

A

How how ugly would it look if we made that a compile time directive.

G

The nervous fans that is, that would be the end of the world, um I think, ideally in the future reading it all. This would be configurable at runtime, similar to the debug system, so you could kind of set the level of tracing that you want in some way. Okay, you can.

C

Use a lot of work to.

G

Make the tracing system to make that possible, because the z plus plus libraries for the stuff is are kind of.

G

Very basic in their support of this kind of thing,.

A

Okay- and you think we could do that, we could- we could have variable span without um introducing any additional compile-time overhead or not. Sorry, I have one time.

G

Yeah, hopefully,.

G

Thank you. You've always been discussing with the um upstream uh libraries about how to get rid of the shared pointer piece. I think that if that's eliminated an interface, um it wouldn't matter how many spans we had at compile time or or runtime, but we need to add the ability to configure those at the uh stuff layer to try to turn them on and off. We wanted them to have different views for different use cases.

G

A

Well, neat, um I mean it's exciting right because you know having having good tracing and stuff would be amber and we've we've talked about it for years, so this is yeah yeah.

G

A

G

Absolutely I think, just getting it able to be built and deployed deployed, which is what that this is targeting. Is it's a massive step forward.

A

Cool all right, well sounds like uh sounds like we're: making progress at the very least on it. um That's really good!

A

So nothing closed this week that I saw um please let me know if I missed it, but I didn't see anything uh performance related anyway, um updated though uh it looks like uh this pr uh vernon that you reviewed around using uh thread local pointer variables to save the shard um that that's now in whitby retesting, so um I think you had to prove that previously. I think um any any other comments on that. One.

B

No, I I mean I wasn't sure how important it is, but yeah it's it's working, but I wasn't sure.

B

What effect, how much of effect it would have makes sense nothing? I can say nothing more than that.

G

A

All right, let's see radix pr for introducing huge page based, read buffers. um I don't think radix here. I didn't. I think it's it's matched.

H

A

A

Take a look here, oh good, and then I can add one to the closed post list.

A

ah Yep there. It is, I see it way up there in fact that merged last week I should have had that in the list. There's just lots of discussion afterwards, which is probably why I missed it. um So are we okay on freebsd, then? Does that mean.

H

So well I recalled they found some issues with freebsd and this follow-up pr to fix that which approved.

A

A

Okay, I'm gonna just quickly.

A

All right uh next, uh the this actually probably should be no movement. uh This first pass that omash bench test uh neha. I apologize is still just sitting on the back burner, but.

F

No worries, no is, I know there are lots of other things.

A

Yeah, uh having said that, it's actually something that we've been really nice right now for quincy testing, I'm really tempted to actually go through and run it anyway, like just apply the pr and then run it from these tests. um Yeah.

F

Yeah maybe committed for our so the next time and you're doing the same kind of benchmarking. We have this in place.

A

Yeah, I still you know if we, if we make it generic for for existing osds, it's just gonna have to look really really different than it looks now, whereas we could just separate this off into like a separate g-test uh tool or something like that looks very similar to what we've got right now and then and then just you know, have it there we could even pretty easily backport it all the way to nautilus.

A

If that were the case,.

A

Well, anyway, not something to worry about at the moment.

A

uh Next, uh I think, igor, I saw that your optimized pg removal, pr had an update this week is- is that right.

A

Seven days ago,.

A

I don't I don't hear you talking if, if you're trying to.

H

Yeah, sorry, uh well so yeah it's! It looks like it's ready for review, so I'm calling for reviewers and uh well. I run.

H

Another uh run which looks good for me just a few that drop, which I'm double checking.

H

Unrelated uh so yeah at this point I don't see any implementation issues with that. So just.

A

All right, very good, very good. um That was it. That was all I saw this for this week. um I know everyone's working harder in quincy, so uh that's that's completely understandable, but uh uh we haven't seen some of these other ones um any anything I missed from anybody.

A

All right, if not, then um the the discussion topic I have for this week, I put it in the ether pad, uh so uh you know feel free to take a look here, but the best of it is that uh we are trying to do uh some testing for quincy on our.

A

We, we have a fairly decent number of amd roam nodes now in-house uh we have ten of them so we're using that uh for this uh release uh for going back and and doing some comparison tests uh to previous releases and um we're seeing some different behavior than we've seen in the past with our intel nodes um uh for reference, I put a link to some of our our previous tests on our official analysis nodes uh where we're seeing really consistent uh improvement kind of going from from nautilus uh to octopus to at that time, master, which was um uh you know, kind of the movement towards specific.

A

um This is a little complicated because we're also looking at one versus two osds per device on these tests uh that was uh requested time, but but nevertheless, we're seeing general improvement here for the most part um on some of the new testing that I've just been doing on our amd nodes. Our newer aimed nodes- um it's it's, sometimes we're seeing improvement and, in some cases we're seeing what appears to be fairly significant regressions uh going to octopus and pacific.

A

um I was saying in the core meeting that, depending on how you look at this, this is good news or bad news. um We've seen some reports on the mailing list of people saying that pacific was slower for them than novelis and we were not able to produce it in-house on our intel machines and in fact um you know this has been kind of the case in both on official analysis on our test cluster, along with tests that have been done by the the dfg workload, team and and uh and other folks.

A

So um maybe the good news here is that maybe this is uh now allowing us to produce some of this, so we can go back and figure out what it was that did. This um is sidetracking us a little bit from the quincy testing, which is what the real purpose of all of this was. But um but you know it's it's never never too late to go back and figure out what you might have done wrong. So um you know, hopefully this will help us understand.

A

If maybe there are some some significant differences between these two different test platforms that we now have access to. um Probably before I dig back into this I'll run, some initial tests on the quincy freeze just to see how we're comparing. I suspect that some of these tests were gonna, look better due to gabby's uh work on uh reducing the amount of data that we store uh in roxdb.

A

uh That seemed to be a pretty big win on uh uh in the right path, especially for small uh random rights, and uh that's one of those cases where we we did see some regression specifically going from novelist to octopus on this platform, so um we'll see what happens there. But um in the event I wanted to share this with folks.

A

So they can see you know kind of what I'm seeing on the ground right now, it's possible that that you know this could change uh over the course of the next week or two as I try to dig into what's going on, but uh that's that's kind of what I'm seeing um any any questions on any of this.

A

One thing I want to mention um is: I don't think this is due to glyphose buffered io. It looks like we backported those changes to all the different releases that I've tested here, so I believe, we're using bufferedio in all cases, not direct io.

A

That was that was kind of my first thought when I saw this.

A

All right, well, then, um alex. Would you like to take over and talk about what you're, seeing uh uh ttl and rock series.

I

uh Sure yeah so um just to go over like what was happening in our clusters first uh to to put kind of some context, so we're running bros gateway, and we have a lot of workloads that do a lot of inserts and deletes uh through lifecycle or just regular workloads from from customers.

I

And what we've noticed over the last few months is that we see the latency uh steadily increase and no matter how much iops we have in the back end. It happens right. So we have some cluster with ssd index, some customer with nvme index, and it still happens, uh and the problem was also exacerbated. While we were trying to uh delete some very old and extremely large shards, uh they are so large in fact, that we cannot use urls because we admin comment for them uh because they will touch the cluster.

I

So what we've been doing is deleting uh np at a time every x seconds and as we're doing that we're also seeing a latency increase.

I

um So what has been happening in our production environment, for it is that um at some point in time, one index osd starts to pile up on slow requests and that blocks our entire clusters after after a while, um so we've been looking at logs for it, and we kept seeing um slowness on the omap iterator, taking like over 10 seconds to list a few keys, and so we started to look at that and found out that it was extremely likely caused by tombstones.

I

So it's such a problem that even on cluster, where we don't have this large shard deletion, we have to compact three times a day and even then, uh as customer workloads start to ramp up uh the compaction.

I

uh Timed outs like we hit the osd command uh thread, timeout or suicide timeout, or something like that uh which caused further issues. So over time. It gets worse and worse and worse. So uh we looked at rockstb and found these two options, uh one that is available on starting with specific, with uh periodic compaction seconds that force compaction uh regularly, but because we are unknown still in production.

I

We started to look instead at ttl, which will periodically as well um look for the tombstones and uh compact uh trigger compaction based on on to to remove the tombstones, so I've run some benchmarks on that. uh Let me.

I

Sorry here, um oh so before the benchmark, um I think I've saw in the mailing list as well many times. uh People having this issue with index and the recommendation so far has always been oh compact regularly. um But we also found out that in our old der cluster uh that used the ssd, a compaction of an osd live compaction would just bring the osd down.

I

So that's why we started to look at different options and ttl seemed to be beneficial. So I ran some benchmarks, um one that is, and they are very specific to our environment, and that's why I want to have a discussion and see if people could find this valuable or if we need to brand something else for you guys so uh on the benchmark. When we delete ten thousand kilo time and up up to ten thousand key per second, you can see uh with the default option of seth.

I

You can see that latency increased tv over time and then there is a compaction events um isu and it goes down, but not far, far from the baseline, where it was and then goes back up, and it kind of you can have a trend where the latency slowly increase over time.

I

uh Whereas if you use ctl, it steadily still latency still steadily increases. But after a while it just drops down back to the nominal value that you had at the beginning of the benchmark um and for other tests that are more there's still not very much production like. But uh basically what I'm trying to do with this test is just generate the worst case scenario possible just to extrapolate the issue as much as possible to better understand.

I

What's going on there, but under normal workload tests where I'm basically doing a hundred requests per second and splitting requests between insert delete and list, and there is also a list, a race between the list and the delete.

I

You can see the latency with the default option just increases over time and it almost never goes down right, whereas with ctl on it increases over time. As expected, because we create a bunch of tombstones, but when ttl starts to run, it just goes down dramatically.

I

um So when tdr starts to run, you see this huge spike in latency um and then when it's done, everything goes back down uh quite nicely um and I think the spike is just use the concurrency of the all, because in the benchmark, I'm basically running single pull single writers object with a larger map, so yeah pretty much worst case scenario, um and so it's been beneficial for us in index. uh I have a colleague, uh josh josh, bargain they're deployed in production. We can talk about that as well.

I

It's been very yeah, very beneficial. uh We say that I think six hours. I don't think that it's gonna be that beneficial for non-rgw workloads to sell it like six hours, but I did notice that pacific by default, well, the rockdb version pacified by default has it enabled at 30 days. So I don't know if that helps with uh pg log and all of that um yeah I mean uh josh. If you're here I think it was. Do you want to talk about production environments.

J

Yeah, um you can hear me right great, okay, uh yeah, so we've been trying this in production, um it's helped quite a bit. We actually did have some clusters that were nvme index, based that we were running compaction regularly. They were stable enough to do compaction, but their compactions were starting to go up to like the 10-15 minute mark for a run and we're steadily increasing over time. So that just was not a sustainable approach.

J

uh We tried, we actually tried ttl at 30 minutes to start with, um and that worked fine, but the the biggest downside we've had from ptl so far is the right amp, which is not surprising. That's basically what the rocksdb docks warn about is that your right app goes up and with like a 30 minute ttl.

J

Our right amp was such that it would probably torch an ssd in about three years, um just due to drive rates per day, um so dropping that down from 30 minutes to six hours reduced the rate out by about a factor of five. So that's tons of uh headroom there at that point, um and it's actually been really stable too, on our ssds.

J

uh It probably took on our biggest some of our bigger indexes. It probably took about two hours for no, I want to say more three or four hours for ttl to actually clean up the database um of all the stuff that was accumulating over the years that it's been running and when it did that there are several cases where we've seen latencies go down for index operations, um but I guess the more important thing is, and this is something we're going to be learning soon.

J

Some of our big deletion operations cleanup operations. Are they going to be better um and we, like, we literally rolled this up to our worst case cluster yesterday or finished yesterday. So we just, we don't have that data yet, but in terms of stability, it's been looking really good and it's outright replaced. Our um compact jobs uh on.

J

Sorry I got distracted for a sec, our compact jobs, on those clusters that we needed them for.

A

Do do you happen to know um uh compared to the default uh ttl, uh what kind of rate amp increase you saw with like six hours.

J

um So sorry, when you say default, ttl you're talking like specific detail, defaults, yeah.

D

I guess is what I was thinking, but.

A

Or or even uh whatever, you've measured what what uh is did changing that increase your.

J

Ramp and like it absolutely did, and so I actually don't have the number for like no ttl versus ttl what the road amp looks like, um but it definitely increases significantly like. Basically, the drives were sitting at 0.01 drives rate per day, um and so, if you imagine, dropping that down from 0.01 drive rates per day to whatever is torching a drive in three years and actually the math in my head right now, it was like a 10 probably well over a 10x uh right now, wow, okay, okay, but the thing is our: like.

J

Our index drives largely sit idle.

J

The issue is not disk access. um The issue was entirely the cpu usage incurred by accumulating tombstones over time.

A

Sure and if you've got a right, heavy workload anyway, you're going to be compacting regularly so you're, it's not like this is really going to change it. If you already are compacting all the time.

J

Yeah and what I, what I see in the grass is like we'll get a we'll get bursts of compaction. Now with ttl, I said it: six hours, obviously you're getting a burst at least every six hours. If you've got a steady workload coming in, it's maybe a little bit more often than that, but it'll run for maybe 10 minutes and 20 minutes of compaction every once in a while um uh was the other thing I was gonna, say right yeah, so the other thing is um I. I don't think this is good enough.

J

I've really been wanting to turn off blue fx. I o. When I can the buffer, I o actually really hurts us. We've been turning it off in as much clusters as we can, because I think I've mentioned this at this meeting before, but once you have a dmcrip layer in there, it actually causes like a huge iops amplification for rights, um and that really hurts us, especially for our ssd based systems where they just can't take that level of layoffs.

A

We really need to figure out why roxdb is not properly doing, read ahead and and like reading from the cache when it should be and is, is like relying on the page cache for it it's ridiculous, but that would if we could fix that we can just get rid of buffer io. I think and go back to doing, direct.

J

Io yeah and it's something that I would love for us to be able to spend some time on too, but um I don't feel like until until we're at pacific we're still a nautilus until we're at the pacific. I don't really want to be spending that much time on it. Yet.

A

Sure, that's understandable and, and the rocks to be code has changed like you know, when you upgrade roxy beads, it looks a lot different than it did back in the model series. So it's.

J

For sure, that's how yeah that's one of the many reasons why I don't think it's worth spending time on it. Listen.

A

Well, hey, this is fantastic. This is really really good. This is the kind of stuff that, like we, we don't see as often right, because we don't have the kinds of clusters that you guys do. So it's it's really. This is this is excellent. um I think the the trick will be figuring out what the right balance by default is of. You know doing uh regular, regular compactions like this versus right, amp versus you know.

A

uh You know what what makes sense for people, but, but certainly more so than what we have right now sounds like it's. It's very reasonable, yeah.

J

And it's so workload dependent right, I mean we have we like alice said we do have workloads that are just so like delete and replace heavy that this makes a big difference. We saw um disk usage drop by 10 to 20 when we turn this setting on. So it tells you how much redundant data there is or undeleted data there is sitting in our in our databases. Right, not everybody's gonna have a workload like that. If your workload is right once read, many you're not gonna benefit from ttl at all.

J

The right apps are good enough for you right yeah, exactly exactly, but we we've had workloads where you actually the workload from like a normal customer workload because of their their pattern.

J

They actually their list performance just gets brutal because they get into this cycle, and the thing is: roxdb is only compacting away tombstones when it has to when a level is full or whatever right, and if they're not deleting enough. Maybe the files that hold their tombstones aren't even getting attention. It could be other files that are actually being the ones being impacted, so something like this at least tells us a guarantee.

J

No tombstone is going to last longer than six hours. Six hours is probably too aggressive for standard workloads um because, like if you don't actually see symptoms for like a few days, you could run with like a one day, ttl or even three day, ttl or whatever right. It doesn't have to be six hours. Just for some of the things that we want to do on the on these indexes. Six hours is what makes sense for us I'd love to do it at 30 minutes, but, like I said, the right amp is way too high.

A

Yeah when we were saying really really bad behavior um with the um like bulk delete stuff a while ago, um I think igor we had. We talked about even like trying to do compaction on like uh an iteration right, like you, iterate over stuff and and maybe.

C

A

You or or delete stuff right, maybe on the uh some some number of ops you you end up going back and doing compaction rather than based on the amount of data that that you have waiting uh to compact. But I don't know if that makes more sense than just doing like a deal type thing.

J

You know yeah one of the things that uh alex and then another one of our colleagues uh matt van meelen, looked into a little bit is there are roxdb um calls for saying, like compact over range or um if you do a deletion. I can't remember how it works. It's something like. If you do a deletion over a range you can tell it to like compact all the way through all the levels down or something like that.

J

Okay, but I, when I've done like very minimal research on it- and I think matt did some research on as well. It was a little bit unclear how well it actually worked in practice.

J

It didn't sound as as much of a guarantee, as you think it might be, and the issue is that your client has to somehow give that hint and that's not always possible, because it's not like rgw is like reaching into the internals of rocks to be on the index osd's right, and so I think it gets really complicated um yeah to try to do stuff like that.

A

This is one of those like dark alleys of roxy. Be that no one wants to talk about too much and like no one really understands what the current like status is.

J

Yeah yeah, so one thing is not clear. Like alex said in the nautilus rocks db, there's just a ttl setting in um the pacific rocks db. They've made some changes to improve the ttl saying, but they've also added a. I think it's called periodic compaction or something like that.

J

It's not entirely clear to me what the difference is at first alice and I both thought that basic periodic compaction was literally compact everything all the time, but in practice, what I'm seeing from ptl is that, basically, it just pushes everything down to the bottom level every six hours.

J

So I don't know if periodic compaction is actually just running a filter and then decides on whether or not it should compact a file, and if that's true, maybe it has less right app like I. I have a hard time from the documentation predicting what ross tv settings are going to do. So uh we also need to try it and see what happens.

J

A

You guys might frankly be more experts on this than anyone I mean like adam, and I have tried to kind of look into some, but not not a ton and- uh and you know already you're you're teasing out details here- that I think um you know we don't necessarily have any more expertise on. So uh you know absolutely come and report what you what you find, because this is really good.

K

Mark I I want to raise.

L

A point uh we worked with a startup called spdb, which currently they provide roxdb, dropping replacement. They all they promise to open source yeah and during our work they showed on roxdb a benchmark, tremendous improvement. But when we put it into surf, we saw that it didn't help us, because it wasn't the bottleneck in what we were doing, but they had like 10x performance improvement in roxdb.

L

Doing just roxdb benchmarks could be that this is a case where such a thing could help, because it's not it you know, because when they become the bottleneck, we just couldn't find a real test case. We tried a real use case where we set where the bottleneck and their improvement actually made a difference.

L

Maybe it is the case where we have something in rob's db that could be improved by taking the roxdb side and not our usage. Approximately.

A

I talked to those guys just a little bit. Adam did most of the work talking to them and working with them. I know yeah yeah, I I don't remember hearing when I talked to them that this was something that they had tackled. um You know as a a big performance bottleneck compared to you know typical rocks db, uh specifically looking at the compaction, behavior and performance of tombstones under cases where you, you have a lot of accumulated junk.

L

I I I I talked to them a lot, but I don't yeah. I don't know whether uh I took them without have coordinated the work on this. I don't know I don't have a clue whether they they tackle this problem. I know that they they claim that they reduce their right amplification significantly. So if one of the problems here is right, amplifications because of what we are doing, they claim that they have significant improvement in this again.

L

If it's a good use case that we are sure that roxdb becomes a bottleneck here, it may be worth talking with them again and trying to figure out. What's going on, it's just an idea that you know I I.

M

Don't know whether you're going to yeah, I think the you know, we've done quite a bit to try to limit.

A

Our right amplification already that's that's kind of the whole reason why we've got these giant uh uh mem tables and giant redhead log buffers right. uh You know, there's there's always this desire to make them smaller, which I completely understand and agree with, uh except for the fact that, but because of the way that pg log works, we end up um with. You know all of this. This temporary data that gets moved into uh level zero and moved into the database.

A

If you have very small buffers, so we've kind of like you know, gone about as extreme as we can in this direction of trying to reduce right amplification and now it's kind of the question of well. How much do we bring back to make behavior nice if they can showcase that they can introduce all of these nice behaviors that we want, while keeping the right amplification low? You know that is valuable, but um that that's the big question.

L

If I'm correct uh it, it was some time ago and I don't remember all the details, but I think that what they actually did that they implemented the compaction differently in a more efficient way. So just from the sound of what's going on, because I didn't fully understand that the exact problem that that josh and alex explained, but just from the the sound here it seems like it, touches the same points. But you know we tried it in the past.

L

We were sure, according to the the rocks, the b benchmark, that it's going to be a win, and it was nothing we couldn't even see the the any any impact. But maybe it's worth you know a trial.

L

One one problem is that what they please, what I'm not updated? They promised and they contact us because they want to open the source and to see the model for this, and I'm not sure about the current uh current uh status of this. But what they provide is a drop in library, dropping replacement for xdb.

L

I'm not sure that it's with the version that you see not notable so it could be that, uh but maybe they will do it for us because they're in good uh relationship with us, maybe if we give them, give them a version. Maybe they will build the version for us, but if you think it's worth um testing or checking this, I could start talking with them, see what we can do do you do. You know what their their timeline is for open sourcing it?

L

No, um I and I I can check. I can check it, but I don't know they. They had several meetings with uh us us as red hat on uh what uh what um economic models they could use. What's the benefit, how they do it all these kind of things they have some grand plan to do something much larger than this rocks db improvement and they they have a lot of incentive to open source it, but I'm not updated at least for the last three months, so I'll need to check sure.

A

And and adam might, I think, he's on pto this week, but uh I think he's back next week, so um I'm sure he'll have he'll have lots of feedback to give on on what he saw as well. So um maybe we should. We should wait until he's he's back, so we can fight his feet back.

L

Okay- maybe you could raise this, you know uh put it on the agenda, for you know two weeks from now and we'll try to figure out. What's going on, yeah yeah.

A

uh We usually I do this kind of week by week. Would you would you just send me a reminder if you wanted it for for the week after this week, okay, cool cool.

A

Any anything else on on this topic, guys.

A

H

uh uh Sorry, I'm curious how uh what's the impact of this ttl compaction if uh no.

H

No removals happened, or just a minor of uh just minimum of them. So how? How visible is this ttl compaction if regular, payload without bulk removal, I mean how it affects.

H

Our operations.

I

So basically you're saying running it even it runs, but there is no tombstone to uh to compact yeah yeah. uh That's a good question. We I haven't thought of observing that.

J

Yeah we've been entirely focused on our index use case, um so we haven't looked at like what happens in rbd cluster. If we implement this, for example- and I don't think we would, the right amp would be unacceptable there um for drive life and that sort of thing we think um but yeah. No, it's a good question. We don't know what happens in uh by quote-unquote normal workload where there's low, tombstoning or no tombstoning and just rights of unique data.

A

Do you know if the ttl gets reset if you are compacting anyway, in between future rights,.

J

uh Yes, like the the my my very rough understanding from the docs and again, I don't know how much I trust the docs versus like digging into the code, but um basically when a file gets written in a level, that's not the bottom level, it gets a timestamp and then the ttl is just like a periodic check of that timestamp.

J

So if you compact any ways if the file gets shoved down a level or something like that, then it's likely that timestamp would change it's not going to get ttls. Now, like I said, almost everything seems to end up in the bottom level. So what I don't know like the big question in my head is: is the timestamp when that file came into existence or it's a time stamp when that file was last updated in some way right, that's a fairly critical difference.

J

Yes, absolutely yeah, and I don't know that yet I don't know if I wouldn't be able to know that without digging significantly or coming up with very specific uh experiments to expose that behavior.

A

Yeah I've dug through the rock cb code, a fair amount trying to think of where to look for that might not be too hard to figure out what it's doing.

J

Probably not but like I was starting to look at it, but we didn't look at it in detail.

A

Yeah, I I probably won't be able to any time before quincy is finalized, because there's just tons and tons of stuff to do, but um but post quincy. This seems like a really good thing to get.

J

Worked out yeah one other critical thing. I just remembered that was really important and we did confirm this from the code is ttl. Compaction does not get scheduled if any other compaction is scheduled.

J

So we were concerned that, like gtl, because we're going to go and turn this on on an osd, that's been running for three years, we're concerned about hours of ttl compaction, basically starving um level size based compaction, but it doesn't. If you look at the loop, it basically says: do I need to compact for this reason, for this reason, at the very bottom, it's ttl I've got no other work to do. Okay, I'm gonna schedule some background.

J

Ttl work, okay, okay, that was pretty important to us, because otherwise we you you could get into some pretty big trouble.

A

Yeah uh keep keep us in the loop on this. This is this is super super good.

J

Yeah for sure, we'll uh we'll be gaining more experience with this in the coming weeks, as we start to reenable some of our more delete heavy workloads and see how it actually does in the wild cool.

A

All right: well, uh we've got about five minutes before we're at the hour any anything else from anyone. This week.

A

All right well, then, have an excellent week. Everyone thank you. So much for coming and uh uh we'll talk again next week, bye, guys bye.