Ceph Performance Weekly, 28 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-07-28

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

All right, I, don't think we'll get them for a little while, yet Braddock hasn't gone yet so I think they're going to be a little while, but uh we can get this thing started now and and they'll show up when they do. um Okay, so uh I did confess. I did not really get through the PRS this morning, I started late.

A

um The only new one I saw when I did scan through the list was the one that that I've got here.

A

It's a work in progress ER for kind of trying to retune our rocksdb settings, uh with the focus on keeping bright amplification at least close to what we currently have it, while, hopefully, both improving performance and the big one is improving uh Tombstone Behavior, so the performance when we are iterating, uh especially doing these kind of uh crazy iterate delete cycles that we tend to do, which for xdb, is really really bad at um right.

A

Now, when you do this, you start iterating over two zones uh that have not yet been compacted and everything slows way down and becomes awful. So um right now it's just some settings that are changing and adding the ability to to um use roxdb's uh ability to compact on iteration, which they added at some point.

A

I, don't know when, but I had support for it, uh but the big piece of this would be to actually track deletes on a per column, family basis and then uh some locking and trickery to basically once a transaction has been committed, then uh increment a counter for the number of deletes that have been successfully committed on per column, family basis and then only trigger a compaction and a flush for that column.

A

Family and the member flesh is the one that we don't have anything for right now, right now uh in in men tables and rocks to be, you can accumulate tombstones and if you have to iterate over them, there's really no way to fix that sort of doing something like this, where you actually track it yourself and then manually issue flushes, so um Mark.

B

Yeah I think some of the tables. The the use case for Tombstone is is limited, so object node there is. There are really few cases when we delete object mode. One of them is snapdrin and I mean deleting them on Mass, the second one. If we do um um remove for volume or for collection other than that I, don't think we delete many objects, I mean we could do it from time to time, but the cases that we do it on a large scale is when we delete um file system with a little volume you're.

A

Thinking in the blue star case, though, but that's not necessarily true for omap right.

B

About North now I.

A

Mean yeah I'm talking about Gabby, like we don't have any controls.

B

Is a different story but I'm talking about the Oh No Object, not because that thing also but but I, think so for the object node. We could trigger the compaction from this um Services, which tends to be background because deletion of a volume, a file system or snapdrin all of these phases. They are not performance, critical and so, whenever we start finish- and we could from time to time just add some code to the compaction but on the normal fuel we shouldn't see. This can happen.

A

So right now we don't expose the ability to do per column, family compaction or flushes uh to outside of the the proxdb KV store.

A

We could maybe add that, but we end up in the same kind of situation, I think well, you might be able to avoid locking in that case, but in any event, we don't have that expired right now, so we need to add uh code into proxy bkv store to let you do a per calm, family, compaction or flush, and then still there's I'm a little a little nervous about letting people just reach in and kind of issue those themselves it I I, don't know that we necessarily want that as part of the interface or.

A

If we want the the TV store uh rocks to be KV star code to kind of handle, it itself.

C

A

I mean like the I guess. The fear right, then, is that potentially you've got like multiple things trying to compact. At the same time and.

B

Like if, if and I'm, also worried about compaction happening in the wrong scenario, when we.

A

B

A

So what we can do inside the uh the the glue code is, we could I think I think we can do this I think we can basically register a listener with rocksdp to say, I want to know when this column family has been compacted and when, like a compaction, event ends and then, if we are tracking deletes, we can basically reset the delete counter.

A

If a another compaction comes in behind the scenes that we don't know about and kind of make this fairly clever, um I don't know if it all will work or not, but that's kind of the current thinking I had is. We can basically increment our delete counter issue a flush or compaction depending on some criteria, whatever it is, however, many deletes that we want that have finished successfully based on on.

A

You know, knowing that the transaction finished, and then we register a listener with roxdb, so that if roxdb decides to compact the column family, then we don't just do it ourselves. You know blindly. We we reset our counters based on on that and then re start reincrementing them um so that, then you know we wait until more deletes. Have happened because of proxy compacts and all the Tombstones are gone.

A

Is that does that kind of make sense? Gabby? Yes,.

B

A

I think if we wanted to do it, the other way we'd have to expose both the ability to do profile, calm, family, flushes and compactions, and then also pass through the ability to register register listeners with roxdb, and that gets I mean it gets kind of messy. Then you're, starting to like expose a lot of internal details about rocks to be to Blue, store or theoretically. This should be an abstraction which you can do it just I, don't know starting to feel kind of gross.

A

At least that's my take on it.

B

A

All right so um in general, though, uh that's all still work in progress, and um you know I, we'll we'll find out what's right or if any of this actually works or or what I guess, um but we probably need to fix it in multiple different ways uh and just try things: um okay, so next updated from uh uh Igor. uh This is uh uh get rid of status updates on each transaction.

A

um This failed QA. That was a deal with this one I think Adam had approved it. You got some updates, um I think Adam actually maybe wrote some fixes for this, uh but I didn't look too closely at it. So anyway, it's it's being worked on um and then after that I confess I didn't really get through the the old PR's uh as possible. Some of these get updated, but um I most of them are probably still kind of just hanging around.

A

So um in terms of as Updates this week,.

A

The rocksdb tuning thing got posted.

A

So if folks are interested, the final version is out there um Josh. We, we talked a little bit about this in the PR, but um you had mentioned. uh Re-Amplification is a really as a concern about uh using TTL, and um one of the things in this article that I saw is that with rgw, the right workload in rocksdb went way down. When we started my start testing compression um it was, it was huge.

A

um You know there are downsides right because probably more CPU overhead, almost surely more CP overhead, and it did seem to impact bucket listing performance, especially in CPU limited scenarios, but um but for rgw it looked like it was a huge bright amplification when.

D

And yeah um I can't remember. If I mentioned this in the past, we actually did run compression into our rgw clusters um internally and I found. That was mostly fine until those osds became a backfill source, um at which point the load. Spike was pretty significant.

A

D

Because of a high rate Traffic- um and we actually did get some people with workload, slow downs when that happened now, okay, there's things that could be done to tweak that right. It's not like. We have to backfill at full speed, but it it was noticeable.

A

Do you remember if you were using Snappy compression or um uh it was Snappy? Okay, okay, I I? Think they're, pretty close, although I think lc4 was a little better um from what I saw on the The Rocks TV Facebook thing um and then also just in in testing.

D

um Yeah and to be clear, the right app that I'm concerned about is less still on the index side like that six hour, TTL. That was based off of our um speaking to find as acceptable right now. So basically, it's like okay assume you want the ear SSD to last for five years. We just did the math on write app and everything was fine. We've got tons of right capability on those boxes, so it wasn't a performance issue. It was entirely a wear level issue.

D

Yeah, let's say you've got like you know: 14 heavy by PLC drives backing your data store for an RCW workload with a high object camera. That's where I start to get concerns about what the write-off is going to do to the drive.

A

And I have a feeling that all of this stuff is going to increase ramp overall right, like the TTL, the the compact on iteration, especially if you start tracking, deletes and that like every you know, thousand or ten thousand. Who knows what deletes we? We issue compactions, you know we're gonna, see that right, amp go up so I'm, just trying to think of like how to how to kind of start balancing some of this out.

D

um Well- and this is this- is The Balancing Act with rotstv right, yeah yeah, like I I, was super I'm, just catching up. I, actually just came back from three and a half weeks of PTO here, but I did see someone one of my teammates at senior PR and posted it I'm super happy to see a lot of those settings starting to be proposed to Mainline, rather than saying in a blog post somewhere.

D

That's awesome, um I think for most people, it's just not going to matter and then for the larger shops like us or others, um we're probably just going to observe carefully and then tweak it ourselves anyway.

D

We'll just go and walk back like the TTL setting or something right to make it work for our our use case still yeah.

A

Yeah, if, if you can handle iterating over more tombstones, it's not really causing you a problem, you could you could you know back those off some water or whatever yeah yeah.

D

Exactly and like a lot of the I mean, it also will make a difference once we're in multiple column. Families like we're still online to Pacific right and.

C

E

Huge difference.

D

We actually so uh I, don't know if you saw him across uh your way at all, but uh one of my colleagues Alex naragon I'm sure you know who he is um yeah yeah.

D

um We had Tracked Down, uh PG log load issue, um which I, don't think were the first ones to hit and the PG log below which basically caused the PG log to be many, many gigabytes in size on a per OSD basis and it actually caused rocksdb to fall into having an L5 or data recipes, and that was awful, because now your Tombstone build up just is that much bigger right, so the more families you have the fewer tune, Stones you have because they can see it compacted out more regularly.

D

We actually have to recondition those osds because they were non-performed after that.

A

Point of time was: was that uh issue that you hit the um the one where uh We are failing to trim if there is a bogus, um basically like future update that ends up getting in? uh That was obviously.

D

This was each entry was getting big, oh that one okay, yeah yeah, because of like the ref count. Updating uh PG log entries like the every PG lock entry was like 20, kilobytes Plus, or something like that. Just for a refund update, I I, don't remember the details, but it was huge and it's like it goes back to some like Hammer air hammer era, change right, I, don't know exactly how it works.

A

It's this one actually, now I'm I'm, it's actually a dupe entry issue is um was or if you have a corrupt, dupe entry that looks like it's in the future. Then we just don't trim anything.

D

Very closely, yes,.

A

D

Don't know if that's landed for Pacific yeah, but we're back recording it.

A

Yeah we went through and we're doing a bunch of work on that a week or two ago, and that's that the fix is good, um so yeah that should that should hopefully no longer be an issue.

A

uh Okay, so yeah I, guess um I will be very interested in uh your guys's. Take on this, and especially if you guys have any kind of like test set up when we get the the deletion like deletion tracking. If we get that in I'd be very, very interested in having you guys review it, and if you can, you know, make sure it's not following anything up on.

D

Your end for sure yeah I mean we do have test setups I'll have to see because of the way they hook into the infrastructure. I'll have to see whether they would trigger those paths or not um so I mean we'll see once the pr is up how they integrate. We might have to tweak how we do our testing.

A

E

D

Is like most of our test, setups are not particularly reflective.

D

um I, don't know I'm still catching out, I find it when I catch up for vacation. Like words, just don't come to my head as quickly but like there's like there's a huge there's, a huge uh array of customer workloads and it's very hard for us to capture that perfectly so, usually once we're here, it's not going to fall over our best test is: let's just go and put in production for a subset and find out what happens so.

E

D

Yeah once once it goes available and once for sure it's like tested well enough, we can at least put it in our staging rate, happens.

E

A

Cool all right. Well, thank you. Definitely appreciate your guys's. uh uh You know Alex's uh tracker ticket. They made kind of uh really really highlighted it. I think I, don't know if I would say it I. We, we hadn't we've known about how bad Tombstone accumulation is for a couple of years now. But um but you know it's it's worse, I think than we even realized so.

D

It would mean that was PC documenting our similar Journey right. We knew it was bad too. We were just kind of suffering through it, and then we realized how bad it was something.

B

About it, yeah yeah regarding Tombstone I think the problem might be mitigated if we change the way we access rocks to be I.

C

B

Think a tombstone is a real issue when you access an object by key, because I think we did a mouth so say you have one billion object and you deleted the object thousand times. So now you have one billion Tombstone and one million object. Are you with me? You have the one million object, but after a thousand deletion cycle, you now got one real one million of real objects and one billion of Tombstone. Let's assume there was no compaction whatsoever.

B

Okay, are you with me so far? So until now Okay? So until now, when you did a search for an object which is done by log n, it was taking 20 steps for one million object. Log n of 1 million is 20, so 20 steps when you grow to build an object because of all the tombstone the search going to take 30 steps, because log 2 of 1 billion is 30..

B

So we really only increased the work that we do by 50 and that when we had crazy amount of tombstones, that's not a problem. The problem is when people walk on Range, because then they start searching and when you search, then you are linearly affected by the amount of Tombstone. Does that make sense to you.

A

Yeah yeah I agreed um I, don't know for sure, but I think when you access something by key roxdb internally will go through and hit the bloom filters right to see if that key exists in particular level, and if you can't, if it has to it, will fall back to doing a a search through the the SST files. Is that right? Okay? Is that you're understanding.

B

Too, but before I'm talking about the main table itself, login.

B

Which is pretty cheap so so the problem I think with Tombstone is really when you do linear paths on the on the table, because.

A

Yeah I generally agree.

B

It will end time bigger. It's.

A

B

The um when we search by range, that is a terrible thing to do. We should never search by range.

A

Yeah and we do but it's the lower Bound in the upper bound there really slow right. That's what we thought on testing.

B

Yeah yeah I suspect that search by wrench is terrible. Even without tombstone we shouldn't be doing ranges. The code should be just acted in one object at a time.

A

I mean you've Gary. You voted a lot of this code. More than I have like when the snap mapper so I mean what what can we do. There like in I, know you're keeping the whole thing in memory right now. um So what what's your long-term plan so.

B

In the snap map, where there was really no reason to search by ranges, but the way this thing was built forced us to walk by ranges, because we wanted to have a logic map form type ID to a set of object. So you can have one snap, ID and say: 20 000 object, you know what we have done is we broke it into 20 000, separate scheme key value each time the key was the snap ID and the object ID. So now we start searching by ranges.

B

So all the searches are done by ranges, and that is extremely expensive. So all of this is going to go away with my code because we no longer use omaproxdb whatsoever and I also think that- and my code is now going through some kind of optimization um Cycles, because I realized that a lot of what we do is doing worst case scenario, but we never really use this code.

B

um For example, when we the way that we call the code, we have an option, we call update and the update gives you the old. So if the Clone object was not to free snap sessions- and now you remove one or two snap, so it gives you the new map and tell you you have to remove all everything from the old Maps So in theory, you should walk over all the snaps and remove all the stuff, but in reality this is not how we do things in reality.

B

What we do is that we remove One Snap session at a time. We only work on a single snap session, so you only need and when you add object when you're adding object, that's a very simple thing, just add the object and you can patch them to disk. When you do trimming you do it on a single snap, so you just need to page this single snap map into memory. Maybe you need to sort it I don't know.

B

Does it make sense that when you remove, if you order the object, does that is it that cheaper? When you do removing order or doesn't make no sense say, I have to remove thousandth object, as is it important to have them done in order boys and soul.

A

B

I, don't know like by the object ID order. If you remove something from the object mode uh on column family, does it make any difference if you do them in one order, if you do them randomly.

B

I believe the only scenario is PG removal and it doesn't matter.

A

B

In which uh okay, if it doesn't matter, then all we need to do is page this single snap entity into memory and then remove them one by one and you just walk over a vector, it's extremely simple thing and you never have to do any searches. You never have to do any sorting any accesses to database and any finger.

B

If you remember that, the way that we do things it's one snap at a time, we don't the code in theory, can support operating in multiple snap and removing multiple objects for multiple snaps, but we never do that. Actually, I take it back. There is one scenario in which you do that, but that scenario is not critical.

B

That scenario is: um when you remove a volume, then you need to remove the object from all the sessions, but that thing is low priority and it could be done easily, but the way that we do snap trim is always one snap at a time, but the code is always operating as if everything could be modified in any time, and it's always iterating over all the snap when in fact we always walk in a single snap.

B

This whole code is very general, which is a nice thing. If it's I don't know, if you try to build a toolkit for somebody else, it's nice to be General, but when you're doing stuff for yourself it's okay to say you know what we're going to remove snap sessions one by one. We start from one session once we're done. We move to the next one. There is really never a reason to walk in multiple sessions.

B

So then the code is much easier to do and it's much cheaper. So that's the way to do things, but the the other cases when we do range I know in PG log we try to do range, remove I suspect, that's another problem, I think in the end, we decided not to do that. In theory. Rendering move is a cheap operation, because, instead of issuing multiple tube Stones, you just create one Mega tombstone. In reality it seems to be more expensive and and I don't know this.

B

There might be other cases where we on all map that we walk by Rangers. I suspect that we do that quite much on r2w and then we need to think if there's a way to avoid doing that ranges is walking by ranges is very expensive.

B

It's easy to write the code like this, but that require linear, linear or near linear, searching.

A

Yeah I agree, Debbie I think the more we avoid it.

A

The the less important some of this other stuff becomes that that we're talking about with, like automatically triggering compaction and flush on deletion and that kind of thing um just to try to get tombstones out. I, don't know if we're ever really going to completely eliminate it, though I don't know like I'm the rgw's ikc. Do you know like under what circumstances we we do a lot of like range based iteration.

B

So I don't know, but there might be some other cases.

A

C

I mean that's: that's how bucket listings work I'm, pretty sure since Casey's, not talking yeah, oh just listing all map keys and iterating over them.

A

And is that the only case that we do is with bucket listing or do we have anything else lurking or we're doing a lot of room scans.

C

And I'm sure there's others, but that's a pretty important one like that's that needs to work. We can't replace it with something else. We like we need that functionality.

C

The alternative is maintaining another index of rgw Entry like objects, and then we just have this problem all over again, but at our a higher layer that can't do anything any better with it.

A

Yeah Greg I'm I'm kind of trying to get a sense for like how important a general solution is to in, like the rocksdb store for issuing compactions on deletion, to get rid of tombstones versus more, like one-off kind of you know, allowing different things to issue compactions periodically like do. We need the general solution, or is it good enough to have like the ability to let something trigger compaction here and there.

C

I I don't have a good sense of the trade-offs. I. Don't have been talking about it, but I haven't been following up with the other details. I mean.

A

My inclination is a general solution. Basically, in the rocksdb uh give you store code basically for every column, family track the deletions and then have some cleverness there, where we can um reset our counters when Rock Stevie issues a background compaction and not expose the ability to do like per column, family compaction and flushing to other stuff. We just take care of it. There that's kind.

C

Of like yeah I mean because, like like the upper layers that are even if they're doing both delete both deletes, like they don't know where they're going definitely like, like they're, not paying attention to what PG things are in they're, certainly not paying attention to which OSD gets those pgs um and that all matters I. Think for this yeah. You know like the MDS stores, all of it, the entries and omap, and you know sure it can do a bulk delete of omap. Sometimes it's not super common but like if it does.

C

If it deletes five directories, it doesn't. It doesn't know if they're, all in one PG or one OSD or like what got or what. So it's not yeah.

A

And I mean like even inside the OSD and inside Blue, Store I. Don't really want it. You know people thinking about rocks to be internal Behavior right like that's. If we we won't, but if we did put a different, you know key Value Store behind this issuing compactions on deletion might not be what you want to do at all. So it's I think the the key Value Store code is or the proxdb value store code is what this probably should live.

C

Sure, when it comes to me, I mean I. Think you, if you want to track that like if you want to issue it based off the ratio of deleted or tombstones, then there's no reason to do that above Rock TV or our DirecTV handling because like why would you rock TV knows? Or we know where the where the beliefs came from I.

A

I think that Gabby's idea earlier was that we could kind of try to be smart about like waiting until there's like no a time period to do a compaction when nothing else is happening, like maybe there's a trade-off that you want to make where you don't want to do the compaction immediately. You just want to wait for a while, but I don't think it's enough of a win.

C

Yeah, we don't have a lot of that kind of downtime that we go about.

A

Gabby today they say your position right, I think that's what you're arguing right.

B

Whatever we do so, if we do snap stream, we could stop the train until compaction is done. There is no reason to run them in parallel. If we do um anything else, then then you could stop yeah I know, but at least you could stop, which is clearly generating more tombstone in the same column, family.

C

Explain what does stopping the snapdrome get us.

C

Like when, like I think, are you saying you would want to stop the compaction until the snapshroom is done and you can do all the Tombstones at once? I don't understand.

B

No, the other way around. So if you do snap to him, for example, and snaptube is generating a lot of Tombstone, then internally, it knows you know I generated that many Tombstone. Let's stop trimming compacted Tombstone and go back to trimming.

A

When you start rather because they will.

C

No, and wouldn't you rather have all the Tombstones get in that you can and do them all at one. Go like you're gonna have to go all the way through either way. I, don't I, don't think we're running more compactions during a.

B

Snapchat I could stop com supremely, can go over I, don't know. If you dream one volume you could end up removing hundreds of thousands of object.

C

B

um For example, that could generate a lot of Tombstone, so the best thing for you to do is I, don't know every thousand Tombstone every 10 000 Tombstone every 100 000 you pose the snap swing around compaction and only when it's done you could resume compaction uh snapshot.

C

That does not swear with my intuition of compact compaction and tombstones interact, but if you have tests, I have looked at tests, I haven't I mean like you have like the. If you run compaction, then into a bunch more tombstones, you just have to go through all the layers again and you're, not like and I. Don't think you're reducing the number of Levels by that much.

A

Greg I think the the the the case that I really worry about deleting or you're or well at the same time, but you're interleaving them right. That's the one that that we consistently see break where you have lots of tombstones you're, not writing anything, so you're, never compacting. So you end up with iteration just taking longer and longer, like you know, like um uh uh uh you know, seeking to the beginning of a range or seeking to the end of the range. That's that's the behavior that we've we've gotta figure out how to um fix.

C

A

I mean Gabby's I think at least for the snap map are trying to figure out how to just get rid of the the seek Behavior entirely, which is great if we can. But you know, we can't do that for buckets in Texas.

B

Why we do that? We know that we generate a lot of Tombstone for others and since we can stop doing that, so in in theory, if you run snap Market very quickly, you would feel the system with Tombstone, so you should in a way pace yourself say you know what every and Tombstone I create I'm going to take a break compact the system and only then I resume compact uh trimming or or volume remove or whatever it is I do.

A

But you don't you don't have to take a break right because when a compaction is triggered, you'll block and you'll you'll take a break regardless. If you want to right.

B

I think, logically, to to look at my God there's taking some steps that and yeah taking brackets means. My code is not consuming resources waiting for something to happen.

A

I see a memory or whatever, um but if you're only working on like one thing at a time is it, are you consuming a lot of resources? If you, if you end up blocking.

B

Yeah, it's one at a time, then, probably not, but.

B

I walk on one of the time, but I'm generating more all the time. So even if I do one each time they could be like out, the number of outstanding could be growing and growing because it's I don't think the thread is going to be blocked in this I'm just going to generate more and more walk.

A

You won't uh you're you're you're you're, just submitting more work without bound yeah, because even if it's not completing.

B

Yes, I think so.

A

That, probably should there should be a limit there. Probably right I should I.

B

Might be wrong here, I'm, okay, I'm, not sure about what happened in this car.

C

Right but I mean the reason the Tombstones are bad is because they inflate the level size right or the the level depth right so like if you're doing an iteration, you have to go, however, many levels deeper to see if there's something there.

A

Craig, this not as part of it but the big, the real part of it, is that you, when you're sort of like you, know finding the beginning of a range um it you end up going over and having to iterate out over all of those two phones to do it so like as you go in because you're doing this kind of like iterate, find something delete. It start over at the beginning. Again iterate find delete it. You end up doing uh you know it's not like a linear growth. It's it's a bigger growth in.

B

Terms of these probably linear growth, but if you do your step and time, then you have to multiply it by n.

A

B

A

B

Yeah but now n is I, don't know you got four and five and six, and so it's just growing growing.

C

Sorry, I and I yeah I think I, don't know enough here because, like sure, you find a thing, that's Tombstone, but that just what do you mean? Then you delete have to delete it like no.

A

C

Iteration when we hit a tombstone, no.

A

We we started our iteration for everything.

B

10 000 things.

B

It's not ideation for every two object that we delete.

D

B

A very expensive iteration.

C

Okay, I mean yeah, so that seems like something we should work on, um because I mean yeah, I, I, I, don't know what all the patterns are, but I mean from just a how much work it takes to do the compaction and and propagate things. Then you know having all the tombstones in there at once means we have to do one pass when we come back and doing it every thousand. If we have 20 000 super Stones means we have to do, 20 compaction passes and that's that seems bad too so I I, guess yeah.

C

We should look at our iteration patterns. um We don't really want to have to like stop iterating like we can't remove all iteration from the.

A

E

C

Everything that does iteration- or at least you know everything we care about. I guess maybe some liberatos users, but.

A

Yeah, we I think we need fallback code because we can't have everything just kind of grind to a halt, but generally speaking, if we can, especially if we can avoid restarting and re, you know like recomputing a lower bound right. Like uh that's that's. What a lot of times we see is that broxtb is spending like a horrific time amount of like wall clock time seeking within lower bound because we're restarting over and over again.

C

Do we do we know what pattern is causing, that is that I I, don't remember the snap mapper code well enough to have any idea, but like is that a thing that rtw is doing or I.

A

I could be wrong on this, but I think maybe the iterator becomes invalid with the delete. Does that sound right? No.

E

A

E

No there's no preservation of iteration between different calls. So each time you create a new innovator when you're calling into the snap member to get the next objects to look at.

E

B

uh I would suggest that in this case, doing a big iteration for, say, I, don't know, 100 object and then internally looping over them would significantly improve the way the code is working with in the presence of of Tombstone, but still it will be still expensive, but it would be less expensive.

A

For what it's worth too, this is just a step number yeah.

C

That's probably the reason for that is that we have to lock the objects, so we don't want to like block, I o on a large chunk of them.

C

It's probably why it's two at a time.

B

Yeah I mean you could just search use this scheme, but you don't need to process them or you could still process them one by one, but just the search just get 100 objects. Each time.

B

The search should not be expensive if the search is expensive, then something is wrong here.

B

I mean removing the object, all the processing, all the transaction- that's expensive, but if even the search is expensive that you say, I cannot do it for that long. Then you are actually making it worse sure you never block it for a very long time, but aggregated you will block in for much bigger time.

E

In general, the smaller Market frequent complexion strategy has been applied pretty successfully in other variants of ICB like Pebble and other academic papers, um so I think I think that is worth looking into.

A

Yeah Josh, that's kind of why I'm I'm, leaning towards this, this plan of tracking deletes uh you know per column, family and then just you know we could tweak it, but generally speaking, issuing issuing compaction, I'm afraid of the right amplification overhead, but we probably are going to increase right amplification doing this, but I I think probably from uh the whole system working well kind of uh yeah perspective. It will it'll improve things.

E

Exactly if we a consistent latency, is much more important than my application. If we have like ridiculously highly agencies resulting from this, otherwise.

A

And it really is deletes right. You know in Tombstone accumulation like everything else seems like it more or less works. If you will it's it's that that one piece that really causes US problems over and over.

E

E

But what's your idea exactly what the delete counter is like counting up the number of deletes and then manually compacting, a particular column, family.

A

Yeah it gets tricky, but the the gist of it is um it's probably gonna involve locking, which I don't love, but uh we have a basically like a calm family rapper thing. That's got a bunch of well a couple of different counters in it. We increment our counters when we issue RM, key or arm keys or whatever we're doing, but we don't actually apply what we increment until the transaction completes through our sdb, then once we have that particular counter get high enough.

A

We issue a compaction and- and we maybe have a separate set of computers for flushing, because flushing the med cable is the one thing that we really don't have any control over right now, as far as we can tell, and no one else is going to be- does either short of this kind of a method of tracking deletes.

A

um We don't want to issue a compaction or a flush if roxdb did so recently. So, in addition to this, we would I believe we can do this register a listener with rocksdb, so that for each column, family, so each column, family would basically have a listener associated with it. For or each rapper would have for her at home. Family would have a listener for its column, family associated with it.

A

That, then, would listen for flush and delete a reflection and compaction completion events, and we would reset our counters if roxdb is doing its own compaction or flush, but we don't issue one immediately after it's done. It does that. Does that sound, reasonable.

E

Yeah I think so um I guess I wonder if the um the listing for for compaction and events and um pleasure events seems like may or may not imagine how I kind of necessary are you sure that is.

A

Yeah, that would probably be uh uh it could be speeding enhancement right, I mean it's not the end of the world. If you do a compaction immediately after rstv just did one um it's not great, but.

E

um Man, I'll figure in our recipe compactions, we feel about it without our intervention.

A

It depends on what the right workload looks like you know. If you're on nvme drives and you're pushing a really heavy right workload, they can be extremely frequent. um Like you know, every second or.

E

Two um I guess with these like delete heavy workloads, it seems like they're, not it's not compacting, that fast or that quickly to keep up with the correct.

A

With the delete workloads, your your I I, don't know if eventually we'll turn our compaction. If you literally fill the mem table, 12 deletes, if, like there's, some extremely small size associated with that that Tombstone I guess I would assume there is um but yeah it it's either. Never compacting or just compacting an extremely.

E

Okay, so you're. What you're saying is it implies that um some activity is tracking when it's taken back based on fullness and because humans, under small they're, not getting that bonus very quickly.

A

Yeah or just never right and we don't I, don't actually know but yeah um the so the the potentially the Bad Case right would be if you've got a um some kind of like moderate or small right workload and then like uh a delete workload mixed in with it and the rights are triggering compactions. But then we're also triggering compassionism's leads, and you end up like kind of like colliding with each other and have like twice the number of of compactions.

A

You maybe should have right, because they're they're kind of stepping on each other's feet sure sure that would be. You know the thing that you maybe want to try to listen for and avoid. But, like you said it's it's probably not the end of the world. It would mean more right, amplification and maybe a little slower, but the the big thing is actually the deletes tracking the deletes and um then applying what you see on successful transaction to rocksdb.

A

To the you know, sexual right of the transaction to the red head log and then uh issuing the compaction or just and the flush the flush in the compaction only for that column family. If you see, however, many deletes you, you want to trigger on.

E

Yeah I think it makes sense.

A

And that way we also avoid compacting the entire database, because we don't want to do that. We only want to do it on the calm family. Let's, let's get to the least present.

E

Would it be worthwhile they can think about um a smaller range within a column, family too? Where would that help.

A

I haven't thought about that that far, but um you can, you can do like a range for compaction.

A

I, don't know so you're thinking like okay, we know we can set the the lower and upper Bound for this range based on the keys that we've deleted. That's your thought.

E

Or maybe um longer term we could we could like. If we see action takes it's taking a very long time for the full column family. We can try to reduce the uh scope, a bit sure.

A

Presumably, this would make more frequent compaction events that are somewhat shorter right, not linearly so but somewhat shorter, with an overall increase in right amplification, because it's not linear.

E

A

E

You flush the mem tables with a range as well, or is it just everything I.

A

Don't know I know you can flush them. I, don't know what you, what what options they give you.

E

Can find out quick, yeah, I, guess I I guess the range part is probably less important for the mem tables. Is it purely writing? It bring this into the first double.

A

Plush is called with a flush options, data structure.

A

And selection options.

A

A

Just trying to see if this flash options thing has the ability to give a range.

A

A

Documentation, uh I, don't know if I want to be in here.

A

I guess this is what this thing looks like in the rest.

A

So you do get like from into try from trying to Borrowed.

E

E

Yeah, it's not a lot of uh English to.

E

We can look into that more another time. Maybe yeah. Yes,.

A

But not worth trying to do this um so yeah, maybe, but we we don't need to do it to start out right. We can start off just trying to flush the whole thing.

A

So yeah I've kind of been trying to lay it out in my head and I. It seems like it all works, I'm, afraid I'm gonna have to protect the um the uh Bradford thing with the lock on on deletion.

A

um When you delete sorry when you like RM key or you know whatever.

A

Well, maybe a couple counters right, but still nothing major like no. There shouldn't really be any blocking code in inside the lock, so.

E

Maybe that blanket could have beat, would it be, it could be be an atomic variable.

A

I was thinking the other day. It couldn't be um well. Let me think about it again. I haven't ever thought about. In a couple of days haha he said: I heard that team yeah Atomic spray I think it didn't work. um Okay,.

A

So yeah anyway, oh I mean I. Try to take a look at that once a lot of these other, like random fires, get put out that we have right now with performance stuff, but uh yeah. That's that's kind of the plan for me anyway to see. If we can do this, um yeah I guess that's! That's it um and we're at the end of the hour, but.

B

A

Anyone else have anything they wanted to talk about this week before we wrap up.

A

All right: well, thanks for coming everybody, it was a really good view. Oh.

E

Go ahead, Gabby.

B

I'm, just asking you just could contact me after the meeting.

E

B

A

All right cool well, then, have a great week. Everyone thanks for coming, see you next week thanks.

E

See you later bye.