Ceph Performance Weekly, 17 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-03-17

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Good morning, folks, starting a little late, the corps meeting is just wrapping up now, um so, unfortunately, we're not gonna have adam or uh uh gabby today, uh they're both out on pto. So uh we won't end up talking about their proposal today, but uh I do have some stuff. I want to talk about with science store uh before that, though, uh let's see the core team is probably gonna arrive here in moments, I'm guessing, so I think I will just get this going right now so uh prs.

A

uh I haven't done this in a while. I've been on pto, so haven't really looked at it closely, but with the all the work that's going into quincy, there hasn't been a ton of new stuff. In fact, I didn't really see anything that was uh really new over the last couple of weeks coming in feel free to add it if there's something that I didn't see, but there were a couple of different closed prs.

A

So I'll start on those, there was a pr from matt benjamin on the rgw side that improved performance. When you had different logging that you were looking at, that pr got closed um without being merged, I'm not totally sure why, if casey are you here? No, it doesn't look weird. I'm here.

A

B

Really relevant to this meeting that that simply changed some move, some logs around it, the the customer.

A

That reported that mentioned.

B

That they were paying a huge yeah.

A

B

Of overhead for logging in their environment, but but this doesn't do anything. It's super interesting.

A

B

It's also closed, because casey would like to change the implementation to be adding a new front. End logging that has some of the stuff, that's more verbose and take it out of the main hw subsystem log. Nothing super interesting.

A

A

Well, then, uh moving on uh josh solomon's pr for improving uh performance in some rare cases in the balancer that finally merged uh we do not have josh. Today uh I don't see laura, so um I think she did quite a bit of the review on that pr. um I don't remember exactly which cases this actually helped it, but um but they both have been doing a lot of work on the balance or trying to make it better and improve it.

A

So um my guess is that that's not the only pr we'll eventually see in this, uh but for now uh general improvement. Look at it if you're interested um there's a pr for using the thread, local pointer uh to save the shard almond. I don't remember much about this.

A

It did merge who looked at it.

A

Laura looked at it and kefu actually looked at this looked at this pr is ronin here ronin. uh Do you remember much about this? This is which.

C

One is so weird.

A

This is, this is 44479 use the thread, local pointer variable to save the shard I'll link it in the window. I think you had maybe looked at this at one point but uh yeah.

A

It was a while ago.

C

C

One second, um I didn't think it was anything drastic or important, but.

A

Yeah yeah, I'm I'm guessing not space in the description.

C

Trying to remember what I said.

C

Anyway, I reviewed it twice, but nothing it works nothing. I don't think it will be much.

A

Yeah that was those hand my impression too, and no performance testing results or anything, but it theoretically is better. So, um okay, moving on then uh testing test testing classic proof test with the performance tag from chris. um Let's see I have to get my window back open. uh Do we have chris today?

A

I don't think so so that got closed, um but I do believe he is still working on trying to uh take some of the work that kefu and radek and others did with running jenkins performance tests for crimson and having those actually run tests against our our classic code as well. um So that particular pr maybe was closed, but I think he is still working on the general idea.

A

uh So not sure what happened there? Why? Why we closed that one, but any event? um I think that work is remains ongoing uh and the last one I had for closed prs um the igor's work on speeding up, pool removal uh by introducing collection list prefetch that has been around for a very very long time and um that was closed by laura.

A

I'm not sure why it got closed other than the fact that oh igor said the best legacy.

D

A

One that supersedes it did we merge it or is it something new.

E

E

E

This new pr is uh pg removed.

A

Okay sounds good.

A

All right so uh updated prs. I only had three in my list. um One is uh this one for setting uh tracing compiled in by default?

A

uh This is on the rgw's our side, the the rbd side- um and I think, casey's also looking at this, but uh the thought originally was that this uh had fairly low overhead. I think there were plans to merge this. It went through some testing and then, um for whatever reason I I don't know if it didn't pass that, but there's been more updates and more discussion on it, so that has not merged yet um and, and still is, I think, actively being looked at.

A

um Do we have anyone from that group that was looking at it here, casey you're here? um Have you have you looked at that recently? This is the uh trace point stuff uh right.

F

Yeah, I uh I haven't heard any recent results: uh performance wise. You know that they switched to some tracer sender that patches things instead of sending everything synchronously and that that helped quite a bit. But I don't know what the the latest numbers are.

A

B

I can report what you've all told me last week, but I think it was last week that that then, that that overhead from with tracing disabled is now below one percent, the overhead with tracing enable is about 15 like to fix that too. But the first. The focus is on the first one of those.

A

When you say tracing is disabled, do you mean that it's compiled in, but precisely.

B

You're compiled in but not executing traces.

A

Okay, yeah under one percent. That's that's not bad right. That's maybe maybe worth it.

A

All right cool well be exciting to see what they what updated numbers they have after some of that work, um okay. uh Next pr, uh this is just a dock pr for rewriting the hardware docs. I was, I think, tangentially involved in some of this sort of, but I've not been super good about reviewing this, but dan from cern has been kind of reviewing it. I think so, uh hopefully, hopefully, that that's to people's satisfaction, uh primarily the cpu section. um You know what our have hardware recommendations upstream are right.

A

Now, a lot of that that documentation was quite out of date, um okay and the last one uh mds pr for skipping inode with cap iteration for empty directories. uh This is from patrick. um It looks like there were some bug fixes and more discussion going on there.

A

I don't know if we have anyone from this ffs team greg. Do you? Do you uh uh remember much about this pr.

G

A

Sure, um okay! Well, I think his active discussion is still ongoing, so uh that's it for that um lots of stuff in the no movement category, but I think the only one here that maybe I would want to follow up on uh immediately is uh we do want to get the tc mount threat. Cache size moved into a stuff configuration option, so I'll try to follow up with adam on that one uh for containers.

A

This makes it a lot nicer to be able to change that setting uh on the fly rather than needing to change the basically self deployment in the container itself. um So yeah, that's that's, probably something that we should try to get in sooner rather than later. I think all right, so uh there were some comments. Oh before I move on anything I missed from anyone, but they'd like to talk about any prs.

A

Okay, so um kenneth uh from soft iron uh had posted in the chat window um that uh there has been some ongoing work. Looking at nautilus, specific performance regressions, I've done, I did quite a bit of work on that before I went on pto and and they've been doing quite a bit of work on looking at that as well. um So uh kenneth had mentioned here that they've been doing a lot of performance testing on their side and noticed a few things.

A

So I guess they're testing an arm.

G

A little noisy where I'm at but I'll unmute, um good good game yeah go ahead. Most of our you know, most of our hardware is, is arm we're actually using um the amd opteron a1100. You know seattle chipset, on a lot of our storage notes, especially the density boxes. So that's that's a little unique um versus what you know. A lot of folks are doing our higher performance boxes. The nvme nodes specifically, are you know, amd epic, a little more. um You know normal. I guess for what most other folks do um we merge?

G

I don't so a couple. A couple of hooks on our team have more info than me. I wish they would have joined but they're busy, I'm kind of you're getting secondhand or maybe third hand information here, but um they showed me a graph this morning uh merging in some uh there was apparently some assembly changes on arm for what isa they import back, ported, that to pacific um recompiled and noticed a pretty substantial performance.

G

You know increase and also apparently uh our gitlab ci, where we're building we're doing our surf, builds, wasn't setting in a release with debug info the cmake build type that made a pretty dramatic sense.

G

Especially at the larger block sizes that made a huge difference, um we're pretty close, there's still a gap like pacific is still you know, demonstrably slower, especially on arm than nautilus. We haven't quite figured it out yet, but you know we're much closer now, at least.

A

Okay, the thing that that is is frustrating or what's frustrating for me when I look at this, is that when I looked at the master branch going back when I've done tests in the past, I've seen kind of this progression where pacific looks better than nautilus.

A

And you know we we see like good behavior and it's all great and then we start doing back ports. And then, when we do back ports, we introduce regressions, but we don't backport the optimizations.

A

So then we ultimately end up seeing a situation where also now, like you know, pacific looks worse than natalie's when we initially tested it. It looked better um that that seems to be an unfortunate trend, which I guess means that we probably need to be doing a lot more performance testing on back ports which isn't going to be fun, but maybe is a requirement going forward.

G

Yeah we're thinking about testing one since we have such tiny caches on our arm. Chip sets we're thinking about building with uh what there's a cmake option like min size or something like that. What is it you're gonna think if that made a difference hold on.

A

I don't remember: okay, no, no worries you can just email it or whatever.

G

Post the mailing list, um yeah I'll, uh I was gonna- give you an update when I had something more useful. I suppose, rather than anecdote, um because anecdote isn't very helpful but we're going to try changing some of some of the compile time options on on arm since the the cache sizes are so tiny on those on those uh on those chipsets and seeing if that made a difference.

G

Sure also what's interesting is um at least again on arm. I don't know why this would be different, but we have about a 10 performance drop running running the demons in containers running them in docker versus, not which I wasn't expecting to see that either and it's pretty consistent across the same uh compile time options, the same linker options um and the same stuff configurations just moving them into containers on arm. For us, we see about a 10 performance drop.

G

How many cores do you have per chip? We have eight typically, eight.

A

Okay, are they all local like not like a new node type thing? um Yes, okay.

A

My first thought was that there might be some like locality issues that when you do it go into containers like screws that up there's networking some kind of container networking uh uh abstraction.

G

Right there is, I was wondering that um I haven't looked deeply into it and I'll admit my understanding is uh probably elementary school level. um I was wondering if, if there was some sort of like socket or networking communication, that was the bottleneck.

A

I've seen some really like high level performance numbers showing like a similar drop in containers. I think the the thought or the hypothesis with it may be related to the network stack, but I don't know that anyone actually followed up and really dug into it in any kind of meaningful.

G

Way sure I mean it doesn't matter to us a whole lot. I mean we, we make stuff appliances right, so we we'll just run natively and it doesn't really. You know it doesn't really yeah.

G

So it was interesting to see um and that that definitely drives my decision on whether we're going docker or not or podman. Or what have you right.

A

Yeah yeah- and I think that's that's the struggle. I go with too because, like a lot of the stuff that I run, the performance tests are are on bare metal specifically because I really don't want other random bottlenecks, including you know the the info we get out of this, but then it means that we're not necessarily running in the same way that a number of users are so um yeah. I I understand your reasoning for especially if you're, making an appliance just running bare metal, it's easier. Yes, yes, yeah.

G

That's all I have today. I just wanted to join and say hi um give you an update. I joined last week, but you were on pto um and.

C

I want to thank.

G

You for all the help. I appreciate all that.

B

G

A

Owe you an update.

G

As well, I'm sure that, up to you this week.

A

Yeah, well, I I owe everybody more work on that too, because I uh I I ended up having a ton of pto I had to use, so I uh I I kind of you know, stopped working on it, but now it's time to go back and try to wrap up some of that uh also for quincy. We we're gonna.

A

Try to you know at least be able to showcase um a lot of quincy's gonna look better than pacific, especially on the right path due to the work that gabby did, but it's also going to hide some of our sins from back parts. I think that we we saw so um there's probably more work to do there, even if the numbers look better got it got it.

G

Well, I have to run in nine minutes for another call um but like, and I'm glad you gotta get around to you know um my comments on the chat again, I appreciate everything you're doing it. It's.

E

G

On you know on us- and I know on youtube- to try to track this stuff down really yeah.

A

Long term, if, if there's time to do it, and it's always down at the bottom of the the pile but um it'd be really nice to make an automated system for uh doing performance, bisects just have it. You know, walk through doing benchmarks and like downloading new versions of stuff and and like just you know, go through the whole process in an automated way. Uggie.

A

G

We have a benchmarking tool that we use that I'm trying to integrate as part of our ci cd pipeline, for when we we do new stuff builds to have some sort of baseline that we track. um Okay, I'm hoping to release some of that to the community this year um for what it's worth that would be that'd, be great yeah, it's it's! It uses. uh It speaks native rados, so, um rather than some sort of other abstractions, I think that'll be probably valuable, we'll see if the community likes it.

G

If not it's useful for us still. You know yeah.

A

One I've got uh uh tests that I wrote that live in the set test. Object store uh world that are really useful for for looking at, like omap and and um and kind of the behavior that we see there. uh But we probably want that to exist as something that you can test against an existing cluster. Not just as you know, like the standalone test.

A

um I don't know if this the the tool that you've got or the code you've written, uh looks, looks at any of that kind of thing, x, editors or omap or anything, but that would be that'd, be a really really useful test to to be able to run.

G

Let me let me let me hit up harry. Is our you know, title chief scientist, but he's the guy who wrote wrote the application. Let me hit him up and see what he has to say on that.

H

G

Fantastic that'd be great.

A

Cool all right, all right: um okay, oh hey, gabby you're! Here.

A

I don't hear you gabby.

G

Well, thanks mark I'm gonna drop and I'll have an update to you bye in the week.

A

Okay sounds great uh thanks. Kenneth have a have a good meeting. Thanks, see ya, bye.

A

Gabby, I I see that you're you're out in the real world somewhere, uh you know probably enjoying fresh air.

A

Yeah no worries I, um since since you were on pto and adam, uh was on pto, I figured um that would maybe wait to discuss uh your proposal. uh If you want to uh it's fine, um whoever whatever you prefer to do.

A

Maybe in the meantime, while gabby's figuring out his leg issues, um I'll give just a brief update um on crimson and and uh science store, so uh josh- and I talked earlier this week, uh jeff jorgen and I talked earlier this week about trying to uh showcase uh kind of the upper half of the crimson stack a little bit better than what we've been able to do in the past. uh So right now, cyan store, doesn't really kind of showcase. It super. Well, it's not bad, but it's not really.

A

um You know as fast as maybe it could be so earlier this week I went back and re-ran some tests, uh specifically on science store and crimson, just to see kind of where we're at right now and um the results were kind of interesting. um Let me quick get some of the numbers that I gathered um I'll, throw these in the just in the chat window here.

A

Didn't copy properly there we go, so this is just 64k random, reads and 4k random reads and 64k random rights and 4k random rights and um the best results out of this were definitely the 4k random. That's what we've seen in the past we're getting about 66 000 iaps out of that's higher than we've done, I think on classic osds ever uh and you know, of course this isn't memory right. So it's not not completely reasonable, but it's it's not bad.

A

um What was interesting is that in both the 64k randomly case and in the 4k random read case, we saw a send message as the primary consumer of well clock time. In the 64k random read case, it was like 99, where we're stuck in send message. Don't know why those numbers will probably increase dramatically when we figure out whatever whatever it. Is that we're doing this making send message just consume huge amounts of time in the 4k random read case, it's like 25, so there's still a really big advantage.

A

If we can figure out why that part of the the stack is taking so much time, uh we're probably gonna see some pretty big advantages there on the right path side. What I'm seeing is that we are spending a significant amount of time in a buffer list. Substring of that code is basically splitting up bufferless uh there's a couple of while loops in there, where we're just kind of iterating and creating new buffer pointers. If I remember right um so we're here I'll link the line of code in science store.

A

This is basically where we're doing that, I'm not totally sure yet on what the right approach is to improve this. It was about 14 of the overall wall. Clock time was being spent in that portion of the code, otherwise um we're kind of spread over a bunch of different stuff, peering code, uh just memory allocation code and freeing um x, adders and omap. Interestingly enough, like this being um rbd uh age, object, t comparisons just a whole slew of random different things. So um the big one was the substring in buffer list figuring out. Why?

A

That's that's so um uh there's so much overhead there, but uh the good news is that it looks like there's some interesting things to explore both um in the read path and on the right path. I think we can maybe do better, I'm not sure how much better, but I think we can make it more efficient.

A

So um I was I've been talking to adam a little bit about uh this problem erratic as well, um and then also about whether or not we can figure out how to shorten the path in the upper half of the stack being from the network buffer down into the object store, that's kind of the direction I want to go as I do. This investigation is to see if there are ways that we can improve and shorten the path there.

A

uh I don't know if it will work or not, but that's kind of the the the approach I'm taking right now um anyway, uh that that's about it. For that any any questions on that or thoughts or comments.

A

All right: well then, if not um gabby, I guess, uh is not gonna, have luck with his mic right now, so maybe we'll wait till next week to to do uh to talk about his proposal. um That's all I have. Would anyone else like to have any topics or have anything that they would like to talk about with the group.

I

I have a topic related to an issue that we saw uh on a pacific cluster in production that seems to be related to roxdb performance, uh oh and bucket listings for rgw, and uh it sounded like this was the right platform for talking about that.

A

I

Go ahead yeah, so let me just give you a little context about the scenario that we encountered. We were on a specific 1627 cluster. In production we had a customer that was using a veeam client and was doing backups riding about 50 megabytes per second constantly for a week or two, and at the time we saw issues they had about 20 million objects in the bucket and then their bucket listing started timing out after we fixed an issue with bi list, it was a.

I

It was a back port that we patched in that was kind of a known issue so uh that the veeam client was doing like one bucket listing per second on this bucket and when we looked into it.

I

Basically, this is what we saw in the logs I'll copy and paste this into the chat, and so essentially, what you pull out of that is that these upper and lower bound calls in blue store, which end up being iterator calls and rocks db were taking like 10 seconds each and on some osds we saw them go up to like 15 plus seconds, and so those are both called in that same logic.

I

So an individual bucket listing for getting back up to the maximum of a thousand objects was taking like 30 seconds in some cases and for the whole bucket it was just taking forever.

I

So on the relevant notices we saw, they were using one core, 100 cpu. uh This bucket index pool was deployed in nvme and those weren't like touched at all. They were barely doing anything and we were seeing a lot of extra space consumption on the bucket index pool as well when we were doing a df.

I

So basically, ultimately we blocked the client. We restarted all rgw instances and did a rolling shutdown of all the osds and did manual rocks db compaction.

I

And after that we were seeing that the listings were taking like eight milliseconds again, so it was a.

A

I

Huge huge decrease yeah, the a few a few times after that it regressed back to the same point where it was taking forever again and it seems, like various things, could cause it to get back in that state, one being scrubs another one resharting another one, cluster upgrades it almost seemed like any pressure on the ost could cause it to get back into this ad loop, where it was taking forever for these listings, and so our working theory that we haven't proven is that there's tons of omap entries and the compactions just aren't able to actually clear stuff, potentially because they're active iterators, and so it becomes more and more likely that they're active iterators as it gets slower and eventually pretty much impossible for the compaction to actually remove all the objects.

I

But that's just the theory we haven't been able to actually like prove it. So I just want to get like feedback and see if you guys knew of these issues. This kind of stuff rang about, had ideas for us to try or get more information.

A

What we've seen in the past is that this is almost always tombstones that there are all these tombstone entries for deletes that you end up iterating over and it makes the iteration extremely slow and then, when you compact, it gets rid of all that junk. That's in there it actually, you know, reduces the working set that you're iterating over and then also things are fast again.

A

um It was a really really big problem when we were trying to do um like the uh the the shoot, um like bulk deletes um delete range, we were trying to implement that we. This came up like as a huge, huge issue.

A

um I think uh we've talked about in this meeting before and igor had mentioned, the possibility of um of like constant background compaction that we right now is triggered when you end up with like a ton of rights coming in right, you have the the um the mem tables grow, big enough, that triggers compaction, flushing and compaction, but what we really need is we. We need the ability to when there's like a delete workload coming in and you have all these tombstones.

A

You really want compaction to happen after you've accumulated like a certain number of them. I think that's the solution that we need, I'm guessing, that's what you're hitting it sure sounds like it.

I

Yeah, well, it is interesting. We see in the logs that compactions are happening, but apparently they're not doing the compaction, to the extent that it does when we shut down the osd and do it manually- and I don't know why that is. But we did see like in rockdb documentation that having iterators open causes compaction to not be able to delete a lot of the files because the iterator is holding on to stuff. Essentially, so the idea was maybe like. We need mutual exclusion when the compaction happens to make sure at least at some cadence.

I

There are no iterators open, so it can do a full compaction.

B

That would be a pretty huge design problem in rocks to be at the to be emerging at this late stage.

A

Sorry, matt matt, which which problem specifically.

B

uh The idea that that it simply has no way to to to to partition work between compaction and income and in active readers, but simply if it's simply, if it simply degrades and readers constantly win.

A

I think it'll be really important to make sure that the compaction that you're seeing is actually on the same range as that you're reading right because you might be seeing compactions but is it in? Is it actually compacting the things that you're thinking you're compacting? Are they compacting like a different portion of the database.

I

Yeah, I don't know for sure the answer to that. So we'll have to look uh in more detail and verify that that's a good point.

A

But but matt that that's also a concern that it's, if you have tons and tons of iterators open what happens when you try to compact, I'm guessing that it is a problem.

B

Well, yeah, but if it's an unsolved problem at this stage, that's really it's really a holler.

A

Well, what what do you do about that, though, like from our perspective, what can we do about that.

B

If you'd have you'd have to do, you have to yeah, you have to you, have to implement it. It's just sort of like reader writer locks. I mean you have to you, have to implement a furnace strategy.

B

In this case, the primary place, to put it would be in the rocks to be interface, might be able to assimilate it.

A

So yeah, I guess my take on this- is first we should just make sure that it's actually compacting the ranges that you're actually holding iterators open for um that'd, be the first thing to figure out, but um the next thing to maybe figure out would be um sorry igor. Do you remember we were talking about this before? I think we were talking about trying to introduce some kind of background compaction um after, like a certain amount, go ahead.

E

Well, actually, that's mostly what my pr optimizing removal does, but this specific case, I pretty sure it's active iterators, which prevent from compacting.

E

Background compaction wouldn't help. I highly likely couldn't help in this case.

A

You igor, do you think it is, it is actually due to holding creators open. It sounds like to me like when they did stuff it. It degraded right. That was the problem. Is that it fixed it when they did offline background compaction, but then it degraded over time again.

E

So, as far as I understand, offline compaction helped for a while, and it looks like some background. Compaction is happening online, but it doesn't help and well. You might want to try another one background compaction to to see if it helps again and now the question. What prevents online compaction from.

E

Bringing this benefit, I I recall some complaints about online compactions being ineffective.

E

And I I didn't perform route investigation, but it looked like uh effective refrigerators. Might impact that badly.

A

I guess the question I have, though, is I mean? How often is it that we're going to have iterators open like constantly where we can't have a background compaction actually be useful.

E

Well, I think it depends on the use case on the user load, so we don't have any active iterators at least d level, probably but.

A

Some of our worst case scenarios are where we like have an iterator open for a while, and then we end up like going back and reiterating over the same range and then just like going a little farther than we did previously. We do this over and over and over again, but you can imagine that in between that, especially if we're doing work in between like we're deleting at the end, and then we reiterate you, it seems like there should be the ability in between those to be able to do a compaction.

E

But the question uh who is responsible for triggering compaction properly at exactly the moment when you are between the iterators yeah?

E

As far as I understand it's just be you do something: it's just something you dick uh patient from roxdb.

A

I think we need to trigger on deletes that's the big thing in my mind,.

E

Well again, that's what my pr does for pg removal.

E

We might probably extend that to other bulk deletes, but again, if.

E

If the issue is in open uh iterators, then it might be inefficient as well.

E

So yeah we need some some well, I don't know how. So we should probably refactor the use case from brothers users.

E

We should avoid.

A

A

We should absolutely not be holding iterators open for any real length of time. That's that's, probably a bug if we're doing that anywhere.

I

No, I mean at least as far as our theory goes, it seems to be kind of a positive feedback loop right, because once you cross some threshold of these iterations taking a long time in our case, it got up to like 15 seconds in some cases now like just one iteration over a thousand object is taking 15 seconds and there's these two calls back to back.

I

So that's like 30 seconds where you have iterators open and then the client is making calls once every second in our case, so we're never getting ahead because they're all taking forever now you're, almost certain to always have iterators, always open. So if you're doing compactions in the background randomly there's like a really really good chance that when you try to do it, there's going to be an iterator open and the problem keeps getting worse and worse because the queue keeps getting more full and the in the iterations keep taking longer.

A

Yeah, I have a feeling- oh god, they're eager. It's.

E

Just an interesting point, so actually it's an open question. What what's the primary uh issue? Is it open iterators preventing compaction or it's.

E

A large amount of tombstones, which.

E

Live longer, and this actually prevents compaction.

D

E

That that's a cycle which becomes worse and worse.

A

Igor, I think cory is right right, like if you aren't acting after a lot of deletes. You start out by doing all this iteration and deletes reiterate and then delete and navigate and delete, but then you're never compacting after the deletes, because there's no right workload, then it's making the iteration take longer and like it exactly would be a feedback loop right.

E

And well uh maybe one more question uh about data removal. uh So what is causing removal in your cluster? Are you moving pg, maybe removing some pools or that's a regular use case from gw.

E

So previously, the major uh issue with bulk removals was due to pull removal of pg eg moving, but honestly, I've never seen a user remove user data removal to cause something like that. I'm curious! What what kind of removals do you have.

I

Yeah- and I don't know the specific answer of what uh at the lowest level what they look like, but it is it's a veeam client doing backups and, like I said, they're doing like 50 megabytes per second, so they're doing a lot of uh writes constantly, and this is a versioned bucket. So I don't know if it's part of the versioning uh stuff that's kept in omap. That is constantly.

B

I

Updated or deleted that's.

B

A very important piece: do they require a version, but they must be creating it as a version bucket. Do they know what they're doing there and.

I

Supposedly, yes, I mean supposedly our uh recent architect, sales team have talked to them about that and they think they need that. So.

B

Well, it's really very expensive um and and yeah has that, doesn't what doesn't doesn't allow our implementation to directory starting to scale it it's effectively as we would hope if they invented it, although if they generate an update in there at least at least, this could eventually spread out, but but the versions of the same object end up on the same bucket index shard, that's probably not changing anytime soon. That's that's a that's a that's a concern.

B

But it then, if they don't and if they don't prove prune old versions, then, um which was that which was then becomes their response, becomes someone's responsibility. uh If, if this, but this can be managed by a policy and a vm- doesn't already do it, it can be folded in somehow if it could tolerate deletion of older of old versions using the lifecycle method do so on some more appropriate schedule. That would that could that they could burn down a lot of that.

A

Matt, can you can you think of a scenario using virgin buckets here where um we would see a lot of deletes coming in at the same time, but not a lot of rights.

B

Well sure that would be in fact the case I just described it if they do, if it, for example, in in an in anybody else's environment. The way you contain you know the the the the the the growth.

E

B

Versioning, as is that you can install a life cycle rule that applies to non-current versions and sets it on them. Then they'll, then then, then the expiring. Then then, whatever you know, versioned object names that expired at the same time will go away more or less at once.

B

Okay, that's one way, but then but anyways, but of course anything a backup program, but we'll we'll do we'll have workflows. That'll, look like that too it'll decide it's going to blow away a whole bunch of names that it's got in a volume or something and then I'll just burn them all off.

A

Yeah yeah, but right now I think I think the the reality of all this is that we're so reliant on rights to trigger compaction, and we don't do it on or I'm sorry well on beliefs, rather um that that when we see tons of deletes coming in at once, it can just basically completely destroy iteration performance.

B

And this is this is all very meaningful stuff and then I don't have anywhere to add to it. But I didn't want under corey if you knew anything but the from the application side of it.

B

You'd seen anything suspicious in the in the listing, like you know, range name, you know in the listings the requests that you cut, that you see coming in, if they, if they see if they seem to be requesting new new listing ranges- or you know, if they're asking, if they're consistently asking for new data all the time or if they maybe aren't.

I

They're they're definitely prefixed listings like so they have some prefix query um beyond that. I I didn't look at them super closely, but I can go back and see if there's anything interesting and share some stuff with you. If that's potentially useful.

B

Well, it's just interesting. If it looks if it looks if there's anything about it looks suspicious like when you say, they're sending listings every second I mean if that is that is that it's because the workload is a saturating workload or is there something.

I

Yeah, well, that's unexpected about that.

I

Well, we actually have some meetings with the veeam uh guys tomorrow uh our company is a partner with veeam and stuff and we have some close relationships, so we're actually going to talk about that with them tomorrow, try to understand their client behavior better and if it can be improved from their side, because we're also unsure why they're doing so many bucket listings.

B

I

B

I'm doing actually.

I

I'll just share this one, I'm I'm.

B

B

The first kind of implementation of um of uh um um um I forgot what it was actually called um server inventory, um which which, which does which which which materializes listings their life cycle uh into uh into objects that are then uh that are in csv or park a foreign for us or it could be or see you up in aws um yeah. We, we also have certification by veeam, but I don't.

B

I've never talked to veeam developers that I knew of um it would if they, if they would be, if they're prepared to use server inventory um already, um then I probably have the solution to this, because one of the solutions is offline.

A

So um I I just took a look and there's in rocks db, there's what's called ttl compaction, which is basically just you know, after a certain amount of time and there's not been a compaction, do it and then um it looks like in the last two years, there's maybe a way in rocks db um to compact on deletion.

A

um I haven't actually looked at this real closely. uh I just found this um the chat window.

A

It and I'll have to look at this more closely, but it looks like they know about it um upstream and rocks db. So there may be a couple of options that don't actually require stuff to trigger the compaction. You might be able to do it yourself with some options.

E

Do you remember last meeting talks about uh detail.

A

Yeah, that was where I was where I started kind of the options. Action.

E

Maybe maybe this can be helpful here as well.

A

Yeah, um I was just trying to see if I could find the the way to set that I don't remember.

A

What the option is for doing this, but I know, exists.

A

Igor, do you remember, does that require any programmatic changes to enable that I don't remember off top my head? If you can just do it as a runtime option,.

A

Another term that was used in their documentation was periodic compaction.

A

Oh cory, I see that you pasted um another yeah performance degradation with deletes. That's exactly what we're talking about.

I

Yeah, we came across that at some point. While we were trying to find relevant things for the issue, it sounds pretty relevant.

A

Yeah, it's it's the exact same thing, I think um so. Yeah the term ttl compaction and periodic compaction are probably worth looking for um oh yep, and they reference right at the end of that in february, 20 february, 20th 2021. So it's like a month old. uh I didn't find many references surrounding new compact on deletion, collector factory, that's the the the new thing that they're talking about for deleting or compacting on deletion.

A

And then yeah, no one replied to the request for an example on how to use it.

H

This is more than a year old, so it's 21., oh.

A

You're right, as I'm getting my years.

A

Okay, well yeah, that's surprising.

A

So I think the question is: can we avoid having to do this ourselves? I know we can trigger compactions if we have to ourselves. So we can write something to do this and fix this kind of like what igor was describing he's done.

A

On the peachy side, but we probably need to do this more generically in the rxdb interface that we have, or maybe we can just it will be really nice and much better is if we could just tell roxtv to to do this itself.

A

So cory, I guess maybe the next step would be: let's see if we can figure out, if there's some way, to tell roxy b to do some periodic productions or or use this. This fancy uh undelete compaction thing. um That might be maybe the next steps to just see. If that can improve the situation for you, um and if we can't make that work, then then we might need to adapt and get something into the code that that doesn't um visit ourselves.

I

Okay, that sounds good. The only other thing I wanted to do throughout there was the the version of roxy b on pacific is like two years old and I think it's 6.81 and the latest is 6.29.

I

Do you think there's any value in upgrading that in case there's anything on the rocks db side that has already been updated and improved? I didn't see anything specific in the change logs, but there's a lot of stuff, so I don't know there might be.

A

It's it's always a little bit.

A

Frightening upgrading the rocks tv version um in the past granted we this was an intermediate version between the releases we had upgraded to and it actually introduced a regression that caused data corruption, and we ended up having to uh put our own patch that that, like um reverted, uh uh the the database version, basically because it wasn't backwards compatible. um That doesn't typically happen with the released versions that was kind of a no-no on our part that we we upgraded to an intermediate release.

A

um So so you know lesson learned there, but um we've always just since then been a little bit gun shy about. You know pulling in a new rock cb version into like a back port. I I think we've done it before, but we really like testing master before we. We do that, just to make sure there's nothing crazy coming in. um So if we did that we'd, probably first backport to like you know, whatever version is being used in quincy uh and backward that the pacific rather than the newest.

A

But if, if there is something specific, that's new, that's really worth getting in! You know, that's probably the discussion we should have and figure out. If we can get that into into master and testing there first and then fake it and then and then backward.

I

Okay, that's fair.

I

H

I

I will keep looking into those options and stuff, like you suggested and I'll see if I can get some more data together around like compactions and stuff, what we're seeing in terms of ranges and putting together in a common place in case anyone else wants to look at it and see if they see anything that we might miss.

A

Yeah and corey you're not alone. This is this is kind of an issue that has appeared in different different ways in under different circumstances.

A

You know it's especially if you're doing a lot of iteration, for whatever reason it tends to crop up and it it gets mitigated if you're doing rights at the same time, because then you're triggering compaction regularly. So it's really the use case where, if you've got lots of deletes and lots of iteration and no rights, that's that's when we tend to see this kind of thing.

I

All right, but thanks for all the the input and discussion on that, I appreciate it.

A

Cool yeah, hopefully we can, we can get it worked out for you quickly and uh you know just get this issue generally taken care of it's big camp authority on our side. So all right um anything else from anybody before we wrap up.

A

All right well, then, thank you, everybody for coming uh and have a great week and uh next week we'll we'll talk about gabby and adam's ideas for pg. uh That's it have a great week. Everyone bye thanks mike.