Ceph Performance Weekly, 14 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Reef: Performance

Description

The Ceph Developer Summit for Reef is a series of planning meetings around the next release and some community planning.

Schedule: https://ceph.io/en/news/blog/2022/ceph-developer-summit-reef/

A

This week for pull requests, I saw two new ones. um The first is corey's excellent investigative work and subsequent pr for uh looking at setting upper and lower bounds on roxdb omap iterators um casey is reviewing that and uh offered some really good suggestions there for avoiding uh copies and uh and memory leaks. So that's good, um I figure corey. Will we can talk about that? Maybe after we go through pull requests, since it deserves a longer uh a lot of time.

A

And um the other new one that it's kind of tangentially performance related, but um there was a pr that came in to allow rgw to use deos as a back end and and for folks that don't know, deos is um kind of an offshoot of lustre where they um they they actually use stuff, as some inspiration for when uh when they started writing it, um it's an object-based store that uh is really focused on on high performance uh kind of tied to optane drives to some extent, um but it's uh it's still fairly experimental from what I I think they they kind of only barely have replication actually at this point, um but uh but but it is there and it is really high performance in some center.

A

So uh it looks like someone is working on trying to make a storage abstraction layer for that for rgw. So that's really interesting. um It'd absolutely be interesting to see uh what what that does and how it works and how it compares to our own uh osds- and you know, maybe we'll learn something from it. So um definitely interesting work there.

A

um There were two closed prs this week. uh One was from me uh so I'll talk about this. A little bit later on as well, but this is um uh basically the minimalist uh fix that we can implement right now for what the what we did in a previous uh change to the abl allocator.

A

uh Basically, last summer, in pr 41615, uh we we made a change to the avl allocator, where we limited uh how uh how not, how long, but how many bytes or how many cycles it would spend, searching in near fit mode and then switch to best fit mode.

A

The problem is that we would leave the current position that we're searching from the same if we, if we gave up so if we gave up, we wouldn't update the cursor position and we'd end up for every allocation request, continuing from that same position and and failing um so. This is kind of the minimalist pr to change that behavior and and that fixes some issues that we saw in quincy.

A

uh So that's basically what that is um moving on uh the other closed pr was basically an earlier attempt to fix the same thing by changing or reverting uh some of the changes that we made there and that just got closed because we we superseded it with this upr.

A

um Three pros that I saw were updated. um First is another one for me that that changes, the avl allocator to a time-based uh uh uh threshold on staying in your fit instead of the current bite and cycle, or count uh for for saying your fit. um There's also an update on this tracing pr.

A

I think uh typica reviewed that and there have been some updates. It looks like not sure exactly what um and then adam's pr uh here. This I've got the world high string sitting on a rainbow vr. I think parts of that maybe have been merged in other pr's and- and it looks like matt's just asking here. If there's anything left from that um adam is my is adam here, yeah adam! Is there anything I'm did I get that right?

A

Is it basically just that we're trying to figure out there's, there's still anything from that pier to merchant.

B

Which pr is this.

A

I've got the world on a string sitting on a rainbow pr. Thirty, four.

B

There are things from that, but they pretty much need to be. uh Redone uh casey made a much better uh string, iterator class, that's more idiomatic, um so yeah most of that can probably uh go, and it should also probably just be broken out.

A

Sure sure yeah matt was just asking in there.

A

I think he's wondering if we should close it or not.

B

Yeah, I think we can.

A

Oh and matt's here too awesome.

A

Okay, cool uh lots of stuff in the no movement category. uh I don't think anything, there's there's anything here. That's super.

A

Okay, so uh someone uh asked in the uh in the discussion topics if we should continue pg log discussion. I suspect. Maybe that was gabby uh but gabby is not here yet, maybe not.

C

A

C

Discussion to have with cappy, it just seems like.

A

C

Topic maybe you'll show up next week.

A

Sure uh it's possible too, that core is still ongoing for some, so he might be here in a little bit. So let's just wait a little bit on that one. um So, okay, uh I'll move on to the next topic, then uh so recapping avl allocator changes. I kind of already talked a little bit about what we did for this minimalist change. I still think there is a valid reason to change the way that we are limiting or the way that that we are switching into uh the best fit mode from near fit.

A

Even with this kind of minimalist change, uh we still see us dropping into near fit really quickly and often, even sometimes on like 4k searches, um and it's just a result of the the limits that we have in place.

A

We can increase those, but it's really unclear how many bytes forward we should search or how many cycles or times we should iterate uh the search in near fit mode before giving up. So this other pr I have uh that um that basically changes to a time based uh search I'll copy that and paste it in the chat window here uh personally, this makes a lot more sense to me here. Instead we're basically just limiting the search to a certain number of uh well, I think you can go down to microseconds.

A

I don't know if they have seconds it would solve too or not, but um now certainly down to the microsecond level. We can say okay. This is how long we want to search in in near fit before giving up, and that's to me a lot easier to tune um when we do that. If we tune up like around 100 microseconds, we see that the number of searches uh that fail the number of near-fit searches that fail goes way down.

A

If we set it to like 10 microseconds, we actually have, I think, fewer misses than we do currently when we're only setting like 16 megabytes forward or 100 iterations, um but it's still pretty high, like you can kind of see that the curve uh kind of goes up.

A

It probably is kind of something that looks a little bit like an exponential curve. um So I'd argue that the very least we should increase the defaults and- and preferably from my standpoint, we change it to a time-based search that we're not trying to tune on multiple axes at once.

A

um I see more people have shown up um igor. I don't know if you heard most of that discussion, but I still think we should probably change to like a time based limit on evl near fit searches rather than this kind of two-dimension, bytes and number of iterations.

A

um Any any thoughts.

D

Well, maybe can we support both uh limits for for a while, just to make sure this time-based limit works properly and then maybe get rid of bytes limit.

A

Right now, if we keep the current defaults, we won't hit a 10 microsecond time limit, we'll we'll easily hit those byte or cycle limits. First, I think.

D

No, I mean, if I remember correctly, you removed by limit with your pr. Yes.

A

We just base it on time.

D

So so maybe we can preserve this functionality and we'll just label it by configuration settings and just make sure that time based limit works perfectly in the field before getting rid of byte limit.

A

Completely, I mean the the current code. We have isn't even in the field right.

D

D

So byte limits are present in pacific.7.

D

A

I thought we didn't merge that pr until I think it's in two eight, I don't think it's in two seven.

E

You're talking about those options right, the avl allocator options that you are trying to revert mark. Those are not there in 1627.

A

Yeah, that was my thought too. That was my recollection as well.

D

So from user experience it's definitely makes more sense to to have time limit limits.

A

And even from our own perspective, I don't know that we can reasonably set defaults like there's. No there's no rhyming reason why you would set like a 16 megabyte or 100 iteration limit right, like.

D

Well, I believe this was borrowed from gfs as well.

A

Yeah yeah, I'm saying the z of s guys are doing it wrong. I guess I'm being a little arrogant here. I guess, but it it doesn't strike me that there's any real good way to use those limits on a variety of different hardware.

A

Reasonably like it changes like what you want to set those two depending on how fast your device is.

F

But at least you're going to have to enter timer code multiple times in a.

A

Look at the way I wrote it. I um I tried to avoid that by basically uh starting out by doing eight uh iterations per timer check and then, as you move on, you increase the number of iterations per timer check. Assuming that you set a higher limit that you can go farther per per uh or you know check at the timer.

F

But wouldn't wouldn't there be a way, though, to to con to do your heuristically, the timer into a number of iterations or some sorts, or some some some constant time search, counter type of thing.

A

I mean you: could you could try to like grab and find out how many checks you can do in a certain amount of time and then and then like iterate, that at that resolution right I mean we could do it that way. It's I don't know that we need to. I mean this is a little sloppy. You see that you know. Sometimes you go a little over your timer setting. um You know you might you might be like 10 over or something, but is it that big of a deal? I don't know.

G

Yeah I like uh matt's idea of adapting number of steps of algorithm by time and then adapting your amount of steps. How much time you you just make it adaptive that that will get give us just one uh time return basically, two times per action.

A

I think adam the the reason I don't necessarily like that, unless you're rechecking it periodically is that if all of a sudden you start going slow, you might have a lot of iterations to go through right like the way I've written it. It makes it so that you, you start out small and you kind of grow it each time so that if, if it's for some reason going slow, then you start out with a low number of iterations so that you're you're, not you can you can.

A

You know quickly, jump out of the loop, but if you start with a high number of iterations and for some reason the background to start going slow, then you might have a lot of iterations to go through before you give up.

G

A

That's why I like restarting this each time, even though, yes, you might have a couple more. um You know time checks involved the the rate that we're going at right. Now it doesn't really matter honestly, I mean.

F

Dangerous, I I I yield the point, but but that's a dangerous one.

A

You can tune it though matt right, I mean like if, if you start seeing it, if we get to the point where we're fast, you can increase the minimum number of iterations per per time check that you do like the way it's written right now. That's that's tunable. It shouldn't be needed to be tuned until we actually see that being a problem, but um you know right now we're doing eight iterations minimum per per time check up to a maximum of like.

A

I think that the default I have there's 10 24 128, something like that.

A

In wall clock profiling, I did not see it coming up right now at our our our current, like rate of 80 000 iops, on a given osd as a problem.

A

Like it wasn't even showing up in the trace.

G

Well, I assume that sweet spot of that algorithm basically depends on how fast we are with getting clock time and how slow is one step of attempting allocation and based on that, we could then decide which strategy is more risky for getting into some long long times or overshooting with performance when unnecessarily asking multiple times for um for a time.

G

Multiple calls for time.

A

Yeah and keep in mind here too right that the the overhead for a large I o is really different than for a small io like if we add a a couple of nanoseconds for a four megabyte allocation search that we're already descending down uh like 800 000 node avl tree, that the cost of that search in the adl tree is a dominating factor right, whereas for like a 4k, io io, we might not have to search nearly as much, but we actually, we actually surprisingly, do a lot of searching in that tree.

A

For it, it's not good, but that's a different topic, um but even even there right, that's when presumably you'd have a much faster time of finding a slot. um You know finding space and hopefully you're not doing as many iterations, which means you're not calling it. You know a time search um I mean hopefully, actually one thing we should look at is: um maybe we do the upfront search of eight cycles without doing any time. Look up. You know that would be an optimization for this right. Like the first time.

A

You just do it and then you don't even call a time search until you've done that first eight cycle or eight iterations search.

G

Well still related topic would be to actually think if the performance cpu performance of allocator is an issue that we basically doing very bad, having one lock for entire allocator, and maybe that's some problem. Also. We we creating here.

A

Yeah yeah, I mean like to to me when I've after looking at all this code, like the a time lookup is like really minor compared to all the other, all the other things that are going on here um like it. I don't discount that I I know where you're coming from on this, but like that.

F

Sounds like animals got some other interesting points there, uh a giant, abl tree everything hits.

A

Yeah right exactly.

F

There's there's something called concurrent avl and there's some other stuff.

A

Yeah so so I mean like, as I've looked at this, I've been kind of eyeballing like okay. Maybe maybe we have need to to to rethink some of our strategy with some of some of the alligators here, um but but that's a much bigger topic. You know very minorly here. What I want is to um make it fairly easy for people to tune like not not have to understand like how far they want to search in this tree or how many iterations is that so meaningless to tune. um You know, it'd be better.

A

If you know with some degree of accuracy they could just say: okay, here's how long to spend in near fit before we give up um it's easier, at least for me to think about wrapping my head around, um but then the bigger picture of all this, I think, is okay. We we need to avoid doing really deep searches in this tree and- and maybe we shouldn't be doing searches in a tree like this at all um at least not an adl tree. But that's that's a much bigger discussion, a much bigger topic, maybe.

F

All trees are good for searching in general optimize form, but but this one is huge.

A

Well and b, tree is better right or b. Plus tree is better in my mind, for something like this right, where you have a wider, a wider search rather than a d-person.

F

I'm not sure about that, but.

A

Well, yeah, like yeah, bigger topic to discuss right.

A

In any event, um the fact that we're seeing misses on with an empty disk for a for like 4k allocations, like I see this, I see us actually giving up after searching forward. um You know 16 megabytes or even more than 16 megabytes um and then just giving up and going into best fit for a 4k allocation. That's surprising something seems still kind of off here to me.

A

But in any event uh more to do there um definitely so we have everyone from core now, oh actually, any other. Anyone else want anything regarding allocation stuff. um I think we probably covered most of what I want to talk about there anyway,.

A

All right so, sam you, you want to talk about. Pg log continue that discussion and I believe, gabby's here now.

C

uh It just seems like a good topic for cds gabby. I understand you did some testing where you eliminated the pg info rights.

H

I did I eliminated, but it was. I was testing something else. I realized that disabling column, family b, while increasing I o performance, caused some right amplification about five percent x-ray amplification, which made absolutely no sense to me because we write less data to rockstb. How come it generate.

H

More data mark suggested that that might be a problem with pg log, so I disable pd log and when pd log is disabled, I I tried the same thing running until disable and column family b and pg log disable and not call them family b, and this time I got the reverse result.

H

No column b is five percent less right amplification than pg, then without it does it all make sense? Do you guys understand what I'm saying here.

C

No, I don't really know what column family b is.

H

Oh sorry, the location map allocation map used to be stored in works, tv.

C

H

Yeah so when my my pr removed them from calling family, so we don't write them to robsdb at all, but the and the impact is about 20 to 25 extra iops in random right 4kb.

H

But I found some strange artifact that right amplification is up by five percent, which is really counter-intuitive.

H

We remove we, we stop writing to rocks to be the allocation map and, as a result, there is more data written to disk.

C

Right amplification is a change in the ratio of the data being written, not the total amount.

H

H

Change in the ratio, so we write the amplification for 4k right. I think it's about five percent higher with when we don't put the allocation map in works to be now. The the the thinking is that by stopping allocation map from going to rocks to be we made ourselves faster and by being faster, we increase the distance between pg log creation and pigeon log deletion, because we are now able to to push more information in the same amount of time, and that means that pg log are now exist in more levels.

H

So to test this id, I disable pg log and with pixel of disabled, I tried once writing a location up to works to be, and once without that, with pidgey dock disabled. Now I got this result. I was expecting that, when allocation maps are not going to roxdb, we are doing five percent. Less disk uh right is that does that makes more sense.

C

Sure, what's the overall difference in iops, if you disable the g log.

H

C

I mean I'm asking because if the problem.

I

C

Log complicates rocks to be, then. Writing it to not rocks to be seems like a win.

H

Yes, I'm just looking for the email, I wrote.

H

I think there's about 20 extra iops on top of no column b.

H

On top of my changes to location, there's 20 percent more iops, but adam was telling me that I only disable the page log entry, but I did notice about the tombstones creation. I need to double check this thing, but there is at least 20 percent extra iops when pg log are disabled.

C

So it seems like the next step would be to prototype the approach I outlined in that email.

H

Yeah, so I try actually I wanted to contact you about it.

H

I'm still not sure. I fully understand what you mean, so you intend for the data to reach roxdb under different object and all the pg log would be a single object in workspeed. Sorry.

C

It would never hit rocket tv at all.

H

C

It would never hit rocks tv at all. Why would hit rocks to be I'm confused.

H

Okay, so that thing I don't understand because.

J

We can implement our own ring buffer is. What does I think, no, hang ups.

C

There's data in blue store that isn't in rocks tv. In fact, yes,.

H

C

H

But unless we push the data to the right ahead log, then we have to write. Do we have any other mechanism using the right ahead? Log of works to be.

C

Yes, you are, you are that that is correct. For this to work, the the rights would have to be written as deferred rights.

H

Okay, so deferred right is essentially a roxdb object.

C

I don't know how it's currently implemented. It's.

H

A way of light is.

C

Then yes, it would only go into the that then that's that's right. It would go into the uh the right ahead. Log roxdb keys.

H

But it's not if you you do the semiconductor. If you use a mechanism we use for the third right and adam, please correct me: if I'm wrong, it's means you need to update roxdb and roxdb would write to the right ahead log, but my understanding is that we've deferred right when we do an update we generate yet another object. Every update is just yet another object.

C

Do you mean a key.

H

C

I think that's right. Yes,.

H

So, what's the difference, what are we gaining by this change?.

C

uh The advantage is that it's going to write ahead log, so it has the same lifetime semantics as the right ahead log. If.

J

That's already a problem: if that's.

I

Already a problem for small.

C

Rights then we have a different problem.

H

So I'm assuming in small right we never tried to be crazy. Optimal because going to rocks to be by itself is a big win. Pg log is different story, because pg log is extremely tiny object, so key and data.

H

Yes, the data itself. I mean I'm talking about the key. The key I understand, but the data itself.

I

C

Very small, yes, yes,.

I

I understand but the word object also refers.

C

To blue store object, so I'd rather use the word key.

C

So it's because the lifetimes are far shorter um if it's actually still a problem, if trimming the right ahead, log creates the same tombstone problems that the pg log does then part two would be updating bluestore to write directly to the right ahead log. I understand it's possible to co-opt roxdb's own log.

H

Yes, you could separate roxdb right now.

C

Yeah so doing this in two different steps, so the proposal is to use the data. Payload portion of blue store objects to store the pg log. That's part a yes immediately in the very short term, it would still be written to the right ahead. Log keys in roxtv, but further optimizations to bluestora would improve that pathway as well. Eliminating that component.

H

I saw an answer. You tried to eliminate that the right to the right-hand lobe.

C

I'm saying you can change the way the write ahead. Log is implemented to write directly to a journal ring buffer.

H

So change right ahead, log and then pg log would only go to right eye log without ever reaching rocks to be rocky.

C

Yes, that would be the idea, but right now, blue store uses, rocks db as it's right ahead log. So it would be better to just use that mechanism and see how much of a benefit that gets us. The limiting the pg log.

H

Isn't an actual option right now is by moving away from tombstone. So if I'm just trying to understand, we would still create effectively roxdb objects for pg-log, but we're not going to use tombstone because we're going to recycle the data and by recycling we eliminate the need for tombstone.

C

H

C

Not sure how the right ahead log works, if we're just writing keys to roxdb and then immediately trimming them, then assuming it's doing its job. Well, I'm assuming that it's not creating tombstones right.

H

No no pg log, it.

C

H

It you know that tombstones, the tombstone don't remove data inside the main table. The tombstone only remove the object in the mirror, so in the compaction process on disk. I don't think.

C

Right so as long as it never so, the idea is that as long as it never makes it to disk, it never creates a tombstone. Is that the the idea.

H

If you never make it to this, you never go create a tombstone, but how can you guarantee never going to do.

C

It we don't need to guarantee it, which just needs to be with decently high probability and I'm assuming that's why the right ahead headlock works the way it does now. That's the purpose behind this design, so, on average.

H

It does not make it full, then all the main tables go to disk.

H

That's the rule. There's nothing! There's no way around it.

C

So I think what I'm saying is there are two problems here. The first is that we are writing the pg log into roxdb directly as their own keys. That part is a problem because it more or less guarantees tombstones. So the first step is to change the blue store interface. We're using to do writes to use a normal object payload.

C

The second step is that, if the right ahead log is an actual problem, then it's a problem for everything, not just the pg log, so the second step would be to change the way. The p, the right ahead log is implemented to write, to skip.

H

Rocket completely problem, I thought the right dialogue is the whole benefit.

C

You are supposing.

H

C

Every entry that gets written to the right of headlock results in a tim's done right.

H

um Not exactly no, that's not how.

C

It works, I didn't think it did so I'm saying that if the right hand.

H

Right look protects you against failures, that's of things. You understand, of course, and it's allow you to bundle together, multiple updates. So if you got object because at the moment, what you do, you will write, object, node, pg info and pg log is a single update to the right ahead. Look in object, node. We know that we need them.

H

So we piggyback the pg log on the object, object to write the headlog and that's the reason that we assumed that pushing them to roxdb is going to be very cheap because right, eight log is that is the actual cost, and that thing is piggybacked on the object. Node. What we didn't take into consideration is that eventually the mam table is being flushed and tombstone going to follow for every object.

H

So if we could keep a single object and when this thing is distorted to disk, it's still a single object, then that's the wing it at least going to reduce significantly the amount of tombstone, because we will still need some tombstones and there is a very tricky management here, because you cannot remove the data until you know that all the page, all the pages inside it are safely removed.

H

So there is a tricky work there, but it is doable. The question is, I don't think roxdb give you an interface where you can just write in cycle. You cannot go and keep growing an object. I think every update, I might be wrong here, but I think everybody.

C

Has to overwrite the whole key? Yes, which is why I.

H

Keep saying please use the word key and not object.

H

B

H

The whole thing, then, what's the benefit, so if.

C

There is no benefit, that's what I'm saying. So what I'm saying is. We are using roxdb's a key range in rocksdb as a right ahead. Log right alternately, rocksdb has an interface it uses to access the sort of sequential write file that it uses for doing the write ahead log in the first place. We plug that through to blue fs. What I'm suggesting is that, if writing the pg log via roxdb keys, becomes too expensive. We could bypass that layer and write it directly to the underlying journal buffer.

H

Yes, it is doable, but it's very tricky. So that's one idea we suggested in the past, and we know that using this mod, the logic is very clear: it should fix the problem, except that that code is tricky to make right. Now there is another option I try to uh to suggest and keeping the same semantic. So the option goes like this bundle, pg log from different pgs, don't differentiate between pgs and when we need to create a new mechanism. A pg log repository once request arrived to the system to the messenger.

H

The messenger would create an entry in that mechanism. It will say I need you to know that this object arrived sorry that this pg log can be created, but it's not committed only later, so that's in the messenger, and so it could push more and more update.

H

But eventually, when this thing reached execution, then it's going to say: I need you to commit it by that time. The the new mechanism would see how many objects it got and it's going to bundle all of them together and create a single update of a bigger, pidgey log. It's going to be pg-log container, it's just going to flash everything it got into us in a single object.

H

If and and by this. If we could aggregate multiple, pg log update and we could do four- maybe eight of them in one access, then we reduce the pg log overhead by a factor of four or eight. So that's.

C

Yeah, that's really complicated. That's really complicated.

I

C

Out it spreads out the logic for dealing with the pg log across several components that currently don't have to worry about it at all. So no, I don't.

H

Think that's that's right. The messenger tell me that you need to create a pidgey log and in the execution which is today where you push the pg log. You just call the same interface and you tell it. I need it to.

C

Have a bunch of other problems like, for instance, you don't actually know what order those things are going to go to disk in. We haven't done the work yet to find out whether we can even serve that io synchronously. It may still need to block on recovery or any of the other 10 things that now it's much more complicated than that.

C

That seems, um but I'll point out that it does exactly the same thing you would achieve if you instead wrote those pg log updates to write a headlog and then batch them after the fact to write.

C

It's the same thing right.

H

If you write the update to pg log, then you don't have to write them anywhere else. The pg log is the repository.

C

No because you have to write them atomically the entire point of the pg log.

I

H

Existence are absent.

H

With the object node, so that's atomic.

C

Yes, I know that what I'm saying is that when you perform that right, you can write it to the right ahead log and set it to the actual any final location of the pg log and then do the batching. After the fact you know the.

H

Normal way of doing things, what's what batching is left? If it's within a pg log, then we are good, we don't I mean we can maintain memory repository.

C

A right to the pg log has to happen at the same time as the actual object right right. The actual rados write that came from the client, yes,.

C

H

It's a must, but that's what we do today.

C

Yes, you must. You must, because the existence of the pg log must prove the existence or non-existence of the object right. They really do have to happen at the same time period. That's why we have a pg log.

C

So if we batch up, let's say eight client writes those eight client rights cannot commit until they're, corresponding uh until their corresponding log entries come out so one way or another. We have to write those pg log entries with the corresponding object rights. The whole point of using a journal is to not or one of the optimizations available to you. If you use the journals, you don't actually have to do that, you can perform the writes as they as they show up with lower latency and then, after the fact, retire.

C

The actual object, data back to the corresponding buffers on disk. You know like any normal file system, that's normally how you achieve batching, because it doesn't introduce a bunch of extra latency on the client side.

C

You see what I'm getting at.

H

No, I do not. I don't understand what what is that you batch in in your design at the moment, what we batch together is object, node, pg log pg info, and if we happen to have more of them, then we do them, but usually because the way we split things on the pg boundaries, if you got 128 pgs, then you're never going to have more than two of them on the same pg, never going to have more than two active walks together.

C

Right, which means if we chose to wait for more we'd, be delaying commit.

H

Yes, if you decided that pg, you must write on the pg boundaries.

C

It doesn't matter whether you're right on the pg boundaries you'd still have to wait.

H

For others, we have uh 256 in flight ios. But if you look in every pg by average, it's about two in flight per pg.

C

But we don't do batching within a pg anyway,.

B

C

Batching gets done in blue store.

H

It's not going to be able to batch more than two in a lucky day, it's going to be two.

C

Yes, but the reason for that isn't because it's split across pgs it's because the total number of writes arriving is too slow to get larger batching. The only way you get larger batching is by introducing an artificial delay in the journal, which you could do that's a valid choice.

C

You could say the journal is going to simply wait until either 10 milliseconds or a certain amount of data accumulates.

I

Not just the journal the whole thing that.

C

If you sorry, the writer headlock, whatever the thing that.

H

Doesn't mean I understand when you say journal, I understand you mean right ahead low, but I'm saying the idea I'm trying to explain is if you could ban the multiple pg log entries and push them in a single pg log container.

H

If you allow pg's to be created not on the pg boundaries and since you got 56 in flight and then by waiting to four of them to aggregate, it's not a very huge delay.

C

Yeah but you can do exactly.

H

The same thing today, right now.

C

In blue store, you don't have to do it in the messenger; it doesn't even help. You.

E

G

I just I'm just thinking uh of maybe some solution that might be more feasible with current architecture in roxdb. There is an ability to insert a merge operation. Could it be?

G

Basically I'm I guess I'm asking some: do you think it would be possible to somehow create keep having as many objects as we have pgs and reuse and create a merge operation that will basically modify that pg info lock, state somehow and therefore just on on rights, only add that merge operation into right ahead log and if something goes wrong and we have to recover from failure. We will reconstruct on after re-reading, right ahead log entire pg info log state. You think that's the.

C

Merge operator, you're discussing, I believe, is a way of you can create key update operations so that, when roxdb internally is doing level merges, it can look at the two different values and combine them in a way other than the newest. One wins make sense.

G

um I mean it always provides them in an order that they updated.

G

That's what the emerge.

C

Operator does that's what we're talking about. That's all.

H

I think what you suggest is something we tried and that's how the whole thing started. We try recycling the pidgey lobe, so we we had a mapping from the real pg info to the pg log key, so we kept the same. We recycle the same keys, just keep doing updates and at that time this thing actually caused performance degradation.

H

I don't remember why, but maybe we didn't spend enough time investigating why this thing happened, because once it happened, we immediately changed course, but we did try this thing and actually.

C

Not really what.

H

I'm maybe we can we can investigate. That was. I think the first thing that I I've done when I joined and when this thing didn't help. Actually it was causing degradation. We just left it, but maybe I can resurrect this thing and try to understand why we got this failure. Performance degradation.

H

My thinking now might be that even if you keep recycling at some point of time that table's going to reach the disk, because every time that every time the right ahead log is removed, is filled and then removed, then everything's flushed so by recycling and never deleting that the thing at a time we did recycle, never deleting which I think maybe that was the problem. Maybe we should introduce a secondary garbage collector, but at the time what we've done we kept.

H

I think three thousand entries in the pg log table and everything was remapped, so we never had to do delete. But what happened is that when the right ahead log was removed, this thing arrived to disk and then a few seconds later, these things reached the disk. So every few seconds we created entries which never been removed.

H

So maybe that was the problem, I think performance, so no improvement and we saw right.

I

C

So let me ask this um was that you were talking before. Is there how hard actually is it to modify the way we do deferred rights so that they don't become rock's dbqs.

C

Like can we co-opt roxdb's right ahead log mechanism to do direct rights to the to that circular buffer.

H

It's not very simple to to separate.

I

I know it's not, though,.

H

It is possible, but once you do these things, you need to take control on the flash operation, because once the right headlock is filled and then it's been removed and you need to know that it's removed because once it's been removed, you need to store whatever you have in memory in some other space, because the right headlock is not uh persistent storage. It's persistent right.

C

H

Yeah, so whenever it's a move, you need.

C

To flush the existing the live, pg log entries. Yes, I don't.

A

I think that igor's uh plan to to try looking at implementing our own right ahead log outside our xdp as a lot of merit.

D

Right so well, I have more or less working implementation now poc at the moment, and the idea is yeah to replace roxdb write ahead log with external one residing to store.

D

I

D

Haven't analyzed that deeply about the idea about adding this blog.

C

The simple version would be: you simply write the log as a circular buffer to a to a normal perfectly. Ordinary blue store object, so the rights would show up as deferred rights like any other small right.

C

So you don't need special handling.

D

Yeah, well, I definitely need to think more about that, but well the the different implementation for write the hedge log close to the implementation is close to the end until to the completion. So hopefully we might be able to use that.

C

Somehow that seems like a reasonable way to prototype. It done once that's a little more ready. That would have that. That would address your concern. Gabby of needing to write to the.

H

Proxy keys, we're.

C

Using as the wall.

H

Yeah, if I I still, I don't know, uh uh igor's uh um walk plane, but if igor would in fact replace or separate rocks to be right ahead, log then we'll be. You will be positioned to be to make the other change, and I think, after making the first change by separating right dialogue from roxy b, then giving us a way to uh store pg log without pushing them to rocks to be is an incremental improvement which should not be very hard.

H

But I'm just saying the first project is it's not simple, and I don't know if igor is working on this now for curiosity or if it's something on his plane of record and actually igor. You can answer this.

K

Well, I hope to release that soon.

D

K

D

Is to improve bluestock performance using that stuff, so I can definitely see that this key sync thread is much less loaded. This way so and well in in many cases it's a bottleneck for booster performance, so yeah. My plan is to release that soon.

A

Now that this allocator stuff is starting to to wind down, I I do intend to test your your code. I've got a a branch sitting and waiting in a directory.

A

Hopefully this week, maybe.

F

I think this this is this is this has been. This has been discussed by us for so long. I mean it is so so exciting to see someone as prototyped.

J

um Did I hear you right that you prototyped moving the pg log out of rocks db completely and you stop because it increased our iops by 20.

C

No, he said that disabling it completely imp improved iops by 20.

H

No, actually, let me speak for igor. The number he showed me was by not disabling replacing the right ahead. Log with his own design gave us and increasing iops by, I think more than 25. Oh.

J

Okay, this was the this was on the slides, okay. This was not moving the people.

D

Well, actually, my presentation covered two topics.

B

E

D

One is about you, write a headlock and the second one is about making well.

G

D

G

D

Lock, implementation at all, along with all this replication stuff and trying to estimate what's the overhead or in other words, what's the benefit if there is no this logic at all,.

J

Okay, yeah! No, I remember that presentation. I misheard what what what was going on there? Okay, never mind, then.

A

In the past, we have removed pg log entirely and seeing pretty big wins out of it. So it's not totally unexpected.

J

Yeah, I I thought gavin had said that he moved the pt log out and it increased the number of iops being used by twenty percent. uh The abs that we were seeing on the subject means.

H

Separating the pg log from works to be roxy gives you the ability to provide your own pg log.

H

So my understanding is this: what igor was doing.

J

H

And so you, when you write to rob's db, you can tell roxdb, you don't have sorry, don't push anything to the log. I took care for this either because I don't care for consistency or because I did it myself, and I thought this is what you got done.

D

Right right, so we can. We can disable embedded strong db writer headlock on db instance startup, and hence, after that, we, the data, consisting consistency, is from our plate.

A

One one um early attempt at this, too uh uh lisa from intel, had tried to just write out um pg log updates to 64k allocations in bluefest, and um she was not seeing any benefit to it. So she gave up on it pretty quickly, uh but it was never clear to me whether or not that was really a good test and it sounds like igor you're you're having much better success with it. With your.

C

I think you're conflating pg log with right ahead. Log.

A

She only was trying to do pg log rights to allocations in blue fs and kept the rock's tv right ahead log, so she was just trying to move it out of that.

D

Yeah but but my implementation is a bit different, so I exactly remove everything related to writing.

A

Yeah yeah, and maybe that's why why you're, having better success.

A

All right well we're we're at the the hour guys uh corey if you're still here, uh I apologize, we didn't get to do all of your excellent work. um I will. I will have it as the the first topic for next week. If that's okay with you.

A

I see you're on muted curry, but I don't hear you.

A

All right: well, we are rapidly losing people, um so oh good, uh corey, okay, great uh thank you and I I wanna do your work justice, because it's excellent so um next week, uh first thing: let's: let's talk about uh your work uh with roxdb, iterator boundaries and, um and then uh you know, potentially we can continue this digilog discussion as well after that. So uh thank you. Everyone for coming and have a great week, everybody uh a happy holiday weekend uh for those that are celebrating it and uh we'll meet next week.

A

Thank you, bye. Everyone.

G

Thank you guys.