Ceph Performance Weekly, 24 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-03-24

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Well, it looks like we've got people uh from core now so I'll get the meeting started. um Okay, I didn't have a whole lot of updated or new pr's this week, uh but that's okay, we're we're still in the kind of final stages of quincy here. So um the the only new one that I saw specifically regarding performance was gabby's pr.

A

uh This came out of some testing that I'll talk about a little bit below or later in in the discussion topics, but uh this is to uh fix an issue with the no column b code, where it was defaulting to use the bitmap allocator in certain circumstances, so um that uh hopefully I'll be testing that pretty quickly here just see if it helps resolve uh the performance regression that we're seeing um there are two updated prs, uh just some updates on this tracer uh code.

A

uh I don't remember that it was anything real major, just more discussion going on there um and then this mds pr from uh patrick. I think there was a little bit more discussion also going on with that pr or actually perhaps not. When I look at this, I think that maybe it was supposed to move down to the uh no movement so any event, uh not not a whole lot there. um Everything else that I looked through, I didn't see, updates for.

A

If I missed anything, uh let me know I've got kind of a head cold right now, so I'm a little out of it uh any any updates. I missed for anyone.

A

All right: well, then, we've got a couple of different discussion topics to to cover so they'll be good, um all right. First topic, quincy large right performance regression, uh we've been doing performance tests on quincy versus pacific to make sure that everything looks good for the release and in aging tests we saw that we were.

A

A large regression over time with large writes when interspersed with um 4k random rights. So basically, this test is 10, iterations of large four megabyte rights, five minutes and then five minutes of 4kb random, writes and then doing that 10 times over and over again.

A

You can see in the first link that we've got in the ether pad I'll put that in the chat window as well for folks here, um but it's it's pretty obvious. I had the the regression there um that that took a little bit of work to narrow it down to no column b, but um but and we're pretty sure, that's what it is now so.

A

Hopefully, gabby's pr will will resolve it, but if not, then we'll have to make a decision about whether or not to um you know how to proceed. If we want to try to get column, be fixed quick before we release or if we disable it for, like. You know the first release and then fix it later in a follow-up or re-enable it later in a follow-up release.

A

Let's see um so one offshoot of that testing that was kind of interesting is, um I went back and actually looked through. uh Both pacific and quincy at different are different. Allocator implementations, doing the same kind of aging stress test and uh it's in the same document in another tab, but uh I have put the link directly to the tab in the the chat window too. Folks are interested, um there's a really big variability in in the the kind of aging performance that we see in this.

A

This uh you know kind of hypothetical aging test with the different allocator implementations. um I found it really really interesting.

B

Can you complain p and q.

A

Pacific in quincy.

B

Oh I'm stupid. Thank you. No.

A

No, no, I I worried that it would be that case, but I didn't want to pick up too much space by putting the whole word in so I should probably uh document that somewhere in here.

C

But uh stupid is reserved for the allocator.

A

Yes, exactly we can never change it. It always has to be named that right.

A

You know the funny thing, though, is that the stupid alligator's not doing terribly in these tests. You know it's actually performing fairly well, uh and- and I will I will say, though, that this is only a 3.2 percent fill of the discs. My intention, uh if I had time, was to go back and and do the same uh iteration of tests with uh much fuller drives, but I haven't been able to do that.

A

Yet I've been just too busy trying to get these other solar testing done, but um but it is kind of interesting that that um you know stupid is doing pretty pretty good in these. um Having said that, I mean the hybrid allocator uh back in pacific. At the very least, it was. It was quite good right.

D

May I ask a question regarding pacific reverse minor version?

D

Is it master or dot seven.

A

This is that seven.

D

uh Then I have a comment about that: it looks like pacific.7 and pure versions. They lack at least one oh well. They lack actually a set of time fixes for a wheel and hybrid allocator, and I could see a pretty significant performance drop with this version and pacific and hybrid allocator when the disk is highly fragmented, okay and well disk is highly fragmented and that's the only disk for those deals.

D

Yes, database is located on this disk as well.

A

I had a feeling that if I did more tests with the disk extremely full that um that you know these results might look a little different, uh I just hadn't been able to run them yet but uh yeah. This is. This is a very small overall data set relative to the size of the disks, uh but you know with a lot of a lot of rework over that data set.

D

Yeah but anyway, you might want to try.

D

The head of the pacific.

A

D

Just just an example: uh the compaction uh for database compaction using hybrid allocator took 10 times more time like something like 50 minutes against courses. Five minutes using stupid, locator wow. I could see up to 100 millisecond latency per single allocation for blfs, in in using hybrid allocator and in.

A

D

I can later show you the cluster latency when using hybrid allocator versus uh subsequent switch to stupid. One.

C

Did you try with bitmap.

D

uh Yeah bitmap is somewhere in between. So it's okay, uh it's better than hybrid allocator in this case, but uh stupid one is even more, but it is even more like.

A

And you said, the avl also suffers from the same problem: yeah, okay,.

A

Well, sometimes simple is good.

D

But fortunately we we fixed that.

A

Oh okay, okay, so yeah! You said that master branch for pacific- it's those fixes, take care of it.

D

Yes- and we also have got these fixes for queensland.

A

A

Well, good, good! Okay, I will put that note. Take down the note to do that in mind. It wouldn't be worth doing a bunch of testing on two 1627 if uh and just telling us something we already know and have fixed um okay, any any thoughts or any other questions.

E

For anyone, I I'm still trying to understand the the second graph. What's the difference, there's two graphs and the number: the ones showing uh 60 nvme blue, store, locator aging test.

A

Oh shoot, I didn't fix the labels on those, I'm sorry gabby, uh it's supposed to be 4kb and I just didn't get the uh the title and the labels fixed. I apologize okay. There we go that now should hopefully be more clear. What's going on.

E

Okay and then another question: when you tested with column b and no column b, how did you change the setting? Did you use the same code level.

A

Yes, I literally just turned off the was it blue store allocation from file settings yeah, okay,.

F

A

So yeah exact same code. Both are seventeen one zero um exact same test. This is quincy code, 1710, okay,.

E

Just be aware that in coincidences, when you do expand device before there is one pr missing which should get today or tomorrow, but uh there is one pr missing. So if you do expand device, you would get an assert when no column b is disabled.

A

ah Okay, okay,.

E

I assume you didn't, do any device expand. That's why I didn't see this program.

A

A

So gabby um uh you were explaining what your pr does um yeah, instead of a little bit. Do you want to want to continue that and explain yeah.

E

A

E

Just a quick explanation: the way no column b is keeping the knowledge if we are writing to disk if we keep using roxdb or if we don't is by over loading, the three list manager and the 3ds manager. When we move to new column b, I was setting the value to null and then later I would remove, put it back to bitmap allocator.

E

So essentially, when the column b is active, it's always using a bitmap allocator, no matter what you ask, you will get bitmap allocator and I don't know if you could correlate the performance with just using bitmap allocator.

F

E

If, without the new pr, which I just met- and I sent to you when you use nokombi, you always get bitmap allocator, you cannot use anything else. So I don't know how come we are seeing different numbers here, because you are showing that on currency avl has different numbers than bitmap than stupid, but I don't see how it possible, because I think I always.

E

Force bitmap, so maybe I'm missing something, and maybe that the program is not what I thought it is.

A

Yeah I mean looking at the allocator tests right um with with quincy hybrid is when you choose hybrid, it's significantly slower than everything else um with ncbe enabled uh no column b enabled so um to me. That seems to indicate that this is something that goes beyond just which allocator is being used, but you know we can try the.

D

uh Just one idea just realized so as far as I understand.

D

In case ncb is on, we actually use two allocators, the first one is hybrid, which is basic for booster, and the second one is bitmap, which is.

D

Just tracking free space or cb, is it correct.

D

Oh, it's just a single allocator, which you dump on shutdown.

B

Well, my thinking was, I will chime in. I don't know where, what's going with copy that we, with no column b, enabled we basically disable all functionality in free list manager, so there is only one allocator that exactly that allocator that tracks and gives disk space and on shutdown we are iterating this allocator just to provide the space for to store. That's my thinking, that's why I I explain.

B

I don't really follow the pr that was proposed that I did not wanted to basically uh well criticize until I understand, because I'm I definitely not understanding the issue here.

E

Okay, sorry, I'm just, but I'm now back on, so I might be missing something here, but on startup, when we read a free list from disc from uh the super meta column family, I was always setting it, no matter what was inside. If it was not column b, I would always need to beat them up, but could the free list also have values like avl or or hybrid.

D

Sorry, so there is a major allocator for bluestorm uh it is. It is configured the blue store, allocator parameter, so you are not touching that as far as I remember you, you feel it on start up and dump on shut down, but during regular operation uh it's pretty the same locator as before.

D

So the difference uh again during regular operation, which is very right handling. uh If we disable no column b, we do not have any different or additional logic. We just remove pre-list manager from the path how it could be slower than before.

B

Yeah, I have the same thinking that that's why I I want to talk about it.

E

Okay, so just let me explain on openfm, there is free list manager create and you pass inside of release type. I always pass bitmap what if real skype was hybrid, did I just lose it.

B

No, that's the different bitmap. We have a freelance manager type generic and bitmap release manager is one of the implementation. It's actually the only implementation until we create zoned free list manager.

B

It's different type of object, it's just the name is the same. So.

E

D

Type is always bitmap.

D

Do not confuse allocator and free list manager. Allocator, provides a location tracks the space and allows you to find free, free chunks, but free list manager is just a tracker which chunk is allocated or free.

D

So for allocators you we have multiple implementations, okay, so the free list.

E

Is always bitmap, so in that case I don't know what was the problem.

D

But uh again, ncb on just removes release manager from the path and removes tb update.

D

I I can't explain how it may be. Maybe it might be two times slower.

B

Yeah, I I'm I concur with igor. I think we do not have a solution what's going on here.

E

So four megabyte allocation: it's not the free list, so it's the allocator. Maybe there is. Is there anything.

D

Between iterations.

A

I do not restart osds between iterations.

D

That's just a single instance running all this scenario.

A

Yep, exactly it's just doing a set of large, writes and then a set of small rates and then a set of large rates, and then it says smaller writes it. Does that 10 times.

D

I can hardly imagine imagine, buying cb offers faster two times faster for four megabyte rates.

E

Yeah, actually, I also have no expansion, because, even even if the bitmap allocator is wrong, how could it be make such a huge difference in performance?

E

But you do not switch allocator? Yes, you need. I didn't change the allocator. I just changed the 3ds type.

D

Or well, you remove it or you have a different instance to track the free space.

E

No, I don't remove it, it's just that it exists, but it doesn't do anything okay. So when you call into allocate it's just skipping some work.

E

Yeah make sense.

A

So so gabby the plan I had before you had your your pr out was to go back and rerun the tests, but then start investigating the behavior of the osd and blue store, while while it's happening, maybe I'll revert back to that plan. If we don't think that the your pr necessarily will fix it, what do you.

E

Think I don't know how long would it take you to run the test.

A

These tests take about two hours to run so just compile, and I had to. I had to modify your pr a little bit to get it to apply to quincy. uh It looks like it, it must depend a little bit on stuff, that's in uh in master.

E

It depends on the pr which was a marriage and master last night, and it's going to quincy only tomorrow.

A

Okay, that's fine! I can. I can uh just apply that one as well.

E

Let me see what we do with the non-manager with the free list there is set, and there is.

E

So the free list manager we use is usually bitmap.

D

Yeah, it's the primary one, so zoned, it's just an experimental feature for so.

E

I'm skipping the work on allocate.

E

And release, if you look inside bitmap free list manager, when you the call to allocate and release I'm basically doing nothing.

E

Is there any other free list manager, except from bitmap's release manager?

E

Oh well, that is zoned, but it's it makes no sense. I don't use it. So the only change I made to this file is, I mean it's. A huge change: allocate and release become nope.

E

Then I really don't have any idea: how come four megabyte allocations become so much slower.

E

Do we know where we spend the time? Are we doing more disk right, more cpu, more latency.

A

Gabby, that's why I'm I'm I was going to do today is look at that until you had your pr come out. That was the next step. Those results I just got last night right before bed.

B

Well, marco, I can offer my help today to keep digging it, but now for now I don't have anything. I don't think we should continue talking about it. It's there is something a clearly pacific was faster than quincy is now, but one thing that stands out. You are testing it on a very fast machine. You have a numbers that I would never be able to have.

B

So maybe we did some modification and we basically are hitting some locks in allocators now and that's really a change somewhere.

A

And- and this test is, is unlike many that we've done before, because it's on the full cluster, this is 60 usds right. So it's a. This is a complicated test.

D

uh Well, I am curious if it's reproducible single.

G

D

With d, maybe you can play with different locators as well.

D

Yeah, I'm not sure about quincy, but master has got a locator histogram port, which shows how fragmented allocation results were.

D

Maybe it makes sense to compare.

D

The cases maybe.

D

Somehow ncb impacts the the results of the allocator fragmented behind I I I recall some reports that again, I'm highly fragmented disk, even as all flash setup, the performance might drop.

D

Probably large amount of disk rights, physical disk rights might still impact the performance.

D

Maybe that's the case here as well, and actually I am for a while. I'm dreaming about some performance counters to track allocation. Allocator, tough, latencies.

D

A

I think so yeah. I think my plan of action right now is. I will I'll I'll try gabby's pr just to see what happens.

A

It shouldn't be that bad to uh just run through a test with it and then after that, assuming that it hasn't fixed everything I'll go through, and I think I will just run a quick test, seeing if I can investigate what's happening inside the osd in the the test that I actually have right now, but I don't know, maybe it makes sense to see if I can replicate it on a single osd, we'll see oh I'll decide. I guess when I get there, um but one way or another I'll try to do both so.

D

Well, in fact, you don't need these eight iterations, so uh I can see just a single well at least two iterations is enough to see the difference.

A

Yeah I started out with three. That was the first time we saw it was with three, and that was almost by accident, because um when you just did one it, it wasn't consistent. Sometimes it looked fine and sometimes it was bad. So it's.

D

Well, I'm curious if there are no interleaving omega bytes and 4k chunks, why is it still reproducible after a while or not so uh is it really this inter junk size interleaving, which causes that picture yeah or just long enough for writing.

A

I don't know for sure I I crafted that benchmark, because I have a another set of benchmarks. That's just kind of a standard set that that iterates to come over a couple of different test types, so random reads: random rights at different. I o sizes, and that was where it was first showing up. So I figured that this was probably a smaller set of tests. I could do that would make it show up and it very much did so. I haven't tried it with just doing block.

A

You know a very long period of sequential large rights.

D

uh And by the way, these are random rights right.

A

Yes, in both cases, although for four megabyte, you know it's, it's taking up the whole block size so.

A

The behavior, I don't think, would probably be very different than sequential four megabyte rights. Maybe.

H

A

But I don't think so.

B

But you are operating on the same objects or those are disjointed. Sets yes same same volumes. Yes,.

D

But but they are not pre-filled completely right, they are pretty obvious. Are they they are completely pre-filled? Yes, then you perform the rights.

H

A

And the pre-fill is not part of this process like the the iterations, it's a pre-fill stage, and then the iterations after that.

A

And the pre-fill is done with four megabyte rights as well.

A

All right: well, I think we've probably beaten that one to death, I'll uh I'll, keep working on it, igor and adam and gabby I'll I'll. Keep you guys in the loop and let you know what I see.

A

All right uh yeah, let's, uh let's move on then um okay, gabby and adam since you're. Both here do you guys want to talk a little bit about your idea for pg log, optimization.

E

So it's still not fully baked and we do wish to get more uh ideas from everybody which is more familiar with the code, but there is so that there's two part is the pg log and the pg info, the pg info. It's the easier of the two we have very few of them, so the number of object is minimal, it's maybe 100 200 uh pg inputs and pg info.

E

There is two type of information: there is the global device state like snap version and such- and there is a lot of statistics and information that we update every io, the first type we could maintain and it's no big deal. We have very few objects. We probably could just throw them in a default as they are. I think they are already, but if not, we can throw them in a default column. Family very few object and they are relatively big, but they don't update.

E

Frequently you don't do snapshot every five seconds, but the second part, which is stuffed statistics that we update per I o, and this thing should probably be modified.

E

The cost of executing a full uh wax db update on every I o is very expensive, and the pg info is a relatively big object and when you try to push it to roxdb, you need to serialize in all the fields, so everything have to be copied and merged and and encoded decoded and such we so and all we get from. This is just some kind of information that we have anyway, and we don't really need to make persistent, because that information is part of the global state.

E

So what we could do instead is make the pg log in so the pg info. Just keep the big object that this. The big state object like snap, but update should be kept in memory. When this, when we do shut down, we should distance to file like we do if the location map restore them from file, and if we had a disaster, we are already iterating over all the object node.

E

So while iterating, we could accumulate all the static information that the pga is holding like. How many object nodes? You got how many shards, how many globes, how many anything, how much allocated space, how much in memory space all this kind of information could be rebuilt on the fly and I'm so so. The recovering the location map is adding significant amount of time. This thing, I think, is going to add, maybe one second, to the whole process, because we anyway iterate over the object, so you just need to aggregate information.

E

I think that should be first targeted project and I would really appreciate if anybody with pg info knowledge can come and tell me if there is any stat that we update per I o with which should not be updated.

E

Oh that's all that much that we must do, and they don't think we can recover this from the object I josh mentioned today that we already have some kind of a scrap task which can rebuild the object that the pg info state. So again, if this scrub really does a good job, it means that the state could be recreated from anything else from everything that we have in the system and we should not be paying pair. I o update.

E

So that's the first step. Does that make sense.

E

B

Yeah, for me, it makes sense but yeah for as much as I understand pg log.

E

Sam, you are the expert on this issue. We really need your input here. What do we miss.

H

Oh, it's gonna wait to wait until you got the pg log part. um Okay, as far as I recall, the only bits that matter to me is, though, are the last updated steps. You're right, we could rebuild the from the store, but the primary value has been the ability to cross-check them against the contents of the store. That's how we catch most bugs who build a store and or the osd.

E

Sorry you've been cut off, at least on my side. I don't know if everybody else.

H

There are two obvious pieces of information that get updated on every. I o there's the last update fields in the info, which obviously we can rebuild if we have a way of resolving the pg log, the other piece. These stats is indeed duplicative of the contents of the store, but that's been a feature, not a bug in the past, as it has allowed us to detect a lot of bugs.

E

So you think there is one part of the update that we do need to maintain or we could go without it.

H

I said there are two things, um obviously, that we updated every. I o last update, which we can reconstruct if we have a way of reconstructing the pg logs as it's simply the last version of the log. The other is the soft stat information you're right, it's duplicated with the contents of bluestora. It's just that. The fact that it's duplicative has allowed us in the past.

E

And so okay, so there is the start which we all agree. It could be reconstructed and then you say there is the last update which is more critical. Can we, let's assume that pg log is now that's the option.

I

H

Wait is it let me switch mike, maybe it's like.

H

Okay, can you hear me now.

A

Yeah, that's much clearer. Sam.

H

Sorry, the worst mic, so there are two things that obviously get updated on every io um and some other things that I think are less important. There's the last update field, which is basically just the version at the end of the log. So obviously, whatever your strategy for the log is would also apply to that, so we'll ignore it for now.

H

The other piece is these stats, which I think is what you're really talking about.

E

H

And yes, we certainly can reconstruct that from blue story. It's by design duplicative. The thing is, it's also a good way to catch bugs.

H

Yeah is it though, so you you mentioned a few things. One is that we can't afford to do a rock's db commit on every. I o. I claim that that can't be true. We are going to do a rocksdb commit on every I o period right.

E

E

um Two things I'm saying: object: node. There is no way around it. They're going to be always committed to roxdb, but do we need to commit pg info after every io.

H

I know that I'm saying that the premise here is that it is catastrophically expensive to write the stats down. So my pushback is the stats are useful and I would like to write them down. Is it really that expensive and if it is, why.

E

I'm saying keep them in memory. That's perfectly fine.

H

No, the value of the stats is that we can use them. Is that they're durable and we can compare them to what's in blue store if we don't write them down and then every time we restart the osd, they are no longer useful.

E

They are identical to the blue.

H

E

You probably missed the first sentence. I said the idea is that everything would be stored in a graceful shutdown. We would store everything uh yeah.

H

I don't care about the great full shot.

E

So we don't we're not going to lose anything only disaster. The only case that we're going to lose is disaster.

H

Yeah, like any common power failure, it happens all the time. So I I don't care about the great full shutdown case. As far as I'm concerned, that's an optimization.

A

Sam, what I really don't want personally are tombstones. I want to avoid deletes. That's really what for me the the first.

H

A

H

So yeah, so the the other half of this is that blister, isn't the only object store.

H

The current commit like the code in the osd that currently governs the stuff, is now in its second object store, and it will soon be on its third.

H

So I understand that blue story has some problems with this, and I want to explore why that is and whether we can fix that specifically because it is not obvious to me that it should be very expensive to write down a small amount of metadata.

A

It's I don't think it is sam. I think it's, the tombstone accumulation, that's the real nasty problem that we deal.

E

With we confuse two things, pj info.

A

Yes, I I agree gabby, but I don't think this problem is necessarily the big problem. I guess is what I'm saying.

E

I'm not saying that's the problem, what I'm saying it's a problem and I think we should address the problem one by one. I I should probably come back to you with numbers like we did before, and try to remove the pg info from uh the right path and see what kind of performance increase or decrease we get. I'm expecting to see some benefit, and my take on this is that there are two things and the biggest one is. Actually I love the biggest, but one big one is encoding.

E

The object is a very it's a cpu intensive operation and it also takes consume memory operations. The second one is rockdb update operation, it's also expensive. I mean that thing depends on how many objects you got in the main table the more object. You got the more operation you have to do, but it's really at the moment everything is speculative, but we can just run the test. Disable pg login for and see what kind of.

H

Changes so suddenly on a normal osd, there are about a hundred pgs. If we can't efficiently update a hundred, very small bits of metadata, I believe the pg stat structure runs in the hundreds of pipes. Then we have a problem.

E

But they are not standing alone in their own column, family. I think they they they.

H

They're using but we're discussing marginal costs. If the problem is that the column family is large, then that's the problem, then we need to address that right.

E

But the column family, I think they don't have their own standalone. I think they share the pidgey log or the default, which one did they go to.

H

I I, in other words, I suspect this is a red herring. If the real discussion here is the pg log.

E

Sorry, I don't think so because even remember the allocation map in theory, there is not much cost in storing the allocation load.

H

But no, it's completely different they're not related at all. We're we're overwriting the same key repeatedly for pg infos, and there are only 100 of them. Any well designed key value store should keep that in its login and memory never actually make it to level zero.

H

So if it is something really strange is happening, that is not true for the allocation map or any other commonly updated structure of non-trivial size.

E

So at the moment at least that don'ts, they don't have their own standalone name table, which means that when the mem table is written down, they are written with them.

H

Then that sounds like a problem with rocks tv.

E

But it's something that we have at the moment and again the easiest things to say to do is: let's measure the time I mean everything we say now is speculative, but we can just remove them and see if we get improvement.

A

But I I want I want to highlight those. They say all the time. Don't look at performance. Look at behavior right. The reason why we see performance gains is because of a single kv, sync thread and any cpu time savings there on something like officials gives us a win. That's why you know any of this stuff kind of matters, but it's it's not really good enough. Just to look at performance wins we want to look at behavior with right, like we can get performance wins in other ways.

H

And put it another way, I don't want to greatly complicate the osd to compensate for a strange rock stevie uh behavior.

H

It's genuinely useful, having pg stats that are maintained independently of the contents of the store, because it allows us to detect bugs I'm willing to get rid of them. If there's a good reason that they're expensive to maintain- and we expect that reason to be durable across multiple object, store, implementations.

H

Does that make sense.

E

Yeah, I I think I think it does, but.

H

E

H

Principal's point of view: it's not obvious that pg info falls in that category pg log might so that's more complicated.

E

Okay, so let's go now to the pizza dough, so the pg info. I need to measure because at the moment everything we talk is just theory. If we see that performance are significant, the impact from, and then we can go back to this and see how much of this we should remove and if then go back to pg log, so peach log is a different kind of beast.

E

We really abuse a mechanism. Roxdb was never meant to hold something like pg log, because roxy be never as doesn't assume by their design. That objects are deleted so frequently. I actually tried to push a change to roxdb to allow object to be marked as single occurrence and then, when you delete you can delete them inside the main table. They refuse to take it because.

E

The argument was that we abused the system. The system was not designed for this, so they don't want this optimization. It's from their perspective that we should do something we should not be using groups to be, but we do so. The problem is roxdb. We create a pg-log object and later we create a tombstone.

E

Even if the tombstone find the object inside the mem table, it's not going to eliminate it. They would. It would still go to the disk sorry to to level level 0. they don't. The roxdb does not support in-memory deletions, nothing is deleted in memory only the disk is deleting when they do merge. I don't know if the move from the main table to level zero is doing the delete or maybe or from level zero to everyone.

E

I am not clear about it, but that's something that we know that's a problem because tombstone they always go all the way down and since we are pushing so many of them, it's become expensive. Again, it's a theory. Some performance analysis, I think by mark and other and others show that you can get about 40 extra iops.

E

If you eliminate pg log, do I I think that was the numbers.

A

That I've seen those are pretty old tests, but at the time maybe there is quite as much with your allocation changes, but it might still be a win.

E

Yeah so again, it's it's worth repeating the test, because everything we do now. It's again we need numbers, but I I think there's significant gain to be made there and the next thing and that's a side effect once we are able to get rid from pg log in pgm, mainly from pigeon of not pg input. That case, then the next goal would be to shrink the main table.

E

I know it's not a big gain in the general purpose, because the main table aggregated size is three gigabyte, but since the right headlock is something I learned yesterday right ahead, log is limiting them to one gigabyte. Then effectively we only get one gigabyte, but if you think about.

E

Cloud environment, where we don't use rgw, then the table space we use is about one and a half gigabyte and if we shrink it by using smaller mm table, we could make significant improvement.

H

E

If we're running.

H

A bit short on time, it seems like the since we're running a short time. It seems like the takeaway.

A

H

Is that you're trying to shrink the usage of rocks to be resources by taking the log out fair enough? Why not write it into a buffer, just a regular option.

E

So pg pg log should be probably written into a buffer and they should not be part of this. The the problem with writing this to a buffer on disk is pigeon. Logs are very small, and if you write them to disk you every right going to be amplified.

E

If, if a piggy dog entry is 16 by 32 byte, I don't remember the numbers.

G

E

E

If you do uh right, if you write to this, can this and ssd now they use 4k or even 8k sectors, then every right of 32 bytes would become 8.

H

Okay, so the traditional solution to that problem would be, you would add, an interface to object, store that defines a special buffer object right um with a special interface and a special transaction operation that allows you to that allows blue store in the back end to aggregate these very small, writes into the journal and.

I

Only periodically.

H

Write them out to their full size object.

B

H

B

H

Solves that problem, have you considered that.

E

Aggregating them, but sorry I missed something. If you are you still going to go to the right, headlock.

B

E

Okay, sorry, if you go to the right headlock, then you don't need anything else right. So it's a solution which depends- and maybe igor, is the best person to do that because he already is working on the right ahead. Log yeah! That's with this! That's what I am trying to.

D

Say uh so I at this point, I'm pretty sure that can use that new right ahead, lock, I'm working on for these sort of things and even that issue with space amplification. You mentioned right amplification. uh It probably doesn't that critical. Since you have to merge your glock update.

D

So you need to track. You need to track them in as a single transaction, so you merge this data before writing to this right ahead log and hence you don't need to write just this 16 bytes you mentioned so your you. You will write both or not metadata, update and page lock, update to write a headlock and then do not commit that to the roxdb.

D

I mean update part.

D

So I believe that.

E

That should be doable, I mean the tricky part, I'm not saying everything here is extremely doable, that the design in the theory here is simple. I think the implementation going to be a bit tricky because you need to know when you are able to release the right ahead log, because the right ahead log is.

H

Wait a minute: that's not quite the proposal. I've had. My suggestion is that we go ahead. So your concern is that if we only write this to write ahead log, then we have to keep the segment of the write ahead, log that we wrote the pg log entry to until that pg log entry gets trimmed right. Yes, exactly so.

H

The natural solution to that is to keep a buffer of the un of the portions that are only in the pg log or in the wall rather and write them to a real object when they get trimmed. Yes, it means when we're right, but you get to do them in big pieces. So the right amplification component isn't a huge deal. It's slightly less efficient, but it entirely removes the coupling between the pg log entry lifetime and the wall lifetime.

E

Miss something so what you say is we agree, we put it in the wall, but then we do a merge right. We have enough object, then we write them in in a single update.

H

Yeah so let's say each each suggestion is the object. Store interface defines a special object type, which you write, sequential, which you write. Sequentially a log object right. Every pg gets one and every pg structures, its log updates, as writes to this log object or as appends to this vlog object.

H

Internally blue store takes each one of these appends and writes it into the wall when it comes time to retire that portion of the wall, any updates that haven't yet been written back to the backing object, get written to the backing objects, just like with any other small write.

E

So what you're saying we could do this thing even with the existing wall, we don't even need to change the right ahead. Login to in control of this, we could use rogues to be built in right ahead log without any notification and still do this change.

H

As long as you have the ability to manipulate the log directly, that's that part, I'm not sure about.

D

Yeah, that's the issue with the internal.

D

E

Other implementation would be that you maintain a counters of what's the highest version uh on the wall. On on on every segment. You break that you keep the numbers, and then you keep a bitmap, and whenever you have a full sequence that all the beats are set, you could go and release that finger.

H

You have no guarantee that all of the pgs, if that's not doable, you have no guarantee that all the pgs on the osd get updates in a timely fashion. You would be unable to trim the wall under some conditions.

A

The space amplification there could be really big in a pathologically bad case, but would it sam? Do you think it would really be like a huge.

H

A single, a single pg that stops getting updates would prevent you from trimming the wall. That's not a viable strategy.

A

Even even sam, if you were to have that, be the typical case, but then encounter the case where something bad happened. You do do your idea of of basically just done.

H

A

H

Summarize that, as do it, the right way, if you set up the log interface so that the trim operation is explicit through the object store that as long as the trims and the rights happen at about the same cadence, you never actually do a write up because it stays in the wall right.

H

Yeah, what I'm saying is you cannot.

A

H

The wall lifetime to the osd to it to osd behavior and you particularly, cannot tie it to user behavior, which is what this would do. So you do have to do the thing where you retire wall rights to a real object.

H

You just want to set it up. So in the common case with a high update frequency pg, you won't do it.

H

Does that make sense, I don't know because it would require you to have a very small amount of control over like, ideally, you wouldn't even write these to. We use keys and values in rock's db as a wall right or if we stop doing that.

A

I don't, I don't think we do that sam, unless I'm.

H

Noticing that then we're writing directly to a wall, then this is easy. I mean it's a lot of work relatively straightforward and it'll, be relatively easy for c-store to do a very similar trick.

H

And also, I think log objects are, in general, just a useful abstraction. I don't think this will be the last user, for it.

A

Sam, can you talk a little bit more about the ensuring that the cadence is right and making sure that, in the fast case, where we're not ending up in the scenario where we um we're ending up, like you know, passing things through.

H

So, specifically, what work, what we're talking about is a version blog right, so updates to bluestore add things to the end of the log or trim from the log or both in memory we keep a record of which entries are currently currently live. For each of these objects remember, there are only like 100 of them, so we can afford to spend a very small amount of memory state on this.

H

As long as the head and tail bounds are within the current wall, we don't need to do a write out. It's only when we trim a wall entry that contains something in that in that range that we need to do a write out. So it's not that we ensure that the cadence is correct. We simply ensure that if it is, we don't do extra work.

H

Does that make.

E

Sense to do it correctly, we put everything in the wall and we will try to never commit it. Then we could trim it, but if we see that we need to trim the wall, but there are still objects that we didn't.

E

That would if there is a log.

E

Because we still wait for them, then, instead of keeping the wall long, we will just create entries in rocks the people or this object, and we could aggregate all of them and do a single commit to works to be with multiple objects.

H

No, I don't think you write the structs db at all. It's pointlessly expensive. You write it to the data payload of an object.

E

E

But if this thing is only applicable for cases that we cannot trim the wall because somebody is not keeping up so only in this case we would start using roxdb proper, but slowly.

H

And by well I again, I really don't think you should write these entries to rocks tv at all.

H

Okay, so, let's think of it this way the simplest. Let's say we made no modifications to the object store at all. The simplest way to do this at the osd level would be to declare a circular buffer object of fixed size. Let's say 16 megabytes each for each of the pg's.

H

The pg info now simply remembers the head and the tail of that structure of where we are in that buffer. We simply write to the head of the buffer and trim from the tail from blue store's point of view. This simply looks like a a sequence of random, writes to a 16. Meg object right and what blue store is going to do? Is it's going to write these rights into the wall and then later it'll? Do a deferred, write right, at least in the most common configuration.

H

So the problem with this approach is that we're going to want to do um we're going to end up writing out a lot of entries that don't turn out to be live because by the time the wall gets around. To that point, we'll probably already have trimmed that pg log entry. So the right we're doing is pointless with me, so so uh so far.

H

Does that make sense.

A

H

Yeah, it's okay, so the problem. The reason why that's happening is that blue store doesn't know that we trimmed the entry. It doesn't know that it's an entry at all right. It just sees some very small right so to fix that we change the object, store interface, to make explicit the fact that we're sending versioned updates and that we're trimming version updates.

H

So now we're still based we're still semantically doing the same, writes right, we're still sending the same rights we would have said before, but now we also send a piece of metadata that allows bluestora to know. Oh, this is version 2000 right and we're also sending things that say by the way, trim up to version 1796.

H

So now, when we're doing our deferred rights, we'll do two things: one we're not going to try to do deferred rights for objects in this category until as late as possible until we're actually trimming the entry from the wall in the first place and two we won't bother to do them at all. If we've already trimmed the entry in question.

H

So we're still writing to the data payload of an object, and this is still an internal detail, blue store, but as long as the pg for for a very quickly updated pg that updates at a cadence roughly in line or faster than the wall, you don't end up doing the rights you on restart. You simply replay the wall. You rebuild the in-memory buffer, representing this object and when the pg goes to rita, you pull it out of memory.

H

You can also do tricks, like co-locating all the pg's entries onto the same objects. If you want that would reduce the number of random rights. But I suspect that's not a resource worth optimizing for.

E

Could you send us an email, explain this? I mean at least for me: it's a bit hard to capture everything uh in memory. I I need to see some. I need to see it written to understand it just too much information for me to capture at this point.

H

So, but from so I I thought I'd expand a little bit on. I I I believe I understood some of what you were proposing in that email. So I wanted to point one thing out: any modification to the recovery protocols of the cluster is vastly harder than local changes to the osd.

H

You mentioned something where you were going to create a sort of multi-osd walk over the osd's in the non-clean shutdown case. That does not strike me as something that has legs.

E

So the idea was to walk over all the object and find so everybody have the number of the version, and then you see how many versions you miss. If you miss that many version, somebody need to push them for you, which in world's case it's just a full racing. I mean it's not going to be a full, recycle, it'll. Think of every object. You miss.

H

Yeah, so that's a lot more complicated than it sounds at first, that's a lot more complicated than a thousand first blush. First, that's basically gluster's recovery strategy.

H

um So now we get into reasons why gloucester's recovery strategy doesn't work very well. um It gives you no way to detect divergent updates. That's the biggest problem. One of the things the log gives us is the ability to detect when um an osd comes back up after appearing has happened and has updates that never actually happened.

H

Those will require some kind of detection mechanism, I'm not sure how you do that with that strategy. More importantly, though, it's a new protocol and would be an incredible amount of work.

H

It's just impractical. For that reason, so I think we should focus on strategies that are local to an osd and that rebuild essentially the same structures.

H

um With the scanning, the local osd's object store to rebuild the log. The challenge there is that not all log entries are the creations or mutations of objects. We have some log entries that encode things related to tiering. That would still need to be written down. They wouldn't happen in the common path, but they still need to be supportive, but the real.

E

H

Yes, deletes are an actual problem. Yes, yeah.

E

But you could commit the lead.

H

Yes, which means by from a complexity, point of view, in addition to still needing to maintain all of the code we currently maintain for the pg log, we would also need to maintain code for rebuilding the mutations, so that may be a viable strategy, but my suggestion is that we find a way to make writing the log down sheet. That would be a lot easier.

H

Which is why I'm suggesting something like a modification of the object store interface to allow bluestore to be much smarter about writing the data for the pg lockdown, so I'll send an email about that next couple of days.

B

Yeah some I've got one question: uh how much improvement you think would justify that additional complexity, exactly in the form or similar to what you just proposed.

H

I'm sorry, I didn't quite follow the question.

B

I was just asking for your feeling, um because we will now, if we do uh something similar to what you proposed, we will have uh some additional complexity, and my question was how much uh more improvement, um iops of or throughput would justify that additional complexity.

H

The my proposal, you mean, I mean, there's a simpler version where we just write to an object. Payload I'd be curious as to what the outcome there would be. We've done that before file, when when file store, was the store back in 2012. That's how this worked.

H

We've already done the circular buffer strategy, the more complicated version where blue store received him.

H

They're not that small, the updates are small.

E

H

Yeah, the updates are small, but then the way file store worked. We wrote to a journal and we wrote non-synchronously to the backing file system.

H

So the file system got to batch the right, so it wasn't a big deal.

E

Okay, what about another approach in which you sync everything you do with everybody around you, you tell them what you so the pg log. You don't commit them locally! You! You use some kind of network memory, so it's going to preserve something and every 100 once you reach 100, then you do one one commit whatever.

H

So what you've done is make it possible to lose 100 updates.

E

Yes, but then, if you, but you know that you only need to scan the last hundred objects.

H

Which means blue store needs a way to efficiently scan the last hundred objects. But again the actual problem is divergent updates.

E

And by the way, if you see anything funny happening in the system, any change somebody going up down, then you move to doing one object at a time, but.

H

None of that matters the the only case that matters is the dirty case. Okay, so you you can possibly lose up to 100 veterans right so for correctness,.

E

We have to be able to compensate for that all members if all osd died. In the same time.

H

It happens all the time it's extremely common.

E

Sure, but I'm just saying that's the only case that we're talking about.

H

Yes, but I'm telling you that it's extremely common, so as a general rule, I'm going to start by reasoning about failure case and then worry about optimizations. So reasoning about the failure case here means that all of the osds would have to do a scan to find the most recent update the most recent several updates. In fact,.

H

You can't tell whether a missing object was deleted or not. So that's absolutely.

E

So the lead would still go normally like we do now. You cannot delete. It would be like a barrier every delete, you have it. It means you have to flash everything to robs to be.

H

That seems like a really expensive optimization. It's. How often do you see delete it's extremely complicated? That's why it's it's expensive deletes do happen pretty frequently. I'm.

E

Not saying they don't happen, what I'm saying is this one out of 10 what one out of 100.

E

And just keep in mind in worst case scenario. If everything is deleted, we are no worse than we are today. I don't think everything is deleted, but if we have.

H

Yes, we are worse off because we have a massive. We have a very large additional, very complicated feature that we have to maintain that's much worse.

E

But sorry, I'm talking about performance from a customer perspective since delete is not all they do, because customer doing just that it wouldn't be doing anything. It wouldn't be a customer, but I'm saying even I'm talking now about performance, not about code writing. So when, when we do delete, we will use exactly the same code that we do today. If there is no delete, then we could buffer things until we get delete and anything else which is funny complicated requires special handling, anything which requires special language.

E

By the way when you say your comment, comment could be.

H

E

Once an hour, every 10 minutes.

E

Even once a second, it's not very common from computer perspective, if you can do 50 000 per second- and you do something, would you say it's very common once a second, it's still a youtube promisation. If everybody until now got 10x uh reduction.

H

Yes, I understand that it'll be faster, um so let me let me observe that this that particular proposal does not in any way require network memory.

H

Network memory or your pure osds, it's irrelevant.

A

H

All you need to do is on startup rebuild the pg log from what's in blue store. Yes right! Yes, so that's theoretically possible! If you're I mean I'm, I'm not massively opposed to that strategy, because it does not change recovery protocols once you've booted up, you have the same structures in memory that you normally have, and in that sense it's exactly as persistent as a normal pg log connect.

H

So the challenge here is: you need to be able to communicate to blue story enough information to alight the object store entries that it can otherwise pick up or have the osd explicitly only write down.

E

H

E

That once we get a barrier object like delete, then everything that we have we're going to set one transaction with everything there. So.

A

E

What do you mean if we can make something better.

H

Sure it could be no no gabby, I'm saying it's actually simpler, never to put mutations into the pg log right with this design on osd startup. We always already pretty much have to do a scan.

H

So during the scan you rebuild all of the creations and mutations.

H

The the challenge is that this actually can't give you the whole pg log. All it can give. You is the most recent entry for any object right.

E

So if there is an update to the object and then a delete and then another update.

H

You need, or even three updates to the object in a row.

H

Yes right so that's tricky, because now we actually embed the we embed metadata that allows recovery to only recover the mutated portions of an object which was a pretty substantial recovery performance when, if I recall, and that information is embedded in the pg log entries and cannot be recovered from the objects. So that's a challenge.

H

Another challenge is again divergent entry detection with the pg logs. The way they are, you can identify the version that you can identify the version that diverged.

E

Can you make this thing a barrier.

H

E

Can't because it's something.

H

No, you should read the raido's paper, but this happens at a different point in time. A divergent entry isn't divergent until the osd's have already lost their acting set gone through peering and then osd wasn't part of that process. This happens in the future of the right.

H

It's not a property of the right when it happens, um but uh the other the other.

E

If you force that peering would flash all the pg log, I mean.

H

No because you've already failed to keep the information you you need. That's the problem. We didn't write it down in the first place.

H

I the the simplest way to do this is to find a way to make pg logs, not expensive.

E

Yeah, if you got uh opt-in, then you'd have a very inexpensive digital.

H

I strongly disagree that that's the only way to do this.

H

So in general, what is the normal storage strategy for dealing with small rights, you buffer them right.

I

Yes, but you buffer them in a persistent space.

H

Right, like the wall, for instance yeah, but the wall yeah the wall. So I claim that the size is not a problem. It's just not as long as we don't choose to actually force synchronous versions of those rights out to an object. It's a red herring doesn't matter.

E

So we could do in other things like, even if we go back to the previous solution, we just push them into the wall without making an object, but once we see that we cannot keep up with the wall, because somebody is slowing us down then revert to the original solution, which every io would create a pg log entry.

H

Well, I think what I'm? What I'm saying is that I want to decouple the expression of what's going on from how blue store chooses to implement it. From the osd's point of view, it doesn't care how the write happens. It just cares that it does right.

H

Yeah, it makes sense the osd does not care whether bluestore is writing the pg log entries to a rock's db entry, or not. I mean it does at the moment, because it's using the omap interface, but there's no reason we actually have to do that if we instead write to a circular buffer with some cleverness to avoid write out, then from the osd's point of view, what we have is a log object that it just reads like a big buffer on startup from bluestora's point of view.

H

All it needs to do is maintain these semantics of that object. It doesn't actually need to perform the rights unless it's forced to.

H

As long as all of the information required to rebuild that object is currently present in the wall, there's no need to do a write-out, so the only time you do do a write-out is.

A

H

If a pg is updating slowly enough, that its log entries outlive the wall, which would also be a case where its performance doesn't really matter, since necessarily those updates are rare or definitionally, I suppose.

H

I I mean I, in other words, I'm I'm proposing that the primary problem here is that rocksdb is bad at dealing with short-lived keys. So as long as we pull it out of rock's db, we have solved the problem.

A

But as long as we don't have the same problem, but yes.

E

But we still need to find a location where this thing could be done.

H

Right but blue store already has the abilities to write to an object. That's a thing.

E

I right I'm still looking at the magic when we say blue store have the ability, where is bluestar, doing that, because.

H

Literally anytime, a client doesn't literally anytime the client. Does a small write, I'm talking about the data payload of a perfectly ordinary blue store object. Yes,.

E

I

In that scenario, what happened when you do delete? Are you still playing a.

H

Tombstone, why would we be you're, never actually doing a delete, you're, simply overwriting the same object over and over and over again like a circular buffer.

E

Once gaby is that right ahead log is trimmed. You must commit things.

H

Yes to the object, I'm just I'm describing an allocated region on disk, but it can.

E

A

Right gabby like we're, not we're not doing it the way that we're doing it right now, we've the way that sam's describing this. We were writing this out. To that like a static set of blue star objects or you know, doesn't have to be static, but it could be, it could be static. It doesn't matter it's implementation, detail inside blue store. This is this is removing this from the rock's tv wall.

H

Or, to put it another way on when when the pg is created right now, what we do is we create a pg metadata object with an omap region that we use primarily for the pg log, the keys of the e-versions. The values are, the you know, encoded buffer and the data payload of the object. I think it's the pg info can't rubber, but instead what we'll do is we'll make a new object, a pg log object. We will give bluestore a 16 megabyte allocation hint and then from then on forever.

H

The only operations the osd ever does to that object are small, writes and that's it.

E

Just to make sure you understand sam, are you suggesting that, instead of creating one object for every right for every pg right, we create one object? We have a single object or maybe two objects for the total pg, and then we just every new update, would just be appended into the same object.

E

F

Object instead of adding a new one yeah and again, we don't.

H

H

Trimming is simply a metadata operation. The osd does the oc is like okay, the tail of the object is now a different offset than whatever.

E

Now I think I understand okay, sorry, because before I I just didn't understand the object concept you described, but when it rocks the.

H

E

Objects: lingo yeah, it's when it's. I was still thinking what the roxdb is going to be doing. Okay,.

H

E

A single rocks db object, the sun do the same way we do with pg info and we keep appending. Eventually, it will be flush to disk, but not nobody, because we're going to be flashing, a single object instead of multiple ones and when it's deleted by the way it is going to be deleted.

H

It won't be deleted, at least not until the pg is destroyed.

E

But the object is going okay, so now we need to maintain how we do because the object is going to have pg-log update, update, update, pg log delete, delete, delete inside the same object going to have. You might have uh uneven solar article, the number of opening brackets and closing brackets the node might not be the same. So you still might be all some delete operation.

H

That's I'm not following that's not really how the pg log works at all.

E

Is simply a link to you delete it?

E

No, I mean you create it, you create one by one: eventually, you go and you delete a bunch of them, but if you trim.

I

Log entries: yes, when.

E

You say stream, you mean that it's from page log, sorry from robsdb roxtp, is going to see a delete operation.

H

Well, not with this design, since roxdb never saw the initial key.

H

E

So how would you make sure that your object would never be flush to file, because once they are flush to file, then you need.

H

To they will be flush to file, that's true, but what do you mean by close.

E

I mean the right, so you have one big object. Right log is covering you from failures. Everything is in memory. You keep adding stuff to this eventually roxdbc that the right ahead log is too big, and it's going to flash everything to this yeah. Can you actually meaning?

E

Is this it's possible in order to be to pin something.

H

Wait a minute, let's, let's not get too ahead of ourselves. Let's start with the simple version. The simple version is yes, the data gets flushed to the object in one big write.

E

Yes, but now this object got a lot of, writes and delete. Do you ever train this object internally? Yes,.

H

Yes, that's not a blue store operation, though.

E

Yes, but you need from time to time, to go and update your object, not just add something you need to update and modify it. You need to read the whole thing and write a new, a new object.

H

E

How would you reflect the dreaming that happened in the beginning.

H

As far as the oil, as far as the osd is point, it is concerned this 16-meg object or whatever it's just a circular buffer, so it writes ahead and trimming simply means. Oh, I can write more now.

H

Separately somewhere else, we have metadata. That indicates where the head and the tail of the log are in this object. We simply remember that offset.

H

Currently the tail of live entries is offset one megabyte and the head is offset four megabytes when we trim a hundred entries now. The tail is two megabytes right.

E

Okay, but when this object goes to disk, are you going to create a shadow, an exact copy of the object and insert it to the new main table.

H

It it never goes into a mem table because it's not in rock's db.

E

It's true: whatever you call it, I mean the.

I

Osd already has a copy of this.

E

I mean the only reason we are we can make this thing visible. The viable is because we share the right headlock, which, anyway, we have to write so doing small.

H

E

Is not considered expensive, but if we'd have to do our own private right ahead of that's going to be too expensive. So we must share what works to build.

H

So what is the mechanism? What describes me the mechanism that happens in blue store when an rbd image does a 128 byte write.

E

I assume that you create an entry in the right eye log and then you create the object or you update the object.

H

Yeah so exactly the same answer here.

E

And when the object is flashed to disk.

H

E

Gets posted, I don't know what happened when the object is flush.

I

H

Well, you did an allocation at some point that indicates where, on the disk, the object actually is. The flush is simply performing the actual right.

E

Okay, so you say somebody like all these corner cases: I'm worried about already being covered inside the object interface, because.

H

It's a simple version of this yeah. The simple version of this is semantically, identical to literally every small write. Booster already does.

E

Okay- and you say, since this thing is working sure, there is some cleverness happening underneath, because somebody need to manage what happened when stuff is on disk and in memory, and if you have duplication and how you remove things, because once you committed stuff in its own disk, you need to know that this thing should not go against the disk. I mean, if you put.

H

The third right right.

I

E

Flash to disk- and you committed it to the its proper location, you need to go and remove it from the disk.

E

So somebody might.

H

Just describe that again.

E

Juggling things there I mean it is doable, so something is flash to disk, you create a shadow copy and you delete and you create a tombstone. So everything goes to. The disk essentially is going to be removed.

E

Sorry, you need.

H

E

You need double buffering, you need another object. Something goes with this. You create another object. You continue from this and once you're done you're going to delete the previous object.

H

You're describing bluestora's cache. Yes, there is a cache.

E

Okay, but I'm not going to go to the implementation. We assume that since the alkaline family is working so we could reuse the same mechanism and see if it's working.

H

Well, I mean again the external interface of bluester requires that that part work. The challenge is, it won't be efficient, at least not particularly efficient. It will be better than what we have now. I think I think I think the first step is to go ahead and implement this thing, I'm describing in a pr it's been done a few times in the past, because roxdb being slow at doing pg log entry updates is a pretty common observation.

H

um So let's go ahead and do that experiment, but there is a way to improve this. If we want to make out blue storm more complicated, um which we can discuss later, but yeah, the simple version is just building on things. Bluester already does it would require no changes in bluestora whatsoever.

H

It might do write outs in cases we don't really want it to, but it'll still be better than roxdb.

E

Okay, I think we need. We have too much virtual information. We need to get something in paper, so everybody could review and understand what this thing is talking about. I think this meeting is grown too long.

A

It did, it did.

E

A

E

Well, but I think I'm starting to understand where you heading.

E

So you could write a proposal. I could summarize the idea about trying to do.

E

Optimistic pg logo, which everything is kept in memory and assume that we could recover from disk, but once we get into something that it's unrecoverable or it's complicated, then we would consider this thing to be a barrier. We flash everything until then, including this thing and move on.

E

If we're going to get too many of them, then this thing would be probably as expensive as what we have now, but if all the barrier cases could be identified and all of them could be identified in real time and not postmortem like a divergence case that you mentioned, then I think that solution might be cheaper.

H

But that's the difference. I'm just going to tell you it. You need to read the readous paper. There are a whole bunch of reasons why that won't work, at least not not simply the most obvious one is you can't actually rebuild the pg log that way, because you won't have the intermediate log entries, so it would break a lot of stuff in the usd.

H

So before we consider something like that, I really want better evidence for why we have to do something that that complicated.

E

So you don't think there are pg log entries which could be reconstructed versus others, which are not you think. It's.

H

The other way around it's that I think there are some pg log entry that can't be.

H

And that no, you cannot detect them in advance, because the the first obvious observation is that if you update the same object three times, you won't have the intermediate updates. If you do this, that alone causes problems with peering.

E

So you could say if you see the same object updated twice, then that should be again. That could be a barrier case or maybe that means you need to up. You need to commit that pg log and everything before that.

E

I don't know if you need everybody else, or just everything on that object level. If you see one object being updated twice, it means you have to go and create two pg log entries real ones, not huge one.

H

The question is, I would still like to have a real reason as to why we cannot write the pg logout efficiently.

E

I don't have the reason. I just know that we didn't do it until now.

H

That is accurate, but all of the reasons I've heard so far are related to rex. To be that, it's really easy to avoid rocks db.

A

Yeah, I I think, honestly, the reason this done this way is because it was a really easy and hammer that you know we.

H

I'll tell you exactly why we I'll tell you exactly why we did it. We assumed rock cd would have a good implementation, which it sort of does in some ways. Roxdb does a moderately okay job of this, but once you saturate the osd hardy enough, as you guys have observed the tombstones make garbage collection a problem, so it turned out to be a bad choice, but it's an easily undone one.

H

It also slightly simplified the osd code, which was nice but necessary.

A

All right well have we have we uh beaten this to death for this meeting, you.

B

Think guys well, I I would just like to have one question: some: do you think it will be better to modify pg-log interface in osd or make it some special case in blue store just to provide the pg info and pg log data exactly in the form it is now.

H

Oh um no, I don't want to do that um so, if, if we do it that way, c-store has to duplicate that trick and it seems really fragile.

H

Do you see what I mean like? We have to make sure that whatever you'd be picking it up from the object name, presumably right, I.

B

Will be following object, names and also the same omap data. The same interface would be provided to osd level.

H

So there are two reasons not to do it. That way. The first one is that it's pretty confusing someone trying to figure out why some object is stored in a different way in blue store would have to go, read the osd to find out why and that's unintuitive.

H

The second piece is that the omap isn't actually constrained in the way we want it. Con constrained, there's nothing stopping the osd from removing omap entries from the middle of a pg log, which, for what we want to do here, would be bad. That would break any optimization. You were trying to do so that imposes a restriction on the osd's behavior that wouldn't be obvious from the osd code, so because we actually control the code in both places. I'd really rather just fix it does that happen.

B

Okay, so we'll just straighten up a usage of a pg lock to dedicated interface, and then there will we will implement it.

H

Yeah, that's that's right. Also, there's a further.

B

H

To this, which is that I don't think this is going to be the last user for this log concept yeah. We have a way to.

B

Extreme efficiently,.

H

Exactly once, we have a way to efficiently store sequentially, updated logs yeah. I think.

A

And sam just to be clear, you're thinking long term, this could potentially be an interface specifically for sequential logs in a general sense.

H

I mean the media term. That would be the immediate goal. As far as I know, there are.

A

H

There are no other sequential logs other than the pg log. I can't think.

A

H

A

I agree with you. I know matt and I have talked about this in the past and I know he would want one so.

H

If you want to poke something like this up through radios, that's a different discussion that it's really a question about protocols. Okay, that's not quite the same thing but yeah okay, it would be an important building block yeah for any any raido's users that are attempting to use the omap to do exactly. This would be running into the same problems. The pg log is they'd, be generating spurious tombstones. We don't really want, and otherwise generating traffic pointless traffic and rockets db. So, yes,.

A

H

Have this building block in blue store, we could definitely find a way to expose it up through rados that rgw can use it for I'm guessing the gc sync thing: I don't.

A

Remember it was a while ago we were talking about, he was excited by the idea.

H

Rgw has a couple of log structured things that it uses for multi-site, so any of those would be interesting. Candidates.

A

Is there are there any things that we should be really thinking about sam as we we do this or talk about this? That that would be important to keep in mind if we are thinking about exposing at the radius level.

H

um There's a simple version that shouldn't be very complicated: I'm wondering if there's a way to do better, but no, I don't think it's a big deal. Okay, like the what what you have in your head right now as the way this would work in rados, at least at an interface level, is probably correct.

H

B

Yeah so cool I'm satisfied for today.

A

All right me too, I need to go eat, so thank you guys all for coming. Thanks for staying. It was very good discussion. Thank you, sam, uh for thank you for explanation. Yeah sure all right have a great day guys bye.