Ceph Crimson Weekly, 10 Dec 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-12-10 :: Crimson SeaStor OSD Weekly Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay last week I was on on pto and I just started work in reviewing um brunei's pr to bring the support of stock to crimson, and I think I will finish reviewing too many pr spreading the alienized booster. But to me I think you could um amend the common message with uh with the details and the strategy of your your change. That will be helped be helpful and I'm.

B

Repeating the code called now.

A

Again, all right.

B

I'm replacing the code and the code now.

A

A

That's me: I'm not how's it going what one second.

C

Before before we continue, yes is the I I would like to suggest. I only got I only this morning here and I only had 10 minutes to read sam's document the new design document, and is it? Is it a good idea to spend half an hour of this meeting just for them to present the design again, if.

D

If that's what I figured I'd, do I don't think it's going to take half an hour either it's a little simpler.

C

C

Yeah, so for me, uh like same as last week, I have been working on cbt, I'm actually at the debugging stage, right now, just figuring some stuff out really close to finishing the really close to finishing and being able to to present the cpu cycles for operation. Then I'll make a pull request. I mean currently it's on my uh on my fork. I'm just fixing some problems, and after that I will continue uh working with with rather closely to yeah.

C

We can improve it even further or what we can add to cpt even more so it will be better for us. That's uh pretty much for me.

A

Thanks andrew mate.

B

um Well, last time I said there are some uh memory observed, so I think it is because the alien blue style used the self logger, and so I have to disable it in the alien blue stuff, because we didn't initialize any logger nsf logger in the self-contact cct.

B

So um I have to uh disable that logger in the alien booster and then I replaced the code. I made some compare uh error when I use the saf aerator, so I have asked the redact to help me to solve that problem.

B

So still has a little problem about the compare and then I think I can do the release version testing.

A

What do you mean by I compare.

B

uh When I use the self error self-evaluator so there's some uh compare error. um I don't know because I didn't really understand that several currently I just read the code and they there's some used deleted uh constructor, so that one has helped uh fixed by redirect. Reader has told me how to fix it and they still had some build arrow. When I use the two ways to waste with the self erasure, so still has some build error.

B

So after I think, if it is, the build error is fixed, then I think maybe the release version will will work, because uh the memory added, I think I have sell it.

A

Okay, thank you.

A

C

Hello, uh I'm a cinema video, uh so uh working in a kind of tourist mode. Here, okay, I've, I'm getting. uh uh First of all, I'm going guiding uh album in the uh cbt uh op magic and also I'm I post a new series of of patches for uh persister work according to harvey's comments. Also added some new comments. There.

A

After your name.

C

Yeah that I I need to uh backboard non-cleansing change. I need to backport the the zeroization, the typicalization uh to nautilus. That's all.

A

C

Hi hi everybody. I have redone a version of the asoc code according to we discussed last week according to sam's question, and hopefully I will deliver it today.

C

I also added this better shutdown code, and so all of this I is working. If I get the comments, the right I'll deliver everything today, including the format the cloud format script I used for these files. So if anybody wants, we want to tweak a clang format file. It will be a part of the of the comment message and other than that I'm reading and trying to learn things related to sister and now this document.

C

So that's me for.

A

I'm also reviewing your change this morning.

C

So better wait better wait for the next version, better wait for the next version.

A

Okay, cool, okay,.

D

Yep run it I'll have a look at it.

C

Yeah, hopefully watching watching, I will do it today. Sorry.

D

I said I'll take a look tomorrow morning or whatever you push, it.

C

Good thanks and thanks in advance.

D

Sam, uh do you want to do things in first that way, I can just dig away right into the document.

D

Do we want to do yinz in first that way, I can just segue segway directly into the document sure.

C

um I I was studying blue star algorithms flash and try to understand the star background plan last week.

D

For crimson at least I've been mostly working on the c-store design doc. um I did have a quick question. Well, I guess I'll talk about that at the end, um let me see if I can get this shared half sec.

D

ah No, I can't, uh can you guys follow in the document or do I need to find a way to do screen sharing.

C

I can follow, we will come forward. I think yeah.

D

Okay, so this won't take super long, because the overall concept is pretty high level at this point. But uh the right sort of design guideline to think of here is butter fs on a log structured file system.

D

So if you know anything about rfs, it uses essentially a wandering three sort of deal with a single super block. It periodically writes out so you can find the route, but it does not attempt to do the log structured thing where it always cleans a segment before it writes back to it. So we're going to make a few changes so that we are able to do that part efficiently, but other than that. The overall design should look at least a little bit like butterfs.

D

um It clearly doesn't share any structures with butterfest as we are not implementing a file system, and there are a lot of things we don't need to do, but the overall tree structure is analogous. I think.

D

uh Who just joined xs x hd no anyway welcome okay, so the first part is just some details about the physical layout on disk. I think this shouldn't be terribly surprising.

D

The goal here is twofold: one we want to be able to physically read the disk and interpret which things are data blocks versus which things are metadata right at just at a course level. So each oh um and I made a few choices for terminology rather than talking about segments or zones. I'm talking about streams, but for streams. You can definitely read zone. If you want that's, that's fine blocks mean aligned, go ahead.

C

Question, why do you foresee only two dreams for.

C

Ssds of these sorry, you mentioned that you mentioned that you envisioned two two strings only for okay.

D

So the very next sentence is that the exact number is a tuning parameter and I'm not really making uh predictions here.

C

Yeah I understand, but I'm trying to see right.

D

Here, it's not necessarily just two, and this is per shard by the way. So, if we have, let's say uh 32 cores on a disk, that would mean 64 streams open two per chart, but I really want to.

B

Emphasize I.

D

Am not putting a flag in the ground about this the goal? What I'm all I'm trying to point out is that there are at least two different regimes, a disc where we probably don't want more than one and ssds, where more than one is probably fine and depending on exactly how many streams we can productively have open, we will make different choices, so the metadata layout needs to be generally flexible enough to permit writing to more than one stream at once.

D

Does that make sense? Okay, thanks yeah thanks, I'm trying to be like I'm trying not to I'm specifically trying not to introduce details of the physical structure of the disk. Beyond the assumptions we actually care about, which is sequential access pretty much and that reads are generally cheaper than rights.

A

But for stem probably, and regarding to the the number of streams, I still have a question, because it would be ideal if we can use different streams for for different kind of metadata change, because they don't share the same life cycles. Precisely. Why I want to.

D

Be able to that's precisely exactly why I want to be able to have multiple streams? Okay, it's not that we have, but I also don't want to have to does that, make make sense. I don't want to make it mandatory.

A

D

Can introduce different.

A

Kind of metadata changes into a single stream, but it's not optimal, but.

D

I'm going to point out the following things that are maybe not super obvious or a single shard and for the rest of this conversation, we'll just talk about a single uh uh reactor right so for a single core. um All operations are serial right and we're not going to make statements about what things transactions can span. So all transaction deltas have to live in a single stream I'll, let that settle for a bit, but it does not follow that every block has to be in the same stream.

D

So I'll come back to that in just a bit. So a stream will be probably designated the journal stream for each core, but we will be able to write out blocks to other streams as as as well.

A

If I, unless correctly, what it means that, if a transaction, a transaction should only write to a single stream, instead of us to get a different um changes to different stream, just because they belong to different trees,.

D

A

D

When we read back the journal, we need to read back a sequence of deltas in an order. If we write them to different streams, we can't necessarily do that or we.

C

D

Not logically different, you you'd have to include sequence numbers, so it would be serialized anyway, it wouldn't be any faster there's. Another reason too: we want to be able to garbage collect journal segments all at once.

D

Because they're all going to journal deltas are going to all go out of scope all at the same time and I'll talk about that a little bit more. When we talk about the difference between deltas and blocks, okay, so.

C

D

I see it is this the sort of logical everything that we think of as a file system. The data blocks the extent blocks the indirection blocks inside of a b3 all that stuff. Those are data blocks.

D

There are 4k aligned chunks of disk space that are a multiple of 4k in size. That shouldn't be that surprising right, most file systems work that way you don't you don't allow unaligned chunks of data.

D

um By contrast, deltas are what we would normally consider in a file system to be a journal, and this part is sort of a departure from butterfest. As far as I know, butterfs doesn't work exactly like this. I like this design because it makes it easy to overlay different internal structures. I'll talk about that a bit more of that.

D

But the idea is that if we want to write out a transaction that changes the allocation tree and or that let's say, we want to write out a transaction that creates an object. What do we have to touch? We have to actually create? We have to create a new block containing the data for that object. Let's say it's like a 2k object right.

D

We need to create the o node. We need to create the entry in the node table that maps the h object to that o node, and we need to create an entry in the allocation table that tells us where it is or more accurately that makes garbage collection fast.

D

So all of those different updates need to all land in the same data block and get written at the same time because they have to happen atomically during recovery. We have to recover either all of them or none of them does. That makes make sense. It's why journals are always written to a contiguous segment of disk.

A

I can't agree with you that it's uh it's essential to to to use a single monetary increase, increasing secret number for funds for a transaction.

D

It's it's more than that, so when we would you do the right the way this is typically handled in a file system journal or blue store for that matter.

D

If you go look at rocksdb is what you do is the journal agent inside of the file system or storage system, writes to a declared contiguous portion of disk and then writes a checksum at the end, when you're doing recovery, you read journal segments until you find something that that doesn't check some right and that's when you know you've reached the end of the journal that way either every every journal transaction you read is either all there or you completely ignore the whole thing. You never apply a part.

D

You never apply part of it of a transaction.

D

That's just sort of basic word stuff, so in this design that that unit is a record containing all of the deltas that comprise the transaction and any blocks that you want to include optionally. You may have chosen to write out blocks to another stream prior to this commit, but those blocks won't be committed until you commit the uh deltas that link them into the into the tree.

D

C

This is the part, I wasn't sure I understand what you're saying. If the transaction involves the blocks elsewhere, you wait for them to be written.

D

This isn't going to be true so often when we're doing a right to an object, but let's say we're garbage collecting or we're relocating a block because we're cleaning a segment right. So we have some information about the age of that block. We know it's been around for like two weeks.

D

So, if that's true, we definitely don't want to include it in the journal stream. If we don't have to, because odds are it's going to live forever right?

D

So if our storage layout or if, if our storage device, makes this a good idea, then what we would do instead is. We would write it to a secondary stream into a record that isn't a transaction. It's just a new record with no deltas in it.

D

um Then, in order to to commit the change, we write a second record into the journal stream that records to change the allocation tree and and the parent block.

C

Okay, that's the part, I wasn't sure about you, you already written a block to a different stream and it now it's there.

D

All we've done at that point. All we've done at that point is bump the right pointer on that stream. That block doesn't really exist yet it doesn't exist.

C

It's not located yet right.

D

It's it it's uh it's more like it's forward allocated different file systems have different terminology, it's definitely free. We definitely wrote a block that is available for writing right. We haven't overwritten anything, but we haven't updated the tree so that it can actually be found so that parts what the transaction that commits is so to do that. We need to update the leaf in the allocation tree where or we need to update the leaf in the allocation tree where the old block used to be.

D

We need to update the leaf in the allocation block in the allocation tree where the new block is now. We also need to update its parent pointer, let's say in the extent tree for a particular row note, so those three changes all have to be written atomically for the for the move to be uh for the move to be atomic right.

D

Otherwise we could end up with two pointers in the same place or or a block that is uh part of the tree but which we consider to be free, either of which would be really bad. Those would be you know, corruption.

C

Okay and and then we two questions first, then we end with a block that is forever alien to us. It's every uh work on that block will require this.

D

You could choose to write it directly to the journal block. That's fine too, like you could have done it once you could have included the block in the journal right.

C

Yeah, but once we did it, it's it everything we spend every right to this block. We probably spend two streams right: germany.

D

No, the reason why we chose to do this in the first place is we're not going to do another right to it. If we thought it was going to be overwritten soon, we would have written it to the journal stream.

C

Okay, I understood.

D

Because that, if you think about it, if it's going to be overwritten frequently, you write it to the journal stream, because by the time you get around to cleaning that stream, you already rewrote it.

D

Okay, so that block.

C

You don't have to.

D

Do anything to it.

C

Okay, and so maybe if and probably if there is a problem, I will need to garbage collect the block.

D

It's not a problem, we have to do it all the time for every bite we write. We need to garbage collect one byte of some other stream.

D

All right because we're writing sequentially so for every bite of free space we consume. We need to create a bite of free space, or else we're going to run out, not necessarily in every right, but on average for for uh for sure does that make sense.

C

Yeah, what I was saying if this block should be marked as uh only partially uh allocated because yeah, but we.

D

Don't have to do that, we don't have to do that because instead, what we're gonna do is when we let's say we need to go through recovery. We had a power failure and we turned back on we're going to go. We're going to check every stream on the entire disk we're going to check whether.

B

D

Were sealed properly, that is, we got to the end. We wrote a checksum and we closed it and if not we're going to walk backwards or forwards through the stream from the first record. Until we find an incomplete record, then we'll close, the stream with information indicating where it really ended and then that's it so that block that we wrote it will be not in the allocation tree. So it's going to be free, it's no different from a block. We wrote and then overwrote later.

C

D

Just not a block that like when we go to gc that block we'll just use it. We don't have to do anything.

C

One second, I was signed out for some reason from the document.

D

That's that's sort of the core concept of this uh immutable block concept. You never go back and rewrite something once you once you've written it, so we have to maintain a secondary structure that tells us where which blocks are allocated and which ones are free.

D

So the fact that we wrote to the block does not mean that we've that we can actually find it uh again and we'll keep in memory state saying that we we did in fact right to this block, so future right should be after it, but that's it like we don't we don't have to remember anything else.

C

Okay, that makes sense.

D

Because this is this is the core concept for the rest of it. um Each one of these deltas is basically an atomic modification to a block. It's saying that, so it's worth, I guess talking about how recovery works. So when we recover first, we scan each stream quickly find all of the open ones. Then we figure out where the end of the most recent journal is. Then we rewrite to the beginning of that journal.

D

Then we replay the journal from from there right and every time we we encounter a delta. We first load the block it refers to and all of its parents into the cache, and then we apply the delta to that to that child right and so on.

D

Until we get to the end of the journal, assuming that algorithm works works correctly, every every block that was dirty when the crash happened should now be present in cash again, with the exception of anything that was only dirty in memory, because we hadn't actually committed the right yet and it should be atomically correct. We shouldn't have any partial transactions because we won't have read any partial records because they're check something.

C

So if the territory is very small for one block so so.

D

I'm sorry: what's up.

C

So if so, each data is is record a.

A

Modification to a specific block.

C

D

C

So if the data is very small so how to align it properly,.

D

So this is a pretty normal problem with, like blue stores journal too. The fundamental thing is that, with some of these drives, you simply can't do a write, that's smaller than 4k period. So no.

B

Matter what you.

D

Do if you want to, if you want to write a thing that can be represented with only one byte of information, you got to spend 4k to do it.

D

That's just one of the immutable laws of this kind of storage. The good news is, it's not so bad. We can batch up multiple transactions into one transaction if we happen to have them available. So if the osd is actually busy, we don't have to write out an empty record. We can write out one with several different transactions baked into it.

D

Does that help okay? So we can't write out half a transaction, but we can definitely write out two transactions.

D

And the deltas don't have to be aligned, I'm going to point out the delta blocks have to be because we're not um we want to load them directly into memory and it's inefficient to load unaligned stuff from nvme, but deltas. We have to decode. So we don't really care about that.

D

The header delta delta delta part will be compactly represented.

D

You won't get any padding until you get to the first block.

C

Okay, but but how to let garbage collector to to clean the deltas if we so what's gonna happen, is.

D

The header is going to the header has some basic information about. What's in the record, this is per record by the way. It knows how many deltas there are and what the raw sets are, and it knows where which blocks exist, what their offsets are and what their types are.

D

That is, it knows if it's a it just has a little integer that says are: am I a b tree interior node? Am I an extent? Am I a block? Am I whatever right, then code can go, look at the block itself and figure out what it needs to do with it. In general, however, we figure out whether a block is in use by using a secondary structure that I haven't that I haven't introduced yet.

D

Are we okay on just the basic layout and how hypothetically you could construct transactions out of it.

C

Okay, but let me for a second ask about reading: how does a read, how does read work.

D

So let's say we're: reading byte 256 of h object foo in pool eight right.

C

D

That h, object if you've looked at the osd code translates to a seven tuple that begins with a hash and a pool id. So we traverse the h object tree from the root that we have in memory down until we get to the o node, which has the extend triana right. We then traverse the extent tree until we get to offset 256, which should point to an actual on disk block containing the data for that uh block. We then read the the block.

D

Some of this may be in cache. Some of it won't be so it will like any prefix of the tree will always be in cash. So if we have up, if we have any block in cash, we have all of its parents and cash too, not an uncommon uh design, choice honestly, um so some much of that traversal will be, will be free and then the rest of it turns into log ad reads:.

D

C

With any rebased.

D

Disk structure.

C

And we have the updated data in the cache. We don't need to go through the logs to cd.

D

So there are a couple of choices we can make there, but for now the design we're going to go with is pretty is pretty simple. Every dirty block is in cash. Whenever we do a journal checkpoint, we write out every dirty block.

C

Okay- and here we have a problem- we have, we have greater. We have larger size for a log which is right to disk uh compared to the cache that we have.

D

It's not necessarily a problem. You can write you, you can always choose to write out a dirty blog.

D

It's no different from doing right back in any other file system, a little more expensive, a little less expensive depends right. So let's say we have a dirty block and we're experiencing cash pressure. So we start at the. What do we do? We start at the leaf node and we begin writing out dirty nodes.

D

um Each time we write out a node. We of course dirty the parent right, because we have to change its location, um but, generally speaking, we'll write out in a breath. First, my bad.

D

Whichever the one is, the one where you do, the leaves first, I want to say bread first, but whatever we write out leaves before parents that way. If we do that properly, we'll tend to amortize several updates per parent. When we finally do the write up, it sounds like a lot of right amplification and it is. But the truth is. Every metadata technique involves a lot of meta, a lot of right amplification.

D

If there isn't any basic locality for the rights you'll notice, rock's db is the same way when you do a write for a sparse region of the key space.

D

They all go into different level, zero caches, and then you have to read and write them and read and write them and read and write them to propagate them up the lsm tree, so that part's, not okay, trees are a little better actually because it's at least deterministic sorry go ahead.

C

I have a probably a more basic question if you decide that if you have uh cash pressure and you decided to write a block and all its uh pointers.

C

Okay, but all this block was referred to by the active vlog active uh logo stream.

D

Yep you just booted the world they're not relevant anymore,.

C

How do you know that, how.

D

You also recorded the delta in the allocation table, saying it's gone.

D

C

Wrote it out right.

D

So that means you logically moved it from its old location.

C

So all these all the data blocks referred to specific block that is no longer there. This is what you're saying.

D

Yeah eventually so you'll go ahead and replay all of those deltas, then you'll get to the the journal event that moves the block and then you'll you'll forget them. That's it.

C

So the photos in the in the in the stream are logical. In that sense, they're.

D

Actually, physical, but the thing is they don't refer to things in disk? They simply refer to things in cash.

D

So the first time you encounter one of a the first time you encounter a delta referring to a physical address. You don't have in cash up, you load it into cash. Then once you get to a delta that says, by the way I deallocated this block. It goes out of cache and that's it.

C

D

C

Right, though, that the delta stream.

D

Does have to correctly represent them, go ahead.

A

I'm not sure how you how you dress the wonder effect. What I, what I heard is that you you you mean that it's in it's unavoidable right. Every time.

D

You write actively.

D

So wandering isn't a problem, so the truth is for for for any tree structure, you have to do logarithmic extra work in order to update parent pointers right. That's just always true.

A

If we can can have a log or some some way to eliminate this problem, this.

D

A

A lookup table.

D

That's what this is a log, so remember we don't have to write out the parent block. All we have to do is write out a delta to it. Okay, we can delay right back for that for that block for a long time.

D

Does that make make sense and the further up the tree you get the further of those blocks there are, so writing them out at a journal checkpoint, isn't really that it isn't necessarily very expensive.

A

To put in another way, if we, if we wanted to add the or edit or modify a parent node, we just depend on a transaction we should which would change that parent node exactly if the log is too long or we are facing some some pressure in the space of the cache we flush, the node and yeah.

D

And okay and then its parent gets.

A

D

Right so hopefully, at each level, if we have any locality at all, we actually capture several deltas. Every time we bother to do a write out.

A

Yeah, that's good until the uh that note does not need to be split or anything. Okay,.

D

And that's a real modification right, that's an actual real thing! So when we do that, we get to do the same thing. We record a delta for like let's say we need to do a b tree split works. Fine, the code that controls the b tree allocates two new blocks with the two split things and writes it to wherever and then either as part of that transaction or sorry.

D

It creates a transaction that writes out two new blocks representing the two new split pieces and also a delta for the parent, indicating they're, introducing the new pointer to the new to the two new blocks.

D

So you can handle splitting and merging through this mechanism as well.

A

We can design a machinery to represent a this transaction in one data.

D

A

Two yeah yeah.

D

Precisely so that, and that's that's, why I'm choosing to be really agnostic about how the deltas work it'll. It means that, as we write the tree structures that we use for our metadata lookups, we don't have to reinvent. We don't have to write a whole new allocator each time. We can just use this one and it's easy to combine all of the different transactions into one transaction.

D

Hopefully, this will be easier to work with. I, I hope.

A

D

Any other questions, okay, so the rest of this is I'm not attached to I'm just trying to work out which, on disk state, I need to do garbage collection, but, broadly speaking, there are metadata structures where there need to be in in the stream. So we can figure out things like so for the journal stream. um Since we only have one journal per cpu, we could just number them right. Each journal, epic gets a number and we start right when we start writing out the next journal.

D

Epic, eventually we finish, we write a record in that stream. That says: okay. This is the this. Is the real new journal now and that implicitly deallocates all previous journals, which means that all we ever have to keep track of for the for a stream to find out whether it's deltas are still relevant.

D

Is the newest journal id that has deltas in that stream, because, if we're on uh journal sequence 10 and we're looking at a stream that only has journal ids up to nine, then we know that none of the deltas matter- and we don't even have to look at them.

D

So all we care about that stream are the blocks.

C

Is journal id the same to the journal, epoch.

D

I'm sorry that's what I meant to say. That's a good question. I meant epic yeah. So if we're looking at epic 10, if we're on epic 10 and we're looking at a stream with epic six we're where the new, where the within that stream, it has no deltas for a journal sequence above six, then we know that none of those deltas matter. We only have to worry about garbage collecting blocks.

D

By contrast, if we're looking at a stream with a journal that that has deltas up to sequence, id10 up just journal sequence, 10, we shouldn't garbage cr collect that stream. We should find something else to garbage glide, because the deltas are still relevant. We still need them for recovery.

D

Okay, everyone on board with that more or less kind of a little.

D

Okay, um I will also point out that, just by nature, the beginning of every journal must contain the roots of the of the tree.

D

So necessarily there's a there's, a top of the tree right which I'll get to in just a second, it's. Basically, the address of the f of the top o node node, like the top of the oh, no tree and the top of the allocation tree. So we need something that references that from the beginning of the journal, because other than that they're always in in memory.

D

They're fixed size, that's not really.

C

The address of the route can change right.

D

C

D

C

D

We will record special deltas that tell us where, where the roots are, but honestly that's not even necessary, we don't ever have to move the root. If you think about it, if you think about it, the only because the root is necessarily very very small, a delta to the root and the root itself are the same thing.

D

So I think instead, what we're going to do is we're going to mark like this. This header thing here we'll have some metadata for each block and there will be two block types that are special root blocks are special and you have to look at them, whether they're in the allocation table or not, because they won't be and allocation tree blocks. You have to look at because allocation tree blocks are not themselves in the allocation tree, but I'll. That's just a detail of how to bootstrap the garbage collector.

D

So at a high level, if we look at the logical structure, so all of this is overlaid in what I described uh before. So the root is a single uh block and it has reference.

D

It has physical references to the top of the ono tree and the top of the allocation tree, so the ono tree is a map from h, object, t to o node t, that is, it is a b tree where it's with leaf elements that are encoded, oh notes, and the allocation tree is a tree with keys that are a tuple of a stream id and a stream offset, which is to say a physical address, and it points and the three leaves are an allocation uh unit or an allocate an allocation information thing which needs to have a few things in it.

D

But the main thing it has in it is bad pointers.

D

And I'm not going to go into a lot of detail about that. The what's important is that, if you encounter if, if, if you're, if you look in the allocation tree and find a physical address, the back the set the information recorded as back pointers, is enough to find every block that that points to it.

D

So for a block.

C

D

A data block for an object. It basically is just um the object name and the offset within that object right, because, if you think about it, if you descend from the root, that's enough to find the uh the block.

D

Moreover, the allocation tree is conservative or is um uh consistent at all times. So, if it's in the allocation tree, it really is referenced by the o node get by, and the flip of that is that any non-root non-allocation tree block that is not in the allocation tree is free to be reused and that's really important, because if you, if you remember before I mentioned that every block for every byte we write, we need to garbage collect one bite.

D

But the hope if we've done our jobs well and the uh and the workload has behaved, is that by the time we choose to uh clean a stream that most of the blocks in the stream have already been uh replaced. Right, because we're mostly trying to to clean streams that have short-lived data.

D

For the most part, streams that have long-lived data we're not going to want to touch, they generally will they'll. Only uh they'll only become free very slowly, so we wouldn't touch those if we don't have to so mostly when everything is working. Well, we're garbage collecting streams where most of the blocks have already been replaced because they were overwritten or something right.

D

So for that reason, it's pretty important that it's cheap to skip a record. We don't want so imagine if we didn't have the allocation tree and we just embedded a back pointer in the block itself.

D

Then every time we wanted to to like de-allocate, even a single block, we would have to re-traverse the entire h object tree to find the o node then the extent tree to find the extent to find o duh. It's already been deallocated right.

D

So it's a bunch of cpu and read overhead to answer a question we were already pretty sure was false, um so we it's, in my opinion, probably worth maintaining a secondary structure, namely this allocation tree. That gives us a um an almost constant time ability to scan forward in a stream to find the next live block.

D

Right because if if it's listed in order of physical addresses, then we simply skip to the beginning of the stream in the allocation tree and then read one block at a time forward to find all the live ones.

D

Then we do in fact have to do a traversal from the h object, but we only do it for objects. We actually had to rewrite and we wanted to load those nodes into memory anyway, because we're gonna have to update them right.

D

Okay, so that was a lot of words. Does that make sense at a course level.

C

Yeah, I think so I think it it might be interesting to to try to to mark the logical structure with a heat map of what would be what is expected to change in. uh In.

D

C

D

I was going to do that in the o. Node, the I figure, the o node will will just track uh mutation, regions, uh recency.

D

Right because when we're traversing down to the o node, we have to load the o node right, so we can load the most recent write time in the inode. That way, when we're doing the relocation, we'd have a pretty good sense of um when the object was written and when it was last rewritten but yeah, we can embed additional information. The allocation tree too, um I'm open to all of that.

D

In other words, both the both the allocation tree and the o node are relatively cheap to mutate, since we need to change them anyway, when we're uh doing a write.

D

Okay, does that answer your question.

C

Well, that's not, it wasn't exactly my question, but it's a good answer.

D

Well, in other words, I'm saying you're you're very correct one of the major linchpins to making this work well, especially with mixed workloads where the osd is serving both rbd and rgw, is the ability to identify whether extent belongs to an old object or or not, because for rapidly updated, updated, rbd blocks those we want to choose to put into the journal stream, but for any rgw block we encounter for the most part, our gw workloads are right once read, never right, so those objects never change and they live for a super long time.

D

So when we do encounter one of those when we're doing garbage collection, we want to make a point of putting it into a a long-lived stream, one with other such objects. So you're completely right. That's pretty important.

A

So in that case, we need to use two streaming for for omap updates and object, updates right.

D

Not necessary, no because omap updates, so the transaction itself has to be in the the journal stream. The question is: where do we write the object? The uh block word we're moving the transaction that moves. The block has to go into the journal screen or we have to do something really.

A

Complicated but the staff should go to another stream if.

D

Well, um no, sorry when I say rgw objects, I mean the actual data objects.

A

um I mean the data is the key and value of the omega country that.

D

Part, I'm not sure about my guess- is that those blocks actually change a lot. I think rgw buckets, update really often.

D

That's a good question, though, that that's a tuning question we'll have to think about, and if we want to, we can also stamp these blocks, as as we write them with the time stamp, we wrote them at and then when we pull them back off desk, we can go oh okay. Well! This is, however, many seconds old right.

D

So if if we did want to make that happen, we could that would allow us to detect like cold subtrees. So, let's say um someone's using rgw for timestamp data, and it happens to be the case that the their object names are sorted by by time.

D

Then they only ever mutate the the rightmost part of the b3, if you think about it and then they clean and then they de-allocate behind, but depending on how long they keep those segments you're right, some of the internal and leaf elements of that b tree could wind up being really stable, in which case yes, we would want to write those to another stream, but in the more common case well in in a common case, I hear about it's pretty often the case that they're just writing.

D

Like user provided file names with no locality or properties of any kind, in which case the key space, just kind of grows, with keys inserted wherever the hell they want to, and in that case the b3 as a whole actually mutates pretty pretty quickly, and none of them would be stable. So both are possible.

A

D

Okay, um the butterfest version of this is the back ref tree if you're interested, in fact, I'm gonna link the paper in here. After this meeting, I forgot to um it's a good paper or well it's an okay paper. It does an okay job of explaining the concerns any anyway um right.

D

So, like I said for gc, all we have to do really is maintain a pointer into the allocation tree for where we currently are for for the next live block in this in the stream and because root nodes which includes allocation tree nodes themselves, do not appear in the allocation tree.

D

We also have to actually look at each record as we go just to check whether they have the root bit set.

D

If they do have the root bit set, then we do have to look at the block and find out whether it's still linked, but that will only apply to the root itself, which will always be just a quick, yes or no, depending on whether it's the current route or or not, and then the allocation tree blocks, we need to check the allocation tree to see if it's still linked, because there's no other way to do it, but those are both very rare compared to the overall set of blocks in the file system.

D

So I'm not overall concerned about it and we could do things like mark a stream as never containing root blocks, in which case we could skip that that check. That's an optimization. We can do later.

D

Okay, then we can do more exotic things I haven't. Thought of this part is only like four days old, so if you guys have thoughts here, I'd really appreciate it.

D

The this, this gc part is, is really the achilles heel of every log structure, file system ever created and we're going we're making sort of one other change. That systems usually don't do we're, not I'm not really assuming the existence of asynchronous garbage collection. I am assuming that we will actually in line with every client.

D

I owe go, do some amount of gc work, which is not to say that we can't burst like we could make it a sliding window where we create like a permissible set of ratios, of used space to live space and say, if you're below this point, you must do work and, if you're above this point, you must not or otherwise, if you're below this point ratio of.

D

If all use space is live, then there's no garbage collection work to do unless you're out of space, and um if you have like most of your use spaces dead, you should do some garbage collection work, so we could create a range where it's like you you're allowed to have up to 20 percent of dead data, provided you have the the free space and when we're not servicing transactions, we'll go ahead and do gc.

D

But once if you do have a consistent, permanent client workload, then you'll quickly get to that 80 threshold and every single client transaction will also have gc work mixed in and that's the reason why I really wanted this um sort of online gc structure, where all we have to do is maintain a very little bit of state about the allocation tree and the stream we're currently cleaning, and we can always squeeze in a little bit of extra work into each transaction that moves a block when we need to.

D

Does that make sense to you guys both I do, I think how we're doing it and why yes cool. Yes, I'm also really hoping that this will make it like right now we're having this problem with bluestore, where we don't have a good way of assigning a cost to a transaction, because it's wholly dependent on bluestora's internal state, because we have no idea whether it's about to do a compaction or or not. We don't know how much work it's built up, so this way should make it really easy to account that we'll always know.

D

So I spent a little bit of time down at the cost part trying to sort of really fuzzily work out how much space we're wasting. um Oh some of this is wrong. We don't need the. I need id200 part.

D

um This is really fuzzy, but.

C

Just a comment that not exactly for this moment, but it would be a good idea to think in advance what parts.

C

How we will use in this design, keyman, which parts of.

D

This design- that was the part I was going to get to next in this conversation, so the problem is most of this is really specific to how we lay out block devices.

D

So, unfortunately, I'm pretty sure that what happens is for a case where we're purely p mem. We don't use any of this for a case where we have some pmem. My guess is this o node by object tree and this allocation tree both live in pm, and we write a completely different implementation for them.

D

That just really exploits the people to the to the most amount possible. So for that reason, if you look down at components, I plan on abstracting all of these things out, so um they like the allocation tree with this design uh as outlined here, has a dependency on the stream manager stuff, but that doesn't have to be true if it's, if it's backed by pmap.

D

So in that case you take this, um the allocator part and the owner part, and possibly even the omap part depending and you would replace them with implementations that use p, mem and skins instead, gc would still consume the allocator to figure out which blocks need to be need to be cleaned, but it would never have to clean any of the metadata trees. It would only ever be cleaning actual data blocks.

D

I know it's fuzzy, but I I hope that was uh an enough answer for how we plan to have to evolve in that direction.

C

Okay, I think we should like write it down.

D

Yeah, I just haven't gotten that part you'll notice. The part at the bottom with the interfaces is completely empty.

D

um That's what I was gonna do next I was, I was gonna start actually writing code well at least interfaces and start writing the stream manager component um and then develop a gantt chart for which parts can be developed in parallel, but in other words the brief answer to your question is nothing. I've addressed here has anything to do with persistent memory.

D

We will attack persistent memory by making sure we adopt a software layout where it's easy to plug in a replacement for the allocator or the o map and anything else where that makes make sense.

C

You don't see a design in which the whole log, for example, can be.

D

C

D

You're correct the journal would be another component that could live up. There ah forgot to add a journal part.

D

In other words, in that case the zns wouldn't have any any deltas the journal. The journal would always write the deltas to.

D

um Yeah, so yes, I I agree, you would probably write journal segments to the pres to the persistent memory.

C

D

Yeah, I'm not I'm trying not to be too opinionated about that part yet because the reality is once once. We start writing those interface components, we'll discover things but you're right I'll, add a note here. That's like I wrote a lot for the blog cache.

D

And the other thing is: is the cash right if the dirty blocks and cash are actually persistent, you don't really have to um you, don't have to write them back.

D

You stop the journal, but yeah.

D

Okay, any other questions.

D

I have a flu like the the cache is obviously very important for how this works. um I was a little sketchy, but the goal clearly is for each or to have its own pool of memory and.

A

D

Strictly out of out of that, and then we'll use some variant, or just perhaps literally uh mark nelson's uh cash fairness system to ensure that, for instance, the owner cache and the uh or the block cache and any additional caches we choose to create, are able to share the same pool of memory fairly.

D

We'll try to do that early rather than late. Although I think I will not try to attack that, because I think we can do some basic benchmarking before that that point.

D

Similarly, I'm going to leave omap until later, because we can do basic benchmarking with uh without it. So the initial goals will be to get this free manager, the allocation trees and the o node part over to the extent tree written.

D

Because the the I think the most torturous workload for this will be rbd, so I want to try to get right amplification numbers for that sooner rather than later,.

D

We can't run an osd on it without a map, of course, but we can run benchmarks.

D

um Did anyone else have just radically different strategies? They wanted to talk about, it's worth, not rat, holding on, or it's it's important not to get stuck on a single design too early in the process, and we are not late in the process. We can still choose a different design, though it would be good to voice it now. I will point out that none of this actually cares about b trees.

D

This o node by h object thing could easily be some other tree structure. The only thing it really cares about is that it's recursive.

D

That it forms some kind of directed graph.

C

And the fact that in the fact that the amount of item amplification.

C

Will be one of the two? Probably one of the designs, I'm sorry, whatever we choose for this, whatever design, whatever graph implementation we use here, will affect the amount of right amplification, we'll have sensitively.

D

Yes, that'll be one of the primary drivers. Yep. That's that's correct.

C

So we can play around later with this spot to yeah.

D

And, like I said, I'm trying to the goal is for these: these these gc journal allocator roadmap. All of these components should be pluggable, so we can create more than one implementation of the way the omap works as long as it respects some basic assumptions about how it interacts with the allocator, but I don't think those assumptions are particularly onerous, so I think it should be fine.

D

That is it's possible to express a back pointer right.

C

It is the requirement of all node, 3 and location 3 different.

C

For example, the range search or well.

D

um They both actually do have to do range range, searches or range scans, as it turns out. The difference is that we know a lot about the key for the node tree. For instance, we know the first uh 48 bits um will be the oh, no 64 bits will be the pool id and the hash. So if we wanted to the top elements of the tree would be really heavily optimized for the fact that they're numeric keys.

D

We also know that the alphanumeric keyboard- that's next to the name of the object, is well distributed among that space. So, for the most part, we're only going to have one entry per hash.

D

So for that reason we're unlikely to have to do real strain searches in the o node drain. By contrast, you know in the omap tree we really do have to do real string searches, so it will have to be optimized. For that basis.

D

Does that make make sense, and we also don't want to waste a whole bunch of space repeating prefixes, so there are some b tree tricks to avoid storing the full prefix of a key.

D

I suspect that will be a worthwhile optimization, because rgw bucket and items are often hierarchical and so they'll often share like path components and such that we don't necessarily want to keep like copying around. But those are all details of how we implement the tree and run. It is correct. It has a large bearing on our right amplification and efficiency and also, I want to point out perhaps an even larger, bearing on our cpu efficiency as we move these things on and off of disk and we construct deltas for them. So that'll be another concern.

D

I'll also point out, I think I skipped this part earlier. Delta's do not have to be bite range modifications to blocks, so the idea here is for every block to be typed. um So when it gets read off disk, the cache will be able we'll know what type it is. So when we're going to apply a delta, it's not just applying a delta to a byte range.

D

It's for a b tree block, applying a key insertion to a b3 block that will allow each of these sort of deltas to be implemented more efficiently than you could if it just had to be by range modification.

D

C

And this is where you could plug in a complex uh updates. If we ever need like.

D

Hypothetically yeah, I'm not as worried about that, but you're you're, quite right. If we wanted to like well with within a single block anyway,.

C

D

Or you know, chained together.

C

Updating of counters or whatever.

D

Oh yeah, absolutely for yeah! No! For that that's definitely the case. For instance, if you wanted to update the leaf of a bee tree, that happened to contain no nodes, and your only purpose in life was to change the the read recency. You could create a delta that just did that right.

C

D

Just updated a fixed size counter at a known location in in the block.

D

C

D

Offset from a known key in the block, that's what I'm about to say.

C

D

Or it could go all the way to decoding the o, node, actually changing the counter and then re-encoding the node. You would.

A

D

To know some things like it'll definitely still be small enough to fit in the b tree note, but.

D

Okay, so I'm going to start sort of working on this and then for the rest of the month um and I'll, hopefully have something to report next week, although probably not a ton. I expect.

A

Some time, what do you mean by working on by by improving this document right.

D

D

I'm going to start by writing the uh this part, the stream layout, so there's a stream manager that needs to actually do right. So I'm going to write that part, then I'm going to start writing working on the cache the memory component, because that's what interprets blocks off of disk and by then I should have had to make decisions about enough things that this document will be wrong and I'll fix it and then we'll move from there.

A

Okay, we're moving forward. Could you also update the at least integrate this document with this.

D

No, I'm absolutely going to do that. Yep! That's that's the purpose for starting now. If I do, if, if I make further guesses about the interface without actually starting to write code, the guesses will get less and less useful.

C

Could you also paint I'm used for links of materials.

D

I'm going to link particularly the paper to the butterfs paper and you all have edit permission. If you want to make comments, I strongly encourage you to do so. It'll be really helpful and I think the discussion on the document will be helpful for everyone else as as well.

D

Okay, I mean again. Most much of this document is only like four days old, so it's totally possible you'll find a way to bypass one of these problems or make it simpler.

D

um Okay, yeah: that's it.

A

Thanks then, I still have some some curry regarding to the the multiple stream, but I will up this document with my my questions and my thoughts. Okay, any questions.

A

Okay, have a good day, see you next week.

C