Ceph Ceph Code Walkthrough, 26 Jun 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-June-26 :: Ceph Code Walkthrough: BlueStore part 2

Description

Second part of Sage Weil's presentation about BlueStore code and internals.

Ceph Code Walkthrough BlueStore (part 1):
https://www.youtube.com/watch?v=f0H-XhcZGP0

A

Alright, alright, this is like an installment of Code walkthrough for Blue's story. Last time, I chose I guess a month ago, already two weeks, I think we covered under. We cover with.

B

Your the under skin and memory, T stretchers right.

A

Okay, yeah, we talked about shards jaundice structure. Did we talk about the cash? Don't.

B

You win through the cash now alright.

A

Okay, let's share my screening.

A

All right, good, okay,.

B

A

A

Bored ih all right. Let's start in header file, um well the perf Kanner's your place started just walking through some of this, um so class boost or obviously that's object, store, big observers and bill. Config changes, theater bunch of helpers that are called on startup and, if can figure out and change, so they take the config settings for check, sums and compression and so on, and then initialize various in memory structures on to that thanks.

A

That compression, for example, gets a handle for a compressor and it figures out the blob sizes based on the mins and maxes and it's another random stuff. um A trans context is for declaration of a class that tracks a right, that's being assembled a transaction and that's think made it to disk I'll, get to that shortly. I sort of like key piece, the right path.

A

This is used for reads: buffer space is used for the cash collections, a collection, a Oh context, just trucks, a iesson flight, okay, so the first meaningful class here is a buffer.

A

This is for the in memory buffer cache when we're caching data that object, data that would be stored on disk and as a few states that are empty, clean or it's in the process of being written, and it's never dirty but not being written, because we never do right back every time we're due to write, we always immediately queue it for disk, that's a little different than like a additional page cache or something.

A

There is one flag that basically indicates whether this buffer should be expired, as in as it commits to disk, because sometimes we're operating in a mode where we want to cache rights after they're, completed in memory until they fall off the LRU and other times.

A

We don't want to cash rights unless were explicitly hinted by the client, and in that case we still have to keep it in memory, while it's being in flight to disk in case there's a read or we need to pipeline a second right, but as soon as it's actually committed, then we throw that that's what that flag is for.

A

And there's some they exist in an LRU. There was also lists by state, so we can identify all the writing buffers easily bunch of helpers for idiots in truck getting data off the end may be reallocating the buffer list. So it's one one buffer, nothing terribly crazy, but for a space is sort of the next level up in the cache. This is for a single object. It's a mapping of a bunch of file, offset or object offsets to buffers Oh in the Linux kernel would be in address space, I guess the analogous mapping.

A

um So it has a map of buffers here which is an intrusive red-black tree. Actually, no, it's just a regular sto map of offsets to buffer and there's some helpers for adding buffers, removing them.

A

The game clip is a particular position and so on doing discard, writes reads and so on these ones. These implementations are a little bit tricky because there's this indirection for the cache that contains the buffer space, which I'll get to in a minute and then the other sort of interesting thing here, that's going on is we saw a write, would write into a buffer space would create a new buffer with the data at a particular offset.

A

It would sort of discard anything. That's previously existed there and then add the new buffer in finish. Right is called after that actually hits disk, so sort of remove it from the writing state to the lien state did read as sort of the analogous thing after we'd perform a read off of disk.

A

We will populate the data into the buffer cache. The flags are appropriately set, we're gonna buffer reads: those helpers are for.

A

Shared blobs are something we talked about.

B

A

Sure yeah I should jump for that it. This.

B

Reminds me a lot of the epic catcher and that it looks like it's a as a single lock and is accounting for space used, based on the memory allocated within every single buffer that right, rather than yes, one giant memory that it's yep.

A

Yep, so the that I think the difference is that it's actually not the entire cache. It's a shard of the cache so the way that it's structured and in blue store there's some number of shards on the OSDs awkward queue like I've, or something better or 8i camera. What it is by default depends on if it's hard, disk or sisty blue source sets at the same number of shards, and the idea is that you'd tend to be on the same cpu. It's always the same work queue.

A

Each work queue is working with P G's that are on the same shard, and so the collections within blue store are assigned to cache shards, and so all the walking is local to that shard and hopefully there's some affinity between the memory and the course that are actually doing that work. Okay,.

B

A

There's generally no contention on those lock the locks for the cache at all, except when it's doing things like checking stats for trimming, there's what there's a background thread that periodically will wake up and turn stuff, and that goes through and takes the lock on each shard in turn right, so that the cache is an abstract root container. So it does a few things.

A

I've got a lock and some counters, especially with it I think these are atomic just so that we can read them without I'm, taking the lock I'm type, you generally don't you're, always holding walk when you're modifying the cathodes. That's not needed for the writes, but you'll notice that all these methods are all virtual.

A

It's because there are two implementations of cache, there's an LRU cache, which is the first one, which is just a very simple LRU.

A

When you add things they either get added at the bottom at the top, and when you touch an item it gets moved to the top and so there's a method. Yeah touch oh node and touch buffer, they're. Basically, two they're.

A

Basically, two types of data structures that are the cache manages uh nodes and buffers and they're sort of managed, mostly independently, except when I return to do space accounting a little bit wonky, but I'm you'll, see basically parallel sets of methods for add, remove and touch for those two different types of Berkshire's I'm gonna miss a trim. Where are you passing the maxes and it'll?

A

Just kick all the things off the end and then a second implementation is 2q, which is based on a paper to cube cashing other than that it's basically two LR use and if, when things fall off the first LRU, there's like a a muralist has just as many items, but it's just the name of the item that got kicked off and if you fault something back into the cache that was previously evicted and that's on that second list.

A

It goes on to the like higher priority part of the queue, and so it sticks around for much longer. It's very harder to get rid of don't order for.

B

Something to get into, though,.

A

Like the it's gonna stay in the cache, it has to be added to the cache, get evicted once and then get faulted back in and then it's gonna stick for a really long time. It's basically how to key works.

A

So there's the hot list and a warm warm in and warm out and they've have. They have really funny names in the paper so that the names in the code don't quite match, because the paper names don't make any sense. Intuitively.

A

Yes implement the same, but it's a bit more complicated about how things get faulted in and whatever um all right. So that's the cache. So that's why this class over here buffer space, these implementations are weird because they have various hooks into the cache to add the buffer. All these audit calls on. It is basically a debug method that just traverses a whole structure to make sure everything is in sync, but it compiles out and in normal, are all builds I think it did to find something to turn on.

A

Right so those are brought first, um you'll remember from last time that sort of the on disk data structure you have Ono's, which represent objects and that data portions of own oats are mapped to blobs, which is basically just like a hunk of data. That's stored, not necessarily in one extent, but a sort of treated the same. The metadata is managed together, it's got all the checksums and how it's compressed and meditated about that data on disk.

A

Basically, because we can clone objects, sometimes blobs are referenced by multiple objects, and so, when that happens, there's a shared blob. That's actually stored on disk that keeps track of the reference counts on those extents in memory. That's also true, but it's it's slightly different in memory, because the cache the memory cache is always associated with the shared blob structure. So any blob, that's ever instance, are instantiated in memory, has a shared blob, a second allocation of the shared blob and the cache the buffer space.

A

Caching, the buffers is attached to that so you'll see here, there's a sure blob. um It belongs to a collection, it belongs I, think just by a cache pointer on here. I guess not that collection is mapped to a cache. Shard um has some flags to indicate whether it's the state for that shared blob is loaded into memory or not, and if it actually has any persistent state. Most of the time objects aren't cloned, and so these are both false because there isn't actually Chris an instantiation of that shared Bob. It's not sure at all.

A

It's just the blob. It's just basically a container for the buffer space, but when it is shared, then nothing changes. You still have the buffers memory associated with that shared structure. um Hopefully that makes sense a little convoluted. So there's a they have a shared blob, ID unique identifier for the shared blob that only is defined if it's loaded or if it's persistent, if it's not persistent, then I think this doesn't even return anything that would undefined we track references on them.

A

So there's some helpers here, but that mostly you can just think of shirt blob as a container for the buffer space in memory buffers and there's a pointer to the in gallon disk state when it exists. The shared blob set is sort of a more tricky mapping that lets. You actually find these, um though, most of the time when you load a blob into memory, you just do node has a flag that indicates whether the blob is shared just part of the blob I'm Flags field.

A

So if it's not shared, then it just creates a shared blob and uses the buffer space attached to it. That's it, but if that flag is set on the blob, then the shared Bob actually has a persistent thing and when it allocates the shared blob, it also registers it with this shirt blob set because it'll have a an ID associated with it. So I think.

A

If we look at blue store eggs, then we look at blue store, blob, T notice here there's just this is the flag that says that the blob is shared and when that's set, there is a field in here somewhere. That indicates what the idea is.

A

Shareable ID I can't remember where this.

A

This is what happens when you don't prepare anyway, um if it is a shirt, if it is actually shared, then it'll it'll register with this shirt Bob set and it's basically just a hash table of the share.

A

Bob ID, a shared wall planner and what happens when you, because, as soon as the blog becomes shared, if you clone it becomes immutable, and so we basically just copped the copy, the blob meta data to the target object and so there's two copies of the blob, with all the the extents and the flags and everything in both objects. And then we aren't allowed to change them.

A

The only thing that ever changes are the references, whether we're actually using the data reference in by those blobs and that's the part, that's stored in the shared blob that combined Sugar Bob structure. So here you'll see II.

A

It has a reference count map in that in that shared on disk structure I'm. So anyway, there's a there's, a lookup function, there's an add and remove oops, but.

A

This is a this is slightly tricky because you have cases where the reference counts go away for the shared blob and something you have to deregister it from the shared Bob set. So you can only look up if the reference kind of still nonzero this sort of easily. You have to take a walk to remember from this set and.

A

Yeah, that's about it.

A

Blobs are here Oh every blob. That's in memory has a reference to the shared blob. So it's always there I don't know. Sorry, it's not always there I think it's always there. I can remember to look at it and.

A

Yeah, it's got a bunch for functions to determine whether it can be split into pieces because you're right in the middle of it, you can take references to pieces of it all kinds of stuff and.

A

I think we do. We talked about this before the X. We talked about the extents and bobs I believe before yeah, all right, yeah, at least feeling this apart, I'm so extensor, basically Maps. They have an extent you have an O node. You have a collection which has a map Ono's o nodes. Each node has an extent map which maps regions to blobs each blob will point to the shared Bob. Possibly multiple objects will point it that that they loans.

A

Those blobs, the shared Bob, will point to buffers owned by the cash shard. Similarly, all the Ono's will also be owned by a particular cash shard, um most of the time, all the stuff lines up with the shards and the blobs. When you clone an object, you're, always cloning to an object that has the same hash ID, so it always lives in the same shard in the same collection. So these things sort, never move around in the namespace.

A

Mpgs belong in their entirety belong to one shard. So the only time that gets weird is when you split a PG.

A

It might be that one of the halves or children or whatever of that split, belongs to different shard, though there's one sort of annoying bit of code in blue store here in woods, collection that might have to move a bunch of data structures to a different shard, and that is here, collection, split, cache cache and it basically locks both cache shards, like nifty helper, that it's a lock ordering right and then it basically iterates through all the nodes, besides which shard it belongs to, and if it blocks another one, then it removes it from one.

A

Cache are answer to the other cash shard and it moves it also. It moves it to the destination collections on a map, because the Ono lookup hash tables are per collection so that they're sort of localized and the hash tables get are smaller, um and then it also has to move the shared blobs buffers and all this stuff whatever. But there's all this moving moving moving stuff here. Well that stuff straight, so that's I, think that's most most of it. So just a quick example if you do get owned owed now, let's do fault range.

A

If we load a region of an object in the extent map into memory, then we're loading all the blobs, though this will.

A

A

Research, the genes.

A

Chrome sent a notification on that decided to break.

A

I'm sure screen.

A

Alright, um so it'll load up a bunch of the code, psalmist decodes, a shard that comes out of the key value database, a shard of a no note about the extent map, and in here we have a whole bunch of blobs that are sort of encoded. That's a really annoying and clever way to try to make them compact and also reasonably fast to code.

A

So this code is kind of convoluted and that's why I couldn't find that sure Bob ID, because it's sort of encoded in line then this walkie format only if it's a shared blob. But basically, if it's, if there's a blob, ID sure blob, then I.

A

Don't know sorry if it's yeah, this is a different block ID. But basically, if we get down here, if it's a sure blob, then we have to open the shirt blob, which is another help, Fleur function and assured that will, in the general case, deed common case for the blob wasn't shared. It just creates a shirt blob and it's done, but if it is shared, then we we first try to look it up. If we already had it then great, if we didn't, then we create it and we register it. But that's the basic thing.

A

The interesting thing here is that when you load the blobs, you don't actually load the shared blob, necessarily because it's actually a different key value and the only thing that it stores is the ref counts on the extents. So if you're just reading an object, you don't care what other clones are also referencing, those same bytes. The blob itself has enough metadata to to read the extents and we have the Czech stones in both objects, and so it doesn't matter the only time you actually have to load.

A

The shared blob is if you're modifying the object, and you have to change the reference count, which is nice- that only happens in the right path, but soon you do have to load it. Then you go and you actually get the key out of the database. You set loaded to true you load. This persistent allocate this persistent shared blob T just for the on desk structure in memory I'm in decoded make sure blob is used in the case for your cloning, an object so initially the blob isn't shared. Obviously, but you clone it.

B

A

You have to make all the blobs that are not shared yet shared, and so it marks some work, some dirty. That's the shared flag allocates this percent of an object and sort of initializes the reference counts or the two objects that will initially be referencing it or maybe just the first one, and then when you the actual code clone does it. There is some code and blue store to try to unshare blobs, because it's a ratio could it sort of write, work load patterns. We make a clone of the object before we do.

A

An erasure could overwrite. So we can roll back and so there's sort of this blue and blue sort of try to like clean that up, so that that we don't end up littering the entire every object with all these shared blobs. And so that's what the make lob unshared is it's a set of heuristics that work most of the time unless you have sort of a weird workload padding. So it's good enough, but this is just the opposite: write it under registers.

A

It that's loaded to falsely to persist something and sort of D initializes backs out. Undoes, what make flops sure does.

A

Alright, so let's go, we looked a little bit at read before um so we look at that again or should we look at let's look at the right one right, one is actually that writes rights for all the complexity is so, let's just get right to it, um so everything in the object, store or layer goes through Q transactions.

A

So this sort of a single entry point. When you call it, we collect all the context lesions onto lists.

A

This is for debugging, where we're sort of pretending that were throwing away iOS and you get a reference to or get a pointer to, the collection and the sequencer and these one-two-one by those different in memory structure, because it survives in collections of the same name. But the main thing here is that we create a trance context. So this is the data structure that tracks data associated with a in flight transaction.

A

It's a child of a io context because it initiates AOS and it has a completion callback. It's like it's called on it by the block there. It's got a whole bunch of states and that it works its way through in the process of actually doing their right and the prepare is what happens. Initially, we might be waiting for the initial rights once those are all done.

A

Then we queue the actual key value transaction to rocks TV that actually commits the operation, and then, if there was deferred IO like we journaled, then intent to overwrite data or something, then we also work through these other states that actually do that follow up by Oh at the end, then finish and clean up shows uppers for tracking latency and so on and its associated with the collection and it's sequenced inside the sequencer.

A

So the sequencer has a an ordered, intrusive list of the transactions that are in flight, and so this is our handle, our whatever node in that list, each transaction is a cost associated with it. I'm related to the number of things we're doing and how many bytes are writing these fields track and which own ODEs are being modified, which objects are being modified, shared blobs are being modified or being written, completions to finish collections that got deleted that need to be mopped up at the end.

A

This is usually empty, obviously I'm, whether there's, if there's deferred work than this is a or in another list of deferred I/o. That has to happen, and so this is our position. There.

A

What blocks on disk rather allocated a receipt or released as part of this transaction and the Delta for our stats? The blue star, is maintaining transactionally with you step date, account of like the number of bytes of different types and a bunch of stats or like how many bytes are compressed. How many are uncompressed, how many are allocated and allocated yada yada that roll up and eventually populate the status out? But so this is the whatever Delta this is incurred by this transaction.

A

um The IO context is the iOS that will be initiated as part of it.

A

There's a sequence number that's sort of local to the sequencer, so it's just a numeric ID that increments every time there's a new transaction on the sequencer or within the collection.

A

These keep track of a 90s and blob IDs that we used just a little bit of cleverness here to make sure they're unique. But- and we don't that's, not an.

B

Id is a unique.

A

Name ID for a node I really embody for a fresh-air blog.

A

To helpers here, though, in in some code, if for if we dirty do note, then we call this and just make sure that they would notice on the dirty list.

A

Yeah, that's mostly it grants contact goes so anyway. Okay, so when you do transaction, we create the trans context, and we can look here for TX e 8 and basically all it does. Is it allocates it? It gets a handle to the current rocks, TV or key value, DB transaction. That is being assembled, so lots of, if you have a whole bunch of transactions, they'll all be playing for the same transaction, that's sort of piling on their work.

A

On top of it, Billy eventually commit the batch, and then this basically adds it to the D Quinn sirs list of transactions that are in flight, and then we all this add transaction. This basically takes the context of the transaction that the OSD passed down and sure does all the work of figuring out what Kiba it's are happening, what I or whatever this is, where everything actually happens, I'll get to that in a second. It also calculates the cost. But this is the number of bytes that we've done.

A

We calculate a cost associated with a transaction, that's sort of a function of the bytes from a number of opps, that's used for the throttling.

A

We take any o nodes that were dirty and we write them by basically putting them in the key value transaction, so they're ready to go and if there's any deferred work deferred, I/o, we encode those keys, also and then finalize sort of- let's go through these inner of this, so the main one is GXE a transition. That's where everything tiding happens,.

A

So we try to be a little bit efficient here by getting all the references of the collections that are used in this transaction, usually they're only like two or one, it's just the actual collection and maybe the meta collection.

A

Usually it's just one, and then we get it right over the operations. Oh no app step to anything they're sort of grouped here, so things that operate on collections may or may not flex. Your memory and I exist right here we have like a reference to the reference to the collection reference things like remove and McCreight collection happen here. All the collection stuff is piled in here and then they're grouped down below there's a bunch of them that implicitly create objects, though these might happen.

A

We set this to true and then we actually look up the O node and if it doesn't exist, but this is an OP that creates it then we'll actually create otherwise, we'll return. You know it and actually usually error up at this level. We basically are never supposed to add an arrow st is supposed to prepare well-formed transactions into the store, so this yeah, so ghetto node, will either look up an O, node or if the second argument is true, then it'll also create a new O node or a new object.

A

Only where these operations will always get a node back just might be empty for other ones. That might be there might be an old reference. So these things like touch and rate all, are getting a nice clean pointer to the collection and to the object and the arguments that are actually gonna be the operation.

A

So the the interesting one is: let's look at touch the safar, the simplest one, just to see what it does, and here um we already got an O node right because ghetto no de-allocated one, because the great flag was true. So all we have to do is make sure that there's a unique identifier assigned to that object and NID sorta like an inode number, and then we mark in the transaction that we should write the UH node because it's been modified and that's it.

A

So if this was a really simple transaction and that's all it was doing assisting a touch, then this would finish it. What the cost would be very small because there's like one up in no bytes and then we do write right.

A

To the right nodes, which basically just goes through the dirtied items in that trans context, structure and updates the key value records forum, so for all the oh noes are reference. We do record no note that the off I'm this.

A

It will reach shard the extent map that needs to be updated.

A

Well, basically, figures out whether needs to restart the UH node and then assuming it's all past, all that it will encode the oh no I'm into an in memory buffer. So it's I there in my in line.

A

No, let's see extend it anyway, yeah down here, it'll encode, the Oh node until its key, possibly also the extents, the extent Maps records. These are also changed and then it will set that key in the transaction. That's gonna be passed or SUV, but have some stats about what how do know is being written?

A

Sometimes we modify objects. But-- didn't affect the note. I can remember why that happens. We have to keep track of those two. If we modify the shirt blobs, then we similarly we need to go and get the key. If we deleted a shirt bob, we have to delete the key. If we modify that, then we have to encode it and to buffer and write that into the transaction, and so that's basically it um right. Now. It's either update co, node and/or the share bobs that are affected by that by that transaction.

A

Then after we do all that um TXE finalize Cavey. This is the last step, it basically prepares. It looks at all the extents that are allocated or released, and it.

A

If there's overlap tries to, if you really something and then reallocate it, we don't want to show you updated place, but figures out what that overlap is and.

B

Then it calls them.

A

By the free list manager for the allocate and releases that mean it passes in the key value transaction and in our case the only implementation here is the a bitmap based free list on disk which is basically doing merge operations with it fields for all the blocks that are allocated or to be allocated.

A

So they're exhorting the key, the bits and the keys for the blocks that we have they're allocator T allocated and the reason why we do that is because sometimes we're preparing this transaction right here, but as you'll see in a second, it might not actually be given to rocks TV until later, because we might have to wait for I/o to happen first and so the ordering of the allocations and deallocations might switch around before it actually finally gets reaches, rocks TV and the merge operations are commutative or associative or whatever community.

A

If I guess those are just X ORS it um okay, so that's finalized. Your probably named like finalized allocation changes or whatever there's some throttling here. That slows it down. Probably this is gonna go away once we are throttling at a higher level on the OSD with the new us stuff, but until then this is where the throttling happens, so cute transactions are actually block sometimes, and then.

A

Finally, we get down to this thing where it's unprocessed, this exe state proc so remember the transit context had those like twelve states and it's basically a simple state machine machine that goes think always an order. Sometimes it comes back, but this just txt state proc basically works its way through those states.

A

A

In a prepare state, which is what it started as key transaction, creates a transaction up here. Jc create starts out in the prepare state where everything starts if there are no pending iOS and we go all the way to a wait saying that word or no, if it does have any way of you say we're waiting for those a iOS, and we call this function here that actually accuses iOS for the kernel device.

A

If not, then we jump all the way down priority neo 8, and we say that we're done with finish I/o sort of the analogous thing here is that if, if.

A

But if the block layer finishes the I/o, it doesn't trigger Zak all back back into the blue store code and it calls TXE state proc on the effective transaction and so it'll call into right here. It'll be any io8 state and then I'll also say finish IO, so they both end up in the same place.

A

And when that happens finish IO basically says: okay, we're done with the IO IO that we switched the IO done state and updates in accounting, but the trick here is that say: you had two transactions, a and B and they both started some iOS a came first became second but say B's IOT's finished before a's. Do OB will be in the I/o done state, but the transaction ahead of it is still in the a io 8 state, and so this basically, so we can't actually do anything when piece I was finished.

A

We solved away for a to finish first, so this update updates us and then we basically start at the front of the list, and we look for things that are have finished I/o and then, if they are, then, if a is done and it looks at as many transactions as are done- and it finishes them all so, basically loop so for starting at the front, all the transactions or an iodine state. We process them in order, and this is the thing that ensures that even though I always happen.

B

Out of order, we.

A

Still commit them in order at the end, so that means that we're calling back into state proc with I/o done down here. um This is cert we're already. We already have the queue locked here blocking on the big concerns, little bit wonky, but basically it's a lock that protects this list of the transactions again. These walks are almost never contended because we're always doing this in the same thread. So it's not it's. It's pretty rare that you actually are walking on the mutex, but they're needed for a few unusual cases.

A

So then we change our state to k be queued, which means our queue to actually be committed to the key value database. This is where things to get clever.

A

Normally, this is false. Normally we go into the queued state and we basically just get put on a list of tvq of all the transactions that are ready to be committed, and then we wake up. The Katie sync thread to go commit them. There are a bunch of optimizations here that we've played around with that. Basically, for certain cases the transaction can be committed synchronously, so you might have you know. Eight threads in the OSD that are processing these transactions and they could all be queuing directly to rocks to be from those worker threads.

A

When that's the case, then you can get some of better performance, but they're a bunch of commission. You can only do that when there are no ordering issues with the iOS and a whole bunch of complicated, complicated conditions, though there's a subset here, you'll notice. If we just want to do it synchronously, then it can't have allocated a new ID, because we have to update a key that has the max it has to be. Oh.

A

The previous, if the previous one is already going through the key value thread, then we have to continue going through the key value thread, or else we'll sort of jump ahead of an order. If there's unstable, I/o with other transactions, then we can't jump ahead whatever. But if we get lucky, then we can. We can do it directly.

A

Now we jump straight to submit it and we caught applied, but usually it doesn't happen so well, so ignore that um so mostly most the time you get put on the kbq and then there's another thread, maybe think that you'll hear mark talk about all the time, because this is what does most of the work. Maybe not must work a lot of the work in blue store. This is Katie.

A

Sync thread basically wakes up takes everything, that's on the kvq and it puts it on the KB creating list and the submitting list sort of copies, I think that they appear in. She left two different lists, and then it will.

A

It will most of the time it'll do a flush on the block device to make sure all of those asynchronous iOS that were sent to disks and that they were completed, make sure they're actually committed durably to storage it'll block waiting for that. Once that happens, then we get our actual we've got a handle on the transaction we iterate over.

A

This is for updating those IDs is sort of not interesting. We updating all these committing things we go through and we actually give them two rocks. Tbh I think it's this txt ID KT. That's actually doing that! That's why it was no it's not. Where is it somewhere in here? We're actually calling.

A

Piraeus Oh every we're making a list remember.

A

A

So this is, we have sort of one transaction, that's our sort of Sentinel! That's going to push this whole batch through.

A

Each of these we it skewed. Then we.

A

Release our throttle, we adjust stent balancing this blue store, we clean up, deferred keys and then finally down here, we synchronously submit the database transaction so somewhere up here, there's.

B

A

Submit that's not blocking it, they can't I'm not really seeing it right now, but then, basically, we basically have this whole batch. That's all ready to go, and then we push the whole thing through with the one that actually has the flag that says for xtv. You should write this flush it to disk and wait until it's actually dribble.

A

So this blocks does a whole bunch of work in rocks TV and when it finally returns, then that whole batch is committed, and here we basically take all of those lists of transactions that we had and we hand them off to a new set of lists that are yet another thread is gonna mop up at the end, this is one of the performance things that happened over the last year, a little bit better performance on SSDs. If you sort of separate the part, that's pushing directs to be an in sort of doing that after effects.

A

So this pushes it all into the these new lists called committing to finalize and down here. There's the finalize thread that wakes up for all of these things that just got finalized. It calls you know, it asserts that it's in the submitted state they showed just been submitted, and then we called state brock back over here and as action is still working its way through.

A

Or is it submitted? We did it calls a little helper txt committed kv that basically triggers the finishers accuse the finish for work associated with that transaction update some stats and then we're now in the kV done state, and then we go to of one of two ways if there was deferred IO, which means that the original transaction basically Journal of entry that says I'm going to do some ayah later and then it committed now we can actually do that asynchronous IO. If that's the case, then we put it on the deferred queue.

A

Or is this all another work you if all that different stuff that gets patched up and pushed out? Otherwise we go straight to finishing and in the deferred I'm not going to go into that right now, but in the deferred work. You basically similar thing where, once we clean up blah blah blah, we get pushed into finishing state and it's the same thing with finishing we're um kind of like with the IO. It might be out of order.

A

We might have one transaction that doesn't have the furred io another one that does and another one that doesn't. We don't actually clean all these transactions, we clean them up in order, so we have to wait for the for the slowest one which might have to Freud IO. That has to finish, but notably the things that happen. In finish. Are we mark the write complete, though those buffers states change from writing to clean?

A

We finally mop up the collections that sort of got deleted if they were any deleted and then the if I, remember right.

A

There is one other interesting thing here.

A

This is some trickiness to wake up the deferred I/o, because it's trying to batch the deferred I/o app, but we have to make sure it doesn't stall if there's something that's gonna, get blocked and so there's some sort of delicate dance to make sure that we don't stall if we are plugged.

A

But the main interesting thing that happens at the very end here is that if there's any allocations that are released, it happens at the very very very end. We finally tell the allocator in memory allocator that these extents are available again. Somebody else can go write on them.

A

This used to be done quite a bit earlier, but we pushed it back at the very very, very end, because there were some really crazy race conditions that could happen where we would overwrite data that hadn't fully been dereference. Yet um oh yeah, because the deferred the deferred IO might have been on a an extent that later got D allocated, and so we want to make sure that we D allocate in order after a deferred, extents iOS are actually completed um and that's that's it again stuff to make sure that deferred. Ioki.

A

Doesn't stall zombie sequencers are sort of one annoying thing where you might have a collection, 1.2 or PG 1.2 it gets deleted um and then you recruit the OST recreates the same collection, but maybe the one that was deleted had a bunch of deferred I/o that was sort of, even though the transaction that deleted the connect collection, finished and completed, and the OST says it's all done. It's actually like still doing a bunch of work. If you rien Stan, she ate that same collection.

A

We want to make sure that we use the same sequencer so that it the ordering it's all all correct, um and so that's what the zombie ones are if, if the collection gets recreated before all references of it have basically drained out of the system, then we'll sort of resurrect that sequencer.

A

So that's the whole transaction workflow, which we managed to do to mostly pull an hour without actually looking at right. That actually does all the craziness any questions before we jump into that I.

B

Thought we did see, we have these perfect errors being updated throughout this path in the logs get latency calls.

B

Where do we usually see the latency I'm the most latency from among these steps right.

A

Yeah- and actually you can look at these- if you do this demon perf command on an OSD, there's a whole blue store group and they've called out the ones that are the most interesting. But let me look at the actual state, so I can remind myself I'm, not remember Oh a io8, there's gonna be the million, see their cave eq'd. There's gonna be some latency there, because basically there's the K vsync thread is one thread: that's just taking batch committing it.

A

Take the next batch committing it, and so, however long it takes to commit the current batch, the thing I big it is is how long up to how long the next set of transactions might have to wait or they can get cute.

A

So the average latency is like whatever half that half that time interval right and then there's also some time and submitted. And then, um if you look at the F at the top here, where we initialize our perfect order.

A

We can look here at the ones that I called out. That's interesting, um so k be flushed. Latency is how long in the KB sync thread easy sync thread.

A

um How long this initial flush is taking, though, before we actually even submit the key value data, we have to do a hardware flush on the device to make sure everything is stable. So it's how long that is basically I saw that will vary depending on the type of SSD or hard disk. Hopefully it's it's small on hard disk, it'll be big and then there's to commit latency, which is basically how long it takes for this batch then to actually be committed to rock CD.

A

Just for the next phase of the KB sync thread, and then sync latency I can't remember exactly.

A

Which part that is I, don't know, I think there's one of them, that is sort of the whole the whole span and one of them and others are so narrowing in on this. Just on that key piecing thread, I can't remember exactly how they go. Unfortunately, look at the code, but yeah there there bunch of these, oh and then how long, how long waiting average ao8 latency.

A

The numbers were mostly pretty small on SSDs, but on our disk, sir, they can be pretty interesting.

A

The other ones that come up actually are the throttles, so we also measure remember in queue transaction we might throttle, because the booster has a lot of work or whatever there's only so much I owe that will allow to happen in flight, but we measure how long a transaction is trying to be cued by the OST is waiting before you and just gets in into the queue.

A

um So there's a submit, throttle, wait and see the overall submit latency is everything on the top half where the OST is doing the work of queuing it um let's see, is the overall I think measurement of the OP from submit to commit I'm, not sure, really T.

A

That's much here, a bunch of stats for, like um you know, hit rates on the UH node, cache and buffer cache. How much time we're spending compressing and decompressing or checking check sums how successful the compression is actually being and so on. So there's a whole bunch of stuff.

A

um Ok, so let's look at let's go to look at you right, cuz I said over the last piece and it's kind of nutty, but this is um most of the interesting transactions actually write data and that's for a lot of the complexity of all. This stuff falls out because we have to wait for Aereo's and deferred iOS and so on. So in do writes for lucky we're not actually writing anything, but usually we are um choose right up in options fills out this right context.

A

So, whether or not this right is going to be buffered or that we're gonna write into the buffer cache or not I'm, whether we're going to compress it. How big we want the blobs to be how big we want to check some chunks to be, and they can vary on a per write basis because they vary per pool I'm. You can have Poole settings that specify whether this pool is compressed or not.

A

They can vary by I/o because you might have hints so, for example, rate escapable hint that most reeds are large and sequential and rights also, and so will make big checksum chunks and that will use less metadata because, usually you don't have to be a very small read and you wouldn't have the read amplification and associate with that. So choose the options and we make sure that the extent map has all the relevant extents for the UH node in memory. We actually do the data right. We allocate I, do a bunch of stuff.

A

Okay, let's go there so first part and do data right.

A

Do write data this is sort of the preparation stage, so any any right to an object. There's going to be an alignment to the metallic size and so Bernie right, there's, maybe a small piece of the beginning. That's a partial allocation unit, there's a part in the middle, that's a multiple of allocation units and aligned to those and then there's, maybe a little bit at the end. That's also another lined. So that's the head, the middle on the tail, and so we have two helper functions to write, small and do write big.

A

We do small, writes on the little end bits and then do write big on the big part in the middle. Do write big is nice and simple, because it's we're writing full allocation units, and so we can just write to a completely new part of the device I'm doing copy-on-write or whatever you want to call it. I'll just write to a new newly allocated region.

A

So we throw out the old extents that reference that part of the the object we're gonna allocate a new set of blobs we're gonna loop over the length figure out how big this blob is going to be based on the next blob size or, however, big of a right we're doing. If we're not compressing, then well, there complicated bits here, because we try to reuse old blobs because it saves a lot of CPU.

A

So the code is kind of complicated here, I, don't understand it super well, because you can read it, but the short version is that if it's a, if it's a new extent for them for those big writes, we just write a new space and we'll create new blobs in sort of the general cases, some of you're, not reusing, and that, where that actually happens, is we create a new blob? We said you bought the true and we call right context right, Oh back in the header you'll.

A

Remember this right context had all the settings for how we were gonna do the right.

A

It has a map of right items which are the blobs that we're touching and flags about them, whether they're compressed or not so on, and so there's this right method that basically just puts it into this rights map to the vector of all the rights that we're doing for this. For this transaction.

A

Let's do right big, do right. Small is a lot more complicated, go through it kind of quickly, as we can't get into all the details, but the basically it does it. It tries to do.

A

It looks at the existing extents and it tries to reuse them and if it can't, then it sort of allocates a new blob that layers over it. So if there's an existing, this says there's an existing blob for this sort of region of the object that we're considering- and you know, if it's immutable- it's not mutable, then we can't do anything with it. Although blah, but sometimes it's a blob, say the blobs are state a minute.

A

Minimum allocation unit is 64 K and we were only wrote four K into it on the last right, and so the other 60 K is unused. Then you small right we'll come back and I'll say: oh, this blob is still mutable and it hasn't been used in this other reason that I want to write it into and so I'm going to write into the an existing blob, an unused portion of an existing Bob, and so that's what one of these cases handles right here: direct right into unused blob, so the existing mutable blob.

A

That's what are a bunch of conditions that says one! That's actually true, and if so, we'll Pat it and write it into the buffer cache and do all that stuff. I'm update to update the checksum dirty blob is a helper that returns a reference to the on disk structure and such dirty flag.

A

At the same time, though, we and it's non constant, so they we can then call something actually modifies it like updating the checksum and we set thee, and then we set the extent map extent that maps to that updated blob and we mark it used now and so on. So the next time this happens well, it'll be an overwrite and we can't do this ran into unused region.

A

Sometimes we're writing into a blob. But we don't. We need to read something in and sort of in order to do a read, modify write. So when that happens, we figure out what we have to read and somewhere in here. We actually call to the camera where it is somewhere in here. We actually called do yet do read. Write here might actually have to read in the old blob that we can overwrite one byte in the middle of it and cue it out to write again.

A

So sometimes this will block that I've annoying um but again there's all sorts of complicated padding and whatever this was painful code to write, I'm a debug, it's quite stable. Now again we try to reuse blobs. This is the thing that Igor did, that save CPU, but it's an optimization. But if all that stuff fails, then fine, we'll just allocate a new blob.

A

So say they say this. This region is a shared blob that was cloned, and so we can't modify it and we can't do anything tricky to fill it in. We just have to allocate another blob, that's sort of logically layered over it in the and the extent dancing. So we allocate a new blob, we write, punch, a hole, we write in to be right into it and it gets added into the extent map waiting for that new blob right.

A

So that, basically far, we've gone through and we've build out this right context with all the rights, the next step is, do out. Great I'll actually allocate space and sends it to disk. So what happens here? Is we go through all these rights and we say: wait a sec? Should we compress them first? If so, then we figure out what the compressor handle codec. Is there checksum compression?

A

If we need to compress, then we compress all the pieces figure out how big they are allocate a bunch of space if we're not compressing that we already know how much space, so we sort of skip all the way down. Here we allocate space to read all this data, and then we actually finally go to the writes and we actually can write it, and we end this. In the stage we calculate checksums your figure out what the checksum water is checksum length. My neutralize check sums.

A

If it's a new blob, we have to initialize a little vector that have holds all the ticks and values.

A

We pass this to you. This is building up the vector of extents that we're writing in to you.

A

We mark them as allocated in a transaction context. Okay. Well, where do blob is yeah? It's adding a reference to it. We calculate the actual check, sums and store them in the in the blob.

A

We update the extent map to point to these new blobs and we write into the buffer cache. If we do this unconditionally, even if it's not a buffered right, because we have to track inflight BIOS buffers that are on their way to disk case, we find him again or we read them and then finally queue the I/o. This is actually Q it all the way it puts it on the ice. It's calling up good to fur it up. Oh, this is different I, don't that's something else: yeah!

A

Normally we call be dead, a or right which doesn't actually key the right. Yet it's just putting on this IO context, which is describing all these areas that are gonna happen right and then finally, that do right is going to call right context finish, and this is gonna say all the old extents I'm gonna update the stats for them to allocate them maybe and share in the ec case for the optimization at the end of the day,.

A

Yeah, it's doing some old extent cleanup mostly.

A

A

Yeah every context it updates the note sighs this garbage collection stuff is to basically make it. So you don't have too many layers of blobs layered on top of each other, getting a really complicated map. It collapses them down once it its gets inefficient, but it's also complicated, so we won't go into right now.

A

At the end, we finally update the extent map we dirty it I think this compress extent map is.

A

Really saying that if you have, if you have to extents, say you're right where you wrote for K and ended up in this blob, and then you wrote the next for K and we ended up writing into the same blob because it was unused and we so we picked that efficient path and you write small, then you'd have a second extent that points to that same. The adjacent thing compress extent map just walks to the extent mapping looks for contiguous adjacent, extends pointing to the same thing and turns it into one reference.

A

So the data structure doesn't get to okey. That's all, but that's that's it.

A

At the end of this whole process, then we have a bunch of iOS that are described in that Iowa context and we have Ono data structures in memory that have been modified and own extent, map records that have been modified and all the blobs that hang off then they've all been modified in memory and you'll. Remember in two transaction.

A

After we've done this, for, however many rights we have and all the other writing stuff, we finally do write nodes which prepares the key value operations that actually write that or xtv family. We're done it's prepared to the next age and then finally gets committed.

A

B

A

B

B

Bigger question faces is complicated, but it's only for compressed blobs that we trying to rate.

A

Yeah yeah exactly compression, you can't modify a compressed blob because it's compressed and you don't know where the data is, and you can't take it apart. It's also one big lump, and so you end up with if you're doing sort of random overwrites with compression turned on you end up with blobs that are layered over each other, and so the garbage compression code tries to count for how much is space is wasted.

A

Basically, in these layered blobs and whenever it crosses a threshold, then it basically just reads them and andrey compresses it and writes it out as a new blob, and it does that as part of the right path, and so there's always a bound. That's enforced. There's no like background activity that cleans it up. It's like once. You do the third right, that's layering over it does the work of reading it and putting back in an efficient order. Yeah.

A

Garbage collection might be sort of a misnomer there, because it's not about background mm-hmm I. Think the last thing that I didn't really mention was the deferred is so one of the cases in do write. Small is if, if I'm overwriting a blob either a small piece of it or even just a small yeah.

A

If I'm running a part of a allocation unit, then, instead of writing it instead of creating a new blob, we should overwrite the existing one, but we can't override it before we commit the transaction, or else we might corrupt the old data before we commit the new operation. So we have to commit the operation with a promise that we're going to overwrite it and then overwrite it after, and so that's that happens in here.

B

Any time we're going to overwrite existing cloud, it's going to turn into a different way.

A

There so it's anytime any time you do a write, that's smaller than a block, then it has to do a deferred right unless it's like a new object and you're, not actually touching anything that it was there before, because we can just observe it. But if you're over overwriting existing data in a sub block size, then I ask you to fir'd. If it's also, if you're, writing and overwriting a blob and you're smaller than that checksum, no, it's actually just any time you're smaller than the blob, then you have to have deferred right.

A

The exception is that if the blob was not completely written before, like it's a 64 calex allocation unit, the first 4k is written in the other. The rest is unwritten and we sort of track that state in the blob. Then you can write into the unused portions, but assuming it's been written before we can't override it must we will we commit first and then do a deferred right. There's one other case where we do deferred rights.

A

um There's a tunable called like min deferred, right, I think it is, and the observation was that, like 4k or no 64k, basically small is on blue star ended up being slower than on file store, because once we hit the allocation unit, we sort of were previously we're. Always writing two new blocks on on disk, which meant that for a write the latency had was at. Do the IO wait for it to complete and flush and then commit the transaction, and then we that concluding, complete and fresh blush and for small I?

A

Is it's actually faster to just write something the database? That says this is a transaction, and this is the data that I'm going to write and then asynchronously go write it because the deferred writes, get batched up, especially if they're sequential and they get like pieced together into one big I/o. So if you're doing like 4k work, sequential writes, for example, this will do a whole bunch of each one will have a transaction to the database. They'll commit and come back and bash up. All these deferred writes and then, once the batch, it's big enough.

A

They'll all get flushed out sequentially, and so it helped a lot or small, sequential use on hard disks do that? Okay, so sometimes we do deferred write, even if we didn't really have to because it wins, and sometimes even it's even faster on SSD I can remember. There's some you just like play with the values you can like doesn't seem like it should, but I seem to recall them. In some cases it actually helped so.

B

The general case that I can already random workload over the 4k metallic size is going to result in different weights. All the time yeah.

A

Yep, but that's actually mostly a good thing, because each each each I/o is basically gonna go into one of these transactions or xdv. Adding an extra actual 4k into rocks. Tv is not that much additional work because it's a hard disk. It's already writing a big blob to the parks to be log and then they'll go into the deferred batch, which I can show right here, which is basically a whole bunch of deferred, I/o, buffers and locations.

A

We put them in a Barbie tree and so they're sort of ordered by where they're gonna go on disk and then when, when the basket is big enough, they're too many things or whatever it, then we finally finally flush it out.

A

Then we put all the contiguous ones together and even if they're, not contiguous, we'll submit them all together and so that this can get like this big chunk of I/o that they can do a sweep across the disk and do a bunch of like disparate rights without jumping back and forth, between the journal on the data area and I.

B

Should make it a lot better, especially for this setup, where you have SSD for the database right.

A

Exactly right right, you get back the benefit. The ad with file store, where all the I/o into the SSD was unblocking, and then you sort of lazily flushed it all to the hard disk. Bring it back that behavior that, with a lot better, better design a little bit a lot more control, because you can still read all that all that deferred I/o is in memory.

A

So, if there's a read, then we can actually not non-blocking satisfy that read, whereas the file store we had to like wash the deferred I/o to the disk and then read it back from the disk again, which was like I'm super stupid.

A

That's probably enough for today.

A

Any other any other questions and right.

A

That's the only other thing to keep in mind. Is that they're there? Those two batches, though there's the deferred batch that we're building of deferred I and then once we met that batch, then there's a another one that you're building while you're waiting for that old one go to disk yeah.

A

A

All right thanks, everyone.

B

A

A good evening.