Ceph Ceph Code Walkthrough, 21 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Code Walkthroughs: BlueStore SMR

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, um so let me start with like based on how I added uh the code. So first we have this um zone devices specialization for uh hms smart drives, it's pretty much a copy paste of the kernel device that you already have here and this one just adds a couple of smr specific things.

A

For example, here we have.

A

Yeah this function, it just makes a couple of um smart specific calls uh like report number of zones to get how many zones are there on the device and the number of conventional and just sets these two variables, the zone size and the conventional region size which is so every smart disk, has a useless bikini. I mean logically they're mapped to the beginning of the uh drive, but I physically they are at the center of the disk.

A

uh So the first um 200 uh I mean it, changes from uh the vendor to vendor, but, uh for example, best in digital has uh 524 uh conventional zones, which means these are zones that you can write to uh randomly and then um seagate has like half of that.

A

So this sets uh those parameters um yeah the rest of it is pretty much uh the same. So another uh important thing about this is now. We do not need this with zbd library. This needs to. We need to get rid of this, because the latest kernels have the iop tools that you can use um to get this information. So you do not need to link uh with liv cbd library yeah. The rest of it is pretty much the same.

B

It's one quick question: there: should we keep using that zvd library just for compatibility with older kernels for a while like? Is there any rush to get rid of it?.

A

uh Yeah there is no rush, uh probably yeah. I don't um probably. We should because uh the ioctals were added like very recently, probably a couple of months ago, so, if you're running, if, if there are people who will be running it in older uh kernels, then yes.

A

uh Yeah, that's about it for the hsmr and then most of the uh new stuff is. uh I mean smart, related stuff in bullstore. So recently I had to just gift them all with uh if yeah with have zbd um so I'll, just go over all the parts of the code that have uh that has this if step and explain them, so we have a zoned, allocator and zone freelance manager. I will go into them next.

A

We have pre prefixes for rcd yeah and I think I have explanation of explanations for this in the zone: freelance manager. um So uh we've added this new thread for the cleaner.

A

And when we are opening a block device uh we just check. uh If pdf is smr, then we set the free list type to zone. So we have added a new free list type zone.

A

And then, when we are opening a free list manager, we are doing this ugly hack here. um If the device is smr, then we pick it back uh the device parameters which is um on top of the analog size and pass it to fm, create and then uh within fm, create. We extract that information that we piggybacked onto this alongside and let me quickly jump into this, to see what we're packing there so yeah. So for now to avoid interface changes.

A

We pick it back zone, size and megabytes and the first sequential zone number on to mean analog size and pass it to functions. Allocator create and fields manager create so yeah. We just um you know, just shift this into amin alexis to the higher bits which are unlikely to be used, and then um we will extract this information in the allocator.

A

To actually to use those parameters so, okay now moving on and then when we are creating allocator, we are first making sure that we're running this uh config settings. This is um so. These are the config settings. First, we make sure that the allocator type is zoned um and uh yeah, so we I have had some. um You know kind of arbitrary restrictions because of this, because we were first targeting to get the like the simplest common case working and that is the amino acid should be at least 64k.

A

And the other thing is, we do not want deferred rights because deferred rights would come in arbitrary order and that would violate the sequential write requirement uh to disk.

A

uh Those were two so we're checking those here and if, if those settings are not uh right, then we just error out and then we also pack yeah. We call the same function to pass the same parameters that we passed to uh free list manager to pass them to the allocator, and then we call uh alec and then we yeah, then here we create the allocator and pass that information in the alloc size.

A

Okay, so here uh we're initializing the allocator and the free list manager. So originally I had these um I had like. uh I had a new uh yeah. I extended this allocator interface to add smart specific calls, but then um we discussed this with digger and he suggested that I should just not pollute the interface and use this dynamic cast to get the type of the allocator.

A

So that's why we have these dynamic tests here and then here we're.

A

Calling the uh allocator uh we're initializing the allocator, so this here we're getting the zone states from the database, so this will be become clear once we start looking at the allocator I'll come back to this, and then we're also passing a block and a condition variable.

A

These are the same. uh These are also being used by the cleaner thread here: cleaner thread, okay, I'll, come back to this again, let me just say that these are this: is the initialization of the allocator and the freeways manager.

A

And this is some cleanup code to stop the cleaner.

A

I think that originally the way I added it, it wasn't like this, but this scope, scope card was added recently and they've changed it to face according to some convention of using the scope card. So if it's a smr, then we.

A

Yeah, I don't recognize me adding this, but uh I think this was modified.

C

Yeah, that's a decent update from keyframe.

D

A

uh Unmount, uh if we're unmounting, we stop the cleaner thread.

A

Okay, this is the kind of a more interesting part so for every object we maintain a zone number plus or id as the key and offset of the of the object as the value, and this is stored in this namespace. I think this is already out of state, because I have changed uh the name space name, no,.

B

A

Correct okay, so when a new object is written to zone, we insert the corresponding tuple to database when an object is truncated. We remove the tuple and when it's overwritten we remove the old tuple and insert new tuple, corresponding to new location of the object.

A

uh And now the cleaner can identify live objects within the zone by enumerating all keys that start with the zone. Num prefix. So I talked about this in our last meeting a little bit. So that's the implementation of that code. So, yes uh yeah! So we maintain this uh zone, node, zoned, all node, so yeah we prefixed everything, that's zone related with zoned, so this is basically uh oh no to offset map and the places where, where we update these, let me first quickly jump to those places.

A

Okay, so this is uh a map from owned the vector of object offset for new objects created in the transaction. We append the new offset to the vector for all written objects. We append the negative of the previous on disk, offset followed by the new offset and for truncated objects. We append the negative of the previous on this offset.

A

We need to maintain a vector of offsets because within the same transaction an object may be truncated and then written again or an object may be overwritten multiple times to a different zones.

A

So, uh let's see where we update this.

A

Okay, so we update this yeah. We have this uh short functions um zone note new object, so note, updated, object and zone, note, truncated object and within blue store. We call these in specific places um and then um I'll I'll jump to where we are calling this, but before, let's see how we process them. So if we go over.

A

Offsets so we go over the entries in the map and for each offset. If offset is positive, it means the object was added, so we insert it into the prefix and this zone key function is here, but given an offset and an oid, it returns a key of the form zone plus zone number plus oid.

A

And we also need to give it the offset to so that it can get the actual um zone number and it gets the zone number by just dividing the offset uh by the zone size and then it encodes that and returns a single string that has zone key plus object key. So this acts as a key. This function returns the key, and then this is the offset offset the buffer list uh and then, if it's negative, if offset is negative, then it means the object was removed. So we just remove the key again.

A

We pass the negative offset um so that yeah we make it positive when we compute the offset uh the zone number here, um so we just remove it. So this is um now: let's look at the parts where we are actually calling updating these.

A

Yeah, so we are calling these three functions, a zone that starts with zone zone, understood now. So let's look at.

A

So we call a new object. This is right. Yes, so in do write uh if.

A

So if device is smr and the all the extents are empty, um it means it's a new object because we're not overriding.

A

So we note a new object if it's.

A

If it's, uh if, if those all the extent are not empty, then it means it's we're, updating an object, so we call outfit object and we get the old on disk offset using this again. These are some like uh assumptions that are not true all the time.

A

This is for the like very simple scenario. Where you have objects, you have complete objects: that map complete to complete extents on disk, uh and so we call new here we call updated here and then we need to call it somewhere else, and this is, I think, you truncate yeah, so deleting or truncating the object. Both code cats follow through this function to truncate.

A

And here we are noting the offset of the object that was truncated.

A

Okay, so let's see what else we have, I think so I got.

D

A

uh Yeah, so I'm faith, cleaning method, so in uh xc finalize kv. um We call so up to here. We have made note within a transaction. We have made note all the um uh all the new objects, all the deleted objects, all the uh updated objects and all of that information is within the uh within this uh deck. Within this map, and this now we have made note of all the delete, updates and truncate. Now uh all the new create newly created objects, deleted objects and updated objects within this map and then.

A

Here we are processing them and we're processing them and then we're making necessary changes to um this is this function, we're making necessary changes, key value, changes to the transaction and yeah. That's why this happens and then, after that, I think um all those updates to key value store that keeps track of the metadata.

A

Regarding the zones and objects are, is persisted to the key value store.

A

This is the cleaner thread, so this just starts the cleaner thread.

A

Again, we're using dynamic cast, and this gets the zones to clean from the free list when it starts.

A

If there are so. This is pretty much um follows the pattern that you have here for other threads. I try to follow the same pattern with a with a function that starts the thread thread. Another function that stops so here we get the zones to clean if it's empty.

A

Oh, we have the protocol here, for um you know making sure that if we are, uh if somehow we crash in the middle of uh of a cleaning process when we resume or we can continue- and we have a protocol here- that makes sure that it is.

A

It stays consistent after a crash and it resumes where it's stopped and the protocol I've actually described it in the uh I think yeah. I have described it in the pr and it should also be here, but basically, what we're doing is uh we're making note of when we start to clean.

A

Let me actually quickly find the pr so that.

D

B

Had a question about this because I was a little bit, um I was wondering when I was reading the code, um why? Why there's? Why? Why it writes down which nodes are which zones are getting cleaned because any individual like object that you're moving out of a victim zone? I think somewhere.

D

B

Like that's an atomic operation, um it needs whatever should be atomic and if you, if you're part way through cleaning a zone and then you restart um like, presumably if you chose that zone before it'll leave, you'll be even more likely to choose it again because there's less stuff in it now than there was before, um but even if it's, if you didn't like, maybe just because you have a different policy around cleaning, I'm not really sure I I guess I wasn't. I wasn't sure why why yeah.

A

B

Like remember and continue like, why not just look at us.

A

I think uh probably because uh you do not have to so um yeah first, let me.

A

Let me find um the commit that um yeah, the pr that I actually had it on lrc4 class.

C

A

Yeah, so let me stop sharing that and share this.

A

uh Do you see this.

B

A

So cleaning multiple zones is not atomic, therefore, to support resuming uh cleaning the cleaning trash. The cleaners are at first persist a list of zones to to clean as a value of the cleaning in process uh in progress zones. um I mean, I think it's just to um save the effort of going through.

A

Doing everything again so because.

A

Oh no, no! It's.

A

Oh, that's because there will be an inconsistency there. If you do not so let's say uh you have, you have chosen some zones and you started to clean them. Let's say you chose five zones and you started to clean them. You can. You persist the metadata about updated, um persistent metadata about the uh updated uh you update the metadata after you have cleaned all of the zones.

A

So let's say you, you clean the first zone and you still haven't updated the metadata for that zone in roxdb. That says that this zone now has for each zone because for each zone we're keeping the number of dead bytes in the zone and the right pointer and if you clean uh and we updated after, we have cleaned all the zones in a batch.

A

So if we uh clean the zone, but we do not persist its metadata, I think what will happen is after we resume from the crash uh we will.

A

uh Since we haven't updated the metadata for this zone, the first zone that we have cleaned, we will still use old, stale metadata to decide which zones to clean and we will uh give the zone back again as a candidate for cleaning. Even though it has been cleaned- and there will be- um I mean so what will happen? uh The um uh the yeah cleaner will go through this zone and we'll find that there is nothing to be cleaned because it's already been and uh and then move on to the next zone.

A

So it's it will probably just avoid some redundant work. I guess, but uh whether it will break the consistency. I thought about this uh when I was designing this protocol, but I didn't make any notes. So I'm not sure I am not 100 sure.

B

I guess what I'm wondering is if, if we can make it so that I mean, if you look at the state of the system, you can have an object. That's going to have some number of extents. It's just an object. Has one extent that's in a zone that needs to be cleaned, so there's going to be some do move or something similar. That's basically going to take.

D

B

Extent, it's going to read it in and it's going to write it again somewhere else, but if we can make that transaction um yeah atomic in the sense that it it updates the allocation for the new zone. That says it has more used, bytes right point or whatever, and it also increases the dead bytes on the old zone.

B

Then it's a that's a fully atomic operation.

A

D

And then everything will be fully consistent.

B

A

B

A

Yeah that uh that's so, if two move is atomic, then we do not need to do this stuff that we're doing right now, but the issue there is uh in term the problem. There is um uh like batching, a bunch of bunch of updates, because we do move we'll have to do a ton of small updates to roxtv, but with just first making a note and then doing all the updates at the end, we'll we'll need to we'll have more core screened updates to ruxtv.

A

D

A trade-off you're just talking about the one allocation bit of metadata, though deadlines, I don't think it matters.

B

Yeah- or I mean even if it did, though we could still have, you could still have like three to ten objects and then one object, if it's, if it's as long as that transaction, whatever the transaction granularity is as long as that transaction makes the metadata for the zones consistent.

D

Yeah, that's definitely correct.

B

Look it'll, just I mean eventually, it'll it'll be used, bytes will equal dead, bytes or whatever, and then there'll be some final step. That says. Oh, the zone has been completely cleaned, and so I need to reset the right pointer or something whatever.

A

Actually yeah, so it's it's going to be like uh yeah. If two move is atomic, uh it's yeah again, it's a matter of whether you want to batch updates or xdb, or you want to do it every time. You clean an object and we'll still have to make some of the uh to move uh a transactional anyway, but it's like um because it will. It will need to update, object, metadata atomically, but there's also the related zone metadata like uh because we also update uh once the object move is complete.

A

We also increase the number of dead bytes uh in the in the zone as well. Yes, yeah yeah.

B

That's what we're saying yeah yep. We already have to make the right path like every every we already have a transaction framework, and so every transaction should be atomic and leave everything in a fully. So I guess what we're saying is that if we just make sure that we update the allocation metadata in those just like we do with writes, then.

D

B

To add this second layer of in progress in cleaning whatever you know, we could you could drop all that tracking and that'll simplify that.

A

Yeah uh I mean it, will it still saves uh some uh rug to be transactions, but.

B

I think- and I think the smr case is such that that is it's one key, basically in an existing transaction, that's updating an integer for the used bytes, and so it's yeah. It's not worth the effort.

D

Okay, it's not it's not a random right for rex to be anyway, it's just whatever. As long as it's updates to the same key they're going to be mooted in level zero, it's it's cost zero like it really doesn't it's not real yeah, okay,.

B

Effectively free yeah.

A

uh Go back to sharing yeah.

A

Okay um yeah, so it's that so we were here. Cleaner start.

A

Yeah so we're getting the number of zones to clean.

A

And then we resume we call zone clean zone on each zone uh to clean zones and then uh reset all the zones, and then uh this marks the zone screen in the free list manager, which is uh in wax db. So this persists that zones are clean.

B

A

B

Question here um is there any reason to clean multiple zones at once,.

B

D

B

If you're going to have lots of extents that you need to move within a zone, each of those is going to be, or even chunks of them are going to be transactions, but we're simple, so just to say each one's a transaction like you're. Not it's, not, there's not going to be value in like reading in parallel from voltable zones, because it's a hard disk with c latency, so you're really going to want to sequentially, read the entire zone or whatever, whatever the live bytes are in the zone. So we can.

A

B

Be simplified to have like pick one victim zone and focus on that zone and then pick the next victim.

A

So the zone selection process that happens in the uh free list manager, freelance manager, uh gives the list of zones to clean. So uh are you asking why we're cleaning multiple zones at a time.

B

Yeah, well, I'm just wondering if we could, if can we simplify that to just say, breathless manager give me the the one victim zone? I should work on first and then you finish that yeah.

A

B

It again, what's the new victim, because situation will change by the time you're cleaning that first zone anyway, like you, don't want to be there's no value in working on two and cleaning two zones in parallel. I guess.

A

I mean it's not cleaning two zones in parallel it just uh it just asks them once and then uh then, once it's uh and that's that's tunable, like those things we haven't uh actually uh like. uh If we look at.

A

So, oh this one just reads it from the database, but this way we actually get the. I think it's in the zoned allocator the place. We actually decide uh what zones to clean that yeah here so find zones to clean um yeah. So right now it's just set to one and though I've just made a note here to make it tunable.

A

um So, okay yeah, it's um it's just, uh it depends on what kind of cleaning you want to do. So you may actually, uh uh so you may want to do stuff, the world cleaning, where you stop everything in that case like no, I o is happening. You may, for example, you know how we discussed, having io being redirected to the two other ost, while the third one is um like is not receiving any. I o. So all the cleaning is happening there, so you may want to do that kind of cleaning.

A

So those are all policy things that uh that are to be this yeah to be figured out, but yeah sure we can do uh in the uh here. We this can be just um get. Cleaning zones can always return just one.

B

Yeah I mean, I guess I guess the question is, you might have a cleaning strategy where you might have one zone that you're trying to clean, but the data might be getting rewritten to multiple zones like maybe you have some some way you decide that this is going to be short-lived or long-lived data that you're reading up.

C

B

Be reading from one zone and writing to two of them, but then you might also have a strategy where you have.

B

Something out loud here, maybe you have you only read out the short-lived data and you write.

A

Again, it's not reading from multiple zones, it's just getting the list of zones, multiple zones at the beginning and then uh cleaning them one by one sequentially like it's not doing parallel cleaning, yeah.

A

And this one is just the standard uh yeah pattern for stuffing it thread, and this uh cleaner thread um so yeah. This is the this is where it the it happens in the loop, so it just keeps getting zones to clean, and if there are no zones to clean it goes to sleep otherwise um yeah. It first makes a note in the database.

A

This is the zones that it's going to work on, and then it just keeps cleaning zones one by one and yeah. This code is uh similar to the one to the one that we saw here in the start, because this one like uh doing cleaning on the recovery, this one is doing cleaning on the normal uh on the common path got it. Okay,.

C

And I presume that's the place where we are not atomic in second results. That's why I suggested to introduce this recovery process. I don't recall the details, but it looks like that.

B

Atomic then we can drop the recovery path code.

A

Yeah, so if um so, here's the part so that this one this is the zone queen zone, which is this- is the function that will take a zone number and clean the object on the zone by calling atomic to move on every object on the zone. So if the do move is atomic, then yes, we can, um you know let go the recovery. There will be some. um I need to uh think more about this, uh but I think yeah, you guys already.

A

uh You know computed it in your heads and said uh this uh doing. um Fine-Grained updates to rex to be will be, um will be free, um but yeah, I'm not sure, uh maybe you're right, uh we'll figure it out, um but yeah. If this dual dual leave object needs to be atomic anyway.

A

uh For, for this whole thing to work, yeah.

D

That that part.

A

uh Needs to be implemented, that's the zone, clean zone uh that needs to implement, to move and then use that to move objects, and once it's done uh we'll have a basic working.

A

You know yeah complete working thing, complete working system without any optimizations or any and then um yeah. So now the only two things that's left are, uh I think, that's all the code, uh zbd specific code.

A

Okay, so there may be a few other things. What's this one?

A

Okay, so here we have some.

A

On some devices, the first one is support. Non-Overwrite workload such as uw with large aligned objects, therefore, for user rights to write small should not trigger osds, however, write and update a tiny amount of metadata, such as osd maps statistics. For those cases we temporarily just pad them to mean invite them to a new place on every update um yeah. So that's the that's for handling the for um workloads that I was thinking of do right.

A

Small shouldn't have triggered, but there's still um small metadata updates that are smaller than min aloxides, and here we're just uh adding them to mean alexis.

B

That makes sense, I think I think yeah. We just need to make sure none of the overwrite paths get triggered in this case.

A

Yeah so I mean, let me write pattern.

A

Oh yeah, can you say that again.

B

We want to make sure that we can support any right pattern like.

A

B

A

That that's what I was yeah.

B

We also need to make sure that we don't trigger any of the any of the cleverness we're trying to not allocate any blob. So, basically, everything should always write a new blob um for.

A

Everything so yeah for that. I think we may need to uh revisit the metadata that we keep.

A

B

Think, well, I think that all the existing data structures should work. Okay, it's just a matter of and and yeah ecore probably knows better, but just making sure that we take the right path through, do write and do write small to write big so that we always we always allocate a new blob. That's always you know. The allocator is always going to give us a new right in the right place um and we don't take any of that. The wonky paths.

A

So when you, what do you mean by wonky path.

B

Sometimes we overwrite or block there's just like. um Sometimes we journal, um okay, the right that we intend.

D

B

And then write it back to do the deferred right. We don't want any of that stuff. We don't want to um write into an existing blog, the unwritten portions of an existing blob. We can drop that um and so on. It might be actually here's a thought, though it might be that there's all this egor. If I remember correctly, there's in the blob encoding, there are all those fields for like what part of the blob are actually used allocated. We could probably drop that because we're never gonna do any of these.

B

Like subsequent blob updates, yeah we're different. We can make the encoding more efficient, maybe yeah, okay,.

A

So what was this one? Okay? I already hope that this one, uh let's try and beat this one. I will also talk to that. I think I'm done here well with maybe a few things here that if this is just a thread, these are data structures.

A

D

A

Some specific functions.

A

Okay, so this is another thing that I added so return the offset of an object onto this. This is added to oh note, so this function is intended only for use with zone search devices because in these devices the objects are laid out. Contiguously on this is not the case in general um yeah. So here this function is supposed to get the uh offset at which the uh object starts.

A

So I I've already forgotten how all this um you know blobs and uh extent and so on, work. I need to review this, but I think that this was correct.

A

This just assumes that uh objects are laid out contiguously, um so you have an object that starts at offset n and all of its data is supposed to leave at the offsets and plus l larger than n, which is, I guess, usually not the case, it's possible that, like in general case, you have parts of it at offset n and then other parts in some blobs at offset and minus 100 or something I'm not sure anyway. This is another part that uh added.

B

Yeah, I think that will okay, so so igor. um I guess tell me if this makes sense. I think so. I think this this strategy is a little bit off, because we want to allow this these arbitrary right patterns- and we want to allow blobs to exist in multiple zones for a single object um in order to support that, which means that, um I think, probably what we want is in the o node. We need us as like a set of um all of the for any given o node.

B

We should have a set of zones that are touched by that object um and that can probably that can either be like this object exists in this zone somewhere with one or more extents, so like one key per um o node zone pair or it could be per blob, whatever we do fiddle with that.

B

But in the simplest say just say like this, o node has some data in this zone um and then we'd also need a structure, a similar that, in the right context, that says that we're updating like a new one like this is this right.

B

Transaction has a new extent in this particular um in this particular zone or we've deallocated the last extent for a given object in that zone. So there's like the delta there so that we know when to create and remove the keys, basically for the cleaner.

B

um Does that make sense, and then we would get rid of this, because this this particular function would change over time right if you have to overwrite those bytes and they get into a new blob in a new in a new zone.

C

So yeah, I think uh right.

C

I need a sort of at least click on, instead of just a single search, so well, okay, when we.

C

Detect which sets to clean up so we can still use this old extent, at least, but build this at least of offset, rather than a single one from it.

B

A

Hey uh yeah, okay, so these are the main changes to uh for that agent booster.cc and then we have zone freeways manager.

A

Which is which, like the primary things here so there's one.

A

Zone types so similar to uh blue store types. I also added zone types which is very simple at the moment, um so it have. It has one structure that, let me see, so we have one structure that gives the number of dead bytes and the right pointer for each zone, and these are just some.

A

Scaffolding and this just persists: um yeah encodes and decodes. The zone state and uh zone free list manager has this function of functions for writing those states into uh into the roxy b and then loading them from live cd, and then this one needs some states.

A

And then merge operator for updating the uh updating, the um I think, the right pointer and number of that bytes.

D

A

Create so here we um yeah so this when you retrieve that piggyback information like the zone size, the number of zone, uh the starting zone number, so the starting zone number is um the zone number the first zone number that.

A

That is the sequential right only zone and then.

A

We just again this mimics on what the current release manager does by just writing some. uh This information into forex tv.

A

Yeah, these are again similar to.

A

Similar to uh what the current.

A

The freelance manager does uh they in the interesting part, are the release when we do release we increment the number of that bytes.

A

And this is a standard, that's just for retrieving the that was from surveillance store and it sound states. Now this just loads. The zone states um yeah. There isn't much going on here. Okay, get clean zones; okay, this one! I just added recently for that protocol. Consistency. Protocol.

A

Sound allocator.

B

Hey, I have a, I have a question about um the these um post managed strides in general, but.

A

B

About the host manage smart drives in general,.

A

B

Let's say that you, you think the right pointer is x.

D

B

You issue a right at offset x and then you crash um okay, but the right actually does it that the right actually succeeds like right.

C

B

Actually writes x but like.

C

B

You don't know whether it happened. Do you restart your software starts comes back up and you think the right pointer is still x, even though the drive already processed or right at that offset.

A

Well, you can retrieve the right offset from the triangle right: keep safe.

B

Internally, got it okay, so after on every restart, you should go and refresh you should.

D

B

Winner but the actual right pointer might be greater than or equal to that yeah.

A

It would be good to do a sanity check going over or making sure that they match.

A

um Ideally, they should always match.

D

Specifically, specifically the one open for right, you do have to check on startup. That's not optional, yeah yeah, right.

A

All of them are open uh for right all the time.

B

Right or whichever one you were writing to like, if you, if you are in the process of writing a bunch of stuff- and you turn the machine off like you're, going to have any restart you're, probably going to have a right point that doesn't match.

A

Yeah, so um I don't I can't just uh I I have another meeting at uh 12, but uh can uh I don't know I don't have time to um you know I can't do as much as I can, but I'll need to sign up.

B

At once, this is good. I think we covered just about everything. I guess the one point there, though, is that when the allocator starts up, it should um and it loads all its right pointers or from the freelance manager or whatever, and we should also do one last check that looks at.

A

B

Yeah and updates the right pointers based on that too right.

A

So um yeah this is the allocator. So right now, this just um um you know, starts from the first zone and then keeps allocating and uh yeah. It just starts from the first zone and keeps allocated iterating over uh the zones and as soon as it finds a as soon as it finds a zone that fits the required um required size, then it breaks out and returns that zone yeah it.

A

If we have iterated through all zones and we haven't found uh which space then it means uh we've run out of space.

A

um So once let's say it's success, when it returns here, let's say it has successfully found the zone. Then it increments the right pointer of the given zone number and reduces the number of free bytes, and um it also if like, if there's no free space in the current zone, it increments the zone number so that next time it starts from another zone yeah. This is very rudimentary um so find zones to clean this one.

A

uh So this one just uh yeah num zones to clean is an atomic um and that we check we have a function here that checks uh if we are low on space so find zones uh to clean.

A

If, uh if there is currently.

A

Yeah so find zones to clean. If there is currently cleaning ongoing, in which case num zones to claim will be um non-zero or uh we're not low on space, then it just returns. Otherwise it just.

A

As it does a partial sort of current zones by using their the number of dead bytes uh and it partially sorts uh by the number of zones that we want to clean num zones to clean at once and um and then it sets uh zones to clean with a list of zones that we want to clean based on the result of the partial sword.

A

Okay, where was they calling francoise.

A

Okay, so this is the um yeah. So let's find zones to clean may trigger cleaning.

A

Every time we allocate, we basically check if there's a need to clean and this may trigger clean, because it also.

A

I think yeah it, because uh if it finds zones to clean, then it will um notify the condition variable and the cleaner thread will start cleaning, oh funny, so this is the allocated path. So basically, we just have a very simple allocator that iterates over zones and the third zone that has the free zone um free space uh returns that and it also every time it allocates. It also looks to see if we should start cleaning and it so it kicks the cleaner thread so that it starts cleaning.

A

That's the allocator path and the release path is, um and we just go over the release set interval set and we just keep updating the number of bit bytes in the uh in roxdb for the zones that I mean this is not in roxtoda yeah. This is not. This is just in memory.

A

A

Yeah, this just uh updates it in memory and then freelance manager will do it persistently.

A

Yeah this functions, I actually add free just adds yeah.

D

A

Some hack here as well- I don't remember at the moment but uh yeah. I need to look at it more carefully to find uh to figure that out but looks like that code path is.

A

Yeah there should be some code back here that has the part where you first allocate a block for um object for this super block.

A

I'm not sure, maybe yeah, probably that code is not does not have um so I'll. Just search for that are make sure I've covered them all.

A

Okay, this is another codepath that uh in transactions we just.

A

So currently you may have multiple threads uh issuing rights. um We have multiple threads that may be executing on these codepaths, so there may be allocation and right that happen out of order.

A

So here what we're doing is if it's a smr, we just uh we have added a lock that um allocation and write to disk. They happen to in a single step. So we don't have because it's possible that without this, so we take the luck here and we we release it down here somewhere.

B

Somebody else right.

A

Here here yeah, so we release it here. So the purpose of this is to make allocate and write steps, a single step, because it's possible that one thread allocates and then the allocator gives it some offset, let's say zero and then uh before it can write another thread allocate and that one gets an offset like 10 and before the first thread writes. The second thread starts writing, um so it ends up writing at offset 10 before the offset 0 has been written.

A

So this whole thing just makes sure that um allocate and write is atomic, and there I've made a note here, uh they're trying to add this new command to kernel zone, append there uh you're not going to give the um you're not going to okay. I need to leave right now, yeah, I'm getting cold uh yeah. We can follow up on this.

B

Okay think thanks so much. This is super helpful. Okay,.

A

B

B

I took some notes: um it'll probably follow up with an email. That's like a to-do list.

D

B

Thanks around okay thanks.