Ceph Ceph Code Walkthrough, 25 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Code Walkthrough 2020-08-25: kRBD I/O Flow

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, folks welcome to the august 2020 code walkthrough for seth this week or this month we're going to take a look through the colonel clan on the rb side. So ilya uh go ahead and take it away.

B

um So hello, um my name is ilya, I'm the maintenance, kernel, client and today we'll focus on the kernel rbd driver, specifically on how rbd images are mapped and unmapped, and the basic io flow here, I'm at the top of the linux kernel source tree.

B

The rbd driver is fairly compact and fits in a single source file, plus a small header file that are located in driver, slash, block subdirectory, but nevertheless it supports uh most nvidia features.

B

And here is the list of them. The two notable features that are not supported by krbd are journal based, mirroring and so-called live migration.

B

Currently, there are no plans to implement either of them at least not in the near future. The new snapshot based mirroring that came in octopus is supported.

B

Implicitly, there is no feature bit for it because it is based on snapshots and therefore doesn't require a journal or any other modifications to the I o path and also keep in mind that uh for actually talking to the cluster, the rbd driver depends on another kernel: module named ellipsef, which is basically a stripped-down version of the liberators library implemented in kernel space, and it is located in the net, slash death subdirectory, and here you can see the uh authentication framework.

B

uh The sefax particle is implemented in auth, underscore x, dot c on the cross algorithm on the messenger.

B

At this point, it's just messenger v1, because we discovered several uh cryptographic weaknesses in messenger v2 and uh it's not conducive to uh implementing in the kernel client for a couple of implementation reasons, and so we came up with a new revision of the messenger protocol called messenger v2.1 which fixes those issues and will hopefully come to the client soon bringing support for full-blown on the wire encryption.

B

And then we have the monitor client. The osd client, which is basically the equivalent of objector in liberators osd map.c, is where we decode the osd map, invoke crush and post process crash to get the actual object placement, because we need to account for things like digi temp settings, dj up map settings, you know primary affinity, etc.

B

Brush doesn't know about any of these, and striper.c has utilities for computing. The layout of ibd images and sacrifice files, which are naturally striped across latest objects.

B

So the way rbd images are mapped and unmapped is uh through the ccfs interface. It's done by writing to these provide only uh files which are called attributes and when you type rbd device on map or rbd device on map on the command line, the rbd2 constructs the configuration string and writes it to one of these attributes.

B

Of course, there is a bit more to it because, at least for mapping the command line two has to parse the supplied options. It has to fetch the sefax key and add it to the kernel keyring and after the image is mapped, it waits until udab finishes processing the associated events and creates various siblings, although that does interfere with container use cases and we're currently working on fixing that, but in a nutshell, all it takes is a single write system, call reason that there are two add attributes and two uh remove attributes.

B

This uh add and add single major and remove and remove single major. uh The reason is that we support two major minor device numbering schemes under the legacy scheme. uh Hrbd mapping consumes a major number for itself and the problem with that is on a typical linux system. uh There is only about 200 majors available, and so this limits the number of rbd mappings you can have on the node and uh potentially also cripples the known, because uh if all available major numbers get allocated to the rvd driver, attempts to create other devices might fail.

B

Under the single major scheme, we can see among a single major number for all rbd mappings and, as a consequence, there is virtually no limit. uh You can have thousands of million mappings. If you want to.

B

And so when the configuration stringer is written to one of these files, we end up in one of these store, callbacks.

B

That are called by the cfs infrastructure, and here I'm looking at the at the rbd map side, and you can see that the forward to the same function. The only difference is this single major thing.

B

And so here we take the configuration string and uh the first thing we do is we bump a reference count on our kernel module and uh the reason we have to do this is in linux.

B

Kernel modules can get unloaded at any time and we don't want that to happen, while any images are mapped and then we parse the configuration string and there's a helpful comment here explaining its format.

B

uh So it's basically a comma separated list of uh monitor id addresses, followed by a comma separated list of mount like key and key value options, a cool name, an image name, an optional snapshot, name and notice that there is no field for a namespace name here, uh namespace uh uh name, the namespace is optional and.

C

I I don't see any code, I just see the the stuff client like directory listing to what you're talking to. I don't see.

B

C

Yeah, I just see you have the middle terminal window and source client. That's the only thing I see.

B

The middle terminal window has the specified attributes.

C

Yeah, I don't see that now I just see you did an ll in net staff and that's the last thing I see.

B

Well, let me uh unshare the screen. Maybe it'll help how about now.

C

There it is yep much better. Thank you.

B

B

We started on namespaces, so the namespace name is optional and it gets passed uh as one of these uh key value options.

B

The output is uh a set of libsev, specific options uh and start check options, uh a set of id specific options and uh struct rb options and the names uh the pool name, the image name, uh they're stashed in uh struct, rbd, spec um and rbd spec also stores uh ids for entities that have them and they are filled in later one by one, as they are discovered, namespaces don't have ids.

B

They are really just specially formatted object, name prefixes that are not cracked in any way anywhere.

B

So the next step are after the command. After the configuration string is parsed.

B

We get a lib chef instance and a new libsaf instance means a new messenger, a new set of sockets and usb client, etc, and uh if we were to create a fresh website, for instance, uh for which I made in mapping, uh we would start the system out of resources uh really fast.

B

uh So here we attempt uh to find an existing whipsapp instance with the same options.

B

So uh unless the user asks us not to share web server options with with the no share option, I'm sorry not to share lipstick instances. We go through the list of uh existing webset instances and attempt to find one with the same options.

B

uh Proct staff options includes not just the configuration options uh such as various timeouts, uh but also things like monitor, entity addresses uh and the fx key. uh So if it matches, uh we can safely reuse it.

B

And here, if we find one, we bump a reference count on it and return it, and here we ensure that the returned libsar, for instance, has the latest osd map, because it may have been sitting there idle and it could be uh that uh the pool we are mapping an image from has just been created, and uh this lipstick instance uh hasn't learned about it. Yet because it serves the map, is you know a few epochs behind?

B

And uh otherwise, if we are not successful in finding an existing instance, we create a new one, and this is where we open a new monitor session authenticate uh and wait until both map and osd map are received. And once that happens, uh we add the newly created libsaf instance to our host appliance here.

B

Next we look up the pool id based uh on the full name that we have and that's trivial, because we have the led map at this point and then we allocate a brocked rbd device.

B

uh This is uh uh it represents an rbd mapping and it is rather big and has everything there is to have about a mapped image device id and major minor numbers, uh a reference to our lipsef instance, uh the uh the header, the watch notify stuff, uh the exclusive lock uh stuff, the.

A

B

Map stuff, the information about the parent image and some other things uh is, uh it is passed pretty much everywhere, it's you know more or less a global variable, which we probably need to split into a couple of small structures, and the next thing we do.

B

The next thing of interest is this rbdev image probe function, and this is where we actually start talking to the osds notice, the depth parameter.

B

This function is called recursively or probing parent images and because the kernel stack is rather small. uh It's just 16 kilobytes on x86, I believe, and even smaller, on architectures.

B

We need to be careful not to overrun it, uh and the first thing we do here is uh look up the image id based on the image name, and there is a special rbdid object for each image, and what we do here is we format a name. We format its name. This is a known prefix, followed by the image name, and we call a class method.

B

An object, class method called uh get id on it, and you can see that all it takes is allocating a couple of buffers and calling into libsef, and once we have the image id, uh we can construct a header name, um because uh that that again is uh uh well-known, prefix, uh followed by the image id and I'm concentrating on format.

B

Two now um because uh image format, one has been deprecated and uh the next thing we do is uh unless our mapping is read-only, we need to register a watch on the image header and the way it's done is again. We call into libsef and we supply two callbacks. uh The first callback uh is called on every notification.

B

uh This is how we know that the image has been resized uh or a snapshot has been created, for example, and then we need to refresh our view of the image header uh and also, uh this is how the exclusive log uh is implemented. uh So you can see some exclusive, lock notifications here, apart from the header update one, the second callback is called on errors, uh and here what we do is just let everyone know that there has been an error and a cue and work item to reestablish your watch when that becomes possible.

B

And once the watch is established, we start interrogating the image header and again looking at format 2. Here we get the object, size and the image size here, the prefix for data objects. What image features are enabled the striping settings also the snapshot context.

B

All of these are individual class method, calls just like get id that I've shown, but this time on the image header. Instead of that special object.

B

And here we fill in the missing ids or the names depending on what exactly is missing and.

B

B

Are while we uh look at the uh parent information, uh we want to get the information uh about the parent uh about the spec of the parent image, so that's the asset school id, the namespace name, the image id and the snapshot id and also the parent overlap value.

B

The overlap value is very important because it tells us how much of the parenting, which should be visible in the clone. So imagine that you have a 10 gigabyte image and you clone it.

B

You get a 10 gigabyte clone uh if you shrink that clone down to say 5, gigabytes and then growing back to 10 that second half of the parent image could no longer be visible for the clone, uh because it should be all zeros just like if you had uh shrunk, uh a regular uh standalone image and then grown it and the parent overlap. Value is how we track this, and whenever we calculate a mapping of clone extends onto the parent extent.

B

We take it into account and proven extents that are outside the overlap region, and uh here we actually probe uh the parent image that we didn't find any parent in the previous function, which is bail. Otherwise, we make sure that our parent chain is not too long and after creating a structurability device for the parent image, um call uh rbd of image, probe recursively uh with uh an incremented uh depth, and this can go on for a while, because this parent can have its own parent and so on.

B

But the length of the parent chain is currently limited to 16 and we refuse to map images with longer parent chains, and at this point we have all the metadata we need and we proceed to setting up the linux block, layer, disk and cues.

B

uh This is the major uh minor numbering stuff that I've talked about, and the interesting bit here is our video lib disk.

B

uh We we allocate the disk uh and uh set the major and minor numbers uh and the queue depth uh the number of queues uh which I'll leave like the block layer that we want request merges to happen when we set the limits. uh uh You know we set the limit on the size of the I o requests that we will accept.

B

This is that line uh here. We enable discard support, but users who pre-provision their images and want them to remain fully provisioned, can disable this by mapping with a node stream option.

B

uh And finally, if, unless a message, checksums are disabled, we need to raise this uh stable rights uh bit in order to prevent the kernel from modifying the page uh after it has been handed over to us and we have taken suction and not doing so would result in uh spur headache, checks and errors on the osds.

B

So now, if there is no object map, uh we uh we can announce the disk, but if, if there is, if, if there is an exclusive log or an object map, what we want to do is grab the exclusive log here uh and uh once that's done, load the object map, um because uh we want to attribute the latency of grabbing the exclusive lock. I'm loading the object map to the parent to the mapping process uh instead of the first io request.

B

uh When it's going to need to be loaded anyway, and we don't want that latency to show up there.

C

Hey uh elia, I think your screen is.

B

It disappeared again.

C

It's still there, but it's showing uh dev image probe.

B

All right, I'll I'll reshare again.

C

B

Can you see it now? I am, at the rbd, add acquire log much better thanks, yep and uh yeah. So once the lock is acquired and the object map is loaded uh and the object map gets loaded uh uh from the post acquire handler uh that the exclusive log, our state machine, calls uh when the exclusive clock is acquired, I'm not going to show that uh we announced the disk uh to the world and uh they returned to suse fast and from there.

B

We return from the right system, call where we wrote the configuration string and the image is mapped any questions so.

B

B

All right so uh moving on to unmapping, uh that's uh that's another uh store callback and, as you can see it's the same story, we have two of them and they forward to the same function.

B

The configuration string in this case is just the.

B

Block device id uh uh and uh an optional uh force flag, and uh here we'll after parsing that uh we'll look up a struct rbd device uh with the with the id that we need and check the uh open count uh if the image is still opened by someone uh we refuse to unmap, but the force flag overrides the open, gown check. um uh This override was uh added uh with a specific iscsi related use case in mind, and it uh isn't really useful for anything else. uh So don't use it.

B

But if the open count check was overwritten, we may have some outstanding, I o, uh and so we need to freeze the queues and wait for the outstanding I o to complete.

B

After that we kill the block clear disk, uh tear down uh the associated suspense attributes, uh close the object map and uh unlock the image header. uh Here we unregister the watch on the header and flash the notification view this on probe function.

B

Here we put a reference on the parent image uh which in turn puts a reference on its parent and so on. um The entire current chain gets cleaned up, uh and then we clean up our own state, uh which we free the object map for the reference on the snapshot, context on various header fields and.

A

B

The end we destroy a struct rbd device, any questions on the unmapping process.

D

uh I have one question uh regarding the outstanding ios: that's because uh what I see is they're using this. uh The drivers using multi-cue block layer that why, if it's a single queue, then I assume uh any change. If it's not. If I see.

B

Well, well, it's it's! It's called uh the function is called uh freeze queue, so it's singular, uh but really it it takes care of uh all outstanding. I o requests uh we actually in the in in that rbd init disk function that I showed earlier.

B

uh You can see that uh we set the number of queues uh equal to the number of cpus uh so uh right, so uh we we have multiple block multi queues, um but uh that that call takes care of that.

D

Perfect thanks.

B

All right, uh let's jump to uh serving io the entry point here is rbdq iq function, there's the definition it is registered with the block layer and the block layer calls it for each. I o request and wants us to handle, uh and uh here we uh grab the uh pre-allocated struct rbd image request: uh translate the block layer up code into the rbd code, uh mostly just for historic reasons. As you can see, the mapping is now one-to-one and initialize. The image request.

B

Each block layer request, corresponds to exactly one image request on the target image, but in the process of handling that image request, we may end up spawning additional image requests for images in our parent chain.

B

Note these two operations uh discarded and zero out uh different uh for a long time. We guarantee that this card would zero the discarded region and there used to be a system attribute uh called uh discarded, underscore 0's underscore data which was set to true, and some people relied on that, but with the addition of the zero out up, this is no longer the case.

B

The semantics have been relaxed and if your discard request is small enough- and we suspect that nothing will actually be deallocated on the usbs, we will simply drop it or if your discount request is big but not suitably aligned, we will reshape it and discard a smaller region and then request it zero out up. On the other hand, uh this uh is guaranteed to zero every single night. uh We never second guess zero out requests.

B

So if you want zero in semantics, that's what you should use and uh in the end we offload the actual work of putting together the image request uh to the work view. This is done because the block layer has certain restrictions on what this qrq handler can do and in particular it cannot sleep, but we actually need to take a couple slipping locks in the process.

B

uh So we offload to this rbd uh keyword, fm function, and uh here we grab the offset and length uh from the bulk layer request, which is referred to as rq here uh run. Some checks, uh capture the snapshot, context uh or, and some other element fields uh from the uh from the image header uh and uh move on to uh filling. uh The image request.

B

Discard and zero out requests, don't have any data pages, but read and write requests. Of course. Do the block layer uh request is more or less uh just a single linked list of this so-called bio structs uh hbiostruct contains a bunch of data pages in it which we're going to need to either write out or within to, and here we pass the pointer to the head of the bio list.

B

And here we set up a filling context with an iterator and some callbacks and call the function that actually does the work.

B

So there are each each image: request is going to be comprised of one or more object requests, because rbd images are striped across radius objects. Even a small single page sized request can straddle an object boundary uh and therefore require uh two object requests.

B

uh There are two cases here: uh the case of a default uh simple layout in the case of a so-called fancy layout and a fancy layout for our purposes is any layout where the stripe unit size is not equal to object size and I'm going to uh focus uh on a simple case here, because uh the case of the fancy layout is uh more complicated.

B

uh We do uh two passes here instead of one, but I I want to point out that the no copy uh in the name of this function refers uh just to the fact that, in the complicated case uh we have to make a private copy of the page. Descriptor array uh page descriptors are small struct and we never copy the actual user data.

B

Instead, we manipulate those page descriptors and arrange the data pages uh during those two passes uh and in such a way that they are fed to the messenger in the right order, and so is always zero copy, in the sense that the data goes uh to the wire from its original source and arrives from the wire to its final destination.

B

uh This is what that filling context and iterator are about, um and, in short, user data is never copied or moved uh within, which sav or rvd.

B

And here, uh whenever we deal with uh block layer requests, uh the number of imagex uh extents is going to be one. uh It can be greater than one only when we deal with parent images, uh so this loop is going to be no up. In most cases, uh step followed to extends uh is a lipstep function that does uh the striping work.

B

Basically, it works through the given uh image extent. uh Does the mapping uh one piece at a time and any time it encounters a new object. It calls this uh allocathem uh callback and anytime, it counters a new stripe unit. It calls this action fn uh callback and if you look at how we invoke it uh for the uh a new object callback, we pass.

B

This alec object extend function that just allocates a new object request and adds it to the current image request and the image required basically serves as a container for the group of related object requests and for the uh stripe unit callback. We pass this stuff from the appearing context and again this is how data pages get added to the right object, requests in the right order, uh no matter how fancy your layout is and on return from file to extent.

B

The image request is fully populated with object requests, we know the names of those objects and the ranges within them that we are interested in.

B

And here for each object request, we do some op specific initialization, um because everything before this point was just uh generic striking stuff and the interesting bit here is we may well. We may end up deleting uh some of the object requests that we've just added.

B

If you recall, on the difference between discards and zero outs and how we can font and discards if we don't think they're going to be useful, this is where it happens, uh and the logic is in rbd and in this card you can see that we do some rounding up and rounding down based on the alex size value and it defaults to 64 kilobytes.

B

As a compromise between bluestore, I mean alexis for hdds uh for ssds and file store, which uh doesn't really have that concept and uh if, after after reshaping uh the object request, it becomes smaller than uh lx size will return a positive value and it gets dropped on this condition here and at this point uh we're done everything is set up and we are ready to kick off the image request.

B

B

And uh this is where it happens, so we are back to that uh keyword, defined function where everything started.

B

And the the actual state machine resides in this function that begins with two underscores: uh it returns a boom and when it reaches a final state, it would be true, and when that happens, there are two cases. uh If this image request uh is not a request to the current image, uh then uh we just let the block layer know that uh we're done, but otherwise uh we need to kick the state machine in the image above us- and this is what this branch does.

B

It could have been a lot clearer, but it is written this way with a go to label to avoid parent chain recursion. Here.

B

And so to get to the actual image request state machine.

B

Here it is, as you can see, it's pretty simple, because all we do here is if the exclusive lock is needed uh and we are not the owner, uh we acquire the exclusive lock uh and uh we kick off the work to acquire the exclusive block and when that completes, uh the post acquire handler will kick the state machine and we will land in this state.

B

We assert that either we don't need the lock or we are the lock owner and we kick off uh the object, request, state machines um and again an image request is really just a container for object requests. uh So all the work happens uh inside inside object, request state machines. uh Here we just wait for them to complete and gather the result once the pending count, which is zero. We return true, which signifies the termination on the state machine.

B

Let's look at the object requested machine, and here we can see we just for each object request. We kick off a state machine and there are actually two separate state machine state machines, one for reads and one for our rights. They're handled separately because reads are a lot simpler.

B

There we don't have to deal with object, map updates and a whole separate copy up state machine, business, discards and zero outs go through the write state machine because they modify the image just like a regular write does and so on a read. uh The first thing we do is we consult the object map uh if it's present, uh if the object map says that the object doesn't exist, uh we move on to handle to handling email and- and uh this is where it happens.

B

uh If there is no parent image or there is one, but the parent overlap, value is zero, innuend means a hole and we would just zero fill the request.

B

Otherwise, um well a note on the object map notice that the uh that the query is called may exist instead of just exists. uh This is because the object map is allowed to go inconsistent, but only in one direction. The rights are on right. The object map is updated first and the objects are written to.

B

Second, if the client crashes, after updating the object map, but before creating the object, the object map will have a record of an object that doesn't actually exist in radius and that's fine, but the reverse is not possible if they are because if it were possible, it would lead to data corruption if the object exists in radius. The object map will always have a record of it.

B

This is achieved through two-phase update process where we use the third object, map state called pending deletion and when the object is being deleted as part of performing a discard or zero out, the object map is updated twice.

B

uh First, the state is transitioned from exists to dependent division and then the delete is performed and then the state is transitioned from pending division r to non-existent, and this is how we ensure that the reverse and that inconsistency uh can't uh ever happen back to the read state machine. uh If the object map says that the object may exist, uh we issue a radius read: uh it can play with annoyant uh if the object is not actually there, and that would mean that our object map is inconsistent.

B

But again, that's fine, um and uh here is how our read is issued. uh We uh allocate an sd request, uh format, uh read rsd op, allocate the messages and uh submit them uh to rip chef and or when we allocate an osd map. We provide a callback which alexef uh invokes when it's done with the request and in this callback we basically just grab the return value and kick the state machine for the associated object, request.

B

So again, back to the uh read state machine: uh let's say we got an oend and the parent image. uh uh Actually uh there is apparent image and there is some overlap uh in this case. We end up uh here and reverse map the object extent onto the parent image.

B

uh The the subverse mapping is necessary because uh the layout of the parent image may be completely different uh and so to uh you know we can just uh if we, if we attempted to read from the same object in the parent image that would not work.

B

uh So here we call uh into uh into lip chef uh to do the reverse striping map, uh and you can see it's just a bunch of uh 64-bit uh divisions uh and multiplications uh that are a bit cumbersome in the kernel, because on 32-bit architectures we cannot use compiler intrinsics and so, uh instead of uh using like the regular flash operator, we have to use this macros.

B

And here is where our prune extends, uh where the parent extents are pruned, um the the the overlap is passed in and.

B

The extents that are completely beyond the overlap mark are dropped and the final overlapping extent is trimmed. uh It could very well be that uh all parent extents get dropped here and we return in an empty array, and in that case uh it is no different uh uh from an email and uh without a parent image. uh We handle it uh again by uh zeroing uh the request, but uh if uh pruning, if the pruning process left us with at least one parent extent,.

B

What we do is we kick off a read to the parent image and you can see that uh here we create uh the child image request. uh uh You.

A

B

It uh with the object request that we're currently uh processing uh capture the necessary fields from the header and uh fill it in much the same way as we filled uh the uh original uh objective image request that was initiated by the block layer. uh The only difference is that this image request is initiated by the object request that we're currently trying to process uh and it's going to get one or more of its own object requests and they will be filled again.

B

The same way, their page descriptors would be set to point to pages uh that were handed to us by the block layer uh for the original image request, and so again there are no temporary buffers or anything of that sort. Everything is zero copy.

B

The image request to the topmost parent image can spawn another image request to the second topmost parent image and so on, but- and you can see that uh when we do this spawning, we do this while work.

B

U again, to avoid building up the stack due to the state machine recursion, but eventually uh we would either hit an object with some data in one of the parent images uh or hit a hole in the bottommost parent, and uh at that point uh uh the uh the the chain of this image request would be unwound by repeatedly taking that slightly obscure branch with the go to label.

B

On that I showed earlier and eventually we would get back to the original uh object request and land in this read state machine uh in in the uh object-oriented parent state, uh and here we have to uh realize that, because uh we uh pruned the list of parent extents based on the current parent overlap value, we haven't read anything past the overlap mark, and uh so there is nothing there as far as we're concerned.

B

uh So what we do is we zero fill that part uh of uh the original request and when that's done, we return to uh to signify that uh we're done uh uh with this state machine.

B

And back to the image uh requested machine, uh the uh pending uh account here will be decremented, we'll uh we'll take the take the result of whether it's an error or or uh or the number of bytes that we've read um uh into account uh and uh again return to this time from the from the image request and back to the I'm back to this dispatch function.

B

uh This is uh this is for the original uh image request that was initiated by the block layer. uh So we just uh call this and request function and let the block layer know that we're done, and we translate our internal error now to the status that the ball player understands.

B

I think that's it for each and we are nearly out of time, so I probably won't be able to cover right uh so the right state machine and just just just to quickly go through a couple of states.

B

B

Here it is, and uh you can see that it's uh more complex uh in particular, uh you can see that there are two object: map related states here, that's what I mentioned uh uh with uh you know, handling object deletions.

B

uh This is the pre-object map update uh where we uh would uh flip the state from existing to uh pending division, and uh this is the post object map update uh where uh I'm sorry uh there it is.

B

uh This is where we would flip uh from pending division to the uh to the non-existent state, uh and uh we have an entire separate uh internal state machine here, uh copy upping uh bytes uh from uh from the parent image, when we need to overwrite them in the uh copy and write fashion, and you can see here, uh we have states for again reading from parent, uh which is uh similar uh to uh you know when we, when we read uh from parent uh or uh regular, read, but we have to deal with object maps here and with the interaction with the positive and deep latin features which for positive, we have a fourth uh object.

B

Map state called exists. Clean which allows us to tell uh you know, allows the positive uh logic to uh to actually work uh and uh with four states the uh object map uh is uh it's a bitmap with uh two bits: uh their object.

B

um Absolutely those interactions are hidden uh in in in this uh functions, um but again the structure of the right state machine is the same. Ultimately, we uh kick off some uh requests and we wait for them to complete. With this ending result. Deck helper, just like in the read state machine.

B

And uh yep, um I think that's, I think, that's that's it in a nutshell, uh but uh the right state machine is uh probably a topic for a whole separate walk-through, uh just to explain all the uh copy up uh intricacies uh and the uh snapshot uh context, uh logic uh for uh d, platinum and things like that.

B

um So if you have any questions, I'm happy to answer.

D

Thanks, I have one question: uh what I basically understood understood from the code walkthrough is that when it's basically, the driver interacts with the multi-cue block layer, what I understand from multi-group, lock players that there are multiple software staging queues and and then you have uh what I saw in one there's, a one function pointer, which has a hardware dispatch queue as well.

D

So the question is that, can we have multiple hardware dispatch queue as well in rbd driver.

B

Yes, uh that's what I've showed in that's what I've showed earlier in this rb init disk function. uh This uh this parameter refers to the number of hardware queues, so the number of the number of software cues is not controlled uh by the log device driver at all. It's totally up to the multi-queue framework, uh the block device driver controls, the number of hardware cues, and uh we set it to the number of uh present uh or present cpus just to in the hopes of increasing parallelism.

B

uh We haven't actually uh benchmarked this. uh This is a fairly recent change. We used to have a single hardware rq, uh but that was because we had some uh global logs uh in the lipsap kernel module, uh which uh you know which uh uh any uh sufficiently parallel uh submissions uh from the rbd uh driver would bump into. But those have been six have been fixed uh a long time ago, uh and so we've we've flipped uh this uh number of hardware queues in a number of cpus.

B

uh So yeah, that's that's the number of hard workers.

D

And is there any work with, I mean I understand that this is basically most optimum case, that you can have uh hardware queues equal to a number of cpu cool cp cpu queue. So that's the optimum case. uh Okay, okay and last question.

D

Is there any what I basically understood from the code walkthrough and when I read some scientific literature regarding rpd as well? uh Is there any way someone can abstract the kernel module to the user space, because what I understood is that uh the entire self client uh kernel module is the the entire set time module is in the kernel module, so it has no user space involved.

B

Yes, there there is, there is no user space involved, uh it's a complete reimplementation uh in c in kernel space uh with all of the associated, uh you know, constraints and uh restrictions. These are two uh totally separate code bases.

B

Yes, uh if you want to, uh you know, make use of the user space code but get a kernel block device presented. uh We have the rbd npd driver for that and uh that's basically, uh you know the uh the kernel has an mbd client and uh we place an mbd server on top of uh an instance of library and uh they talk uh via the nbd protocol and uh that way the rbd image ends up being exported uh uh to the kernel, but for krbd and lipsaf uh and saffs.

B

uh These are totally separate uh from scratch. Implementations uh in kernel, space, there's nothing shared, except for the implementation of the crush algorithm, uh and it's the same and it's written in c. uh So it's it's the same uh in libretus and in lipsef, and also some header files are shared which define stuff like uh latest feature bits and uh these some parts of the one wire format such as messenger tags, message, header and message footer.

B

uh You know what uh osd ops get serialized to uh what mdr shops get serialized to uh so stuff of that nature. But those are the only things that are shared. Everything else is but really separate.

D

Thanks thanks, so I think uh for the user space. uh So if someone is interested so nbd driver, like you said, is the uh right driver to see if you have to use the space.

B

Yes, if you, if you're interested in utilizing uh those uh those features that uh chronology uh doesn't support so, uh uh for example, journal based mirroring uh it. It makes sense to use uh ibd nvd.

B

uh um But uh if uh you know, if you're happy with uh krbd it's slightly more performant, uh so there's really no reason uh to fall back to nvd. uh Unless you have a specific use case in mind,.

D

B

Any other questions.

C

Thanks for putting this together, uh elia.

B

I I hope this was useful, um feel free to uh reach out on rc or email with anything related to the kernel, client or stuff in general, and thanks for.

D

All listening, bye thanks.