Ceph Ceph Code Walkthrough, 27 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Code Walkthrough: LibRBD I/O Flow Pt. 1 2020-10-27

Description

Apologies for the screen share quality in this episode.

https://tracker.ceph.com/projects/ceph/wiki/Code_Walkthroughs

A

Hello, everyone, my name, is jason dillman. I am the current uh tech lead for the radius block device uh portion of ceph. um The goal of this talk.

B

A

To walk through the I o.

B

A

Of how read and write requests get translated from the lib rbd api down into calls to the osds and back to the back to the user of the of the api.

A

uh I guess to start. I am going to share my screen.

A

Could everyone see my screen.

A

Your name yep all right, that's one plus one all right so I'll count that as it's working um as I get going, um I since I won't have the the chat window visible. uh If anyone has any comments, questions I'll I'll, stop from time to time and ask if anyone has any comments or questions that I can try to answer in line.

A

Otherwise, if I forget- and I uh don't feel free to speak up and and chime in, um so we don't get too far ahead.

A

uh Just to start, though I did wanted just to touch briefly on at the high level of how data was organized um for rbd, um and so when you think about a block device, you have uh just a you can almost think about it as a flat file from byte zero through byte.

A

Whatever, let's say it's a one, terabyte uh block image you're going to have you know you're going to be able to address all the bytes between byte, zero and byte uh at one terabyte, but the way that is internally represented and stored within ceph is that uh rbd via kbd rbd, careful using the kernel, library's user space tool, um library what it does is they they both break down. Those requests into much more much smaller, much more manageable backing objects in the in the cef storage cluster.

A

So all this is documented here on docs.theft.com. If you go to the architecture section, there's a link on how data striping works so, regardless of the size of the image rbd only talks to backing stuff objects that are of a fixed size, the default fixed size is four megabytes. So if you just do from the rb cli, you do an rbd create dash dash size, one t uh and then an image name to create a one terabyte image.

A

You're going to have the it's going to use the default four megabytes backing object size when, when you first start writing data to that image and it's rbd, everything is then provisioned, so there's actually no data. Besides a little bit of that metadata describing your image when you first create the image, it's only as you start writing data to the image that uh backing data actually starts getting written to the osds.

A

That is, unless you use the dash thick option when creating, in which case it actually will thickly provision your image and make sure that every image or every backing object of the image is fully allocated.

A

But getting back to the striping, you can just think of it as very simple raid, like striping.

A

The default case is that it has a stripe unit of the object size. So if the object size is four megabytes, the stripe unit is four megabytes and then the stripe count is one. So um in that case, bytes zero through four megabytes of the object, go into object: zero uh by four megabytes: eight megabytes object, one and so on and so forth.

A

um If we look at this rbd types.h file for the new- and I say new loosely because it's been the default for a while now, but the rbd image format too, and the way that those backing data objects are then represented in ceph- is that you have these. uh Just these objects called prefix with rbd underscore data dot, and then it's got the unique image id um which is generated when the image is created, and then it has uh a sequence of like an index into um the image.

A

So for the again the simple case of uh just a four megabytes object size stripe count of of one. Then you can just take that index and multiply it by four megabyte the object size and that's how you know which object to go, read and write to when it comes to getting your data.

A

Any questions so far at the high level about how rbd breaks up uh data.

B

A

um So, as I mentioned, it's lip. What this talks about is about lib rbd. This is a user space library that uh different programs uh can compile and link against to utilize uh and access. Rbd images- I think probably one of the most well-known integrations- is going to be uh keemu. Here I have um the the qmu source code and in the block subdirectory there's this rbd.c.

A

This is the driver um for how keemu interacts with uh live rbd, uh but it has its own internal hooks about how to do reads and writes and flushes, and all that all alchemy that has to do is uh swap its internal api calls into library api calls. It just so happens that it always uses the same helper function. Method, which is this uh rbd underscore start underscore aio.

A

It determines if it's old enough, if it's, if it's a way old version of lib rbd, so it needs to use a bounce buffer. If so, it has to copy all the I o into a because it might be an iovec has to copy the I o into a bounce buffer um before issuing the read or write. But if it's a new enough version it just can pass that data right to lib rbd.

A

So in this case you have uh library, calls for aio, write, vector, read, vector or flush or discard, and so on and so forth.

A

So at a high level, that is somebody that's trying to use rbd that that's how they translate their. uh I requests in to lib rbd.

A

So those apis, then um we provide those apis in both the c c plus plus and python bindings. The python bindings are actually just a thin wrapper around the actual c api. So it's not a native python binding.

A

um So when it comes to the the c api bindings, that's what gmu uses, um because it's that's what the qmu's built in, but you can. If I search for one of these calls, we have both synchronous blocking.

A

I o methods that kind of correlate, hopefully closely to how posix standard I o methods would be with read, write and we have aio equivalence which would be non-blocking, and you provided a callback that you're going to want to get invoked when the I o is is complete.

A

All these methods in the c api they take this rbd image type t.

A

That is something you get when you invoke the rbd open command, you give it a rados, I o context, which basically is a connection to the cluster and a specific reference to a specific pool in the cluster. Let's say the rbd pool you give it the image name and then it'll do all its magic on the inside and populate this rbd image key and such that all the remainder of the image specific api calls are good to go on the c plus side. It's similar, except the everything, is kind of wrapped in objects.

A

So in this case we have an image object that has various methods on it, like uh you know, reading and writing, and the aio equivalence of all those methods as well.

A

And then the python bindings again are very similar with you know the various reading writing and the aio equivalence methods of.

A

How to do I o against the image.

A

So before I then dive into the actual behind the scenes of how these api methods work any questions so far on the api.

A

Okay, well so the apis are defined if you're, in, if you're, in the s, theft, source, tree ceph source um and then include rbd there's the c version. The c plus plus version and the python bindings are in a py bind subdirectory. Instead of the include directory under an rbd subdirectory.

A

But rbd library, specifically it's it's- it is located in ceph source uh librbd, it's pretty much self-contained from there.

A

We try to break down all the various internal sub-components of what lib rbd can do into different classes and namespaces namespaces would be in their own subdirectory, underneath the main librbd subdirectory, but uh in terms of where the api methods first hit, it's in this file called librbdcc.

A

So this is just a giant translation file that is the it keeps the the abi the stable for the api so that, no matter what version of librbd shared library, you're, you're running against, you know dynamically linked against. Hopefully the goal. Is uh it didn't matter what you compiled against, because this this live rbdcc file files, what's responsible for maintaining that api's, binary interface and and what it will eventually do is if, if I go and find one of the api methods like uh wreath like image, read.

A

All it does is, it translates.

A

Any of the data coming in and invokes now an internal non-stable api method uh for handling the actual. I o in the c plus plus side of the house, because this is the the c plus version of a read call.

A

We don't have to do much translation because everything is already kind of handled for us, but in the.

A

A

In the sea world, um you know we're given just straight up buffers pointers to the locations in memory and length, so we have to kind of translate some of those. uh You know around some of those things instead using the c plus plus buffer lists, we natively kind of wrap that around uh with little helper functions to represent that buffer and length in memory to say where we're going to store the read result or in the case of a of a write, we actually have to we kind of translate and copy. We don't.

A

We have to copy the memory from a c buffer into our internal uh buffer list implementation, which is what you see actually happening here, where it's it's taking the the buffer, the raw pointer buffer, you're, providing, and it's creating just a buffer list that points to that points. To that data for use internally,.

A

So this is the this is just the api translation layer, but the next layer it gets into is um where we actually then start implementing how these ios are going to get broken down for getting sent where they need to go so we're starting the historical trend of librbd is that all the internal non abi locked functions used to be in this file called uh internal.cc, which grew to a giant. You know thousands and thousands and thousands of line file which was hard to maintain.

A

um So the goal that we've been moving forward with uh released release to release is.

B

Anytime, we have.

A

To do any major work on a on a given function, we actually try to break it out um into its own api, slash um xyz. In this case, all the I o functions have been broken out into api io namespace.

A

Where then we handle all the internals of what we need to do um so, the first, the first set of I o methods that are up here, they're all the synchronous versions, so blocking reads: blocking rights blocking discard calls, um but what we kind of do under the scenes is we just actually translate all those synchronous calls to asynchronous calls, and then we just wait for the asynchronous completion to complete before we we move on. So there's really not much to any of these.

A

These blocking synchronous calls because they really just call the aio version of this of the same method.

A

Which all those start down here with the aio underscore prefix um so yeah when we get a request in it, comes with a completion callback, so this that this aio completion object holds enough state information that, when the I o has completed, we know which you know. We have a pointer back to a function that the user provided, so that we can call it back or if the user didn't provide a callback function, because they're going to do something like polling or something like that, we can at least mark the uh completion as complete.

A

So when the user, uh that's using the lib rbd api uh next checks, the aio completion object, it's marked as complete, and they know that it's it's it's ready for for use, but the calls like the ao read: ao writes you know they have an offset and a length. This offset is it's in image space, so we, uh you know, we think about like an lba address or something like that on a hard disk.

A

It's it's representing the absolute position of the byte within the rbd image. So this is before we start translating it down into the internal object based and object, extents dial ios that actually get translated to the the seth cluster, but the api level layer. At this point everything is image extent based ios.

A

We have some uh boilerplate code, that's been in here. I don't think it's actually ever been used besides um by some testing groups, where we there was this goal to be able to trace all ios flowing through the system um from from the front end of the library api, um all the way to the osd's to the replication. You know across the osd's and back you could get a complete picture of how a single io moved through the system.

A

So I don't think anyone ever turns this on, but that's some boilerplate code for that you know legacy functionality for tracing first step. We come in through we kind of initialize the the completion we basically say hey. We started the I o at this time and that's how we can start tracking latency statistics later on once the I o completes.

A

There's a an alternative, as I mentioned, you can provide the ao completion, a callback function or you can do a polling loop. um This is uh so if you have marked your image of saying you're going to use an event socket to get notified on instead of having a callback function.

A

This just is a way that we internally book do some bookkeeping on the ai completion to say: hey you're, not going to do a callback you're going to do you're going to notify the event socket to say: hey, there's, data or completions that are available for for polling.

A

We do some trivial. I o validations before we actually issue it, and then we translate the read request in this case into this little helper.

B

A

That just represents that I o request, and this is what helps uh keep track of our I o state, as we move through uh different dispatching layers and we'll I'll get into that shortly, but you can just think of this image dispatched by spec, as it just has enough data just to describe a given. I o, in this case a read I o, but if you go through it, it's the same pattern. Over and over again, you know, here's a an image dispatch, spec that describes a right or a discard or a right same operation.

A

A right same operation is when you just give you know a small buffer and you say: hey, write that same buffer over and over and over again x number of times.

A

In this case, a write, 0 write, 0 call this one's a little bit more complicated. I get into that at the end, but it's it's a way to translate between it's a way to ensure that your data is actually zeroed out, because a discard is not guaranteed to always discard and zero out. Your data.

B

A

Does the discard alignments don't properly uh line up with ceph expectations and alignments.

A

uh Compare and write operation that just says you give it a buffer to say. I expect the data to look like this on the disk and then, if it does, then you can override it with the data. I give you if it doesn't throw an error and give me give me the offset at which the data mismatches and then a flush operation which just ensures that any rights that you had provided have actually been persistently written to the backing osd before completing the flush.

A

So so now we've taken an I o from the front api layer we've translated to the internal api layer, which now um starts the process of pushing the data through our internal. I o uh dispatching engine.

B

A

So the first thing we saw was that basically, every single io in the system gets translated into uh this image dispatch spec, which just describes the I o, and it has enough data to store. You know if it's going to be a read, a discard, a write, write, same, compare and write, and then it stores it in a giant effectively uh giant uh union um so that this data structure, you know only stores the data doesn't store. You know the size of the is the the size of the structures.

A

You know the size of the largest uh substructure.

A

uh And then we store some additional metadata that we need for uh in terms of well, here's the completion, that's associated with the I o, the image requests, the extents with the image extents.

A

Are you know what are the offset and lengths and you can have more than one internally um tracing information and so on and so forth, and the other piece of data that it stores is this image dispatch layer and as we talk about this next, is when, when you send this request and issue this request, the way current lib rbd works is that we know instead of hard coding, a bunch of if-then-else statements throughout.

A

Throughout the code for handling how we can interact with different um plugins, we now have a way where we can dynamically and programmatically just take an I o and iterate it through different books to let different hooks manipulate the I o as they want to, um and those hooks are defined here as an enum, I'm in the I o subdirectory io namespace and there's the types folder and here's just an enum that describes some of the various um dispatches we have layers. We have for manipulating and uh doing things with incoming image based io.

A

So uh first one we have is uh the first real layer that we have is this cueing layer? um All it does. Is it takes your?

A

I o puts it on a work hue and then basically returns the the control back to your calling application, because when you think about it, if I'm, if I'm cumu calling in to my api function, um it's not until it hits this cueing layer, then it completes control and returns control back from you know this, let's say aio write call from qmu, so all that does is it throws it in a work, queue to say: hey, there's another thread, a lib rbd thread.

A

That's eventually going to handle the I o it's going to pop things off the queue and issue the I os.

A

We have a quality service layer that handles throttling. So anytime, you define throttling parameters to say, like I give an image she's only allowed to utilize. uh You know: 100 iops, that's handled by the qos layer, the exclusive lock layer, it's just a hook that says anytime. It sees a write operation in it makes sure that the exclusive lock is actually acquired for the image and if it's not acquired for the image, it attempts to acquire the exclusive lock.

A

um So it handles all the all the interactions with um acquiring the exclusive lock when the first I o comes in or if you lost exclusive, lock, reacquiring exclusive lock um as ios are issued the refresh layer. uh We have this notion that we have the um on these new style.

A

Rbd images we have the rbd header object, so it's rbd underscore header dot and then your unique image id that header has a potential lot of metadata on it like it includes metadata about the image size snapshots things like that, so those are things that can dynamically update.

A

While emu is running so I could have an rbdcli process on node a manipulate, an image, but that image is currently being used on node b by a kimi process. We need some way to instruct that rbd uh client on node b, that hey some data, has changed about the image you have to go refresh the metadata about the image that you know about. You know a new snapshot or that the image has been expanded or so so forth.

A

So the refresh layer is responsible for detecting those changes and then issuing a an asynchronous call to go refresh the the image in the background, and it pauses the I o, while that refreshes is occurring, the next layer is a it's a new lit new layer for um the pacific release, and this is the image migration layer. This is in supports of instant image restoration from a read-only source, so I could have a an rbd image.

A

That's on an s3 endpoint and I could use the rbdcli to define a new image and say: hey the parents of this image is actually you know on this remote. You know http s3, protocol endpoint, whenever, whenever this clone effectively doesn't have the data go get the data from this external endpoint, so that migration layer is what actually takes care of translating you know ios into whatever formats data might be stored on that it's because it's probably not going to be stored uh realistically in the native stuff format,.

A

The journaling layer- and this is for rbd journaling features, um there's a right blocking layer. That's this is just used for internal metadata. Whenever let's say we have create a snapshot or something like that, we need to basically pause. I o, while that snapshot is being created, we can use that right, block layer to basically block all writes from occurring while we're doing this internal bookkeeping, and then we can resume all rights uh right back cache layer. This is uh this is related to intel's work with a persistent write back cache.

A

That would be a local cache that writes to an ssd or octane device on your local local node. So to hopefully the goal is to reduce tail latencies on on rights and then finally, the next layer is the core layer and that's actually what sends if it gets, if the I o gets all the way to the core layer, that's actually what will send the I os on the next step to get them ready to go to the cef osd cluster.

A

So any questions so far on the on the image I o image extent.

A

B

A

um These are all so all these layers they're pretty easy to find if you want to dive into into depth for those different layers, they're all basically um stored as their own file.

A

So um you know here's the here's, the cute uh image dispatch, here's the quality service, image, dispatch layer, here's the right, blocker, so they're, all pretty easy to find they're hopefully well named, but the the one I want to focus on next uh is the core one, the core layer and that's what starts the process of doing that striping operation to determine hey, I'm talking about image, extent, four megabyte off through eight megabytes. That means that I'm actually going to send this.

A

I o to object one um in the backing cluster and that I'm it's using object. Extents. You know zero bytes through four megabytes, because that's just the way the ray translation works. If you go look back at that cheat sheet on the on the data layout, so the core layer is in this.

A

This image dispatch- this is the core layer. um All it really does. Is it's a just, a proxy translation layer between the this newer style um pluggable uh I o handling engine and the original legacy io state machine, which is this image request class, so the dispatch layer will invoke the prop the appropriate method on the api via to read, write, discard, write, same compare and write flush, you name it, but here in the image dispatch class, it's going to translate it to the associated state machine class in the image request.

A

um So if I pull up the image requests.

A

It has a bunch of helper methods which are just those those are read right, discard flush right same and compare right. These are all legacy functions they've been around for a while. So that's why there's they're being used by the dispatch layer? But really all those functions are?

A

Is that they're just um they're just factory methods, because what happens is at the end of the day it just each of those methods uh instantiates a very particular kind of object, um be it an image, read: request an image, write, request, image, discard request um and then just invokes the send method on those objects to actually kick off the state machine.

A

It's unfortunate and I'm apologize in advance, but we try to do a good job, at least on all our modern state machines that we actually have a little ascii, drawing to describe the straight uh the state transitions and between the between the state machines. Unfortunately, because this is uh so old, it's uh one of the one of the original functions. It's just been tweaked and tweaked and tweaked as the years has gone on um yeah it doesn't have the the ascii drawing.

A

That makes it a little more clear as to how the state machine transitions between the different functions, but we can we can just dive into it. So if we look at the image read request state machine, um you first saw actually that everything invoked a send method. uh It reads so it creates instantiates the object and calls the send method. That said, method is the same virtual method on every single class, which is in the base class, which is the image request.

A

So all that sends uh method does. Is it kicks off? It optionally kicks off a little side, state machine that will update the timestamp on the image.

A

So if you ever do an rbd info command on recent versions of subtly, semi-recent versions of of rbd you'll see a create date and an update date and an access date.

A

So this is this is how those time stamps those the modify, timestamp and the access timestamp are are updated. Is that there's a little star side, state machine that says um based on your image property that says I'm going to update my modified time or my access timestamp every 20 seconds or whatever your settings are because it doesn't update it with every I o. You know it just.

A

Does it on a periodic interval as your I o comes in so if, if that period of time has has come up, it'll kick off the state machine to basically update the timestamp, which is just sending off this io request to that rbd underscore header dot, unique image, id object uh to track the the access or modified timestamp, but that doesn't really affect your. I o flow, that's just something that gets kicked off and that's actually running concurrently, then with your I o.

A

So the next thing that actually happens is it calls this pure virtual, send request method, and that's you know where each each type of I o has its own breakdown from that. So the reads versus the writes: have their own send request method.

A

For the read first step, it does, it says: well, if you're using rbd the internal rbd cache, and you haven't said that you're doing random. I o, and you have read ahead, enabled it just optionally kicks off another state machine that says: hey. You know analyze this. I o pattern and see if someone's attempting to do a bunch of sequential I o uh if they are in the background and also asynchronously to whatever this user is requesting start reading.

B

A

The user and and filling up the the read cache so that future ios might be satisfied by the the read cache.

A

The next step is so all the this is still an image extent based I o operation. So we need to then map all all those image extent ios. You know if I'm reading the black device that that offset four megabytes that actually is going to translate to a very specific rbd backing object for this image.

A

So that's what this mapping does it's this little helper striper function, that does all the all the calculations to say, given these image extents map them into one or more object extents, because if we go back to the picture for how stripy might look, I mean striping might get complicated and it might be that my image extent has to go across. You know multiple objects multiple times potentially and include multiple extents of multiple objects as it as it loops through.

A

I might have to read stripe unit zero, one, two, three four: five, which means I'm reading from two sections from object: zero, two sections from object, one one section rounded two and one section from object: three, and when all that reading completes, I need to be able to reassemble all that data appropriately. So it goes back in the correct order that the user actually expects it in, because the user shouldn't have to know about you know the internal details of how rbd is striping, its images and its design. So that's all!

A

That's all this function does right here this file to extends it. It maps image extents um file to object extents in this case extents, um that's the naming is kind of legacy and leftover from ceffs and the original uh cfs client. So that's why the help the method name is called file to extends and not like image extents to object. Extents. You know because it's it's uh it's from a point of view of seth, but it's the same math for both and we also keep track of um this buffer offsets.

A

Is it also as it's as it's doing these mappings? It's also keeping track of when it has to reassemble this data? All the stripe data needs to know where to go put that data back into the uh the buffer that the user has provided, so that the data is in the right order.

A

um So so now we have a bunch of this. This file two extensions put populating this object: extense uh collection.

A

So now we actually can get started with the read request, which is.

A

We give the the a completion, we had stored, the read results um in it and we also basically tell the read: result: hey when you have to reassemble it. Here's the original image extension of how everything gets reassembled, there's a couple: there's a couple different permutations that actually need those image extents. Some of them don't need it at all, so they. This is a no op function. um We can dig into that later, um but we actually issue the request.

A

So this ao completion we tell it hey, you're gonna, expect you know, object, extent, count number of requests that are actually going to be issued concurrently. So you have to wait for this. Many internal requests to complete before the actual like user side, ao completion is actually complete and then it iterates through all the object extents and it issues it's a very similar pattern here. So now, instead of an image dispatch back, it's actually going to issue these object, dispatch, specs and it's very similar it. You know it describes it as a read describes.

A

It doesn't write, describes it as a discard, depending on whatever state machine you're in so this is the process now to start kicking off a read. If I we go to this right side, most of it's the same so there's between the different right methods, you know write a discard, a write same and a compare and write. So most of the methods actually get uh kicked off from this generic send request method, but it's very similar.

A

It does the same. Computations of converting image extends to object, extents. uh It does some extra work about pruning, which is basically saying if, if the the right goes beyond the object boundaries and things like that, then we um we'll truncate those ios so that we're not going past the end of the image.

A

um But then the same thing. We set the number of requests that we expect and then um the one place we start differing here is if the journaling mode is enabled- and this is- this- is still legacy code, because the journaling hasn't been broken out into its own dispatch layer. Yet, but when it does, this will all be the the read and write methods will basically look exactly the same, but right now we have this special hook.

A

That was, like, I said legacy where, if you have the journaling feature enabled we kick off a journal append that describes your io, which gives us a unique journal id that we can then use that journal id later on in the object request, and I say that we use it because we kick off the journal, append requests and we also still proceed with the sending of the object request with one exception, because if the data can just hit the cache and stay in the cache, it's allowed to stay in the cache and but once once the data is actually trying to get written back.

A

That's when the journal id is actually used because we actually stall and prevent any write-backs from happening until we are sure that the journal event has been properly committed to disk. So this was just a little speed up way to that.

A

If you had right back cache enabled that your I o would appear to complete faster, but it's really just being held in memory, while the journal event is being appended um and then once the journal event is securely written to disk, then we're allowed that right back is allowed to proceed, um but the send object request. If we go look at this, there's going to be a different, send object, request for for each.

A

Thing where it iterates over the extents and then it calls up your virtual, create object, request which creates an object request and then it sends it so the difference the different methods create different objects. So here's a here's, a write, request, creating object quest, so it just creates the image dispatch, object, dispatch back, create write, you know, you'll see. The same thing for here is a discard request. All it's doing is creating a discard object, dispatch back right, same same thing, compare and right same thing.

B

A

Any questions on I mean so the next section I'm going to talk about is object, io dispatching, but any questions on image extent dispatching on how the io now has gotten broken up into um object. Io extents. These are, these are the I o ranges within individual objects. So it's going to identify the I o saying this belongs to object, zero, and I want to read extents of that object of bytes zero through four megabytes.

A

A

All right, well, the object, dispatch spec is actually very similar in design and function to the image dispatch back. It just has to store a little bit different data, specifically the most important piece of data it needs to store, is the object number in which that I o is going to get issued against, but on that it's very similar. It's it's offsets um and lengths and various other properties.

A

So it's same same concept where we store it in a giant uh essentially union, all the possible different io types restore the current layer in which this particular I o is currently processing on.

A

A

Store our callback so that we know how to basically complete the I o uh to basically then hooks back into the ao completion uh to finish off the ai completion. uh If we go look at the types again, the object, dispatch layer types have a caching layer. So this is the you know. The legacy uh librbd and memory cache those those work on the object level.

A

This is different than the intel uh persistent write log right back cache, um so this is an in-memory only cache right back or right around or right through um we have a new crypto layer, cache layer, that's coming in with pacific, where we can internally uh have you know, let's say lux encryption on an rbd device and library handles all the encryption internally, so this crypto layer is actually going to handle uh block alignments and encrypting and decrypting ios as they come in.

A

uh We have a journal layer, that's responsible for blocking ios that haven't committed to the journal. Yet, based on that journal, tid we have a parent cache. This is something that came in octopus. This is, if I have a cloned image of a parent. All the data of the parent is read only so. The parent cache is uh the it's. It talks uh via domain socket to this uh demon called the cepha mutable uh commutal object.

B

A

Daemon, which is responsible for basically promoting hot read-only objects to uh like a fast local cache on your device, so it can serve reads locally instead of sending the reads and redirecting the reads back to the soft cluster. So if you have a lot of golden images or something like that in rbd in theory, that's what the parent cache could help you with there's also work, I think, being done to incorporate it into rgw for immutable objects in rgw.

A

Next layer is a scheduler layer. All this does is it tries to determine if you have a bunch of sequential ios. So this is just a very dumb. I o scheduler. That says it looks like you're doing a bunch of sequential ios to a given object, so it it'll try to collapse all those sequential ios into a single.

A

I o request to the the backing osd because that's going to be faster and more efficient for the osd to do than it would be to send it dozens of individual small requests and then finally, it gets down to the core layer and again, that's that's the layer that finally sends the ios to the the backing stuff cluster.

A

So similar to how it's easy to find the hopefully easy to find the the object dispatch layers. Hopefully it's also easy to find the uh if you find the image dispatch layers, hopefully find the uh the object, dispatch layers like here's, the here's, the sketch scheduler uh dispatch layer, the initial implementation. We just called simple, because it really is a very- doesn't try to do anything fancy it's just trying to do, detect sequential ios and batch them up. So that's a simple scheduler and then the core layer comes down to just the object.

A

A

It looks very similar to the image dispatch. All it does. It's kind of a translation layer between the old world and the new world, um the old world. Being this object, uh read request or discard request to write, requests, state machines and the new world being this plugable uh dispatch layer. So it just translates between this api and the original legacy library api.

A

um So as the different methods are invoked by the the dispatcher, be it if it's a object, dispatch, spec, that's describing a read: it's gonna invoke the read method. If it's a discard or look the discard, if it's right, it'll, look the right method, they all get broken down into the appropriate state machines that are all described- um and these are these- are the core mach. These are the core state machines that actually then perform the I os against uh the cluster and these ones actually do start to have some ascii drawings in them.

A

So it's very similar as before you know we have the factory methods. They don't do anything except instantiate. State machine objects, state machine objects, are what then send you through the state machine. Be it a read, request an abstract write request. We call it.

B

A

Because there's different kinds of rights, be it a true, I'm writing data versus I'm discarding versus I'm doing it right same or compare and right. Those all just are slight tweaks on the the concept of the abstract.

B

A

Right state machine, but if we go look at the let's say the first one, the read requests it gets entered in with the send. um So the first state is it's going to go and just issue an object. Read request to the the cluster.

B

A

So it determines, you know, first step it determines what you're trying to read. um Be it if you're, trying to read the head of the image or a given snapshot of the image, um if you're having to read the head of the image like you're, not trying to read from a snapshot.

A

That means that we can then go check the object, map and things like that, because the object map will be in memory. So we can go check to see you know, hey. Does the object map say if the app, if we know the object, may or may not exist? If we, if the object map, says, there's no possible way for that object to exist, we can actually just skip ahead and we don't have to issue that initial read request to the to the osds.

A

Otherwise what we do is we for all the extents the object-based extent. So this is like read bytes x through y, in the object, and we can have more than one extent that we're going to read.

A

We can issue either sparse, read to the osd or our read to the osd, and we just have like a little hint here that says um you know based on this configuration setting, which is, I think, defaults to like 64 kilobytes um in the in the config. So if your read request is greater than 64 kilobytes, it tells the osd.

B

A

To try to do a sparse, read for me because what a sparse read does is it won't return uh blank sections of the object, so if it was, if there's no data there, it'll return, you know no data, it'll say hey the data I gave you back really only represents extents from point a to point b and point from point c to point d, uh and you know there's a gap somewhere in there and it's up to you to do the math to figure it out, but it doesn't inject a bunch of zeros in there where there weren't any zeros before it keeps the data thin um and then yeah, then we just execute the method on using the librados api and when and when the data.

A

This is an asynchronous method. This is non-blocking, so this goes away and eventually, once uh liberators comes back with an answer, it's going to invoke its completion callback, which is the handleread object and we get told by the osd. The object doesn't exist, in which case we may have to go read from the parent if it's a cloned object. Otherwise, if it truly is an error, we have to you know we'll bubble that error up to the up to the the user.

B

A

Nowhere is good to go, then this request is done and what this finished thing does is. It goes and tells the original ao completion that this particular request is complete um and the ao completion, that's tracking all in-flight requests once it gets down to zero inflate requests will uh finish itself off.

A

But if we need to read from the parent, uh you know and minus the this little special flag that you could pass to the state machine that says: hey do not attempt to read from the parent and 99.99 of the cases it is going to read from the parent, uh in which case we just invoke this little helper method here that if we go look at in utils.

A

A

I'd pull up.

B

A

I think it got insane.

A

Apparently there it is all right. This is missing. um So all this, this is just a helper method, because we had a few places in code that were doing the exact same logic over and over again, so we got broken out into a helper method, because when you want to read from your parents uh from your parent image, you already had an I o. That's coming from a given object, let's say: object.

A

One object, one might represent actual image, extents four megabytes through eight megabytes, so we have to translate these um object extents back into image extent. So we're kind of reverse mapping. Those object, expand extents back into the parent image.

B

A

That's what this method does? It's just the reverse of mapping an image extended to an object extent, this maps an object, extent back into an image extent, and then, once we have these parent image extents, we determine if we have an overlap with a parent, because the parent might be smaller than the current child because you expanded the child and then we, you know potentially prune those image requests based on that overlap and assuming we actually have data to read because there actually is an overlap with the parent.

A

For that given read request then really all we do again is just issue another. um uh We started. We, we have our. We start a brand new uh ao completion for tracking the read and we kick off a new read request, but instead of issuing the read to ourselves, we this, if this image context, represents us this image context. Parent, actually is another image context. That is all the state about the parent. So it just issues the read directly to the parent's image.

A

Yeah going back to, uh if you read the parent same thing, uh checks for errors, if you had it, if the parent didn't exist, that means there's no data to read. If there's an error bubble the error up, um and then we optionally uh there's this field, which I don't think anyone ever uses, but you can turn enable uh copy on read or copy up on read.

A

So if you had to turn that on on an image, what's going to happen, is if you issue a read request that the object doesn't exist on the child, but it exists on the parent.

A

Then it's going to actually then kick off this little asynchronous state machine in the background to go copy that parent data up from the parent image to the child image, so that you stop having to go read the data from the parents in the future.

A

And that's all that this is just boilerplate for for handling that case of kicking off, that asynchronous requests kicks it off. It forgets about it and moves on and finishes the the read request, but in terms of rights uh just noticing that you know we only have a little time left.

A

I just want to point out just it does very similar things. It checks to see if the objects may exist, if there's any optimizations, it can do it potentially updates the object map. Because again this is legacy code. Before we broke anything out into layers in the future. I would have said this was object. Map would have been its own layer um and then assuming everything's good to go. It kicks off the actual, write requests um and again, if it was, there was a copy up. It can.

A

um If it's a child or a clone, it can put an assertion on it to say: hey you're not allowed to write this. Unless it's. I know this object already exists on the child image. Otherwise, it's going to you know, write. Give it a right hint, add some write, ops and execute the the radius api call and then once it's done, it gets invoked back down into handle um which, in which case, if I, if the object didn't exist, which that really only way to get that error would be.

A

If the pool didn't exist or if we had that assertion exists, firing. That means we actually have to perform a copy up which is complicated. We don't have enough time to do at this point in time.

A

uh Otherwise uh the right should be complete. The illegal sequence is related to compare and write. That's a expected error from compare and right. If you had a bad bad data, it completes, and then it potentially does a post update of the object map, which only happens with the discard, because you might remove uh an object because the first state to update it would be to mark it as removing pending and then finally, you go and remove, mark it as non-existent and I'm sorry we're running out of time.

A

Maybe we'll schedule another um talk about this to dive more into it. um But in a couple minutes left is there any other comments.

A

B

Maybe this could be for next time, but that might be interesting to see um how you test all this. All this in the unit test as well.

A

Definitely yeah we have. We have quite the extensive uh unit test library, uh definitely way more code and unit tests than we have an actual library code, uh which is great. It gives us, hopefully good confidence that you know the code we're putting out. There is going to continue to function and we don't just have to rely on on high-level integration tests. We actually can get it down into all these classes and test them.

A

A

A

Well, I'm sorry that we ran out of time. I was hoping to get a little bit further, um but yeah like I said we can definitely schedule another one of these to dive in again um and yeah. I guess uh thank you for joining and uh you can reach me on the mailing list.

A

Any number of ways, if you have any other questions that you don't want to raise here, but you want to still have lingerie in the back of your head.

B

All right thanks a lot jason.

B

Thank you. Thank you.