OpenZFS 2018 OpenZFS Developer Summit, 21 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZIO Pipeline by George Wilson

Description

From the OpenZFS Developer Summit 2018
Slides: https://docs.google.com/presentation/d/1ohdmjsp9mejuSRKwDeU83o9297KaiHrqC-tV__kjO6E/edit?usp=sharing

A

Our next speaker is going to be George Wilson. Most of the people here, I think know: George Wilson he's one of the earlier open, ZFS and ZFS developers, and he's done a lot of work on performance on poor location and like various aspects of ZFS, and he probably need to save a lot of pools from you guys. So, please, everyone welcome George George Wilson, who is going to talk about the CIO pipeline.

B

Thanks everyone, so if you don't know me, please come up and say: hi I'm also really interested to like get to meet new people, and it's it's great to have this close-knit community kind of working together. So that's fantastic anyway.

B

With regards to like this talk, you know it's I always find it interesting when I have to go and like think about a subsystem that I haven't worked on in a really long time, and then you start realizing gosh I, don't remember this. Why does it work this way? So a lot of this talk is going to be a little bit like that.

B

It's I've spent a lot of time in the IO pipeline over the course of my career with with ZFS and so I'm hoping to answer a lot of the questions for people that that have maybe never looked at this, but I'm sure there will still be questions to to be answered after this talk.

B

So quick shout out to one of my fellow colleagues and in in crime and when it comes to ZFS there was this tweet back in June that I saw and I was like. huh That's kind of interesting- maybe that's like reason to have a talk about this at the open ZFS, so so that was the inspiration.

B

Initially, this was going to be the presentation. I was just going to stop here, yeah, yes, this exactly so, as Brian mentioned, this is actually in the source base and he was just kind of letting us all know that hey this exists, maybe we should do something about it. Hopefully, this talk inspires somebody to go. Do something about it.

B

um So so, first of all, let me just kind of set the stage of you know where the zio pipeline lives and kind of some things about it. It's kind of fascinating the I/o pipeline for as small a number of lines of code that exist is actually extremely dense code and does a lot of things that you might not recognize actually happen in that layer.

B

So, as you can see in this diagram, the I/o pipeline kind of sits at the lower area of the stack, so it's actually kind of in an area that we refer to as a spa and is comprised of a couple other components and it works very closely with the video player. So it really is the framework for all I/o that is driven through ZFS.

B

So as you're starting to do things whether it's read writes you know even iocked doles they're coming through the zio pipeline, it's in this layer where we're actually going to do our translations from data virtual addresses to actual physical locations on disk. It doesn't actually, you know, handle that completely and it works very closely with the V dev layer to do that, but it's instrumental in kind of getting that part of that portion done.

B

It's also here where we do things like any transformations, whether it's you know checksumming dedupe compression. So a lot of the things that you're used to seeing like zpool, you know set compression equal LZ for it's like that gets set in an upper layer and it gets passed down to the CIO and it's in this layer where that actually happens, and it's kind of neat, because the way that the checks, Amin and compression code works is it's somewhat pluggable. So we can extend it so we've.

B

You know we probably have all seen over the course of years where we've extended and adding added new checksum algorithms added new compression algorithms. It all happens in here and there's a bunch of other stuff. So there's a quite a lot that happens at this layer, whether it's allocation, throttling selecting you know how we're gonna do ditto blocks and which devices have to get you know have to do I owe to them whether or not we can do gang blocks or not, and we'll talk a little bit more about those in a few minutes.

B

So the other thing that I started thinking about is like. Why do we refer to this? As a pipeline?

B

Like you know, people have probably worked on file systems and I've never really seen pipeline used in like an I/o path of theirs, like I issued I/o, and it gets done and I'm good so, but when you think about, like pipelines in kind of a cpu context, the reason that pipelines kind of came about was because we wanted to increase throughput and paralyze some of the activity so like in in this kind of little simple example, we can see that there's like four different stages for this particular CPU and as we drive workload through, we can actually have the CPU doing.

B

You know staying a hundred percent busy by having four different instructions working on different stages. So, if you think about the I/o, we can kind of do a similar thing. We can say well, if we kind of break it up. We have like fetching some data where we actually get this from. You know the user or the caller. We get whatever the data buffer they're sending us. We decode it.

B

So we know that we're doing kind of like a read or write and we initiate the I/o to the underlying device, and then we get some kind of response. So in a in a sense we can kind of break this up into pipeline stages and what we've done with ZFS is we've done exactly that they're more intricate than just a four stage pipeline and they differ based on the types that that exist. So here are some of the basic pipelines that exist in the I/o, the zio subsystem.

B

We have two different types for physical iOS for doing like physical reads and physical writes. We also have logical, reads and logical, writes, and then we have kind of this bunch of specialized more specialized pipelines for like freeing claims, which is kind of a very unique thing that happens when we actually do a pool import, I octal, which is primarily used to kind of flush out the right cash rewrite, which is primarily only used by the Zil, and then these other ones like the null and the route pipelines which we'll talk about.

B

So, let's kind of like see what makes up these different pipelines. So when we're talking about these, we actually have different types of iOS that get associated with them. So we just saw there's all these different pipelines that exist, but then there's six different types of iOS that can utilize those pipelines. So we have reads, writes, freeze claims, iocked, doles and the special null one and these different types utilize.

B

These task queue pools to do the iOS and move the IO pipeline through you know, through the various stages and I call them tasking pools, because they're actually comprised of like two different task queues. We have like a normal priority. We have a high priority one and the number of threads associated with each of this task queues depends on the I/o type.

B

So, if we're talking about something like writes, we may actually have a lot of issuing task queue threads and you know versus something like claim, which may only have one task queue thread and the whole general premise behind the I/o pipeline and kind of a guiding principle was that we never burn a thread. We never get to a point where, when you're driving through the pipeline, that you actually go into a condition variable and block, so whenever we actually get to a point where we have to give up control, we transfer control to something else.

B

So you never have one of these threads. That's just sitting there doing nothing waiting for somebody to wake it out.

B

So here's kind of the way the pipeline's work and I've broken these up to kind of show. The differences between logical reads and physical reads. So you can see here that the upper section shows the physical reads. We have these stages, which are indicated in red the ready and done stages. These are our interlocked stages and I'll talk more about what they mean and why we have them, and then we have these green stages where we're actually issue eating iOS. So these are all the the vida of communication.

B

This is where we're actually gonna be talking to the video player when we're driving IO through the pipeline and then the blue stages are kind of computational stages that happen throughout the pipeline. So in this case, when we're looking at physical reads, we can see that we simply get it's the ready stage where there really might not be very much done there. We issue the IO to disk once it completes, we generate the checksum and then we call Don, which calls the callback and notifies the caller.

B

That everything's, complete logical, reads, look very similar physical reads. The only difference is that we insert a new stage, which is called this called read.

B

B pianet and the difference here between the two is that when you're doing a physical write, you're actually writing to a known location, you're writing to a known v, dev a known offset, but when you're doing a logical, read, you're actually passing in a block pointer and that's why we need to actually do some some different things in this write or in this read VPN it phase where we're gonna, look at the block pointer and determine where do we actually need to do the I/o?

B

The write stages also get more complicated. So here, when we're looking at physical, writes and logical writes, we can see that there's a common stage that gets inserted, which is issue async. This is where we are actually going to transfer control from the callers thread to one of these task queue pools, we're gonna just say: do this work on my behalf, so both write, write pipelines will do that, for you, they'll also generate the checksum.

B

The difference here is that we see these orange stages that actually exist in the logical case, where we're going to transform data. So this is where you're going to do that compression the encryption, the things that you're used to kind of doing as a property when you set that as the pool they're gonna be done in these stages of the pipeline.

B

It's also worth noting, where the check sum generate code happens, you'll notice that the transformation and the checksum generate happen after we issue the async stage. It's because we want to make sure that we can span this work out to as many threads as possible to go compress the data encrypted and then once that's stable and the data has finished, transforming we can actually generate the checksum that we're going to put to disk.

B

The logical also introduces these purple phases, which is the allocation code. This is where we're going to first throttle it down to make sure that we're not overloading a bunch of disks that potentially can't handle the workload and that's done by the DBA throttle. And then the DVA allocate code is where we're actually going to go and talk to the Metis lab subsystem and say, find me a block where you can actually write this. So in the right pipeline.

B

We see it actually branching out and going and talking to not only the VF subsystem, but also the Metis lab subsystem in order to actually complete the right.

B

There's also optional stages that can be inserted, so the pipeline's themselves aren't static. They can change. So in this case, if we're dealing with reads, we might have a block pointer that was that was provided for us.

B

That actually is part of a D dupe block, in which case we have to add in the stages that are associated with D dupe or it might have been a gang block, in which case we have to do all the transformations associated with gang blocks in order to be able to complete the iOS and writes, have a similar thing.

B

So rights have optional stages for D dupe, where we want to go update the DD, the DD T table as part of the write, but we also have this concept of an OP write where it's possible that we're actually overwriting the data with the exact same content. So if we're doing that, let's just abort the pipeline early get a performance win, don't do the allocation, don't do any I/o.

B

So that's what you see here with this optional stage and then I mentioned there's a couple: other special pipelines there's the null or route which has, which is the simplest of all pipelines that it just simply has two stages. It has the ready and done stage which we refer to as interlock. The idea behind these is that they create a container for you.

B

So you can imagine that if you're trying to do a bunch of work, but you want to be able to kind of wrap that all that work around some kind of root, you would issue your iOS, make your root, the top parent I/o and we'll talk about the dependency graph and how that works, and then simply just wait for the root IO to complete and based on our dependency graph.

B

We know that that means that all the subsystem or all the children iOS would have been done as well and we'll show an example of how that works, and then the rewrite which looks very similar to write. But the thing that's removed here are the allocation stages. So it's kind of odd. If you think about the fact that we're copy-on-write, why do we have the rewrite stage, because, obviously we do copy-on-write when we do go, allocate a new block?

B

Well, there's kind of the the main use case here is the Zil, so the Zil, because of the way that it's structured actually breaks out its allocation from the pipeline stage, so it allocates the blocks separately.

B

So it already has a block in hand and simply just wants to go and write that location once it's actually filled it and that's the main reason why rewrite is there there's also a special case that happens where you can transform a write pipeline to a rewrite pipeline in the case of sink to convergence, when we're actually trying to finish and flushing out a transaction group.

B

So some other things that happen in the zio subsystem. That's probably worth noting gangue blocks I mentioned some stages where you could actually add that in so if we have a condition where the pool is mostly full or severely fragmented, our only option is to one fail: the I/o and oftentimes. You know by the time it gets to the the cio. It's like I have no way to tell anybody that I'm about to fail anything. So that's not really a good idea.

B

The other option is I end up, creating gang blocks and gang blocks are taking and creating constituent, smaller blocks, logically putting them together to create to make a larger allocation so assume that you have like 128k IO that you want to do, but you have severe fragmentation or you're mostly out of space when it gets to the IO pipeline. We would simply try to allocate that 128 K block and find that we can't do it.

B

We can't find a contiguous block anywhere, so we would create these this concept of a gang header and gang members, and we break that allocation up into smaller chunks and we'll keep nesting. This all the way to the point where we're allocating simple 512 byte sectors, assuming that with even with fragmentation, those allocations will always succeed. So this is a huge penalty if you get to a mode where you're having to do this, because now, what used to be one IO to go?

B

Get your 128 K block now can turn into a entire tree of iOS as we build up and create all these little constituent blocks, stitch them together and then pass up the logical data to the caller, but it's a necessary concept in order to be able to ensure that we can use all the space that's on the pool, and it also prevents us from just throwing up our hands saying sorry. I couldn't allocate that so. Obviously you know we have to stop all transactions and not let anything go through.

B

We also have this concept of children types. So when we're going through the pipeline, most often we're dealing with logical children, these are iOS that were requested by some consumer. So, like somebody said, go read this block that read becomes a logical, read and most often in our cases, that's coming from the arc, but sometimes within the pipeline. We have to create these new children to go. Do work on our behalf and that's what these other types are for.

B

So when we're doing say a logical read that might turn into you know if it's a gang block, for example, it might create some gang children to go. Read all these little constituent blocks, stitch them together and then transfer that data to the logical I/o, which will then be returned back to the caller.

B

The two common types that you'll see in if you're playing with the pipeline will be v2f children, which are used to go talk to the physical devices and logical children, types which are the ones that come from. You know the entry points like zio read, zio, right and so forth.

B

So I mentioned the dependency graph and it's actually a very powerful thing within the I/o pipeline. The idea here is that we can create these extremely large IO chains and ensure that things go out to disk in an ordered fashion and also be able to have.

B

You know like something like a container like a route IO where we can get the get the error status and the completion status of a much larger set of iOS. So the way that the our dependency graph works is that we have concepts for notifications and which are these two, these two stages of ready and done. These are places where we can actually get notified when an IO completes that stage.

B

So, for example, let's say we have, you know we're writing out some data and we've gone through and written the level 0 data and we're getting ready to do the level 1 block the level 1 I/o may have started, but we don't want it to actually start doing an IO and writing to disk until the level 0 blocks have actually made it to disk or not necessarily made it to disk, but gotten to a point where they're stable, in other words they're not going to change anymore.

B

Their content is, is, is finalized, they've gone through the compression stages, they've done, the transformations they've actually done their block allocation and maybe they're on their way to disk. So that's where we can actually have these notifications. We can have an I/o that is currently sitting waiting for to get started, waiting for all its children to actually make it through those initial stages, where it's doing data transformation in an allocation and then once it reaches a ready stage, then it notifies the parent says: hey I'm ready.

B

So, obviously, you can start being ready and you can move forward because I'm not going to change the data underneath you and that's what these notification callbacks are for. There's also these stall points.

B

So when the iOS are going through the pipeline, they may need to stall the common one that I just mentioned actually happens when you get to write compress so again. The example of I have some level 0 data that I'm writing to disk. It's gone off. It's actually gone through compression.

B

It's gone through encryption, it's gone through checksum, it's in the process of doing an allocation, so we haven't actually allocated it's block pointer yet, and the parent I/o, which is going to contain the block point of the points of that data, now comes to write compress when it reaches that stage it has to stall because it can't actually do its compression.

B

It can't do its own internal modifications until the underlying child, I/o has actually finished and gotten to the ready stage and the stalls are can be either very coarse-grained or fine grained, depending on the operation that's taking place in some cases, you only want to wait for, like divita of children to complete. That might be what you see in, like the video of I/o done stage, where we issued a bunch of iOS, and we simply want to wait to make sure that the iOS are now to disk before we move forward.

B

So the only thing we really care about are the video of children. We don't care about logical children, so this is something that you'll see throughout the code and is kind of a subtle thing to note, but it's actually a very powerful thing when you're actually consuming and driving the I/o pipeline.

B

So here's kind of a dependency graph that we would see and is probably pretty common within ZFS, where we create this route IO and if you remember the route pipeline simply just has two stages in it. We add these children underneath it. So maybe we add the first row of logical children and they might be reads or writes we don't really know they might actually add some additional logical children underneath them and maybe at the very bottom we get some VF children which are actually doing the physical I/o.

B

If we were to issue this and then wait for that route zio, when the route Zi gets notified by the done callback, we're guaranteed that everything underneath it is completed.

B

Likewise, you could actually wait somewhere in the middle and then you're only waiting for things underneath it to complete.

B

Okay, so that's light part of the talk.

B

So I wanted to kind of go into this because, as I was kind of putting this talk together, I quickly found that it's really hard to kind of represent some of these concepts without looking at code.

B

So what I, what I'm gonna start with is kind of looking at what is a cio? What is this thing that we've been talking about? How does it actually go through this pipeline? What are some critical things of that structure that allow us to do what we do with it? So here's kind of a small portion of the zio structure- and it's actually quite large, if you look at it in an assorted to point out- are some some things that kind of are interesting with it. For example, we have this concept of a bookmark.

B

The bookmark is a small structure that actually tells us which file system were going to what file were operating on. What level of indirection are we in what block ID? Are we going to be modifying? So we have this context of exactly what we're trying to do. I owe to embedded in the I/o from a ZFS object. You know kind of model.

B

We also have this properties field and I mentioned that there's you know all these things that we do when you write. You know you either do like a ZFS set. Some property or zpool set some property. Those get passed down to the I/o, the cio itself and they're stored in this I/o prompt. So things like you know how many copies? Do you want of this data? Do we you know want to create additional copies? What checksumming algorithm? Do you want to use?

B

What compression algorithm do you want to use all that gets provided by each CIO? So this is how we can have one zio, that's off doing lz4 compression while another one's doing gzip and that's all embedded into the zio structure.

B

I mentioned the I/o types, so these are the the different types of I/o, so these are going to be rights. Freeze claims, iocked, doles, the child type. This is going to be whether it's a VF type, V deaf child, a logical child, a you, know, a gang child, and then we also have this these callbacks, where we can actually call back to the user that has issued the IO periodically as this iOS going through the chain. So we have places where we can actually call back when you get to the ready stage.

B

That's one of the notification points that I mentioned. So as this IO gets to the ready stage, it can call back to the consumer and say: hey I'm, just letting you know I'm in the ready stage and the consumer might do something as a result of that or do nothing likewise. The done callback is typically the most common one, where we'll actually return the response, whether it's returning the data or some status, but that's where the consumer will get most of its information and then embed it in here.

B

We also have what stage were in of the pipeline and what pipeline are we using? And it's worth noting this pipeline? That's referenced here may have changed. We may have started doing a full re or a full logical right pipeline and then now that's been converted to an OP right pipeline, so things can change.

B

The other thing I was gonna mention is we also maintain the priorities for the for the I/o and, although not really used and consumed by the I/o pipeline. It is passed down to the video scheduler, there's only one place in the pipeline, where we actually look at the the priority and that's when we're dealing with the task hoop task queue pools where we want to determine. Do we want to send this as a high priority? You know task queue or you know a normal priority task queue.

B

The rest of these priorities are actually consumed by the video player when the CIO actually ships and ships the I/o to the video player.

B

Okay. So let's talk about how we create one of these suckers, so here's a very common entry point with zio Reed and we can see that when we calls the iro zio Reed, we want to do a logical, read so we're going to pass in a lock pointer. So the picture on the right kind of represents what this block pointer would look like, and it has in there 3 DV A's, which are these V dev a size off shift offset parameters.

B

So when we actually get that that block pointer, we're going to use that to go, read the physical location, you'll, also notice that this is simply calling another function called zio create all these entry points that are primarily used by callers from of the I/o pipeline, wrap zio, create so zio create, has a lot more functionality and they're, just simply wrapped for convenience sake.

B

So the thing to note here is this parameter, which is currently no happens to be the Vita that we're gonna do the I/o to, and we pass it in as no because we're passing in a block pointer and the block pointer is going to determine where the actual physical devices that I need to talk to. We also pass in the stage where we want to start so we're saying start in the open stage when you're doing this I/o and then your pipeline is going to be a read pipeline.

B

The other thing to note is that this thing doesn't actually do any I/o. It just simply returns back this structure. So at this point in time it seems like we should be going off and doing the read, but we're really not gonna do that just yet we're simply going to create the structure. Tell it what you know, what its intention is: gonna, be and then sometime later down the road somebody better go often and do the actual physical I/o.

B

So how do we start that I/o so you'll see in the code that you know we've many places where we may have called a zio read and then sometime later, we'll call one of these two start points either zio wait where we want to wait for that IO to go through the entire pipeline or zio know wait.

B

Both of these functions will call a kind of driving the driver of the pipeline, which is zio, execute and we'll talk about what that looks like the other thing to note here is with zio no weight when you issue that it's kind of CV, you think that it's going to be an asynchronous I/o, but it doesn't always. It may not be a synchronous I/o, because it's possible for an I/o to get to a point where it will issue and even return in the same context, because we didn't pass the control to anybody else again.

B

Remember certain pipelines have the issue async, where they pass control right away. Others don't so keep that in mind when you're utilizing this, because no wait doesn't necessarily mean totally asynchronous.

B

We do, however, add this special call to the zio add child here, and the reason for this is that we have to have a way to keep track of asynchronous true asynchronous I/o. The case in point here is: let's say: you've gone off. You created a bunch of CIOs and you went off. You did a zio, no wait on them, they're off and running, and now you want to export your pool.

B

How do we know when those iOS are done because nobody's waiting on them anymore, they're now running asynchronously?

B

So what happens here is the concept of the Godfather IO, so the Godfather IO is associated at a pool level and it is becomes the parent of all no waited iOS and when you go to actually export the pool, we're going to do a zio wait on the Godfather I/o when that is the I/o. When that Godfather IO completes we're guaranteed that all these children have completed. It turns out that the Godfather IO is a simple zio root pipeline. It just has two stages, just like we saw in the I/o dependency graph.

B

So this is a neat way for us to be able to like associate these kind of orphaned iOS and be able to get status from them when we need to- and we do that at the time that we actually export the pool.

B

So here's what zio executes does this is the heart of the I/o pipeline and the driver of how things make their way through the pipeline. So it's just a pretty simple function in that is simply take the IO and you're going to keep running and calling into different stages until you get to the done call back or the done stage.

B

So if you start off and you're not in the done stage, so we started say in the open stage, we'll simply increment the IO stage, and when we increment the io stage, we may actually have to increment multiple levels, because not all pipelines have all the stages. So these stages are, you know, they're power of two numerals, and so we may have to go and skip over a couple of them.

B

Do that real quick, find the next stage that actually exists in this pipeline and then determine first of all, do I need to switch from the current task um on to something else, in which case that happens, if say, you're on the interrupt. Ask you and you're in the process of trying to do an allocation, for example, where you may have to like go off and do a bunch of reads, because we want to load in some space maps or something when that happens.

B

We want to make sure that we switch you back over to the issuing task queue and we, and that primarily happens, because the issuing task queues have a lot more threads than the interrupt task queues, but once you've you're on the right task, queue you're being handled by the right thread, you'll simply call the CIO pipeline table function and you can see based on this stage. These are all the various functions that will get invoked so when you're trying to figure out like okay. What is this pipeline going to do? Well, I mentioned.

B

Like a stage like you know, zio read VP in it. Well, it would call the CIO read: BP init function. If the stage has you know a DV a throttle stage in it, then it calls the CIO deviate throttle function. All these functions will simply return a cio which says execute this cio. Next, when we come through this big cycle or they'll, return back and know which says: hey I really want to stop right now that happens when we have stall point.

B

So if we're going through the pipeline and all of a sudden, we have to wait for a child and that child isn't done, will return back a null point back to the to the execute function and it will actually short-circuit and break out right here. That's a point where the IO itself is requesting the pipeline to stop. We don't go into a CV, wait. We just simply transfer control to somebody else.

B

Okay, so let's look at more examples of how this happens. So here's the case where we're calling zio right and we're calling this from the arc so pretty straightforward. The arc is going to get some information and then we're gonna call zio right and we're going to return back the zio again. What do we expect to happen here?

B

Absolutely nothing! All we have done is created a structure we have not actually issued the IO it turns out.

B

Right is one of the more complicated cases, because what it's trying to do is is trying to build up the thing we refer to as the mega zio I, don't know if this is anywhere in the code, but it is just something that we started calling a long time ago, when I think, when mark redid the the way that the I/o dependency graph works, but we refer to it as the mega zio, and the reason for that is this thing can be like tens of thousands of iOS I mean I, don't know if it reaches a hundred thousand iOS, but it's huge.

B

This happens when we're sinking out and pushing out a transaction group, and so this is how it works. So we started off here. We said: okay, the consumer of zio right was actually Arkwright. It's I gave back a zio, but I didn't issue it. So, ok, let's go figure out who is gonna issue it? Well, the color of Arkwright happens to be debuff right. Well, debuff right simply takes that. I owe that I returned from the arc and assigns it to this drz I/o member doesn't actually issue it. Ok, well, that's kind of interesting.

B

So, let's look at the color of debuff right, there's, actually a couple colors of debuff right and I'm only showing one here, but in this case we see that debuff right gets called we pass the CIO. Are we take the CIO that was actually part of the dirty record and then eventually we're gonna call zio no wait on it. So we're gonna issue it asynchronously we're not going to wait for it to complete.

B

What's interesting here. Is this recursive call to debuff sink list? Debuff sink list actually is where debuff sink indirect gets called. What this is going to do is it's going to start looking at all the dirty records and go from the highest level of indirection, create all the CIOs for those indirect objects. Go to the next level, create all the CIOs work, its way all the way down to the data blocks so and every single time. It's doing.

B

This will notice that it passes it passes in a cio which happens to be the parent to Arkwright that cio gets passed in as the parent to zio right. This has now built up that dependency tree. That is similar to the example my example much smaller than the real mega zio, but we build up this huge dependency tree. We're at the very bottom. Are all the data blocks, indirect blocks, indirect blocks, going all the way up. You know all the way up to our meta meta denote.

B

The other thing that we'll note is we build this thing recursively and it's when we start actually popping off from the recursion that we start issuing the iOS. So, although I don't show it here, there's another function called debuff sync leaf which is going to sink out. All the data blocks. It's going to be the first one to actually start issuing zio no weights. So it's going to say ship off all the data blocks, then we're going to return from this function.

B

We return back from debuff sync list, and now we ship off the zio is associated with the first level of indirection. We turn back again now the second level of indirection. So you get a ton of iOS in slight, but they're all ordered when we ship them out.

B

So when it's coming through the pipeline, you're gonna have all these iOS coming through. The first data blocks are going to come through here. They're gonna go async, so immediately they're gonna hit. This is being issued from typically this bossing thread, I think Matt. We have now one thread per data set for.

B

So there's some level of parallelism that happens here and then we simply take that will paralyze it some more and then those data blocks will go through the compression encryption generate checksum go, allocate a block, go to the ready stage.

B

Now they don't have to block because they don't have any children, so they're gonna keep going, but at the ready stage they're going to be able to notify the level one indirection, hey I've reached ready stage, which means now the level one blocks can start going through this front part of the of the phase of the pipeline so level one blocks can start going through and getting to that point the data blocks will continue, they'll go out to disk and then no contact done at that point in time you know they're again, notifying their parent and letting them continue, because you know they'll get to a stage where we want to make sure that things get completed in a certain order.

B

It's also worth noting here that I mentioned how like DVA, allocate contacts the meta slap code. This is where we're gonna go. Do our copy on write the VF layer here is going to contact one of these functions and chances are it's going to contact more than one of these functions? So we have this kind of little nuance of every single time. We do an I/o, no matter what your pool configuration is.

B

We always call via a mirror IO start, and the reason for this is that we're trying to figure out if we have the copies properties or a Ditto block associated with that IO, and we treat that as a mirror. So rather than kind of doing some special case for the little blocks or copies. We just use the mirroring code and hijack that and treat that those iOS as mirrors so you'll always see a call to VM or IO start. If you really do have mirrors, you might see a call to Vita a mirror.

B

I will start twice and then eventually it calls vita Disgaea start to actually do the physical io to the you know the constituents of that mirror. So let's look at that. So here's the allocate path, the allocate path, will simply call Metis lab Alec it'll request. The allocation that's associated with this with this IO. If that fails, this is where we're going to call the gang code. So when we have a failure, will call zio right gang block. Zo right gang block will create gang children to go.

B

Do this right again with smaller allocation units, so let's say we get through the allocation code. Just fine! Now we get to IO start. This. Is that kind of subtle, little piece of code where it says I go through here? If the veto is null call the mirror operations to do an iOS start every single time our logical IO comes through here. The logical, iOS Vida is going to be null, so it's going to always call Vav Rio start to try and deal with copies property and ditto blocks in the mirror.

B

Ios start code is where it will also have special logic, which I don't show here also has special logic. If the V dev is null and then says, take the block pointer that was allocated, go figure out, the Vita is associated with it and create this mirror map. For me, this mirror map will then will simply just do this while loop and say for each one of the children in this mirror map.

B

So let's say we have two DV a s so we'll create two children, because they're probably going to two different Vedas I'm going to create a video child IO with that's going to go towards. You know child number one and child number two so for ditto blocks. It looks like mirroring if you have mirroring it'll look the same way.

B

So this is what the child io. This is another part of the pipeline that has a kind of sub pipeline. If you will so zio Vita of child IO. Has this thing called Vita a child pipeline which looks like this little piece on the right-hand side. It's just the IOT stages that are going to Vita and the done callback, and that's it. So it's a very small pipeline, but it is used very heavily by all logical iOS that are trying to do I owe to disks.

B

So every time a logical IO comes in, it will actually create a VF child to go. Do this work on its behalf, the VF child will run through this. The logical IO will also run through these stages. But what will happen? Is it's simply waiting for its children to do the work on you know on its behalf and then doesn't actually do any real I/o.

B

So the other thing to note here is the pipeline notice how the start stage here is this stage. Vita IO start right shifted one, unlike all the other pipelines where we actually start in the open stage. This is one where we start the pipeline at V, 2, vo start so again, kind of a powerful thing for zio create that I can I can specify where I start. You know the pipeline, so I can insert an IO in the middle.

B

Don't highly recommend it, but you can do this, so there might be special needs that you have where you're like I want to be able to create an IO. That starts that you know DVA allocate because I'm doing something funky, but it has that capability. In this case this is a convenience because we wanted to simply drive through the last portion of the video vo you'll. Also note that we actually pass in the Vita.

B

So, unlike the other cases where we've passed in null- and we let the block pointer figure it out at this point in time, we've already taken the block pointer, we figured out which V DES we're going to go, do io and we're simply going to go. Do that do that here, I'm going to point out this little snippet here?

B

Has anybody ever looked at this code where, if it's a leaf, offset plus V dev label star size, okay, some people? So this is something to note if you're ever trying to do a translation of AI have this I/o- and this is the the offset in my block pointer I- want to know where that is physically on disk. What this is doing is every time we go to disk. We are shifting and adding to that, offset I. Think it's four and a half megabytes which accounts for the ZFS label on that device.

B

So the front label has to 512 no a 512 K section, another 4 Meg reserved piece for the boot. So that's the four and a half Meg and we shift everything off of it so that now offset zero associated with the logical block. Pointer really is four and a half Megan, and that's what that code is doing, and it's all handled in in the pipeline kind of a little subtle thing, if you're ever having to figure out like how do I map this logical block to where it might live on on the physical disk.

B

Okay, so we were doing we were going into the mirror we went through. I was gonna point out. One other thing: that's that's kind of interesting here and that is.

B

We came into this code as a logical, I/o we're creating and zio no waiting, Veta of iOS v2f child iOS right here, so we're doing the Vita of child io, creating the child and immediately no waiting that iOS so getting it going, and then we're executing our selves, which effectively is going to take us to the next stage of our logical I/o, which would be move us to I/o done so this logical, I/o didn't do anything and what you can expect is the very first thing that it's going to do.

B

An I/o done, wait for children to be done. So this is how again that stall, point and notifications are critical as they drive through the pipeline.

B

So the I showed you the first half of vo vo start with this funky little video of null I'm, now going to show you the bottom half of it, which is where it actually does most of the work. So now it says, come through we'll call vida io start again, so this time we're coming through it as a child, Vita type, not a logical, Vita type, because it's moved on to done now. I kicked off my child. The very first stage that it's going to do is called Vita bio start.

B

It's Vita is not going to be null, so it's not going to go through the special case. Now it's going to drop down to here and say, based on my Vita of go, call the start, ops vector for that.

B

It's also worth noting this little qio because I may be coming in and I call Vita of qio and now I'm calling into the scheduler the I/o scheduler and the I/o scheduler may come back and say: hey it's great that you know you want to run, but this guy over here is the one that really should run right next. So you maybe come in, as you know, zio for vita e or one and what you get back is a zio, for you know, Vita, 4 and Vita 4 is the guy. That's actually going to go.

B

Execute this so again. Another kind of subtle thing here is that, even though you have started the I/o start, you may not be the one that's actually running and going to the next phase right away. You may end up getting queued because you're being scheduled to run later or maybe you're aggregated to go. You know be part of a much larger I/o.

B

Ok so simple case here: let's assume we have a mirror with two disks. This is uh this is effectively what it would look like. We would have a logical right. It would go to mirror I/o start because Vita is no. It creates a child out that child is going to be associated with a mirror that is going to call back into via a mere I/o start, where it's going to create two children, because we have two disks associated with this mirror they're also going to go to disk I/o start as time progresses.

B

Our logical I/o moves the Vita bio done. He stalls because his children aren't ready to complete the child. Io associated with the mirror, also moves to Vita bio done and he's gonna. Wait. He's now got to wait for two disks that are doing IO to complete. So this kind of shows you how things progress through the system in a very simplistic way.

B

Okay, I have pipeline reading here, which is much simpler, but I'm probably going to just skip to this, to show a couple other examples and take questions.

B

The one thing I will know with the pipeline and you may not have caught this when we were first looking at all the pipelines. I'm I showed that there's these transformation stages for rites, where we, you know compress and encrypt, but we didn't see anything for reads where we decompress and decrypt it's because they're handled slightly differently, they're handled with transformation stacks which are in this Reed BP and knit phase, and so in here we will actually you'll see that the two things that are highlighted where we actually decompress the data.

B

If you're asking for a block to be uncompressed when you're doing the read, then we'll decompress it in the I/o pipeline and pass you back the decompressed data, it's quite possible that in the new world today, when you're using a compressed arc, this doesn't get called very often because we're just simply passing back the compressed block anyway. But so anyway.

B

Let's look at a couple examples. So I have this little. The simple little D trace script. That shows me a pipeline based on different functions. So in this case, I was tracing all Zee iOS that were being executed from debuff sync leaf and debuff sync indirect. This is where we actually do the sinking of a transaction group to disk. So a couple things to note the first pipeline we see in here that we've added this optional, not bright stage.

B

It just so happens that, when we're doing the rites on a Delphic system, we add the stage in all the time to see. Can we avoid the IO? The thing we noticed is it actually went to the allocate phase, so it did not avoid the IO. So this knob right case was simply you know, an extra stage really didn't do anything.

B

If we look at this other one I highlighted this zio right, compress anyone no lights. Why we see this twice.

B

So I mentioned stall points. Whenever we stall the pipeline, the thread that's gonna, come back. That's actually now waiting gets his stage reset so that next time he gets invoked, he will start off in that exact same stage. This is a case where we stalled. This IO actually came through got to Rio right compress and said: oh, my children aren't ready for me I'm going to go off CPU. He gets woken up second time now, 150 Mike's later and now completes the rest of his pipeline stage.

B

So here's one looking at reeds and again much simpler and we see a similar thing at that point, this time we're in zio vita of iodine and we see that getting booked twice the first time. It's the logical I/o coming through now he's waiting for his Vita of children to actually do that ion on his behalf and then he gets. He goes back on the pipeline once that IO completes.

B

And then this is a case where we're actually doing the Zil Zuko D.

B

What do we see with this pipeline? So we know the Zil does right or maybe I should say what don't we see with this pipeline? We don't see allocations, we see checksum generate and then it goes to ready. Let's compare that to this one where we did compression checksum generate not bright, DV a allocate from this. We know that this is a rewrite pipeline stage. It doesn't have the allocation phases in it and if we look at the the second pipeline that's listed here, what can we infer from just looking at that?

B

So this is a zio route. Pipeline has just the interlock stages and, interestingly, you know again: it's stalled in zio done.

B

B

Have glossed over a lot of stuff and I apologize for the heaviness on this early morning? For this talk but I'm happy to answer questions we have a minute.

B

Yes, I can make that available to you. It's it's. It needs some love, so I would love for people to scrutinize this and figure out how to use it in a better way. I started this surprisingly for all the years that I've been doing stuff in the CIO pipeline. I've never really had a script as good as the CIO trace, 1 and which is probably like a week old.

B

B

So how much? How much impact does the IO throttle have on the passing of time? Oh totals fussing time. So it's interesting that in it's actually an improvement in in certain scenarios and overall, probably should be negligible on on the rest of the time.

B

The reason for it is that without the throttle- and we saw this quite a bit at del phix without the throttle when you have a lot of pool with a lot of devices that have that are very imbalanced, your spa sync time is going to be determined by the slowest device that you're writing to. But if you have the throttle in place, so the throttle is giving out work initially as a one big chunk and then gives it out a little at a time when things complete.

B

So now, if you have a slow device, but a bunch of fast devices along with it that slow device got you know, maybe he gets only the initial increment of work and then the rest of the the work gets dealt with by all the other devices, your spawning time just shrunk down. So that was one of the reasons why we went to tackle. That was because we were seeing in cases where you would look at. You know, 10 devices and two of them were busy.

B

The rest of them are sitting idle and it's the two that we're busy tended to be either heavily fragmented or mostly full and they're waiting for the last pieces of work to complete and if you've ever monitored, spa sync and just did like you know an I/o stat you or a zpool IO stat you'd, see like this huge ramp up of work and we're pushing out all this. You know I/o and then it trickles off and trickles off.

B

Then it goes jumps back up and it wasn't very uniform, and that was the reason for the the deviate throttle code. Yes,.

B

Yes, so the question make sure I get this right mark, so the question is: does that account for, like large rights blocking out? You know like demand reads, so that was a problem that we solved in a slightly different way. So originally at Oracle, we had kind of like a simpler schedule, scheduler and weari implemented that at the video layer, and so now there's like scheduling queues for each of the I/o types and that's where the I/o priority is actually passed into the V dev layer.

B

So we we give higher priority, even though you may have a ton of you know. Async rights coming through. We give higher priority to a demand read and we pull off of that queue, even though it may have fewer elements than the the right queue does at Oracle. We weren't doing that. We actually had kind of a combined queue where everything got sorted in there. Oh.

B

So the question is: what about the actual physical device queue? Depths I, don't think. We've actually looked at that to see if the q-dubs did. You look.

B

Yeah because there's there's a range there's, a minimum and a maximum value associated with each of the scheduling. Cues one more question: Brian I'm gonna get you to next. After the talk.

B

Does does this make a difference with 4k physical sectors.

B

So, with regards to the pipeline, no, so the fact that we're using 4k physical sectors won't really impact the the the pipeline as much it's possible that the gang code I'd have to go back and look at this that the gang code may not be accounting for physical sector size and may actually try to do smaller allocations in the physical sector. But I'd have to see, if that's the case, so I'm, not certain.

B

That's the only place I could see where we would actually do some some aspects of that there is also in the I/o pipeline.

B

The other area is where you may have to do: read, modify rights and you're doing sub-block iOS, so that also lives in zio V Tobias start you'll see a place where we actually have I think it's called zio sub block, where it's a case where we're doing an IO 2, which is smaller in size than the physical sector size, and so it tries to like zero out the tail of it as part of that work.

B

Those are the only two places I can think of in the pipeline, where we would be impacted by having different different physical sector sizes. All right, I'll talk to you.

B

B

Yes, so so Brian made an observation, which is a great one that for those you know in the room or even online, some of the things that are presented here. It would be great if we actually had comments in the code going back to Brian's original tweet. That actually explains some of these things. You know that's a great way to get involved and be part of the community. We welcome.

B

You know that, and- and we look forward to having like new members that just simply want to play around with the code great way to get started all right. Thank you.