OpenZFS 2021 OpenZFS Developer Summit, 17 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DirectIO for ZFS by Brian Atkinson

Description

From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1f9bE1S6KqwHWVJtsOOfCu_cVKAFQO94h
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021

A

Cool yeah, so I'm brian atkinson, I'm from the hp uh hpc des storage design group at los alamos national lab. Today, I'm going to be presenting on the addition of direct io to zfs hello.

A

There we go all right, so I just wanted to start out with quickly going through what I'm going to go through today. I'm going to start out with quickly explaining what directio is then I'm going to give a motivation for why we actually want to add this to zfs.

A

Then I'm going to get into the kind of the dirty details about how we actually implemented this, then I want to share the performances results that we have with direct, I own zfs, and finally, I just want to conclude by giving the current status to the work as it stands today, and so I really just want to start out with what exactly is direct tail because often times people don't know exactly what direct io is, and so a user will pass the odirect flag to an open call and when they do this they're saying hey, I want to use odirect and from the linux man page.

A

It says: try to minimize the cache effects of io to and from the file and the io is done directly from to and from user space buffers, and so really what this means in linux is most file systems will just directly map in user pages into the file system, and then they'll read directly in and out read and write directly in and out of those pages, and by doing this they're completely bypassing the page cache. So no copies are ever made of the actual buffers that the user is using.

A

But outside of this, there's really loose rules and semantics around directio and if you look at different file systems, they'll have their own alignment restrictions or even things like do. We want to be coherent with buffered, writes and reads, and all that kind of stuff, with direct io. So really it's very loose semantics outside of just that main idea of mapping in user pages and directly reading in and out of them, and so currently zfs actually accepts the direct flag, but what it actually does.

A

Is it just silently ignores it right now and really if we wanted to have a direct and zfs at a minimum, we should be bypassing the art completely and not making any copies whatsoever, and so this really leads to the question of why in the heck will we ever want to bypass the art and have we completely lost our minds?

A

And I can honestly say, while working on this code, I have felt like I was going crazy from time to time, because the arc is such an integral part of how zfs works, and I wanted to start out by saying that in no way is this presentation saying never use the ark. The ark is super important to cfs and you can get great performance out of it. However, there are certain times where it is beneficial to bypass the ark.

A

One example is databases oftentimes, they have their own caching schemes, and so, if you have a database sitting on top of zfs, you know if it's writing things down into zfs the arc's going to cache that data, and that may not be what the database actually wants and it may not actually benefit the database in some cases.

A

But I also want to talk about just our workloads at lanl specific to hbc, and we have these run. Launching run long run parallel simulations and these things can run for sometimes on the scale of weeks to months and highly parallel. And so often what these simulations will do is periodically throughout time is they're going to checkpoint their data, and this is just a right one situation in the hopes that we never have to read that data back.

A

It's just merely to say the state of the simulation itself and we call this checkpointing, and so, if that's going in the arc, there's really no benefit for us there, because our intention is to never have to read that data. But unfortunately, we all know hardware failures happen, and in that case we actually do eventually have to read that data back out of zfs to restart the simulation, just get it to its old state.

A

But this is a read one situation: we're really just reinitializing the program to get kicked back off after a hardware failure and again we don't really want that data in the arc.

A

This is just a read one situation: it's not going to benefit us to keep that data hanging around in the ark and the last one that I wanted to highlight here is using low latency high bandwidth nvme devices, as v devs and later on, in my presentation, I'll show that with direct ions, the investment get 1.5 to 3x speed up with these devices, but I highlighted this case because this is actually the case that got this work kicked off really, and this is because lanol is looking at its next generation of parallel file systems with luster zfs and these nvme devices underlying zfs, and we started looking at actually nvmes inside of these z pools.

A

We got really surprising results, um and so the first thing I want to highlight on this graph is this top line here. This dashed line, and what this is is the maximum amount of sequential read bandwidth. We could get out of 12 samsung pm 1725 nvme right around 42-43 gigs a second. However, when we actually put these nvmes inside of z pool, then we start uh use different uh configurations, see the striping rate z1 or az2.

A

We found we were completely bowel neck and the bottleneck got worse as we continue to decrease io parallelism, which is the x-axis here. So each one of these data points along the x-axis is sequential readers of just their own individual files, out of the z poll, and so in the best case scenario, which is the low I o thread counts. We get we're leaving about 48 to 57 of all the available.

A

Nvme read bandwidth on the table and that's because the arc was pre-fetching, it was doing a decent job, but you know we still were missing a ton of available bandwidth and this got really worse as we continue to decrease that io parallelism, because at that point we're thrashing the arc, and so every reed was having to be issued down to the v dubs or yes, the venom cells and the nvmes.

A

In that case, we were leaving 70 percent of available bandwidth on the table that these nvmes would allow for, and we also did this comparison with just sequential writes in these nvm nvme based sequels and again. I want to point out these dashed lines at the top here.

A

The main difference here is that the top line is for all 12 nvmes, and this is for striping, because we actually have all 12 nvme bandwidths to work with. In that case, however, when you go down to raid z1, you lose a disk worth of bandwidth for the rights and then in the rey z2 case. Again, you lose an additional disc worth of bandwidth for the rights, and we saw something very similar.

A

Just like we saw with the reads as we increased io parallelism, we were just completely fat, a flat line and, in the worst case, we're leaving 40 7 of available nvme bandwidth on the table in the best case 34. But in either of these cases I mean there. It's still a lot of available bandwidth that we wanted to capture, and so this actually led to a meeting back in august of 2019 between cray livermore, oakridge and lanl, because funny enough we were all investigating the same issue and we had all tried different parameters.

A

We had shared that with each other. We tried just doing small patches of code, but we can never get past these bottlenecks and zfs with these mdme z, pools and so over the course of the week. We decided all right, let's actually investigate this and figure out. What are we missing here? Like? What's our bottleneck, and so we use the tool called flame graphs and the way you read a flame graph is at the very bottom of the flame graphs.

A

That's the beginning of the call stack and, as you go further up in the flame graphs, you're getting deeper and deeper in the call stack and what you're trying to find with these plane. Graphs are these plateaus and the longer a plateau is the more execution time is being spent in that call, and so we found two places in particular where, with the sequential reads for buffered zfsio, we were getting stuck in.

A

The best case is the one over here on the right: that's where okay, the data was in the arc we just copied out of the arc and into the user buffer, but in the worst cases we which we saw like we continue to decrease that idle parallelism, we're actually getting hit by two memory copies- and that's the one on the left here- that one is actually copying that data into the arc and then eventually we're going to copy the data out of the arc and into the user buffer and again we did the same analysis.

A

But this time with writes and again, we found all of our execution. Time was being stuck in this memory copy. But in this case it was just simply taking the user's buffer and copying it into the kernel space and into the arc. And so when we looked at this for, like all right, there's actually a pretty easy solution of this. We just need to actually implement direct io and zfs, and we can completely avoid these memory copies.

A

And so I just wanted to first before I get into the dirty details of how we got all this working just go over the big picture. What what do we mean when we say we added direct io to zfs and I'm just going to start with the reads? First, and I've got a really simplistic diagram over here on the right of the internals of zfs. So, on the left hand side, we have the normal buffered path that zfs does with reads.

A

So when a read system call comes in, it goes through, the zpl layer hits the dmu and the first thing it's going to do is go hey.

A

Is this data in the arc, if so go ahead and copy it out into the user buffer and return up to the read call in that case, however, if that date is not resonant in the arc, it then has to issue that request through the zio pipeline down to the underlying v devs copying into the arc and then eventually copying it into the user space and so with direct io.

A

The big picture here is what we're trying to say is when we get that read system call come in through the dpla, we'll enter the dmu and we're just going to directly issue that into the cio pipeline and read that data off of the v devs, and we do this by direct mapping the user faders directly into an abd, and I just want to quick, quickly state this here and then I'll, go into way more detail about what I mean by this. There is with direct diaries we can copy out of the arc.

A

uh We allow that, because we have this thing called art coherency between buffer and direct. I o, but again I'm going to go in that in much more detail, but I did want to just quickly mention that here.

A

So the big idea with the right side is that the typical right path for zfs for bufferedio is a user buffer will come in and that's immediately memory copied into the arc as soon as that copy is done. We return out to the right system call and we're done, and this buffer here is then assigned to a transaction group and eventually, after some time period or enough dirty data is accumulated into the arc. This transaction group will transition to its sync phase.

A

Once it's in that syncs phase, it's going to go through the cio pipeline, any transformations that are able to take place and then it'll issue that right out to the underlying v devs with directio.

A

What we're saying is like take that user buffer again directly map it into an avd and then immediately send that through the zio pipeline, so we're still going to do all those transformations just as we do before, and then we're going to issue that right immediately down to the underlying vdups, and it's not till after we've put that data on disk that we return back to that uh right system call, and so at that point we can guarantee to the user.

A

Okay, your data has landed on the underlying b desk so now to get into the actual details of how we actually got all this working in zfs. I just want to stick to general details and then I'm going to actually go into much greater detail with the right side of directio and just to start up front. I want everybody to understand.

A

Odirect does not imply osync, and this is common with all file systems and what we saying when we when we say this, what we mean is, even though we're guaranteeing that data has landed on the v dabs with odirect. That says nothing about the indirect block, pointers and the metadata of that right. If you want that to go out with that, oh direct data, you have to either tell the z-pole sync always or pass the o-sync flag.

A

The other big thing is: we have alignment restrictions with odirect for now, we've chosen the page size, it's common, a lot of amongst a lot of file systems, but some can even go as low as the lba, but for now we stuck with uh page size and if a user does request direct, I o through the o direct flag and it's not page size line in that case we're going to return em valve.

A

The other thing we do is if, for if a user has a memory, mapped any files, we're just going to silently ignore any direct io requests, and I know this may make people cringe because like wait, wait, you're, silently ignoring the stuff and that's why we actually added some accounting and we have it both for uh linux and free bsd.

A

So users can actually verify how exactly did my request get issued out in zfs there's also, of course, arcstat, and you could see through arcstat how your data was delivered, but these extra accounting that we've added they're much more fine-grained. If anybody wants to look at that, while they're issuing their direct, I o the other big thing I wanted to mention is that all direct I o requests they're issued as a sync priority down in the vw queues, so the zfs vdev sync max min those are the parameters that apply to any direct.

A

I o reads and writes, and again, as I previously stated in that read, we keep our coherency with odirect, and so I really want to dive into exactly.

A

What do we mean by arc coherency and so again remember with the direct io right we're just immediately coming in, and this is all going to be done in open context, we're going to come in issue it through the zio pipeline and once that data buffer is landed on the vetoes, we actually have a little bit more logic to apply here, because there may be a case where there's an associated buffer arc buffer.

A

With this data we just wrote out, and we want to invalidate that arc buffer, because after we've set that data on the v dev, we want all future reads to go down to the vw, because that's the most recent data that was issued through the direct ao.

A

So in the case that, for whatever reason, the user's mixing, buffered and direct, I o operations. We do have that little bit of logic. So once the data is on the v dev in the best case scenario, we have no data a dirty day in the ark, and so at that point it's like great. Let's just update the block pointer. That means every feature reads: gonna have to issue down to the v dubs below, however, worst case again, users, sometimes they'll, mix and match you know buffer and direction.

A

Although I don't suggest it, they will occasionally do this. So if we do have these dirty records, the first thing we have to check is: okay. Are we actually syncing out this dirty uh record here, that's associated with this data buffer and if that's the case, we have to wait because with directio, we want to promise the same consistency semantics that zfs normally promises.

A

So we want that transaction for the previous transaction to go out and be written out to disk, and if there is no no dirty record syncing, what we'll do is remove all the dirty records remove the data from the arc, and then we update the block pointer, and so we do have one other alignment restriction specifically with direct. I o rights and that's for each director right. It also has to be record size, aligned and there's a reason we do this.

A

It's because we want to avoid a read, write and modify cycle and to explain what I mean by this. I wanted to show two back direct. I o writes. You know one after the other explain what that means by the read write modify. So if we came in, we did the direct I o path. We went into the ci pipeline, we would calculate a checksum based on that user data and we place that in the block pointer and then that data will be on the vw.

A

Well, let's just say: a user wants to just modify a small portion of that data buffer. You know this red block here and they want to use direct ao. What that would force us to do is first we'd have to go. Wait.

A

You have to go down to the vw, read that data off bring it back up in the dmu layer, we'll modify it again issue it back out to the director or through the zio pipeline, because we have to recalculate that checksum and update it inside the block pointer and then issue that back out to the vdov and honestly, we did implement it at first. To do this. We allowed for these sub record size updates with odirect.

A

We found the performance penalties are so high between these read write modifiers that it just did not make sense to allow odirect to do this, and so in the case that a user does submit a node direct request and it's not record size line, we're just going to silent, ignore it and go hey, you know what go through the arc you're going to get better performance there anyways, and so we just silently ignore it.

A

In that case, the other thing I did want to mention is that for each block or first block, that's written into a file, zfs actually allows that first block to slowly grow up to a record size. This is really entwined with how the war the arc works and because we already have this record size alignment restriction. You decide, you know what, while that block size is growing, just go through the arc like you normally.

A

Would we're not going to issue that first request to the file using direct, I o, but I just wanted to mention that so nobody's caught off guard when they see that- and the last thing I wanted to touch on was this idea of stable pages?

A

um And what we're saying here is we actually have to write, protect uh the users pages, because it's important remember: we've directly mapped in the users pages with odirac and explain why this is important again when we go into the zio pipeline, we're calculating to check someone writing that into the data's block pointer. However, for some reason these are just maliciously you're, just let's be honest as users of apparel you know code. Sometimes they just do weird things if, for some reason, after we've calculated this checksum, we said it in the block pointer.

A

If we were to modify the contents of that buffer and write that out, I mean everything's fine as long as it's just sitting there on the vw. But the issue comes in when we go to read that data out either because we've issued a read on it either or there's a re-silver happening or anything of that matter. What we're going to get is a check, some failure and there's no way for us to actually fix this, because our checksum was calculated based upon the original contents of the user's pages.

A

So we have to write, protect the users pages before we actually issue out the direct io to the zio pipeline and before jumping into the performance results that we got with directio. I did want to do.

B

A

Last note, on direct a on zfs, we did add a new dataset property, which is called direct and the default is standard, and that follows all the semantics that I've outlined so far, but we also added always, and what this really allows the user to do is say: okay, I don't want to modify all my applications to use the odirect flag. Let me just simply try it out and always is kind of a best effort that we do here.

A

So in the case that it's not page size line like we always require with odirect we're going to go, no worries just fall back to the buffer pack, the buffer path in that case, but this just allows users hey. Let me try autodirect, let me just see if it does me any benefits with my workloads.

A

We also added in that same vein, direct disable, because if this patch gets merged, people that we're passing another direct flight would be like what the heck just happened, because the performance may be all over the place, all of a sudden they're not getting what they want, and so by doing disabled just means hey just ignore the o direct flag. Just how zfs does it today?

A

So it's basically fall back to the old version, and just forget that oh direct flag was even being passed, and so now I want to actually show how we our performance results with all these crazy additions of o direct to zfs, and so I'm going to start out. Looking at the squid, chill read results from the nvme z pools and on this graph I have the previous results that we had.

A

Those are the light color lines here and the oh direct results are the dark color lines and again I want to point out this dash line at the very top that was the maximum amount of nvme bandwidths.

A

We could get with sequential reads, and so we actually see is that for each of the different video configurations driving rate z's, we get pretty close to capturing all that available, nvme bandwidth and at the higher io threads, where we are thrashing the arc, we actually get about a 3x speed up for the reads there and in general, for all cases the o direct reads scale. You know pretty well and they stay consistent as we continue to decrease that I o parallelism.

A

So, with the sequential write results for these nvme z pools, it's again important to remember that, based on the certain v dev, you have a certain maximum you can achieve. So I just wanted to grab the one case where all three of the different vita of configurations were at their maximum bandwidth and that was at 512, sequential I o thread writers and so for each of the different v dev cases uh I have a green bar, that's showing this is the amount of available nvme bandwidths.

A

We were able to capture through the buffered path and then the blue bar is okay. This is the maximum amount of the available nvme bandwidth.

A

We were able to catch a capture with odirect, and so, if we just look at the raid z1 case here in the center previously, we were only able to get 53 of all the available nvme bandwidth, but with odirec we were actually able to raise it up to 93 percent and in the raid z2 case, we were able to lift that number up to 93 and in striping we're actually almost able to get 100 of the available nvme bandwidth.

A

But for all three of the different v dev configurations, we got over 1.5 x, speed up and I also just wanted to quickly state. Unfortunately, I don't have enough time in this presentation to go over all the scaling results with odirec sequential rights, but if anybody's interested in this data, I do have addendum slides all that data is available and I also do have oh direct ride.

A

Direct write, scaling results with d-rate configurations, so anybody can look at that if they're interested there was one last performance case I actually wanted to go over here and that was o direct with disk and the main difference between this graph and the previous graphs that I've shown is along the x axis here. This is actually the number of sequential read.

A

I o threads per z, pole, and that's because we took a 90 disc, j bod and we broke it up into five z pools and again we did the the different v dev configurations, striping raid, z1 rayz2, but really what I just want to highlight in here are the raid z cases and that's the green and blue lines and the darker color lines are the odirect results.

A

The light color lines are the buffered cases, and so what I really want to highlight here is that for all the o direct cases we're actually performing worse than buffered, and this goes back to what I started out. The presentation with like we're not arguing to never use the arc, in fact with odirect, if you're looking at just from a performance perspective, there's a few things to take into consideration.

A

Basically, you want to think about what is my I o workload. Will I benefit from this? Even with that io workload? What's my v dev configuration and along those lines? What is the underlying hardware? Because we can see here that, even in the low thread I o thread counts, the our prefection is doing a pretty good job with this j-bond and even though it does, you know flatline a little bit and come down. It's still really outperforming the o direct results.

A

So just as a caveat, you know odrich if you're just strictly thinking about performance, it's important to really take into consideration all these different factors.

A

So I just wanted to give a curtain status of this work. We do have an open pull request. You.

C

A

People go out there, grab it. Try it out, discover bugs possibly give it a whirl. You know we definitely would like some feedback on the pull request from anybody who does try it and we do are. We are aware of some bugs in the code um there's some corner cases we're still working through, in particular with linux, the stable page stuff.

A

um The kernel doesn't really document stable pages too well, so we've been struggling to get that to work for all cases on the linux side, but we're hopeful that we're getting close to getting that done and in the freebsd for some strange reason without direct reads: I've been getting e-faults occasionally when mapping in the uh the user pages and really anybody in the freebsd community, I'm more of a linux person and I'm trying to learn freebsd as much as I can, but anybody that could help out on the side. That would be greatly appreciated.

A

In particular, getting this performance testing from freebsd would be great too. All my performance testing has been done on the linux side, and the one last thing that we have talked about is actually getting a direct.

A

I o to work with z-vols brian bellendorf, and I worked on this for a little while, but we kind of tabled it for now, but we would like to actually integrate directio with the vols we're just not sure if it'll go into this poll request or maybe once this pull request is merged, let's go back and get the z-vol parts hooked in and before I finish up, I want to give a special thanks to everybody that has contributed to this, because without them none of this work would have been in the state it's in mark.

A

Maybe who initially did the direct? I o implementation, matt ahrens, he really assisted in getting the semantics nailed down for how he wanted direct io work in zfs matt macy did all the free freebsd ports for direct. I o, and I really appreciate his help. There and brian bellendorf has been great in helping me try to get this work across the finish line, a lot of stuff with the art coherency and just general linux side implementation details, including the stable page issue that we have right now.

A

So with that I'll take any questions. Anybody might have.

C

So brian, this is fantastic to see this work coming along. um It's been a while, I know so.

A

From the beginning 2019, when we started.

C

So, uh in your uh analysis I mean you already highlighted that you know the the direct I o uh work is really a performance effort.

A

C

So you know you have to be careful when you use it because it may not pan out in all configurations, and you should demonstrate that. Obviously, with your uh your example running on on disk ra, then an nvme. Have you also done any um analysis on small block versus large block workloads, because I would assume that would be another category where you might not see the performance you might expect or.

A

C

Think about it.

A

No, I haven't done a full analysis there yet, but that is something we're interested in because actually matt aaron and I were talking about those jbod results and really I should have pressed up the record size. And I didn't say this all the results that I shared were uh one meg record sizes and uh one meg request sizes. But if I'd actually pushed that, you know up to maybe say like eight meg record or.

C

A

Meg, maybe there would have been a performance win and really just that's what I was trying to highlight. There is given a certain test case. You may not get what you like.

C

Yeah exactly, I think you have to be careful when we, when we release this feature to you, know the documentation very clear about that you're. Obviously your performance, uh it may vary depending on your on your work cases, which exactly.

A

And that's why we added that disabled, because again, people would be like whoa what the heck.

C

Yeah exactly well thanks.

A

Yeah, oh yeah,.

B

The question was from joshua: he asked what tool you used for generating the load and then also for measuring the performance and generating the graphs.

A

So yeah for the uh io workload we used the xdd tool and the reason we use xtd is because.

C

A

With a lot of our workloads at leno, we have an end to one situation and that's extremely hard to mimic with fio for the raw test cases. We did use fio to measure those and I actually have verified the xcd results with fio.

A

That's why I also only presented multiple parallel readers and writers, because we didn't want to show those into one case, because that's really specific to kind of hpc, but that's how I measured all the performance stuff with the graphs. I used canoe plot to generate those.

B

Cool um the next question was from saji or joshua, says thanks. The next question was from sadji nair. He um notes that uh yeah, so odirec doesn't give you a sync which means that if you crash you could lose the oh direct rights uh if they weren't synced um yeah statement. That's exactly.

A

B

Yeah and um you could you know you can use the osync flag and the direct flag if those are the semantics that you need yeah exactly um alan jude asks if you open a file with odirect, do indirect blocks get still get cast in the ark? Yes,.

A

Yeah yeah, if you want to control the caching of any of that stuff, you know you'd have to say like um what is it metadata you can? You can turn that off? I believe, with the.

B

Use the normal properties yeah.

A

But now all that's going to be in the arc, um with odirac we're strictly talking about the data's user buffer itself. We're not talking about any of those indirects and that's again goes back to that. Osync idea.

B

Cool um james simmons asks uh what's the timeline to getting this merged.

A

B

A

Yeah, hopefully soon I mean the the stable page stuff has definitely been frustrating on the linux side. um Every time we think we got it. It's like, oh well, we just tested a case and now we're stuck waiting around you know and just dead locking. um So really, I think, that's the last remaining hurdle in the linux side. um Again, there's a few little bugs.

A

We actually do have a random arc leak that can happen which has been super hard to trace down, but it could be like one out of 100 runs all of a sudden we leak in our cutter, but that's only when you're mixing the direction buffered stuff, um the freebsd again, that needs more love and nurturing care than I have given it.

A

Unfortunately, I'm not an expert on d and matt macy was great, was porting that over, but anybody that wants to really help out on that site it'd be great, because we definitely need to get all that done before we even think about getting it merged.

B

Cool, uh we have a few more questions and I think we have time for them, so uh I'll keep going richard lager asks he's um he's asking about um when the uh page size. His question is about alignment, so if the page size is less than the record size, then what are the alignment requirements? Doesn't it still need to do read, modify, write and does that fall back to buffered access, or is it still going to be doing direct? I o with?

B

If your record size is greater than the page size and you're in you're doing just page size, aligned.

A

uh So the page size is less than record size, yeah.

B

So I give the default record size 128k and you do a page size aligned right. You know what happens there.

A

We actually do address the case. Where say if you have a record size of um 1k as long as you've issued the request as 4k, we will send that oh direct, we.

B

Allow that he's asking about the opposite.

A

Yeah and that's what I was trying to understand so.

B

I think that we just I thought that at one point we decided that if, if the request is not record size aligned, then we would not do it as direct. I o.

A

Yeah yeah, we always bail back to the buffer. In that case,.

B

um Rich asked a question that was answered on the chat um about like: can you go back to the old behavior of ignoring, though direct and just um always have buffered, and the answer was yes, there's going to be a property to do that.

B

Will strickland says great performance results? um He has a question about the graphs. Am I reading correctly that direct, I o will likely perform better with fast storage, media or a fastback and like nvme, but direct data might be worse. If you have a back-end, that's slower, like htds.

A

I think that's too broad of a statement to be honest because again veterans we're going over the results. You know it's like well, we should have gone with the bigger record sizer, and so again this is where I'd love for the community like get the pull requests down mess with it. You know, give some feedback um we're obviously doing more in-house to to get more results and interpret them, but yeah. I I think it's too pro-statement to say like no. This is only going to be great for nvme.

A

I think there are cases where it could be beneficial just depends on the ilo. Workloads too, I mean yeah.

B

That's all I was trying.

A

To stress there.

B

Yeah, I agree, it seems like uh primarily, you can think of this as like uh odirect reduces cpu usage and reduces memory usage and um there's some latency trade-offs for making that happen uh in some cases. But you know if, if your workload is not super latency sensitive, then having more cpu is good, regardless of the back end performance.

B

um Yuri vault watchkov asks um how about the use case of virtual machine storage using a z-vol. Would that benefit from direct? I o, assuming that you carefully select the record size.

A

It could and that's for uh at the bellandorf- and I were talking recently they've- had a case- come up livermore where they could have benefits with z, vols and odirect.

A

um So did we definitely want to get that work in it's just again. I don't know if it'll be in this initial pull request, there's a separate pull request uh when this gets merged without fully testing out it's hard to say you know, we first would have to get everything hooked in and actually we thought we had it working at one point, but we were surprised it wasn't as easy, even though the hooks are there. It's like okay. I just put this in this evol code.

A

We ran into some weird issues, but we just did kind of mention that for now, until we got everything working on the regular pass, but I would say it would take actually testing it out to give any clear statements on that.

B

Right, I just took time for maybe these two last questions that are here um a question relayed from youtube.

B

The xrd3k um is asking if the speed ups would apply to special v devs as well special videos like those that are used for, um like small blocks uh and uh raids like metadata like um youtube metadata, I think normally those would not be storing like data.

A

Blocks so it might now exactly yeah. Okay, that's what I would think yeah, because I mean yeah, the anything. The indirect block pointers we're not sending that through. Oh direct.

B

um Ted kabine is asking about: um is this largely intended for workloads where the data being accessed is much larger than the memory available for the arc? um So in other words, it can't be cached and the blocks being accessed are not accessed through public, so yeah he's basically asking like is this? Is this for uncached workloads.

A

Yeah, it's kind of what our directs implied. I mean and again the reason we even got this work kicked off. I mean between cray over his liberal war um and lana I mean you know. Sometimes our iowa worker workloads are not common. You know we're sequentially streaming out large amounts of data, and that's where we saw the benefits of this. So for us, it's like well, if we're not really wanting to cache it there.

A

Could we just you know, avoid doing it, but in cases where say you are doing a lot of you know, caching or anything like that, having like an l2 arc could possibly benefit you more than own direct. I mean if you really that's your I o workload.

A

It sometimes no direct, may not be the solution there, even depending on the underlying medium storage. It's just really the I o workload and what you're trying to do all right.

B

I'm going to make an exception because this another question came in that I think, is kind of interesting from youtube. Thomas monroe asks what happens if I write two non-overlapping record align regions so they're good rights um using o, d-sync and o direct at the same time.

B

I know that if it's I know, if it's odirect, then that, then you know those will both be able to proceed in parallel concurrently, I'm not sure what od sync means in this situation.

A

Yeah they that would be fine.

A

I don't see whether- and I didn't stress this in the presentation, but zfs has the range locks you know out there, so that prevents a lot of bad things happening with oh direct rights and because we have the record size alignment thing, it's like all right great we can. We can have multiple direct. I o rights going to the same file and in our into one cases that we've tested um atlanta. It works completely fine. You know so you're never going to have two direct. I o rights overlapping to the same record.

A

The range locks are going to lock it out it. Just it can't happen, they're going to proceed in an order that we, you know it's controlled by cfs.

B

All right: well, thanks a lot uh brian for your talk.

A