OpenZFS 2022 OpenZFS Developer Summit, 10 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZVOL Performance by Tony Hutter

Description

From the 2022 OpenZFS Developer Summit: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022

Slides: https://drive.google.com/file/d/1smEcZULZ6ni6XHkbubmeuI1vZj37naED/view?usp=sharing

A

A

Can you hear me awesome, so I want to talk a bit about z-ball performance, uh more specifically patch I got merged over the summer that could increase or decrease your z-fall performance.

A

So back in August of 2021, our archive team at Lawrence, Livermore National Laboratory was testing out some new hardware and they were benchmarking xevolves on ZFS and what they were seeing was when they would do parallel. Dd rights, just normal DD rights. They would get about 1.7 gigs a second and then, when they re-ran the same test with o direct, they would get 2.9 gigabytes a second, so basically line rate, and they were saying well. Why is that?

A

And for those of you who don't know, odirect is a flag that you can pass to open and open at, and it's a hint to the kernel to say: don't cache anything. I'm writing just send it directly to the device, so I re-ran the test setup that they were using. The archive team and I hit the same results.

A

So I started down the rabbit hole, so this is a non-odirect right. This is just a normal right and um flame graph. That I learned about at a past opencfs conference, I think from one of your presentations Matt. So here with the non-odirect right, you can see. Xeval Wright is like 60 percent of the total time and if you look at the O Direct, it's maybe 20 percent.

A

So if you drill down and you go into Z Vol right, the Nano direct case, you can see that almost all its time is taken up by debuff find it's actually 18 and a half percent of the total CPU time is in debuff find for non-odirect rights.

A

But with odor or with odorect, it's one percent of the time so something's going on in debuff. Fine, that's eating up all of our performance so before I go into that I want to give a little bit of background.

A

The Z Vol driver is a block device driver for the Linux kernel and the API it historically used was through. Submit bio, submit bio is a function that you'd pass in, and you say whenever you have an, I o that you want to write to my block device pass this struct bio and what I was seeing was when you would do o direct you'd.

A

Get these nice big, like half megabyte or megabyte block iOS, but if you did Regular rights so non-odirect you get these tiny little 4K pios coming in there's tons of them, and that makes sense because, when you're doing a regular write, the kernel is going to block or Cache everything uh in the page cache. That's why you're getting all these little 4K block iOS.

A

So how does that relate to debuff, fine, so debuff find is a function that looks up the debuff for the block that you're writing. The debuff is like the internal representation of the Block in Cache, and so you have to lock the block before you write it, and so, as you can imagine, there are tons and tons of blocks in Cache.

A

So when you want to write, you have to find the block and lock it well, you can't just iterate through all the debuffs, because that's inefficient, so we store the debuffs in hash arrays.

A

So basically, each block will hash to a bucket, so debuff fine once it finds the block it'll lock it and return, that's what the function does.

A

Okay, this is a simplified view of. What's going on so, let's say: I want to write block zero I first need to Hash the bucket this case bucket three and lock it so that nobody changes out the the list from under me and then I iterate through all the blocks in the bucket until I find zero I, lock it okay!

A

Well, you can probably see the problem here.

A

If you do a one megabyte write and it's broken up into little 4K block iOS, then you're going to have 256 little block iOS all trying to lock the same block and the way it works in the zval driver is that when a block I o comes in it'll hand it off to a thread. So, let's just say you have 16 processor or 16 CPUs 16 threads, so it hands off all these little block iOS to all these threads and they're all trying to get the same lock.

A

So that's why you have this massive law contention in debuff find, but it's actually worse than that, because this is a very simplified view. There is not one lock per bucket, there's, actually one lock per group of buckets.

A

So when you load up the ZFS module it'll create the number of buckets based upon how much available memory you have. So it could be. You know thousands, but historically we always hard-coded 8000 locks, so one lock would handle multiple, multiple buckets. So when you had all this lock contention going on, you were, you could be holding a lock that other parts of the block device you wouldn't be able to access it there either.

A

Luckily, a recent change that Brian beldorf checked in makes it Dynamic, so that there's just one lock for 128 buckets.

A

So, what's the solution? Well, I mentioned that the historical interface between the kernel and the block driver was the submit bio interface where you'd get one block. I o at a time, but there's a semi-recent interface that they added called block multi-q.

A

It was added I want to say around the 3, 10 or 315 kernel, but the one that we support because they've made some changes is like the four 13-ish kernel.

A

So anyway, it's a new API for talking to your block driver and the difference is instead of getting tiny, little block iOS from the page cache you get one big struct request that points to multiple block iOS. So that way you can lock the block per big request rather than per little block. I o so that's the main benefit to ZFS, but there are other benefits too. So the reason it's called block multi-q is that it's a queued interface.

A

So, instead of with submit bio, where you submit, and then you wait- okay next one submit, it just puts it into a queue with block multi-q, so the kernel can rearrange or merge iOS if it wants within the queue and then when it's ready to kick it out of the queue, then it wakes up your driver and it's called multi-q, because there's not just one big queue. There's multiple cues that you can spread across multiple CPUs, so you get better parallelism.

A

All right so on to the benchmarks, so here I took eight nvme drives, made them into a raid, zero pool and then just ran DD, and this is one DD to one's Evol, running regular non-odirect, and so you can see with the old submit bio interface. I got between 400 and 600 megabytes, a second at different block sizes and then with block multi-q I got around 750 to 900 ish, so pretty good speed up there.

A

This is the same Benchmark 1dd to one Z Vol, but with o direct, and you could see it's close to the same, but block multi-q is actually a little bit worse in all the cases and I'll go over later. Why I think that is okay? This is a parallel write Benchmark. This is spawning off 16 DDS in parallel writing to 16 Z Vols, and this is kind of the best case scenario for Block multi-q.

A

You go from anywhere from a 2X speed up to a 4X speed up depending on your block size, but then, if you run that same Benchmark with odorect they're, basically the same.

A

Okay, now, let's look at reads so this is a single read: single DD read from a single Z, vol non-odirect and the performance is close to the same, but block multi-q is a little bit better at 8K, Vault block sizes and one Mega, Ball block sizes and then parallel read. They basically perform the same again. Parallel reads: 16 DD reads: non-odirect.

A

Now single read with odorect, they basically perform the same block. Multi-Q is a little bit faster with one megabyte blocks and then parallel read with odorect. They basically perform the same I. Think some of the differences you see here is just test noise.

A

Okay, so now I want to look at all those same benchmarks, but look at the CPU usage, and here I was just running the time command. So I was looking at wall clock real time and system time, so CPU time and wall clock time. So real is the wall clock times. This is the CPU time and like before you can see.

A

The yellow line is the ultimate bi interface with uh this is a single right, so it takes a lot longer to complete and block multi-use a lot quicker, but you can also see the red line. Is the block multi-q system time? And you can see that the block system block multi-q system time is higher than the submit bio system time. So it's using your coars morph is using Warrior cores more efficiently.

A

And then this gets even more pronounced in the parallel right case. Again we know block multi-q is a lot faster here, but if you look at the red line, which is the block multi-q CPU timer system time, and then you compare that to the submit bio system time, the green line, you can see that block multiq is really using Warrior cores, which is good. I mean you pay for those cores. You want to use them to get faster rights.

A

Single rights with odorect, you can see the block multi-q takes longer, which we knew and the system time is about the same with parallel rights, block multi-q on wall clock time was about the same, but it took more system time. So that's not as good.

A

Looking at single reads: non-odirect block multi-q was a little bit faster at 8K, which we knew and about the same amount of system. Time and rounding it out. Parallel reads: Nano direct, basically, the same single reads with odirect. Basically, the same parallel reads with odirect block: multi-q is a little bit slower for the AK block size.

A

Okay, so those were all sequential. This is some numbers that Tony Nguyen ran my point. These are a bunch of different tests that he ran. I don't want to get into the actual numbers, because it's all over the board and that's kind of the point. The point I want you to take away from this- is that lock multi-q is not an instant win. You really have to try it with your workload. It may be a lot worse for random, reads and writes if you're running odirect, you probably don't even want to use it.

A

So, in summary, block multi-q helps when, in the non-o-direct cases, especially with non-odirect rights, especially parallel sequential rights, um it can help a little bit on sequential reads, but otherwise, if you're using o Direct, you really don't need block multi-q.

A

Now I mentioned that block multi-q was a little bit slower in some of the cases and I said I want to touch on it later and I. This is the reason why I think it's a little bit slower check my time here when you do a write with block multi-q, it gets put into a queue into the kernel and then at a certain time the kernel is going to kick it out of the queue to your block drivers, the Z of all driver to say processes. It process it.

A

So that takes time once it goes into the zval driver. Then we spawn off a thread to deal with it. That takes time so it kind of gets double queued, two wake-ups and it doesn't necessarily have to be that way. You could do a right. It goes into one of the block multi-qs and then, when the kernel wakes up your block driver, we could just synchronously do a right. We could just Synchro synchronously call zeev all right and when I tried that it worked for, like the first four-ish writes, and then it would kernel panic.

A

So basically, the whole reason why it's doing this double queuing is to get around a curl panic. But if we could figure out, why that it's doing that, I think it would lower the latency and we could get better performance.

A

So these are some of the tunables for Block multi-q. These are kernel parameters, so block multi-q is not enabled by default, it's disabled, so you have to go in and set xevol use block multi-q to 1 to use it, and this setting takes effect at zval load time.

A

So, for example, if you want to use block multi-key, if you want to test it out, you would set this to one and then import your pool, because that would load your Z Vols or you could create a z Vol after it or you could even have cases where you create a z. Vol set this value to 1, create another Z Vol and those two z-volves use the two different code paths. One uses submit bio one uses block multi-q, xivo threads. This is actually an existing pre-block multi-q parameter.

A

This is just the number of z-val processing threads in the zval driver. I. Think it's default set to one which says use the number of CPUs as evolve block multi-q threads. You can think of this as the number of cues to tell the kernel to create I think this is also set to zero. To say: okay default to the number of CPUs evolve block multi-q depth.

A

You can tell the kernel how big you want your cues to be, but in all my testing it didn't really affect the performance at all and then finally, zevolve block multi-q blocks per thread. This is kind of a way of telling the kernel how big you want your iOS to be, or what is the optimal size. I would like my iOS to be, and this is an interesting value which I'll get into a little bit so I mentioned because of the Locking.

A

You want to process big block iOS or more specifically, you want to process block iOS that are the same. I o size as your Vault block size, so you'd say: okay. Well, if my Vault block size is one megabyte, then I want to tell the kernel. I want one megabyte block iOS, and you can do that. That's fine, but then remember that it's doing all this overhead with this double queuing.

A

So, wouldn't it make sense to try to process like two megabytes or four megabytes or eight megabytes, so you can process more data for those two wake-ups.

A

And this is where I kind of went down the rabbit hole. So the kernel provides this function called block q. I o opt which is documented as set optimal request size for the cube, but if you set it to one megabyte or whatever you want to set it to, it doesn't seem to give you those iOS. So I didn't have any luck with this, but there's kind of a roundabout way to do it. With these two other functions.

A

So there's block Q Max segment size which you can think of as like what block I o size, do I, want and then block Q Max segments, which is how many block iOS do I want per request, and by setting these to in a certain way to fit your block size, you kind of get what you want.

A

So then the question becomes well: what's the optimal block size? Is it one if I have wall block size, one as one megabyte do I want it to be one megabyte or two megabytes? Four megabytes, so I ran some benchmarks to try to find The Sweet Spot.

A

So this is a graph with different numbers of blocks or different request, size, different request, sizes and different numbers of threads, but the takeaway here is the more blocks you request, the faster your reads go so this is uh this is a parallel read test, so more blocks better for reads.

A

For writes, you see the same thing. This is. These are non-odirect rights. You get a lot more performance with requesting 16 blocks versus one block.

A

Consequently, you get better performance with fewer zval threads, so that might also be something you want to tune when you're doing benchmarking.

A

Now, here's where it gets interesting before it looked like well more blocks better right- and here you can see with writes more blocks is better, but this is looking at oh direct reads and writes and with odirect reads: you actually get better performance with fewer blocks with four blocks, so this is yet another tunable that you'll want to Benchmark on your workload to try to get the best performance.

A

I think I set it to default to eight blocks in the driver. Currently.

A

uh Just a few other interesting commits I want to talk about reduce debug fi debuff find Lock contention. This was something that Brian merged and this basically converts all those debuff locks from a mutex to a rewrite semaphore, and this could potentially help if you have like mixed reads and writes. Although we did get a report from one of our vendors saying, we didn't get as good of performance with this and then the other commit I want to mention. Is this dynamically sized debuff hashmutex array? I did talk about this earlier?

A

This is the commit that makes the number of locks dynamic. So you go from the 8 000 hard-coded locks to one lock per 128 buckets.

A

Okay, so what else can we do Beyond block multi-q? What are some future things we could do to improve? I've already talked about the double queuing. I think that's probably the biggest win with the least amount of effort. If we can just Synchro synchronously call zval Wright from whenever we get a block, I o I think that's going to be a lot better.

A

The second thing I mentioned is support.

A

Vanilla request queue, so I kind of lied, there's the old, submit bio interface and then there's the new block, multi-q interface, but there's actually an interface in between that's just vanilla, request, cues, just normal request, cues, and so we could support this with not a whole lot of effort, and that could be useful if you're running a kernel, that's kind of in between submit bio and block multi-q, like maybe, if you're running uh rel7, and you really want Evo performance, you could potentially Implement request cues in the z-ball driver, the direct I o patch.

A

This is a patch that's been going through review that Brian Atkinson is working on to implement true direct, I o in ZFS, and this is something that gave us a huge huge performance improvement with um with data sets. I haven't tested it with zvals.

A

We got like a 60 right speed up using direct, I o. So if you want I would recommend everyone try to get this patch reviewed and merged, and then the last thing I wanted to mention was kind of a kind of a crazy idea.

A

So I mentioned that you kind of get the best performance with odirect it's kind of a magic button that gives you better write performance, but you have to specify odirect when you call open or open ad, so it has to be done at the application Level. So you either have two choices.

A

You have to recompile all your applications to pass in the O direct flag at open which no one's going to do, or you could do this thing called Library call interception where you can hijack calls to say open or open ad, redirect them to your library and then in your library. You add in the O direct flag and then just call the normal o Direct.

A

The idea idea there is, you could say: LD Library path equals lib, fake odirect.so, and then you run your by with your binary name after that, and then your binary would always use odirect with rights.

A

So this is uh what I want to do for the hackathon tomorrow and we'll see if this helps. So this would only help if you're, if you're, using it with an application, that's using Z Vols, but it could also help. If you have a file system on top of a zval you could. You could use this hack too. So this could speed up like copy or sync tar things like that.

A

So we'll see how that goes.

A

And then I just listed the pull requests that I had um this got merged in June I. Think it's block, multi-q is only in master. We haven't pushed it out to any of the point releases and again you have to manually enable it. So um that's what I've got any questions.

A

No I don't think it would work that way. I think it would. When you do a write, it would put it into one big request, possibly merging it with another request in the same queue.

A

So I'm not sure why the? Why lower why fewers evolve threads in that case made it better? Maybe it opened up more threads for the actual underlying block device. I, don't know yes, so the question is: have I considered async the async dmu pull request as a Improvement.

A

I haven't looked at the async dmu pull requests but I'll take a look.

A

A

Yeah, the question was: did I, try did I try, setting Z vol request; sync guys try setting that to one to get around the double queuing: um I, yes, and no um that code path. That you're talking about is what I used for the synchronous right, I didn't set that value I, just hard-coded it and tried it and that's what was giving me the problem.

A

Any any other questions.

A

A

A

A

Thank you good job thanks.