OpenZFS 2021 OpenZFS Developer Summit, 17 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZFS performance on Windows by Imtiaz Mohammad

Description

From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1vcKWOCgw5G3YLiNSolWXg26eK4w7a_iG
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021

A

Awesome thanks guys uh welcome. uh This is imthis. I work with datacore. uh I've been managing a team in bangalore um and they've been working on zfs performance on windows over the last 18 months, or so so I'm going to give a quick snapshot of what we have done there all right, a bit about data core we've been doing software defined storage for over two decades. Now, among other things, we have windows server, based solutions for log storage, so that is what triggered interest in the fsn, uh specifically the z-walls around q4 of 2019.

A

Now we've been working with uh jorgen. Since then we have received great support from him. Along with him, we managed to stabilize zfsn, which is the windows port of zfs. Of course, uh we made the zdb functional, we expanded the z-wall size, so there was some sort of a limitation in data types which didn't allow us to create.

A

So you was more than I think two terabytes or so uh so we we could go up to seven exams with that change, and then we did confirm the uh integrity of data that we are writing to cfs using tools that are homegrown at data core and that has kind of laid a good foundation to uh they pursue the performance. Experiments that I am going to talk about in this uh talk and alongside we did manage to introduce perfmon counters. So perfmon, as you might be aware, is a very popular tool in windows ecosystem.

A

So we were able to add a bunch of counters to it, and then we also instrumented the the printf api which is used for inaudible.

A

We use the wpp framework for doing the tracing, so I'm going to talk about that as well a little bit and then at the bottom you see the the repo. So that is where you could get a copy of zfs in that we have been using if you are interested in checking it out all right. So this has been the uh the focus for us for the for the last 18 months or so so.

A

Measuring performance of sea walls, uh especially when deduplication compression and encryption are turned on uh and then identifying the bottlenecks and and, of course, fixing the bottlenecks. So how did we measure the performance? So we used this uh delhi mc server, which had 128 gb ram, 16 cores, 4, ssds of size, 370 gb, each and the the repos that you see there.

A

When I show you the slide where we compare the performance with the open zfs on linux, so that's the repo that we used and on windows. Of course, we've used the repo that I just talked about all right, so um so this is how we configured our pool and wall. uh We used up the four ssds and on the wall side we chose to have only metadata cached and we turned on the dupe we used lg4 for compression, uh we used sync, always we used a block sizes wallboxes of 128k turned on encryption.

A

um We used the aes256 gcm algorithm, the size of the wall happens to be 500g, and then we used disk spd. It's a nice tool which works for linux as well as windows. Again the links at the bottom. They can reveal more information, so it has a bunch of options.

A

So you can see you can specify the block size of the the workload, the duration of the test, whether you want to disable the right caching, uh the latency, whether you want to track latency or not the outstanding ios per thread, the number of threads you want to do: random, io or sequential io.

A

The percentage of rights versus reads the warm-up time, uh whether you randomize, whether you want random, is content in every right and, of course, the target right, typically a very standard set of parameters that you would expect to see in any io tool all right now. This is how the performance of zfs syn was a year ago.

A

The first row is deduplication plus compression turned on on z wall. um It was okay, I mean you can see. 128K sequential right was giving us around 400 mbps, but then the the problem started when we added encryption to it. So you can see that pretty much it has become, you know, unusable um and that that is what you know kind of triggered the the next set of experiments that we did.

A

So how did we identify the bottlenecks? Well, one tool uh we used uh quite a bit was d trace. uh I think it is uh something that is well known in solar, arrays and linux community as well uh again, it has been there on windows for a while. So the links at the bottom can give you more information about the tool. um So using that tool you could find out. You know where is it that you're spending most of your time?

A

So you could use it for tracing kernel code as well, um so that that gave us quite a bit of insight into what was happening and then, of course, zippo iostat arcstat. So those have been very well. You know used tools for measuring the performance or doing the analytics as far as performance is concerned. uh The only challenge there is. It may not be easy to do the the charting or the graphing there and that led us to uh integrate.

A

You know the the output of zippo iostat or whatever zebra stat uh or whatever arcstat does in perfmon. So you can see those three highlighted counters there open gfs, cache, open, zfsp, div and z pool. um So those are the three counters that we added to perfmon. So that made it a lot easier to capture the output in the form of graphs that I'm going to show in a while again, if you want to read up a little bit about what what is the pool?

A

What arcstat does there are a couple of links in the bottom, um so this is how you could choose the v devs. uh They are qualified, uh along with the uh full name and of course you could choose the pools themselves. If you have multiple tools, you can see all of them here and choose whichever you want, and this is this is how the uh the output will look like in the tabular format.

A

So we have neatly, uh you know prefixed the counters, uh with rcl to our slog, as you can see here and then uh vdf. So again you can see you know, they're active async, read writes there is a pending assignment rights. Reads and a weight count, wait time and so on and so forth. All of them neatly organized and if you want to get the counters at the pool level, this is what you could do, and this is the charting that I was talking about.

A

So you could collect all that information over a period of time and zoom into a particular window of interest and you can select whatever counters you want to monitor, and this is a very valuable way of looking at the performance uh and looking at what is possibly causing the bottlenecks.

A

So this this really helped us uh quite a bit along with the the d-trace tool that I talked about and then, um given that we were just wrapping up on zfs, we also wanted to know how the code path is hit and what code paths we need to uh really monitor. uh How do we learn more about the code right? So one great way is to do tracing, but then we didn't want the tracing itself to hurt the performance, so that is where we used wpp testing.

A

It's a very efficient way of doing tracing in windows very lightweight, so we did modify the cfs installer.exe. You can see that the options here that we added was the trace command. uh Hyphen, l ox4 basically says I want to trace anything, and that is at level four and below, and the hyphen has 250 says the size of the file, the trace that we want to collect in mb and, of course, hyphen p is a path. So basically we are saying hey.

A

I want to turn on a session where I can log the traces at level four or below, and it's a circular tracing of size, 200, 250, mb and here is where the file resides and where, when you're done, you can just delete the session using a hyphen d option. So we added a trace event function. So you can see there are a few flags here uh and also in an example. So this is how you could say this is again a windows specific code.

A

So you could say you can specify the level here and then the string that you want to print in the parameters just like you would use in. You know, uh printf statement and alongside we also instrumented the printf uh function, because there are a lot of different diffs already in the code and they map to a default level of four.

A

So so that means, like you know, using this command here, gfs installer trace hyphen l0 x4. We could just see all the traces that were coming from d printf, as well as any new places that we have added again. If you want to uh explore wpp a little further. There are a couple of links in the bottom that you could pursue all right, so wpp. Actually it doesn't write the entire string or whatever you specify in the trace event or d printf.

A

It just doesn't mapping it puts a grid and the parameters and later when you want to actually view the etl file, you could open it in one of the tools like trace view, and at that point it actually asks for the pdb, which has the mapping. You know the grids.

A

So that's how you decode the the messages that are there in the etl file using the pdb, and then you get these nice traces all right. So now that we knew what was causing a problem using those mechanisms, it was time to remove them.

A

um The first thing we tried was to uh use the intel's isl crypto library, specifically the 256 gcm algorithm from it. It's an open source solution. uh The link at the bottom talks about that. um So that has given us a significant performance improvement. um It actually leverages the processor advancements, uh the avx ii uh instructions um and, along with that, we made a small change in the store port area. So basically we tell us to report hey.

A

If you have rights of 128k or more, you could give them uh in one shot so, rather than chopping it into 64k, we are capable of handling 128k rights at a time, because that's what the wall block size we use underneath.

A

So it kind of aligns well, so that change also helped us a little bit and then we did borrow a couple of changes from upstream uh and just to uh remind we started working on this couple of years ago and we didn't have open gfs 2.0 back then so we had to you know, borrow some some nice changes that that we learned from upstream into the uh zfs in the repo. So a couple of things that helped here were the metaslab unload delay.

A

So metaslab, as you may be aware, is a way to keep track of the um in the free or allocated spaces on the on the disk. So basically you take a video. You chop it into about 200 chunks and you want to keep track of. Where is the free space in each chunk? So you do that using beta slabs.

A

So earlier the default was eight and which means, if you have not seen any activity happening to a particular meta slab in the last eight transaction groups. Then you want to flush that data structure to disk and that was kind of causing a lot of disk activity. So we bumped it up to 2048, so that kind of locks, the meta slabs longer in memory.

A

So we don't do a lot of io, so that really resulted in a what performance improvements there uh and then we uh actually again borrowed the data sync percent from upstream so earlier it was 64 mb, so which means uh every time you have a 64 mb of uh nutty data, you or you just flush it to the disk. So we bumped it up to 20 of 4gb, which is kind of a configurable number. So that gives us more leeway to.

A

uh You know: collect more data in the transaction group before writing it to the disk. So those are the bunch of changes that we did in zfs in, and that is what uh the performance gains look like.

A

uh If you look at the top three rows, we made significant gains there um and the bottom three, although the percentage gains are there, but but the base is so small uh that this percentage gain is not really very helpful, as as far as the practical usage of the the driver is concerned, so there's still some room to improve here. So the top three are essentially larger block sizes. This is the size of the data, so 128k blocks, whereas the bottom three are smaller, writes and you know, reads 4k and 8k.

A

So I think that is where still there is a room for improvement, even in the current state of zfsn.

A

Now, how does it compare with opengfs 2.0 on linux? That experiment was done and you can see that the top three rows are still are still okay-ish, uh not bad, but then again the bottom three is where we still think open. Cfs is doing much better on linux, and definitely there are some bottlenecks in cfs in that that need to be removed, and it's possible that when we port open cfs 2.0 to windows, those those bottlenecks still exist.

A

So that's something that we continue to look into all right. um So I talked about uh jorgen uh he's been fantastic in extending uh support.

A

Now, we've already migrated uh the changes that I talked about: the perfmon counters, the wpp tracing the store port change, all of them to a windows, branch of that repo, open gfs on windows, slash open cfs. So the idea there is to it's kind of a stage, so jorgen takes all these changes. He reviews them. He requests any changes that are required and then he you know bunches them up and upstreams it to the 2.0 whenever the time is right. So so that is, you know that model has helped us move at the base.

A

We all can you know, move it. So so that's the model that we are following right now and then going forward. We want to look at the the 2.0 code base for windows, basically try to stabilize it and then you know work on the performance improvements just like we did for the zfs in codebase all right. So that was the prepared notes. uh I would be happy to take any questions at this point.

B

Paul is asking: did you run any performance tests without using dedupe.

A

Yes, we did try a lot of combinations, but then I don't have the data. Of course we have limited time to talk about it as well uh and we did run a lot of other tools as well. Like you know, this was not the only four corner test that we did. We ran this using midi bench uh here. Bench, hammer db a lot of stuff, but is there a specific question around it.

C

I was just curious because I know that uh dedupe has its own bottlenecks associated with it and if uh you know, if we got markedly different results without dupe, enabled it might point to different priorities for bottleneck, resolving.

A

No yeah, the the latest test that we did uh was without video without compression just 128k rights, and we still see some issues in zfs in compared to open gfs 2.0, especially when you're running it on nvmes.

A

It's not scaling the way we expect it to scale so yeah I mean we have been running a bunch of tests. A mix and match of these features.

A

What is gfs in zfs in is the windows port of zfs.

A

B

Thanks for all of your work on zfs on windows, both jorgen and mts,.

A

Thank you. It's been great to watch data course. Progress. Awesome thanks, guys.