OpenZFS 2021 OpenZFS Developer Summit, 17 Nov 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: A New ZIL That Keeps Up With Persistent Memory Latency by Christian Schwarz

Description

From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1kiWrpqvl8T9gHjZmG7Pk0f7mE5LEPCmZnbtUFUsuM9g/edit?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021

A

Hi, I'm christian I've been around the community for a few years now, and some of you may know me from the work on zeripple and this summer I graduated from university and I joined nutanix to work on the nutanix files product, but today I'll talk about what I did in my master's thesis where I looked at how we can make the zil faster so that it can keep up with persistent memory.

A

But before we come to that, let's have a short recap on the zild's role in the larger cfs architecture.

A

As most of you are probably aware, the zfs on the structure is a tree of block pointers and whenever we want to update a file in the structure, we are going to modify a leaf block within an object and since the block pointers check some they're pointy, we have to propagate the change up to the up the tree up to the uber block and to make out of this crash. Save cfs uses a copy and write mechanism.

A

So as a concrete example, if we modify a level zero block and we'd actually allocate a new level, zero block that contains our changes and then we would need to modify the block pointer for that block in the parent level. One indirect log over here to point to that new allocated block, and so we don't modify that block in place but create a copy of that block again.

A

But this time we reuse most of the block pointers, except for the block that we just modified, and this will go on up to the uber block and once the new block has been written, it's the new root of the industry structure and some parts of the old tree are now obsolete and we can get rid of them now. This is quite beautiful to look at in on itself, but we can't really afford to do this stands for every single vfs operation.

A

It would just be prohibitively expensive, both in terms of cpu time and right amplification.

A

The solution that cfs takes is to batch the changes of many vfs operations into transaction groups and by batching these we'll amortize the cost for this full bottom-up update.

A

However, with these transaction groups, we just got ourselves a new problem because to make the batching work, we must wait for changes to accumulate in dram, but at the same time, it's not reasonable to block every vfs operation that happens until the batch is large enough. So what we do is we let the vfs operation return to user space immediately and the txg will then be written out in the background.

A

The advantage of this is that all the vfs operations now operate at cpu or dram speed and the change.

A

The downside is that the change will only be durable eventually and there's not really a hard guarantee on when that will be or whether that will ever happen or the ordering in which that will happen, and some apps just need that to function properly.

A

For example, apps, like databases, need a mechanism for fast and immediate durability for the files that they modify, so they they just can't wait for the current open transaction group to sync out because in the end there is a human weighting at the other end of the line.

A

So the solution that cfs takes for this is the zil. The idea is to extend the ondisk structure with the linked listhead for each object set and the nodes on the list that starts. There are self check summing, and that means that we can append to the list independent of the transaction groups being synced, because we don't need to update any parent block pointers that point to these nodes because they are itself check. Summing now when a vfs operation needs immediate durability, it will append a docker card to that list.

A

That describes what happened in that operation at the logical level, and if everything goes well and the system doesn't crash, then the change is going to be written out in a transaction group. As an update to the on this tree and um as part of the new transaction group, we write out a new list head that pops off all those dog records from the list that are now obsolete because they are part of the tree structure proper.

A

But what happens if we crash before we finish writing out transaction group 5.?

A

In that case, the pool is in the state of transaction group 4, plus the lock records that were written to the chain since then, and what we are going to do once reboot is that we re-import the pool, and then we remind the block allocator that we actually handed out those log blocks for use by the zil, and this phase is called claiming and later once we get around to mounting the file system.

A

We can then walk the chain of lock accords and reapply each lock record to the data set and the end result will be that the file system state once we're done with this, contains all the changes that we promised as durable before the system crashed, and we can then mount the data set and let the application observe its state.

A

Now things are not as simple as that. In reality, there are a bunch of complications and I'm going to limit myself to the two important aspects that are relevant for this talk.

A

The first is lwbs, the zil actually doesn't write individual lock records, but it batches them together into so called log right blocks, and so this linked list is actually not a linked list of records, but it's a linked list of lwbs on blocks.

A

We need those awbs uh for block alignment and um because the the locker codes themselves have variable length, as indicated in this sketch here, and also the batching can be used, and the batching of multiple local codes into awbs can be used uh to do some tricks on highlights hardware to make the to make this a little a little less expensive.

A

The second thing uh that complicate things are it access the vfs operations, actually don't write the locker codes directly to the on disk list, because often the vfs operation doesn't even know whether it's synchronous or asynchronous. This is determined later when we call fsync. So, instead, the vfs operation only creates dram log grip loads, which we call intent, lock, transactions or short it access and after creating those itx's.

A

The vfs operation will hand those idx's over to the zil using the zell itx assign api and then the keeps track of those itx's internally and when user space actually does request immediate persistence, for example through sync or fsync, then the cfs posix layer calls another cell api called the commit, and the commits job is to figure out which it accesses need to be written out to the on this clock as log records, so that they are so that the semantics of the system coil are actually met great, so um we've covered the very basics of the zil.

A

Now, let's talk about the performance on modern hardware, the basic principle is that the latency for synchronous, I o operations, is the time that is spent on doing the vfs work, plus the time that is spent on the itx design and the commit now vfs and the itx assign are purely cpu and drm bound. But the commit is a little more complicated.

A

It does the work of figuring out which itx's need to be written out to disk. So that's building this a commit list, and then it has to take those it accesses from the commit list, convert them into locker cards and pack them into lwbs, and then it uses the cio pipeline to write out those lwbs to the actual storage hardware.

A

Now, and all of these are software steps that add overhead to the actual hardware latency for each ldb, that is written and historically software. Overhead wasn't really a problem, because the hardware latency dominated every other component in this equation, but with modern hardware, for example the 3d crosspoint stuff for that is used in the obtain nvme and pmem drives.

A

We get single digit, microsecond latencies for 4k synchronous, random rights, and this means that even a few microseconds of processing time can easily become the bottleneck for the performance and we can actually observe this by configuring obtain pmem as a stocky base. With today's zill and in that experiment that we did here, we used fio to generate 4k synchron rights onto a z, pool with separate data sets for each fio thread, and then the pool had three enterprise nvme drives configure the top level v divs.

A

So plenty of I o throughput performance and a single pmem dim configured as this lock device. Now what we measured was the void clock time that was spent in each of the latency components, and what we could observe is that the vast majority of the time that is spent per iop goes to lwb and zio overheads.

A

Only about 20 percent of the time are spent on dmu and itx work, and only 14 of the world clock time are spent on the actual um connection, with the hardware and waiting for the hardware to store the data.

A

And if we look at the like what is being written to the disk, we could also observe that the lwbs are actually sub-optimally sized there's about three times right, amplification going on from certain right.

A

um So the the awbs that we're actually writing out are 12 kilobytes in size, whereas the actual payload is little more than four kilobytes.

A

So uh there's a lot more nuance to this analysis that I'm not really able to present here due to time constraints, and it's also certainly not a workload that is representative of every use case for cfs, but it's a good example for what is wrong with the current zil and why I believe we should re-architect this for modern hardware and, to summarize, like there were two conclusions that I drew from this experiment. The first was that batching log accounts into awbs is not necessary, at least not always on pmem, which is byte addressable.

A

We don't need to adhere to the block boundaries and like even if we do, because we are on nvme- drives, for example, uh tricks like the batching and lwb timeout, and all that stuff that we do for for high latency storage hardware doesn't really buy us a latency advantage anymore, it's more of an overhead at this point.

A

The second conclusion is that the zio pipeline adds much overhead for very little benefit. The problem is that all the connect switching that's going on in there adds latency and latency noise, latency jitter and most likely, it's also not particularly helpful for data locality. Essentially, the entire design is more geared towards high throughput than latency.

A

So, given these observations, I think that the new zeal design is needed and it should have the following properties. First, it should abandon lwbs as a concept and store individual log records. Instead, it may or may not do some batching under the hood to optimize things, but to get the lowest latency for fully synchronous workloads. We should just store individual lock records.

A

Second, we should no longer have pointer chains on disk like we do with lwbs today. Instead, we should defer the serialization work, at least of the I o operations as much as possible to the time of replay, and this will enable more parallelism on the right path for independent operations and third, we should bypass the cio pipeline to avoid its overheads and write directly to the storage hardware.

A

Instead, we have seen the benefits of direct access to nvme drives last year in sergi, nair's presentation and we'll see later that they are even greater if we apply this to pmen. So, given these goals, I now present my proposal and how we can actually achieve them.

A

The main architectural change that I'm introducing is that we should use the slog v divs in a different way.

A

We don't give the slope rate of space to the spa, but instead we let the zil consume the space of the hardware directly and the zil then constructs a storage substrate on top of it and that is used to store all the lock records of all data sets in the pool and that storage substrate behaves like an unordered set of locker cards. You can put records into it and you can iterate over it in an arbitrary order and it will automatically garbage collect itself.

A

In the background to avoid running out of space and we'll use the lock records. Last sync transaction group to determine which lock records need to be garbage collected, and on top of this very, very minimal interface, we then implement the actual zil functionality. The idea is that each data set adds a bunch of metadata to each local code and if we should actually crash the replay code will then use that metadata to figure out which records need to replayed and in what order.

A

Here's a quick visualization. So at the time we write the zil the commit will be writing the lock records in some logical order, but the storage substrate is free to organize this log space. However, it sees fit so it just has to ensure that it will find those lock records again if we crash, and while we are writing new locker cards, the garbage collection will kick in in the background and yeah.

A

This will just be how the thing operates all the time and we hope that there is free space on the physical level at all times. Now, let's assume that we crash, then the replay code will scan through the search substrates contents to find records of the data set in question. It will filter out obsolete log records and then reconstruct the replay sequence.

A

That replay sequence is then applied to the data set to recover the committed state, just as it is as it is with the with the currency today, and if the storage substrate detects that uh if the substrate loses some locker quotes, for example, due to bit droid or data corruption, then the replay algorithm will have to deal with the fact that those local codes won't show up when it's against the storage substrate.

A

So data integrity is also covered here. So why is this more performant? There are two main reasons. The first is that we've eliminated right-of-the-right dependencies on the I o path for independent rights, of course like if the rights actually depend on each other, they must have a mechanism to wait on each other, so that replay can succeed, but at least we have the option to do independent, writes and fully in parallel now.

A

The second advantage is that the search substrate is just an interface, and this enables us to have a custom lightweight implementation that is specialized to the particular storage stack at hand into the particular hardware.

A

So, for example, for pmem, I implemented a custom allocation scheme and a data layout that is optimized for the zig contents, and I use pmm native io instead of using the cio pipeline.

A

So now that I've established a high level idea, let's make things a little more concrete with an example. What we are seeing here is a visualization of the storage substrate's contents and the metadata of each lock record on the x-axis. We have the generation number and we use it to encode logical dependencies between records on the y-axis. We have the transaction group of the individual records and the name of the records which we for which we use a letter. Right here is also a piece of metadata.

A

It identifies the entry uniquely within a generation and for clarity. We are using unique names for the entire for all log records in this example here so at the beginning of our example, the source substrate was empty and now we've added a locker card, which we call a and it's for a change in transaction group 4 in generation 11.

A

and now we write another record for transaction group, 5 and generation 12., and this record is called b and its generation is higher than a so. This expresses a replay dependency of b onto a right here.

A

Then we write another record c into the next generation 13 and while we are writing new records, garbage collection might kick in in the background and remove records for transaction group 4, because transaction group 4 has been zoomed in the background. So this will happen, but the gauge collection is fully independent of the right path. So, even while garbage collection is going on, we are still writing. Another record d and d is actually for the same generation as c.

A

This is new, and what we've expressed here is that d has a replay dependency on b, just like c does, but d does not depend on c for replay.

A

Now we continue to write records, e and f, which also share the same generation number 14, so e and f don't depend on each other, but e and f both depend on c and d for replay and finally, we'll write records, g and h into a new generation, and it's also starting some some new txts. So the number is going up there note that at this point we can observe that garbage collection is going to kick in pretty soon because at any given point there can only be three unsung transaction groups.

A

So at this point it's clear that transaction group 5 has synced out because otherwise there wouldn't be an entry for a local code for transaction group 8.. So now, let's recap and see what metadata we've observed here. We have seen the transaction group. We have some have seen the generation numbers and we have seen the unique ids for each entry, which I represent by the film represented by the letters. However, this is not really sufficient to detect those entries at replay time.

A

So for this we use a counter per transaction group that and to explain how these are computed, we're going to run through the example again and show which, which counters each individual lock record has so for the records of the first generation. All the counters are set to zero and after we are done, writing a generation.

A

We sum up how many records were written in each transaction group, and this running sum is kept in a table called the counters table now for the next generation's records, we will then use the counters table's contents as the counter values for each individual block record over here. So we can see this with n3b we've copied the table into the entry b and now again after b of the b's generation, so generation 12 is done. We do the accounting and account for the fact that we have written another locker card in transaction group. 5.

A

remember this is a running sum, so we don't reset the table after a generation. Now, if there are multiple records in a generation like generation 13, we use the same table in every record of that generation because, as we'll see later, these don't depend on each other. So um they only depend on the on the last generation and every generation before it.

A

So again we just copy over the table and at the end of the generation we account for the records written and this time we've written two records. So we have to bump two counters here um now and this this will be like again. It's the same procedure for generation 14. We do the accounting again and we write records for generation, 15 15.

A

now, there's one interesting property here: the table that we store for generation 15 only contains the counters for transaction groups, 8, 7 and 6., so we've dropped the counters for transaction group 5 and four, and the reason is that we know that these have synced. So we know that we won't have to validate the counters later during replay.

A

Okay, so um we've seen how the counters are computed, let's now put them to use uh to exercise the the replay code path.

A

Let's assume that we crashed after we've written out transaction group 5, but before it was garbage collected, then um in that case, the the records that need replay are the records d, f, g and h. We ignore, b and e, uh even though they are still present, um because their change is already part of the main data structure and if we attempted to replay them, the replay callback would fail, because the replay actions that are encoded in these docker cards are not important.

A

So our plan is to replay d, f, g and h and then use the counters to detect lost records along the way. Let's first cover the happy case where we haven't lost any record. In that case, we so we initialize our counters table to zero during replay and then look for the first records counters and for each counter that is greater than the transaction group at which we crashed. The counters must match what we have in the replay table, so these counters are all from generations that are 5 or older.

A

So there's nothing to check here and we can just replay it and after doing the replay action uh we um update the counters in our account for for so after we we've replayed all entries in the generation. We update the counters table, just as we would do on the right path.

A

Now, moving on to f the counters for transaction group, 5 and 4 can be ignored again, but the counter for transaction group 6 must match what we have in the table, and that is in fact the case. So we can replay f and we can do the counting and move on to entry g and again we compare the counters and can observe that these match. So we can replay g as well and h. We can replay that as well great, so that was easy. Now, let's look at the case of actual data, corruption.

A

Suppose some bit droid has corrupted and records e and f. Then the search substrate would not show them to us when we scanned it to construct the replay sequence. We wouldn't even know that they existed in the first place because the storage subsidy doesn't tell us about them. So, for now our replay sequence is going to look like d g and h.

A

Now we want the replay algorithm, the replay algorithm to replay record d, but we mustn't replay records g or h because they depend on f courtesy of the generation numbers. So, let's see how this works out, we'll initialize the counters table and compare the counters for record d. They match so or they can all be ignored. So we can replay d and do the accounting.

A

Then we move on to g and the counters for transaction group. 6 don't match in this case, so we cannot replay g now, since h has no logical dependency on g, because it's in the same generation we try to replay it as well, but its counters also don't match.

A

So we cannot replay it either and we'll stop replay at this point, because we've reached the end of our tentative replay sequence and the end result is that we've replayed as much as possible, given the constraint of the the generation numbers over here great uh and like what's important like the the important thing is that we can actually present witnesses for a missing entry uh or former for a missing record.

A

So g and h serve as witnesses that record f has been missing if this was a little too fast or you want to revisit this later. I'd encourage you to either read the chapter in the thesis or look at this slide and the next, because they are basically summaries of the voice track.

A

But what we are going to focus on right now is the concrete implementation and then later on benchmarks. So the short name for this entire project was silpimem and it's the product of my master's thesis. The goal of the thesis was to design a system that makes synchronous, io and zfs as fast as possible, using persistent memory. So this was really a no compromise approach on making synchronous io. As fast as possible, we've already covered the high level ideas and so on and the algorithms and for the rest of the talk.

A

We are going to focus on implementation and benchmarks, and there is a bunch of more material thesis if you are interested, in particular testing validation strategy and so on, and a lot more details about the benchmarking setup.

A

I've already mentioned persist memory a couple of times, and until now it has been more or less sufficient to think of it as a very fast type of storage, but to really appreciate the the value that it brings to the table. We should have a more elaborate introduction.

A

uh First of all, uh you might know it under a different name. There is a non-water type, main memory or storage class memory, and these are all like this. The naming depends on which branch of industry or academia you're, following in the case of persistent memory, it's concrete product name uh branded by intel.

A

So um the idea is generally always the same. The idea is that, um instead of speaking a storage protocol like nvme, uh you map the pmem directly into the address space and then use normal load and store instructions, and maybe some cache flashes and so on to perform, I o to it. So there is no storage protocol anymore.

A

Essentially, your micro architecture is what you use to talk to the storage directly, so this is a pretty different model of doing io than what we are used to in the storage space, but it has. The pm has some very appealing properties for the zoo that makes it worth taking a closer look.

A

So first, it's very fast because there is less overhead than with something like nvme, even if it's the same storage media underneath so in the case of in the case of obtain, both the obtained nvme drives and the obtained pmail drives use, 3d, crosspoint media underneath so basically there's the opportunity to pass on those additional efficiency gains by avoiding the protocol overhead all the way up to the commit and make this really really fast.

A

The second advantage of pmim is that it's bite addressable. This is ideal for the zil, because the zil locker records themselves have variable length and often very short, so with a zild on pmem. We don't really have to worry about padding or block alignment or the space wastage that might result from padding up to a certain alignment.

A

I've put some links up here. If you want to learn more about the basic technology, but for us and for the purpose of this talk, these are really the most important characteristics.

A

What we haven't covered yet is how we can actually reach that hardware in something like an operating system like linux. So, um first of all, uh there are several operating modes for pmem and we're going to use the app direct mode for pmem in the fstx configuration here. You can just ignore those details. If, if you're new to the topic, what's important is that in that mode the pmm will show up as a block device node in the device fs and the kernel driver that provides this device.

A

Node will then implement the block device io using the pmap native store and cache flash operations and so on, and this will enable existing block view by consumers to take advantage of pmem performance without any modification.

A

So, for example, in the experiments on the performance that I presented earlier, I just used the dev pmm as a slog device with the current upstream opencfs.

A

But the idea behind bmem really is about direct by granular and transparent access to the person's storage. So essentially it's a full bypass of the operating system, storage stack and in order to support this linux provides the direct access apis.

A

In that case, the the block device. Consumers can use these apis to check whether the block device is actually pmem, and if that is the case, then they can establish a direct memory, mapping to the pmem and once the mapping is established, the consumer then can issue load and store instructions and cache flashes and so on directly to that memory. Mapping and the operating system is completely out of the picture.

A

So what we see in the what we see in this screenshot here is an example from the ext4 source code, where there is conditional optimization for pmem or a different implementation of how we read, read and read data from a file if the file is actually on an xc4 instance, this instance that is deployed on pima great. So, given this framework, uh the goal for the pmem was to make it fully transparent to the user.

A

So there should be no change in the zpool cli, because in the end, like that, fpman device still looks like a normal block device from the user's point of view and zfs should just switch over to the new zeal architecture that I proposed and do the right thing if the stock is actually persistent memory.

A

So now, how does this fit in the overall system? Architecture of cfs the pmem will be added as a slot vdf to the z-pool and then we'll implement our new zeal stack on top of it.

A

Now we can't throw away the old code for a bunch of reasons, so the first step was to refactor the zil so that the different persistence mechanisms could coexist at one time, and the result is that uh we have this thing called silkens and I'll give more details about this in a minute after that, um I actually implemented the the high level ideas that I presented earlier.

A

If you remember, we needed a source substrate um for pmem and an implementation of the higher level algorithms and the source substrate in the pmem is called prb and the high level algorithms are implemented in a code, module called handle and the data structure for the send like it exists once per once personal instance. Now to make these data structures easier to test, I implemented the prb and handle as quite in modules, and so there is some glue code necessary to integrate them into the zfs code base.

A

That's the yellow spot over here, and also I had to teach the vdef layer at the lower layer about pmem and initialize the prb on top of it and so on. So there's some glue code involved there as well.

A

Let's look at the kinds first, the idea here is to split up the zil code in the two modules there's the itx module, which deals with anything that is happening in dram, so this would be the code that keeps track of unwritten log records the code that figures out which records need to be written to the on this chain during the commit and the other module that I have is the persistence module, and that is all the code that actually writes the to some form of stable storage, determines the the persistent data structure and so on.

A

That code is also responsible to read the zell after a crash to interpret its contents and to drive the replay process.

A

So now, with with this refactoring, I could then introduce a v table that decouples this persistence, api from the general api and like we can now have different implementations for the speed table coexist at runtime and the name for these different implementations is called silkens. So now any the kind will need some place to store information that is per data set, like the lwb listed for the zill lwp and the place for this or like and zpma, will store some metadata in there, so that can find the local codes again.

A

So um basically, we needed a place for this and the place the ideal place, for this is the zil header and so with the kinds this header now becomes. A tech union and the union tag is the enum value that represents the silk kind. So this also means that when we decide we need to decide which v table to use at runtime. We just refer to the union tag that we find in the zill header.

A

This design is also backwards compatible because if the zip feature for the kinds is not yet active, then we can just assume that the dataset uses the old awp based kind and also use the old fill header layout.

A

Now I also want to spend a few minutes on the storage substrate implementation, because I really think it highlights how simple that layer can be when prb is initialized. It takes a pima mapping and a partition and it partitions it into equal sized chunks. Each of those chunks is then an append, only sequence of log records so that when we want to write a record to prb, we can just pick any record that has sufficient space and insert the lock record at the tail of that sequence in a crash, consistent manner and for garbage collection.

A

We'll just wait until the chunk is full, then wait until all the entries in it are obsolete, because the transaction group has zoomed out and then we just reset the entire chunk sequence to zero to the beginning, and then it can be filled again.

A

And if we have a sufficient amount of chunks, then this can be a quite performant implementation. For example, we can have one open chunk per cpu so that there is no contention between parallel writers for access to the chunks, and if we make the chunks large enough, then we also minimize the need to coordinate different writers when they need a new chunk or when garbage collector runs and so on, and this is really all there is to it.

A

To this design, of course, there are some dram data structures for bookkeeping and garbage collection and so on, but these are really boring details and they don't have much overhead. The key observation is that, at least for pmem, the source substrate implementation is very, very thin. Easy to understand, easy to audit and, most importantly, has very, very low overhead.

A

Sorry about bullet points here, okay, so the next step was to wireless up into a prototype that could be used to run the actual benchmarks. If you remember, the goal was to automatically activate the pmem when we add a pmem stop device. The problem with this was when the pool is already instantiated, then we'd have to switch over v tablets while they are potentially in use- and I didn't have the time to cover this during the thesis, so the workaround was to determine the silk kind ahead of time when we create the z pool.

A

So we have this ugly module parameter here, where we, when we set it to pmem to the pmem and then create the pool, we'll check that the vdf config matches and contains exactly one pmem stock device, and uh once that's done, we set the root data set, still kind to the pmem.

A

And now, whenever we import the pool, we look at this root, dataset, silk kind and recover the pool itself kind from from that field, and um of course we will check that the v dev config still matches and then instantiate prb on top of the pmem stock space, and then the individual z-log t instances and then the commit routine.

A

If you just use a pointer in this bar to to access the prb and obviously we also need to prevent operations like zpool, remove the stock device, because we don't want to pull out the pmem from underneath prb, while it's still in use. So all of this is quite hacky. I will freely admit that and it probably needs more refactoring, but it was sufficient to get the job done in the sense that we could run benchmarks on top of it.

A

um Now, before we come to the benchmarks, uh let's take a quick look at how the commit works in the pmem.

A

The first thing is that the first thing that the comment does is that it requires a mutex that is per data set, and then it uses the itx code to get the commit list. Then it walks over the commit list and for each itx on the commit list. It will convert it into a lock record representation and then write those stock records into the handle, prb data structure so into the source. Substrate and we'll pick a new generation for each record.

A

So the at replay time the replay dependencies are fully sequential, and this is.

B

A

A just exactly what the same same kind of dependencies that we have with lwbs, and this was just the safest choice to use here now when we are done with this, we release the mutex and the next combat call can start can start. Writing can start writing and, of course, if the data sets are independent, then they only need to coordinate at the substrate level and we've seen that this can be made very efficient.

A

Now there are some some points where we can improve this here. So one is to use something like the commit waiters, so that we can get a little more parallelism. We can do this by pre-computing. The generation ranges that will be huge writer and then, if the writers do the writing in parallel, um the problem is that we'll still need that we need to change the apis a little under the hood, so this can be done, but just hasn't been done yet.

A

The other thing in the thesis- and I have some backup slides on this as well- is that I experimented with bypass tx layer completely to enable even more parallelism than what we can have with um with the commit waiters. But we don't have time to cover this year. So yeah. Let's look at some benchmarks.

A

We start with the primary workload of the thesis that were 4k synchronous, random, writes with a separate data set per thread again like I know this is not really a representative for most cfs workloads, but it's surely a torture test for the zil. So in this experiment I compared the performance of four different configurations.

A

The first configuration was the fsdx configuration in yellow and this configuration writes directly to the pmem from user space. So we only have the raw hardware latency, plus some syscall overhead. The other three configurations were zul awb, the pmem and async.

A

They were all on a z-pool with three enterprise nvme drives and one persist, memory dim slot device, uh the awb and the pmem use the respective circuits and the async configuration um has sync equals disabled set. So this is meant as a like estimate for the upper bound of how efficient the like what what we could achieve if the persistence code was uh was maximally efficient.

A

Okay, so now look: let's look at the results.

A

What we can observe is that the pmem achieves 130 000 iops with a single thread, and that is this purple curve over here, and that is about an 8x speed up over what we can achieve with the awp on the same hardware.

A

It's just a software change and we can also observe that the premium scales up fairly well uh to 400 000 iops with four threads, and this is own- and this still a 5.5 x, speed up over what can achieve with.

B

A

Wb now um for higher thread counts. We see it to to align with the fsdx curve over here and the the async curve actually shoots up way higher. So what this means is that we are reaching the pmem throughput limit at this point and that if we had higher pimen bandwidth available, we could increase the we could potentially land, even higher iops.

A

I actually did that experiment in the thesis and we reached about 900 000 iops with four interleaved obtain them. So that's pretty good.

A

uh If we look at the a little deeper at how the latency is distributed, uh we can observe that the pmem with the pmem, the zip persistence, now only takes about 25 percent of the void clock time um of each iop, whereas it was about 80 percent, with the lwb based cell and in turn. This means that the asynchronous part of this is so dmu and the itx work now become the dominant components in the latency equation and we need to optimize there.

A

I also did some more realistic benchmarks. Most of you would probably call those still fairly academic and I would agree, but it's really the best. I had I'm not going to go into each of those in detail. The gist is that they are all doing synchronous, writes in one way or another, either through writer, headlocks or metadata heavy sync operations, and so on, and in contrast, the previous workloads those are in on one data set.

A

So we can actually observe the impact of the the mutex that is held in the comet in the comparison between the lwb and the pmem. I could observe that the like single threaded workloads, so um roxdb radius, and so on. They achieved pretty good performance.

A

uh 5.8X speedup over the lwb, with red as a 2.7 speed up maria will be a 2x speedup and there were some workloads that didn't benefit as much but in general, there's a pretty good result if you crank up the scaling factor, so the number of threads that are simultaneously doing requests to these types of servers or doing put operations and so on.

A

What we could observe is that the pmem doesn't really scale linearly like if it would then the bar in the the orange bar for scaling factor four would need to be four times as high as the bar and scaling factor one, but there's still a substantial improvement. So there is some scalability there and the pmm still performs better than the lwb in most of the workloads.

A

The other interesting comparison is that between the pmem and dm write, cache dmrac cache is a device mapper target in the linux kernel that you can use to configure write only cache for any block device, including other device, mapper targets, and what's so appealing about this- is that it has been explicitly designed for persistent memory and fast nvme drives, and but it uses a different architectural approach, because it's a write back cache, whereas the pmem is really a writer headlock and the write cache is more of a block layer, volume manager, type of approach versus a pmem which is very tightly integrated into the zfs file system.

A

So it's exciting to observe the different performance of these different architectures, and what I could observe is that in the workloads that I looked at, the pmm always lands within plus minus thirty percent of the throughput. That could be achieved with dm right cache uh if we put xfs on top of it.

A

So this is good because, in contrast to zill awb, we can actually be a serious competitor to this technology in terms of performance, and there are some benchmarks that perform very small rights, so roxdb and radius, and these performed better in the previous benchmark as well.

A

But in this one we also see a big advantage for the pmem here these these workloads over here, and I think that the reason for this is that we have less write amplification in the pmem because, um like xfs, will see a block device underneath and will blow everything every everything up to 4 kilobytes, whereas the pmem will yeah use the pmap natively and write small lock records directly.

A

Now um these numbers are quite impressive. um We should talk about some of the drawbacks of the pm before we wrap up, although we're running short on time, so I'll skip over some of those uh first. The prototype that I developed in the thesis has a bunch of weaknesses, uh in particular um like the.

A

We only have one implementation of this architecture, so we don't really know whether it's a leaky abstraction, uh then there's a problem with uh workloads that only do occasional asyncs have sync operations, so we haven't already looked at those in the benchmarks, but there are a bunch of efficiencies inefficiencies in the implementation and we could work around those, but haven't done that yet, and there are also problems with or at least unaddressed performance issues, with parallel fsunks on the same file, because we could get some performance improvements there. If you use something like.

B

A

Commit waiters, then there are some features that are missing, so support for native encryption would be a must. I think, if we upstream this or if we consider something like this upstream and mirroring of pmems, so that we get some redundancy for this log that is also not implemented yet also. The glue code is quite hacky, as you may have noticed, so um we should probably revisit some design decisions there and, more importantly, the design has also some inherent weaknesses, and I would like to thank alexander moten specifically for his feedback on this.

A

So the first thing is that it really doesn't address the elephant in the room, which is that the zil is very, very sequential and very strict about the guarantees it gives to our user space and the pman provides those same guarantees.

A

But if we want- but if we were fine with relaxing those guarantees, we could potentially get a lot more performance, and maybe we can discuss this in the breakout room, whether relaxing the guarantees as an option for us. Another aspect is amount of drm allocations and drm to dram copies that are happening in brazil.

A

Those don't show up with four kilobytes synchronous rights, but they do show up if you crank it up to 128, kilobytes rights or even higher, then there is a bunch of mem copies going on that are not really necessary and could be optimized away.

A

We don't have more mem copies than the awb, but we don't have less either. Also like there are some maintenance concerns. So if we have, um if we have this custom space allocation going on uh in the search substrate, then we need to double think every time we do any tricks with space education. So, for example, zebra checkpoint is a candidate where generated some headaches during the design phase.

A

There is also no graceful fallback mechanism if this lock is full. So if data sets cannot be replayed immediately. This is an actual problem on small persistent memory devices like nvidem n. It's not so much a problem on obtain because the smallest unit for obtain is 128 gigabytes, and the last thing is that the linux stacks apis are gpl only so we cannot actually use those apis in the zfs module upstream. Unless you set the meta license to gpl.

A

I'm very happy to do this, I'm happy to discuss ways, so we can deal with this in the breakout room and with that I'm done with the talk um regarding upstreaming. The pr is out. I've done my best to break the the changes up into commits that can be reviewed independently.

A

Regarding my personal commitment to all of this, as I said, um there is all content from my master's thesis and my employer is not involved with any of this at this point. So currently, I'm only able to contribute to this in my spare time, but I'm quite eager to explain stuff to people and help if there's any interest in upstreaming some of the work so uh yeah thanks for your attention and I'm looking forward to discussions in the breakout room.

C

uh Would it be possible to do this in a meta slab instead of a v dev, conceivably.

A

uh What is this question.

C

Like to put the so like, if you had a pool which was a single v, dev pool- and you didn't want to have a slog, but that pool was on a very fast storage device. Could we do some of these optimizations, but within a region of the of the pool, rather than.

A

Yeah we have this optimist like upstream cfs has the like dedicated metastop, minor, slab education class right now, so in theory like we should be able to change this so that we pre-allocate space any vdf that should work. The like the problem is needs to be directly addressable, so essentially we would need to like do the do large allocations like spa, max block size locations and then we'd need some way to ask.

A

Given this block pointer to this spa max block size allocation. Please give me the the dear uh the the memory mapping if we, if we want to do this for pmem right so yeah at some point, we need to bypass the io pipeline.

A

If we do, we implement an alternative scheme for nvme, then things might be simpler, but that's something like we brought hardware address, whether that is a virtual memory. Mapping for pmem or the nvme logic block addresses.

B

Right for the logical block addresses that sounds a lot like the crash dump to z-vols or swap where you know, we kind of.

A

B

All the space- and we can say you you're- just allowed to go scribble on this part of the disk and go around the rest of the pipeline.

A

Yeah, it's going to be the same same problem that needs to be solved or is it solved? No, I wasn't aware like I know. Freebsd has something like that, but I don't think.

B

No illumos has trash dumps working right, wow, something like that, but I think on the other platforms.