OpenZFS 2022 OpenZFS Developer Summit, 2 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Shared L2ARC by Christian Schwarz

Description

From the 2022 OpenZFS Developer Summit https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022

Slides: https://docs.google.com/presentation/d/1Fp3yLnpFKyPyG_4lViqMbg-q1p9o2DwW5lPjOfcvwrc/edit?usp=sharing

A

Hi I'm Christian I work on CFS at nutanix and today I'm going to present on a project that I did in the last year day to Arc before I get into the details. Let me give a brief overview of what nutanix files actually is.

A

So we are a software defined scale out, file, storage solution and the core functionality that we provide is NFS, SMB and multi-protocol shares that are accessed by clients and under the hood. Those shares map to ZFS data sets that are spread across many zipwoods and the architecture works like this. We have a compute part, meaning CFS, and the protocol servers those run inside VMS on top of the nutanix hyper-converged infrastructure and the there's a storage part, meaning the V devs of each Z Pool.

A

Those are v-discs, which is, which is an abstraction that is also provided by the nutanix Cyber conversion infrastructure stack and we access those v-discs from the VM via iSCSI.

A

uh Each Z Pool that we have in the system is imported in one VM at a given time, but the zippers can move around between the VMS for purposes of ha or for load. Balancing and moving around. Those zippoids is cheap because we just need to coordinate which VM owns given Z pool at a given time. We don't need to move the data because the data resides somewhere in the hyper-converged infrastructure.

A

We just consume it via iSCSI, um so the zfs's role in this architecture is really not that of a uh physical volume manager, but instead it's a positive compliant file system with nice. Enterprise level features that we use to provide useful functionality to users now that the actual project that sparked this shared a true Arc idea.

A

We had this project called fights extended buffer cache and the going with this project was to accelerate read heavy workloads who is working set exceeds the size of the Arc of an individual compute VM So. The plan was to take a local disk of the vm's current host system and attach it to the VM, and then we would use that host local disk as an L2 Arc inside the VM, and that would allow us to serve the reads from that local disk.

A

Instead of having to go to the v-discs every time we have an L1 Arc cache Miss. There were two problems with this.

A

The first one was that if we add the post local disk SNL tour to the zipwoods, then we can no longer move the Ziplocs around, because uh we now have this horse Docker device attached to them and what we wanted to avoid is to have a bunch of A2 Arc, add remove operations in the path of ha and load balancing for various reasons, but the even more severe problem was that the L2 Arc is a perceived pool construct, but in our product we cannot really predict which share or Zippy will need the acceleration the most.

A

So because it's per zip, what we would need to do is take our local disk and partition it up to assign one partition to each local to each zipper that is currently imported.

A

But then, if only one chair or one Z Pool is hot, then we would only use that partition and not use any of the other partitions because they are part of another zippoy.

A

So that would lead to underutilization uh in those cases and where it was would actually be more attractive to just put the entire capacity and as dynamically shared among the important zippoids and then the Zippo that is currently hot could use sort of the capacity before we get to the solution. Let me give a quick recap on how the arc Works internally so the arc from a high level point of view.

A

Maps hashtags, which consists of the spa load to your ID data virtual address and the transaction group of a given piece of data in the normal videos. It Maps those to a structure called Arc Buffet dirty. Usually we refer to that structure as Arc headers and that arcader points to the storage location of the cache data in the L1 and the L2 Arc in the i1 arc. The storage location is identified by a pointer to the dram buffer.

A

So that's pretty straightforward in the O2, the storage location is a pointer to the cache V diff and the offset within the cachevie dev. Now um so much about data structures. How does this thing actually fill up over time?

A

There's the kernel thread called L2 arc, V thread and that thread iterates over the L1 buffers that are eviction, candidates of the L1 Arc, meaning they are at the tail of the most recently used or most frequently used lists that the arc maintains. So we have our our regular L1 buffers here and then there are a bunch of eviction candidates and this Edge Rock feed thread iterates over those eviction candidates and applies the following rule.

A

If the eviction candidate is from a z pool that has a cache device attached to it, it will take that buffer and write it to the cache device and remember the location, meaning the offset and the vdef ID in the in the arc header.

A

And now, when it comes to eviction of the L1 header, then the we will keep the arc header structure in the DRM and um we will remember the offset in the O2 Arc so that when we later read that location, we will get an L1 Miss, because there is no longer another one buffer.

A

But there is still the L2 header. We can get the data from the cache device and serve it to the user and um that's.

B

A

How the historic works works from a high level. Now, let's see how this system behaves. If we have multiple ziploots, um each zipper requires its own cache device.

A

um That is an invariant that Upstream, which are currently imposes, so we have to Supply Cache device such zipul and then uh the previous, the rule from the previous slide still applies. A cache device in a given zipul will only host L2 buffers for that z-pull, because that's what the L2 Arc feed thread does right now.

A

That means we have a mixture of buffers from the different ziploids, but the atwork thread partitions them by zipul to the different cache devices of those Z ports and that behavior can be desirable if the zip codes serve different workloads and we know that up front, and maybe we want to avoid some Noisy Neighbor problems or whatever, but in nutanix files we don't know upfront which Z Pool will be the hot Z pull or maybe there will be multiple, equally hot Zippos or so on.

A

We just don't know so, it's better to pool or the cache device capacity that we have and share it among all the Z poets and that's what we did so with Shadow 2 Arc looks like this. We have the vdisc based zeppo is they just have normal normal type, V devs, and then we have a special zippoor called nutanix fsvm local L2 Arc. That name is subject to review, obviously, and um that the pool only consists of the whole stock devices that we attach to the VM.

A

We partitioned these devices into two partitions, one fixed size, P1 partition and then the remainder of the device is P2 partition.

A

The P1 partitions are put together in the mirror and they constitute the normal videos of this pool and they are small. They are just enough to make the pool creation work, but they don't store any actual data apart from The Meta object set and the.

B

A

Stuff that you get when you create a z-point, the interesting part of the P2 partitions, those are striped together as a cache device, and so that alone would not be sufficient to do anything because the electric feed thread um standards change, and so that's what we did.

A

We took the L2 up feed thread and changed it so that it no longer applies this really where it partitions the the buffers instead, um which we change it so that it feeds the buffers from any zipool in the system to the L2 Arc zip for its cache, V devs so effectively. The buffers are now spread over those cachevideos, and that was really all we need to do to solve our problems and make this work now.

A

The obvious question is: is this correct I believe it is because we do we do not make any changes to the arc or the L2 Arc invariants themselves, the tagging Remains the Same caching validation Remains, the Same all I can think of basically Remains the Same. There is one case where we need to do some minor changes.

A

The fallback read case.

A

um That is the case where and read from the L2 device fails, for example, because the L2 device is dead or there's a checksum error or the buffer got evicted after we started the read, but before we finish the read something like that, in those cases we will go back to the primary pool and read the data from there and that is fully transparent to the user. That's how the arc is supposed to behave.

A

The problem here is that that primary pool might be exported while we're doing the L2 agreed, but to make the four pack rate we need to guarantee that that doesn't happen. So the solution here was pretty simple. uh We just told the spa, config L2 Arc log of both poets instead of the pool where we do there to our grid and that solves our problems, because this prevents the primary pull from going away as well.

A

The other risk was that they might like, because of because we are changing lifetimes of several data structures. There might be a risk of dangling pointers that we didn't identify.

A

uh The primary risk that I had in mind was that we would have headers from the primary pool that reference uh structures that are associated with the two outputs that is new previously all of these would be for the same Z Pool and they would have the same lifetime, but now uh that's no longer true. They have different lifetimes because you can export The Poets independently.

A

So when we export the Altura Pui, we will need to make sure that we invalidate all those Arc headers. Well, it turns out that the code is already structured this way. So um my understanding is that the existing code and locking is sufficient to deal with this. A quick disclaimer.

A

um The basis for this design was the episode neox, 0.7, so I know a bunch of things have changed in the me time in the arc, and we didn't take those into consideration where we designed this.

A

um So that was it about where the project is right. Now, let's talk about the future right now, the project is a proof of concept. It hasn't been productized yet and um what I did was publish the rebased code to GitHub, so it's available as a draft PR and every one of you can look at it and check it out. There are a bunch of to-do's.

A

First of all, I did the rebase, but I didn't take the new features into consideration so for a general design, we'll need to think about how we handle these and then obviously we cannot use this hard-coded magic name. We need some more Dynamic and generic representation of uh whether the drug devices should be shared or not. uh A property seems like the right choice for this.

A

Maybe something like this share it to our videos on or off with control, sharing of the short videos for the torque simply and if you want, we also want to deal like we want to support both the non-shared and the shared HR case.

A

So probably, we should also have a property that controls whether we want to use the shared network in a given zipul or whether we only want to use the uh that Z for its cache devices for data set for data sets in that pool, so another property that we could throw in that is our subject for debate. I would be happy about comments on the pr or in the Q a right now, and that was my talk.

A

If you're interested in the code have a look at the pr or we can look at it together in the breakout session or during the hackathon, and if you want a demo, maybe we can also do that during the in the breakout room after the talk, if you have questions or comments on the design, now is the time and before I hand over to Matt. I would like to say thank you to my team at nutanix and the ZFS Community at large, in particular for the nearly and George Wilson.

A

Both of you answered lots of questions that I had um while I while I implemented this. Thank you.

C

Q, a cool all right questions for Christian.

C

John is there a reason you decide? Can you explain the reason you decided to use a separate pool rather than making an El torque be in two cools at once, kind of, like a spare, can.

A

Yeah, um so that was actually one of the alternative designs. um It was just to fiddly uh in the like the latest management code. That was particularly familiar with that. Actually under the hood, the vdev auxiliary T, which is like the the I, don't know, look it's called abstract base class or whatever, basically a piece of code that is shared between spare videos and Network v.

A

Devs spare video support being present in multiple puts at the same time, but I found the whole like that at the point in time where you put the spare into actual use, I found all of that pretty messy and didn't want to deal with that um so yeah I chose the separate pool of the part.

A

Also you have to like um think about what the different sub commands will do so, for example, Zippo status with the or a Zippo like OS set. Will it show the IRS for the device or each pool or uh with a like, distribute the statistics based on which pool access did how many accesses to this device? uh It seems simpler to just have one one zipway.

B

B

Share this out Wednesday export pool a um does. It go in immediately like reclaim all the space that Poulet was using in the L2 Arc, or does it like persist in case you are going to import Kool-Aid again later.

A

uh Yes, as I said, we are on uh 0.7, so we don't have versus Network and uh I think that would say so we don't. We don't have that particular use case in general, when we export the pool it's going to be gone for several minutes and by then the tour contents are probably irrelevant so um yeah. We didn't need to think about that. We thought about okay, I've streamed it persistently to our. Are we interested in versus Network for the product and yeah? We concluded that that we aren't.

A

So that's why we didn't look at this.

C

Are there questions all right looks like no cool thanks, a lot question.