OpenZFS 2019 OpenZFS Developer Summit, 13 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZFS TRIM Explained by Brian Behlendorf

Description

From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=1Osc5IajVUqfrFlXFiz5m7p9UXa0iFV0f

A

Less but your mileage may vary, so why is this again? I? Don't want to dig too deep into the media details right. This is with SSDs I think it is worth mentioning briefly the background like where this phenomenon comes from and fundamentally it comes from the fact that Fernan SSDs, at least you can only write pre-race pages, so these are typically 4k iOS, 4 K pages on the media side. These are very very fast.

A

You can write them very very quickly when they're pre erased, but you can never overwrite them when you need to erase them, you have to erase full erase blocks. These are much much larger and you're gonna erase a lot of pages, so this operation is very, very slow and hurts performance so, like I, say a couple quick examples of this, because I think we're all kind of up to speed in how SSDs work they're, not exactly new technology but write amplification is that effect, you're, gonna have or SSDs. So here's a really simple example.

A

Here: we've got three erase blocks basically, and we got a new write come in, so those are the green blocks write. It all very, very fast, but if you want to override it, you just can't overwrite that one sector, what you're gonna have to do is read all the previous data likely and write it to a new unwritten spot because you can't overwrite in place.

A

So this can lead to a very simple version of right inflation on the SSDs, so here we only need to Dulari one page but effectively we read or we wrote for instead, not great.

A

This can obviously get much worse along with this. Another effect people know about parentheses is garbage collection, so here we've got the same blocks, the same three erase blocks, and now we've got old data here on the middle one. This is data that the SSD moved internally. It knows it's not valid anymore. It knows where the real good data is well. We can't write those blocks till well. If you got another right, come in you're going to need to erase that erasing that block really really really slow with SSDs.

A

So you come in your you in a race you're right data there, this full erase right cycle is really expensive. So that's just a quick issue of the problem with SSDs and why trim is important. So the solution to this needs to be tackled at the file system level, and this is why people are keen on having trim in their file systems, because only the file system knows which blocks are actually allocated and which blocks are actually free.

A

The SSD internally can be very, very clever about making sure things get pre erased and moving stuff around, but at the end of the day you can only do so much, and this gets a little bit harder. The fuller, the SSD gets alright, the more data it has. It has to manage internally the worse. This problem gets and it manifests itself as reduce, write performance over time with your SSD. So again, originally file systems were optimized for hard drives. So this wasn't a problem. Unlike SSDs hard drives, don't suffer this rewrite penalty.

A

You can pretty much expect the same. Rewrite performance more or less, for your drive for the life of the drive for SSD is trims kind of more requirement right because you expect the performance degradation it internally. Like I say, the drive could only do so much so there's some nice effects. I come from this and motivations for implementing trim support in ZFS. Again, that's probably why it's been one of the most requested features.

A

It's going to reduce, write amplification because of the file it's going to reduce, write, amplification, basically fewer rights, because we're not going to have this problem as much internally where we need to move things around in the SSD you're gonna get higher write throughput.

A

Typically, if you have pre-trimmed write devices, because you've got a lot of these pre array cells that you can write to very quickly and additionally, it's going to improve device longevity, because if you're not doing these, if you can minimize erase write cycles on your media so much the better right, because all of these flash devices have a limited lifespan. So what you've seen is that from support has systematically been added to most Linux file systems. At this point I. This is true for most other platforms as well.

A

Sixty-Four added trim, butter, FS added trim, even fat, GFS and XFS- have all added trim support at this point. On linux, what the story doesn't stop there, it's actually a little bit trickier than that, because the file system have trim support and still not behave particularly well. So it's for this reason that it's disabled by default and a lot of linux distros for just out-of-the-box, like you, make a new ext4 filesystem on some distributions trim just isn't enabled because it doesn't necessarily there can be performance concerns with certain devices right where performance isn't great.

A

So it's something you gotta be a little bit careful with.

A

So you may have noticed. I missed one file system before Soho ZFS ZFS has actually had trim support for a long time now right going all the way back to freebsd 9.2, there's been a version of trim in ZFS extent to added a version two to their product, so trim support. We've noticed been a good thing for a long time and there's been a version of it out there, but none of those made it into the Linux port. That was until the last release and the last release, at least for ZFS on Linux.

A

Oh wait: we went and took it as an opportunity to revisit why trim support had never been added and to look at how we really wanted to do it in the in the pork, so design goals the first one, probably a no-brainer right we want online trim support, did not hurt applications; they should actually help application performance. We don't want it to do negatively impact anything additionally and we wanted to interoperate seamlessly with all of the existing open, ZFS features. This is one of the things that held back getting trim, support into the Linux port.

A

For quite a long time is. You may have noticed that there's been new features added to ZFS at a pretty steady clip, and one of the things we pride ourselves on is having all those features work really well together. So porting, one of the previous versions of trim was easier said than done right because it needed not only you couldn't pick up the old code, exactly as it is and add it in you'd have to adapted to work with all the new functionality that was in ZFS.

A

So we wanted to make sure whatever was done here worked cleanly with everything we wanted to avoid introducing any duplicate functionality for ZFS again, it would have been really easy to take one of the previous implementations added in, but that could result in a lot of additional code, because, while the FreeBSD and eccentric triple trimming applications are good and solid and work great, they were just written before a lot of these other subsystems were added to ZFS, and now we can properly take advantage of them and what we wanted to be long-term maintainable.

A

This again goes to like not adding a lot of additional code and keeping the design easy to reason about and relatively straightforward. We didn't want to be overly complex and because we wanted this to be used everywhere. I think it was important to keep an eye towards portability, between platforms and for treatment.

A

Particular this is a little bit bigger of an issue and something I'll touch on later, but we wanted the final implication to be pretty portable, so these are all good things to strive for, but why didn't it happen until now and the reason it happened now and it was much easier. It's because they're recent opens the FS feature that got added so in the release.

A

V dev initialize got added as a feature also known as the eager zero anybody here familiar with Vita initialize. They use it. Oh good. This is the right crowd yeah. So if you dev initialize added as a feature to initialize on allocated space in the pool, basically as a background task for performance concerns, it turns out that on some systems you may have a first access penalty.

A

When you do your first right, if it's like a thinly provisioned device in a virtual environment, it may cost you a fair bit to perform that first right and if you can hide that first access penalty, that's great! So that's what be dev initializes. Therefore, in the background, it writes a pattern zeros to all the unallocated space in the pool when you think about it. This behavior is exactly what you want for trim right.

A

This is exactly what you want to do with trim you want to, in the background, not issue zeros to all the unallocated space. What you want to issue trim iOS to the all the unallocated space so building on vida I've initialized, like put in place a large numbers of components, we already needed to implement trim, and we could do it in a way that leveraged all of that existing functionality, which was really nice, so in particular, I'm gonna talk about all these components in a little bit, but just a touch of them.

A

Briefly, we dev initialize added a really nice CLI options for it, along with good reporting the ability to enable and disable medicine I've allocations. This is so we can safely initialize. In the background, the Sun allocated space yeah added some infrastructure for us to figure out where to submit the iOS to the physical disk for the unallocated space, but everything we need wasn't quite there. Some new work was required. Originally V, dev initialized was written just for that sole purpose, so it needs to be extended a little bit and made more generic.

A

Additionally, ZFS has never had the notion of being able to issue tremé O's as a higher level concept, so they need to be added to the pipeline to be able to do that and automatic background trimming is another thing that needs to be worked on and I will touch on that closer to the end.

A

So what got added manual trim so there's a zpool trim command in the latest release. It does exactly what you think it should it initializes an on-demand trim and it trims all unallocated space in the pool. So the command-line syntax is very much like V dev initialize, just give it zpool trim, give it the pool name. Additionally, if you want to restrict it particularly devs, you can do that just specify the ones you want. There are a handful of options it takes.

A

If you want a rate limit it, you can give it a dash or if you want to do a secure trim and your devices support it. You can do that, but its purpose is basically to go through and efficiently issue trim, iOS for all of that on allocated space in the pool that includes merging these ranges. All set me up great earlier today, talking about range trees and Metis labs, so that's great, but the raise trees basically handle all the burn for ZFS trim. All right, we just walked that space.

A

Furthermore, a manual trim used to skip very small ranges when issuing trim iOS to disk. If you look back to the SSD stuff, I mentioned right, there may be no point in issuing really small trim IOT's for the SSD, if they're smaller than an erasure page, because the drive can't really do anything with that they also may be slow. So these are things that could often be skipped on the flip side.

A

Of that a manual trim needs to break up really large trim operations, but you don't want to do is issue like a 2 terabyte trim to device right because it may accept it and they go out to lunch for a very long amount of time, not great for user experience.

A

Additionally, a manual trim, you can cancel it, you can suspend it, you can resume it. This is because we preserve the state for it on disk so of your pool crashes. Whatever you export rien port, we can continue on so it behaves administrative leave from a user perspective. It's a very nice way, and all of this has been nicely integrated with the existing zpool status utility. So you can get good reporting for what's going on, so that was a lot of what it can do, but what does it actually look like?

A

So a zpool status looks like this now in the UH prime release or about the upcoming release, but the current release. If you issue a zpool trim and you're in sepal status, you'll see it broken down by individual videos, which ones are trimming and there'll, be this little trimming line appended after them. If they're, currently in the process of a manual trim digging a little deeper, you can give it at the Dashti option if you want to know a little bit more about the octave running trim.

A

In this case, it looks like the raid-z devices here are about. 2/3 of the way through and they're, currently running, it'll show you their percentage complete. The devices in the mirror I've actually got out of my way and suspended here. So what you can do is run deep will suspend and list particular vetoes and it'll suspend the progress for them, so they won't advance until you resume them. So if there's a performance concern or your have somebody other issue where you want to pause them, you could do that and then the last one.

A

Here's done right. The log devices smaller, really quick. The trim trim the whole thing it's complete, so you can always run a zpool status. T and integrated with the existing tools get a good idea of what's going on with a manual trim, so that was a lot of high-level stuff. But let's go dig down into the internals here. A little bit and I'm gonna talk a little about it. Each of those pieces that I mentioned draw Metis labs again.

A

We know a lot about menís labs now, which is great, so I don't need to go over much of that background, and but what we're interested in the meta slaves here is. One range tree, in particular from annual trims, there's a range tree in each Metis lab called MS allocatable, and it's what's used to track the allocated space on disk right.

A

So this describes all the free space and what we're going to need to trim from the this particular Metis lab building on the work that was done for Vida of initialize, we added the concept of being able to enable and disable one of these Metis labs. So when you disable a Metis lab, it doesn't mean it's completely inaccessible. What it means is it's unavailable for new allocations. Basically, so this is the fundamental mechanism that was added to allow us to safely trim the meta slab.

A

You can now disable a meta slab I guarantee that at no time a new right is gonna land on it, and everything described by that MS allocatable tree is safe to trim it's okay. If you read from the meta slab, it's even okay. If you free blocks back to the meta slab, what you.

B

A

Ever want to do is have someone write, a new block to that meta slab? Have it land on disk and then accidentally trim it and that's what this prevents again. We don't want to have too many of these loaded concurrently because they take a lot of memory. There's performance concerns with that. So what we want to limit is the number of meta slabs.

A

We can have disabled concurrently to some relatively small number, so there's controls that were headed to the meadow slab for that and, furthermore, their ref counted so like multiple threads can have the same Metis lab disabled at the same time, so they can be safely operating on the same Metis labs, but fundamentally Metis lab enable and disable are the mechanisms they use to make sure we can safely trim all the blocks covered by a medicine.

A

Meet up translate Evita up. Translate is the next important bit of the puzzle here. So with the MS allocatable Raintree, we know which logical ranges we need to trim from disk, but it's a little trickier to map those like physical offsets on disk and that's what we get out of Vida translate so Vedic translate was like a helper that was added to the Vita bop structure.

A

Let's let you do this logical to physical translation, so you can call the be deaf, translate, function and specify a particularly feat of analogical range you're interested in and it will return the physical ranges on that video and those are what you can trim. So the strategy for this is pretty straightforward. Excuse me so, starting at the child, you walk up to the parent.

A

Basically, and once you find your parent, you initialize the logical range and then you walk back unwinding the calls to the child and there's a callback function registered at each of these levels, and you can use to do this mapping so by default, there's only two right, there's one that's four mirrors and Singleton's, which is pretty straightforward.

A

There's not a lot of complicated mapping that needs to be done there, but for raid-z there's a custom one which helps you figure out, like which part of that logical arrange, exists on your particularly feet of, and then when it one lines all the way down to the child. It returns that range. So this provides you a way to basically say which part of this logical reigns exists on my child, be dev and what are the physical offsets for it. So that provides the second bit of this right now.

A

We know what we should trim and we know where it is on which leafy does the last bit, and this is the bit that was added for trim. The previous two parts were added as part of e to initialize, because, like I say, it's got the same problem basically, but we needed the ability to issue trims, and this is where things diverge a little bit from the other ports.

A

One thing that kind of bubbled up is interesting when looking at the FreeBSD and the extent to version is I, didn't realize this at all until I started looking at them, but the implementation that the system level four trim are very, very platform-specific how freebsd handles a trim is very different than how illumos handles a trim, which is very different. How linux handles a trim.

A

The interface is to do that are pretty different, and what we really wanted to do was prevent that kind of platform, specific detail from bubbling up into the core ZFS code, which it had in the other implementations, and it was. It was awkward things just weren't quite in the right place. So we did here and this version is made trim a first-class citizen for the aisle pipeline.

A

Basically, so now you can issue not just cio reads or zio writes you can issue zio trims and they're handled like reads or writes like anything else in the pipeline. They aggregate it properly. They get added to the queues properly for issuing to the devices they're normal full trim or full IO operations.

A

So this made it easier to implement the platform specific bits down at the lower platform, specific layer, because now for like a block device at least on Linux, you have a video of disk IO start function and it's called with the IO type and you get a zio when it says type trim and you just map that the appropriate thing for your platform, a Linux that happens to be the block type issue discard function, you just pass in the range you want to trim and it takes care of the details regardless is on an 80 80s, Chris, Guzzi, disc or whatever it just hands it off the block layer and deals with it.

A

But this way each platform can do their right thing. Handle trim for their system like I, say I was surprised that are different. That was v2 file. Again, you just implement the right thing for there on Linux VOP space is actually a wrapper for phep allocate. So if you're familiar with hole punching on Linux, this really boils down to a hole punch. So if a file it makes it into sparse file, it punches a hole for that range. If your file system supports it, NFS excess ZFS.

A

All these things support hole punching on Linux, so even get trim for your files, which is nice and, most importantly, all the higher core components of ZFS. Don't know anything about this. All they know is that they're issuing a trim right for a particular space that ranges an allocated so putting that all together. This is what it looks like from architecture standpoint. So when you issue is equal trim, we're gonna spawn one thread per leaf feed EV. So these are the orange boxes here at the bottom.

A

These threads are relatively short-lived and they're, going to make one single pass over the meadow slabs and the idea is they're going to go through each minute. Slab and disable Metis live allocations, so we know whether it's safe to trim that Mellish Metis live allocation. No new route rights are gonna, be coming in and then we're gonna issue trim IOT's for the leaf in question.

A

Using the V dev translate functionality and we're gonna save our leaf and we're gonna say you know: here's our top level metal slab and we're under trimming issue all of those trim iOS it's a little more complicated than this in practice, because there is some you don't necessarily want to dump all of the ranges that would be trimmed on the device immediately.

A

So there is some logic in there to rate limit things make sure only so many trims are outstanding at any given time what you know you issue them all and you wait for them to complete when it's completed, enable Metis live allocations and that metal slab has been trimmed right, so repeat: go to the next meta, slab repeat the process. This is at a high level.

A

What happens with the manual trim while it's going along doing this progress gets saved in a leap zap, so each one of these leafy devs has his app and it stores the state in it. This is what's used for resuming if you suspend it or progress reporting, so it knows how far along it is, but the state gets tracked on a per leaf basis. Then you can cancel suspend resume.

A

So what does this look like in practice? Here's what we expect! You know we expect performance, be good again for our SSD and we expect it to drop off. But when you issue a trim, we trim all that free space and performance should be returned on the system right. We should have trimmed the device normally good, but it will drop off again. This is where a lot of file systems kind of stop and they say well. This is good.

A

Just you know, I don't know every day, every we put in a crown job from your file system or whatever and you'll have pretty good performance, but you still end up with the sawtooth behavior, depending on the activity of your file system and how busy it is. So it's it's good. It solves the problem to have this kind of a manual trim that you can do, but it's not quite what you really want.

A

What you really want is an automatic trim right that runs continuously in the background and trims all the free blocks and always maintains good performance.

A

This way, the underlying storage always has an update mapping of what the file system thinks is in use and isn't in use and can much more efficiently manage that storage to deliver good performance. This was added to ZFS by the auto trim property. There's a property you can set on the zpool set, auto trim on or off when it's on you're doing automatic background trimming so to make this work, there's one more piece of the puzzle and that's the free block life cycle, life cycle and ZFS.

A

So when you have a block- and it's really a free bak you're really about to free it, it's not referenced by a snapshot anymore or you know it's really going to be free. The the block gets added to the MS reading tree. This is a tree. That's also attached to each Metis lab it'll migrate, as transaction groups get synced right, and this block actually gets freed to the these trees called MS defer. Trees. It'll hang out there for a while right before we return it to the MS allocatable of range tree.

A

Once it's on the MS allocatable arrange tree again, it can be allocated again by meta, alik, DVA blocks. This is the normal cycle that blocks make it through and ZFS. So when you turn the auto trim property on what happens if we get this other MS trim tree that gets added to the meta slab. Now the MS trim tree is always going to be a subset of MS allocatable, because it's going to contain only recently free blocks on the system, so it'll be some subset of MS allocatable and it also will only exist in core.

A

It's never persisted to disk.

A

So if you crash, you will lose this tree, but it doesn't really matter that much because it's only a recent freeze- and it's only a few right- you can always run a zpool manual trim again, if you really want to trim this space but practice not having a be persistent, was a really nice simplification and made it perform just as well, but when this trim tree is enabled what you also need to do, when you allocate block and ZFS, is you got to make sure that you also remove it from the trim tree?

A

Not just the MS allocatable tree? Let the trim MS trim tree and it's this trim tree that gets consumed by the Vita of auto trim threads for the background trim.

A

So this looks familiar automatic trimming. This is the same diagram, but it's a little bit different for auto trim. So in this case, when you turn on the auto trim, you're gonna get one thread per top level v Dev instead of one per leaf. It also could be a long-running thread instead of a short running thread. In fact, it'll run as long as the property is enabled just looping. In the background, there are a couple advantages of having one thread instead of one pervy dev.

A

In this case, the big one is probably that it makes it easier to only disable one meta slab at a time when you're trimming it because we don't really want to disable more meta slabs and we have to and disable allocations for more places. We want to have minimum impact on the applications, so if we can get only disable one meta slab while we're trimming it and work on all of those children that's ideal. Also one thread is really more than enough to issue the iOS for the child be devs here.

A

So we have one thread for that: it works very similarly to the manual trim process, except we're gonna continuously iterate over the meta slabs in this case. So it'll start at the beginning. It'll disable one of the meta slabs start at the top I guess, and then it's gonna consume that MS trim tree and what I mean by swap here is when it consumes that MS trim tree what its gonna do is remove it for the meta slab and insert an empty one in its place.

A

Basically, so this allows new frees that happen on the system to be added back to the MS trim tree and we'll get to them on the next pass. It's not a big deal that they accumulate there for a while. It's fine, we're just gonna makes it easier to operate on the current MS tree, so we can traverse the whole thing and be done with it.

A

We go through me issue, trim iost to all of the children in this case and then wait for them to complete, like I said before and running at the top Aliyev here, so we actually have to issue the I out to all the children under us whether there are a mirror, arrayed Z and we use the translate function again to figure out where those offsets are passing in the right. Children wait for the trim to complete and a bailout locations and we're good that Metis lab has been trimmed.

A

The tricky bit with automatic trim, though, is that if you were to just do this, this thread would just spin in the background right consuming all of your CPU and that's that's not what you want tried to be forcing transaction groups all the time. Every time you free to block would cause activity.

A

So the solution for that has been to group these meta slabs into metal slab groups for lack of a better term, and by default we group all the metal slabs into 32 different groups, and the idea is that at most we're going to process one of these groups for transaction groups. So that means it's going to take you a minimum of 32 transaction groups before you get back to trimming the same medal at the provides you 32 transaction groups worth of time to aggregate adjacent Rangers together to issue them efficiently.

A

We found that in practice, 32 works pretty well from testing. It's controllable. You can adjust it, but 32 works pretty pretty well. Furthermore, mmm the automatic trim is set up such that it never forces a transaction group sync. So the whole process is driven by normal transaction group syncs occurring on your system, but if you have pending freeze in a transaction, they won't force the transaction group to sync, because you really don't want it to right. You want this process to be driven by rights.

A

Just because there's freeze you don't want to do it sort of deal with it. If your pool is idle this way it stays idle if you're not actually doing I/o to it. So you can't see this in his equal status output, but you can't see it in zpool io stat. So if you run zpool io stat are well auto trimming is enabled, but you can see there's additional columns on the right there and there's the independent column, which shows you the standing true miles in this case.

A

We've got I, don't know some middle sized ones in progress, but most of them are pretty large trim commands outstanding. So if you want to monitor it, that's an easy way to do. It he'll be running all the time. People io stat W. You can also get the request times for each of these trim. Ios that are outstanding and I.

A

Don't know if I mentioned it before, but Dru miles are often very, very, very slow compared to normal reads and writes, and you could really see that in the graph here right like a true, my o's are hundred, maybe a thousand times slower than a normal, read or write on this device.

A

So it's important not to have too many of them in flight at any one time and to like manage them very very carefully, but I thought it was interesting. Are they really like showed up in this graph or output?

A

So what do we expect from Auto trim? So we started pool with Auto trim off. We expect performance to degrade again right and then you know we turn it on at some point and then we expect performance to gradually improve as blocks are allocated and freed from the system, because eventually, every block that we free is going to get trimmed on disk.

A

So, even if the pool was in a bad state beforehand and completely untrimmed Foreman should gradually improve as we cycle through blocks in free more space, so we expect performance you to recover and be maintained.

A

So what does this look like in practice right? That was a lot of a lot of theory and a lot of hand waving about how we want it to work, and why is designed the way it is, but does it actually work that way?

A

So here's the test case I ran because I wanted to convince myself to write. Does it actually work as its intended? So the test case here is the time to copy the Linux kernel source. So what you want to test? What I wanted to test at least make sure the trim was working as advertised was to compare performance of copying the Linux kernel, which is a couple of gigabytes and I, don't know a couple tens of thousands of files at this point.

A

How long does that take on a pool that has trimming enabled and one that doesn't have auto trimming enabled and to do it at a constant pool capacity? All right, because we know all I mentioned before, there's this cliff past about 80%, we could totally skew your results so at a target pool capacity about 80%. What does how long does it take to talk copy this colonel?

A

So the test case here is basically make 200 copies of the Linux kernel and then repeat this repeat this process right pick, one of those copies to remove, remove it and then time how long it takes to make a new copy in the place of that previous copy, so temple enix here is just RAM disk, so basically copying from a ram disk into ZFS.

A

Tell me how long it takes the pool here also starts empty, so this RM might actually do nothing until we make our 200 copies and 200 was picked just because for my particular pool that got to mean to 80%.

B

A

So details here well, the hardware config mirroring and rag2 was tested. So what does this look like? So here's the data mirror and reads you're on here. What you can see is these are the averages over five runs and then I've just got a little scatter plot there for each test. Point for each of the runs so at the very beginning, at a pre trim pool performance is good all right or making quite a few copies of the kernel per minute.

A

It looks like about six copies per minute, something like that, which is what we'd expect it's completely empty pool performance is just fine as we fill the pool performance does degrade all right. It gets worse over time and eventually it bottoms out somewhere- and this is all with you know- no auto trimming, nothing- fancy just enterprise-grade SSD that we're testing with here. So that's the sustained performance without it there at the bottom to convince ourselves that really this is not because the pools full cuz. You can imagine that.

A

Maybe this is just a performance drop off to do filling the pool we reshoot a zpool trim. Sure enough. Your energy pool trim, just like you expected performance pops back up almost back to where it was before, which is great, but also as expected for most kind of goes down again, pretty pretty quick. It doesn't take too many more copies of the kernel before performance goes off a cliff. Do you pull trim again just to convince ourselves? That's really!

A

What's going on sure enough, it is right, then performance drops off and then the really good bit here at the end is autumn does work as intended. Once you set auto trim and you're freeing and allocating a ton of blocks right, it actually doesn't take particularly long for performance to recover and when it does recover it, it's able to maintain that performance almost as if we're a new pool. So the good news here is all that theory. Actually works. Trim does work as advertised and performance is pretty good.

A

So that's all I've got that's kind of a overview of how trim was implemented and I. Guess. I can take questions.

B

A

Doing like removing a data set and doing all those freeze, yeah so I haven't run that exact test I imagined that it would work pretty well because I mean they're all freeze handled, like any other freeze, we rate limit. How often we issue trims, so it might take a while to process all those freezes they aggregate, but it shouldn't affect the rate at which were really issuing trims as long as you're talking about the auto trim case, you know a trim case.

B

Yes, yeah yeah.

A

Yeah yeah I've seen this it's very sad. Yes, yes,.

A

Well, so, yes, they will be broken up at the lower level. There's a threshold where I, don't I, don't know what it is offhand, but we guarantee that we own tissue trims higher than I want to say it's like 16 or maybe 32 megabytes, something like that. But yeah there's a cut-off where we say no, no. This is just a bad idea. Don't do it.

B

A

That's a good question: they can absolutely both be run at the same time. Oh, the question was: how do you manual and automatic trims interact when they're run? At the same time, so a manual trim, an automatic crimp can run at the same time. The manual trim basically runs as fast as possible. Right, it's the administrative thing and it will. The intent is trim all the free space in the pool as quickly as possible, so that will be run normally in parallel. The auto trim will run but I don't think it.

A

It doesn't conflict with the manual trim, but I also don't think it necessarily factors it in. So you might end up pre trimming the same blocks, but probably not a big deal.

A

Yeah I'm sorry I missed that.

A

Why the question is, why do you need a separate MS trim tree for trimming? So the reason there is the MS allocatable tree describes all of the unallocated space on disk right.

A

We need a separate trim for MS trim, because we only care about blocks has been recently freed, not all the blocks that are free currently in the pool. So you only want to trim things that were recently released. We could go through MS allocatable there, but that's a lot of extra work and it's possible that a lot of those ranges have already been trimmed by our previous pass, like you could run a manual trim over the pool and trimmed everything in MS allocatable, then all that stuff is already trimmed.

A

So, like the previous convict question, if we had come through with the automatic trim and trim everything, an MS, allocatable or MS trim, which may overlap with that, it may just be raised at work right. Those blocks may already be trimmed, so they're attracting two different things.

B

B

A

Yeah, so the question is I mentioned that we don't force transaction groups inks. So when the pool is idle, we don't go do trimming in the background. Basically, we wait for some new right to come in. A question is: is that not nested? Isn't that a good time to do trimming actually because the pools otherwise idle? That seems like the perfect time to do trimming because it's not going to impact any kind of application workload yeah.

A

From that perspective, it would be good because it wouldn't impact any application, but in practice we found that even running when the application doesn't seem to impact performance much and it's desirable for the pool to actually be able to qui us when no one's using it.

A

You could imagine some kind of more complicated machinery, I suppose where you go through and you trim everything once you drain all the trees and then once everything is fully empty, you quiet and stop that could be left his future work. There might be value in that it didn't seem necessary in the first implementation. I would say.

B

B

A

True, so the question was yes: I was using fantasy Enterprise drives and the story may be different for consumer drives. We did not extensively test consumer drives, so I could totally believe that you may need to tune this differently if you're using a low end consumer drive.

A

Definitely it's work to do there and someone should explore what the right tunings are. Yeah.

A

Rate-Limiting for the trim question was what kind of rate limit was done for the trim. I didn't mention it the way the trim I/os are issued as part of those threads. They rate limit themselves. Basically, there's a control that issue that controls how many outstanding trim bytes will be issued down to the pipeline, and then the pipeline itself does some limiting and breaking up of those I/os before they get submitted to disk. But yeah I glossed over that there was a berry. Do one of there there's there's more detail about how those get broken up.

B

A

The impact on the read latency I didn't run careful data on the read latency, so so I do not know I, don't mm-hmm I had I, don't know, I, don't know I be hazard. To guess on that, like you say, could very wisely between drives from my personal experience. How the SSDs behave does vary widely between the consumer grade the enterprise grade between manufacturers between a lot of different things, so I don't know exactly how it behaved there.

B

A

So I think the question was: what is the memory impact of maintaining this additional range tree tracking, the the free yeah, the ranges that are to be freed? It's pretty minimal in our experience, mainly because the tree itself doesn't get that big, typically because we're pruning it fairly aggressively like every 32 transaction groups, you basically drain the entire tree. It's also a range tree which happily has been nicely optimized now, so that helps any free in contiguous ranges. They don't take up that much space, so we haven't found it to be a problem in practice.

A

I suppose it could be if you suddenly free a lot of stuff that happens to not be contiguous and makes really big range trees. It's not capped at the moment could be I mean if it turns out to be important. We could cap it yeah this, isn't that 8! Yes, the question was: is this in the current release? Yes,.

B

A

So the question is: what's the x-axis on this right, I conveniently left it off right yeah, so the XS axis here is test iterations, basically so as well. Each each of these dots is one cycle of that remove copy loop from beginning to end. Here. It's about 2500 runs of the test, something like that for perspective, so it probably took when performance collapses. Here, maybe I don't know 30 40 50, maybe copies runs of the test before performance was back down below. That's gonna vary based on how full your pool is.

A

In this particular case, the pool was 80% full, so we weren't leaving ourselves like a lot of extra free space, so that'll contribute to it, but the trim, so the trim time isn't uh yeah I didn't mention so who must have been about 2, terabytes, usable capacity, something like that to get about 200 copies of Linux kernel in it, and how long did it take the trim?

A

It's not reflected in the graph here, because this is just iterations not trim time, but the whole trim for a manual trim I want to say was on the order of, like maybe 2 or 3 minutes. Something like that. Pretty quick.

A

Yeah, so the question is: did I did I run the trim, while the test was running, the manual trim or did I run it in between, and the answer here is I did run it in between so you're, not seeing like the effect of a running manual trim and what it does to perform it on the system. That's yeah, the Scriptures weren't set up that way. I'd be curious about that myself, I, don't know what impact a manual trim would have.

A

The auto trim was clearly designed not to impact things, but the manual trim is much more aggressive.

B

A

A good question: um they were close enough that I didn't investigate that right. I'm, like raid-z, Mir, they're they're, about the same right there about what I'd expect. So no I don't know exactly why it's a little bit slower.

A

The question was: does V Jeff Translate need to take into account device removal, so the V DEP translate code pretty much, as does that logical to physical mapping and I forget offhand. If, with the VF Translate code, do you know offhand yeah.

B

A

Yeah, so one other nice thing to add about that too is I meant to mention at the time, but because there's like a generic helper for this too, it's easy to extend the mechanism, so it works with things like dear aide when it gets integrated right it'll be able to do that. Mapping and trim and we'd have additional eyes we'll just more or less work with the earring.

A

A

All right so the trying to be summarized what Matt just said is it it doesn't matter because the MS allocatable were walking just represents like what blocks are allocated, doesn't care. What's in them, it's just what's allocated and what's frite right and we're just trimming all the stuff. That's nothing is using right.

B

A

We have not done testing at a large number of so the question was: are we keeping a list of discs that we know are problematic right? No, we try to justify good discs for the most part, so we don't have a really long list of of discs. You should avoid it'd be cool. If someone had something like that, no.

A

Yeah, exactly like you've got a really bad disc and trim works poorly with it yeah. It could be bad.

B

A

Question was: is there any additional delay between when a block is deleted and treated, and when we actually go off and trim it? No, there isn't. So. Once it's added back to ms allocatable be considered eligible for trim, it might not mean we get to it immediately depending on you know the thread when it can revisits that meta slab to trim it but yeah it's immediately eligible. So there's not an additional deferred delay, so you can only yeah how about to say well, even yeah for automatic trim.

A

You get the 2t excuse to because it does go through the MS to fir tree. So you get to transaction groups, so you can rewind your pool as far as you normally would be able to safely rewind your pool without having any risk of your data being gone, but once it was eligible to be overwritten by like a normal file system right, it's also eligibly trimmed and it may be trimmed at any time. So no additional delay.

B

A

Yeah, so the point was that this does potentially eliminate or restrict your ability to roll back, because we're pretty aggressively trimming these things. That's absolutely true. If you don't want that, I was just not turning on automatic trimming or we can. You know, think about extending it. If that's useful functionality you want to preserve and you wanted to lay it farther I mean that's the thing you could extend.

B

Yeah yeah exactly.

B

B

B

A

I mean I think that would all be worth interesting to explore. Generically like as part of the MS de fer tree, like Alan, was saying right, like maybe you want to let stuff last a lot longer. That's totally a reasonable thing to request, but I think it should apply uniformly to trim and like normal file system, al right.

B

B

A

Yeah I mean that may still be the case in the links work, because just because it ends up on the MS de fer tree right, it may not be trimmed for another 32 transaction groups right. It could be that long.

B

B

B

A

If the Wrights big enough I think that's true right, like if you're doing a really big right.

B

B

B

A

I think it would be really interesting, like we said about checking out different SSDs I bet, they all behaves a little bit differently. It's gonna depend on the firmware and we tried to tune what we thought would be reasonable, defaults right.

B

B

B

A

So I guess the comment there was. um This may behave a little bit differently on SATA drives which have different limitations right. This was testing was done on scuzzy drives and hopefully they do a little better with trim, but your mileage may vary.

A

It was pretty straightforward, I would say, but only because Jorge paved the way for me to get this done right. There was a lot of just extending that code right and then adding the bits that were missing, which I think was really nice. Actually, it's a nice to see these two features come together.

A

There was some trouble chasing data corruption bugs for a while. That was thing we cared a lot about. Obviously it's one of the things that really kind of delayed this work for quite a while is like. How do you convince yourself that this is working absolutely correctly? It is never gonna do anything wrong, because if it does it's bad right, so there was a lot of runtime and testing and running down those kind of bugs to convince ourselves that it was absolutely solid.

A

So the question is, if you're running a pool with all hard drives, that don't support trim what happens with the properties. So at the moment nothing. Basically, the properties are still there. You can turn them on, but each Drive is individually detected, whether it supports trim, IOT's or not, and if the device doesn't support issuing trim, it's not supporting that device that won't issue trims to that device. If it does it'll issue trims another report which ones which don't support it. So it's just disables itself.

A

If the drive doesn't support it and that's all detected for you, you don't have to configure that or anything yeah.

A

Yes, so I didn't mention it, but this does not require a feature flag, so there's no on disk compatible change. That was the other nice thing about not making the MS trim tree persistent on disk as a required new change, the on disk format. So nothing changes.

B

A

Yes, the question was: does the auto trim code? Disabled Metis live allocations when it's trimming? Yes? Yes, that's the only reason. It's safe.

A

I think it's the farmer is based on. There were bytes outstanding, not the number of ops. Yes yeah, that's it's throttled at two levels, a one layer for the manual trim is throttled based on a number of bytes outstanding. We issue to the lower IO and layers in ZFS the IO pipeline, and then the Sierra a pipeline itself will determine when to issue those iOS based on other activity.

A

On the system, like I said it's, a full trim was added as a full class member for trim top for I/o types in ZFS, so it'll trade-off between outstanding reads and oustanding, writes and trim and trim is like the lowest priority thing. So if there are outstanding, I think scrub, my people out trim, but it's below reads and write. So if there are outstanding reads or writes that need to be handled, they'll be handled first.

A

You, yes, it's possible, but the question was like: is it what's? How does the throttle work and the throttle works based on bites at the higher level, so we're going to issue a range to trim a certain number of bites through lower layers, and then the pipeline will issue iOS to handle those trims as it deems appropriate as it's been tuned right to avoid impacting performance. So if there's a lot of reads that need to be handled, synchronous reads: the trims will be deferred in the queue and they won't be issued.

B

A

So we should probably talk offline, but yes.

A

The question was, why have an automatic and a manual trim? The thought process was because a manual trim and an auto trim are kind of serve different purposes. The manual trim you really want to as quickly as possible return a device to a fully trim state and that's what it does. It runs through the entire pool and trims everything that could be trimmable. It may impact performance, but it will get you to that fully trim state as quickly as possible. Like we said it may just be a few minutes to get there.

A

The auto trim is more for maintaining performance on a pool and keeping things in a fully trim state. So slightly different use cases. I, guess you can make an argument that the auto cream gets you most of what you need and you could just leave that enabled. But at that point it's easy enough to have the manual trim to, and it's useful.

B

A

So the question was: if you're, if you were allocating from a meta slab and we disabled it, because we need to trim it and yes, your allocations are gonna shift, other meta slabs and we may have to load new ones because of that, but we probably have a couple of meta subs loaded. So as long as we don't disable too many of them, which is what happens with the auto trim right, it shouldn't be too much of an impact.

A

The question was to take the lock to disable the meta slab and, yes, I, believe that's the case. Yes,.

B

Sure all right.

A

Seven so so Matt says they're normally seven allocated so disabling one. Is you.

B

A

If we, yes, yes, if we don't allocate.

B

Yeah yeah! Yes, yes, so.

A

We should be okay as long as we don't disable all of them, which is why there's a manual trim only allows a small number to be disabled by default. It's like three or something like that.