OpenZFS 2022 OpenZFS Developer Summit, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Refining OpenZFS Compression by Rich Ercolani

Description

From the 2022 OpenZFS Developer Summit: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022

Slides: https://docs.google.com/presentation/d/1og6UY010exjAANYkkmZn9qrAlO9r0TN6/edit?usp=sharing&ouid=112595186103367032517&rtpof=true&sd=true

A

A

This is going to be a talk about mostly a couple experiments that I did, some of which panned out and a lot of which didn't and how I got there, which will hopefully be more interesting than the dry topics in the slides. um Well, I mean it's fine, um so quick disclaimer, the talk is independent. I am employed by Google.

A

I am not being paid to do this in any way um and I am not. You know. This is not my day. Job I don't work on anything related at work. Okay, just to be clear.

A

um So there's an outline of what I'm going to be doing uh just like talking about what makes compression on ZFS somewhat different in requirements than you know.

A

If you picked up a random thing off a library, rather uh just uh then trying to update lc4, because people keep asking about that trying to updates a standard, because people keep asking about that um trying to add broadly because I've had like three people, ask about that and then an experiment that did pan out more uh adding an earlier board function to the standard compression and then a summary, and if we have time A bunch of random other experiments that I didn't go into depth um now. Why did I do this.

A

It you know, sometimes it was just fight, like somebody said, I, don't think that'll work or I said to myself. I, don't think that'll work, but I can't convince myself. It won't. Let's try it.

A

um You know it's nice to have some of the results like I'm, going to show later about going from two hours to 15 minutes, so I'm compressing something great um people keep asking, and it's nice to have an answer. Besides. Well it's hard and we haven't tried and more directly I started contributing more actively because I spent a bunch of time on leave and I wanted to get back into the habit of actually working, um so I started, trying to reliably contribute a little bit and the Habit stuck and.

B

A

All stuck with me um so unique requirements about compression here uh so in a couple of places, EFS assumes that if it decompresses and recompresses it gets the same thing back.

A

um The one I can recall offhand is: uh if you use the l2r persistence, it always saves it compressed. But if you have uncompressed Arc, then you may be sad. If there's a mismatch um so swapping out, just lz4 is a standard or whatever for a new version will produce different results. Both can decompress and compress, and both can decompress each other, but like ZFS will get sad and ZFS is sad. It will throw errors.

A

um This would also, for example, cause problems for like an operator dedupe, because you know different result, different checksum, that time uh technically right now, that's a problem with using gzip, um because Linux and everything else use different z-lib compression versions um and also a problem. If you use the Intel qat offload stuff, but nobody's really complained about it. It just is true, so maybe it's not really a problem.

A

We have to worry too much about, or maybe nobody's using Jesus, who knows, but more interestingly, like ZFS has like tiny records that we're compressing right like 128k one Meg, If, You, Really, Turn, It, Up 16, um and a lot of things are focused on like large streams of data or large masses of data at once.

A

So, for example, Z standards performance test as far as I can tell are mostly focused on like either parallel streams or large, like tens or hundreds of Meg sets of data, um whereas we're never going to get that large or well I suppose if we get really creative, we could, but why and zephas won't save space in tiny units.

A

So in order to save something compressed currently we require that both needs to be true. It saves at least one block on disk and it saves at least 12 and a half percent space I have patches to play with it being a tunable, but there are trade-offs involved and I'm not going into that right now.

A

um But so, as a result, if you have a compression thing that goes from oh I,.

B

A

Did have it already? Well, uh if you save like another 3K on 128k, it's not worth the trade up, so it won't make a difference. Even if you know 3K of 128k across your whole data set is kind of a lot. So you know hatching.

B

A

To be controllable might be worthwhile for some people and zeros doesn't built in have Notions about like I'm running compression. You know lz4 version one two, three whatever so.

A

Each of the compression algorithms currently would get to implement handling that themselves, which is not impossible, like I, have branches for this, but um it is an additional complexity. You have to deal with so anytime. You want to update it. You would have to consider whether that complexity is worth it versus what you get.

A

um And currently, as I alluded to earlier is the standard, for example, does a bunch of things about parallelizing compression and the interface for doing that gets slightly worse, compression sometimes, but also is paralyzable, but that also means that it can be unreliable, how it doesn't necessarily guarantee you'll always get the same compressed result, if you do, that is my understanding, so that would run into problems with caveat one.

A

So here's how I did the graphs that are coming up here, um I made a couple data sets that I thought would be differently. Reflective one of them is a 20 year old, male dirt, I bought that's, you know, text so mostly highly compressible, but tinier files, a bunch of firmware blobs that I have from updating various devices over the last decade, which mostly incompressible not.

B

A

Mostly and then a bunch of public text files off archive.org, because I wanted something that would be largish and really really compressible.

A

um And then a snapshot of my root file system, because that's just a wild card of all sorts of things um then wrote them at different record sizes. We send streams, sentimental receives backed by fast storage, a couple times and averages the results.

A

The space saving can be kind of variable because you're also weighing how much the metadata, among other things, compresses and that's going to vary based on where it puts things or you know, phase of the moment of writing whatever there's a variation. That's going to happen, naturally, even if you didn't do something differently, so if you see results below like 30 Megs or something it is probably just noise and I did this with a my zentry desktop.

A

My Intel coffee link desktop my Raspberry Pi's and a Mac Mini, which seems like a reasonable set. I would have also done like a spark, but I tried letting that run for 12 hours, and it still wasn't done so I I said no much as I enjoy running things on The Spar, uh but you know surprising. Nobody thinks that are one CPU intensive task. Essentially a lot are going to vary wildly.

A

The graphs in this are like all Cherry Picked I have enormous collections of them. If anyone wants to look but like these are all Cherry Picked to be interesting results. So you know this is not exhaustive, so lz4, originally the code that we have in there now was about from December 2012..

A

There was a lot of spelunking involved to find that um I was trying to figure out a good way to add compatibility because or backward compatibility without a feature change, because it seems like a shame, since the forward and Backward Compatible um and the standard does this by putting a header at the front with versioning information, but that would break old does E4.

A

So the thing I did was sorry.

A

I, I can't I'm, apparently hearing things great. uh You can stick a version field on the end in the gap between how long lz4 thinks the record is and how long it actually is, and I checked and the old code is perfectly happy with this. Doesn't care great and you can handle all the rest. However, you like it's very tiny, like lz4, is like lz4.c and lz4.h, and also lz4hd.cma.

A

Basically, like it's kind of really easy to just log in and go have a nice day, um so I tried this on the mail dirt and the firmware blobs and the space Delta I got even at a one. Meg record size was like you know: 15 Megs, one way: five banks another way like nothing like it's noise.

A

um So I tried again to see how long it would take to write and read and mailed her, and you know: writing was noise like there was no difference really but reading. On the other hand, like the newer decompressor, that you know that that's a pretty good Delta and like the range on it was pretty large across a bunch of different data sets um like sometimes it was a little faster. Sometimes it was a lot faster, but it wasn't really ever slower so that Arrow wasn't supposed to show up that way.

A

Oh well, uh after all that the compressor wasn't really worth it and the complexity wasn't really worth it, um but the decompressor was a good win. So why don't we just take that and go um and I did and it landed uh after not very much reviewed because it was like it works and we can always just pull it out if it doesn't work and it's not in a stable release.

A

Right now, but I have been running it since, before it got merged and I haven't found anything that breaks um so it'll be in the next release. Great um the standard update is the one. A lot of people have been agitating about. Wanting the one we're running was released in May 2020 merged in August, 2020 I believe um it's not when the pier was open just when that version was merged, a bunch of different files, um kind of involved.

A

So originally it was all aggregated into one file, because the standard has a thing to do that, for you already built-in versioning, so like no fun on disk format. Meddling to deal with one unfortunate thing is that their testing was such that 1.5.0 was so much faster. They decided to turn up the compression settings for each of the levels in 151 and newer. So as a result, those all are slower than they were because they figured they had performance bandwidth to burn.

A

um This is not my favorite way of measuring this, but I didn't get a better one working.

A

um So, okay, you know you save like 10 or you know, like tens of Megs, essentially um I'm, sorry that the graph doesn't have nice units, but it doesn't.

B

Like rendering.

A

Negative numbers: um that's a bunch of different versions that I bolted on, in addition to 145 and the difference from that two one, four five, so you know Z standard five is a nice Improvement, 7 and 11 are okay improvements, and you know the rest is. Noise is how I would interpret that your mileage may vary, um but then on a different data set or the same data set. Actually, if I recall with a different record size, then the space usage goes up or does nothing.

A

And the amount of time it takes to write goes substantially up um with.

B

The previous one that.

A

Isn't so you know going from sorry again, I'm hearing things great, uh you know going up by like 30 seconds out of 120 or so not really a.

B

A

um And if you look, you can see that's with 151 and newer that it gets substantially larger and it still gets substantially or a little larger for 150.

A

um oh I, didn't mention is when I turned on allowing it to use like inline assembly, um which requires more expensive things in the kernel, which is why I did it differently.

A

um Then this result, which was the incompressible data, being really really slow. It's like oh, no! This is a bad idea, um so after all that it would complicate that to have like something to handle the standard, versioning properties, because I'd argue for that, because otherwise you know people with dedup and not break would be very sad indeed, um but then we'd need to keep it.

B

A

Forever and yeah, um sometimes markedly slower for no better results. The early important thing that I mentioned earlier and I'm going to talk about shortly might be very helpful for this um I thought.

A

I had something set up with that integrated already, but I didn't um and I ran out of time, because I discovered I had done half of these tests wrong um when I was rerunning them with updates I had half integrated the old version, half the new one and that that that wasn't going to do anything useful, so I fixed that, but did not have time to rerun this, so I'm probably going to do that during the hackathon tomorrow, unless I have a better project and we'll see how that goes, um all the graphs are, after fixing that to be clear, the previous graphs were much worse results and always said.

A

Never do this, as opposed to sometimes.

A

So the other thing another thing I tried was adding broccoli because, like a couple, people came to me and said: hey here: have you heard of this? Compression thing was like I've heard of it I've not heard that many people use it that often but I've heard of it, um and you know it was already in like self-contains the great not like another experiment: I did, which was trying Snappy, but snappy is written in, um go so not as convenient um sure.

A

Let's go, I'm not intended broadly compression goes like zero to nine, and so you know that's a fairly simple range: there's not anything complicated technically, it says 10 and 11 too, but those are a very different thing and should not ever be used interactively do not do it. It's bad! It's bad, don't do it.

A

um It turned out to be a little more complicated because it turns out. Rotley wants to do floating. Point Mass when it compresses running floating Point math in the kernel can be sad. They I asked on mailing lists. They were considering adding fixed point, but they haven't done it.

A

um So you know we get to do that. We put barriers around every call in then it turns out. The allocator has a problem where, when it says no sleep, it actually means no sleep unless I want to um and sleeping when you have preemption disabled is bad. Don't do that. Linux gets really mad, though oddly it doesn't mind on older, x86 kernels for some reason, but newer ones or any other platform. It gets real mad.

A

uh So I wrapped it with something that pre-allocates the memory for you, so it can't really ever run out of memory, but then you're dealing with memory overheads, but so that was a fun discovery of a bug that, like we had never run into before, but was there for you know ever.

A

um This is a bunch of useful or a bunch of comparisons of uh these standard. Gzip, broadly for and nothing uh broadly, is red and Z. Standard fast is light. Blue and Z standard is dark. Blue um all of that is mostly to be referenced later, but the point I wanted to make is that it looks like broadly at lower levels can be better and faster than the other options that you might have like. You can see.

B

Right here, yes,.

A

So you can see here that, like broadly at its lowest level, is nicer than say gzip or Z standard one, while also being pretty fast compared to them. So you know not I, don't I, don't think that it's worth the difference, especially with the complications and overhead that I had to do to get it working.

A

um You know if somebody comes up with a good use case or says this works much better for my data and you know maybe, but you know, put it on the shelf and come back if there's ever a good use case great.

A

Cool, uh so early important uh was a feature that I thought. Maybe this could work um because one thing we talk about a lot for there was a lot of talk about. You could find in lots of blogs, and people talking is that one of the reasons lz4 is nice on ZFS is that it will bail out early rather than wasting time trying to compress things- and you know, Z standard is another compression thing originally by the same author.

A

um So surely it would do something similar or have some similar functionality? Are we maybe not using it because I'm sure if you've used the higher C standard levels, you're familiar with how unfortunate it can make your system? If you don't, if you're, not careful um and.

B

A

It doesn't work and we make them glued together.

A

So the explanation of what lz4 does right now, basically, is that if it gets like n bytes into the into a chunk that it's trying to compress and it hasn't compressed any of it, it will just go I'm out and hop to the next chunk rather than trying on the rest.

A

um So it skips through small portions at a time, so you still get decent compression, even if it's a mix of compressible and incompressible data without burning all your CPU time on things, you can't compress that's my understanding from reading the code carefully. I did not go ask the author. So if I find out I'm wrong, I'll tell people, but that's my understanding.

A

So originally I tried this I looked at trying to like get this directly into the Z standard code, but that would require modifying the basically vanilla, Z standard code we have baked in and changing the actual output of the compression call would violate the original constraint I mentioned of like compression decompression compression not being different, because if it, you know incrementally, goes through and changes what 4K in the middle of 128k record compresses like, then it's going to be a different output and you're set.

A

So as an initial experiment. Thinking I'd do something more refined after this may be worked. I tried, gluing lz4 on as like a pass filter for weather to decide it should compress or not was he standard and the initial results look really confusingly good? That is a non-linear graph on the left, because otherwise it's just not readable because it just um so a notable thing here with the incompressible data is like on my this was on my Horizon uh on there. It took 10 minutes to write the incompressible blobs at one Meg record size.

A

Without this change and with this change it took about a minute and a half, so you know kind of a difference um and the amount of space you use difference was like you know like 100, Megs or 10 Megs, or something really small.

A

um So you know I'll take you know a fifth of the time for 10 Megs out of 45 gigs, or um you know great, except if I try this on highly compressible data, then the Delta gets a lot bigger and it was like two gigs or so, um if I recall so like it didn't take much longer. So that's fine, but the Delta was like losing two gigs of compression and that that's not really. Okay, um okay, so like that first result was really good.

A

There's no way I'm ignoring this after that, but we can't really say: oh I'm, sorry, you might lose. You know like 2.5 compression on this. That's not really an okay result.

A

What if we use Z standard as a pass to decide whether to use the higher compression instead.

A

um If you look on the right, it's uncompressed on the far left, it's just using the standard three and in the middle is using different Z standard and lz4 levels or Z standard levels and does before rather to as a pass for whether to try the compression.

A

Yeah me too, but um so it turns out that all of them are really bad as um a first pass compared to just using lz4 um in terms of space savings like they're. All they all give up worse and I- don't show it here, but the amount of time they take is also sometimes worth the higher the closer it gets to lc4, because lz4 is really the king of what it does it's really, astonishingly good at it.

A

um But what if we try doing both there's no way like running two compression passes. First is going to be time or space efficient right right.

A

um And yet, um if you look on the far left, you can see using lc4 and then is the standard level is more space effective than just using lz4, which is right here.

A

You know up to like using Z standard 2, where the Delta between it and just running is the standard. Three is Tiny, so how much time does that? Take? Let's run that test again where we do the incompressible blobs, and you know it looks basically the same um I think I didn't run it quite as far out, because I didn't feel. Like waiting for Z standard 18 to run um so, okay, that that's still good savings great and on the highly compressible stuff you can see, the Delta is like nothing.

A

um um And the time difference as well, in addition to the space difference, is still you know pretty negligible, um it's actually until you get up to like Z standard 15, still, basically the same, um which is pretty good for you know taking some of the data and trying two different compressors. First, before you tried the thing you were going to do, okay, so this was all right on. Like my high end, ryzen I've got a lot of cores and computation per core there's no way. This should work on a Raspberry Pi.

A

Actually funny story: um it goes if you do the same in compression incompressible data test from taking. You know well over I'm sorry, um I.

A

You know it goes from taking 6 000 seconds to you know, like eight 900 seconds, to write this so about.

A

I I said two hours here, but I'm thinking, yeah, that math is okay. I did check that two hours to like 13 minutes and the space Delta is like nothing. Okay, you know like I'll. Take that I will absolutely take not taking two hours to write this at that compression level. That's great, so I skipped over playing with different record sizes and trade-offs to not to decide when to do this because ultimately, I picked at least a standard three, and at least 128k.

A

um I also skipped over finding out the door was a bug in how the ark did recompression that never came up unless you ran this I still don't understand why it never came up unless you ran this, but it sure did so that got fixed um as I alluded to I have a lot of graphs and I can't just push this up to Z standard, because um you know they don't only operate in tiny chunks like this.

A

They operate on like streams of data or lots of things in parallel, neither of which is conducive to something that blocks like this.

A

um So I opened a PR and it landed about two months later: it's not in 2.1 it'll, be in the next release.

A

um There's a backboard but like that, that's a very kind of the actual amount of code change is not that large, but you know it's kind of a significant change in what you might expect it to do. So it's not in a point release uh is my understanding.

B

I, obviously am not.

A

In charge of them, but you know if you want to play with it, have a nice day and let me know if you find something that's pathological on, because I haven't yet so: okay, we only lose like 60, Megs or so that's great. But what if we could do better.

A

So funny story, it turns out broadly at low levels, is even better than using both passes at this, even though you have to do the FPU instruction guard.

B

A

It um I'm not suggesting of the not coming up gradually I thought I did that oh um but I'm not suggesting we merge it. For that reason, right, like that's, that's too much work for too little, but it's a really fun data point and really strange, um and you know that I thought would be somewhat entertaining to people who find this sort of thing funny um laughing at compression is not necessarily what you were expecting, but here we are uh so here's a summary um right. So lz4 update the decompressive.

A

Is a nice win the compressor not really like you could make an argument about better security for the same sort of compression level, but like then, you have to deal with a lot of complexity.

A

uh If you ever decide you want to do it. Let me know: I have a branch. It works it's great, uh but that would be my opinion. The standard update seemed like a bad idea. I realized halfway through testing I. Did it wrong and did not have enough time to wire it up properly to run all the tests again before this? My apologies, um but so far it looks like it could maybe be a win because it turns out, as we learned with early abort, a lot of the time.

A

You're spending is mostly on things you can't compress. So if you can skip those in some way, it's a big win.

A

You know, Bradley was fun, but, like you know, it's a compression algorithm it it doesn't magically. You know, I, don't know, use neural networks to magically recreate your data in five bytes like early abort, again Twitter. This doesn't this should not work, but I cannot argue that as far as I can tell it definitely does so.

A

It's been merged and everyone can benefit from it and use higher levels of compression on their backup devices and Tiny things that don't have much GPU without losing lots of space and I didn't actually remember to mention this, but the reason you can get away with that is because it's deciding whether to compress or not hold records, not changing what the records are like inside them.

A

So because of that, you don't run into the problems with like compatibility, because, okay, it's uncompressed great, it's the same thing or it's compressed and it's the same thing. um How much time do I have great?

A

um That's fine I included a bunch of experiments that I didn't go into nearly as much depth in the slides in in.

B

A

Lot of time, um so I tried updating, zlib, because you know integrating our own z-lib copy would avoid the problems I mentioned earlier, with different gzip versions that nobody really runs into, but are still there. We just don't hit them in practice, but a lot of the ones that are faster than just baseline, gzip or zlib. I should be consistent, um mostly rely on doing FPU instructions to be better and don't actually seem to be consistently better and are often significantly worse at compressing in The Limited testing.

A

That I did, and some of them are really hard to get to compiling the kernel, because they are not remotely similar styles of code, because they've reshuffled everything um as I mentioned I found, while doing this Linux actually did a similar thing to what I did with the lz4 decompressor and merged the zlib decompressor. That was newer, 15 20 years ago, but let the compressor- because it had this tiny regression on arm and nobody ever carried again.

A

A

Really matter because I haven't heard lots of people using Giza right, like lz4, is better at one thing: Z standard is better than other, um so you would really only use it if you were trying to have some compatibility with things that don't understand either of those, and that would be a very Niche set of people as I mentioned. Oh snap, peoples, oh I, remember what I was thinking of it's S2, which was written and go uh Snappy is written in C plus, which you know lobbing that into the kernel, not not fun.

A

Don't do that, but one of the Linux Colonel doves wrote A C implementation to consider something similar a few years ago. So I could just use that my experience was that it's bad at General use um since as far as I understand it was really intended to, like my understanding, is we've basically intended to compress like blocks of text that that was the goal.

A

You know like a tiny thing to compress text have a nice day, um so it's not too surprising that throwing General results or general sets of data at it did not end well.

A

um S2 is an interesting project where a um I understand correctly, a database developer decided to integrate Snappy decided it didn't perform well enough and wrote their own re-implementation of it. That is backwards, compatible compression and decompression, but markedly faster and better, which is a neat trick, but the their implementation is written in, go so I'm, not lobbling that into the kernel, I I, know um and I haven't spent time trying to re-implement it.

A

But it's an interesting thing: if anyone wants to consider it and I tried playing with the Z standard memory allocator, because as anyone who's looked at the code knows it does its own custom pooling allocation thing um which works. But you know it's a weird custom thing: it would be nice if we didn't have to have this custom thing over here when we have all these other things, but Linux at least has limits on its own.

A

Like caching allocator things in terms of how large the thing you can cache is, um it will just complain if you try to make like a 32 Meg allocation or something.

A

um So you can't build a pooling thing out of that, because it won't do it and you know just dynamically allocating on demand the size of the allocation that this thing needs. Sometimes it's just sad, so I tried using the Zio allocators after seeing a patch that Alan made at one point to do that, and it seemed slightly faster like cold, but then the more you ran it. It basically became noise compared to the other outages. So I, don't really think that's worth trying to merge.

A

If it's going to not be better belonging about it, um I was going to say that I believe it's the last slide so now I'm happy to talk about anything I just said, or lots of other random experiments. I've done that I didn't um because I thought well when I practiced. This I was better at talking slower.

A

Questions from anyone, if not I'll, go away. That's fine! But yes, there are enough, as I mentioned, and didn't actually talk a lot about I. Think in the slide.

A

um That's something that broadly does and that's the thing it does. That needs floating Point Maps is it does at higher levels anyway, an entropy estimation, calculation on the Block, you hand it, but that's that can be expensive. I. Think lz4, mostly doesn't do that Z standard does at higher levels.

A

um Doing that fast is kind of a problem, though right like you, you can basically do my understanding. I am not an expert in this field. Is that you, it basically runs into a bit. Counting problem is what you want to do, or something to that effect.

A

um So modern CPUs do have fast instructions for that, so someone could probably write one to use for this purpose. That's a better refinement than just the brute force of running multiple compressors great.

A

But aside from that I, don't know how fast you could be at it other than having to iterate over the whole thing or having data on it. Initially, um like having data on what your input is before you came in is my understanding.

A

Looking at it kind of hacking and using the end of this version, yeah.

A

No I I mean I. Could it would work fine as far as I know, I didn't try that I just put it at the front, but the reason I did that with lz4 is because we had the existing implementation. That does have a header on the front, and that header is not that large.

A

um It has the I believe the compressed size of the data at the front and that's it um so I couldn't just shove it in there, because there was already something there and I need I wanted to maintain the backward compatibility um as much as I could, because it seems like a shame, to have a feature flag bump that out for no reason um so I could do that, but it's not necessary there, because there's no existing, broadly implementation, I have to care about.

A

No I'm just throwing it out um and that wouldn't necessarily work, because the way that it works is that it operates on tinier chunks than the whole thing I hand. It so I, don't remember the constant self-hand, but you know like it if it gets like 16, it gets like 12K into 16k, and it's not done it. It just will skip to the next 16k, so you still get some compression, even if it and then and all of the algorithms.

A

We have integrated we hand them a smaller buffer based on the 12 and a half percent I mentioned earlier, and if they run out of space they give up rather than overrunning the buffer, because that would be bad um and so no it's just running over the whole thing or 87.5 of it, potentially depending um I'm, not using any dry run.

A

Flags I did experiment with using the higher levels of lz4, which we don't expose, but like all the codes there, um but it turns out turning the level up at all, just made lz4 markedly slower and did not significantly improve the results.

A

B

A

Yeah, we have a break until what.

B

Time is it.

A

A

I did talk fast.