OpenZFS 2017 OpenZFS Developer Summit, 31 Oct 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZSTD Compression by Allan Jude

Description

From the 2017 OpenZFS Developer Summit:
http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit_2017

A

So our next speaker, he might be a star in the freebsd world, so he actually wrote a few books about ZFS and today he's gonna present about a new encryption algorithm. Sorry, a new compression algorithm which has been brought to ZFS on FreeBSD. So please welcome Alan Jude.

A

B

All right, so uh my name is Alan Jude and I'm, a FreeBSD committer and I co-authored FreeBSD, masteries, NFS and advanced @fs with Michael Lucas and I run a video streaming company in my day job. So the compression algorithm is called Zed standard. It was designed by John Collett, who wrote lzd, for which we use heavily in Zetas vests. Today at Facebook, the general concept is to get compression ratios closer to what you get with gzip, but faster, because I was at four, you get less compression but more speed.

B

It's actually a combination of a number of different compression algorithms, including a finite state, entropy encoder and Huffman encoder and like how gzip has its nine levels. Currently, there are 22 and soon to be more levels in Zedd standard, and it provides you a much greater array of speed and memory trade-off and then there's also a dictionary training feature which we'll talk about in a little bit so just quickly, comparing Zedd standard to dead Lib, which is gzip and lz4. You can see instead of about a two to one compression ratio.

B

You can get closer to three to one, but it's not as fast as LZ at four but four times faster than gzip and these numbers are per core. So if you have a reasonable number of cores, then you're going to be much faster than your spindles or maybe even your SSDs, and so the trade-off for having the compression is pretty low. So I originally started working on this. When said standard 1.0 came out in the middle of 2016, the very beginning of it was quite easy.

B

Zfs has a nice clean API, you just add some functions to a table, and you know it's like here's. The buffer I want to compress here's. The buffer I want you to write the compressed version into the sizes and whatever, and it was very straightforward to do this made easier because in the design of Zed standard they actually provide a way for you to specify your own memory allocator, instead of it using malloc and free.

B

So it was easy to adapt that to say well here hook up to the FreeBSD kernel memory allocator, although in the 1.0 version of zed standard, they also used a lot of stack space which caused all kinds of grief, but luckily, in later versions they offered a heap mode. Much like lz4 has. That meant we could just out the memory that way.

B

So the code review for this in freebsd, including support for booting, like in the boot loader for booting from a zed standard compressed disk, is currently up for code review and once it goes in freebsd will post a PR up to opens edifice repo.

B

So the approach we use to integrate zed standard into freebsd was we basically imported a copy of zed standard into contribs dainard?

B

It's already been updated a couple times, I think when we imported it, it was 1.1 and now we're up to 1.3 and then in the freebsd based system, we actually install lib zed standard, although we install it as what. In previously, we call a private library, meaning its names faced off, so that only applications that are part of freebsd use it where, as third-party packages that users install won't, be able to find this, and if something depends on zed standard or they want to install that standard. They get the version of the library from ports.

B

And this way, if there's an older version built in freebsd, they install a newer version from packages and it won't conflict.

B

So once that was already there, I just made our ZFS kernel: module on freebsd link against that library, or rather pull in certain files share the code and it worked, and then I integrated it into our lib stand-alone, which is what's used by our bootloader, so that you can do that standard decompression in the freebsd bootloader still work-in-progress is actually being able to support memory, filesystem images where you actually have the kernel and so on, compress to a set standard and being able to decompress that- and you know, pixie boot, with a compressed kernel and so on.

B

So there was a few other challenges with memory, unlike lz4, which has a fixed context, size for compression and decompression. It's slightly tunable and elves at four, but the the one we're using in Zen FS is just a fixed 16, kilobyte memory allocation, so there's 1, K memcache reside standard with the different levels and different record sizes. You get different contexts sizes for the compression and decompression so I.

B

Actually, the approach I took so far is to create an array of kmm caches based on the I picked three of the compression levels, the minimum the default and the maximum instead of implementing all of them, because there's only so many spots in the enum on the on disk format for the compression types and it turns out, we don't actually want to put all of them in there anyway. So a decompression context with LZ or with Zed standard is 150 kilobytes and then compression varies with a 16 K record with the minimum professional level.

B

The the context is 136 K and then, if you use 8 Meg blocks with the 19, you can get up to 50 Meg context to compress it usually won't use that amount of memory. But that's the worst case scenario.

B

So we have an array of the compression levels and the record sizes and we used a function instead standard that estimates the contact size and we create a bunch of K mm caches that we would use and initializing those doesn't really have a cost, and then they only get used. If you actually start using something that block size and that compression level so know those 50 megabyte k, mm caches won't actually take up any memory unless you actually start compressing, 8 Meg blocks of data there's in the newer version.

B

There's a new API in it standard, static context where you can actually provide your own memory and instead of it dynamically allocating from the k memcache.

B

But we need like a pool of free allocated things per thread and we don't want to do that with 50 megabyte context. So currently the Cayman cash is what I've stuck with.

B

So this is led to the question of because there are nineteen or twenty two of you. If deaf in the ultra mode in said standard levels in on dis format, we only need to know that it's Zed centered. So when we go to decompress it, we can use that decompressor. We don't actually need to know which of the 22 levels was used to compress it in order to decompress it.

B

So ideally, Zed said it would only take up one slot in the you know, and then maybe we'd have a new property called confessed level that would control what level we compress it with per dataset, but I'm.

B

Having trouble reasoning about how to handle that when you know, if you set the compress level for Zed standard to ten, and then you switch to gzip ten is in the valid compression of the level and same with ELLs at four, which doesn't really have compression levels. Although there it has something we could use like a compression level, but I don't know how to have a property. That's very tightly coupled with another property, where you know, if you change from zest and ER Del's at four.

B

Suddenly you have this compressed level property that doesn't make any sense or going the other way, the compress level. What when we create this new property, what do we set it to by default for ELLs at four and so on?

B

So for now, instead of filling up the enum I just created the minimum, the default in the maximum compression level in the prototype, so I did a little benchmark here of compressing the Selita compression corpus as a standard benchmark for compression with the minimum level. You can press about three hundred and thirty-five Meg's a second per core with a compression ratio of two point. Eight to one.

B

The default gives you three point: one six to one and the maximum level without engaging the ultra mode, gives you almost four to one compression, but is only at three point: three megabytes per second per core, which might be a bit slow.

B

Although looking at gzip the minimum on gzip, you only get two, my seven dewayne compression at 77 megabytes a second at about that same speed. You could get 3.4 DeWine compression was at standard and with gzip 9 you're, barely getting the compression level that says, Santa would get at 20 times the compression rate. So you can get a lot more throughput.

B

So then I did a more real-world benchmark of an install of FreeBSD, including all of the source code. Splatted down on two data sets with the various compression and block sizes, and you see the three at the bottom. There are LZ for where you get a little better than two to one compression.

B

The first dot is the the base system with all binaries, the middle ones, the source code and the other ones to total, and then you can see the blue ones at the top of the maximum compression with as a standard you can get over four to one compression on the base system.

B

Combine that with compressed Ark and you're, getting a much better cache hit ratio.

B

So earlier a couple weeks ago, I was in Paris for euro BSD con and was talking to one of the vendors that was there and they run a payment processor in Europe. It's mostly an append-only database, but I was helping them debug. Some performance problems, they're having and I first thing I noticed, is for their MySQL database.

B

They were using 128, K record size, I assumed it was because they didn't know better, but when we discussed it with them is actually they do it on purpose, because they get a better compression ratio and they have 20 or 25 terabyte database. That has to fit all on SSDs and they can only afford so many SSDs and since it's mostly an append-only database, it's not.

B

They don't get as much write amplification as you would with random access, but they're using the larger record size because it got them an extra of like 0.5 to 1 on the compression ratio. So obviously stronger compression with that would still be fast. Enough might be quite interesting to them. So I grabbed a database that we have that work, which is our ticketing database for a pay-per-view system. It's about 14, point 2 gigabytes with lz4.

B

We get about 3.8 to 1 compression with the regular 16 K blocks we actually use in the database. But if we scaled that up to 1 Meg blocks in the database, we actually get 5 point 4 to 1, even with just ELLs at 4, and writing that data takes about 50 seconds to write at the 14 gigs and have it be compressed with gzip.

B

You get even better compression between 5 & 8 to 1, depending on the block size, but it takes about twice as long to write out the data with Zed standard on the minimum compression level you get between 5 and 1/2, and up to 9 to 1 compression, and it actually takes less time than ELLs at 4, because while it's using more CPU time, you're writing less data, and so it completes faster and then the default level was dead standard.

B

We saw almost 10 to 1 compression and with the maximum we actually got 12 to 1 compression, but it took 15 minutes to write at the 14 be 50, but that's only on a four-car machine. If you had more processors, you can get the work done faster, but yeah.

A

B

The max level, one probably only makes sense in the archival type workload.

B

One of the other interesting features I touched on part of the reason why Facebook is so interested in Zed standard? Is it has this custom dictionary training compression mode? Their main goal with it is someday that browsers will support Zed standard and they'll be able to send their. You know 10 JSON messages that are based on the same dictionary, but with different content compressed with this custom compression dictionary that will be able to abstract out the the repeated parts of the structure of the data, and so while I was digging through the zetas fest code.

B

That I came across this little comment. In one section: it's like xxx. We should have a custom compression algorithm for arrays of block pointers I'm like I, wonder if this would work for that and Matt had some ideas on that when he was reviewing these slides.

B

If we wanted to actually offer this to end users to be able to say, here's a dictionary or some dictionaries for the types of files I'm going to write into this data set, what would the ZFS API for the user to load that dictionary into ZFS look like and how would we manage them? I, don't know what that would look like.

B

And so looking back at the customer at the database and using the larger record size and and other discussions, I've seen is similar to that raises the question: would it be practical to have instead of the record size we have now, which is the logical, the actual amount of data?

B

And then we compress it and write a lesser physical amount to actually try to keep filling the record until the physical size is the 128 K block or whatever, and man I briefly had a chance to discuss where that might be practical and where it might not, but I, don't know what the ZFS API would look like for being able to continue to keep feeding data until you've actually filled up a physical record of some reasonable size, but how much waste might that save?

B

You know if you're using 8k records and Postgres and getting 4 to 1 compression on them? You know how much slack are you leaving on the hard drive, because the SSD uses 4k or 8k sectors.

B

Another interesting thing that says sander has recently grown as a contributed project, is an adaptive compression feature kind of designed before we had compressed, send and receive where people would pipe. Is that if I send into gzip or multi-threaded gzip or whatever this said standard, one will actually dynamically adjust the compression level based on how fast the output is being consumed. So, if you're going over a slow Network link, it will spend more time compressing, but only up to the point where it's not starving the network link.

B

So it's saturating that or link and doing as much compression as it can. But if you're never like is faster, it won't waste time, compressing and actually end up slowing down the process.

B

So it dynamically adjusts the compression level to keep up with the how fast you can drain the output buffers that might be very interesting in ZFS, where it's like compress it as good as we can without slowing down or writes to the disk, although in ZFS we're writing records, that are, you know, even in the worst, their best case are only 16 megabytes or in most cases, are 128 K or 16 K. You don't have much time to adapt in that in that kind of context, but I know at mic Center.

B

They have previously talked about a smart compression feature when you're writing to a large file and kind of keeps a history of how well of other blocks and this file compressed, and that might be able to be do some kind of adaptive training to select the right compression level to get the best compression without actually slowing down what you, how fast you're writing to disk.

B

I've also been in contact with the author, John Collett, and he's very interested in that you said Center being used instead of has so on, has offered to help by adding new api's to Zed standard that if they would make it easier to integrate things with CFS.

B

So if you have any ideas of what that might look like or what extra features you would like from a compressor in order to make that more integrated or more useful in ZFS, like El Cid 4s got some interesting features for that early abort, where it will decide that it can't compress it into that small of a buffer and will not waste a bunch of time trying to compress it. So maybe something like that would also be nice.

B

Zed standard also has both a block compression and a streaming compression API. There might be some use for that. What would be nice is talking about. If there's things we could do to reduce the amount of memory it takes when we're trying to only compress like 8k blocks, we really don't want to have to allocate a hundred kilobytes of of RAM for the context.

B

For that- and you know if, if the largest record we're ever gonna have is 8 or 16 megabytes, maybe the decompression context should also be smaller bar tuning options like that, but also maybe as a standard API that actually understands a BD or just scatter gather lists, meaning that we don't have to make a continuous buffer before we feed it to the compressor.

B

If said, Sanders API could actually consume the ADB just gotta gather list. You know we can save a whole, be coffee or something, and even just that might be a valuable performance increase.

B

So I have a couple more slides that are mostly just about the history of my little project, but I. Don't think we have time for that.