OpenZFS OpenZFS Developer Summit 2014, 14 Nov 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Saso Kiselkov - Compression - OpenZFS Dev Summit 2014

Description

Compression (Saso Kiselkov from Nexenta)

A

Previous work a little bit, but just in case he doesn't I'll mention that we have sasha to thank for all the lg4 compression support.

A

Yeah, this is a little bit popular cool, so I.

B

Now used to talk.

A

About some future ideas around impressions so.

C

Yep cool, so I suppose.

C

And we're going to be talking about essentially what's pointless working compression, I got no weight, so uh first, a quick primer on uh compression stuff uh compression. Mostly uh when you look at the world of personalities out there they're mostly falling into two categories, you got the archiving guys and you have the real-time guys so the archivers.

C

They are mostly concerned about doing the stuff right. So they try to achieve the most compression.

D

C

And they're willing to trade secret performance for that they'll spend a whole bunch of cpu cycles, looping patterns and basically killing machines for a few minutes.

C

The new thing about that is, though, that they get really good compression ratios, really being sort of you'll, see that in a minute and the real-time guys try to do just a good enough job, but they'll they'll be pretty fast about it. Those are the guys down at the bottom there and if you recognize.

A

Your algorithms there.

C

Lgbt is our earthquake and lv4 is what we ultimately call pretty soon. So why do we even bother with real-time guys? This is why we bother it's when you look at your compression ratio, differences which is the top bars yeah lg4, does get slightly less compression ratio than something like g6.

C

But if you look at the bottom part, which is the compression.

B

C

Data non-compressible data and decompression lc4 is pretty damn quick, so this is pretty much why we bother with real real-time stuff. It's not just for crazy things, like compressing memory or doing sort of a cpu to memory transfers. It's also used for actual data, because the data savings are slightly modulated by the fact that we get huge performance but the arc, and so there's a there's, a sort of middle ground.

C

Sometimes you can find it where you trade a little bit of the performance, well quite a bit as a performance actually for getting slightly better compression ratios. So we have here lz4 the high compression variant and, as you can see, you get slightly better compression ratio approaching sort.

A

B

C

But you also get much slower compression much lower compression of non-compressible data, and the compression stuff is still pretty damn quick, so this is sort of a middle ground now, so why would you bother with these super slow guys.

A

C

Sometimes it's just worth it. Sometimes we don't really bother with initial cpu usage. Sometimes we are just concerned with serving the data many many many times so the initial cpu cost gets sort of dilute it down and the bandwidth can cost us a lot. So it makes sense to do it right. The first time, first time around certain stuff, even compresses super duper well to take your average web server text files logs stuff, like that compresses down really well up to and over ninety percent loft cost even better than that.

C

However, certain workloads do not compress much at all so pre-compressed stuff. You've got your multimedia venture archives, stuff, that's already getting sort of crunched over my compression algorithm is pretty much a non-target, and so when we do as administrators look at setting up compression on our data sets. What are we looking at? Initially we look at. Are we going to pay for the cpu?

C

It does cost money, it does.

B

C

Cycles but usually we're pretty good on start systems. These things have reasonably fast cpus and they're underutilized. The second question is: am I going to be getting something out of it? We asked that question and we tried into it in some strange fashion, probably akin to my reading what our datasets can be composed of. So we try to gather whether it makes sense to turn it on the funny thing is. We should not even be thinking about that.

C

We should be letting the machine figure that out for us for us um and the beauty about file systems is usually when you look at files they're, either one or the other they're, either compressible or they're, not compressible. Usually that way so you've got your text. You've got your documents, your uncompressed audio, that's super compressible and you got some stuff. That's never going to compress much at all and, unfortunately, for us, though, as administrators have compression settings profile system, so there's not much. We can sort of do in a fine-grained approach.

C

Usually so we have to sort of pick and choose our battles. We gotta choose a pretty large club in order to have our mail.

C

So when we look at the performance stuff, we look at what we can do something about here. So we got the compression.

B

We got the ratios and.

C

The compression stuff in the compression performance- that's kind of you, know you, you can't really improve your ratios without hurting your performance. Unfortunately, that's the reality. It really is just about pattern searching and if.

B

You remember if.

C

You keep keep history in mind. You look at the progression. For example, video compression mpeg-1 fx fanback 4.

B

C

On and so forth, the higher the algorithm and the better the compression ratios that they achieve the more they kill your cpu, which is probably one of the reasons why we're not still getting practical compression algorithms.

C

All computer sciences have been all about them for the last 20 years, but so far nobody's actually built a cpu capable of running the damn thing and for decompression it's really just grammar processing. So you get a compressed stream, which is a complex grammar of various compression, primitives and you're just trying to reconstruct the original data, the more expressive the grammar, the slower. It's going to be pretty much so there's these are sort of the two areas we can understand and we cannot do much about the uncompressible stuff, though, has this range?

C

Why would that slow? We usually, as I said, we know, profile pretty much, whether it's going to compress or not so why the hell? Are you spending time trying to figure out the same bit of information over and over again we're basically compressing my block and forgetting the history of what we've been doing previously?

C

So how did how do you do about? How do we go about it? How do we go absolutely so there's a few bad solutions. Let's try and next year has recognized file extensions right. Well, now, really I mean how many of them are there. Anybody know how many extensions are used for mpeg files.

B

C

If there's a new platform, I guarantee they're going to make a new file for that and there's some strange questions that just make this people like this stupid solution, like what the hell we're doing behaving based on what the user names the file, but if they just name a file, what if they want to do that with the file they're, just building it with zeros, it's really cheap for them for us if they aim it right and what what I don't so the other solution there is, let's just make compression per file a property.

C

Let the user decide yeah good luck, trying to keep track of fueling files and try setting the compression on there's tricks. You can do with directories and stuff like that, but pretty quickly you're going to be sick and tired of it you're just going to turn it again on or off which is again getting back to the file system setting which originally was a sensible decision. But this is not, and of course frequently, the administrator is not even the user who's actually using the start system.

C

Think your average shared storage company environment, you got 500 users right into the same x, drive and they're, just cramming it off all the weird data. How do you know if it's compressible or not you're, going to run shell strip for that or whatever?

C

So what we did was well, why don't we just use the file system itself to track profile compression we'll just make cfs kind of smart about remembering what it's been doing in the past, so we just extended a bunch of the infor structures.

C

We did not modify the on this format and we just have cfs make the decision based on historic performance. There's a simple heuristic. It just checks. How often I've succeeded if I've not succeeded much at all, I'm not going to try for a while. Then I'm going to retry again and progressively either backstab or becomes more reluctant to stop to back to back off from.

B

Compressing pods the.

C

Beauty, the beauty about this approach is it works for any file, any data, you don't lose much compression ratio at all so far I haven't actually seen any and it works even for composite crop. So you check your bmd case, which is full of all manner of data and we can sort of dynamically adapt to two various right patterns going on in there.

C

So there's a few numbers that I can show you on the performance there.

C

So the yeah there's a couple of confusing lines here, but essentially the important lines are the top one and sort of the the. So this is compression performance uh conversion performance with a certain amount of uh in constant input. Data flowing, that's not compressed. Essentially your garbage.

B

C

Your storage system, the top line there, is the performance with smart compression on which means that we completely ignore the non-compressible stuff and the line just below. That is the situation where we have where we just turn it off. Just try towards everything.

C

So you can see that there's about, I don't know like 20 performance increase one, it's even greater because for some reason, uh gzip1, the regular compression stuff is faster than when it gets fed on compressible crap. So the performance difference there is much larger, actually.

C

And that's about it. I'm.

D

Trying to make it fast, so questions go right ahead.

C

So the the algorithm works, essentially by remembering it's remembering his results. We track the compression performance on a perfect basis when the block depends on what the state is currently off the file. So when it was uh hit by a lot of incompressible data chances are we'll skip it, it will just accounted, as is a bunch of data. We tried that's compressed and we'll try again later on uh when the file has been getting decompression, we'll, try and compress it.

C

If it works out, it will count as a good job if it doesn't compress school kind of, doesn't know and then there's essentially a floating statistic that will, after it dips below a certain number. We start backing off and backing up.

C

um So the thing with lg force test or whether something is compressible is that it can. uh It is fast, that's true, but it's uh sort of a quick check, so it will either determine that something's good. So there's a good chance that it will impact your compression performance compression ratios pretty severely. It's actually tunable inside of lc4.c uh how hard it should try and the effect to that is.

C

If you use that quick test chances are you're going to hurt your compression performance ratio, compression ratio, which is, for example, why, if.

B

C

um The high compression of orion lz4 you'll, see it's not there. It's just the non-compressible performance is exactly or almost the same as the compressible one.

C

It's because if you, if you do these quick checks, chances are you'll, throw away data that you could have compressed. It's, unfortunately, a trade-off, but the beauty about it is about the smart compression approach is that we don't even have to do that check anymore. We can. We can remember our previous performance. That's the point. The compression.

A

Already can be stupid.

A

In the second world war, they had to figure out what was good. They did match analysis, so they developed statistical methods and they used four cultures. They just pulled up on chat. Did some tests based on the results, either increase the tests or to reduce the tests based on those steps.

C

So yeah there's no persistent changes in that. We do not qualify on this state. uh There's uh one property we have, which is certainly not really changing, it's not incompatible, and it will turn on automatically if you're. If you have compression, turned off on the data set, it's sort of a separate setting, but by default it's on when you do use compression on your data set you're going to get this feature automatically on.

C