OpenZFS OpenZFS Developer Summit 2013, 19 Nov 2013

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: OpenZFS Developer Summit Part 6

Description

http://www.beginningwithi.com/2013/11/18/openzfs-developer-summit/
Performance on full/fragmented pools (George Wilson)

A

A

A

um So my name is george wilson, I'm also working at delfix uh before that I have worked at uh oracle and sun, so I've been involved in cfs for quite some time. um The stuff I wanted to talk about was a little bit of some of the recent changes that have gone in with regards to uh pool performance.

A

um I'm gonna talk a little bit about like full pools and imbalance pools um how many people run kind of imbalanced cool configurations.

A

So uh our customer, our customer base is very well known for having imbalanced pools, and this is the case where, like you know, start off with your pool configuration. You have like four lines and you start loading a bunch of data and then sure enough. It's like I'm out of space. I need to add more space. So now I have four luns that are full and I have four more ones. So I have like this big disparity in ones that are full and ones that are completely empty, and performance can nose down from there.

A

um So that's a problem that our customers face a lot just because they're always adding and adding more data. If somebody has like a storage appliance, they may start off with this massive. You know lots of terabytes or petabyte of storage, and maybe they don't see that issue because they've had it all at once. Our customers don't do that. They tend to add it kind of piecemeal.

A

So zfs kind of, like historically, has had ways to deal with this, but they never really were very aggressive. They the way that it would work would be that you simply, you know, figure out.

A

There was this disparity in space between multiple lines and they would try to add or allocate a little bit more to one lung than the other in many cases that little bit was like 25 more so you know it's not a lot when you have a huge disparity and we needed something that was a little bit more aggressive, especially for our customers.

B

Because that works well, when you have to just keep them balanced, yeah, they're, already balanced and you're, freeing some and you're allocating some more. It keeps them balanced right. So.

A

Like if you're you know 39 full, you know, is it like 38.75 full, then it's nice, because it kind of brings them back in balance and they're, both 39, but when one's at 90 and the other one's at zero, you can't get there and balance them out fast enough. um So we came up with this thing, which is in open zfs.

A

I don't think we really talked very much about it, but it's called zfs mg no out threshold and the whole idea behind this is it sets this threshold of free space that you want to maintain on your device. So as long as there's that much free space on the device that device is eligible for allocations, if free space on that specific device drops below that threshold, then it's no longer considered considered eligible.

A

So just to compare the two approaches, the original approach was you start off with you know a couple uh spindles. You start writing some data, you're filling it up. You get to this point and now you're like I need to add more space.

A

So I add some more spindles and now I'm writing data across all of them, because zfs will automatically give you stripey, but all we're doing is we're filling the ones that are mostly full. Almost all the way up and we've just barely made a dent on the ones that are active.

A

So that's the way that the old algorithm used to work.

A

What we did was we made this change by having the mg noah threshold such that again, I can fill it up and once I get to a certain point- and I add my additional disks this time when I actually start to write- I start writing much much less in this case. This. These writes that we allow to go to the other devices. Are the minimum size right that you can accept so 512 bytes? What you get on those devices and that's simply just because if we fail 512 byte allocation, then we can't gain that.

A

So that turns into an error, so we at least allow that size to go through. But then we write big data onto the other devices and we write until they all come up to the same threshold and then once that happens, then you're allowed to write the same amount of data to all devices. You start striking across them all once again, so that was a change we made because we needed to force things to get to these new devices much much faster.

C

Yes, the old algorithm was already biased to new medias and new devices.

A

It just was not aggressive enough, so you would see instances where you know um without like raid z or marion's or just simple stripes. We have a stripe with that effectively says you do like 512 meg, I think per device before you ground robin to the next, um and it was like you do, 400 to 1 and 600 together. So it was like it wasn't. Balancing.

B

It enough, if you didn't add more space, it would continue to write already.

A

Yes, yeah, so the idea behind it is that once you hit that threshold, that device isn't eligible, but if all devices are beyond the threshold, they're always eligible.

D

A

So, in this case, the small small rights are metadata rights that happen to be 512, bytes that I can't fail so when, if you, if you spend any time in like the metaslab allot code, if you try to write the smallest unit, that cfs allows so 500 total bytes. If that ever fails, then that's considered, I o error and then that causes the entire pipeline to stall. So if you ever try to write to that device um with 512 bytes, then it better succeed because we can't gang it.

A

So ganging is where you try to make up the same size block by kind of chaining it you know splitting it apart and creating two smaller buckets, but you don't have anything smaller than 512 bytes.

B

So it sounds like this is a this is an ease of implementation issue, rather than I uh you know we intentionally made it because that's the right thing to do is to allocate 512 there. It's just correct. It's like it's kind of an artifact of. We wanted to implement it in an easy, straightforward right.

A

So this this actually prevents us from having the I o mechanism stall, and we allow very little data to go to it. If you ever watch this like in a real pool, you'd, see like you, know, tens of megabytes going here and like k worth of rights going together, george.

D

Is accurate to describe this threshold as a bit of a cluj it.

A

Is a bit of a clutch? Yes um again, this was so as adam kind of alluded to earlier. We, you know somebody was asking the questions like. How do we do these performance changes and how do we roll them out?

A

We've been observing the performance problems and then we've slowly been implementing them, but you know: we've like.

E

This isn't enabled.

A

By default on opengfs, um we've enabled it on some of our customers so that we can see how it behaves and how it performs and what we're learning from all this is that, as we run across these different scenarios- and we discover something- that's effective- we can find ways to actually like leverage it and enhance it and I'll get to that. You.

F

Grew of this recently, no okay, that's another.

A

Thing that was, that was the imbalance or one.

G

And a half that yeah.

F

Does this affect your performance on your workload that you guys care about.

A

So for us um what you end up with like when you're in this when you're in this situation, the read performance is typically means that most, your data is actually here because it hasn't been striped across so for for reading. It's not that big of a deal where you're losing is right bandwidth, because, in the case of the new enhancement.

A

You're writing most of your data to these devices, so instead of getting four spindles of bandwidth, you're actually technically getting two, um but for our customer base, and I would assume anybody who's kind of been in this behavior, where they start off with some discs and are adding more. They were already getting two spindles worth of bandwidth now, they're, getting two spindles worth of bandwidth.

B

Well, but the problem is before they weren't, actually getting two spindles or damage.

A

Because it was all random right, yeah it gets. The problem is, as these things fill up, you get to the point where you're spending so much time here. First of all, finding a block to allocate second of all every single right turns into a random reign, so latencies just go through the roof.

F

I'm just curious that, because a lot of times you'll write and then read it back quickly, you're not gonna you're, only using two spindles. So then that reed comes off the two springs.

A

Correct, I don't yeah, I mean if it wasn't like, if it was wasn't cash or whatever you had to go back out of the desk you'd, be coming off of two spindles in those instances.

A

So the other thing that we've been working on is trying to understand free space. um So you know again, adam asked the questions like 30 000 segments was that something was that a large number.

A

We never looked at the number of segments that actually like made up a space map. Everybody was happy right, everybody's running fast, and nobody really complained about that. I was like yeah I'll, do it well, so you look at the number of segments and that doesn't really tell you very much because you don't know how the free space is there. You could say you can look at one meta slab and it may have 30 000 segments. It may have 250 000 segments and it has you know four gig of free space.

A

It's like what does that mean. I have no idea. I have a lot of segments on one. I have fewer segments on the other, but it's still high and I have the same amount of free space. So the thing that we wanted to do was to create this histogram and start storing this on the pools. So there's a new feature flag. That's out there that you enable, and it upgrades your space map objects to start storing histograms.

A

These are just power, two buckets of how the free space is comprised on disk, so it gives us the first time we actually get a view of like how is that free space made up now we can figure out that that four gig might be nothing, but you know a bunch of 2k segments, and you know that might explain why we have 250 000 segments they're. All really really small performance is going to be really really bad on that on that any.

E

Ios that are trying to go to.

A

That um the other thing so today, what's available is you can get histograms for meta slabs, which is misspelled?

A

We've now enhanced this to actually have it for devices, so you get histograms on a device level and you're also going to be able to pull this on a pool level. So we've been spending a lot of time kind of adding more observation into the way that free space is there, so we can have a better understanding how to deal with fragmentation, and especially, how do you allocate in a highly fragmented pool the other thing that we added was a new block allocator and again this is not enabled by default.

A

So this is something that you could play with the whole idea behind the.

G

Allocator is that it actually.

A

Goes keeps track of the last the largest segment on a meta slab and then just simply consumes everything within that segment and then goes fine to find another. The current allocator that's been shipping for a while actually has two modes. It has a mode where it tries to find everything based off offset. So it's trying to keep things in offset order. So if you last allocated it offset 1000, it's looking for allocations that are close to 1 000, but slightly greater, and that will satisfy that allocation.

A

Once you get to the point where the space in that meta slab is mostly consumed, then it starts going into a best fit mode where it's like everything is going to be random right, so it has kind of this two mode property. This does not. This would actually always goes to the large segment and then consumes that segment in its entirety before moving on this may help some workloads out there.

A

We've done some performance testing with this, but we did not enable this because we're kind of limiting how we're rolling out performance changes.

A

We're also changing the way that we actually handle the meta slab selection, so um how many people are familiar with the way that we allocate on disk the way that the zfs does allocations so I'll just kind of briefly go into it.

H

A

Does this like three there's three different phases to allocations for, for um when you're trying to allocate a block the first thing that it does is it determines which device to allocate from so it's first going to pick a device and then, within that device.

E

B

E

Actually, stop group right.

A

Yes, or about a slap group for those of you that are familiar with meta sub groups, uh you can think of those as devices. Then.

H

A

Actually going to look within that device at any one of these 200 meta slabs and pick the best meta slab to allocate from it's going to then within that meta slab, go and find the best region on that meta slab to actually satisfy the allocation, so it's device meta slab block effectively. Those are the three um the three layers that it goes through.

A

The problem that we had is that the meta slabs, the way they're ordered only took into account free space and again, you know, we've just talked about how we didn't know what free space you know how free space was comprised of that metaslab, because all we knew was a number well now we actually have some histogram information. We have the way that the meta slabs are how the free space is broken up.

A

It gives us the advantage to actually come up with a new weighting mechanism, so we've come up with a metric to measure fragmentation at a metaslab level, and we can now use that as a weighting mechanism to weight metaslabs that are highly fragmented down weight, metaslabs that are less fragmented up. So this gives us the ability to actually pick the best metaslab, truly the best metas lab to go do allocations.

A

There are some things that we've been discussing um for things like. Does it really make sense to always go after the best medicine and we've had this conversation about like what? If the workload is low, there's not a lot of rights coming through.

A

Well, if the demand isn't there, then maybe we shouldn't pick the best one, because all we're going to do is swiss cheese. It up. Let's wait and do the best one for when there's actually a lot of workload to be done so that isn't implemented in it. Yet we've been kind of kicking that around, but that's an idea that we've had of trying to now take. This is again kind of learning all the things that we've we've discovered so far of we know now how much workload is coming in based on the right throttle.

A

Work, that's been done, so we can keep track of how much dirty data there is to be written. If the amount of dirty data to be written is actually very low, then maybe we don't try as hard to go, find a really pristine and awesome meta slab. Maybe we just say: okay, you're gonna get. You know this one, that's mostly fragmented, because we can waste a few cycles there. Simply because there's nobody, that's really requesting a lot of work for us, so we think that could be a win for us.

A

It could also mean that these pools that have lots and lots of segments. Now, all of a sudden, you start filling in those holes. The number of segments goes way down. Now, it's not as big of a pain point when you actually have to go and look at those medicines.

D

Just another point on that: we can actually use that idleness in the system just to draw a finer point on it to optimize our space maps, so knowing that, rather than having 30 000 segments, if we can go find exactly the right place to drop that puzzle piece, then we only have 29 999 seconds. If we do that enough yeah, we can actually make it a little more manageable.

A

So slowly we're trying to take the information that we're learning to be much more intelligent about how we actually pick the right block. Pick the right, meta, slab and you'll see even pick the right device. The other thing, that's that we've done is we've actually added this preloading of meta slabs. So we've.

A

What we've learned is that anytime, you actually have to go load, a meta slab while you're in the allocation code path, so you're in the right context is very, very painful, so we now can figure out which ones are the best meta slabs on each device that we anticipate. We would need to load and let's go load n of those every single time. We finish a transaction, so they're already loaded in memory, they're in hand the next transaction comes through.

A

He goes to do allocations everything's already there we don't actually have to go, spend you know io cycles to go, read them and load them in. Hopefully, that's enough again. That's another area, for I think more future work where we can determine what n is based on kind of the information that we're getting about dirty data. That's coming in so if we know that we're getting you know, gigs worth of dirty data coming in then n better be large enough to handle the gig. That's gonna, you know get synced out in the next transaction.

A

So today that n is hard coded, but we think there's more future work that can be done there where that's actually becomes dynamic. Questions on this stuff.

G

Yes, would it make sense if we knew what kind of rights are coming in, what size right? We know which ones to do.

A

Yeah, so if we, if we knew so it's going to vary based on, you know what record size everybody has but yeah once you kind of have an idea of like what blocks you're coming in. You can kind of assume that say even with compression I'm going to be looking for like an 8k block. If that's what you're actually recording an ak block, maybe I can fill you know a 4k segment, but at least I can get an idea that I have these.

G

Number of 8k segments are the largest that we saw and one of the problems that we saw was 128k showing up, and we had this famous one where you just start you search through all over and then.

A

Yeah you'd never find 128 k7 exactly yeah.

D

Are you talking about crystal.

A

Grams, no, this is more of trying to understand like if, if we know the types of ios that are coming in this is, I think this is more than piecing. You know finding the right puzzle piece, it's like how do we know which puzzle piece to go after, if I know that I have you know 100 ak segments that are coming in and that's my low period, then I'm looking for meta slabs that potentially have a hundred open spaces of at least 8k, um but not too much.

B

More, but not much more than that, exactly because I don't want.

A

To go after one, that's got a bunch of 128k segments, I'm looking for the one that I can just fit the puzzle piece. So it's kind of building on the same idea that we've been discussing.

A

So I mentioned that we're we've now added a fragmentation metric to each meta slab, we're exposing that also on a per device and per pool level, so we'll now be able to see the fragmentation metric for each device in each pool.

A

So this is all based on that same space map, histogram data that we're now uh keeping on disk.

G

A

Gets updated every txg um it if you've spent any time in the meta slab code? It's not going to be 100 accurate every single time, because there are cases where we actually do. Freeze where the space map isn't loaded.

A

So we don't know all the segments, but we we do try to like change every single time we reload a metas lab and we do operations on it in syncing context, we'll go and actually refresh it to give it an accurate view, but it can kind of like fall off a little bit over time and then get refreshed.

A

um So what's interesting here is just as we were using this fragmentation metric to weight, a meta slab and try to find the right one. Now we can do a similar thing. If you go back when I was talking about the imbalanced line, we had this hard-coded value. Well now what if this value become kind of becomes the metric of which devices do I go after to actually look for allocations, ones that are more fragmented?

A

I try to avoid things that are less fragmented. I go after and start to do allocations for it. We can change the way that we actually do device allocations going forward. We can select ones that actually have more free space and are more eligible for allocations. Based on the presentation, yeah.

F

When it's 50 fat, fragmented meat, so.

A

So fragmentation is kind of it's.

E

Nebulous, yes, it's very nebulous.

B

We can uh we can assert 100, fragmented means that it's all in the smallest chunk size possible, so like all 512 by three chunks or one k. Actually one k one.

A

B

A hundred percent, if zero percent fragment means it's all more than 16 meg 16 bag. So we.

A

And and again this is kind of it's a little nebulous as matt was saying we're trying to come up with a way that we can create a relatively good metric. That makes sense to us here, and the idea was that a 16 meg segment is probably large enough to be able to satisfy most allocations of any workload that you generate. We may find that 16 meg, maybe isn't you know, but at least from some of the data we've seen. 16 meg seems to be satisfactory and does not cause performance problems.

A

So if we can say that anything, that's 16 meg, any segment, 16, meg or larger is considered not to be fragmented. We use that as our first metric and then we can. We step it down as you go from 16 meg, all the way down to 512 bytes. So 1k is, I think, at 1k. If your segments are nothing but 1k is 100 fragmented, so.

B

When you there's like a table that tells you like this, this chunk size. If it was all this chunk size, then it would be x percent fragmented and then we just wait. Then you know we weigh everything based on how many based on the histogram, but how big, how many there are in each chunk.

A

So then we factor all that in so this gives you an idea that you're you have things that are probably like. You know, somewhere in the four to four to eight meg range of segments, um actually, probably less than that I think 50 might be 128k segments, so most of your segments are 128k, although there's probably, if you look at the real histogram and.

F

Look at the histogram or something yeah so.

A

There's so we've added uh to zdb the ability to actually dump out the meta slab information, the space path, information um and actually there's zdb minus m. So maybe what we need to do is just remove it, so it never prints under d to solve your problem. um I don't know why.

D

Is frag effectively like an average, a sort of average of the histogram yeah.

A

Yes, yeah, it's effectively kind of like an average.

B

Is like it's never going to be able to give the whole picture right.

H

B

Be all blocks that are 128k or it could be like one. That's one. That's 16, you know up 10. There are 16 meg.

E

And then a thousand that are 1k yeah right, we talked about.

A

Several solutions that- and this was the one that seemed to like fit and nicest of being able to take into account the entire range of all free space segments in this in the metrics lab and still represented as one number. But the ability is that that one number then gives us the um the opportunity to make better decisions of when we go to actually select devices.

A

A

So the histograms are actually stored as part of the space map object so they're in the bonus buffer of that object.

A

So one thing to note, too, is if you're upgrading to this feature, um the space map objects have to actually get reallocated in order for you to in order for them to use the larger bonus buffer, so older objects won't, have it newer objects, will you'll see if you upgrade fragmentation, won't get won't actually show up in z pool list until like? I think I have it right now. At 50 percent of the mena slabs within a device have been upgraded, so if fewer than 50 have been upgraded, then you just get a dash there.

A

It doesn't show anything um until over time. They upgrade we're also adding logic to actually kind of speed that process along um just because we definitely want to get more information about free space. So we can understand how to improve things.

A

The bonus buffer is yeah on the space map, object.

C

Yeah, just on the spacecraft.

F

Okay, question uh for sure.

A

Okay, and again, as I was mentioning, this is probably where we go next is taking this fragmentation number to go and look at deciding which device we're going to look at and go do allocations for it. We're also introducing this allocation cue.

A

So in the pipeline itself. What happens today is when you go do allocations.

A

We hit a point where we actually fan out to a bunch of threads and go through the allocation in this kind of you know parallel fashion. The problem with it is that if you're, you know allocating data data data data and maybe metadata metadata metadata, if, when it all, gets fanned out it kind of all meshes together, um so you get into a mode where you could get allocations on the same device of you know: larger block, smaller block, larger block, smaller block, and over time this can add to the swiss cheesing.

A

So, by having this allocation cue, we can actually order these based on the way they're issued. So if the zio tree has already built it up to have a bunch of data blocks, which it has a bunch of data blocks that should be allocated first, then they would actually get allocated first.

A

The other idea behind this and again we'll probably not enable this when we actually push this upstream, just because we want to see how it's going to behave.

A

The idea behind it is that, as devices start to allocate and write more slowly, then chances are they're, probably more fragmented.

A

So if every device now has an allocation threshold effectively the number of allocations that they have pending, if a device has already reached that when we go to allocate from you know the next block and issue it down to the to the v devs, we would find that certain devices already have reached that threshold. So we just skip over them and find one has already processed them more than likely that device either doesn't have as much workload or has run through his queue, much faster, meaning that he's probably doesn't have the much fragmentation.

A

um I think that's all. I had.

A

Questions catch me afterwards. If you want to discuss this, I'm interested to hear what how people are experiencing, what problems people are experiencing with like full pools and.

D

George, when do we is 200 metaslabs, the right number.

A

It was a question mark on your slide. It would be a question recognized last. um I I think it's not.

A

I don't think that um the the thing that we've discussed and the thing that makes more sense to me is meta labs that are actually size based the problem you have there is that, and I think the reason the original intent of them being 200 was to allow you to have devices as small as 64 meg, so that you could carve that up 200 ways, and if we made our meta slap sizes to be 2 gig, then obviously you wouldn't be able to have that, but I think that a fixed number doesn't make sense.

A

In my opinion, I think we're better off actually trying to, rather than say we're going to pick a number. Let's pick a size which may limit that and say our smallest device that we'll ever support is, you know a 4d device or a 8 gig device or whatever that size is. But I think that's where there's an area for investigation yeah, please keep force fours.

A

There's actually a lot of benefits to something like this for the zil, because brazil, even though it uses metaslabs to do its allocation, it makes no sense there, because all you really want to do. Is you want to scan and do effectively, plus plus or cursor based allocation on a zeal device? That's really what you would like to see if you'd like to say, I start here and I keep allocating until the end and then I wrap around, and I start over and on a solid.

A

You know state device, you're, obviously doing better, where leveling so or effectively yeah. So I think there there are benefits to actually getting rid of the metaslab.

A

Sizing um because today we would chunk up that you know 4k device, try to chunk it up, 200 ways, you end up with small meta slabs that you end up having to you, know toss back and forth, and it doesn't make as much sense right but to adam, to your point I think, that's an area we need to go and really do.

F

Have you ever done any investigation tweaking that we.

A

Have not um we have not looked in that area to see what it should look like.

A

I think that there's going to be if, when we actually go down that route, because I think the answer is, we will go down that route of looking at a better solution.

A

There's the upgrade implications, kind of have to be considered. The customers that already have 200 meta slabs would probably fall in nicely, and you know we have to make sure that things like spares and mirrors and all that stuff. All this.

H

Works I'll just throw that it seems to me that maybe the allocator is avoiding the last 30 of the mega slabs a little too today.

E

uh Yeah, if you.

H

Look at the ddb output, it doesn't use those last thirty percent of the medicine.

A

H

The pool's, like 70 percent.

A

And and part of it is that the weighting, the weighting factor that is in there um tries to provide a bonus to metaslabs that are like lower lba range right. um So, in a way, it's like the you have to really really work hard to fill up the first you know half of your device before it even starts to consider something at the very end, yeah.

E

A

Agree- and I think that with the new medislab fragmentation waiting that actually is going to come into play much earlier, because you're going to find that, even though I'm still allowing you to have a higher waiting for lower lba ranges as those gets consumed, their fragmentation metric goes way up, and you start losing that doubling effect, it starts to go away which makes the bot ones to end more enticing to go. Do allocations.

H

A

H

Up another question: I mean.

D

That only makes sense for spinning this right that the lower lva can be faster. Is there any thought of.

C

Extending it to be aware of, like non-rotational media, I.

A

Think that there definitely should be, I think that we should look at once. We know that we're not dealing with a spinning device, then it's it's very easy in the metastatic weight code to say this device has, you know, should not even take that weighting into account. Well, not just solid.

D

State but the kinds of devices we plug into, which are spinning disk, but the lbas are like the map yeah. They don't motivate.

C

Anybody um so why we're biasing into you know to one part of it doesn't mean anything from in terms of how it impacts real disks.

A

Yeah definitely the more. We know about the device, the better decisions we can make as well.

A

Cool thanks, george.