OpenZFS OpenZFS European Conference 2015, 26 Jun 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: George Wilson - Performance Retrospective - OpenZFS European Conference 2015

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, hello, everybody welcome back hope. Everyone suitably caffeinated I've got great news. The lunch has arrived, so it's only a one small presentation from George before we can have lunch. Just a quick reminder if everybody who's here, we are doing a hackathon tomorrow, same location here from ten o'clock, I. Think till as long as we last of open, ZFS hacking. So in light of their and we kind of wanted to generate some ideas outside from inside the room. Really when so, we've got free t-shirt to give away, which is modeled by my able assistant.

A

Yeah I think we need a collective ooh ooh. Ok, so we're going to give it out to the bet the best hackathon idea. So you know raise your hands I'll, throw you throw you the Mike and you can come up with the best hackathon idea. It can be super complicated. It can be won. I've already nominated you for whether you like it or not, or anything like that, so frawls open. You.

B

Must be a medium.

A

Anyone I mean I, don't have to go through names, because I can name quite a few people here. So we can go one by one and force people, fine, yeah. Ok, one idea any anyone who can for a hackathon something that they come of a feature for ZFS: an improvement to ZFS.

A

Alright, so I say: you've nominated yourself, a BP rewrite tomorrow. So congratulations and fitness on the.

C

A

He fent it's andrew andrew stone on the way over did say it didn't seem like that. Bp rewrite was going to be too hard, so you know he's nominated himself to help sell on that say anyone anyone at all.

D

B

Modifier protection.

A

D

A

I could give them the mic and then he can really explain that for everybody.

D

So the idea is that in some plastering environments.

E

D

Have multiple nodes that are able to access the same discs? What we need to do is to make ZFS aware of that shared space, so that you don't you weekend we're going to be able to prevent importing that pool and both nodes. In the same time, corrupting data.

A

As excellent idea, congratulations and there's a suitable medium as well; okay, moving on so, let's, let's.

F

The first dozen people who come to the hackathon tomorrow I have more t-shirts for you cool, so.

A

Yeah, hopefully, we'll see you all tomorrow same location but from ten o'clock. Ok, so moving on, we've got George so doing a performance retrospective from Delphic. So over to you thanks.

B

Thanks Ryan, um so out of curiosity like how many people have really dug into ZFS performance.

F

B

So the question was how many people ran in a performance problems? Ok, ok, cool, so um so what I want to go through is kind of, and this is going to be, since this is a community forum, this is going to be a community presentation. So I expect a lot of participation, so we're all going to for the next. However, long kind of start, looking at a performance problem and figure out and kind of come up with proposed solutions, so we'll look at like how how pool performance can be impacted.

B

What is out there today? What are things that we've already done and how they improve things and then we'll take a closer look at performance problems and kind of talk a little bit about this thing that Matt mentioned, which was the allocation throttle. So, let's start everybody get excited we're going to start looking at a performance problem, or maybe there isn't a performance problem. So if somebody came to you with this type- and this is actually a real system- came to you with this type of information, what things could you glean from this?

B

That would either tell you it's good or bad, so.

A

I'm hate so from fro with the mics, so I know: okay,.

B

So like, for example, what do we see here? We have a system that looks like it still has quite a bit of space available how many sister, how many people run their systems to this capacity at 71 percent? How do they perform.

G

B

Right so sometimes good, sometimes bad, but we do have quite a bit of space and we also can kind of take a look and see that, like fragmentation doesn't look too bad, so we kind of at first glance, look at this and say: okay, maybe this pool will actually perform well what if we get a closer look at it? What do we see here? So, as we start looking at this pool now we're looking at each of the individual devices that make up this pool and what are things that we notice.

B

So it looks like there's been devices that have been added, so some devices have more free space than others. It's true, some devices are bigger than others, so this is actually a very common thing that we see with our customer base, how many people configure the system correctly, the very first time, every time, okay, there's there's at least four of you, okay! Well, unfortunately, not all our customers do that, most often, what you see is you create a pool. You start off with some configuration and then later you decide to change your mind.

B

Primarily you change your mind because you need more space or maybe you can't buy that drive anymore, so you get something larger well. So in this case we have quite a number of disks that are actually twice as big as the original ones that were first added. We also see that there's a bunch of devices that are over the eighty percent threshold and those that have been using ZFS for quite some time. Eighty percent has been a number that has been out there for quite some time. As like the known cliff of performance problems.

B

Oftentimes, you don't get to eighty percent your to get to eighty percent. You might be lucky I'll tell you that this particular system has actually run close to ninety percent at times barely run at times, and we also have kind of a big disparity on free space. So we're not expecting this pool to perform great.

B

If we had our you know, if we had our chance to do this over yeah, we probably would start from scratch, create all new devices at the same time, same sizes- and you know maybe ZFS send receive it all the data over. Fortunately, we don't have that luxury.

B

So what are things that we've with that particular pool thing that when we started looking at this- and this has kind of been an ongoing performance investigation for us? We see this not only in an internal system which this happens to be, but also our customer base, which goes through these scenarios quite frequently.

B

So we notice that, as devices start to get full, they take a lot longer to allocate oftentimes, that's because they're fragmented, so the actual finding of blocks takes a long time use a lot of CPU. To do that, you know now writes that we wouldn't end. We would intend for them to be like sequential turn out to be a bunch of random writes, because the free space has scattered all over these devices.

B

So we start seeing performance penalties from fragmentation. We start seeing Metis labs how many people are familiar with med aslam's?

B

Okay, a few of you for just a just a quick synopsis of meta slabs so that everybody understands meta slabs. You can think of them as regions on a disk. So the way that ZFS does allocations is it takes an individual device and carves it up and into approximately 200. Equally sized regions we refer to as meta slabs question is why 200 because 100 didn't seem right and 300 seemed like too much I. The reality is: 200 is a number that has existed from the beginning of the ZFS days.

B

Early or I mean I start in 2005, and so it's been around since then. Nobody knows there are now two nobles to actually change it, because we are finding that different configurations perform better either with more med, aslam's or fewer, but yeah. It's it's one of those mysteries of life.

B

But anyway, when you have these different regions in order for you to be able to allocate from a region, you have to load it first, so we carve up the disk in 200, approximately 200, equal sized regions, and when you have a lot of fragmentation, devices are full you're having to load these various regions throughout the disk. Looking for space, we also found that when you have a configuration like we just saw where the devices are imbalanced in different sizes.

B

You are not getting the full efficiency of the the devices themselves, so we're leaving a lot of performance on the table when we actually start doing rights and again we don't recommend people have a lot of imbalance luns, but we have found that in our customer base- and this may be true of system administrators throughout the world- nobody gets the pool right, the first time, or maybe the second or third or twentieth. We always kind of start adding and that's the life cycle of ZFS. It's one of the beauties of ZFS.

B

We can actually add devices to an existing system, but we want performance to kind of at least be on par as we do this okay. So so we listed a bunch of problems. We actually came up with some solutions, so we're not. We don't have to solve all those problems. During this talk today, we are going to try to solve a couple of them, so I'll do a shameless plug for last open developer summit, we're actually presented on the dynamic mediswipe selection.

B

This was one of the key things that that we discovered and a big improvement on performance when you have these types of pools, highly fragmented, where you're seeing lots of loading and unloading of meta slab regions and when you're, actually running low on space and I'll, show you some charts on where we. Actually, you know some of the performance gains that we got from that those slides by the way are available on open, ZFS org, as well as the videos.

B

So if you want to go back and look at them, however, there are two problems where we have a partial solution.

B

That's been around for some time how many people are aware of ZFS mg, no Alec threshold, a couple people I'll talk a little bit about this and why it's just a partial solution, and then we have another problem that actually doesn't have a solution today, and that is, we just aren't very efficient when it comes to writing and getting full bandwidth when you have configurations such as what we just saw with regards to ZFS mg, no Alec threshold. This is a very coarse brain switch.

B

The idea behind it is that when you have these these pools with big disparities of free space, you can actually set this and say once a device gets to a certain capacity level. So so, when it has, you know less than say, ten percent free stop allocating from it. The idea is: switch everything over to devices that have more free space that way I, don't have to pay the penalty of loading and unloading, Metis labs.

B

Looking for free space on devices that may not have much to begin with very coarse grain, and it contributes to this other problem of. We leave more right bandwidth on the table. Oh.

B

And read: yeah because of the fact that we're having to read to only certain devices.

B

H

B

First, talk about the existing solutions that I just mentioned. When we set out to improve right performance, we were looking at making sure that right performance, as you approach the eighty percent cliff that everybody is so fond of stayed relatively. Even we knew that as we started going beyond eighty percent, that performance was going to be tough to achieve so what we, what I'm, showing here is random, I ops using this benchmark. We call frag.

B

The baseline is the blue line and we can see this very linear cliff that happens starting at about sixty percent and actually, if you were to graph, this out is even starts before sixty percent, but we wanted to improve on this and by doing some smarter selection of métis lab regions, we actually were able to flatten this out to the eighty percent mark. So you can see here. This is the top line is actually showing us what we were able to achieve using the new algorithms.

B

We still see a drop off and, as you get to ninety five percent you're pretty much hopeless, and this is kind of worst-case scenarios.

B

Like I said, we've seen this is using AK block, so your mileage will vary depending on what your record sizes are, but we think there's actually probably even a little bit more room to improve between 80 and 85 percent, which is an area that will probably start investigating, because one of the things that we want to do, I'm sure everybody wants to do is be able to advertise that you can use as much of the storage pool as possible.

B

As part of this same solution, we also observed that we can reduce fragmentation using this new algorithm.

B

Again, you see the green line, which is showing us the I ops curve for the for the metis lab selection logic, and we can see that fragmentation actually stays relatively low.

D

B

To the baseline, until you get up to the ninety-five percent mark and there's a direct correlation between fragmentation and performance once we're completely fragmented, we can't get any I ops really out out of the system at all. So that's what we have today and that's actually maybe up streamed. So we think that's up streamed will verify definitely available on the dell fixed repo that Matt referenced earlier.

B

But what's next so now we need to figure out like how do we address these remaining problems, especially since we know that customers are going to continue making the same mistakes over and over, and we want to continue to drive and increase our pool capacities, and it's not practical for everybody to send receive to a brand new pool every a year or two.

B

So first, let's talk a little bit about how we would observe this type of problem. What do you see when this is going on.

B

So, just a little background, how many people are familiar with the way that ZFS actually does allocations? When you have multiple devices.

B

Few people, so when you actually create a pool in this case, this is kind of showing like a three striped pool. Zfs always done does round-robin allocations, so it tries to select. You know starting point from one device. It'll go to that device and from there it'll allocate kind of a pre-selected amount of space. Once it reaches that threshold. It then says: okay, I can switch the next device. What that means is that if I start off and allocate 512k from this device, I'll do that 512k allocation.

B

As soon as that I reached my threshold, I can now go to the next one and so forth and I just keep round-robin across. That means that we can keep everything evenly distributed, we're always doing approximately the same amount and there's some very obscure logic. If you look in the depths of like the Metis lab code, where, if one device happens to be a little bit more, you know has a little bit more free space than the other than it tries to give it some fraction, above and beyond what the normal limit would be.

B

So that's the way it works.

B

What does that look like when we actually run a system that way in particularly the system? We just showed the zpool list for so I'm, going to go and you're going to see this update and I want you guys to kind of observe what happens to the pool and then we're going to collectively try to figure out what is happening and try to make some sense out of it.

B

So here we are we're starting off we're, looking in particular at our right bandwidth, because that's what we care about right now, we're writing a bunch of data to this pool.

B

Okay, so we went through several seconds there. What do we see.

B

We'll run it again.

F

I

Even quite even two very uneven so.

B

What from very, quite even to very uneven.

B

So it appears we're writing more to the more full ones. um Yeah interesting things right, I mean like what does this mean? What is actually happening here?

B

Our algorithm is fundamentally broken. Yes, I mean. Why do you think, like these particular devices that are currently busy right now? Why do you think they're, the ones that are busy so fragmentation? Would how so.

B

So we write the same amount of data. It takes longer to write to these because they're fragmented exactly so we're you know. If we go back, we write the same amount, we're going to allocate the same amount of data to all these devices, but we have no idea what the characteristics of the devices are when we're doing these allocations other than we know that if we try to write to this device and if it's mostly full, it's going to take us longer to do the allocation part which is finding the free space.

B

But it's also going to take us longer to actually write to that free spay write those blocks once we've done the allocation, because it might write a little bit here and a little bit down here and we're seeking all over the place. So we pay the penalty twice. We pay the penalty of trying to find free space. We pay the penalty of actually trying to get this on disk to stable storage. And that's that's exactly what we see here.

B

When we start running everybody's got about the same number of allocations, then all of a sudden some devices start going to the point where they're not busy at all and the ones that remain are the ones that tend to either be most full are most fragmented. So we end up with these devices that are sitting there and we're not utilizing the bandwidth of them during the entire cycle of our rights and those happen to be the fastest devices.

B

So if you were to kind of observe this from the perspective of a fast device versus a slow device over the course of say an entire sinking of a transaction group, this is kind of what you would observe. You'd see that the slow device kind of takes a while it gets a bunch of outstanding iOS and over the entire course, it's still working on it.

B

Turning on them, turning on them until we eventually complete, which that means that this is how long it's going to take for us to sync out that data, because we're going to be based off whatever the slowest device or the one that takes the longest to write out he's going to be the limiting factor versus our fast device gets about the same amount of data, and he can turn through it much faster and eventually finds himself completely I.

B

For the majority of that transaction group, not a place, you want to be.

B

So the comment is that Kendall.

A

I'm just going to give you a microphone, can.

B

We say that adding more V devs is not necessarily giving us more bandwidth in this type of configuration, because.

I

They're yeah there's a kind of a like I, don't know kind of knowledge, common knowledge that yeah. If you, if you add the number of parallel videos, you add parallelism there, we have something that behaves more like a raid Z, which is the slowest one has to finish for the transaction. To finish, I know it's not right. Z but I mean.

C

I

B

Yeah we would like for it to give us the bandwidth as and we would like to believe that, yes, adding devices gives us more bandwidth and in practice, if you add them at you know, when the pool is relatively empty and your devices are about, you know, have about the same amount of allocated space. You will see that performance game.

B

The problem is that most people don't look at adding space to their pool until they're out of space right I mean you know, we try to get our customers to think that think about like not what you have today, but what you're going to have? You know three years from now and I'm, not very good at it, and I can tell you, our customers aren't very good at it and if you know for those of you, you know that are in the storage industry.

B

What you'll find is that customers tend to like put a little bit on there, try it out and then, if they really like it, then they throw a bunch more crap on it more than they probably anticipated. They might have told you. Yes, I only need five terabytes sure give me five terabytes. That's all I'll ever need until they like your product, then the next thing you'll realize is that they have 30 terabytes on there and you're like okay.

B

That's not what we planned for and I now have five terabytes that are of devices that are totally full and twenty five terabytes that are completely empty, but every single time you try to write I'm having to do even allocations across that. That's exactly what we have in this pool. This is an internal system for ours, which happens to be a way that we deploy a bunch of development boxes.

B

Well, it was great when Matt and I first started, because you know we had tons of free space one of the few first developers, but of course we hired more people, which meant we needed more space.

B

Maybe I should have had like a timeline of like the number of Engineers we hired, but every single time we hired more engineers meant we deployed more virtual machines which were coming and being served off the system and over time everybody started to complain. Yes, if.

G

My disks are from a storage array. Does it means better for me to grow.

E

B

G

Resin adding new lens so.

B

A great question: growing, an individual device is actually an improvement to a degree. Now, every single time you grow one of these devices. We actually create more of the 200 Metis labs. So now you went from 200. Maybe you go to 300 you do this enough times. Eventually, you may have that you find that you have thousands of regions that are all equally sized.

B

So when you get to a point where you're running low on space, now you have thousands of métis labs, you're going to be looking at trying to load and unload to see if you actually have every space, so you get kind of some initial. You know performance gains, but you may find that in the long run you may be suffering, just as you would with this implementation, and and to be honest, we actually before we had solutions here.

B

That was our recommendation for customers is grow your lungs rather than expand, because at least when we're doing even allocations, as we saw the way the of them worked. If we do even allocations- and you expand all your lungs evenly, then you're fine, everybody just got more free space, but there's limitations on how far you can actually expand, which led people to to add. Yes,.

H

Yeah, so if I change one vida with I, don't know one terabyte, one: okay, it will be silver right. It will keep the it won't, keep the fragmentation right so.

B

If you actually replace so, let's say, for example, we decide to replace each.

H

One by one one by one, we're going to replace.

B

Each one of these with.

H

B

One terabyte to bring them to the exact same size, okay, yeah. So that's that's another possible solution, because then, as soon as the reefs over incompletes and I remove the old, you know smaller device I get that space. It's kind of going back to the expansion logic once again, so you're just doing it there in like a physical. You.

E

Know how versus.

B

Doing it through some storage array, that has the ability to expand, it basically lunch, but it's very similar concept and it would do the same thing as far as fragmentation is concerned, your fragmentation number would go down because you now have a bunch of free space. That is completely.

H

B

But you haven't changed anything about the existing data. That's already on there. That's still fragmented in the exact same way, George.

A

What about reducing the size of the Mets lab allocator? How much well how much free space is required to look for across the videos so.

B

It's something that we haven't actually looked at to see. If you know how much of a performance gain you would get so I think by default, it's 10.

A

Mega I think I want.

B

To say it's like 512 k, / / like like leaf device, so depends on. If you have like a raid Z, then that number becomes much larger versus. If you just have a simple disk, then it like switches every 512k.

B

We did not do that. Investigation.

B

But it's possible that might give you a little bit of gain, but now you're you could round robin much faster through you're still having to pay the penalty of looking for whatever that small amount is and now for devices that are free, because it's kind of like a global devices that have free space now are getting just a trickle. So you may find yourself that performance might actually dive down simply because now we're spending more time just kind of cycling through all the devices looking.

B

You know allocating a little bit of space here, a little bit of space here, which we still might have trouble finding, but it could be an interesting thing to investigate.

F

All these are like great ideas of little tweaks, but they don't really address the the key problem that you're talking about, which is like some disks, are faster than others and we're allocating the same number to each of them. So we have to wait for this, though it's disk to do it's. You know we allocate one fifteenth of the data to each of these disks, so we have to wait for the slowest one to do. It's one fifteenth of the amount of work.

C

I

Is this cycle of looking on the disks loading the Metis lab to find free space? Is this cashed in New, York or every time it has to do it? It has to do physical I mean iOS to the disk to retrieve that Indian.

B

So so, if the if the Metis lab is already loaded, then everything is just a cash. You know look up in New.

I

B

Yea, it's actually it's it's data, that's actually not in the ark, but it's got its own kind of cash, but still you're able to like look through because it's just an AVL tree, so you're able to like walk through trying to find individual sized regions that is actually pretty fast. Where it takes a hit is because every single time you load a métis lab today, if you don't allocate from it you're going to unload it, which means the next time you come over around to it.

B

You may be looking at the exact same metal at asking a very similar question. You did just a few yeah we well. We have a tunable now, but you may be asking a very similar question of like can I allocate from this meta slab, can't.

I

We have a cache that hole more of this.

B

Yes, so there's now a way for you to keep some of that data for much much longer, because anticipating that you're going to ask a very similar question and also the way we select those meta slabs is very different, based on the changes that the the graph that I showed. That has changed the way that the algorithm works today, but for this that doesn't help us. um So I heard several things wit.

B

What do you guys think of like how would you try to tackle this problem if we wanted to come up with a proposed solution of getting bandwidth? You know dealing with these devices that are imbalanced.

B

So an idea of a VF q for top-level vetoes.

B

So a weight based on usage.

F

F

B

0 / free space figure out like can I actually allocate more from one and less from another.

B

So so maybe maybe a way to store how fast devices actually respond and use that as a way to select devices. All you know great ideas, we'll talk about how we address this and you'll see that they actually, what what you guys are hitting on can actually be solved in a slight variation of the product.

B

So one of the goals is, we really think that we should allocate less from these devices because they're mostly full right and if we allocate less from them, then we're not spending and using you know, cycles doing that. And so then that means that we're going to allocate more from these devices to have more free space.

B

But really what we care about is how do we ensure that we utilize all the available bandwidth, because we don't really care as long as these guys are busy, then we know that we're taking full advantage of whatever hardware and devices you've given us and as long as these guys stay busy and we keep them both busy for the entire duration of the transaction. Sync time, then, we can solve this problem and we can actually solve it by kind of measuring how you know how long it takes for these devices do allocations.

B

So that leads us to what we're introducing now, which is the allocation throttle, so the way that the allocation throttle works is and again I'll talk a little bit about how ZFS does things today and how this differs. So today, when you go through and do all your allocations, you end up getting this onslaught of iOS that get created. So every time you sync out a transaction group, we create thousands of iOS and they just get thrown into the system and they get handled by a bunch of task queues.

B

These task queues will actually create like this big fan out. So we start off with an ordered type of of right. Where we're writing, you know from a particular file, the first block, the next block, so forth kind of writing it out, but because we handled it we hand this these out of task queues. We end up actually kind of mixing them up in order and the the whole reason that we do. That is. We want to make sure that they go through the compression cycle as fast as possible with much parallelism.

B

So we issue these iOS. We issue the may synchronously and we let them go through the compression cycle. So what we do now is: instead we actually take all these iOS and we throw them into an allocation q.

B

We then have this global resource that you must get an allocation slot from before you're actually able to go into the allocator. So all these iOS are now coming to the system they're going to get throttled.

E

B

As soon as you have find an available slot, you're granted and available slide, we have multiple threads, so we're actually granting multiple slots. At the same time, they're not going to come in and select a top-level device. Now.

B

People familiar with top-level vetoes, the concept top level v. Dev is the. If you do like a zpool status, is the first device you see under root, so typically on a pool system. You'll see you know root, and then you may see mirror. You may see raid z, you may see a disc, but that's your top-level device, that's where we make allocation decisions is based off top-level devices. So in this scenario we have for top-level VFS.

B

We have a bunch of iOS that are going to come in they've been granted a slot they're now going to determine which device they're going to allocate from this is still going to be done in a round-robin fashion.

B

So when we start off, we may start off with a certain amount of work, because we only have a limited number of slots. Every device will be given the same amount of work as a starting point. That kind of keeps us in a point where we get all the devices busy right off the bat each device now has an allocation Q. You can think of this as like the queue depth. If you're familiar with the scuzzy world, they each maintain one of these.

B

Those allocation devices get turned into children iOS, which are actually going to go out to the physical disks. So if this is a raid Z, there may be multiple child owes that are actually writing to the physical disks behind it. In this case, this is depicting a mirror where we have one top-level io gets turned into two child iOS they're, going to do the work on behalf of.

E

This allocating.

B

Io so once the I/o has been issued to disk, we then look at what happens during completion, so now that the device is actually completed, we convert the child iOS back to an allocating io, because that's how we track it, so we find out which top-level I/o actually completed, work decrement his individual queue and give back our reservation slide.

B

Once we've given back the reservation slot, we then simply go back and say: is there any more work to be done from allocation queue and the cycle starts over? So what does that mean? We gave everybody, let's assume 50 units of work. Everybody starts off busy right off the bat.

B

These two devices happen to be fast, so they're going to complete much faster. This goes back to trying to keep some kind of telemetry of how fast these devices are.

B

So if we had the telemetry, then we could just make our decision, but instead we're going to rely simply on the completion of the iOS as the device is complete. That tells us that they actually are faster. We simply give them more work. So if everybody started off with 50 units of work, these guys can only handle 50 units of work for the entire txg. The rest of the work would come here as soon as they complete they just simply get another device or another allocation.

B

We go back out to disk, keep repeating a cycle until we drain the entire allocation q. Any questions yes.

J

This might lead to foster fill rate for certain devices absolutely so it might even amplify the imbalance, trust characteristics of the pool.

B

So it's quite possible that this could lead to a scenario where your your blocks, that you want to read from might live on one device more so than it is yes, that's very possible, and one of the things that we rely on very heavily within ZFS is caching, so we're kind of trying to mitigate the right, the right performance by by providing enough cash in enough intelligent prefetching to deal with the cases where you may have to go back out, and you know read from fewer devices than you would if you did even allocations now, if you have a workload where- or you run in a scenario where your cache hit ratio is effectively nothing everything's a cache, miss then there's a way to turn this allocation throttle off.

F

In the common scenario, this is actually going to improve that imbalance of reads because, like if you look at what we were doing before, we had like a bunch of disks were more full and somewhere less full. So reads: if we're reading from somewhere random they're going to go more to those more full disks and less to the less full disks. But with this algorithm we're going to end up writing more to the less full discs, because the last full ones are faster rate because they're, not we don't have to raise scattered.

F

So this is going to tend to actually more quickly even out the space allocation between the disks I'm kind of implicitly, rather than explicitly I. Think.

B

I think it may be like one could. I don't know if you would do this, but one can envision leveraging this to have. You know these is flash devices and these a spinning disk right, even though they may have started completely empty right, because if these are faster than they may receive more allocation over the period of time and become full much faster, which means read, are going to be targeting these devices. But in fact they are the faster devices to begin with.

F

B

Become slow as they fill up and then so I think what your your concern is am I getting. My is my read bandwidth going to leverage all four devices exactly as a result of this, it's quite possible that it doesn't, but that doesn't necessarily mean that your performance is going to be less.

F

Optimizing it for being able to like write to all disks at the same time, all just like we're keeping all the disks been. While writing if reads, are basically have the same performance characteristics as rights, then reads would also keep all disks busy at the same time, right like if you wrote a file- and it was you know, sixty percent here in sixty percent here in 40 and forty percent here, then that means like we can write these ones faster than these.

F

So then, when we come to read it, it's going to read these more than these, but we can probably read these faster than these, because we can write to them faster right, like if reading in writing are correlated in terms of their performance.

B

Yeah I mean we, we made this tunable, so you can actually play with it and turn it off of.

J

Course, but how many allocation slopes are there so.

B

The number of allocation slots are based off of the maximum number of Q depths that you allow / Vita, which I think by default is 10, which is a Max and.

E

J

B

It's it's no longer via.

J

Back spending, it's like ZFS.

B

V dev async max right. It.

H

Raised a Mad Max bending, yes,.

B

It used to be max pending, but it's based off of that. So it's a percentage, so you can actually say: I only want it. You know I want to start off with. If that's 10 I want to have a hundred. You know a hundred per each of my top levels, so in this case four hundred slots that I can go out each one of them doing a hundred units of work at a given time, but thatthat's tunable.

B

Also, you can say I don't care, I make it a thousand and in which case you may not see much queueing happening here at all because they're always they can always handle more work. Other questions.

G

Because, let's say I have a system with many SSDs like 40, neurons, 2020 top-level mirrors. What we see sometimes he said SSD performance would degrade slowly over time and for now we rely on I start busy output to detector the.

E

G

He's going slow and we have a threshold where we say: okay, this is bad and we must change it with this system. If I understand correctly, we will see it will get less I. Also, the present busy would be more or less the same as.

E

G

I won't see it so I'm wondering if there's a case that or a way to get our.

B

Mission yeah, so we don't actually have like a case that exposes you know each individual devices like how much how many units of allocation they got for that transaction group.

B

Definitely you can expose it through D trays, which is something that we do and you'll see this behavior in iOS tab or like. If you watch zpool iostat you'll be able to see like how much more certain devices are getting, how many more allocations are getting versus others.

B

And this is same pool with the feature enabled yes.

I

I would like to bounce on the unknown what what the person there said. You said you could have to V devs made of SSDs and let's say to V devs made of slower disks. But then my understanding is correct. Yeah you get those SSDs full to 100%.

C

I

Quickly, when the other drives will be much less full and so I understand the reasoning out when there's an imbalance, but if that imbalance is intrinsic to the performance of the device, then you you just do that. You fail completely the SSDs and then all your left is with free space on the on this devices right, correct.

B

Yeah, so if you, if you configured your pool in such a way where you had this big imbalance on performance of individual devices, you would see devices that perform much faster, fill up, much quicker or.

I

Like mirrors versus raid Z, if you have like two raids, ease and two mirrors in the same system, you could have better perfumes.

B

Yeah, you have a similar thing there as well right yeah, it's gonna! That's why it's going to depend I, don't know of too many people that have created that configuration. um Nobody.

I

Suggested it so I thought.

B

I

Was like maybe made.

B

Did this had some benefits? I suggested it because that's a scenario where that's very possible, and somebody may actually think of that as an advantage. Some way, I.

I

Don't claim, but what might happen? What might happen is you put I, don't know three thera by drives and three years later you put other drives, and now those drives have gotten faster. Now you have that intrinsic imbalance because of device performance, not because you have done something as crazy as mixing v-dubs made of SSDs and physical drives, but because of generation difference that have different performance profiles. Yeah.

B

Definitely that that's possible, where, as devices.

I

With that create problems with the new with the new location throttle like like, has some v devs get to 100% full, which sounds scary and then you're.

B

Going to see a tapering off so as a device actually becomes more and more full, it's it's intrinsically going to slow down. Okay, so you're going to see if you have devices that are you know extremely fast and I would imagine in the case of like actual spinning disks you're going to see maybe a marginal difference between generations.

B

But let's take the SSD case. What the expectation is SSDs would fill up and could fill up much faster you're still doing allocations to the spinning disk, so you're getting allocations starting off, and- and this is where, having that tunable- determining how many allocations I want to give off like for the entire system at any go might be relevant because you can say I want 500 units of work to go across every single device that might keep things not as it won't create as big of a disparity.

B

So you may still be growing both devices at a relatively even rate. But let's say you, you know you say something like a hundred and you allow SSDs to start filling up they're going to reach a point where they're not going to be performing the way they did when they were empty. So the device is now that remain for right performance, they're, going to start creeping up and you're, going to start seeing a switch in the amount of allocations going from one area to another. I.

A

Suppose we'll probably see that between the six gig and twelve gig, SAS or yeah.

B

Vary between hard.

A

Drives I mean I'll Drive, you know. Luckily, we've got a couple of hard drive manufacturers here. Who are we doing a talk later on about these things, but it's like maybe to keep in mind question.

B

E

Since we're doing that, with a smaller chunk season, their risk of a bigger fragmentation rate.

B

So the chunk size itself isn't changing yeah, so the chunk size still remains 512 and this that algorithm still still is the same. The only difference now is which device is actually going to be serving more iOS, so you're still chunking them up as they come across you're still chunking them up in the same way, you're just now doing the distribution slightly differently and so any more questions.

B

So this is that same pool that we saw earlier taking over same samples. So we're going to look at six seconds worth of CPU iostat now, with the behavior enabled.

B

And so what we see here is we start seeing devices that are actually doing more work. These devices happen to be some of the faster devices on the system.

B

Devices that are more full are doing less work, but we're keeping them busy throughout the entire time frame of the transaction group. So we didn't stop allocating from devices that were mostly full we're giving them as much work as they can take.

B

And this is kind of a comparison, so there's quite a bit of data here, so the top devices are the slower devices bottom devices of the faster devices. That's what we saw from the previous cpool iostat left graphs are showing you, average Layton sees for doing the complete allocation and right for that device.

B

The right graphs are showing you how much data was written and allocated to that device. So, as we see here, we look at these. They tend to be averaging about 80 milliseconds to do an allocation and a right as a result they're getting somewhere around 10 meg over. That course you know 10 mega second, over the course of that spa sync, the faster devices which are averaging about maybe 15 milliseconds are doing somewhere around 25 megabytes per second across the course of that transaction group.

B

Now, if you kind of project out and expect these devices to eventually start to fill up because they're getting that much more data per transaction group, they will eventually start looking more like this.

E

B

Eventually, the entire system starts to even out, but the advantage here is that I can safely go in and add two brand new devices and still be able to keep all my devices busy.

B

So now we can look at all the performance problems that we've uncovered with this particular pool and this type of characterization of problems, and we can say that we have solutions across the board.

B

So it's actually quite significant for what we saw on our system. We've seen performance in kind of two class of two two different ways. One is: we've actually been able to drive this pool. You saw it at like 71 percent we've actually driven this pool up to eighty-seven percent without people complaining, which is very rare for our engineers, because yeah because well they're, very quick to complain when performance problems get to that point.

B

But I don't have like a specific number but I think that we're we're at least like twenty percent faster in most like over a time period of spa sink and in some cases more and we see kind of a variation because it depends on like which device actually starts the allocation. So because you have a round-robin type of scenario, so you may end up where you start. You start allocating from say the emptier devices first, which means that by the time you get to the slower devices, you've already processed quite a bit.

B

But it depends on on.

E

B

I, don't know if you like, the like the v-dub q, stats I, know showed a huge improvement from what we saw before.

B

Yeah yeah I know our our performance and experience with this has been probably.

B

Yeah I would say more positive than we first imagined.

F

You want to measure in terms of like how many I ops can you sustain, and this is a production system, so we don't want to just like throw an unlimited number of I ops and see how many you can take before performance sucks. So what we kind of what we have is like how many if's does it happening to get, but it depends on the load. The load is very variable and then you know also there's like.

F

If you look at it over several days, the amount of free space could be very variable, so you know we kind of see like oh jeez, it's like ninety percent poll and people aren't freaking out. So that's like we've never seen that before this is great, but we don't have like really hard numbers. Unlike you know, you can do X I ops at y % full with vs. without this change, which would be great to get on like a synthetic system and.

B

I think that part of the problem with like, in the main reason we kind of tackled this particular system, was just because of the fact that it already was an aged pool that gone through several iterations I mean we had probably four different times where we add devices to that pool versus if we just kind of created a lab configuration where we fill one device up, I can't get the same fragmentation on that device. You know at least in and feel like I'm reaching the actual problem at hand.

B

This system was actually great for that, because we were able to observe it in a problem state to devise a solution out of it.

H

B

Yeah, so what's the best way to create fragmentation to kind of like try to look at some of these systems, so we have kind of the worst-case scenario with that we look at, which is what we call the fragment mark and then there's actually something that matt has put together, which is also very yeah, see it's even worse case, but it's it's based off customer data. So it's it's.

B

It leverages the fact that they we were able to kind of like get a histogram of how many blocks were compressed or like what compression ratios were across a pool and then try to write a data set or a series of data to a pool that kind of matches that same kind of histogram of compression ratios and then from there start churning that data.

B

That's what probably what I would recommend I don't know like do you dirt, so yeah I was gonna say we could probably just make that program available. I, don't know if the frag benchmark was ever publicized. That was done by a previous engineer. The.

F

General story is like basically, we just like create a couple of big files with like AK record size and do random, writes to them and then wait until the like initiative. Performance is going to be good and then performance gets worse and worse and worse than we just wait until performance doesn't get any worse and then that's like that's as bad as it gets, and then the variables here are like you know how much of the pool do you fill up like?

F

Are you saying the pool is going to get going to be ninety percent full versus fifty percent full and then like the block size, and then the distribution of compression ratios so like most of these tests that you showed were with just like a constant width with no compression where it's just like all aka blocks, and then we also have done tests where we make each block get compressed a different amounts. So you have all these different physical block sizes that we're trying to allocate, which is like really really horrible for compression.

F

You can end up getting like tons of gang blocks and, like other, like really horrible things happen, which is very interesting. Might.

C

I suggest using VT bench for that yeah.

B

And I so, like the other benchmark, our frag benchmark as Matt.

H

B

Of mentioned it uses fio so again, creating you know you figure out, like you want your pool to be. Sixty percent full create a file that consumes sixty percent of the space, and then it runs fio kind of like over top of it using random rights, and then you, you kind of monitor the throughput, look for a steady state and that's when you know that you know you've kind of reached that threshold.

B

That's a good way to tack. It.

B

Don't go away just yet, there's actually more, which we don't I. Don't have a lot of slides on this, but I know Matt mention this, but we did want to kind of announce, compress dark which is actually functional and in our internal repo, and that has been completed as Matt kind of mentioned earlier, compressed dark, effectively mimics what the compression that you're using on disk. So, if you're using gzip nine on disk, you effectively have a gzip nine compressed version of the block in memory.

B

Just some preliminary tests that we did I had a small system. 20 gig of Arc had a creative 35 gig file on a using LZ for compressed file system, I'm, getting about 2.6 4x compression ratio for that file, I'm, actually able to read that entire file into the ark and keep it completely cached. So all subsequent reads come directly from the ark, even though it's 15 gig larger than the existing arc.

B

This has been integrated also with l, two arks ol2 ark behaves a little bit differently, so I'm interested to see how persistent l2 Ark is going to it's.

C

Going to be a race to the gate,.

B

C

Who has to do the integration work? There's.

B

A lot of merge changes potentially coming, but we're we're getting this kind of finalized. This is going to be in an upcoming release of ours and then, as soon as we get it completely, you know polished for up streaming will get that out there.

B

Just like the other pieces, there are switches to turn compress dark off, so you know people can try it out and see what they think, but this is a new thing that we're excited about and wanted to announce it to this group.

B

Yeah, so this is probably like fall time frame is what you'll see from from Dell fix coming upstream is about. That is.

C

It / datos a tunable.

B

So turning it on and off is like there's a big switch which turns compression arc all the way on and off. If you want to try it on a per data set basis, then it's going to be based off the compression ratio or the compression algorithm you use for that data set. So if everything is uncompressed except one, then you would only have those blocks that would be uncompressed or compressed in the arc. I.

C

Can see that as a bit of a problem, potentially what, if you have a pool that has a mixture of some real time d, compressible, stuff LT very compressed? Obviously we want to keep that compressed in memory because it's cheap to decompress, but what? If, at the same time, you also have some stuff in there? That's gzip, or maybe some future archival algorithm we come up with and for performance reasons. You want to keep that uncompressed in memory. I can press on disc.

C

So essentially your workload would rely on it on memory, access being fast, yeah.

B

So so today that is not possible in depending on your workload. So if your workload is one where you're going to be accessing that frequently, then for frequent accesses things will stay uncompressed, but if it's one of those where you're accessing it once and you want that initial access to be fast and then you're going to wait, you know hours before you access it again or days, and you want that access to be fast, then that isn't possible. Today, George.

J

Yes, so you you read, the compressed broke, block and stored in memory. Writes not recompressed. Just.

B

J

It get it absolutely clear. Yes,.

B

Okay, yes, so whatever whatever form the block is on disk, is the form that it will take in memory. The advantage that this gives us is, as Matt also mentioned, is compressed, send now becomes simpler, because you already have the compressed block in memory. We actually have a design for compressed send/receive that allows us to send this compressed block from memory without decompressing it at all and writing it in a compressed fashion. All the way to disk I.

I

Don't know if it's in the current version or in a future version, but I saw something on a slide about metadata being lz4 compressed, so does that apply to cash to metadata it.

B

Would apply to cash meditate.

I

Currently, it's not right metadata is her anything under arrest and stored in arc uncompressed.

B

Today, everything in the arc is uncompressed: okay, moving forward, even metadata, which is currently LZ. Jb wait has that there's an algae for by default, so so LZ for metadata would stay uncompress or compressed in the ark and then uncompressed whenever you access. Unless.

I

The big switch is turned off.

B

Unless the big switch is turned off or its data, that's that's like access. You know repeatedly.

G

I

I have another question: yes, sorry, there used to be something kind of a blog post that had to do with some tests doing on nexenta. Basically, it was too much. Ram is too much RAM kind of thing like like. If you really put too much RAM in a system, the l2.

E

I

Redwood dig forever or something like. Is that still the case? Do we have to be scared about putting a lot of rams into systems I think.

B

They I don't know the blog post, so I'm not sure if exactly this is it I'm wondering if it was referring to the fact that when you had very large l to work devices that the amount of memory consumed to actually store, the pointers to that l to ark was actually pretty high, which has changed. So I don't know if that's what they were referencing.

I

I think it was more simply that above 224 gigabyte of Arc, there was a problem of really the arc reclaim taking too much time. Oh.

B

Yeah, that's it.

C

Shall not be talked about. That's.

F

That's kind of a different issue: I did some work that was integrated at the end of last year. That makes that, like slightly better in that it will, it won't need to came in reap as often but the real solution to that is kind of a bigger project that I'm also working on thats related to some work.

F

That's happening on Linux, where we will be able to like, rather than having different kmm caches for like 128k blocks and 16 k, blocks and 8k blocks and then having to like shrink the arc and change which ones are being used. We would be able to compose a 128k block in memory from a bunch of 4k pages and then, when you are shrinking the arc you're just freeing those pages, and you are involving kmm at all. That's like I have a prototype of it. It needs some more work.

I

Current currently, what is safe if we want as much RAM as possible, because we want a big arc to store metadata and all kinds of things. It depends a lot on the workload.

I

Well, it's a varied workload. I don't have a single answer to that, but the thing is: is there? Is there a number that it's it's safe to have you know and above that we.

A

F

Into this problem in 66 games, is it 512.

I

Gigs, what you.

F

Guys have a lot of varied workloads as well and you're, saying to it like to 56 ish. Is that right.

C

E

Recommendation.

C

I mean I, don't quote me on this, so this is sort of going by what I remember off the top of my head, but the general consensus is somewhere between 256 and 512 is sort of a sweet spot above that you might run into some issues, but I've seen systems hurt even below that, but it really depends on your work. Planet yeah, the.

F

C

Turnover, you have we've.

F

Done a bunch of more work to mitigate this in terms of like the Cayman reclaim can end up like using all cpu and then, like your network connections, drop and whatnot. We've done some work to mitigate that, so that that might mean that you can bump it up higher. But.

I

If I have to make a purchase decision today for a big, a big pool, you know a petabyte pool. I want to get as much memories. I want, as I can right. I can't go with a terabyte of memory, because you're telling me that I'm gonna hit this problem. If I have too much RAM there for too much arc right, yeah.

B

I mean I think there's a I think your mileage will vary based on workload, so it might be that one terabyte for for your you know case might be fine, but we've definitely seen issues as you go above and beyond 512 gig, at least on illumos. There's things that need to be addressed. We've seen I think our largest customers are running like 384 I think we may even have some that are 512, but that seems to be kind of where there's still some work to be done.

B

We've already gotten a quite a bit of the changes that we want to do as far as kmm reap itself is concerned in place. Now we're looking at taking k memory completely out of the picture, at least under Lumos and I'm sure that.

F

Everything that we've been talking about here of like this, your question is all specific to Lumos, because.

C

So we're assuming.

F

That you're talking about illumise but I, think you're not you're talking about freebsd, which I don't think that any of us have been like the three of us.

I

Thought it was general to ZFS. Whatever do not form is not really.

E

I

I'm on freebsd, I can put a terabyte of ram and I can have 800 gigs of Arc. We.

A

Don't know her previous.

C

The guys you know, without jumping without jumping in.

A

Solara solaris, what oracle solaris is the cs3 appliances have up to three terabytes of ram if.

E

I would say: they've.

A

Done some work to improve that anyway, so it's a, we know, it's possible united.

F

Only platforms this.

A

Is affectionately with.

C

F

Is not really a ZFS question, it's a platform memory usage question in sorry. We kind of started answering it based on our experience with the Lumos, which is not really but thank you.

I

For clarifying, because I was not aware that it was a limo space and if you have to provision systems and make purchase decision, how much RAM is a big question on the table. Yeah.

B

For every system, you build the work that Matt alluded to, that he's got a prototype that would be more ZFS, specific and presumably one that that would carry over to freebsd yeah, but and and primarily the reason that a lot of that work is being done is because kmm reap on various platforms, behaves very differently and inevitably probably has problems at different points. We don't know what freebsd might be. We can only. We definitely happy to tell you all about, although most problems it.

I

Another little one, how a couple of years ago to tell open ZFS day, 2013, maybe 2012, but I think it was 13. I think it was you. You ended to talk with a very enigmatic. I was on the live stream. That day, like oh on 4k disks, really you shouldn't do raid z. That is that is that still a concern is that this.

F

Is a valid thing depending on the workload so do it? Can you bring it by blog post? I wrote a blog post, which is directly addresses this issue that the the issue is essentially like if you're using 4k disks with raid z, you're using small record size like 4k right, hey, record size, you're, probably shooting yourself in the foot like if you're using for kak record size, don't use raids, irregardless of its 4k or not, but.

I

You 4k record size, you usually on a normal workflow. You may have all kinds of record size, you're, not necessarily block.

F

No I'm for like the default would be 128 k, record size and yeah. So this is this is the the blog posts that I'm talking about?

F

Have this this great like graph, but this is talking about kind of like the additional space overhead based on sector size and all this we can get into the details offline, but in general, like if you're, using 4k devices with raid Z in your doing general purpose use with you, the default of 128 k block size, it's going to be fine, it's only these specific small record size uses because, most of the time, even if using 128k, if you're using the default of 128 k record size, only small files are going to go on these smaller blocks and almost all the time most of the space is in big files.

F

Even if you have, even if you have a million small files, it's usually going to be those you know, hundred big files that are actually consuming most of the space and those are using the 128k block size I mean most files like 128k is tiny nowadays, right, like even a picture file is like 100 128k blocks or like ten hundred twenty eight blocks. So no.

I

We don't because what we do is we do. We do nest that serve files over samba I mean over SMB and a FB to mac and windows clients, and so we have, you know photographers, have big pictures, but.

E

I

Maybe a lot of Word documents that are that are smaller or.

F

Small most motions, it's a very it's a very block size. Yeah I- wouldn't worry about that. Okay, that.

I

You saw we can use okay, you should be finding a place to raid 0 HZ 20 HZ, three or.

F

All of the grades is.

F

I

F

I

A feature now is it current or future feature that there's larger block size you're like up to one Mac or how much up.

F

To one Meg or or.

I

More is that implemented yeah, that's him, okay, and is it by default? No.

F

You the default.

I

Is still will have you, okay, you.

F

Would have to increase the record size is.

I

There, what is the benefit penalty of activating.

C

It was, it was primarily motivated not to change the default in a sense of do no harm yeah. We don't want to change performance characteristics of divide of workloads that depend on.

I

This, but if I activated what do I gain, what do I lose so.

F

You're going to gain performance, especially on on large files that are sequentially accessed like video files, for example, especially on raids, II, even more so, but even on mirror the the really killer use case for this is video serving where you have like a lot of video files and each one.

F

You knew straight and you're serving up a lot of streams at the same time, so it kind of looks like random access, but you can afford to cache quite a bit of it so, rather than like random access of 128k you're, doing random access of one meg and or more and so you're able to get much more megabytes per second. By doing it in one big chunks, then in 128, k, chunks.

I

The old files can still locate.

F

A small block right, yeah yeah, so.

I

There's no penalty, the.

F

Potential benefit. The quality would only be that those one meg reads and writes: can kind of stopped up the pipeline right like if you're using a disk, then a one Meg read is going to take longer than 128 k read. So if you have other latency sensitive operations, then the latency is going to go up so.

C

For instance, you have a workload, a data set which you've not previously tuned to a specific record size. So your work look kind of us seem to handle my 8 k. If you do a smaller right to it, not a full block right where you some sort of workload that does partial rights to the blocks. What we do in memory is, we do a read, modify write, so we read the block off the disk, modify that and then write a new copy.

C

If all of a sudden, you have to read a one Meg chunk, router and 128 k, you or read-modify-write overhead, just octuple.

F

But hope I mean, hopefully you aren't doing those like if you're using 128k record size, hopefully you're just like reading files sequentially in writing them sequentially. Typically, that says.

I

Their rule of thumb activate only in case of this. At this work floor, you just have to test for each workflow and decide if she turn it on.

F

Like you mentioned these Photoshop files, where they're modifying dist a portion of the file, the the record size, would have a big impact in that scenario or.

C

Say running a database running.

F

I mean running a database, you probably have already turned it down to AK or 16 k, I'm.

C

Talking about but yeah a sysadmin who didn't it.

I

Is it? Is it a purple setting or per ZFS filesystem.

F

I

System: okay, thank you and.

A

Then, we'll probably also you know, have a conversation from the from either hgst or toshiba about you know, availability of, like 4k, drives 512 and 512 and 80 or 512 emulated or.

C

A head cold FN held our heels as these and.

A

How that all comes in so that'll be w this afternoon right from the from both a GST and toshiba so obvious from a hardware perspective, but any more questions for George.

A

Thank you. Ok, thanks! So much.

A