OpenZFS 2019 OpenZFS Developer Summit, 13 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Metaslab Allocation Performance by Paul Dagnelie

Description

From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=1he9APxNsQutYBzJHCjkbveUCmHPlbDhK

A

So our next presenter is Paul diagonally, who is going to present about metal slab allocation performance. So Paul has been making a several contribution over the year. You might have known him, especially for CFS and receive, and his work on reductive data set over the last year. Paul has focused more on working performance, so he will talk about some of his findings in this presentation. So please welcome Paul.

B

Good morning, everyone so, as Pablo said, my name is Paul. Daniel, E and I'll be talking to you about Metis lab allocation performance today, all right so who am I as Pavel said? Well, Daniel, II, I've,.

C

B

With del phix, for about six six and a half years now, and to give a little context about del phix, we use ZFS for database virtualization and we work with a lot of sort of high-performance systems. We have a lot of sort of long-running systems with lots of file systems, lots of clones and lots of snapshots.

B

So we tend to run into a lot of situations where you sort of encounter sort of some of the extreme performance characteristics of ZFS and sort of hit, the places where it starts to break down in some of those cases. So why am I here? What led to the work that we did and what led to me giving this talk well late. Last year we received a number of customer originated complaints that got escalated to engineering.

B

Customers were having a bad performance, they were having issues, and so they got forwarded to us and we formed a team to work on it and we dug into a little bit and we found specifically that the customers were having problems with synchronous, write performance and then, when we drill down further, we found that we were spending a lot of time in the allocation code path and so first I'm going to talk a little bit about like what synchronous rights are, how they work and then what Alec like how allocation works in ZFS and then I'm going to talk some about the problems that we had and how we addressed them.

B

So synchronous, fights versus asynchronous, writes normal, writes that you think of when you're writing to a file system are asynchronous and the idea there is. You know you hand the kernel, some data to write file to write it to an offset to write it at and then it goes. It gives you back control immediately, but it'll do the right. Some later point in ZFS we batch up a bunch of these rights into things called transaction groups or TX. G's and asynchronous rights are mostly latency insensitive.

B

You know you care a little bit, but by and large you're just trying to get as many of these through as possible. So it's really all about throughput. You can wait several seconds for your right to actually be synched to disk we're really just trying to get as many of them through as possible.

B

Other rights are called synchronous rights. Synchronous rights are done a little bit differently. When you tell the kernel, you know write this to this place. You say: don't come back until you're done, and so then the kernel is going to not return control to you until the right is actually persisted on disk somewhere, and so the advantage of this is that you know if there's a power failure or if your program crashes or if your system crashes, you are guaranteed that if that function returned your right is persisted on disk.

B

So it's good for reliability purposes. These are trying to, and because of this nature, the filesystem wants to issue these as fast as it possibly can, because they're really extremely latency sensitive, every millisecond you spend waiting for the right to be issued is a millisecond. The program isn't queuing up another right, so your latency and your I outs are directly tied to each other. Zfs has a special system that it uses to make synchronous, writes for more efficient, called the ZFS intent log or the Zil.

B

The Zil is a system that allocates pre or especially allocates a bunch of blocks that your rights are just sort of immediately placed into you whenever you issue a synchronous right, it sort of bypasses parts of the allocation path and uses other parts of it. I'll talk a little bit more about it later. Another important concept to understand is the idea of a separate and tenth log device or a slog device.

B

A slog device is a separate disk that you add to your pool and designate specialy, and the Zil will try to do its allocations from that device. So this is something that you see. A lot of sort of confusion about or misnomers about, in literature online is the difference between the Zil and the slog. The slog is a kind of device. Bazzill is the ZFS subsystem and the blocks that it manages. So hopefully you know when you're reading about those in the future writing about those in the future.

B

You can understand the terminology a little more clearly so now, let's talk a little bit about allocation and ZFS and how it works and sort of what the what the principles of it are so important concept in allocation of file systems as in life, is that everything has to go somewhere. All your files have to end up somewhere on disk. All your metadata ends up somewhere on disk, all your indirect blocks and filesystem metadata.

B

Everything has to have a spot on disk and ZFS, as probably most of you know, is what's called a copy-on-write filesystem, but for those of you who are not super familiar with the concept, most file systems are what's sort of referred to as an update in place file system. When you modify a file, the writes happen to the same place. The data was already living. Zfs is different. Every time you update some data, it writes it to a new location.

B

It like takes a copy that it puts in a new spot and modifies it, and so, as you can sort of tell, this would necessarily result in doing a lot more allocations, because every single write has its own allocation associated with it or potentially multiple allocations.

B

One of the advantages of copy-on-write is something is that we can get transparent compression because you're not always writing to the same spot. You can write the new data in a different size, which means that if the data compresses better, you can use a smaller block or if it doesn't compress as well, you can use a bigger block.

B

So that's one of the really nice things we got out of this, but it does make allocation trickier because it means that instead of just allocating the same block size over and over and over again, you know have a variety of different block sizes that you're allocating regularly.

B

If your data isn't all exactly the same compressibility, so that makes allocation a little trickier and ZFS than it would be in other file systems beyond the fact that it's a copy-on-write file system and an important concept to understand related to that is the difference between a record size and a block size. Your file is broken up into, has a record size associated with, and by default it's 128 kilobytes and the idea there is just. This is sort of the the logical chunk of data for a file.

B

Your big files are like it's 128, kilobytes, yunk and then another and then another and another. The block size is the actual physical size of that data on disk. So if you have compression enabled or various other features enabled, the actual block can be much smaller than that. You know your hundred, twenty-eight block could be honored, kilobytes or 60 kilobytes or three kilobytes, depending on how well it compresses. So what is a Metis lab?

B

Many of you, if you work in zs, have GFS have probably heard this term before, but a Metis lab really is just a name for a region of the disk. The name sort of implies something about slab allocators. That is was historically true, but is not the case anymore. The Menace lab is really just a part of the disk and the data structures associated with it. Metis labs are about 16 gigabytes in size and are usually about 200 of them on the system, but that varies with the size of your disk.

B

I'm, sorry 200 them on the disk, and that varies with the size of the disk, and so what the Metis Labs real job is, is to track the allocatable and/or the allocated and free space within that disk region, and this ties into a concept that's really critical to Metis labs, which is the distinction between a loaded, Metis lab and an unloaded, Metis lab. When you boot the system, we don't. We know how many medicines there are. We know how many disks there are, but we don't know what spots inside of each Metis lab.

B

You can actually use for allocations like what spots on disk or free. This is stored on disk in a data structure called the space map which I'm not gonna, go into too much detail about today, but I'm just mentioning it for context, and so when it comes time to actually do an allocation, we have to do what's called loading, a Metis lab and the idea there is.

B

You read the space map off of the disk and load it into an in-memory data structure that you can actually use to do your allocations more efficiently, and so a bunch of information is stored in the Metis labs, and so we store more information when the Metis labs are loaded the weight of a Metis labs and is a sort of useful concept to understand it's.

B

Basically, it's an attempt to distill the overall quality of the Metis lab for allocations into a single number, so we can use it to quickly compare sort of which is the best medicine, and so we always have that tracked. Whether in Metis lab is loaded or not. One thing that until recently, we all had one in Metis lab was unloaded or when it was loaded, was the size of the largest free segment.

B

So you know, if you have a one megabyte free segment in your medicine, we would keep that information around or if you had like a 64 kilobyte free segment. We'd keep that around and if you tried to do a 65, Killa kilobyte allocation, we would immediately know that the Metis lab wasn't suitable and we wouldn't try to like go looking through its trees for a place to put your data.

B

Another sort of the the critical piece of loading, a Metis lab is the actual full tree of allocated and free space, and so that's only stored in memory once loaded, because, obviously that's sort of the the big space consumer.

B

There are many other space trees beyond just the allocatable space tree. We're not going to talk about those too much, but they do exist and I can talk about them of people of questions. So when it comes time to load a meta slab into memory, the data structure we loaded into is called a range tree and a range tree is just in the. For the purposes of this talk, it's used to be an in-memory representation of the free space it can represent.

B

You know any like kind of space, but we're going to talk about free space range trees for this and they're built on top of until recently, binary search, trees and there's two trees as part of every range tree. Actually, there's the offset sorted tree and the size sorted tree, so I'm going to use this sort of example, Metis lab to help you understand.

B

What's going on so consider this Metis lab, which has 11 places to allocate spots 0 through 11 and the colors, are sort of washed out on the projector, but the there's 3 plug regions. Here, where you can allocate data, there's 3 free spaces, there spot 0, spot 3 and spots 5 through 7. All of those are sort of places you could put data if you needed to do an allocation, so the offset sorted tree is the one on the left. As you can see, it's just sorted in order of where these things are.

B

You know physically laid out on the meta slab, and this tree is useful for coalescing adjacent regions. So, for example, if I came along and freed spot for I had some data written there and I freed it. You could use the offset sorted tree to determine that oh 3, 2, 3 and 5 to 7, if you like, free for all of those, can combine into one region, and so you could quickly use that tree to do that sort of coalescing operation.

B

The size sorted tree, on the other hand, is useful for finding a place to put your data, which is you know, kind of what this is all about. If you're, like okay I, have a too wide piece of data that I need to allocate, you would go into the tree, no be like ok spot 5 to 7, that's 3 wide. You can use 2 of those to do your allocation, so it's a very efficient way to sort of find places to allocate data and also merge things together.

B

If they, you know, if you're doing, freeze so now, I'm going to talk a little bit about the actual allocation process, what exactly the steps are that we go through so whenever you're doing an allocation for each allocation, you do the first thing you do. Is you pick a disk there's a lot of different sort of mechanisms that go into picking disks, but none of it. None of them are super relevant to this talk.

B

um There's, like you know, thresholds for how much performance you already have going on to one disk and lots of things like, but for our purposes you pick a disk. It's basically round-robin. Once you've selected a disk, you need to pick a good seeming, meta slab and so the primary factor that we use there is the weight that I talked about earlier, and then it's lab what the highest weight is, the one that we think is the best one, and so we try to use that one first.

B

If the minutes lie, but the highest weight isn't loaded, then you need to load that meta slab so take the space map off of the disk. Read it into memory and make the range tray out of it once you've, loaded it or once you've picked a meta, slab and loaded it. If necessary. You then pick a you need to find a place in that meta slab to do your allocation, so there's a lot of different strategies that can be used to do this sort of the the more naive ones.

B

Are things like first fit where you just sort through the offset tree in order until you find a place, your thing fits, and you just put it there or you can do best fit, which is like you sort through the size sort of tree for something that's exactly the size of the allocation. You want to do and you put it there.

B

Modern approaches, use, sort of cursor based approaches and the idea there is that the last time whenever you do an allocation, you record the place where you did it in the Metis lab and then, when it comes time to do the next allocation, you do an offset like a first fit search from that location for words for like a little ways, and they do there is that when you're, using like a spinning disk or an SS, even SSDs, in some cases, if you can put lots of, writes like actually physically close together, the disk will combine them into one write and do it all at once, which can really improve performance by sort of just reducing the number of I/os that the disk has to think about.

B

But then, if you don't find something, close ZFS will now actually go just to the size sort of tree and find a new place to start doing allocations. That was a change that was made somewhere recently, um so once you've picked a spot, however, you decided to do it. You claim that spot you remove that space from the allocatable range tree. You modify the on disk data structure and you move on with your life.

B

If there's no spot in the Metis lab for your allocation, then you need to go back to the top and pick a new meta slab. You go to the one with the next highest weight and you repeat the process and you keep going until you find a spot if you don't find a spot. You know if you've gone through all the meta slabs in your disk and you none of them worked. You pick a new disk, repeat the process. You keep going until you've gone through all the disks in the system.

B

If you've tried literally every mana slob on every disk, and you can't find a place to put your allocation, you do what's called ganging. Ganging is basically you take your allocation and you split it into smaller pieces, and then you allocate each of them separately. This is really a measure of last resort in ZFS, because it can cause fragmentation to just absolutely skyrocket. We really try to avoid it whenever possible.

B

So there's a lot of stuff here and I sort of made that sound like a lot of different steps and they're all sort of very important and there's lots of things going on, but in terms of performance, really that that list should look something more like this Metis lab loading is far and away the most expensive of these operations. This diagram is not to scale.

B

No, so like picking a good medicine can take. You know 100 microseconds, finding a place in the Metis lab another. A few hundred microseconds loading, a Metis lab, can take a second on really big systems like an actual real. Second, it is catastrophic, ly slow in terms of performance. So really, if you're going through this loop more than once you're having a very bad day, you do not want to be in that situation.

B

So for synchronous, rites, things work a little bit differently. Synchronous rites go directly into the Zil rather than just sort of the normal allocation path and the Zil uses these 128 kilobytes by default log blocks, and it will just sort of if you come in with like an 8 kilobyte allocation. It'll just drop your data into that so block directly and move on. It still has to do allocations to allocate these Oh blocks, but it'll try to use the slog devices if you have them available and those slog devices aren't really used for other allocations.

B

The idea is that the slog devices are just sort of this relatively pristine space.

C

B

You can use to do these hundred twenty kilobyte allocations. The actual allocation process for an intent log block is the same as the process for an async block. You go through that same loop, but it'll try to use log devices if it can and then later on after your data has been synced to disk during subsequent TX G's ZFS will actually migrate. Your data from the Zil block to a normal resting place on disk, and this is why your slog device doesn't sort of just fill up with synchronous.

B

Writes the data is all sort of migrated to the body of your pool. The the ZOA blocks are not intended as sort of a permanent resting place, and so that sort of my introduction to how allocation works in ZFS and what the data structures are, and so now I'm going to talk about what happened to us. If you look at this flame graph, which is a little bit hard to read, I understand, but the the key takeaway from it is that when you look at it we're spending, this is a like.

B

A heavy synchronous write workload and we're spending a lot of time in the allocation code. If you add up we're spending literally 50 percent of our time in the allocation code, which is not really the place you want to be, you want to be spending a lot of time, writing things to disk and compressing things and checksumming things.

B

We're spending so much time, allocating that we don't have we're losing significant performance, and so this is the sort of flame graph we were seeing on our customer systems and when we thought about it and we sort of worked out what the problems were, we really ran into. We realized that we were sort of at the perfect place to run into this problem, because we have relatively small record sizes.

B

We use an 8 kilobyte record size instead of the default hundred and twenty-eight kilobyte, and this is common if you're, storing things like VM, decays or usings evolves or you're, storing database files, because all of these are trying to sort of have small atomic units of data in a VM. Decays are pretending to be disks where sectors are 512, bytes or 4096. Bytes and so because we have this small record size and we have compression enabled you have lots of different small sized blocks, and so when you're, you know allocating and freeing them very rapidly.

B

You can end up with a lot of internal fragmentation from just little sloppy spaces. Leftover. Another important factor is that we have a heavy synchronous, write workload.

B

Databases really do database, virtualization, primarily and databases like to use synchronous, writes for reliability purposes and NFS likes to use, synchronous, writes, and so the one part of this is that it adds a really sort of heavy load of 128 kilobytes, a location for all these ill blocks that you're allocating.

B

So we have to do a lot of these really small allocations, but also really big allocations, and this really wide mixture can result in extremely heavy fragmentation extremely quickly and then, of course, as I talked about, sync rights are very latency, sensitive and so the more time we spend in this allocation code, the worse things get and so those factors sort of combined and then finally, we have a lot of customers really high-performance systems.

B

You know all SSD pools high-end, flash storage, arrays things like that and ZFS was originally built in an era where devices like this weren't, as common just have gotten much much faster, while CPUs have only gotten somewhat faster they've, gotten a lot more efficient, but that doesn't really help us, and so we really need to sort of make our code more efficient and be tighter with the CPU we use, because we don't have as much slack time while we're waiting for discs to work.

B

It's really we're getting more and more CPU bottleneck as time goes on. So these issues, all combined together, really started hitting us in the ass pretty hard. It was a rough time to be us, so we started looking at the specific problems we were reading, so the first one is that we found that for some allocations we were calling meta slap load like a dozen times. We had single allocations. We were trying to do.

B

We were loading, many medicines and, as we said, that can take a long time, so we were waiting like 10 seconds to do a single allocation. That's not that's not good. Like second, you don't want seconds per aisle. You want I Oh s per second inside the you want the numerator, numerator and denominator in the right place until we were wasting a huge amount of CPU to do this. We were loading. All these medicines would take a lot of memory.

B

I'll talk about that a little later and we were spending a lot of IOT's to read these space maps from disk. So why are we loading all these medicines that we can't use? This is a huge waste of our time. Well, when we have a meta slab loaded, we have the weight that I talked about and sort of the attempt to distill the quality down into a single integer for comparison purposes.

B

Previously this was sort of all about how much just how much space was on the meta slab and it was weighted a little bit based on the fragmentation, but now the actual weight is based on the largest segment size you have available free to allocate. If you want a bunch of details about this, you can see Matt's talk at BSD can in 2016, but so the way that it works now is that you take the largest bucket of free segments.

B

So, for example, if you have a bunch of free areas between 64 kilobytes and a hundred 20 lights, and none larger than 128, that's sort of the bucket that you're in and then you take the number of segments in that bucket and those two factors combined create your weight.

B

So this is, you know it's a pretty good system right if you have like a 32 kilobyte allocation- and you see this you're like oh great I- can definitely satisfy my I/o. If you have a hundred kilobyte allocation you're like okay, if these free segments are evenly distributed throughout this bucket, probably I can do my own.

B

Io it'll probably be fine, but what if it turns out that actually all thousand of those free segments are exactly exactly 64 kilobytes, then you will load this meta slab with your hundred kilobyte allocation and find that you can't do anything about it. Like you loaded this thing, you thought you could use it and you can so you've just wasted that load, and that was happening. What was happening to us really really regularly.

B

So we solved this problem when a menace lab is loaded. We avoid picking bad loaded, Metis labs with this cached maximums free segment size that I talked about earlier. We don't keep that around when the Menace lab is unloaded, because you can still free to an unloaded, Menace lab, you can only allocate from a loaded, Medus lab, but you can free to an unloaded one. This is a thing that's been in the ZFS for a long time is sort of a performance boost, but the key insight that we sort of had we were thinking about.

B

This was well the maximum free segment size that we had when we unloaded the Menace lab is still a good lower bound on the largest free segment size right. More frees going in is not going to shrink the largest tree segment size as it can only grow it, and so what?

B

If we just keep that value around and use it as a sort of a rough estimate for whether or not the Metis lab can satisfy an allocation, if it's, if we keep that around and you come in with an allocation and it's smaller than that value, we know we can satisfy it. We might be able to satisfy larger ones, but we know we can satisfy that much.

B

But as more fries come in this heuristic, it's really really wrong, and so we actually age it out after about an adult I think is an hour after a certain amount of time. We just stopped trusting the value and go back to just using the weight.

B

There was a prototype project to do sort of a more sophisticated approach where you would calculate the expected value of the largest tree segment size and it would increase and, as more fries were done to the Metis lab. While it was unloaded that project is still highly experimental, but the results were reasonably promising.

B

If you have questions about it, I can talk to you about it afterwards or if you want to see the results, but it turns out that this change that we made to just keep this value cached and stopped trusting it after an hour gives a huge eye. Oh and we got a 30% boost on a Lumos where this change was initially developed on a heavy readwrite mix on a system where you actually couldn't like keep all the Metis labs loaded in memory.

B

So great, we solved one of the problems. It made things a little better, but even with all of our loads. Actually like satisfying at least one allocation, we were still loading a huge number of Metis labs who still spent you know a huge amount of our time on it, and the problem really was that fragmentation was was just really bad. If your workload requires a Zil blocks per second and your meta slap, you know your best. Medicine only has 50 so wha, like 1528 q regions, and the next only has 30.

B

You know- and you have this nice curve downwards, even if you were always picking the best medicine you're still gonna need to load multiple Metis labs per second, and you know when they take a second to load. That's gonna put a pretty heavy damper on your workload, so we really just we didn't, have enough buffers in each Metis lab, so we had to keep we needed either to load more Matus labs or we just keep more Medus labs loaded at all times.

B

This is sort of like the most simple approach to this problem, but it turns out it works pretty well, but there's a very real cost, which is that when you have this range tree loaded in memory, I showed you the range tree structure before each free segment or each range sag is I'll call them cost, take 72 bytes of memory. We had some customers where their average free segment size was like three three and a half kilobytes and they had 100 terabytes of storage.

B

And so, if you try to load all of those regs range segments into memory for a hundred terabyte pool, it takes two point. Three terabytes of RAM: that is not a good amount of RAM. It's a very bad amount of RAM. You, like you, have other things. You want to use your RAM for, like user programs and the Ark, and we didn't we'd, have any left for that.

B

In fact, there was one particular situation where a customer's system was booting and they needed to do an allocation pretty early in the boot process, like as part of mounting the root filesystem before you could do, T, X, G's or anything like that, and they like needed to do a large allocation and they just couldn't find a place to do it.

B

They would load so many Medus that their system would run out of memory and then they couldn't unload any medicines, because until this point you could only unload during a txg sink and the system just hung because it couldn't find a place to do an allocation. The fix was that that was slightly different from the stuff we're talking about today, but this just sort of illustrates how bad things were getting in terms of trying to load all these meta slabs. So we need to do something else.

B

We need to fix these range trees they're using too much memory. So until recently, the range trees were built on AVL trees, which are a very nice balanced, binary search tree data structure. They're super easy to use, they're pretty performant, but the way that they're implemented they have this node structure that you allocate as part of your data structure. That just adds like 24 bytes to your data structure, and since we had two trees, we had two of those for each range segment.

B

So 66% of the memory in each range segment was devoted to these evl nodes. That's a lot of memory! That's a lot of overhead. In addition to that, every range segment is allocated separately and to go back to that example, 100 terabytes three to three-and-a-half kilobyte segments that is 35 billion segments. That's a lot of segments that, like regardless of anything else you have to do, is think how much time you're spending in malloc allocating 35 billion segments, and so this is a flame graph that we had from a test system.

B

We were trying to load a very large pool. We were just loading all of its Medus lobs into memory, and this took I think something like 30 or 40 seconds we had customers were. It would take 20 minutes to load all of their medicines into memory between malloc and then just all the AVL tree operations you had to do. It was not. It was not good, so we came up with 2 fixes.

B

One of them is that we switched from an AVL tree to a B tree and I'll talk about that next, exactly what that means it, how it works, and then we also made some other changes to the range settings to make them more efficient. So, first, let's talk about B trees. Avl trees are a binary search tree. They split twice at every level. B trees are an NRI tree, they split n times at every level and it's actually a variable n.

B

The data is stored entirely in tree controlled buffers rather than having these. You know little range segments that you allocate every single one of them separately, you're, allocating these for kilobyte range or tree nodes and then storing data in an array in them directly and so the basic structure of a DB tree. You can see here you have the root at the top.

B

It is an array of elements which are the squares, and then it has a bunch of child pointers and each element acts as a separator between two child pointers, which is just like it would work in an AVL tree. But then you just sort of scale. It way up sideways sort of a very simple concept, and then you know the root has children and the children of children, and the bottom level you just have leaves leaves don't have any child pointers.

B

They just have lots and lots of data, and so one thing that we had with AVL trees that we wanted to keep with B trees is that in the offset sorted tree when you do an allocation or you do a free a lot of the time. All you actually have to do is just move the start or end of the free region.

B

If you're like coalescing with an existing thing, so one you could, when you had the AVL tree, you just like find that one memory allocation modify the memory and not have to like remove it and reinsert it in the tree, because it wasn't changing its sort order, and so you can still do that with B trees. You can still get a pointer to the actual buffer and modify that memory directly.

B

So we still got that property, which was nice. The B trees are really designed sort of as a very low overhead data structure.

B

They're designed store lots and lots of very small things, which is exactly what range segments are, so they were a good fit for this purpose, and another nice feature is that you can design your B trees in such a way that, when you're doing an initial load like you're, inserting a lot of elements in order, they'll pack, everything very tightly and sort of maximally load all of their nodes, because B tree nodes are sort of a variable size.

B

They can have anywhere between half full and all the way full and that's why you're not just constantly inserting and removing nodes. You know this nice bouncer, nice, shrinking and growing property.

B

So we made those changes. We switched away from the AVL tree to the B tree, and so that was a good start. Removing the AVL nodes from the range segment, which is 72 bytes to start with, we've cut out 48 bytes, so 66 percent of the memory gone great start way to go. The next thing that we thought about was well there's this RS fill entry. This was added to enable sorted scrubs and re silver.

B

So it's super useful for those purposes, but you don't need it for, like the Metis lab free trees right they don't they don't use this field, that's always zero. So we realized that if we cut that out, we could save another 8 bytes of memory per segment and the way we did that was, we taught the range trees how to deal with multiple different kinds of range segments.

B

There's the range segments that do have this fill entry and the rain segments that don't and it uses the right one dynamically based on you know whether or not you specify like an allowable gap size for the range tree. So when you create it, it decides which of these to use and it chooses the most efficient one.

B

So great that saved us, eight more bytes were down to sixteen bytes and then the next insight we had is well we're using 64-bit integers for this, but most Metis labs are not more than two to the 32 sectors, right we're having like full byte level addressing of our disk.

B

At this point we don't need that because you're never allocating a chunk smaller than a disk sector, and so it turns out, if you just like start the counting from the start of the Metis lab rather than the start of the disk, and you count in sectors instead of bytes. Basically, every disk in existence can be addressed using 32-bit integer instead of a 64-bit integer, and so you just make that change. You save another 8 bytes, and this diagram, unlike the earlier one, is to scale. The left side is what it was before.

B

The changes in the right side, half is what it is afterwards. So it's it's a nice shrink. So how did what was the overall effect this? Well, we switch the segment from 72 bytes to 8 bytes in most cases, for these free trees, but you're, storing two trees, because you still have the size sort of Train offset sort of tree and they don't get to share any space with each other with B trees and, as I said, these nodes are, you know, they're, not always full they're between half and all the way Foles.

B

So on average they are about 75% full. So when you do the math, it works out to about 30% as much memory as the AVL tree model.

B

So already a really good start, a third as much memory as a nice win, but we realized we could do a little bit better if you think back to when I talked about sort of the cursor based allocation scheme, where you remember where you did the last allocation, and then you look forwards a little ways to find your next place and if you don't find anything nearby, you go to the size sorted tree and find something there. Well, if you're doing a small allocation.

B

If you just look at like the next hundred segments from your last allocation, you're, almost always going to find a segment that you're that will fit your allocation. It turns out, if you, if you just look at all the allocations less than 16 kilobytes, something like 99% of the time.

B

You would just find a place to use, put it using the first fit algorithm and only 1% of the time did you actually go to the range tree, but on the other hand, if you look at the site, the actual range trees, something like 90% of all of the segments or for these really really small regions. Less than 16 kilobytes, if you have a really badly fragmented system, so we just don't store those really small segments in the range tree and it turns out.

B

This is a really really good performance win because you, the few times you do go to the tree, you just get a slightly larger buffer that you could allocate multiple of these blocks in and so the next time someone goes to do.

B

An allocation you'll probably use the rest of the space, and, if not, maybe you get a small increase in fragmentation, but the results are pretty nice because now, instead of 30% memory use it's 16% memory usage, so we are using one-sixth as much memory to store, basically the exact same information as we were before. In addition to that loading, the seat, the space map into the range tree takes 60% as much CPU, so we cut 40% of the CPU off just by switching data structures and then not loading all of the size sort of elements.

B

So that's again, a really significant improvement and one improvement that we weren't, even thinking about but ended up, getting sort of by accident is that when you unload a meta slab in the old model- and you freed all these range segments, you were doing all these 64 kilobyte frees to lots of different pages in your allocator. You would very rarely actually empty one of the pages in your allocator.

B

There would almost always still be some segments from some other meta slab, pinning that slab so that you couldn't free that memory back to the main system, and so we realized that when you do this model, where you have these four kilobyte allocations, every time you unload a meta slab, you get memory back, because it's just taking these four kilobyte regions and immediately sending them back to the system, and so it actually made unloading, produce responsiveness much more quickly, and so that was sort of a side benefit.

B

We didn't even know we were gonna get so we've made huge progress. We've made great strides, we're loading meta, slabs, better, we're keeping more meta, slabs loaded, we're using less memory to do it. Everything's coming up roses, but it's not working and the problem really it turns out, is that there just weren't enough 128 kilobytes locks.

B

We just the system was so low on them that you we just ran out and we had to like force TX cheese to happen faster and faster, so that we could reclaim these blocks so that we can do more allocations. We had lots of places to actually put the data. We had lots of small segments to do those allocations. If we were doing the pay synchronously, we didn't have anywhere to put these hundred 28 kilobyte blocks. The fragmentation was just too high.

B

This is sort of the thing that a slog device is kind of intended for right. It gives you this nice pristine space to these allocations, but there's some problems to it. One. It requires some logistical overhead. You have to actually go and manage all the pools and add these disks and set them up correctly and then the other thing is that you have this. It creates. Well, you have this problematic performance bottleneck because a lot of our customers already have.

B

It looks like it's coming back all SSD pools if you were to just attach another SSD use it as your slog device. Now all of your synchronous rights are hitting one disk instead of hitting all your different disks and it becomes this really big bottleneck and in a lot of cases they didn't have faster disks available, especially if you're in something like the cloud where you know you already have all the fastest disks.

B

The only faster disk you can get are ephemeral, and if your slog device is ephemeral, you start losing data and that's not good. So we really. We liked the idea of the slog, but we need some way to really alleviate these issues, and so that led us to the embedded slog project and the idea here is pretty straightforward. You pick the best medicine on your disk best here, meaning the most free space, and then you make that a little mini slog each disk has its own little slog and the nice thing about this.

B

One of the nice things about this change is that it's both forwards and backwards compatible. If you take a system that has this, you know code running on it had the pool created with this running on it and loaded onto an old system. There was no on disk change. It was just a like a priority preference for the allocation process, so the old like it will import fine on the old system.

B

If you take a pool created on an old system and imported on a noose on a system with this feature, it won't be perfect because all of your medicines are gonna be somewhat dirty, but it'll pick the best one and over time, as fries get issued to that mattis lab. There will be more and more of these hundred twenty-eight kilobyte segments that you can use to do these slog allocations, and so it just sort of improves, naturally over time as your system ages will still use this metal ab4 ready to deal our allocations.

B

If we have to it's not like we just throw away, you know, 0.5% of the space on every disk, we'll still use it. We prefer these slog allocations. If you.

C

B

We have to if we need to do them, we'll try to use it for this, and so it turns out. This gives a really really substantial performance when we would get 40 to 50% increase in AI ops on a random sink right workload in situations where fragmentation was really high, and so that it was a really really nice improvement for a really really sort of conceptually simple idea of just having this little slog on every single one of your disks.

B

So the current status of these projects, the load and unload changes, the Mac size change, the keeping more mattes labs loaded and some other changes involving parallel allocation and stuff like that are all in Zeo, I'll master. The B tree and range tree changes also landed in vol master or open ZFS master. I. Guess not that long ago, embedded slog is not yet upstream.

B

If you want to look at the code, it is in the Delphic ZFS repository and then a related project that I didn't talk about, but is also sort of interesting in the sink right. Workspace is the log space map project and, if you want more details about that, you can look at seraphim's talk from 2017 at the dev summit and that's actually in gol 0.8, see already getting performance wins from that. If you're running a major release so yeah, we managed to resolve a lot of problems and making a lot of things better.

B

Does anybody have any questions.

B

So the question was a lot of this. Talk have been sort of talking about small record sizes. What what, if anything, is different at larger record sizes like our one mag maximum record size?

B

It turns out that, with larger records, record sizes, fragmentation is less of an issue mostly because you're working with such large units compression can still get you into situations where you have internal fragmentation, but even then sort of the the slop spaces when you're working with such large units tend to be many kilobytes, which means you can still use them for things like you know, compressed indirect blocks and stuff like that, so fragmentation is still an issue and you'll still run into some of these problems, but it'll take you a lot longer to hit them.

B

So you'll still see some ones from this, but nothing behaves fundamentally differently at those larger sizes.

B

So the when you're allocating in the slog ZFS so ZFS will try to go to the slog first to do these Zil block allocations, but it uses the same basic allocation methods like the algorithm itself is the same. So you'll still run into some of these problems. If you have sufficiently high, you know, write a sink, write workloads and some of these changes are still beneficial. Evon like more just straight slog disks, but it takes a lot more work to hit them.

B

The question was: should we should we allocate things differently on slugs because they have sort of this different workload? It's definitely possible that if we had sort of a separate slog allocation policy, we could get improved performance. I haven't done any research on it, but it's definitely. It definitely seems like you know. If you were because you're sort of more confident that this disk is just gonna have a lot of space available.

B

You could be more naive about like just doing first fitter, just doing best fit and choosing a faster algorithm, but the the costs of the actual allocation, like the the picking process, are not that high. As I explained, the costs are really very heavily in the in the loading medicines process. So it would maybe give you a little bit of CPU savings, but it probably wouldn't be that significant I'm George did you ever.

B

Oh George was for those on watching the video are watching the recording later George was saying that there actually is a way right now in the code to set separate alligators for separate classes. So you could play with that. If you wanted to experiment with them,.

B

Yeah, so the question was: did the change to the new segment or to the new segment based weighting algorithm exacerbate any of these problems, and it's actually the opposite with the old method? Disks would be very like either disks that, had you know every free segment to use. An extreme example is four kilobytes, but that mean, and every let's say, every allocated segment was also four Calla bytes.

B

Your disk is actually half free, but it's all these little four kilobyte segments that disk would still report a really high weight just because so much of it is free, even though almost none of it is usable and when it comes to allocation, you really what you care about is finding a spot that satisfies your allocation, not sort of how much of it is free.

B

So in practice we found that this algorithm did improve things substantially, and there was another idea we came up with to do sort of an exponential weighting scheme where, like it, would not just be thinking about the highest bucket, but also the next highest bucket at reduced weight and the next high bucket. After that, this idea seemed sort of promising, but we ran into the problem that I talked about where, like even if we were picking the perfect medicine there.

B

Just we there weren't enough free segments for our use case, and so we just had to keep more meta, slabs loaded, and it turns out that that was a more efficient solution to the problem. It might be worth playing with the different weighting algorithms. Later.

B

So I'm, so the question was earlier: I was saying that it would insert these eight kilobyte rights into these 128 kilobytes ill blocks and I said that it would just move on I misspoke slightly. It does actually keep using the same cell block to try to fill up all that space with more and more rights. It will deal synchronous rights, aren't the only things that end up in cell blocks. There's some other things that end up in there as well, but it will put multiple rights into the same cell block to you space.

B

You can change so block record sizes like we did one of the workarounds that we had temporarily was we changed these from 128 kilobytes to 36 kilobytes, which alleviated the problem, but didn't completely solve it, because eventually the customers just forced their free segment sizes down even farther, so it was really just a stopgap. While we figured out some these better solutions.

C

B

Those watching the video or watching later Matt was basically pointing out that I fibbed slightly when making this I told it as one single story of one single case, but in practice this is really a spectrum of problems. Some customers were fine after the first fix, and some of them really needed all of these approaches to sort of alleviate their problem, but it flows a little bit better as a story, if you just tell it as one single use case.

C

B

If you were to try to load all of the meta slabs, you would straight up use 1/6 of that to keep all the meta slabs loaded yeah. It is it's a really really significant, like memory reduction.

B

B

Yeah George was saying that, because loading them takes so much less memory because we keep so many more of them loaded. If you go back to that allocation loop, you can really see that we can make much better choices throughout the entire process, and it really just makes the whole thing significantly significantly easier to work with Tom.

B

So Tom's question was: is there some sort of quota on how much memory is used for a loaded, menís labs? This is one of the changes that we made related to the bug. I was talking about where the system would just hang.

B

When you got more than a certain amount of memory was to add a cap on the memory used by loaded, Metis, labs, I believe the default value is 25%, and so one of the changes there was we added this cap on memory and the other was we made it so that you could unload Medus labs at any time, not just during a sync.

B

You haven't followed.

B

B

That's correct: this is 25% of memory used for specifically it's for the cache of things that we allocate the b-tree nodes from.

B

In 0.8 there was no cap, it could use as much memory as it wanted yeah. It would just use as much memory as it needed to load the Metis labs it wanted to load. So we did add that cap, it's a tunable that you can change, zo l I, think doesn't have a good way to change it at run time, but like on on a Lumos and stuff. You can just change that whenever and the system will behave accordingly, it doesn't have to be done at boot time.

B

So the question was whether Metis lab data is counted as part of the arc medicine like metadata sizing. The answer is that it's not so if you load the space maps from disk into memory, because the space maps themselves can be cached in the arc that data counts as metadata in the arc, but the range trees are not like actually part of the arc they're allocated out of an entirely separate cache, and so they don't count towards or like we in regards to any arc quotas.

B

Until we added the memory cap, they were just totally uncapped and totally untracked. They were just use as much memory as they wanted so they're. Not. There is a way in which you can have space maps in the arc. They will just like, naturally get loaded as you're reading them and that does count towards arc metadata, but the range trees don't.

B

So the question was: if you aggressively sized your arc to match your ram size, is it going to compete with itself to do these range tray allocations? The answer is yes a little bit.

B

It depends a little bit on sort of what your system looks like and how much memory it would actually need to hold all the Metis labs it wants to use if you don't have sort of as heavy of a workload as we had you're not going to need nearly as much memory as we're running into, especially once you start to hit like 80 to 90% capacity.

B

Zfs fragmentation starts to spike, there's like a performance cliff as we refer to it, and so, if you're not really nearing that you're, probably not gonna, have that much memory used by loading Metis labs, because the range trees are very efficient if fragmentation isn't very high right. This same previously was a 72 byte region that same 72, byte region could represent 512 bytes or could represent 5 gigabytes of free space like it could.

B

If the fragmentation isn't high, they worked great, but it was really only under these very heavy situations that you started to run a specific memory issues, so it is competing with itself a little bit, but it's not usually a problem yeah the arc. The arc will shrink as it sees that memory pressure. So the arc is the one that's going to yield. The the range tree stuff will basically take priority over it.

B

So the the bee tree, the leaf nodes I believe, are four kilobytes. The core nodes are actually allocated slightly differently, just as part of a design thing there they're always a hundred and twenty eight elements wide, and so their size varies according to that, and that was just so that I could have sort of more predictable branching factors when I was doing the math, but in practice, I think that that also works out to a couple kilobytes.

B

Yeah, so actually an inter leaver is no talk. I had a little bit more about this. um The question was: can you achieve some of the same benefits of loading, more meta, slobs by loading, larger Metis labs or having larger medicines having fewer of them? And the answer is yes, the downside is that the loads become even slower and each Metis lab takes up even more memory when loaded. So you do get some of the same benefits, but it does mean that you know if you only needed to load.

B

You know one current meta slab and instead you're loading, one extremely large metal, slab, you're gonna, take a lot more time to do it. It's an approach that you could use in some cases, but it would have its own cost associated with it. It would certainly be a like a decent stopgap.

B

If you didn't have any of these changes, you could get some of those wins, but you'd still run into the memory problems that we have and stuff like that, and also as the tree gets larger right log in gets higher and you'd spend more and more time in the AVL tree operations.

B

The benefits are definitely more pronounced on all flash, because you're, more CPU, bottlenecked spinning disk is sort of slow enough that the CPU has time to do all this work. But we do, we do still see performance runs on spinning discs.

B

Dtrace was extremely useful in diagnosing this dtrace is what we use to make a bunch of these flame graphs and in general dtrace is an absolute life saving tool.

B

The question was: could we have found it without it, and the answer is maybe I mean we would have you need fundamentally, some way to do like some sort of profiling and DTrace. In addition to everything else, is a great profiling tool. If we'd had some other way to do profiling, we probably would have found it if we didn't have any way to do it. We would have had to do a lot of instrumentation like trying to like add counters and do analysis and stuff like that. It would have been very difficult.

B

This work was primarily done on illumise. All that has been tested on Linux. The performance winds are smaller on Linux and we think the main reason for that is just that. Linux has slightly different performance bottlenecks. We Dell all our analysis on a Lumos. We did all our testing on Lumos, and so this was really optimized towards that. But all of these changes give performance wins on Linux as well, usually slightly smaller numbers but they're still pretty real.

C

B

The question was: if you have systems on Linux and you're, trying to figure out whether or not they're running into these issues. Where do you look the answer? One thing you can do is you can look at a flame graph? You can generate those on Linux pretty easily and if you see a lot of time spent in the allocation path, that's a sign.

B

You can also just look at zpool list and if your fragmentation is like above 90%, you're, probably running into this issue, high fragmentation is like a really good indicator that you're gonna start running into these problems.

B

But specifically, if you start looking at like how long your synchronous iOS takes to complete and then if you compare Metis lab loads to allocations using like BPF trace or something like that, you could see some of the same indicators we were hitting and if you look at eight systems that have some of the same workloads, we were having you're. Probably gonna, see some hints of these same problems, though maybe not to the same extent,.

B

The question was: would metadata offload to like special device? Special faster devices have helped if we had that configured on our customer systems and we had it set up properly, it would have made reading the space maps from disk faster, but the the problem is that, anytime, that your sync right is waiting for a disk read, even if that's a fast or discrete here already you've already lost.

B

You can reduce the extent of the loss but you're still in a really bad place, and even even if the space map is loaded in the arc, just loading it into the raintree itself was a very time-consuming process, and so it would have helped with that at all. It might have changed the fragmentation behavior a little bit, but not significantly.

C

B

Yeah Matz Matz pointing out that, in most of the cases we saw, the the space maps were all already cached in the arc because they were reading these things. So often, all of the time to load the menís labs was cpu time, actually just processing the space map and turning it into a range tree.

C

C

B

So the question was: if you were going to do the same profiling process on Linux, what we use I would use, probably perf or BPF trace. Bpf trace is just a different wrapper around BPF, but I. Think perf is probably like the most efficient profiling tool on locks.

C

B

Yeah yeah, you could yeah I agree what you we could get to the flame graphs with perf. If you want to do. Some of the analysis we did BPF trace is now reaching sort of the level of capability that it can almost do what we did with D trace. There's still some things that can't do, but it's definitely progressing pretty quickly. So that's my current preferred tool of choice. You could also do it like system tap and stuff like that. There would be a lot more cursing yeah. That's a Georgia sign.

B

I forgot to make a bet with anybody about this ahead of time, but so the question was what, if we had some sort of defrag algorithm, that would actually just reduce fragmentation, not as a short-term project but a long-term one, and the answer is that this is one of the many mythical projects that would be solved by the even more mythical block pointer, rewrite project.

B

For those who don't know, block pointer rewrite is a project where you could actually take the the data on disk in ZFS and modify the block pointers, even the ones and things like snapshots that are supposed to be read-only, which would be necessary because, as you update, the daeviation need to update the check sums of parent blocks and things like that. I know that a number of people have asked for this over the years, an extremely large number of people.

B

When I talk to my friends about ZFS, they tell me so when's block point a rewrite happening and then I punched them in the arm. It's it's always 50 years away. This is cold fusion.

B

If somebody was sufficiently motivated, I think that block porn to rewrite is a solvable problem, but in practice it turns out most of the things you want it for other than like defragmentation can be solved other ways. For example, device removal was previously like a BP rewrite class of problem, but we found a way to solve it that was more efficient and didn't require quite as many of the same constraints, so a defrag sort of as a long-term theoretical goal.

B

If we had it one day, I would be happy, but it would involve doing VP rewrite and BP rewrite is a scary project that would need to be handled very carefully.

B

I didn't work on device, removal, I, don't know exactly what the status of that is I know roughly how it works and I know that there's no theoretical barrier to preventing like default device removal but I, don't know if anybody's working on it or what the status is I'm sure somebody here knows and I think we're probably out of time at this point. So we can.

B

Okay, so if you want more details, you can look at that or you can I'm sure somebody in here to start like start tried playing with it, and you can talk to them about it.

B

B

Yeah and also I'll be out there. You can talk to me about anything that was in here or any of the stuff that I mentioned is sort of a side topic Oh.

C

Hold on one quick.

B

So, actually, the the way that it works right now is that I, don't remember the exact algorithm ZFS will 16 gigabytes is the cap on the size of a meta slab F once each Metis lab is 16 gigabytes wide, it will actually create more meta, slabs and so that'll continue I think up until you have like a 256 petabyte disk, at which point it'll start expanding the meta slabs again.

B

So there's like interesting curve shape that's defined in VTAC that you can look at if they get smaller the shrinks down to 512 megabytes and then it starts doing fewer than 200 meta, slobs.

B

Yeah that yeah the question was or the comment was that if you start with a really small pool and just double your disk size over and over, you end up with like a thousand really tiny metal slabs, and so that's why we don't recommend that you just increase the disk size over and over. You have to actually add new disks and stuff like that.

C

B

Yeah Matt's saying that, even if you do have that situation, where you have lots and lots of tiny meta slabs with these changes and with the log space map changes, it makes everything so much better that you'd have to get into a really extreme performance case to start running into these problems. Again, you don't have to worry about your meta slabsides, nearly as much now yeah, you got at least five years.

B

George is saying this is a recurring subject.

B

We could consider changing like the minimum size or maximum size of medicine, I think at least some of these numbers are largely arbitrary, I, don't know how much like research and math went into them. So if somebody wants to start a campaign to change these numbers and like actually dude like data gathering and figure out what the right numbers should be go for, it have fun I look forward to seeing what your results are. That they'll be really interesting.

B

The comment was that 64 Giga meta slabs appear to work perfectly well as well and I believe it alright. Last question: I think.

B

Yeah, so the question was were in a lot of use cases: synchronous writes, are unintentional. Where are writes intentional, and the answer is yes. Databases over NFS will try to always do these synchronous writes because they want like to actually know that their data has been persisted. We have a mode that we can turn on where we just like ignore the fact that they're synchronous and return immediately.

B

We call it NPM for NFS performance mode, but the acronym is another meaning it's called no-pants mode, and so you can do that and that was considered as a possible alternative for like well. If our customers are truly hosed, we can do this instead, but it does impact reliability significantly. So we were, we were doing. Sync writes on purpose. If you're doing sync write accidentally, you have a different problem and I think you should be looking at that before you start looking at this stuff.

B

Alright, thank you. Everyone.