OpenZFS 2021 OpenZFS Developer Summit, 17 Nov 2021

Previous Meeting

⏯

youtube image

►

From YouTube: ZettaCache: fast access to slow storage by Mark Maybee & Serapheim Dimitropoulos

Description

From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1r2P2kQHozr_RDLLf5JgPXqJOdVE7VlxH9B92jGAODLI
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021

A

All right, let's talk zetacache, so, uh as you heard this morning uh about uh a little bit about zettacash most about the uh object store, uh I wanna start actually by sort of reviewing a little bit about the object, store sort of set context here.

A

As you heard this morning, we use cloud as a back end and have sort of infinite capacity because of that, but we trade that with for high latency now, this works really well for large block streaming workloads.

A

But, uh as we saw this morning, conventional wisdom says that's not a good model for highly transactional latency workloads such as what we want it for and mostly because of latency, but we know how to deal with latency and zfs right. You can mask your synchronous, write latency with a slog.

A

You can add a cache to mask your read latency like the l2 arc, and you should be good to go except for those troublesome workloads, and that happens to be the delfix use case. We have a small block. Size average of 3k means the high proportion of metadata to data we're talking about random io workload. That means little. I o consolidation.

B

A

Streaming we're talking large working sets so, for example, a 50 terabyte working set, which would mean upwards of 18 billion blocks that we need to try to keep track of and want to operate in a constrained memory size all right, something on the order of 128 gigabytes. That means we can't keep all that metadata in memory, so the l2 arc is simply not going to to do the job for us right. It can't support the target working set uh at least not 128 gigabytes of memory.

A

uh That was 18 billion blocks. On average, you can see almost 100 bytes of metadata per uh block you're blocking the cache means you're going to need on the order of almost two terabytes of memory, just to manage the blocks in the cache.

A

In addition to that, it has a fifo eviction policy which can be and is very ineffective when trying to size your cash. So even if you get a lot of data into the cash, you can't depend on unused data, not clogging things up right, there's, no, it doesn't get evicted uh effectively and in fact you can have frequently used data. That's in your cache that can again get evicted and will be evicted eventually, just as it works its way to the fifo cube.

A

So you really can't size the cache to guarantee that you're going to get your working set to stay there. So we want something a little different. Our goal here is to get performance comparable to zfs on ebs, uh and that means zfs, backed with s3, with azetta cache based on eds storage, and we want to reduce that 20 millisecond latency to two millisecond latency that so that the high latency from the s3 needs to look the averaged out to be a nice two millisecond latency, hopefully from our zetta cache.

A

I want to support very large caches and I want to do that with small memory footprints on the order of two gigabytes less than two gigabytes for that 50 terabytes of cash. So that means has to be persistent by design, not as an afterthought, and we need to improve the eviction policy rather than fifo we're. Looking at lru, we want to be able to understand the cache size and its effectiveness.

A

So let's talk architecture.

A

So this is what the object agent looks like in zfs right, it's a it's! It's as george talked about it. It's uh cfs object. V dev communicates through to userland to talk to the cfs object agent and within that we have the zetacache, which is where the zfs will look. First, the ios will be looked for first and if not found there, it goes off to look in the cloud.

A

Why did we put the zeta cache up with uh in the user land as opposed to embedding in the in the kernel, as our other caches are um a few good reasons? One uh this architecture we wanted to keep zetta cache uh very close to the the zeta object agent uh for efficiency's sake right, it's it's the primary consumer of this um and it made sense, then, logically, to leave to put it up there.

A

It keeps us separate from the the bulk of zfs, so we could develop without having to inject a lot of new code path or changes into the zfs io pipeline or other parts of zfs architecture. This is a separate entity completely, so it really gives us an easier environment to do our development, because in user land we are not dealing with a new kernel. Every time we make a modification, we are also able to develop in rust, which is really nice. uh So it's helped us do our development cycle to be much improved.

A

We can cycle things very quickly.

A

I think I talked about these points, but one of the questions you should probably have in your mind right now is: is it going to perform if it's in user space can you perform, and the answer is for us at least right now? Absolutely uh our needs really are only to be able to drive an aws instance effectively efficiently.

A

Sorry and most aws instances do less than 20k apps uh to an ebs storage, and so we can do that now easily using a simple period period interface, so it certainly meets our needs and we believe that it will meet our any any needs in the future, because it's possible to go with user space as fast as almost as kernel using userland nvme drivers, for example, or the iou ring interface. So we don't believe this is going to be a bottleneck to our performance here.

A

All right design.

A

Basic components: uh the of the zetta cache are an index used to find data in the cache, some form of allocator to store data in the cache and a checkpoint mechanism to create a credible state. Now, if you're looking at this picture on the right, you might be thinking yourself.

A

You know that looks an awful lot like a diagram of a file system and file systems are hard and if you're thinking that you're you're right, that's true, both those things are true. It does look. A lot like a file system and file systems are definitely hard. I know as part of the team we spent 50 engineer years before zfs first went to first customer ship. So it's it was a big effort, so we need to simplify requirements. We aren't actually going to develop a new zfs here, we're developing a cache.

A

So we should step back and say all right. What are the actual requirements we have for this storage, and that is a persistent searchable index that doesn't need to you can't in fact be fully cached in memory, but that can support lookups in at most one read right. Remember our our goal here is to be able to satisfy or to emulate, zfs on direct storage.

A

We need a block allocator that has a small memory footprint as well and is resistant to performance impacts from fragmentation, and we need some sort of way to transition from state to state. So we can atomically, update our onto state and maintain consistency.

A

We use checkpoints for that. So let's look at some simplifications. uh One we're only gonna have one block pointer for each block right, we don't need any snapshots uh or other you know. Most of our metadata needs are pretty simple here uh and in fact we can store all our metadata in logs right, there's, no logical overrates of data.

A

We only have to logical appends and our logs are stored as lists of extents of large allocations, so that we don't have a lot of blocks involved in our logs large log blocks, where we simply add log elements to it and we need to track sort of where the end of our last element is within the extent and there's no need for a lot of indirect block structures to manage that that information.

A

So we're really developing a new from scratch, simplified block storage here, and we also have a very simple interface right. Our operational flow is lookups and I'll look up something in the cache, so you can look it up query the index.

A

If you find that that entry in the index, then you're going to read that block from the cache or insert a new block where you're going to look it up, query the block get from the block allocator a a new space on disk for that block and that block out here will then insert a new entry in the index to say. Here's that that new block on disk.

A

So I want to talk about the specific elements there: the index and the block allocator, because the most interesting pieces of the of the current design.

A

Okay skip one sorry, so the index is uh just a way of to map the block, a block id to a disk location, and if you think about zfs today, for example, in the arc, we have a large hash table that does that for both arc and l2 arc.

A

But that's a little bit of a complicated data structure to manage persistently to have a persistent representation of so we've chosen a simpler data structure to represent it, which is essentially a a log structured, merge array which is a mutable, ordered array or run of entries that we store persistently, and then we track pending changes in logged in memory right. So whenever we want to transition from one setup one index to the next, we apply our set of pending changes to the old run, to produce a new run.

A

So the old run is ordered by uh index op id and we apply our pending changes to produce a new run on disk.

A

In addition to that, so in addition to oops sorry we skipped two. So how does that look on disk? What does the uh that run? Look like on this? The run on disk is actually a array of chunks where the chunk is sort of our logical.

A

I o unit, and it's a large list of an array of entries where each entry is that mapping of a block id to a disk location, and then we, in order to be able to achieve our ability to read from this or look up an item in this with only a single read.

A

We also keep stored a summary of the mappings array of mappings, which map the first block id in each chunk to its location, to the chunk location on disk, so that and then we keep that mapping in memory. So that's our lookup first lookup point when we want to look up a block in the index. We consult our in core summary.

A

We find it from it which chunk that block that that block id would be in. If it existed in the in the cache, then we can read that chunk and look to in the chunk to see whether or not we actually have an entry for that particular block id.

A

All right now, penny changes are kept in memory in appending changes tree that's for efficient, lookup and inserts, but we also need to keep that persistent all right. We don't need that persistent, because we don't want to lose pending changes when we panic or or exit from the log. So we need to make maintain a persistent version of that. We should call the operation log on disk. So whenever we add or adding entries in the pending changes tree, we add uh entries to the end of our log.

A

This is a simple list of entries on disk append. Only and we can. This is allocated in large chunks, large extents, on disk, and we are simply adding new entries, as we add them to the pending changes tree.

A

So to move from state to state, we simply create a checkpoint about every minute where we persist the pending changes. That means we make sure all our entries in the penny changes tree have been written out to the operation log, and then we store in that checkpoint, a pointer to or basically the metadata necessary, to find that pending changes list, as well as a pointer to the index, which is already persistent on disk. So if we do have a crash, we lose less than a minute of cache data in the crash.

A

This also represents sort of a unit, a point of consistency within the cache, so each checkpoint is a consistent consistency point for the cache itself.

A

So how does that work? Well, if we have to when we, when we uh reopen the the cache from any point. At any point, we first read this a super block in a well-known location which points to what the current checkpoint is. The current checkpoint block. We read that checkpoint block, a checkpoint block has essentially our meta metadata within it. It has a pointer to the index and summary index run in summary, as well as the operation log, so these all the components that make up our index.

A

So what's going to happen, is it's going to simply read in the summary into memory and ingest the operation log and replay it to reconstruct the pending changes tree once that's completed, we're now able to to operate the uh the caches normally, but we can't accumulate these pending changes forever right. Pending changes uh represent a set of deltas to the index, and if we just continue to change and accumulate entries in there, it's going to get too large over time.

A

So we need to merge those pending changes periodically into the current run, to produce a new run, as I uh illustrated in an earlier slide.

A

We don't we trigger these periodically when we sort of accumulated on the order of tens of thousands of pending changes, and we run this as a low overhead. I o task in the background, so it doesn't interfere with normal cache activity.

A

So we read the old index in order. We apply the pending changes to it. When we come across, for example, insert we insert a new entry uh from the pending changes into the index, so the index grows. If we find a remove uh request, we remove that entry uh or if we have an access to a block, then we update the a time I'll talk more about a times a little bit. How they're used.

A

But the index also itself cannot grow forever, all right if the index grew forever, obviously that the index is a mapping of where all the blocks are in the cache and the cache is finite in size. So eventually you can have an index that represents a full cache and will need to start evicting. Data from the cache will continue to grow and absorb new entries into the cache.

A

So we need to evict periodically, as I mentioned earlier, we'd like to evict based off an lru algorithm.

A

So this is where we use our a times our access times. Every block has a last access time. uh This is uh incremented every 10 seconds, so the access time is really a measure of sort of artificial time from when the cache was first created every 10 seconds it increments and every block that is inserted into the cache or read from the cache during that 10 second period is marked with that particular access time.

A

This gives us sort of an approximation of of an lru. If we create this histogram. As that you see on the right that every time we ingest a block, we add the size of block to the histogram at that a time so, on the on the diagram on the right, the numbers in this diagram represent gigabytes of of data. So, for example, at a time one, we ingested 20 gigabytes of data at a time two we ingested 30 gigabytes of data and so forth.

A

If we access a block, uh for example, we are, let's say at a times seven and we access a block at a time. Two then that read, causes us to change the a time of that block, and so we would and say it was a 8k block we would have. We would decrement our uh uh slot 2 of the of the histogram bucket 2 by 8k and increment 7 bucket 7 by 8k, and that's how these this this particular histogram it grows and shrinks and changes over time.

A

Now when we want to evict. All we need to do is look at this histogram and count up the space of the amount of of of of a data of block of space used by the blocks. With particular a times in particular, we'll look at uh the current a time in this diagram is a time nine count back until we have the size of the cache the cache size that we want to try to maintain. In our example, we're trying to maintain, say, 300 gigabytes of data within our our cache.

A

So everything before a time five is is now evictable data. That's the old data. It's a time of all that all those blocks are older than we want to keep around any longer.

A

So we can evict based off of that, so that cut off that a time five represents our sort of a cut off for eviction, which we will now leverage during merge. So remember, merge is already going through all of the blocks, all the index blocks and all of the change log box blocks pending changes blocks to create a new index, and so it can also examine the a times while it's doing that job, and so it can, if it sees a block that it occurs, has an a time before our cut off.

A

That's an evicted, evictable block and it becomes evicted. So you can see in this example here we're having a set of pending changes. We have an old run and we're generating a new run, and we see that we have blocks at uh four or five and nine that are all have an a time of before five.

A

And so when we write our new run, we're going to come acro come along and say all right: four that that's evictable five evict eight nine! Well, we don't actually evict nine because our pending, we also have a pending change for nine, which has a new array time. So that takes priority. It says all right: we actually have modified nine, since our eviction cut off and so nine gets goes into. The new index.

A

So now it's really kind of cool that now that we have this sort of mechanism for evicting blocks or for for managing blocks based off of a times uh this a time histogram is, we can actually leverage that same histogram. To answer the age-old question of how do you want to size? Your cache?

A

How effective is my cache at a particular size, so we use that same eviction uh same uh a time, histogram to generate a new histogram, which we call the size versus his histogram, so size versus his histogram is leverages that eviction model, and so, whenever we're going to, we get a hit in the cache.

A

We can say all right. This block we just hit in the cache. What was it's a time this last a time not the current? It's about the update, of course, because we just did a hit, but what was the hit, the the daytime when we we actually found it in the cache that tells us essentially the a time at which it exists, the latest a time it could exist in the cache, and now we could count up all the space that was before that a time, and it includes that a time and say all right.

A

That would be the minimum cash size. We would need in order to make to guarantee that we would have hit on that block at this time.

A

So that happens for every single hit in the block. We maintain this new size versus his histogram, and using this leveraging this information, we can generate a report that looks like this.

A

So, on the right you see, output from our zcash report hits command, and this represents a little bit of a toy example. It's a small working set, uh but we're getting about 98 percent uh hit rate in this cash, uh but 97 look at the data on the right. You see that 97 of those hits are occurring within the first three gig of the cache according to that graph on the right.

A

So we could take a look at this and say: hey you know. If I was to have the size of this cash, I could still retain and maintain a 97 percent hit rate- that's pretty cool, so it could actually tell you that uh I I am based off this particular workload.

A

I have more cash than I need I could get by with less cash now.

A

We could also now extend this idea and say all right not just maintain this information for the current cache, but if we extend our a time histogram data, we don't throw away sort of the historical data that's contained within it, because we have remember we we have the a time histogram data records that for all of our a times we we can choose uh when, if ever to evict information out of that log, that data we can use that history of of size to actually say well. You know I looked up this block now.

A

We also have to keep a little bit extra data in our index, of course, which are the so ghost entries for blocks which we've evicted recently. But let's say we keep another cache worth of cache size worth of data around that way. Now, when I look up that block, I can say all right the index. I looked up that index that block in the index, and it said no, I don't have that block, but I used to have that block and when I I had it it had.

A

It was marked with this a time so, from our previous example say, it was a time 3 and I evicted at any time everything between before a times five.

A

Now I can count up the all the the data in the a time, histogram and say all right. Well, the cache was would have been this big at this a time and using that I can predict, then all right. If I had a cache, it's a little bit larger I could have.

A

I could have kept this in memory and he would have had a hit and that could generate a graph look something like this on the right, where we see that yes, the first same cash size same seven gig cash, but now our we have a larger working set and we're only getting about a 68 hit rate in our current cash.

A

But if we project it out into our ghost cache, we see that around 10 to 11 gig, we are getting up to 97 98 hits. So this tells us that all right, I could potentially get better hits. I could potentially up to 98 hit rates.

A

If I was to double the size of my cache and that can be very valuable again, all this depends on being able to run this data for a period over which you're actually operating toward with normal working sets to understand how your hits are going to work within your cache all right. At this point, I want to change gears a little bit and I'm going to turn things over to seraphim. Sorry, sorry for him to talk about the black allocator, careful.

B

So, first of all, what does a block allocator do right? The local locator is a subsystem that determines where data should be placed on disk and when designing one. There are like multiple things that we should be taking into consideration things like runtime performance. How quickly can we satisfy the allocation requests?

B

Storage utilization, basically making sure that we are making the most out of our hardware and minimizing any wasted disk space?

B

um Other factors like contiguous allocations, where some devices like hard drives, get a big performance boost whenever we try and allocate blocks contiguously on disk and, finally, the memory consumption of the block allocators metadata. Basically, the block allocator needs some kind of in-memory structure to keep a track of what is allocated and free on disk right. So we want to make sure that distractions don't take up too much memory and we play well with the rest of the system.

B

So before I go ahead and go over what we ended up, doing I'd like to do a small recap and look at what zfs already has in terms of block allocation. So the block allocator gfs is part of the storage pool allocator or what we call the spa and what the spa does is basically dividing each video into 16 gigabyte regions. We call metaslabs from which we are located in three blocks of arbitrary sizes.

B

Now these meta slabs are represented in memory with something we call range trees and you can think of our entry as a standard tree data structure where each node each tree node basically represents a free region on disk.

B

Now these ranges rangers can get pretty big, um especially for big pulls, but even for smaller pools, depending on your workload or the history of the pool. So for that reason we only keep a subset of all these range trees, meta laboratories, loaded at the time on our system. Specifically, we only keep loaded the ones that we are allocating from, so we can find where the free space is on disk, but, as I said, we only keep a subset of them loaded because we want to conserve them.

B

So this is more or less the scheme that we, the the design that we have currently today and it has served us well and it continues to serve as well, but it doesn't mean that it doesn't come with its own flows, specifically um within delphi. What we've seen is on our database workloads, which are mainly characterized by small and compressed random rounds.

B

We see that our pools over time having free space scattered all across the meta, their meta, slabs and because these rights are compressed, fragmentation is pretty high because we allocate all kinds of block sizes and when capacity gets a pretty high percentages, it's just hard to find space so distilling these kind of like samples into like actual problems.

B

What ends up happening is that we're spending a lot of cpu runtime and a lot of these chaos just constantly loading and unloading, meta, slab range trees and the memory overhead is high, because these ranges can get pretty big and again reiterating on the high capacity part where allocations can start failing and gag blocks may kick in, which makes fragmentation even worse and kind of create this pc cycle.

B

That's not to say now that, because of this, the spa design is not good. It actually satisfies the requirements of a file system like zfs pretty well, but for the zeta gas we're designing a cache which is different from a file system, so we went back and kind of reconsidered. Some of these tradeoffs, specifically things like the continuity of allocations right.

B

um This is something important for hard drives and file systems need so need to support hard drives, but now we're talking about the cast like the zettacast, where we most probably going to be using something like ssds or nvme devices, so contiguous allocation of this kind of device is not as important and therefore it doesn't need to be. A major factor in our design.

B

Second part is the disk space utilization right. We were talking about the spa, the file system. We strive to make the most out of our hardware. Our goal is, you know, 100 utilization, but for something like a cast. Yes, this video utilization is important, but it's not the main factor in the design.

B

Everything is the hit rate right as long as your heat rate is high. As long as your heat rate is good, it means that your cash is working, it does its job and finally, memory uh consumption.

B

um This is a file system and basically, what the spa tries to do is manage unlimited storage with limited ram.

B

Basically, so you can keep adding more storage to your systems without having to think about the ram that you're actually using, and this is the whole reason why we do the whole dance with loading and unloading meta slabs now for a cast. This is still a very important concern. This is part of the design because it's a valid concern, but our goal is not as strict. We can be more flexible because we are talking about the case right.

B

Unlike a file system, you can actually blow up the cast, not lose any data, take a temporary hit in latency and potentially reconstruct your cast in a manner, that's more efficient for your system.

B

So here's what we, after considering all of these trade-offs, here's what we ended up doing basically local location on the zericas works like this. We take the cast and we divide it into 16 megabyte regions. We call slabs and there are three types of slabs. There are the bitmap-based slabs, which are analog modeled after a true slab design where all blocks are of the same size, and I have on the bottom left an example image of a bitmap-based slab.

B

I say 5k blocks where you can see like each slot is of equal size, regardless of its, whether it's allocated or free, um and then we have the extent base type where it kind of resembles meta slabs, and it's mostly used for block sizes that are bigger, but they basically contain variable sized blocks and you can see on the bottom of the center there's an extent based slab picture with a 20k block. That's allocated a 50k blog, some free space and a 31k block, and finally, the last type is the empty slab type.

B

This is the completely unallocated slabs that can be converted on any of the other types on demand as we see fit. Now, I'd like to talk a little bit more about that. Basically, how do we decide what kind of slab types do we want and need, and what exactly is our allocation scheme on top of these structures?

B

So what we do is we group our slabs by block size into what we call a slab group now for any incoming allocations that are less than 16k.

B

We have the slab groups that are backed by bitmap-based slabs, basically, the bitmap-based slab groups and we have a group for every multiple of 512.

B

512 bytes 1k 1.5 k, 2k, all the way up to you know 16k, where, um basically, when I say it, the 2k slab group, I mean the group of slabs that are bitmap-based, that support 2k allocations and then for more than 60k allocate 16k allocations. We have two extended base slot groups, one group for less than 64k and one group for more than 64k allocations.

B

Now this distribution of like block sizes into groups, it's not set to stone, it's actually a tunable, but regardless of that, each slab itself may change its type and migrate between groups during the lifetime of the cast. So, just to give an example, it really depends on the workload. So let's say you just created your pool. You have your data and you, you know you put your cash in action and the cash in the beginning is empty.

B

You start reading, let's say a data set that has a lot of 3k um blocks and we start casting those blocks in the zettacast. So we go to the block allocator and we keep requesting blocks of uh 3k.

B

So what the block allocator will do is convert all its empty slabs to 3k, bitmap-based slabs, put them into the 3k group and start allocating and using the space in those and then later. If we say if, for some reason, your workload changes to 6k blocks, what the cast will do, it's basically going to start casting those and evicting the old 3k blocks, basically slowly, freeing over time all these 3k slabs that it made converting them to empty slabs, then converting them to 6k slabs to satisfy the 6k block allocation.

B

This was more of an example to just kind of like describe what the runtime of this scheme would look like. Most of the time in the real world, the things are not as black and white. Basically you, you won't just have 3k incoming blocks or 6k, but most probably you're gonna have like all kinds of block sizes being requested by the block allocator effectively distributing all these labs into all these different slab groups, and another thing that I simplified a little bit was that you won't as often see slabs getting completely freed up.

B

Most probably, you will see something that looks like the first row on this image here, where we basically have some slab groups that are like pretty full and other ones that are not as full and in this specific case, I'm highlighting this kind of like new perspective that we have now in fragmentation.

B

Basically, before with the meta slabs, we had this block level fragmentation where, because metaslabs, uh who had viable uh length block sizes, um we have fragmentation at the block level, but now uh we've kind of like traded off that problem from fragmentation at the slab level, where we can have underutilized slab groups leading to stranded space.

B

In this example, the first row we see that the 3k slab group you know- has a lot of slabs, but it's not actually using the space within them that efficiently and that's a problem right, but for our cast, our uh block pointer kind of like scheme is a lot simpler right, unlike the spine cfs where we have. um You know this kind of like complicated graph structure of block pointers and direct blocks.

B

That's impractical to move around our blocks are fully indexed meaning we have exactly one pointer for each block and that can make it easy for us to defragment our data and free up slabs. And basically, when I say the fragment, I mean literally locate the underutilized slabs. Let's say these four slabs in this example move their data around.

B

In our case, we just move them in this first slab right here and we ended up freeing them to be used for another slab group, potentially the 5k or the 6k one, because this look almost full, so that's kind of like an overview, a high level overview of our design and that's more or less.

B

What I wanted to cover today, for the sake of time, I just want to spend one last meaning comparing our new design for the cast to like what would have been if we were actually using this file locator for the cast and with our new design.

B

We get greatly reduced cpu and disk io activity, because we don't spend time loading and unloading the block allocator's metadata, the specifically the meta slab, brain switch right and in turn, our performance is not affected at high capacity or even high fragmentation.

B

Moreover, our memory overhead is greatly reduced because, on average, we'll be using bitmap-based labs and effectively representing a block on disk, with only one bit in memory right versus the eight to twelve bytes that we are currently using today per registry segment for the metaslab brainstorms.

B

And finally, I just wanted to highlight that, overall, the our design is a lot simpler right. We we have a requirements for a cast that are in comparison, a lot less than the list of requirements that you need to implement the file system. So it's not that we have this genius idea. It's more like we looked at our requirements and we came up with a simpler design and reasoning about its performance is a lot more straightforward and we hope that it's going to be easier to debug in production too.

B

um That's all I wanted to cover today mark. Would you like to.

A

All right, um so uh that's pretty much is it for for the the bulk of our material that we had to present. But I would like to leave you with a couple of of thoughts.

A

One is: could you modify zettacache to work with traditional block based storage and I talked about our architecture and the fact that we had chosen to implement zeta, cache and user space to be tightly integrated with the object agent?

A

But I think- and I believe it is still possible that you could leverage data cache from uh a in a sort of typical block based storage configuration by injecting the uh cache check into the I o pipeline, so that at some point, once you're past the arc and l2 work checks in the pipeline and actually going out, you know starting to go out to disk. You could query even up call up to zetta cache and say: is this block exists in my some sort of fast storage there and and get a response back?

A

um So I think it is possible. um Perhaps a more radical and an interesting thought. Experiment is um how could you use zettacash as primary block storage and not just cash, and, as uh I mentioned, and as uh seraphim mentioned a few times during his part of the talk we made a lot of simplifications and and those simplifications uh are dependent on the the requirements of cash, which are things like yeah.

A

If we don't ingest that block, that's not a big deal and it's not the end of the world uh because we're just a cache and you'll just get a miss and you'll go to the back end storage. Well, obviously, uh that's not going to be practical for primary storage. You have to ingest all blocks, so there'll be a lot of issues to work through, but it is kind of an interesting experiment to think about.

C

A

Think it's it's something we do think about. Occasionally, um I think that's pretty much all we have for our slides here. Are there any questions we can answer?

A

So I got a question from alan and he's asking whether or not uh we could apply. I talked about the a time histogram and hits by size histogram whether we could move that same concept into existing arc and le and do something similar there to be able to generate interesting reports and information about it. And initially I thought yeah that seems plausible, um but then matt pointed out that the there is more complexity in in the arc. The arc isn't just an lru cache.

A

It's a much more sophisticated uh balancing cache between mru and mfu data, and so it's it's going to be quite a bit more tricky to be able to do predictive uh analysis. I mean you could do the analysis I think fairly easily on either the mfu or the mru sides of the arc, but the fact that it's constantly shifting back and forth based off workload, I think, makes the trying to reason about the whole thing more much more complicated.

A

So I'm not sure how easy it would actually be to try to do something like we've done for uh the zettacache there.

C

The other one was about like uh when we're doing rebalance. Is it better to see big stuff.

A

C

The question the next question was that paul asked was about like uh rebalancing, versus evicting.

A

Oh yes, and um they you know that the reason we prefer to rebalance there is is again back to being able to reason about the cash. If we to adjust it is simpler. If we just say yeah, you know we have four blocks in this slab, so let's just check it because that's only four blocks right and this is a cache.

A

We don't care and that would be an easier way to empty the cat that slab out and free it up rather than finding locations for those four blocks that we're trying to that are left in there.

A

But it's suddenly doing something like that begins to introduce more uncertainty now about how to reason about your cash, because now you're no longer maintaining a strict uh lru policy in terms of evictions, and so, if you're doing a lot of this kind of work of throwing away blocks like that, whenever you need space, you really don't know how your cache is behaving. It's not even fifo. It's just kind of a random eviction pattern, and so it kind of breaks our ability to predict. So it's it is certainly more efficient, but it's uh better.

A

If we can retain those blocks that are that are not eviction candidates, yet.

C

Yeah, I think it's something that we could consider the trade-off would be. You know, obviously you're doing less io. So right. um I think that we need to we. We haven't yet out performance results from like you know how much eviction is really needed under you know, even hypothetical workloads, so you know once we see that um sorry, how much like rebalancing is needed under workloads and how big of a performance impact is that you know maybe we'll have to reevaluate. If we see that that is more impactful than we predict, it agreed.

C

Yeah, so basically allen's, basically asking sir from about like. Could we apply the block allocator that you designed to regular zfs um meta slabs with with having strict slabs and all that stuff yeah.

A

And why haven't you done so yet.

B

I mean we, I I think one of uh mark covered that a little bit on potentially using the zettacast um as like a backend for an actual pool, but like actually changing the meta slab uh design. Right now I mean yeah. Obviously we could try and do that, but also things are a lot trickier.

B

um I would say on on the spa right now,.

B

So I can definitely see the benefit, but trying to have um it. It depends on a lot of things right like sure for some workloads it would be great, but um we would have the design wouldn't have wouldn't wouldn't be able to be the same um because, like what? What kind of like, would you have like slabs within the meta slab, um like whoa,.

C

Yeah, I think that, even if you kind of let's say that you work through those problems of like fitting the thing, a into thing b and um the complexity of modifying the current code- that's all theoretically doable. um I think that the one of the big challenges would be rebalancing yeah right, because we with the cache we can avoid most allocation failures by doing rebalancing before they occur. um And then uh you know if, if, if we really didn't want to avoid allocation failures, you could like bump things up to bigger sizes.

C

So, like I do a 3k allocation, but all there's no free slabs and the 3k slab group is full well try doing a 4k allocation, try doing a 5k allocation right.

A

Or go to extents yeah.

C

You know like you could do that and it might give you like some kind of weird results in terms of space usage, but that seems doable. The problem is that, um and you can go the other way using gang blocks. uh You know which the kernel you know already knows how to do, um but the problem is.

C

I think the problem would be that uh you might be able to get into a case where that is common rather than rare, where, like you, might commonly have to be ganging or you might commonly have to be bumping up to an allocation. That's a much larger size um and you to get out of that. You'd have to do rebalancing, but rebalancing is like extremely non-trivial with zfs. That's like the bpv rate. If.

A

We have bfa bp reroute. We could do that.

C

But the rebalancing is is almost trivial um because with zettacache, because we're already rewriting the whole index every once in a while right. So it's just like like while you're writing the index, you just like look up and see like. Oh what should I be remapping this? Oh yeah, okay, great, like let me I'll store that new location in the new index.

A