OpenZFS Leadership, 25 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: June 2021 OpenZFS Leadership Meeting

Description

At this month's meeting we discussed: ztour; Linux user namespace; ZREPL; ZFS on Object Storage

https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#

A

All right, let's get started, um welcome to the june 2021 opens the festivities meeting. uh It looks like we don't have uh too many attendees and don't have too many items on the agenda so this well, I won't say because I don't want to jinx it, but let's get started um and we should have time if folks have other things that they want to discuss or questions that are not on the agenda.

A

First was the question from rich. I don't know if you're on rich but uh they're asking about ztour um is anyone working on it or an equivalent. um I think you remember that from uh many years ago, um uh developer summit conference um there was a demo of z tour, which was like a facility for.

B

A

The on disk format, um using something like debugger like using mdb, maybe I think it was using fuse or something to actually let you browse.

C

It as a file right um is that the one that don brady did.

A

C

I think you're right.

A

I don't see don on here. As far as I know, don hasn't done any more work on it. um Does anyone know of other.

D

A

Projects in that space of, like zdb-like um things for examining the honest format.

A

All right: well, it's definitely an interesting idea um and one that we're thinking about a little bit in the context of the zfs object, storage, stuff. So um I'll, maybe come back to that when I give a little update on that, um but I guess maybe I'll give you a preview. So the um you know there's an additional on disk state, or maybe it's not disk, but you know in in the object storage state um that uh is kind of below the existing zfs on this data structure.

A

So we need to come up with some way to examine it, um either using zdb or using um some other facilities. So uh if folks have ideas about like how you would like to see that or do that, um then you know maybe one one place to explore that would be with this object store metadata, um since you know, there's a lot less of it than zfs. So if you want to do something new, you could start mess. You know experimenting with how to do it with objects or metadata.

A

Hooking up to fuse seems like maybe a lot of work, but I don't know we'll ask don.

E

Yeah because they did it at the more rece a more recent dev summit, they revived it and he, the team, worked.

D

On it and they.

E

Won the lego spaceship or whatever, and they had.

D

Something that they could actually demo.

E

And it looked cool so.

A

Yeah, I know I don't recall who else worked on it with him, but um that if you're curious about it, that would probably be a good resource to go check who worked on it during that hackathon, and I think that would have been 2018 sounds right, sounds right.

A

Maybe a few clicks will bring us to it. Let me see.

A

I don't see in 2019.

A

I don't see it in 2018.

A

Let's see 2017., it was 2017. z, tour mixes zdb, with fuse um uh participants were all from delta, don brady, pavel zacharov and prashant srinivasa um yeah.

E

So I guess it's been a while yeah.

A

It's been a while, um if you, uh if folks, are interested in this, um I will try to see if don will come to the meeting next month, um he's going to be working on zfs more as part of the object storage project that we're doing so. I think he and a few other folks would probably find it useful to attend this meeting.

A

For other reasons, anyways.

A

Cool all right, if there's nothing else on that, we'll move on to the next item, which is the linux username space support uh alan.

E

Is that your item yeah, um so we've posted the the first pr for this and it works.

E

Gonna have to look at the test failures because it it was passing all the tests when I tested it, um but basically, if you um enter a another username space, uh this changes the zvs and linux code to instead of there's a a tech used in bunch places in zfs in global zone and on linux it was just defined to one. So you were no matter what you always were in the global zone.

E

um We changed that now to actually look at which username space you're in uh and if it isn't the root one to uh to return that you're, not in the global zone, uh and so with that, if you run, you know, unshared dash capital. U and you're. Now in a different username space, when you run zfs list, you can't see any data sets, but you can delegate data sets to a namespace.

E

So if you set the zone property to on and then do, zfs user ns add and the namespace id to the data set, then when you run zfs list inside that data set you'll be able to see only those similar to how zones and jails work on solaris and bsd, uh and we it works so that you can mount the data sets within the namespace.

E

If you have a mount namespace as well, so that it can work with something like lxc containers so inside the container you can as the root that's not really root uh can uh do whatever you need to do. The data sets and all the same rules applies with jails and zones. uh You know you can change most of the properties, but you can't change the it's. The quota and the limit on file systems and snapshots, and a couple of things like that.

E

So basically just maps the existing zones and jails code to what the equivalent is in a linux. Namespace, that's cool.

A

E

By buddy one of the ci company- and they want to run people's ci in containers, but they want to use zfs.

A

Cool, um are you looking for code reviewers now.

E

uh Yes, it's ready for code reviews, but also interested in uh people with similar use cases. Who might uh I guess one question is: does suddenly zfs list not showing stuff when you're in different usernames faces uh break anything for anyone? uh I don't think so. You know once you're in a different namespace. You kind of expect not to have access to the stuff from the original global name space, but it will be a change of behavior and if it's problematic, we might have to protect it behind a module parameter or something.

E

um But yeah uh people want to test it out and see how it works or whatever uh interested in feedback cool.

A

I mean I've. uh This is in the area of like platform, specific kernel code, so I've assigned tony tony win to be the maintainer for this. So you can work with him if you need help finding reviewers or getting reviews.

A

A

Cool uh questions about that.

A

I think this will be a great feature to see to bring you know, linux kind of up to the same richness of functionality that we have on the other platforms, yeah.

F

Yeah, I think that's really neat alan just to clarify it sounds like from what you said it sounded like. We need to enable this right, there's some there's something to that. We have to set to enable this feature, or is it going to be no.

E

Not currently, currently it is the default, but if, if it breaks things for people that might I don't think anybody assumes that they can access stuff as zfs when they're in you know, once you're in a different username space, you're not root anymore. So unless you've done zlo or something you wouldn't expect to be able to do much in the way of zfs commands anyway, yeah so uh yeah. This is on by default.

E

But uh if, if that change in behavior causes problems for people, we might have to predict, but I don't expect so because you know as soon as you enter that namespace you're not really root anymore, and so you don't expect uh zfs to work anyway.

F

D

In terms of the testing, is there any change required on the tooling site? I I have in mind you trying this with podman and it underneath it uses the graph driver. Does it need any change to the driver to actually use it.

E

uh Not really like, when using with lxc, you have to allow zfs uh like whitelist service in the list of file systems that are allowed to be used inside the container, uh which is just.

D

E

Armor config file on ubuntu, um but outside of that no there's uh when you're inside zfs just works the same. It's just you're limited to the data sets you're allowed to see, which are the ones that are delegated to you and have the zone property.

D

Okay, so the property needs to be set.

E

um For you to be able to mount it and manage it uh to mount to inside the namespace, yes, the zone property has to be turned on similar to what you would do.

D

Okay, so that would require change in the grab driver inside the podman to actually uh mark the data set that is inside the container right.

E

If you want the users inside the container to be able to manage it um like, if you just operate.

D

E

From the host to the directory that it sees rooted in or whatever, then that would still work it just uh you know it depends if you want user inside the container to know that they're using zfs or not or be able to see the data sets or not.

D

E

uh Like I know, with with jail's on bsd half the time, I just create the data set on the host and mount it to the right spot relative to the jail's route uh and the jail doesn't know that it's a separate data set other than you know. The file system id is separate or whatever, but it doesn't know the name of the data set or anything. It just shows up as restricted or something like that in du.

E

Or you can actually delegate it so that root in the jail? Can you know, z of a set compress equals or whatever.

C

Ellen um you mentioned, like the mount namespace and having like um that capability of being able to mount the data set. So you have to be in the amount name space and you have to use a username space or can, if you don't have, if you don't set up a username space, can you just do the mount name space and allow a delegated data set to be mounted.

E

If you're just in a mountain name space, then you would still be root from the original name space, so you can just do whatever you want to do right. So.

C

It doesn't restrict you in that case.

E

Yeah, so uh in global zone is specifically the username space. Okay, uh the.

C

Linux namespace stuff is like there's so many combinations.

E

C

Map well to like zones or jails. So it's always interesting when, when you start to kind of like look at the kind of permutations there.

E

Yeah, so we did the username space, because that's really when you're deciding is this the real root or a made-up root user right like are you? Are you real root, or are you uh some user who ran on chair or whatever and are running an unprivileged container? That is just pretending to be rude.

C

E

But in general the only thing that was really required was setting an extra flag in the vfs bits to say this file system is uh aware of the namespace stuff so that it handles the the user id mapping or whatever uh so that you know when you're inside the namespace you're actually running with a the virtual users inside that namespace have very high uids, but that are mapped to regular ones inside the namespace.

E

So zfs has to know that you know uid 6001 is actually uh inside, shows up, as you know, user id 1001 or whatever, so that when you ls the files you create inside, they have the name relative to the namespace. But you know on disk those are owned by the user id. That's the real one.

E

If that makes sense,.

E

And we've added a bunch of test cases, but if people have suggestions for other test cases, we'd like to make sure that we cover all that.

A

Were there any existing test cases for like jails or zones.

E

A

E

There was one I think, one or two existing linux tests for something to do with namespaces that we extended uh and then that uh for jails. I don't know that there are any test cases for jails and we might want to look at adding some.

G

There's a bunch of places in the suite where uh blocks of code are conditionally marked out, like is global zone uh which originally was to to run with zones. uh uh Have you been able to try to uh activate any of those code paths.

E

uh I wasn't aware of them so didn't know, but.

A

Are you talking about within the test suite john or in the kernel code.

G

No in the test suite.

A

Yeah, I think that um I think what you're talking about john is like you could run the test suite from within a local zone and it should more or less work with these. But, like some tests, don't run.

G

Right, or have to handle devices differently.

A

Yeah yeah, so that would be interesting to see like because I think that covers you know the vast majority of the tests do or at one point would actually run within a local zone on solaris um right. So you might look at doing that on linux.

G

I mean bear in mind: it's been untouched for like a decade.

A

G

A

Totally work, but that that might I don't know if you think that that actually gets this useful test coverage, then that would be uh a reasonable approach to take.

E

Yeah, I see there's some like zfs destroy that checks is global zone.

G

E

uh And those might be useful to hook up just to to make sure we prove that you know you're allowed to destroy the data, sets that are delegated to you and you can't destroy data sets that aren't you know we tested that by hand, but getting those tests hooked up would be good to look at.

A

Those cool other questions about.

A

Delegation or container.

F

Containerization.

A

A

A

We, why don't we um I give an update on where we're at with the zfs object storage stuff, um but uh if there's other folks have other questions or topics, why don't we bring them up now, since I.

B

A

Get into a sprawling monologue and shoot up the rest of your time. At the end.

A

Any other questions or topics.

B

uh So with uh zeripple we have repeated reports of uh cfs dry sense, reporting, a very large un64 as size estimate, and it seems like it's an over like a negative. It's called underflow overflow. If we go negative, I think it's still called overflow technically, but uh yeah. There appears to be some bugs with that, and uh we currently have a user who is like setting the data set aside, so we can reproduce the case. We just need to hook him up with someone who can uh tell them what to do.

B

Tell them what to do.

E

So this is like zfs, send nv yeah.

B

And the size is.

E

You know 18 or 16 editabytes or whatever yeah.

B

A

Yeah, um why don't, I would say, um file uh issue if there's one already and then um ping uh paul dagnelly, he should be on slack. um We want to. The first step is to figure out like there's so many different code, pads like modes of send and send estimation.

A

um So uh if you can include like the exact command, that's being run, then um that'll help like you know. Presumably it's not um it's not uh what's it called it's not redacted send, probably but like full versus incremental, and all that and like from bookmark versus from snapshot.

A

um There's like different code paths for all of those.

B

Okay, yeah, I'm not sure how long they can hold it aside. So, but I assume, like a zed stream number something that wouldn't help right, because then we already have the.

A

Yeah, probably not um okay, but at least that basic info will then tell us like what additional metadata. We would need to see to figure out why it's going wrong and then probably we could get that from zdb and then they could. You know, destroy their stuff.

B

Okay, yeah I'll, create an issue cool.

A

Other questions or topics.

A

All right, then, um I'll talk a little bit about the object store stuff, which I think I mentioned two months ago. um We're now uh well on our way to initial implementation.

A

um We have uh most of the team here at delfix is working on this project in one way or another, so mark and serafim who are on the call just joined us this week and they're going to be working on the new cash component, zetta cash with me and tony and tony.

D

A

And john kennedy are working on the performance performance aspects, so performance evaluation, um which I I both look forward to and dread, seeing the results of that, because we're still very early in the implementation, so there's lots of stuff that um doesn't work as we know it. It needs to to get good performance um and um so we're we're uh folks might not remember from from two months ago. So let me give you a brief overview of like what we're doing.

A

Basically, we want zfs to be able to consume object, storage rather than block storage or like as an alternative to block storage.

A

So you know typically typical use case might be you're using zfs in the cloud like say in aws, and you want to use.

F

A

As the backing storage, um for you know, probably primarily for a cost reasons, um although kind of after this initial project, we have some more more awesome and crazy ideas that will take further advantage of the shared nature of um object, store, which I'll talk about later.

A

F

A

uh So the the the key and tricky thing about our use case is that we want to be able to do this with databases which are uh you know doing. Small random writes um with small record size and compression, and everything is, is small and tiny. So we can't get away with just saying like uh like, if you're doing that, you're using it for backup and you and you're gonna just set, you know record sizes at least one meg and we'll have one object per um per zfs block.

A

That would be like relatively straightforward, but we want to be able to use small continue using small record size um for the databases that we're using that we're storing, um which means that we need. We need to combine a whole bunch of blocks like a whole bunch of like three kilobyte blocks into one. You know one megabyte plus size object for the object store, um which means that we need to. uh We need to worry about like how do we free these?

A

So you know we got to keep track of like here's all the blocks that um need to be freed, but haven't been freed yet uh and then kind of consolidate that so that we can um like back. We need to batch process that so that for every object that we have to read and then rewrite um to remove the free blocks, hopefully we're able to free a whole bunch of blocks within the object, not just one or two.

A

um So that's kind of one half of what we're doing and then the.

D

Other half is.

A

um For a lot of use cases, we still want to get good performance, so the um object stores you know generally have really high latency for get input, requests um like dozens of milliseconds, at least uh and in a lot of cases, uh if the data so for writing. We can kind of ignore the latencies, um because we're batching everything up into transaction groups right um and we just have to. Then we just have to worry about well what about synchronous operations?

A

um Well, we can use the existing zil and have like a local disk that gives us good performance for the zil still, but for reads. uh You know waiting dozens of milliseconds for all the reads that are not in the arc would be really really slow. um So we're going to use something like the l2 arc.

A

Caching blocks on uh local disks, uh but the l2 arc. um It can't get very big in terms of a number of blocks that it can store uh because it uses so much memory um per per block. um L2 works pretty well. If you have large record sizes um but again like having small record sizes, uh it causes lots of performance challenges for us, so um in particular you may want to. You might want to be able to have a cache.

A

That's like dozens of terabytes, at least, um which means that, even if we were to like kind of super turbocharge, the way the ltwork stores its metadata, um you still wouldn't be able to keep that entirely in memory.

A

um So we have to deal with a an index of the of the cache that doesn't fit entirely in memory. So that's kind of the big challenge of the caching subsystem um so that so uh the way that we're implementing this all this stuff, the cache and the talking to the object store part is going to be um where it is through a agent process in userland.

A

So like the kernel is making read and write requests that are calling up to the agent, um which is then you know, checking in the cache uh going out to the object store um over the network. uh Doing the background, processing of like background freeze, um managing the on disk index of the cache and all that stuff um and the agent is uh an agent, is all written in rust.

A

So, let's see what else do you? What else should I tell you guys um we're um we're pretty far along with uh getting things to kind of basic functionality?

A

We just started on the cache last week and we have something that kind of sorta almost works for some for, like a little while.

A

And uh I I might be able to show you a little demo if if people are interested, but um why don't I pause and oh the other thing that we did is um we opened a issue um that has uh an outline of like the project, the kind of stuff that I just mentioned, and then um uh the user interface for how you would create an object, store based pool um it's it's a little different because you need to be able to specify like.

A

um What's the you know, what's the object, storage endpoint, you know uh like what's the name of the other machine I'm gonna be talking to um and what are the credentials that we're using to talk to it.

F

A

We're doing that with dfs.

C

Properties did we update that for the new role property? I.

A

Don't think that's updated yet, but um we should ask.

A

Yeah um yeah, so uh we have some enhancements in progress that will um let it uh like figure out the credentials based on the role that's assigned to the instance in aws um that'll, make it a little easier uh christian. I see you have your hand up questions.

B

Yeah, um so I have thought about cfs and object: storage, one or one and a half years ago, and I think one one concern was cost analysis and cost prediction. So because you pay for the individual requests as well.

B

Is that at all a concern in your design or like? How do you model model costs and take that into account.

A

Yeah um I mean the this so we're primarily thinking of it for our use case um by comparison, we're comparing it to using ebs. You know the block storage in amazon um and the per byte cost is much lower. It's like about a quarter or a fifth, or something like that compared to the cheapest ebs storage, um but yeah there are per um per operation costs of again put for.

A

uh We. We looked at that kind of from a theoretical point of view. um The good thing is that the costs um as long as you're, using it like with within aws, like um you're, using s3 from and from ec2 instance there aren't they don't charge you for the throughput.

A

They just charge. You like a per request fee, so those are usually not too high compared to everything else um as long as you're uh as long as like you're right as long as your objects are big right, because it's like well yeah, you know I'm writing at you know full network bandwidth of 25 gigabits per second, but uh it's with you know one meg or eight meg objects.

A

So the actual number of requests per second is like pretty manageable. um I mean it's not nothing. You know it adds up to like a few dollars a month or whatever, um but you know you're paying 25 per terabyte per month for the storage and you're paying. You know a lot more than that, probably for the instance to run it at 25 gigabits per second. So it's not too bad. um We did right. We did.

A

uh I don't know if paul, I think, paul's not on here, but paul is working on um a facility like mmp, the multi-mount or multi-modifier protection for um for the object store um and uh that's gonna, be like uh mandatory. Like you know, right now, the mmp is like off by default.

A

You have to like jump through some poops to use it or whatever, um but for object store it's just like so dangerous because, like everybody has access to it, you know very easily um that it's going to be always used, and we did have some interesting.

A

Designs around that, because uh you know, if you're, if you're heavily using the pool, then hey like paying a few bucks a month, doesn't really matter like that, doesn't matter, but if you're not using the pool, then um having mmp be, you know putting down its heartbeat even just like once a second um it does add up to you know like non-zero dollars per pool, I mean, if you just have one pool, then it's no big deal right. It's like three dollars a month or something like that, but um uh all right.

A

So now let me get to if you don't mind me, rambling, on a little bit um the uh the end goal, for this is not just you have one pool, that's on um object store.

A

The end goal is that you have essentially like distributed zfs, um where you have like this: a big system of interconnected pools on different machines and they're, all consuming storage from the same object store and the the real key thing that we want to be able to do is to have like a snapshot.

A

That's in one storage pool and create a clone of it in a different storage pool and the clone like references blocks that are in the parent storage pool without having to copy those blocks, you know using, send and receive into the child pool.

A

So you might in this new, like object, store based world. Creating pools, becomes almost as cheap and easy as creating file systems is today, and you can move the pools between um you know between vms, so it makes the overall system much more flexible and uh in that world you might end up with lots of storage pools, um and so you might have lots of storage pools which are idle uh most of the time um and so the mm, the cost of, like writing out once a second an object for each pool.

A

um uh It might actually become. You know noticeable to our customers uh wallets, so um uh paul has done some work to design that so that the um uh basically, instead of having to write out a heartbeat for each pool, um we only have to write a heartbeat for each like for each um bucket that this agent is connected to.

A

So the agent manages, like all you know, all of the pools are managed by one agent process on the machine, and so that way like you, basically each pool will have an object associated with the pool that says you know I am owned or was most recently owned by this agent, and then the agent will have like a object in the object store. That says, like I'm still alive every second, so that way, you know you only have to pay. You only have to have a heartbeat.

A

Basically, like you know, one heartbeat per vm, that's part of the system rather than one per storage pool. um So so that is a point where that did come into come into play.

C

Yeah, if I can add something else to matt yeah, um the other thing going back to cost, I mean like this is kind of why the zetta cash is such a big component of being able to like cash, the majority of the blocks that you might be using for that specific file system, so that we don't have to go back to object store. We you know we get the big blocks going to s3 you're doing.

D

C

Operations there yeah and then and and the majority of your work is kind of happening locally through some some ebs volumes. um The other thing that to note about the zeta, cache and part of the design is unlike l2 arc, which is you know, kind of dedicated to your pool. This is a global type of cache for all the pools being managed by that agent.

C

A

It's all about taught to participate, yeah.

C

Yeah exactly yeah, so the the when we get into this multi-pool architecture. This zetta cache is kind of like a global entity that these pools will be able to register and say I want to cache some blocks. You know and we're going to use. You know this. Whatever disks are, you know kind of backing the zeta cache.

A

Yeah so for reads: I mean.

C

A

For reads: um you really don't want to be in the case where, like I'm doing, one read and that's causing me to get an object and then I'm not getting anything else from the object because you're, then you have huge, you know, read throughput inflation because it's like! Oh I'm, reading this 4k block and it's causing me to go.

A

Read this one megabyte object and I threw away you know 99.9 of the data um you want to either be in the mode of like oh, like we're scanning like we're just bulk reading this whole thing so yeah like I read this one block and it caused me to bring in the whole object, but now the next you know 300 reads- are also going to be to that same object.

A

So we actually used it, and so it's fine or you want to be in the mode where I may be doing random reads, but they're all cached in the cache. So basically you want to have like either either you have like ninety percent plus cash hit rate. In this um you know l2 like um disk based cache or um you're like doing bulk reads or you're.

A

Not using small record sizes, you know like there's, certainly lots of other use cases for this- that don't have all these hard problems and um you know like you're, storing video files, they're all in record size, 16 meg and it's just one record per object and then, like uh everything, is much much easier than the problem that we have chosen to tackle.

E

um Are you envisioning to use this with the the special vlog or something for all the indirect blocks so that you don't have the problem of having to read a one megabyte block to get the indirect block to tell you which one megabyte object? You need to go fetch next.

A

um For I mean uh so for at least for our use cases, um we would always have some sort of um block based cache so uh and you know making sure that that's at least one percent of this of the whole pool size is like pretty easy, um so we would always probably always have all the metadata, like all the indirect blocks um in the cache.

E

Yeah, that would be the expectation yeah right, but but you didn't envision specifically using the allocation classes to keep them separate.

A

E

A

Yeah I mean the cash um uh the cash it's like within the cash there's. Definitely a lot of cool stuff that we can do to um to make it so that, like we, it can handle things in different block sizes, um maybe I'll get into that in a couple minutes, but um in terms of like, what's in the object store, we weren't doing any like um we're kind of letting zfs's um allocation ordering kind of handle a lot of that for us, because I think that the way that it works, we can double check.

A

This, though, is like it's basically like it. It's writing thing blocks of one layer together. um So, like the ordering, I think will happen to be. Like you know. First, you get all the first. We try to write all the data blocks. Then we try to write all the indirect blocks. Then we try to write all the l2 indirect blocks, so they'll, naturally kind of be clustered together in objects. Anyways.

E

A

um And the big thing about the I mean the special devices has two big benefits. One is like you have a faster device um which, like you know, we don't have faster and slower like object, storage. um We we have the cache to solve that problem um and then the other benefit of the of the special devices is you aren't trying to combine lots of big blocks and lots of small blocks together into one like metaslab allocator scheme, um uh and so like that's kind of a self-imposed problem.

A

um You know, rather than a fundamental problem like that. That's the problem of how zfs chooses to do block allocation um and we've kind of been making some dents in that uh over time, like with uh the zill embedded slog metaslab, basically like automatically carving out one meta slab to use for log allocations. If you don't have a dedicated log device, you can imagine doing something similar to that. Like, oh, like I noticed that 90 percent of the blocks here are small, but you have some big ones.

A

Let's carve out one meta sub for blocks that are like larger than 32k, um but uh we can with object, storage like the allocation part of it doesn't matter right, like you, can put together big and small blocks into one object and that doesn't hurt the allocator at all, because it's just like sequentially spotting them down into objects. We aren't, we don't have to go like try to refill existing objects um within the cache.

A

Now, that's where it gets really interesting. um So, within the cache the undisk cache um the so the l torque the way l2 arc works it just like starts the beginning of the disk, writes through the whole disk and then when it gets to the end. It comes back to the beginning and and overwrites what it was there before.

A

So the uh the allocation.

D

Is really easy.

A

um And you know you can mix together large and small blocks, and everything is fine, um but the eviction policy is horrible and that has a impact on cash hit rate, and you know- maybe maybe that's- okay for a lot of time, a lot of the time with l2 arc, especially since, like the kind of philosophy of the l twerk is like yeah, you know it's like an extension of the arc and you're gonna get some hits and some misses, and it's like better than nothing versus for our use case.

A

It's like well in a lot of use cases you want to be getting like 90 plus uh hit rate in the in the cash um and if you go from 90 to 80, that's like doubling the amount, the number of times that we have to go to the uh to the object store, which could have a really material impact on your overall performance.

A

So we wanted to have a cache where we control the eviction policy and it's reasonable. So we chose lru, which you know it's not like. Amazing. It's not skin resistant. The way the adaptive replacement cache is, but it's at least reasonable and and easy to reason about, um and so that means that we need to be able to choose what we evict not just like, have the have whatever.

D

A

um Which means that um you know it within the cache we're going to be left with. You know like as you're ingesting stuff, you're, evicting old stuff and you're, leaving lots of little holes where you're evicting stuff and then trying to fill those little holes with the new stuff, which is you know, basically the same as the problem that we have with zfs on block storage and the meta slab.

A

Allocator and, like you know, when you have tons and tons of small blocks, especially with compression it gets really hard and you have to maintain some free space and you have to keep all your metastabs loaded and it takes up a lot of memory.

A

These are all problems that are kind of specific to having lots of small blocks. You know typical zfs users, where you just have like the default record size 128k.

A

You don't really run into this, but if you're, storing databases with record size, 8k and you're using compression- and you know- maybe you're using a shift- nine five- twelve bit sectors, then these problems are very acute.

A

So with the um new cache that we're designing, um we're, gonna be uh investigating and hopefully implementing some newer, better ways of doing um allocation um which may which, which may be able to be transferred over to the you know, zfs proper block based cfs.

A

So what we want to do is um divide this space up in into slabs in each slab would have um allocations of all the same size.

A

um So like let's say each you know each 16, meg slab has things that are all you know, maybe they're all 1k or all one and a half k or all 3k or all three and a half k. um And so then, when we're doing allocations, you know we would find uh like. We know I'm trying to allocate three and a half kilobytes. So I'm going to go to one of the three and a half kilobyte slabs and just get one from there.

A

um This should sound pretty familiar uh in terms of the you know: memory slab, allocator, that's exactly the way that works um and uh the um meta slab. Allocator is kind of named after that, and there were some ideas initially about how we might do that. But those never really came to be um so uh we're kind of trying to get back to that place, at least for these small blocks.

A

I think it'll be very helpful in terms of memory usage and like memory usage of the allocation state um should be much more efficient like in the worst case. You have one bit per block um to keep track of of this in memory I mean you can do better in some cases as well.

A

Yeah, okay, uh maybe I should pause here, pause my monologue here and see if folks have more questions.

B

I have another question, uh so I'm not totally clear how much like layering, like cross layer, hints you're doing there like how many hints are you giving from zfs into the the user space demon and and back?

B

That's still like. I didn't get it from the explanation.

A

um We uh right now there's not a lot of hinting and there's also uh probably a lot of things, a lot of performance that we're leaving on the table in terms of the current implementation. um What did you have in mind in terms of what might be necessary there.

B

I'm I'm basically not not clear, for example, how much of the the metastab allocator still is in place versus uh how much you delegate to user space and so on. None okay,.

A

So the I mean the meta sub allocator doesn't really have any place with object with like what goes into the objects, because we're just writes fill up objects and then um we don't. We don't go back and modify old objects except for to free stuff out of them. So we aren't like reusing them.

A

um So you know you don't really need a meta slab allocator for the uh for the cache. You could imagine using the meta sub allocator to allocate the space in there, but um we decided to instead kind of do a from scratch implementation.

A

So um the agent that's managing the cache, it's not using the z-pool, it's all from scratch. So it's just like all new code. That's consuming the disk and you know putting down doing, writes to the disk from all new data structures that we're coming up.

B

With from the perspective of cfs like that, there is a new vdf type object, storage and it has no no allocation exactly okay, yeah.

A

So from the kernel's point of view, it's like there's some new um v dev type, that's a view of object store and we have like special hooks in there to say. Like oh, like if you're going to object store, then you know you don't need to do any allocation or whatever, or I mean the only allocation that you do.

A

Is you allocate the block id, which is just like I plus plus right, the blocker ids just move forward and whenever we use them and so that's very easy and then, when you do the write down to the object store v dev, it's just like I'm writing to block id. You know the next highest block id. That's never been used before, and then it just sends that request up to the agent um and then the agent is like packing together all these sequential block ids into one object.

B

Okay, but the like the information, for example, about the the object type or something is totally lost on the way.

A

Yeah, so the agent's, just getting like uh this sequence of bytes, says just like here's the size, pl, here's the block id, um which is you know like the next block id. Please write it um so we cert like if, if, if we want the agent to take different action for different types of things, like metadata versus data, you know like indirect blocks versus user.

E

Data or whatever.

A

We could easily.

D

A

Hinting into it, but um we haven't got that sophisticated yet and so far we haven't needed to. um But if you have thoughts on like what we should be doing, um that would definitely be I'd love to hear them.

B

I think there's lots of academic literature on like predicting predicting data block placement and so on, and if we already have the information about the object type, for example like we, we don't have to do all the fancy prediction stuff. We already know it's metadata or its data and so on so um yeah. I think we can maybe could profit from that. But then again it depends on your use case, mostly.

A

Yeah, it depends on like how much profit there is to be had very like after you, after you've predicted, whether it's metadata or data. What.

E

Are you doing with it.

A

And how much does that help you.

B

Yeah, for example, like you, could have uh separate objects for data and metadata, because the metadata is supposedly changing much more frequently and then maybe you have another look at okay. So is this an allocation for an object set that is like one megabyte block size or just an allocation for an object set? That is eight kilobyte block size, for example, and then maybe based on that guess, whether it's archival storage versus hot storage and yeah, maybe tier that into different objects, a little so that you have better.

B

I don't know how you call the metric about like object, utilization or something you don't have to come, and.

A

B

With a name for that, yeah.

A

Yeah, so in terms of the object store, the biggest problem that we have is um going back and free is freeing stuff and the uh what you really want is for your freeze to be very clumpy, like uh you want them to be clustered into objects. So, like you know, we have a million freeze and you know you have a thousand objects, but those million trees are concentrated in just these hundred objects. So we only need to read those 100 objects and then rewrite them to process the freeze rather than having them be.

A

If they're evenly distributed over all of our objects, then it's like well, I have to read and then rewrite every object. You know the entire. You know petabyte of your storage, which you know would really suck. I mean you just can't do that very often. So you know you end up with a lot of outstanding freeze, um which you know. The good thing is um unlike block based storage, you don't have a finite size, so it's not like. Oh, like you're, out of space. Now your pool is now like.

A

You can't do anything because your pool is full and we got to go to process the background freeze instead, it's it's, it's just cost right. It's just that. You know you're paying more for you're paying to store these blocks that are no longer going to be used, um and so you know to be nice to our customers wallets, um you know, and eventually our wallets. If we, you know, hopefully end up running this as a service um we need to be, we need to free it reasonably soon.

A

So if we could predict um like what we really want to predict is uh when will this block be freed and which other blocks will be freed around the same time?

A

If we could predict that, then you know we would say like for, for a given txg, we can kind of choose uh how we map its blocks into objects. So if we were like for this for this to extrude like these are the blocks that are going to be long lived, and these are the blocks that are going to be short-lived, we could pack all the short-lived blocks together, so that when we go to free it, we only have to hit those blocks.

A

We only have to modify those objects and not the objects that contain long lived blocks.

A

Hopefully, like the indirect block versus data block clustering- that hopefully is already automatically kind of happening to happen- will help us there, but I don't really know if that's the case there. Definitely we could use some more research. There yeah, like conceptually.

E

It sounds right, but then, if you do the try to group all the three and a half k allocations over here and all the 2k allocations over there, do you end up splitting up those well.

A

For that, um so that's a different problem right! That's for allocating all the three and a half k things here: you're talking about the cash right.

D

So in the cash.

A

um With the cash we we, we could have chosen to cash objects or to cash in big chunks, um but but for the cash we chose to do everything block based, so you can allocate and free individual blocks without incurring you know, additional work, um just like you do with regular block based cfs and so.

B

That's why you have.

A

The problem of the like the allocator being like having all the same problems as the meta sub, allocator um yeah, so I mean we definitely thought about all these things, like maybe the maybe the cache should be like. You know, allocating one meg chunks from the disk and putting packing a bunch of blocks into there. Then your rates will go really fast, but then you know when you're evicting it's like well.

A

If I want to evict part of like some of the some, but not all of the blocks in that chunk, then uh then you end up in this like rewrite um right. You end up paying a lot of bandwidth to go, read and then rewrite those chunks of the cache.

E

And then you're.

A

E

Smr problem instead.

A

Exactly and it's like well, look. We already have some like the object. Store thing, basically solves the smr problem, so you want smr like basically just um use the object, store, type use it as an object, store right, um and I.

G

I could imagine.

A

Some obviously we could be more sophisticated about it, but I I would. I would bet that if you took a smr, a drive, managed, smr drive and just put like you know, xfs on top of it and then and then like had a layer that mapped from objects to like each object.

A

Is one file on xfs um and put zfs object store on top of that, it would work great because at the end of the day you know you're, writing in these big chunks and then freeing the big chunk and then overwriting it so like the smr level of remapping is not going to have to do you know it's not gonna have to do a lot of work, but but the advantage is you get like you know you can be doing random.

D

A

Can be storing a database on it, you can do whatever you want and um then the zfs, like object, store layer, takes care of making sure that everything is actually happening in big chunks to the drive.

A

Well looks like we're out of time. Thank you all for coming to the uh zif, the matt aaron's zfs object, storage, monologue. I hope you enjoyed my comedy set. um You need to record this for the next conference. huh uh It is recorded um so yeah. Well, the recording will be up on youtube uh later this week um and uh yeah I mean. uh Hopefully, you guys can tell like I'm pretty excited about this. It's a lot of fun stuff, um we're learning!

A

You know we're learning rust rust is a great programming language, especially if you're coming from c um uh and uh yeah and and we're having a lot of fun with it. I recognize that, obviously this is a new sort, a sort of new use case for zfs, so um it doesn't.

A

Obviously not everybody is going to use this. It's it's a new capability that people will use zfs in new ways, as opposed to enhancement. That's going to like make zfs better for everybody who's using it today.

A

But hopefully this is going to uh you know, make bring bring, make zfs uh have some like relevant capabilities in the new kind of the way that storage and computing is moving, uh namely like towards cloud and towards you know more abstract uh data, storage and services so making.

D

A

Easier to use and deploy in those kind of situations where you're, on top of you know whatever like.

D

A

Some weird smr drives on top of an object store. uh You want to be able to move stuff around from vm to vm, um uh as opposed to like, uh if you're, the one implementing the stuff underneath that and you're consuming raw drives, then um obviously like the zfs. Existing capabilities are are what you need.

A

Cool thanks a lot um we'll see you all in four weeks at the later time, so that'll be july, 20th um and I I may be on vacation events, so I'll have someone else run the meeting, and uh hopefully, you've got your dose of matt aaron's uh zfs monologues this week uh this month and you'll be good for these two months.

A

All right see you all. Thanks, bye,.