OpenZFS Leadership, 12 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: April 2021 OpenZFS Leadership Meeting

Description

At this month's meeting we discussed: corrective zfs recv; blake3 checksum; LXD containers; ZFS on Object Storage

meeting notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#

A

Welcome everyone, uh it's one after the hour, so we'll get started. uh I see that alan is wearing a great t-shirt and uh perhaps has rearranged. You rearranged your office. Yes, I turned my desk sideways in order to try to feel like. I wasn't in the same place anymore, cool yeah, so you have a new office. That's good excuse.

B

A

Cool, uh we have a a good variety of topics on the agenda today, uh so uh we'll start working through them. um The first one that I had was about userland zfs.

A

um I don't know if the presenter has joined yet mayank, I'm not sure if I'm pronouncing your name right.

A

I we'll um we'll come back to that if or when they joined the meeting. um The next one that I had was the corrective receive pull request. Alec.

C

Yeah hey um so I took some time and worked on the corrective receive patch and I have it in a spot where um most of the tests are passing, except for this one, free bsd test that uh for redacted sends that causes freebsd only to panic and apparently not every single time, um and without my patch I can so I've been able to reproduce that on my vm too, and without it it doesn't, but I don't think I'm doing anything in the path where healing flag isn't set for receive.

C

So I've been kind of struggling to figure out what was going on there.

D

C

Yeah, maybe if somebody knows what the corrective receive, redact, panic or redact send redact panic test. Does they can talk about that? Because I'm kind of uh it's kind of terse the test itself? So I'm not sure what it's really doing.

A

Yeah john who's, on the call I think uh john kennedy might have written that test and paul who I don't paul diagnoli wrote the redacted center received um code. I don't see him on the call.

C

All right well either way. I also wanted to just mention it and see if people can start reviewing again and taking a look at the patch and maybe have some comments for me. While we figure out this, this freebsd test thing.

A

Yeah, that's great um yeah. This is the code that you posted quite a while ago, but yeah great to see that um making progress on that. um Do you wanna, because it's been uh a couple years, uh maybe give just a quick um overview of like what the point of this is and how uh and how it works so that people who might be interested um can can realize that they are interested and help you review it.

C

Sure sure yeah there, the um talk from a couple years ago from the zfs developer summit is linked in the in that ticket. So somebody can go and listen to that presentation, but I can give a quick overview. So this is a patch that enables you to take a send file and use it to heal, corrupted data.

C

On the pool, so without really, if you have permanent corruption in your data set, you can't do much about it other than destroy and re-receive it. If you have a backup right, but this enables you to use a send file from a well somewhere. Maybe you have it backed up that data set somewhere and you can use a send file from the backup system to heal, corrupted data and um some of the things this. So this patch can re-encrypt data re-compress it.

C

So if the those settings don't match between um source and destination, it takes care of those things of those details and there's an arbitrary restriction of the guide of the snapshot has to be the same um as the snapshot you're healing. So basically, the send file has to come from the same snapshot um we, which doesn't it's.

C

I say it's arbitrary restrictions, since it doesn't quite always have to be that way to be able to heal, but it seems to make sense to uh kind of enforce that and make sure that the data you're you're ingesting is can be used for healing for sure yeah, that's kind of the the overview. I think that should be cool starting point. Yeah.

A

And um I remember back in the day there was some discussion about like uh tooling around um creating like a smaller scent stream. Was that did that ever happen?.

C

No, I I didn't do that. In fact, I also took out the spill block healing, so this only does write records now um and I'm probably going to look at the spill block after.

A

Yeah cool, it seems like this, um the there's a very also very old and incomplete pr from um a deflex intern, uh about being able to better identify which blocks were damaged and like tell you which snapshot it appears in and stuff. It seems like that would be uh really helpful for figuring out like what send you need to do to do the healing properly or even.

E

A

Generating a smaller sense stream based on that info.

C

Yeah we're giving you um all of the snapshots that reference particular block.

A

Yeah, that's like one of the that's one of the things that that other pr does. I see mark maybe nodding his head, because that's that's one of the things that he was was planned to work on, although that that may be taking a back seat to some other projects that we have right.

A

Yeah yeah, so uh if somebody is interested in this kind of stuff, then um that we could definitely point you, I mean you give mentoring to somebody who wants to pick up that pr, since we probably aren't going to do that in the next six months or so.

A

um Cool well, I think.

F

That's great and.

A

Yeah I'll take a look at that, maybe um I'll see if I can get paul uh to take a look at the code review as well. I mean other folks who are interested in corruption and uh and and center receive. I think this. This is probably a nice pull request to review.

C

B

This is john. I can also take a look at the panic just to see if anything jumps out from the test.

C

Awesome thanks.

A

Sure that one's the the assert you were talking about where the one total was less than the other or something yeah yeah.

A

It seems odd that a cert would trip differently on a different noise.

C

Yeah it's some kind of timing thing, because the previous d12 test passed but 13 and one other one didn't and I've seen the 12th one fail as well, and I can reproduce it. Okay.

A

Cool well, reproducers are great for getting to the bottom of it. I'm sure we'll be able to.

B

A

Cool anything else for this before we move on to the next agenda item.

C

Nope, I'm good cool.

A

uh Next uh blake three check some um tino.

D

Yeah, that's me: I'm from germany, hello, welcome, um yes, break three! I've been in for about three weeks or four. I don't know. Currently it was very fast to implement, but then I had a problem with the freebsd support and because there are undefined references in the open, cfs pumped cargo motor, I don't know um how to implement and implement this properly into the bsd stuff. I'm a linux, only user, so I run into problems to make it uh ready to to to to submit uh really submitted it because yeah now this is the problem.

D

A

Do you have the the error message posted in the pr somewhere.

D

um There, this is a testing on the freebsd uh machines uh shows. Therefore, this is very clear.

A

There are many tests.

D

Yes, the automatic test yeah and then the other two points under this: the the next new instructions on intel or the hardware support chatu.

D

What about this? Currently in linux, only some generic code is used and can this be better implemented so that the shake zooming is a lot faster. I think yes or within the linux kernel. There are already a lot of fast checksumming stuff, but but this is not used. Government.

A

um I'm not sure totally understood, but are you saying that, um like within the linux kernel, there's there's like infrastructure for doing checksums? Yes,.

D

uh And they're.

A

Faster than the way that it's been implemented in cfs.

D

Could we make some some connection so that this can be used or is it? uh I.

A

Expect so um I I put a link in the doc. uh Someone else on the freebsd side has been working on the same thing where we have in the freebsd kernel. uh Our crypto framework has support for doing the sha to offloading for x86, mostly.

D

The amd machines.

A

Have it, whereas the intel ones don't but uh and for the arm, 64. yeah.

D

A

I posted a it's a bit of an old patch now, but uh it looks at plugging that in on the freebsd side, and you can just see there's a we made a os linux version of it that just returns e, not sub for so far but could be plumbed to whatever function. Would uh you would call in the linux kernel to do the the checksum.

G

Pretty much every crypto stuff in linux kernel is gpl exported, so you have to have.

A

G

A

Yeah so, unfortunately, using that using the the kernel encryption routines on linux, uh because um you know the opencvs project is not using any gpl only exports, that's not really a great option, which is, I think why the the icp stuff is from illumina. So that's all code, that's cddl license that was copied from lumos all the c code and I think, there's maybe some very light acceleration in there but yeah.

A

I totally believe that there's more optimized routines that we could use and if we can find ones that are appropriately licensed, I mean we could bring them in like if these freebsd ones are bsd licensed. We could, potentially you know, copy those into the zfs uh source space um and replace the icp illumios ones. The main bits of it are assembly files from intel or something that are yes.

D

A

And those are, I think, dual license: bsd and gpl. Yes, uh all right yeah. uh I know the the person doing. The work on freebsd is mostly interested in the uh amd like ryzen and epic offload of of sha2, which I don't know if it's exactly the same.

D

Yeah, I thought it is a problem with the license and stuff, so this is not used as a canara b stuff of linux. This for sure, that would be my next question if this can be done, but I think no no right um the next uh stuff. I I vote here on to the.

A

So, um just before you go to the next thing for the blake three, um the uh can you give a. uh I remember I researched this a while back, but can you give a summary of like how does blake three compare to the other tech swimming algorithms like which is which ones is it faster than which ones is it more secure ish than like? What's what's kind of.

D

A

D

I'm using some.

A

Algorithm x today, I would switch to blake three because of whatever reason you know what I mean.

D

um It's about four times faster in with the ss two stuff already um and faster.

A

Than than which other algorithm.

D

Faster than the normally sharp two algorithm, but the normally only the generic code, not the not the intrinsics of of the cpu and so on, because I don't have this currently on linux.

D

Maybe I should take the intel code as a assembler code, put it into opencfs also, and it then could make the the hashing bench like the fletcher 4 bench, and then you can use whatever you want, because this one is on this machine the fastest, so that the administrator can choose it on the fly which which implementation he wants.

A

How does it compare to the other, newer algorithms, like sky and edon r, that were added.

D

um It's and and on r, is also a lot of faster. I I I checked, but um not not really a bit faster, but there's also um there must. The block must have some more bits as at about 4k. Also, then, screen is also a lot of faster, but that would be also a question um which block sizes uh should we uh benchmark. Currently, this is very fixed within fletcher, and this shouldn't be fixed because um you can set up different record sizes and then you have also a different speed for different algorithm.

D

So um I would make the hashing bench a file where.

D

Means a different size, 1k, 2k, 4k and so on for the typical sizes and then the different implementations and with the speeds also and then user can check, choose whatever he wants for for for using a mind break or on this machine. With this implementation.

A

Yeah that makes sense. I I didn't realize that the size made a big difference in the kind of which algorithm was faster.

D

Three is only faster when about 1k is the process.

D

1K and bigger and then very very fast.

A

Yeah I mean I, I I'm sure that everyone's using more than one more than one k, yeah.

D

4K is the smallest. I imagine.

D

I I don't have numbers because I don't implemented. This is a patch. Currently, uh this is a hashing benchmark. Should I make a pull request for such a thing? Does it make sense, or is it not not wanted? Not? I because I don't would make the work women it's for nothing.

A

I mean I, I would see a lot of value in just seeing the results like if you just made a table and we're like hey like on a modern. You know cpu that has all the extensions here's how each algorithm performs for each block size. um I think that, like being able to regenerate that on demand all the time you know, I don't think that's something that needs to get run more often than like they come out with new. You know new instructions or whatever, which is only maybe once every five years or something.

D

Maybe so hardware vendors would would take this file and and promote their product. It can be used for the theaters. Maybe I don't know yeah.

A

Maybe usually the vendors don't get like that detailed or that, like zfs, isn't a common enough use case that they would spend marketing.

E

A

E

But the future.

A

Yeah, maybe maybe certainly they care about making these algorithms go fast. For you know all the other use cases in addition to zfs.

D

Yes and ellen you would uh make me, uh would help me help me with the freebsd stuff, and then we can make the break free.

A

Yeah, like looking at it, looks like it's just having trouble finding a dot h file that comes from the lumos compat stuff. uh It might just be a matter of adding an extra path to the include thing in the make file or something.

D

Okay, oh thanks.

A

Cool, um so uh did you were there other things um that you wanted to talk about: the shots, 2 arc64 stuff.

D

I could make some code and do it into into opencfs. I have no problems with this. I I would reuse uh other ones code make it so that it fits and- and then it's it's in- but I when there's a need, I think for for linux, there is a need because I we can't use the fast uh stuff from the camera. So I think I would do it with extra pull requests.

A

D

A

Yeah and I put the link to the prototype somebody has from the freebsd side in there. Maybe we can come up with what's the right, zfs interface for it and then implement the adapters for the os specific bits in in the os files.

B

D

A

Cool, thank you um and I think the next topic that we had was um the lxd containers. Allen yeah. uh So my company, clara, is doing a bunch of work right now to add support for lxd containers uh in the same vein as jails and zones.

A

uh So you can delegate uh permission so when you basically jail a data set to a username space in a for lxd, then root in that namespace can now. You know, do zfs commands and uh create new data sets and so on, but they can only see the data sets you've delegated to their username space.

A

So basically it hooks up. The enforcement of you know is global zone and the similar primitives that are in zfs. So when you call dev zfs from inside the username space, you can only see the data sets that have been delegated to that namespace uh and the parents that you have to see to get back to the root or whatever, uh and you can create new data sets uh and so on, and they happen all inside the namespace.

A

um The next thing we're working on now is hooking up the user, id and group id mapping currently without that, when you create a file as one of the kind of pseudo user ids inside the namespace, the file ends up showing up as being owned by user id. You know 264 thousand and one which is the actual uid in the system, but in the namespace that's supposed to show up, as you know, a low number normal user id uh and adding the mapping there and so on.

A

The main goal behind this for our customer is to be able to run docker using its zfs driver inside of an uh unprivileged lxd container, so they can run their ci stuff for multiple projects on the same machine and inside the container docker can do the zfs crates. It needs to to run that way, um but we're interested in what other use cases people might have for this and what you know test cases would make sense for it.

A

You know we're looking at adding the right, um zfs test, suite editions for this and the question we have a lot.

A

Is you know what makes sense here and what can we expect to exist uh on the the ci machines and what won't you know we, uh you know our clients use case is mostly the the ubuntu lxd, tooling and so on, but I imagine we'll mostly use the I think it's the unshare command or whatever they use to create a namespace in linux, uh for the tests to keep them basic and so that they should work on. You know all the supported, kernels and so on, but any input people have about this would be helpful.

G

Yeah so uh one from me it would be similar scenario scenario but using podman on rel, so it would be a podman container inside podman containers.

G

To to to run also the ci builds, I currently do this using the fuse overlay fs inside the master con container on zfs. So if.

G

Do you uh think that the work uh on the zfs site will also require some changes in the zfs driver inside the docker.

H

B

G

Okay, so if it won't affect the docker driver, it should also uh not affect the podman driver so that that would be awesome, and I would love to test that.

A

Okay, hopefully we might even have something to share for people to look at at next month's meeting. Oh great.

I

Kind of related to this- um I don't know if there would be interest. I haven't touched this in forever, but there is a change and I'd have to ask to see if it was who originally did this at uh giant, but um for anyone that's used um zfs on newer versions of solaris. You know, there's you for things like zones which may be applicable, then for jails or whatnot. There's a basically.

I

I guess it's done. It's like an alias so that, instead of having when you when you're in your container, whatever you want to call it instead of seeing the whole path to whatever that dedicated data set is since it can be long, it can basically aliase it within that container to like just the name of it. So um so, then you can all your commands and everything use that and like and at least in this implementation I don't know how it was done on in solaris, but with this.

I

Basically, there is a an alias property um which, in this case on you, know in a lumos, there's also zone specific data which is kind of like thread specific data, but it's per zone, and so it kind of sets the alias. So then, when you, you issue the octals it'll unalias, the data set, you know before it issues it, but so there's code for that, like I haven't, touched in forever because it I believe, though it pretty much worked aside.

I

There was just some concerns where it was kind of abusing uh or reusing certain uh struct members. That probably shouldn't have done that. But I I know, um if that'd be interested in like yeah, in conjunction with that.

A

I think it would be because especially yeah in some of my cases, you end up with the the the part being delegated is like the the fourth data.

D

A

In the tree- and that's just a lot of excess stuff for someone to have to to type- and you know it looks cleaner if you mask it from the user and these user from their perspective, basically has their own pool. Yeah.

F

I guess there's also, potentially I mean it's security issues too, where you you want to have that to be visible outside of your your container.

I

Yeah, it's the only. I guess, potential issue, which really, I guess, is something that can be done when you're creating the container is. Obviously, if you have two different data sets, you know where the you know in terms of the the last portion of the name is the same. That could cause a conflict, although I think that's you know probably good enough just to say you know, don't do that.

A

I

In general, like.

A

We already have the concept of certain properties that you can't modify from once it's delegated, like the you can't change the quota on a delegated data set unless you're in the global zone.

I

If, even if you, the.

A

Alias has like a prefix that just always goes in front means that it name spaces it so that you can't have those conflicts.

I

Well, I just mean like if you try to like okay, you delegate, two data sets to your container and they happen to have the same. You know the last component of each of those names happens to be the same. Of course when you alias it, but you know it's probably one of those things where you could say you know what you know. Don't do that you are just you know, just fail.

I

You know when you're setting up a container, it says you know, hey that's ambiguous yeah, I mean I it's probably you know something to note, but I don't know if it's something that would really be. You know an issue in practice, but just uh you know that is out there that'd be the only little gotcha and again I don't since I don't know uh how solaris did it. I don't know um you know if they have the same issue or if they did something else where then that doesn't cause that or whatnot, but.

I

But um yeah I can try to maybe something here the next couple weeks. I can try to um update that, and maybe um it may put it up for um review than if there's um interest, although the only thing I guess I'll have to figure out for the open, zfs piece is like I said it does rely on. You know what we call like the zone, specific data to hold the um some of the.

A

Yeah there's something similar in freebsd that we're using and we had to come up with something like that for uh for the linux one as well to to keep the which namespace it's related to and so on. So I think we have an analog for that, but I'm not sure.

I

So that may be something: maybe you can um you know just that would mean a little work. It may need some testing because I do have. I have an ubuntu vm. um Well, I can try to see if I can.

I

Sorry, um yeah and I'll see how I have to set up triset. I haven't you actually haven't tried to set up the previous vm under beehive on smart, although os should work. I think people have done that just so. I can detest it.

E

uh One quick thought: can you hear me? Okay, sorry, I'm on my mobile one, quick thought about the alias thing uh and maximum data set path name length. So I assume that the like unaliasing and the kernel happens pretty early in the whole code path.

E

So things might be a little surprising if you like, add data set names or make longer data set names inside the container, and you do not know how long the alias is that is hidden from you inside the container, and I know that at least in zero bill we have some basic checks that check whether dataset name uh will exceed the maximum dataset name length.

E

Maybe it's relying on an implementation detail and that's not really clean design, but there are situations where, where people exceed their data set path length- and this could make things even less obvious than they are now.

E

Oh and one one other concern with regards to the uh to the linux: support for the jail property, um and how are you identifying the the container inside the kernel.

A

Yeah right now it's the unique idea assigned to the namespace. So it's just a big random number that you have to get from the lxd utility or from slash proc.

E

Okay, so the so zfs somehow integrates with lxd.

A

uh Well, there's the.

E

Namespace or how does it work no.

A

Like you, just do, zfs user, namespace or userns add and the the id of the namespace and then the data set instead of you know, zfsjl the jail name and the dataset.

E

Okay, and so the idea is to extend uh lxd to know about the user, ns sub command and then do the delegation.

A

Maybe I don't uh I'm not really worried that much about the integration with the linux tooling yet is just make it actually work so far,.

E

Yeah, because I think we need to figure out the story that, like probably the user in s command, is the right approach, but we need to have a story that works for the different container managers on linux. Yes,.

A

But hopefully, if we can come up with a nice interface, all the container managers will just adopt it sure.

E

And do all the work for us.

A

E

Sure they sure will do okay thanks.

A

D

A

Was all the things that we had on the agenda um if there are other things that folks don't talk about or questions? We can do that now. um We could also talk about.

F

A

The big new project that we're working on here at delfix give folks a little preview and see what input people have.

A

H

I met, uh if possible, uh I hoped uh ryan murdered to do that, since it was his project, but I'd like just to uh attract attention to apr, which is 11919, which is we talks about problem with extended attributes porting between freebsd and linux. Originally in illumis, science uh historically happens that we have two completely incompatible implementation and we at existing particularly heat. This compatibility issue science. We have two products on freebsd and linux, which are at this point incompatible.

H

So uh all the things described in a request, if short uh on linux, I think user space attributes are prefixed with a user dot and and then name uh on freebies users knows this prefix and question how to make it com compatible. So if somebody have interesting area of extended attributes- or I have ideas, I'd like to welcome you to that pr.

H

A

um Yeah I haven't looked at that very deeply, but uh thank you for taking on that uh bit of incompatibility. That is one of those things that hopefully we can avoid those in the future. You know through the the coordination of open, gfs and obviously freebsd and linux are using the same repo now, but for the other for the other uh operating systems as well cool. Well, I can talk a little bit about a new project that we're working on at delfix, which is zfs on object storage.

A

So the idea is like today you create a storage pool. It's on some disks. um You know block storage. um We want to make zfs so that it can run on top of an object, store like amazon, s3 or on-prem ones like minio or netapp has uh like object, store product as well.

A

um The uh I think that there's been some attempts or thoughts about this before um the one kind of interesting thing that we're bringing to is like our use case is databases, so we need this to work well and perform well, um even if you're, using small block sizes, you know the the performance of object. Stores generally is like the latency is very large, and you also need to use large objects to get good throughput.

A

um So having a simple like one-to-one mapping of zfs blocks to objects, uh you know, could work well for like archive archival use cases where you can set record size to 16 meg, um but for our use case, where we have databases with record size 8k um that would not work at all for performance, so uh we're taking like a bunch of zfs blocks, combining them into one object and then storing that with object, store.

A

So there's kind of two components to this project: one is like making zfs talk to the object, store and we're using a user land agent to do this to kind of intermediate that stuff and then the other component is.

A

We want to be able to get really really good performance if you can give enough cash enough local cash or or block based cash. So.

A

The goal is that you should be able to get performance, that's similar to a block based storage pool, if the, um if you have enough, if you have a big enough block based cash, that you're getting like more than 90 cash hits in the in that cash. So uh in principle this is kind of the same uh tech. The same.

A

It's it's the problem that the l2 arc is trying to solve.

A

In terms of having you know a caching layer, that's much faster than the storage of the main pool, um but the l torque as it exists today is not really doesn't cut it for our use case, um then one of the big problems is that the amount of memory required to manage the l2 work is really big, um because it's proportional to like the number of blocks that are in the beltwork, because there's a record for each of those blocks in in memory, um and so we want to have really big hashes, like hundreds of terabytes and have small record sizes.

A

So you have lots and lots and lots and lots of blocks so uh we're looking at um basically creating a replacement for the ltwork that would um store the index inside of the cache itself rather than requiring it to be in memory. So you can have unlimited size. L twerk managed with a fixed amount of memory.

B

A

So those are the two big components that we're working on implementing and um some of the feedback that we like from this group is on like the administrative model and also just kind of questions that you guys have, but um we can probably work on soon like publishing like what we see as the interface, the user interface for this um right now we have like a bunch of new properties that let you specify so there's a new video type for the object store, there's a bunch of new properties that let you specify like the endpoint um like the s3, ur, endpoint, url and um yeah, and then we plan to support other protocols as well, not just s3 protocol, um but the other main one is azure.

A

So most almost everybody uses the s3 protocol, even if even if they aren't amazon but um azure has their own protocol for object store, that's like basically equivalent, um at least in terms of the things that we would be using just the basic like get put delete operations.

A

So questions comments.

A

I said I've got one about the cache actually you're, looking at block cache. Is it just going to be a read cache or a write cache, because we've had cases come up before too, where it would be really desirable to have.

A

You know: 100 terabytes of right back cache for zfs2 on block device, um so we're looking at just using the existing zill in terms of um in terms of write caching, so uh because the object store um actually can get very good throughput as long as you have large enough objects, and you can issue enough concurrent put requests um so like you can you can max out uh even more than 25 gigabits per second um network uh on aws, uh so you, you know, the right throughput should be very, very high, um but in terms of that problem space I know nick center did something they had.

A

They actually implemented a very bad cache. I think they shipped it in their product. Maybe they're still should be in their product. um I think that alex eisen gave a talk at the conference like many years ago about it. um So you might be interested to check that out. I good.

J

I was just going to say: I think the thing that we need to like that you know we're kind of counting on is that a the zil will be able to keep up for certain right workloads. But if you, if your right workload is incredibly intense, then yeah there might have to be like additional work.

J

That needs to be done in either modifying the zil to like you know, perform better or some other component, which is like a right back hash that that would kind of be like a step in between you and the object store yeah. So you really get more like a layered file system. At that point, yeah we think like for our use case. We think the zill will will work well, um but you know the proof is in the pudding.

C

Got one question.

A

Pointer of the thing, I will definitely check that out, but that's.

A

Yeah sure thing I think.

B

There's someone.

A

Else here to comment, yeah.

G

Go ahead, mercy hear me yeah, so uh one question for me is uh access to the uh pool. uh I assume will not be clustered.

G

uh It will not yeah, uh because one use case that immediately come into my head is persistent volumes for the containers in kubernetes cluster, so if we somehow could make it work in a clustered manner, that would be awesome use case.

A

We have had similar thoughts.

A

The the the project that we're working on currently does not include that work but um attend future conferences, and hopefully we will have more to say about that. Yes,.

J

So marcin one question for you like um in the use case you're talking about are you? Is it typically the case that your containers need access to all the entire pool or like a subset of the pool, or it's.

G

uh For me, the good enough situation is when I have mounted same pool on multiple nodes, but single node have access to the single data set, so the data set doesn't have to be clustered, but the pool should work on multiple nodes and that's good enough, and that should work perfectly for, for my case, yeah.

F

You want all those nodes to be able to write.

G

At the same time, or uh to the pool yes to the single data set now so one one node writes to the specific data set, but multiple nodes write to the their own data sets.

J

Yeah so, as matt said, stay stay tuned.

G

A

A

So when you mentioned using a larger block size for the object size for the object, like uh what kind of scale was that and do you expect, that might be flexible.

E

A

Yeah I mean it could definitely be. It could definitely be like tuned. um The sweet spot um for like performance is a few megabytes. So like one to eight megabytes um on on s3 um and yeah, I mean that might change. You know with different cloud providers or with on-prem solutions like min io, um and uh you know, that's probably something that would need to be configured or you know could be, could be configured in terms of the target object size there.

J

I mean one question for sorry: go ahead, matt I was.

A

Just saying you know, because that's within the range of potential dfs record sizes, you could imagine that, like maybe the record size, one meg and larger does have a one-to-one mapping which simplifies you know.

F

A

Bunch of stuff, um but the the you know for our use case, we really care about the smaller record sizes. So we need to make that work really well right.

J

Yeah- and the question I was going to ask for you know with regards to like feedback is, if folks currently are not necessarily using zfs, but using like these object stores for different types of workloads, um and you think that zfs might be a fit there. It would be interesting just to kind of like send us that feedback.

J

We obviously have our use case that we're very interested in, but uh it would be good to kind of get a feel for like what else is out there that that folks are interested in kind of taking advantage of this. This technology.

A

Yeah I mean we want to uh obviously develop this for our company's use case, but we want to upstream it and make it part of open zfs as well, and so um you know we want to make sure that we aren't missing anything about use cases that are adjacent, that we can also make work. Well, um you know without a lot of extra effort,.

F

Maybe not exactly.

E

A

Sorry go ahead.

E

Maybe not exactly an object, storage use case, but I was thinking about like if there is some infrastructure to group changes together into object, store sized blocks right. That seems like it could have some synergies with regards to smr drives host managed smart drives. Absolutely I see I see some synergies there and if the software architecture can be made such that, like the the aggregation of changes into these larger sized records, uh can happen in a way that, like the next layer down, can be either smr drive or object storage.

E

I think uh there is some potential there that should be uh like.

A

E

You know more of what is in my dress then than I do certainly, but.

A

Yeah I mean, I think that um uh so the all of this object store stuff. It's all, as I mentioned intermediated by userland. So zfs is talking blocks up to the user line agent. So so the zfs kernel is like right. This 3k block right this three and a half k block right. This 2k block um sending those commands up to the agent, which is like just a process running on the same computer, and then uh the agent is coalescing those into objects and then um sending them over the network to the object store.

A

B

Don't have any.

A

Relation to the actual zfs objects, yeah.

B

A

Object numbers same same name but uh right completely different thing, yeah right, because you know when I first envisioned something like this, I was. I thought there was value in trying to keep uh a mapping to the zfs objects back to what's, in the object store, no.

D

A

And so on, but I think with small ones it it doesn't make sense in your case for sure yeah and then you know we, um you know, zfs has snapshots and copy on ray and clones which are you know our product makes heavy use of um so that kind of brings any ideas about like the zfs. You know, file being even.

F

One thing right.

F

A

The uh sorry I just lost my train of thought, so, oh so smr drives um yeah. So you could imagine like plugging in like teaching that userline agent to be able to write to smr write the like objects to smr drives. Instead, um I imagine that uh since a lot of the smr drives are going towards um what they're called drive managed where they like look like a regular drive, they put a bunch of smarts in there to do the mapping themselves, but it only really works. It only really performs as well.

A

If you're writing in big continuous chunks right, um I can imagine uh like maybe a good enough solution being to have the agent just take each object and write it like say: target target object size instead of three instead of four megs, it's 100 megs and then um just write those as files into a non-copy on write file system right, like use fat32 on your uh on your smr drive and then just like splat down all these huge files that are actually you know, contain zfs blocks um and then the.

D

A

One of the tricky things with this is that uh what if your workload is uh contains, freeze um so like what, if you're, actually uh freeing blocks, freeing a significant amount of blocks over time, um which ours does for sure, um then you know in order to actually release space from the object store. You have to read and then rewrite the object to remove those blocks. You can't just like freeze a little bit of the object right. You got to read the whole thing and then rewrite it out.

A

Omitting the blocks that are no longer part of it um that that kind of scheme would also work really well with smr, where it's like. Okay, great we're just going to like remove this 100 meg file and then like reallocate another one, and then um you know later on we'll splat another different under mag file over where that one used to be.

A

D

A

Working really well kind of uh you know not being 100 optimal, but like good enough that you would get good performance even with um you know, small record size and and randomish.

J

Access yeah really the way I mean we've kind of architected this, because we have this new vdf type. You could you know you can imagine the smr being driven by you know a user land agent or some sub kernel component that that bdev object now talks to so you could implement. You know, take the things that we're putting into our user lan agent, bring them into the kernel and then do it that way too, like, but a lot of the principles all apply here like.

A

Yeah, although I think that the um you would want to leverage the coalescing right, the the coalescing of blocks into a big object happens in the agent and you'd want to leverage that fluorescence yeah.

J

Yeah you'd have to pull that into like some smr specific kernel.

A

J

Or whatever, like.

A

Which might be a little challenging because uh we are writing the agent in rust. So all.

E

The passwords well, but actually that's what I was thinking that taking a step back and looking at the at the overall system architecture and what components make sense to centralize in the cfs module uh because, like I think, I think, the aggregation stuff and also uh the stuff related to freeing and re-packing changes into different objects.

E

Something is logic that should be centralized.

A

um Cool? What do you um try to see if other team members are here, um george? Do you think it would make sense to have a thread of the mailing list or maybe even like a uh an issue or a pr to discuss to like show folks, the user interface that we're proposing and get feedback on that.

J

Yeah, I think that would be great, I think, um having a way so that you know people can comment on kind of what we're proposing. um You know kind of the keys, the new, like um keywords that that we're thinking about and kind of their meaning and and um the various parameters. I think it would be great to get feedback and- and I don't know like- maybe if we can put it like as an issues that would be great- um I think that's a little bit easier to manage than yeah. Then the.

A

Millions of mailing.

J

A

All right here we can do that. The next, probably before the next meeting, yeah, I'm very interested in what the user interface looks like and trying to you know, make sure we make something that fits well and and has the flexibility we want to go forward, because you know those interfaces tend to be once they're released. We can't ever change it yeah I mean the um we're kind of doing it all through just properties, so you can always. I would definitely.

H

A

That you know have a different thing that um you know where it's like: oh yeah, you didn't specify the x property, but instead you specified the y prime or the the y property has a form that I can recognize and therefore I relax my requirement of having you have to have the x property.

A

I mean we've talked about that already in because the new properties are like the um object, store, endpoint the like http endpoint, the bucket well sorry, the um the region um and the uh keystore key location, which is kind of similar to the encryption key location where it can be like you know, file clone, slash and then inside there it has like uh credentials um that are in it. You know specific to whatever your protocol object, store protocol is, um and then the name of the v dev is the bucket name.

A

So you do like something like you know: zeus create dash o. um You know: endpoint equals http colon, slash, slash, blah blah blah dash o uh region equals. You know us west 2 dash, o keys. You know key loc object, store, key location, equals file, colon, slash whatever and then uh s3 is the um view of type and then the bucket name comes after that, just like you'd be specifying like you know, mirror and then device one device, two or whatever yeah.

A

I think that's one thing we haven't looked at yet with the v dev property stuff is how to specify properties on v devs as you're, creating them yeah. I mean it's interesting because here you know, logically, these these properties obviously apply to the vita, but not to the pool as a whole, but we can kind of get away with pool properties, because it's like there's really like there's really no need like we're planning to have it just be a requirement that if you're doing this object store thing, you have exactly one normal class.

A

If you dev um and right, you.

D

Have other videos about.

A

I I guess s3 doesn't really have the same concept but wanting to have you know two v, devs split across availability zones or something.

H

um Like a mirror.

A

Of those two possibly yeah or just something where all the videos wouldn't necessarily be exactly the same, but I don't know that we need to worry about it too much, but yeah yeah.

B

It's something.

A

That I hadn't really thought about until you started talking about it. Just now is uh specifying vita properties as part of the the pool creation and how to do that in a way that doesn't end up. You know tacking a bunch of stuff on the end of each v, dev as you're, specifying it or whatever each member of the vw as you're, specifying because that gets really ugly yeah yeah yeah. We thought about that and we're kind of like well. We can just get away with having these pool properties because we're only going.

D

A

Have one object store, v-dev in the pool, and so we know which one they applied to, but it is a little a little wacky I mentioned I want to- I guess, put it on the agenda for next time: uh z pool status as exposed as pool properties or something so that you can get it.

D

More automatically and.

A

And some of it, the problem right now is most of the the way it's constructed is all done in the the user space bits in lib zfs and in in the z pool command line tool itself uh and in the properties we, you know, we'd fill out the text on the kernel side instead.

A

uh So that's why I haven't done it already, but I'm very interested in being able to do things like get the current scrub or resolver progress by getting some properties uh and.

B

A

To make better gui displays of that, and so on yeah. That would be great.

A

I imagine that you'd be getting like kind of each component of that as an a property whose type is a number right, like you know, yeah.

D

A

Bytes scrubbed number of bytes to scrub or whatever, and then, like you, know, building something on top of that. That shows a progress bar or whatever yeah, but then also having one for like the the action thing, and it says what's wrong with the pool or whatever and a couple other ones like that, and some of those may be applied to avidev, not the whole pool now and you can get really complicated.

A

But, yes, more properties is more better which brought up the other thing we talked about at some point was: do we need an extra property type like verbose or something that isn't displayed by default? But is there if you ask for it uh if we're getting to? If we get to the point, we're having too many properties, do we need some like hidden.

D

A

Default properties, but not the hidden properties we have now. Okay. um Well, yeah, that's interesting! I think we're over time. So I want to respect folks meeting time, um we'll have.

D

Another meeting.

A

In four weeks, it'll also be at this time one o'clock pacific on uh looks like may 25th.

A

So thanks everyone and uh we'll see you in four weeks.