OpenZFS Leadership, 1 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Second March 2020 OpenZFS Leadership Meeting

Description

At this month's meeting we discussed: Deprecate dedup send/receive; new ZFS admin API; O_DIRECT semantics.

Details and meeting notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#

A

Okay, cool, it looks like it's one after the hour. Welcome everyone to the working from home edition of the opens defense, leadership, meeting, I, hope, you're, all staying safe and socially distance. If not isolated, let's get started. We have a bunch of the interesting topics for today. So the first one on the agenda is my topic.

A

The deprecating of ZFS deduplicated ZFS, sent streams and some all the pieces of work associated with that see. Folks might have seen that we added the message to say that those are deprecated in master and that'll be back ported to 208.

A

So the next release of 0-8 will have those messages in there and I have a PR open to add a utility, Z stream and the sub commitment of that Z stream, redo p-- to take a duplicated, sent stream and turn it into a normal one, so that you'll still be able to receive any old d, dupes and streams that you have, or you know, generate those kind of those d dupes and streams on systems with older software. That still has it and receive it on newer software.

A

You know indefinitely so that PR is open if folks have feedback on the design, the user interface and now would be a good time to get feedback in I also incorporated the functionality of Z stream dump into the new command. So now it's Z stream, space dump and Z stream dump. One word is maintained as a an alias to keep doing what it was doing before.

A

Step in that project is going to be actually removing the functionality to generate and receive those sense streams and that that will land in master after the the redo utility.

A

Questions or comments about that so.

B

That that utility is gonna ship with all the other utilities, yeah.

A

So it'll it'll live, you know, kind of next to Z stream dump and the ZFS command and all the other stuff cool I mean it's it's pretty. It doesn't have a lot of dependencies like most of the code is just like in the one you know it does, link with the ZFS but kind of makes minimal use of it. So it shouldn't be. You know we shouldn't have to do a lot of work to maintain that kind of. Similarly, with the Z stream dump, like it just kind of sits there and works.

A

So hopefully we can kind of move forward with ignoring that in the future.

A

Cool alright. So the next item that we had was Mike did you have hey Mike, wanted to talk about a new API for the administrative interface of ZFS.

C

Yep I put together a little presentation. Let me go and share that.

C

C

Can you guys see that yep go.

A

Ahead, all right.

C

Sweet so hey guys, I'm Mike, Carlin I'm gonna be presenting a basically a little proposal for something called live. Zfs API.

C

All right so, first question right: what exactly is this? We already have libs EFS libs EFS cor. What's this additional librarian? What's the point so essentially, it would be like an alternate API for applications to integrate with ZFS. It would essentially have a set of functions equivalent to what the CLI commands provide. I'd be like zpool, create sequel, destroy list. All those different functionalities would have an equivalent, but the main deal would it doesn't need to be exactly equivalent.

C

The main the idea would be make it idiomatic and still easy to use, so one-to-one is not extremely important, but something that is very usable.

C

So the motivation behind this essentially is just to really simplify integration for applications that want to administer and monitor ZFS. A lot of the integrations I've seen with other tools like Moby lumina.

C

They essentially call out to all the CLI utilities, zpool ZFS and then parse, that output back to some data structure that matches something they have I'd like to see if we could have an API that they would use this instead, there's some stuff in the CLI utilities, some business logic there I'll get into a bit later that if you were to try to mirror exactly what the CLI is doing, you'd have to add some additional fluff. What Lib, ZFS and live ZFS core provides, so this layer could accommodate that.

C

So no one has to do that twice.

C

Also when you're calling a library rather than shelling out to ZFS or Z pool, you have a lot less overhead and for me personally, I like to keep what I'm, using usually native to the exact language I'm in and not shell out, if possible. So this would be really great one thing: that's really cool would be very consistent expectations between what happens when you do. We see I command and exactly what happens with the API. This is very akin to kind of what AWS has done with their API.

C

Is that there's always a CLI and API equivalent, and you have a very common context as to what that means.

C

So what would this look like.

A

Right if I interrupt a Squasher motivation, just to clarify so it sounds like part of this- is that it were well as part of this that people are familiar with how to administer ZFS using the CLI and then they're like now. We need to write a program to do what I used to do with a shell script, or you know whatever, like my product, needs to integrate.

D

With what I know how to do manually.

A

On the CLI, I want to have like a easy as simple as possible kind of translation of like at the Cielo I type, whatever you know, ZFS whatever, whatever and I want my program to do it, but I don't want to have to do with like parsing a bunch of command output, or you know a fork exec. You may suddenly.

C

A

C

It's kind of the idea so.

A

You you know, these kind of consumers would be fine with like the same things being errors, as as our errors on the CLI.

A

Maybe would be nice if they could like parse them out a little better or whatever yeah.

C

Yeah, just a really consistent, so you do one thing in either area and you know exactly what's happening: I mean they're, pretty close, anyways right. It's not like Lib, ZFS and Livesey. This core are drastically different than what's happening in the CLI, but yeah.

A

This is like what like Lib GFS is I kinda sorta, trying to be what you're talking about here. Mm-Hmm.

C

A

Not quite so like, why does it like what makes live, ZFS hard to use in this.

C

A

C

Some of it comes down to it. Does a lot of standard air output and standard out output. So you don't really get the context back up. If you call the API there's some of that in certain places a lot.

B

Of it wasn't straight word at first, but a library should never print to the standard file. Descriptors totally.

C

B

Should never exit it shouldn't do things like that, it shouldn't flock processes. There are like a lot of rules that make a good like a library, it's at least possible to consume safely and I. Don't think it really follows those patents, yeah.

A

Yeah the FS follows some of some, but not all of those heads yeah.

C

Yeah I would almost be fine if this lived in libs, EFS I, just don't want it to be confused. So so, if it's standalone, it's very clear that you.

D

C

This and you don't need to go further down unless you're doing something very specific, and you need to skip yes.

A

C

A

Tricky thing about implementing this is like, on the one hand, you're saying you can do just the same thing as if yeah, as as the CLI, in which case like great like the Cielo, is already using live ZFS. So we just like you know, move a bunch of stuff from ZFS main.c into the library and then job done. But then you have all the same behavior of libs ZFS. So like.

D

A

You don't like the fact that if you think there's missing functionality in live ZFS, then I think this kind of makes sense. But if you're saying Lib ZFS does the wrong things, then you can't just like take this CLI and and you and use that code, because that is using live ZFS right. So you know if flip ZFS is like printing a standard error and you're like I'm on my new utility, my new libraries, not, for instance temp in dust and air, then you got to go, find all the you know.

A

You knew your new library is gonna, be using live ZFS. Presumably, since that's what the CLI is doing so now you have to go find everything in the DFS that doesn't do. That does stuff that you don't want by totally exiting or printing or whatever, which, like I, don't think if that is as ubiquitous as it might be made out to be, but there's definitely work.

C

To do there and totally agreed I was intending that to be part of this, to essentially modify those places. So there's always a consistent return up of an error code, slash description rather than anything being outputted I, don't think that's super far across the board, but there are places where it happens, so that was kind of stuff like okay, I would change Lib ZFS, but at the same time I don't know who consumes that. So there's those questions I, don't want to break anybody who's expecting that, yes,.

B

C

A bunch of work I mean it's pretty explicitly unstable, though right like.

B

E

Going to say one of the things we're always told his limbs, EO v Gore is API stable. That libs DFS is not right, so this new thing could be stable. I would be, as I said, when the chat thing I would be totally in favor of that, because we wrote a we and I presume many. Other people have written Python interfaces for that FS and he'd be nice to have it simplified.

C

Yeah and that's exactly what the intention is just really simple: straightforward, just like the CLI is I love to CLI. It's it's really well done so anything akin to that would be awesome. For me, I.

D

Think if I can say about that, because I spend a quite significant amount of time, repping the CLI in the repple and as a part of that I also read through a lot of code in the CLI and if CSS and so on and I think the FS has like what is in the open sea of s.

D

Repower actually already exposes a lot of API surface, for example the Python bindings, although they are contrary the already show how much work it is and how much duplicate code there is, for example, for figuring out what exactly a system called error means in the context of a particular system coil, or that code is already duplicated in the PI CSS bindings. If you can call them bindings, because there's actually a lot of logic in there. So I think.

D

One key aspect that is, we figured out before we write yet another library that exposes the other functionality in a different way is how we transport errors from user space from using space. I think we already talked about this a few months ago, but there are a lot of points than use the space where the corner of the system coil needs to figure out what the system called return. Value means based on the context that it did like, for example, e exists or e in Vilas heavily overloaded.

C

That's a good point for sure those are things. That'll need to be addressed. I, don't necessarily have answers to any of that. It was just kind of a it'd, be nice that this was here. But thanks for that.

E

I think the the main point of what he was saying: there was look, what other people have done and try to see how you can improve it and help.

D

E

Improve their implementations, as I said, if pile of ZFS could be done on something that is more consistent and stable, that means everyone wins. That makes sense yep totally.

D

Also yet another thing that I was like I talked to a few people at the last to developer summits about that I have a like experimental patch. That is halfway done, but it has been sitting around for such a long time that I, don't think it would be, would apply now, but I think it would be pretty awesome to have a stable kernel interface that is based on just in envious. So we put in an envious, twee get out and- and we list and that's all that is to it, then.

A

That's I mean you've seen the new. All the newer actors are of that format. Yeah.

D

Exactly and I would hope for like if we go for having a cleaner, like API to ZFS I, think that is right, layer to start cleaning up the OSI octaves and being able to issue the I octaves from any program England with candy little effort. Yeah.

A

But that's like a very different goal than what's being discussed here, because those those axles are way lower level than what the see lies does we're like imagine, you're like ZFS, create Oh bah, bah, bah, bah bah ho. You know file system, there's a lot of like parsing of that stuff before you can get down to the end of your list or like how about ZFS, destroy ZFS destroy capital are something something with a percent in it.

A

So you're doing like a range you're going to range of snapshots and each thing that's below this point and all their clones and whatnot like that's, and maybe you do like, and you just want to get like the list of what it would do like there's a lot of stuff there. That is not it's like a lot of different iocked holes. To put that all together, I mean that's that's why the live.

A

Zfs is you're, saying like just don't use live ZFS, which is great when you can do that when you put in the effort to do that, but you know lose. Fs is like I, don't even look to every silly, but it's probably like fifty thousand lines of code. So that's a lot of functionality that you're getting.

B

A

I could also thank you all is.

B

To have like common business logic as it were, between the COI and this library as much as possible. I think it probably needs to live in the library, yeah, yeah, okay, I, see.

D

C

All right, I'm gonna, move forward unless someone to get something else. Okay. So what exactly does this look like? I just did some preliminary looking through what's available in Depot Maine GFS Maine, there's a lot of forward declarations. Do ZFS asterisk, so do zpool asterisks that kind of make up the different CLI commands I kind of just tried to one on one to one that those as much as possible Matt had brought up something use cases where maybe it doesn't make sense to be one-to-one.

C

So if it doesn't make sense, we shouldn't do it that way, but by and large this is kind of the idea. Any business logic that's present in zpool or ZFS kind of gets shoved down to the lip ZFS api layer. This be anything, that's not really. A CLI related concern, zpool and ZFS would attempt to use live ZFS api, where possible.

C

Not only would this provide usage of the library and how to use it. It also gives you test coverage over any changes. We've made because there's a lot of backing tests to validate zpool and ZFS the live ZFS. Wouldn't our lib GFS api? Wouldn't output the standard error standard out you just return error codes and descriptions, leaving the client in charge of how to output that if it was a rest server, the rest of would return a JSON err format text right.

C

If it's CLI it decides to output the standard error standard out, we would leave that focus of control on the client, but they would have all the data necessary to make those decisions. That's kind of the idea. It's definitely a first pass, there's stuff to be iterated on so a first example.

C

If you look at Z pool just destroyed. This is oh sorry, let's pop back, if you look at z, p-- will destroy it's fairly straightforward. It's kind of a simple contrived example, but I think it will prove to be useful for this. The main piece here really is the disabled datasets. This is kind of a validating. Things are good beforehand, making sure all of the data sets are unmounted and an additional logic there before we attempt would destroy if you were to duplicate this just looking again slip.

C

Zfs you'd have to make this call yourself beforehand. Anyways. This is just a simple piece of business logic that I think could be pushed down other pieces, the open and closed I. Don't think it really needs to live at this layer. There's nothing specific as to why this needs to be here, at least within what I've looked at so far. There might be other cases where this proves me wrong, but for this case I think it can push down.

C

So essentially what this ends up looking like if we were to rewrite this with the changes is now so. This ellipses here is basically CLI parsing clicking too much here. This is the CLI parsing. We get the pool name, we call our libs ZFS API pool, destroy which just takes the name of the pool, whether we're forcing it or not, and the history string I, don't remember the history strings for, but it's part of the call down, it returns an heir code and we decide what to do from there.

C

So if we didn't succeed in destroying the pool, we follow some of the same logic from beforehand. If there was a slash in there, we're assuming hey, you really meant to destroy a volume. You should probably call ZFS if we failed, otherwise we're gonna, try and relate that back up. This is not equivalent output to what's there, that's a little more complicated to hit equivalent output, which would be another hard part of this, but essentially we'll try and use that error code to output.

C

Why this didn't work, and this kind of leaves all of the output in the CLI layer and nothing below is the intent so.

A

One question I have is like in the like I think. As we mentioned the given, given the error code, the number like getting the text that describes that accurately is complicated.

A

How much would you want to do that in the library versus in the consumer, because I mean, like I, think you could still do all that in the library and have it like you said pass you could just pass up a string.

A

That's like here's, the string that describes the air and and by the way, here's the error, number wit, but, like most people, wouldn't use their number, they would just be like great here's, the string I will you know printf it or like put it into my JSON results or display it on your screen in some font or whatever.

C

Yep yeah, so it would be in the library. This is an old example from the very first thing I put out before we discuss the air stuff. So I would rather it be a structure with a code plus the description and not rely on the consumer to ask for a string somewhere. It should just trickle up yeah.

A

Sure, but the point of being that, like the string is generated in the library and the consumer, is basically just like taking this opaque string that describes the air and returning that or printing it. You.

B

Probably wanted to sort of two things: I think you want that string like so I think this. The string that the CLI would emit today should come out as a property in an error object, but it would be good to be able to support other properties. You know like if, if the destruction for like, if the, if the destruction fails, because it can't unmet a particular file system or a particular set of file systems, you can have a list of those file systems or the data sets or whatever.

B

In addition to the message and in addition to a code so like getting an envied list back or something oh yeah,.

C

That would be useful for sure and.

B

C

B

Think you'd be good, probably good, to have like a I seen it the call up there. The destroy call you've just got like the the arguments of the call. It probably would be good to have like a library handle as well for all these.

A

B

You're like open in the library or in it the library or whatever, and then and then, if you want easy mode, you can ask the handle with a particular symbol for like the easy string and if you want more complicated error messages or if we want to evolve the way that there is a reported over time.

B

We can add symbols that refer back to the handle to unpick the arrow state in more and more complicated ways, rather than what we've got there with the conversion of the raw number or even even including the like, an arrow pointer in the in the arguments to the function.

B

They're all sort of limited in that like once it's there. It's kind of it's difficult to evolve that interface, but whereas here you could have like lzj a pill, destroy a handle, pool history stream force, and then you, you know if it. If it doesn't go well, you could have like get era nvl or get arrow list of devices or something like we could add, additional symbols that refer back to the handle and look at the the furthest state of the last operation.

B

Without having to necessarily have completely designed that into the original call, I think.

C

That's a good idea: I had originally kind of followed how libs EFS core had embedded it did in it and hold on to a handle under the hood, and you might pass handles down to stuff. But this makes a lot of sense for this case. To do that, we can hide it behind that. There.

F

Are sorry sorry.

A

I would say Josh with that that particular interface wouldn't be thread safe Rey is that do you think that would be okay, or do you.

D

A

Should do something where it's like returning a like returning an error thing that you can then inspect so I think it would be so.

B

I think the if you've got five threads and you want them to be able to do different things at once. I think you should open a handle per thread. Basically, so, like okay, all right the the thread safety would be or like it would be up to you to handle.

D

The thread safety.

B

Of the handle and we could make the handle lock it unlock correctly inside, so that it's like memory thread safe, but yeah.

A

Yeah but let's.

B

Make sure yeah like.

A

I think that's how the gfs works as well right, yeah,.

E

A

Need one maybe ZFS handle per thread, but like those threads, can do whatever they want right.

B

And they can be pretty cheap handles on the mace like it's probably just gonna be a struct with some storage. You know yeah.

F

And there are some interfaces like ZFS and that give output as it goes along so having a callback that's able to to as this output would normally be generated. The callback gets called, so you can do progress. Indication.

B

Because it manages threads under the hood right, the send.

F

Dick this end has a threading. It just does.

A

F

A

Well, the send, if you, if you report, progress during the send, then the library does create a new thread to like what, like the main thread, is doing.

D

All Sen tactile.

A

It creates no thread, that's querying every every hover line. You know you're.

F

Right, yeah, yeah, I, wasn't I, wasn't thinking of proper progress. Reporting I was singing, are like the Sun UV we're just as out putting.

D

Your name of the data set.

F

As it goes along.

C

Right so what exactly does LZ a pool wave and look like under the hood.

C

Essentially, we've pulled down the Z pool open, does equal close. The disabled data sets is in here now, Lib ZFS API maintains its own handle to live, ZFS we're in live ZFS core, where applicable and essentially would return air goes, but, as we've discussed, the air stuff is going to evolve and be different. This is essentially the idea, and then you would have total understanding of what's possible to return what those things mean, if you so we're going to use the air code I, don't even know that this perfectly reflects what it is with.

C

This was my best attempt in a quick time. Yes, I think throwing.

B

Out the overlap with Ono is a really good idea like having having a separate set of enumerator return. Codes of is a much better idea. Then, okay,.

C

There's a lot so I was like: oh I, feel kind of bad cuz. It's gonna dupe a bit, but it's it's.

B

Really fine, no is it design mistaked from the beginning, it's fine to have a thousand definitions in that enum. If you have to to reflect the state of whatever I mean, there's.

D

B

No reason to try and preserve there like be busy means something that it doesn't mean thing all right.

C

That's kind of all I had it sounds like we've kind of had the discussion, but if there's anything else, you guys want to bring up I.

A

Think you know my kind of takeaway from this. Is that, like this, this solves the Rugal problem and I think they. It is doable. I think that the bulk of the work here is in the error handling, like that's kind of that's, probably where the BFS is weakest. You know moving the code around from libs from you know, ZFS Mincey into a library is like not that hard mm-hmm.

A

Then it's separating out the parsing into a clean, like you know, interface of, like the parsed arguments to the that we pass through the library, but the error handling is is where you have your work cut out for you, yeah.

C

A

I mean I agree. It's used to be really great.

B

To see this, and as far as I think up front careful up front, a bi design will be important too in order to be able to make it from from the first version to make it a stable library that we think it would be important to make sure that we don't make. There are a number of like subtle, C and an elf realization, design issues with say below P a a B ice rather than an API switch I. Think we'll just want to make sure we don't accidentally tickle those thank you. Yeah.

A

I think we've made all those mistakes in one place for another. In the gfs it's I mean.

B

Yeah but but it was I mean as an internal component, I mean it it just it's. It is definitely not designed to be a stable IVR yeah.

A

I mean like as an internal component as ie like here's, a bucket of code, which needs which, like some of which is needed by a bunch of different commands like it's.

G

Fine, but you know.

A

In reality, we probably would like, if I, had it to do over again and had the time I would have like if we didn't have the time to make a real API like we could have just put all of that stuck not even making Lib CFS and just have like one giant binary that you can invoke by calling a ZFS or Z pool or whatever. You know, because that's kind of more honest about what how that code is really structured.

A

Like the line between the between like this, you know the Z full command and the GFS is super blurry I.

D

Have a remark on the software engineering angle on that, in particular about the the error handling case, so they are definitely it's out there, including zero booze, but there are also a lot of other spirits that rely on screen, scraping the air output for proper error handling and the like a few classics. Our data set does not exist and all these kinds of errors, it's very important, that these remain stable.

D

So as an idea for making sure that we don't have any regressions in that regard, I would propose that we start recording the output of standard out and standard error of the entire ZFS test suite or at least of the the subset of the ZFS test fit well. That makes sense and put that somewhere, I, don't know whether in the repository itself or in get the FS or something and have that be the baseline for any regressions with regards to error, reporting and I. Think they like it like it introduces. This feature should do that.

B

I feel, like that's kind of a stretch like there are definitely.

A

B

The CLI that are stable and aspects that are really not I mean that's I, think.

A

That it's I think that um uh figuring out what aspects are stable, like figuring out what we're changing about the output in a systematic way is a great idea as to whether we would like you.

D

Know not allow.

A

This to be pushed if, depending on what the output has changed, I think is, is a super judgement call and, like you know, probably something some changes to the output would be fine, and some changes to depo would not be fine. I think.

B

A

Case-By-Case right.

B

Yeah yeah cuz like otherwise we're basically we're saying as part of this, they will never again be a visible change to the CLI, which I think is yeah.

E

When you say it that way, I don't.

B

Think it sounds that easy to even enforce that so yeah.

E

We keep changing the output as equal status, all the time yeah.

A

That's a great example, so I think that I would I think I would definitely recommend implementing something like what Christian said. Just like record the output of all the commands. When you run the test suite and then do another run and be able to dip those so that we can know what was changed and then apply that kind of case-by-case judgment call on whether those changes are okay or like. If, if it's easy to fix it up, so that it looks the same, then you know we might as well do that. Okay,.

C

Definitely, it makes a lot of sense. There's.

D

Also, another point: I wanted to make that experience then developing 0po, and that is that I found it very useful to have internal abstractions that are it important and I think if we just move these commands into a library that works like the ZFS commands, we don't gain anything on that front. Now, there's the question whether it is like if, if I'm mistaken, it only I find it worth having it impotent api's, but in general I find it very useful from programmers perspective to like not have to worry about.

D

Like should I want to say, I want this data set to be created and not and not handle the error. Oh, it already exists. Fine yeah.

C

Or if you've, retries or something well.

B

The second one hits.

C

B

That that helps sometimes but like what about the vast amount of things that you could have so like? Let's say you create the data set and then that's fine and it's and then, but then you go and like set a bunch of properties, make some changes to it turn the compression on do some writes whatever, and then you come and do that original call that you want to be on impotent again.

B

Are you like what, if that call like what, when you created the data set, if you had, if you included like setting the compression to offer on or whatever, but in the meantime that property has been changed like does? That call also change the property, and would you want? Can you imagine, under some conditions at least that, like a strict mode where, like you, would want to know that the thing that you'd asked for hadn't happened or had happened, or something in basically this.

D

Is kind of a transaction, it was a transactional problem, I think. Basically, that's the I would not only have the other pretend api's, but there's so this was kind of a open up for a bigger thing. That is what I do internally and zero bill is to almost everything with geo IDs, so in particular snapshot geo. It is like being and snapshot. Reality is fast and geo. Ds.

D

These enable, for example, say free names, ZFS rename cannot be implemented safely like you can never know whether between the decision that you want to make the rename and you actually issuing busy, as called to make the rename somebody swapped out the data set underneath and being able to express that operation with G IDs instead of the the data set pass. I would find that very helpful in the library and the point I'm trying to make is I.

D

Don't think it's a good idea to have a kind of lateral movement from CLI to library, but instead there should. We actually thought going into what is useful for program ability because sure, like in the 99% case, the password finding, but if you have concurrency integuments I, know that then I think it would be very useful to restructure. A lot of the IP is, or at least have to thought in mind that at some point we want G or IDs and not data set path.

D

B

D

Imagine sense you.

B

Could imagine making the input to this library and be lists as well?

B

You know that, like very people, you couldn't specify, you know you could say like put for the rename case, you could have the input, be a source envy list as a source data set in a destination either specify either and either one of those could be either a name.

B

Well, I guess so the source could be a name or or a GID, and the destination would have to be a name, but you know, and then you could add, you know like if you then wanted later on to add, like an idempotent like a create, if not exists or like those sorts of things, you could add those. This extra properties that you don't have to necessarily support and all we'd have to do in the in the in the function would be to reject properties that are not understood as being invalid.

B

Basically, and then you can have at least a backwards compatible interface, yeah.

D

I think API design wise. It makes very much sense to have all of these be and vilas. It's also great for like bindings for other languages, because essentially an invalid snaps to keyword arguments a really good way to design that.

A

Well, I think, there's a lot of good ideas here and I. Think from the stuff that you mentioned, Christian is not necessarily like in the scope that I was initially discussed, but so it would be I think it'd be great if we could get some folks to work together on. You know work together on that.

A

So if you know of Christian, if you want to work with Mike on, you know making sure that it would work for for your these gifts that you think you have, then that would that would help to get it done.

A

Why don't we leave? We don't have a lot of time left and we have a few more interesting topics. So why don't we move on and I guess as far as next steps for this one um Myka, if I recall correctly, you you have some some cycles to work on this, so why don't I?

A

Could you put together like just kind of refine down further your proposal in terms of what the api's will look like and maybe come up with a couple more examples, and then we can I like maybe send out an email to the mailing list, and then we can get some more discussion on. You know the more concrete proposal for they are handling and all that and deal sounds good cool thanks a lot so I'm gonna. Why do I skip the well well, I'll mention their direct stuff or I'll tee up if Brian, Brian or.

A

Mark is on the call but Brian or buying that consent or brian behlendorf. Do you want to talk about the direct stuff.

A

I'll keep talking until somebody and meets himself, so I was working with Brian Brian in and mark, maybe on the direct support which we talked about at the last meeting, and it folks raised some very good concerns there. So we we took that away and had a bunch of discussions around the exact semantics that we wanted. I sent out an email with a summary and there's also a design document that kind of weighs the pros and cons of a bunch of different decisions.

A

I'm not gonna, go through and like outline.

G

Every one of those.

A

If I recall correctly, one of the main discussions at the last meeting was around what about concurrent stores that could happen so, like you're, doing a write system called that's no direct and.

F

A

Process has multiple threads and one of the other threads could do a store to the buffer that you're using for the for the direct right. We want to make sure that the check sum or like in an you know in a simple implementation that might change the data. What like in between when we do the check sum and when we wrote it out to disk, so we decided that if you have checksums enabled or are using a bunch of other features, we're gonna have to make a copy of that buffer in memory.

A

While we're doing the I/o, we couldn't really come up with any other reasonable semantics that fit with ZFS. Is you know, ease of use and like not having a lot of sharp edges?

A

Obviously, there's other ways that we could. The key thing is like the checksum has to be of the data that has actually written sent a disk and making the copy is like how we would implement that twiddling with the vm. Subsystem is another way that we could do that, but right now we think that would be slower than making the copy, but that could change in the future.

B

It'd have to be a pretty big Buffalo for them to to tap out I. Think yeah.

A

Well, I mean some of the use. Cases are 16 bag buffers that.

B

Might be big enough so.

A

But, on the other hand, I think a lot of the use cases here, at least for Brian, Brian and Mark- are using lustre which doesn't have to suffer this, which.

D

I mean you know it.

A

Because it's in the car all right because luster is in the kernel already yeah, so we can just say like well the kernel like if you use this particular interface, then you're guaranteeing that you won't change it and that interface is not available. You know, that's not what we use for the right system called, because we don't know that a user mind isn't changing it, so that was the main kind of like change.

A

We also like refined down the property, the set of properties and how we handled misaligned things from what we had discussed before, hopefully that more or less makes sense, I'm not going to spend the time to like read out all of that. Unless there are questions happy.

B

To I'm still a bit confused about where, like the block alignment stuff, makes sense because it's sort of disk I/os, but where exactly does the page alignment, the 4k alignment that is stuff come from exactly so.

A

B

A

A

Understanding and and Brian or band for the free did jump into any point, but my understanding is that most of the other, it was kind of like everybody else. Does it so we might as well two and.

A

Four reads for rights like if it doesn't align to the block size, then we're doing as a buffer during anyways, so like we could might like. There's really no reason that we couldn't handle things that are less than page size there as well and kind of handle them equivalently for reads.

A

What we're doing for sub block reads is we're constructing an abd from like the pages that were passed that were provided from the user and then, like some pages, other pages that we're just going to like that are scratch pages that we're going to throw away so I think that at least in the current implementation, those you can only stitch together pages, not arbitrary sized things so like if the user passed down like one kilobyte buffer like they're like read it read it into this one kilobyte, then I think we would kind of have to read into our own buffer and then copy into their into their buffer, as opposed to like either passing down pages.

A

Then we can do this scatter abd thing which can be preserved further down and I. Think on Linux, with the recent, with with the proposed changes that scatter Ness goes all the way down to the describers.

G

I mean I've been looking around and there are different alignment requirements, kind of contingent on different pile systems, but commonly page sizes. What they stick with I.

G

Mean I've seen where they go down like 512 bytes, but I've also found weird generic virtual file system hooks at it say it's. The file system block limit is 4k. It allows you to do less than that, but they'll do like filling up the pages and things like that.

A

I think that's definitely an area where we like. If there's a good reason, we could definitely change that. That's not like a hard hard requirement on our side. We have a lot of wheelie room, wiggle room because the you know all the documentation is like. If it's not, if it doesn't meet some alignment, then we won't do it and we'll give you an error so in a bunch of other file, systems chose one for K or the page size.

A

So we were like all right sounds good, but if there's like use cases where that doesn't make sense like where they need to go down to 512 byte alignment or whatever I mean we can make that work, yeah.

G

We definitely do have some wiggle room as far as I can tell.

B

So the other thing that it was a little confusing in the particularly in the preamble and talking about the motivations it mentions a lot of particularly the stuff about throughput and latency of write, requests and stuff. It doesn't really feel like that's borne out in it like it feels like the classical direct justification is really just about managing your own buffers. So I don't know that it's right to set the tone necessarily on whether like huge right, latency is, is acceptable or not like it may be, but I don't.

B

It really feels like a side effect of the justification like it really feels like oh directors for Oracle, my sequel and stuff, like I'm gonna manage my own giant buffer, cache and I. Don't want the double caching feels like really came from, rather than being about increasing throughput, necessarily.

A

um Yeah I mean the I think there's two questions. One is kind of where it came from and and I think that that that's definitely true and I think the other thing that's key about where it came from, is that it also disables POSIX semantics of overlapping rights or overlapping operations, because POSIX says like, if you have, if you're doing two operations like a right and any other operation and they overlap in the file range, then they happen in a defined order. All.

D

A

Most other file systems implemented that by saying great, each file has a reader/writer, lock and so like, while you're doing a right to the file, you can not do anything else to the file and obviously that's horrible. So Oh director throws that up just as like. Well, then, we won't grab the lock and overlapping reads and writes like have like who knows what'll happen. It's your.

B

Computer, you know Oracle.

A

Doesn't do that so? Who cares? And you know we we, we are not throwing investment got that window with our direct because, like we've had range based, locking from day one, so you can have like yeah like you have the overlapping things are ordered, but you can do two rights to two different offsets concurrently, so I wasn't super concerned with where Oh director came from originally, but I do I. Think you still have like I hear your point.

A

My thought was like if, if it's really about buffer management, if it's really about like I, just don't want this to be double cashed. I, like I'm telling you I'd prefer that you not cash this because I don't because you're just gonna be wasting memory and I want to use that memory for other stuff. Then a much simpler implementation is possible, which is like kind of what I did originally when I was prototyping. This so like, for example, for writes like you, don't need to bypass any of the code.

A

You can just say like do what we always already always did, but when the operation is done, throw it out of the cache so like there's no reason to do the right right away. We just do do buffered style, writes, always and then like at the end of spa. Sync, we notice oh like this. This is only in the cache because of it was a direct IRA, so we just evicted from the arc in the DM you and whatnot and I mean I.

A

Think that that my work okay but the you do get like a huge performance win by not involving all of that, but having the opportunity to like streamline like the D, mu, Arc, etc, like buffer management and all that stuff. It.

B

Does feel like there are a lot of copy-on-write sort of considerations to that I sort of different from classical extent based filesystems yeah.

A

I think the primary consumers, like the folks, are actually doing the work to implement this like Brian and Mark. You know they care about this for luster and other kind of HPC workloads where it's not I mean yeah. The buffer management is nice because, like they don't need to cache it, but I think that, like an equally large, if not larger, consideration is that they want to maximize throughput and minimize the per byte overheads of like making copies, making multiple passes over the data and all that stuff, so I think they're concerned with that.

A

On the other hand, like I, I and probably other like database type, users are more concerned with the per block overheads of, like you know, instantiating a bunch of debuffs and a bunch of other stuff, which you know you might have seen like from my work with sin and receive recently is kind of all about minimizing those kind of overheads and I. Think that's. If you take the principle of like we really care about maximizing throughput, then I think that those implementation details kind of flow, more naturally from that than if we're just.

A

If we're just saying that it's about you know not not using up a bunch of memory, all.

B

Right so I I think, given all of that and the fact that it does feel like this is going to be conceivably breaking for people that are currently using O direct without because the implementation today, but I, think we just doesn't do anything right. I mean yeah.

A

So the one thing that would break is, if they're not like right now, there's no alignment right constraint. So I think that is the one in hopefully only thing that would break that would be like you actually get a different result, as opposed to performance being different right.

B

And so from that I think I I do think we should consider having it be. The current behavior today should still be the default behavior and that you should turn this new behavior on which you know easily the people doing lustre and stuff can easily like yeah. You opt into this because they're already all taken into presumably a bunch of other stuff as well, but then out of the box like any merely by rebooting onto this update your application that makes on the laundry ice. It's not gonna break basically yeah.

A

Underline no direct, writes yeah.

B

A

That's a good point.

B

The other the other platforms like FreeBSD, you know we must I think allow quite a lot more on the line. Writing behavior, I think by default, for instance, um believe that's true, but we can confirm that, but I think that's true. Yeah.

A

I think that's a fair point. Another way that we could I think your your your point is that, like we're, changing we're making things fail which didn't otherwise fail in the default case, that's.

D

A

Way to address that could be to make it so that, like unaligned acts like no unaligned Oh direct like there's no alignment requirement for direct, and we will just you know, do it like. If you do a one-bite unaligned, read then like all right sure like we'll just read the whole thing for you there you go we'll fix it in post production, but.

B

The like I think yeah. That could be another thing. Wait like by default. You could, you could imagine, then the default mode and then a strict mode. If you want the alignment for some.

A

Yes and that's kind of yeah yeah.

B

We could do that and any other then yeah. Oh the other thing too. The other thing that it does feel like a breaking change is pools with no redundancy and no encryption and no compression I guess you have had to turn the check sums off right to get the the where the buffers can be twiddle during the rock.

A

B

So that's not gonna, be wonderful and.

A

I mean if you, if you turn off all those things, then I mean it doesn't once you turn all that off, then ZFS doesn't care and so I think it's fine like ZFS is gonna copy through your buffer and the disk is gonna copy through your buffer. So semantics are the same.

B

What from I haven't looked at the change yet, but from a structural perspective it there. It seems like the list of conditions under which the unlocked buffer read can happen, is obviously pretty long and complicated yeah.

B

What I remember from some of the encryption work that went on for a really long time, some of the last things that we chase down what were like? Not the encryption, decryption and stuff, not necessarily always happening at all the places that it needed to happen in the pipeline's, because the abstractions didn't, but there wasn't necessarily one clearinghouse for for where, where that's those specific decisions needed to be made around some of the new flags. Is there it's the implementation of this? Where we're making that decision? Whether the buffer unlocked read, can happen or not?

B

Is that in a pretty single singular place like.

B

A

I would think so because the I think the issues that you're talking about are with like when you're dealing with an encrypted data set. Sometimes you don't need to decrypt it, and sometimes you do, and that is that's where the trickiness came in versus in this case like if encryption is enabled on the data set, then you do not get this SuperDuper fast path. Behavior and the encryption can't be like you know. That said at creation time and does not change so I. Think that's fine, but.

B

The final boss decision, though, is sort of being made in one place, though, is that true.

A

Yeah yeah that should be made in one place, we're almost at a time. I just wanted to mention one other thing about the alignment. There relates to what you're saying, which is the one of the one of the reasons that we thought that maybe we should be failing.

A

The direct operations I have poor alignment, is that applications might be doing doing direct operations and then like checking to see if they worked or not, and if they didn't work, then like increasing the the size of the dream or, like you know otherwise modifying their behavior to be more optimal so like telling them like hey you try you, you tried to do this I can't like I, can't really do it fast. The way that you're asking me to so hit, try again like try something else. I'm.

B

Definitely in favor of the strict failure when, when you are trying to go fast and the only way to go fast is a particular constraint must be met. It definitely.

A

B

Sense to have like an Ian event, okay, so whatever for that.

A

B

That's the yeah.

A

In that case, maybe the strict would be would require block size alignment, not.

B

Page alignment, because that would be the weather io decision is really faster or slower right, like you've, got a whole record right or not, or whatever yeah.

A

G

Initial pull request to have up I have a lot of this data set of properties where we're definitely trying to distill them to health. Yeah.

A

We're trying to distill them down, it may be reasonable to like we just we. We still just have the one property, but to have a you know: disabled default strict and always we.

B

There's one more value to.

A

That existing property right.

B

That and they're still kind of like levels like like the.

A

Shells of onion.

B

A

Yeah I apologize folks we're out of time. There were a few things that we want yeah updates on and then also calls out for code reviews, I'm just going to say what those were and then we'll we'll discuss them. Next time- and please add I'll- add this on the slack channel as well, so the status updates, josh Petzl, already posted a status update on the Penn zero debut its they haven't, made much progress, but they still want to alan.

A

Has the status update on DD blog and the DDT limit and I saw that he just opened. Perhaps during this meeting a pull request on the YouTube limit or any duper ceiling, there are also pull requests out that need reviews on persistent l2, r, k-- z, standard compression and the d duple ode, which is like a prefetch or like please go read in the d: do from youtube table from disk synchronously, so people who are in a position to review that code.

A

Please do so that stuff is ready to go and we will see you again in four weeks at the later time, so that should be April 28th 1 p.m. Pacific.

A

Thank you all thank.

C

You thank you thanks.