OpenZFS OpenZFS Developer Summit 2013, 18 Nov 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenZFS Developer Summit Part 3

Description

http://www.beginningwithi.com/2013/11/18/openzfs-developer-summit/
Vendor Panel (all represented companies sharing their work); Karyn Ritter, Matt Ahrens on the OpenZFS community.

A

um So I'm adam leventhal, I'm a cto uh here at delfix before that I was in sun working with matt and then worked on fishworks the the zfs storage appliance. As it's now doing,.

B

You created d-trace and I created d-trays.

A

From wearing this, d-trace t-shirt to a zfs.

A

So uh what does deltix do so? um It's easiest to talk about the the problem that delphic solves and how why uh wend, my way down into cfs so um delphi solves this problem that lots of big companies have they have monstrous databases and for every big production database. They have many many copies in non-production use cases and so developers and testers and qa, folks and reporting and analytics and backup and production support. All of these people need copies of the production database.

A

Those copies are a pain in the neck to create they take a long time to create they take up a lot of space people deal with stale copies and so forth, so forth.

A

So what we do at delfix is we call what we say we virtualize at the database tier, so we make it very cheap and very easy and very efficient to create copies of databases. Everyone are clear enough: okay, cool and what that does. It means people can run their projects faster and make their developers happier and trim time on their dev cycles and stuff, like that.

A

So that gets down to to now kind of explaining. From the zfs perspective zeus, I joined zelfix in about three years ago in september. 2010 um matt, joined matt and george joined me uh about two months later, um but the choice to use zfs has actually been made long before any of us had showed up at delphics.

A

Delfix chose zfs as a platform um in 2008 because they wanted a file system that was foremost reliable, uh had writable snapshots and clones and was freely available, and I think open source was kind of at the back of the mind as well. So that's why zfs was chosen and what we do with zfs specifically is.

A

um First, we've we've made a lot of extensions on top of that and I think matt will want to talk about some of those, um but we make heavy use of snapshots to close heavy heavy usage, snapchats and films, much more so than we had um at sun or fishworks or oracle and any of these other places. So each virtual database starts from a snapshot of what we call a source and we create a clone writable snapshot and then are able to provision that to some point. In time, databases, um that's good! All right!

A

I think, um and then you want to talk about some stuff we've layered on, let's see with those I.

B

Was gonna um I'll? Also I'll talk about uh maybe a few things that um are forthcoming, that we've we've done a lot of enhancements in zfs like you know the great throttle that I think a lot of you have heard about, but I wanted to mention um a few things that haven't been pushed into the most yet that are on their way. um So one is cfs bookmarks, so not to be confused with the internal data structure called something or other bookmark in zfs. But um this is basically a mechanism.

B

That's specifically for doing a zfs send from a place where a snapshot used to be but no longer exists. So the idea is that if you're doing remote, replication you're continually like create a snapshot and then send from the previous version from the previous snapshot to the new snapshot over to the remote system. So you always have to have that previous snapshot on the sending system. um Zvs bookmarks allows you to create a bookmark at the point in time where that snapshot represents and then delete the snapshot and then send from the bookmark instead.

B

So you don't, if you're doing a remote replication. Now you don't have to have any snapshot that like always exists and is tying down storage on the sending system. Other.

A

Stuff, we've done. uh You know chris sun and bazel pro worked on the feature flags which I know a lot of folks are using because it's important for us to um to enable the whole community to to build different parts of gfs out without being usually incompatible in in tough works.

B

um So another thing that I I've worked on recently is the zero block, compression or embedded data feature um which is going to be going to limos some, even after we get through our backlog. But the idea is basically that um some data blocks compress really really really well um and they can actually fit down into like about 100 bytes. Then we can put that data into the block pointer itself, so the block pointer doesn't actually point to the data block.

B

It actually contains the data, um and this helps a lot for like small bits of metadata. um So we think this will probably improve, like you know, zfs list, and things like that, because um you know all those properties uh are stored in like a different object with a different block and instead it'll just be. You know, one less random. I o because you'll already have the data. It's also very advantageous for our particular use case, where initialized, but otherwise unused oracle blocks are very common.

B

So one thing that dbas like to do to benchmark their store, their storage subsystems is to create an empty table um which tells oracle to go right. uh You know initialize every block of that uh file.

C

With a little header and footer that goes on, it well turns out that that header and.

B

Footer um can be compressed with lz4 to fit into 100 bytes, so it super accelerates that you know benchmark special, but it also uh makes it a lot faster to like read that data, because we already have the data in the block pointer um rather than go for this. um uh What do you mean? uh You say you keep the data in the partner.

D

B

Because we don't need the data block, we have the data.

B

So the data like the block pointer, so the block pointer, is always checksum by the block that points to it. So, for example, you have an indirect block uh that indirect block is verified to be correct, because the block printer that points to it has its checksum in right. So, if we're putting uh you know, pointers to data in there, those pointers are verified to be correct. We're putting compressed data in it that data.

E

B

Verified to be correct by the widget that points to yellow.

B

uh This so this is the there's like a new bit in the block pointer that identifies these.

F

Block pointers.

F

B

Yeah, so there's still like a birth time and a props word um are still used more or less the way that they are. um We should give ourselves a hug and make someone else: okay, um you're in the front row, do you want to go next sure I like this.

G

One uh hi I'm barbara mastaki, I'm at joint we're, basically a cloud provider, so we use efs in a few different ways, primarily first off just for taking around images and dealing with the fact that customers have vms every vm gets its own zfs data set, whether it's an os based virtualization or using hardware virtualization. If it's hardware virtualization you get z, volts and those are what turn into your disks. um Those images need to be moved around. So that's all through snapshots and clones very similar to what delfix is doing.

G

uh In addition to that, we've done a bunch of uh enhancements in zvez multi-round multi-tenancy for dealing with basically enhancing the I o throttle be more container aware. So you could basically define shares on that. So you can basically have a better sense of fair. I o sharing stuff that we have in the works, which is kind of interesting, is fs limits.

G

So, basically, you can do limits on the number of snapshots and data sets, because if you're aware, you put 10 000 snapshots on your box and things get very sad right away for that machine, and the last thing you want is one bad actor to screw over everyone else. uh The other big thing we do with cfs is basically the core of our new, basically object, storage and compute service, and what happens is basically it's an s3 like data store, but you can run compute directly on there. So, basically spin up zones uh go.

G

Do whatever compute you have. You have all your packages that you ever want, then, once that's done, you roll back the zone and move on to the next customer. So that's already a pretty big use case and we're using rollback. Probably you'd have like hundreds of roll backs in a round per second, sometimes.

F

Cool sorry, could you use a resource manager? What.

G

Do you, what do you mean.

F

G

Yeah everyone so have.

F

You guys done anything on the I o.

G

Yes, that's that's exactly what I'm talking about. We have our own. I o yep, that's a zone aware or container aware. Basically, I thought.

B

Cool, um uh do you want to designate someone from next time? That's who uh I guess.

D

H

We heard a little.

H

Yes, very quickly, as you already know, from boris's talk. um Nexanza basically have a um you know: storage appliance based on zfs, um so we basically used every single feature of zfs right from snapshot, um so we basically use gfs as a backend for for nas and sand protocol right, um nfs, civ ice busy, fibre channel kind of thing, so we we at center itself. We try to work on. You know, fail over performance.

H

um You know zfs performance kind of things. We also spent a lot of time. Working on you know. You know. The comstar framework inside of illumas for people that are not familiar with the comstar is stands for common discussion targeting.

B

H

Framework- and it basically allows us to do you- know fiber channel block protocol, we basically export z-volts as fiber, channel and iscsi.

H

Let me see what else a lot of the work that we do inside of the company actually are on. You know, driver work, um shifts development, kind of things, networking fixes and such we inherit a lot of stuff from the cfs community, and we try to push back as much as we can.

I

Thanks, yes and our customers like to push our systems to the limit, so we are running a lot of.

B

You guys are mainly selling software and partnering with people that um bundle that, with hardware right.

I

Yes, so it's true right, yes, under this uh software storage, meaning that uh we're running on. uh Obviously there is a you know.

F

Yes, um I don't know where you um honestly, I joined very recently.

J

J

Pick up the switchboard and that's hooked up with the x86 port, so you essentially have an os running which can program the uh network switch and in addition to that, since you're doing the sdn provisioning uh most of it is essentially zones and uh virtual machines like what um robert mentioned on on.

J

So you essentially have vms and either a kvm running the machine. So that's where the vfs is right now being used when you either put an iso image to provision it for the gaming vms or you put a zone image on it. So that's good.

J

That's that's again.

J

With some fixes pulled in a cherry pick, so we will need to.

J

But but I guess some some fixes important fixes have been pulled in now, which of course, you know, I'm not the same as rafaela's they're fairly new to this, but then yeah cool thanks.

F

K

Connecticut, um I'm the gitmaster, the zfs understander development operations engineer uh so we're a linux company generally ubuntu. um Our main product is unfortunately written in php, but we are working in a lot of different operating uh different languages there. So we are using the gfs on linux, kernel module provided by ll right now. um So that brings me to the point of help number one. We want to talk about uh lib cfs how we can integrate with that, how we can stop using the command line.

K

Command line is so good and it is so well documented that we are now dependent on and use it all the time. um So, thanks for that.

A

L

Know definitely a programmatic way of getting.

K

In there and um helping the developers who don't understand the intelligent gfs safely get at the things that they want to get at is one of the things that I've been working on recently um because we had a whole bunch of yes, this pipe to grab. That was great, so we operate our own cloud and we also operate lots of machines out in the field and we uh run open cfs on all of that.

K

So our cloud is about a thousand servers uh with over 50 petabytes of data.

D

I know we have more.

K

Than a thousand, so anybody from data watches this. I don't know what the number is it's more than a thousand, um and these are very hot storage devices, uh they're not cold at all, so we are virtualizing all the time. One of the things that we actually do uh is just like the velvet pre-provisioning.

K

We start up these virtual machines in the cloud to let the customer know that whenever they want to come, get their data, it's good. So we're constantly starting these things up and they need about as much ram as they would have needed to start up anyways. So they have a whole bunch of ram going to dfs, and then these guys are constantly taking. uh Then the next thing they're missing out on some of the pictures. I should have just done it. I'm sorry is it going to be.

K

I mean some really killer stuff here, people just um I can go into while it warms up one of the cool applications that I've done. That was not exactly for work. Was we had a linux kernel thing that was working in three five? No, that was not working in three pod and was working in linux 360., and so they asked the question: how can we get to that.

D

And so I did a.

K

Linux kernel, uh reverse bisect,.

B

I think you might need change which or mirror your display.

B

K

I can there's something: what does it think it is there? It is all right, it's gonna work out just great. Let me tell you all right so.

J

K

A thousand servers 50 petabytes and it's hot, um so just like delta street visioning. We are starting those vms all the time, uh but.

M

As you can imagine, when disaster.

K

Comes for their multiple boxes, they want to start up all those vms and actually run them. Not just a pre-provision. Does this thing work um so again, we've got lots of ram going to zfs and then, when something comes for your six devices there, you gotta start them all up. So the the talk today about vm integration with memory management is definitely interesting to us. um Integrate with the linux page cache uh dropping it more.

K

I know uh zfs on linux got a lot better on that in 6.2, and uh you can only go up from here right.

K

So sometimes we have so we're putting out new servers just about as fast as we can, but due to us being a business and wanting to make money, we of course try to get as much out of the old one as we can.

K

So that means we've got these pools getting up to 85 percent full and those are actual customer chains and we can't break those out and just put them other places. So we do have to actually make the transfer and that's really expensive um on our end on that guy over there. So that multi-tiered storage sounds interesting to us with maybe getting uh the ssds on these guys so that they can ingest it faster anything. We can do like that. I think I have a help.

K

Point three uh improved center receive speed, we're probably doing something wrong. There's probably some configurations that we can do better. There's also things that maybe we can help you guys uh give back to community a little bit and pick some of that stuff. I know there's a buffer and send receive talk on the chalkboard different today, so that'll be great. So here's what a basic product does out in the field, um because that was our cloud. So these.

J

Are our guys out here? They.

K

uh When the computers want to back up, we expose over the network of a block device for them to write to they write, what's changed since then. We closed it up, take his efs snapshot and then send that one to the cloud, but since that first snapshot that you actually do as a block writing from your devices is way more gigabytes and up to a terabyte of stuff that you don't want going over your network. We actually round trip, so we take hard drives, put them together.

K

Mail them to you, make a z-pool on that zfs send onto it mail it back to the data center and upload it directly to the cloud and during that time, you're building more snapshots and you send them all at the end- uh works pretty well um cool uses. uh I guess this is another one of the things I did off of our devices. um I back up our github instance. I am the proprietor of our github enterprise within the office. um It's a real game! Changer.

K

I think that you guys have had experience with if you're, using the zfs and you're used to using it to go into a directory and just have the files there and then have the snapshots being available to you through a zfs list and that kind of stuff, as opposed to cluttering it up with files all over the place. It definitely changes the way you think about how backups can work and how you can store data and really cleans that thing up.

K

I think, and then the other cool thing I did was that linux kernel, reverse bisect didn't get into too much, so we set up a machine basically with a clone of the linux kernel line and a script that tested whether or not it worked. So then you compile the new linux kernel, run the test zero or one. So you make a new virtual machine exactly off of that model, zfs clone again from the first one, give it the new shot.

K

It builds that test it and goes back and uh without zfs that whole thing would have been a lot more expensive to clone out again and not only the running. Linux operating system, but the clone of the entire linux transfer, because you never know where, in there you're going to get so in 13 steps and really quick. We had ourselves the answer for which cash uh fixed that change and we were able to look at it and see why the lynx kernel was hanging.

K

um That's about it. I think the other cool one.

D

K

All right, so that's what that.

F

Is next three questions for today.

B

Next slide, um you guys are both from delphics you're from.

I

Thermos you're from bulgers.

N

Which just went to beta a couple weeks ago closed or just another term for archive the idea being is we're targeting large businesses, with lots of data with petabytes to store, and rather than just storing that data sticking to the wall. You don't know if it's going to be good when you need it in five years, we're targeting zfs file systems like they can write the data to it. The data can be scrubbed periodically. If there's a problem with it, the operators can go in and switch out.

N

The disk and the people involved can know that their data is safe.

N

Some of some of the challenges we're running into is being a archive, we're pretty much going to write once and hopefully never read unless the disaster strikes but we'd like to power down some of the discs, because at no point having thousands of discs, spinning and the problem we're running into is powering down discs out from under vfs, isn't so wise when there's all sorts of issues there. So we're trying to find a better solution for powering down discs, whether that be part of zfs or maybe some external utility.

N

I'm not sure yet, and another thing we'd like to see is uh better api support on some of the and see full steps instead of having to go through maine with rgbrc.

N

It would be nice if we have some api calls that we can make to get equal status. I think he imports another one that doesn't have the api over it. If you go through block vrc, what's happening is if you get an error spit to the screen, rather than take action.

O

Thanks and actually I was going to say, the stuff that boris was talking about with the storage tiering could help your power on power off properly, because if you had metadata in power on storage, we could power off data that is sitting on disks that potentially wouldn't read yeah. That might be another use case. Oh that's so interesting, yeah.

J

B

Right uh next bombs here.

C

Would be helpful if you stand in the front.

C

Better video, if I'm not shooting over everybody's heads, thank you.

P

So I'm justin gibbs, I'm working special logic. Central logic has gone.

D

P

Several different product iterations using zfs, some of the most interesting changes that we've made are actually based on a product that we're not working on anymore but they're, very interesting, probably for people who have override scenarios like you're, using z-vols and stuff like that or iscsi, for instance, out of z-vols.

P

So the main crux of those changes is to do asynchronous resolution of copy and write faults. So you come in and you do an overwrite. If you grab a block of data, we basically will allow you to make that update in memory and return to the writer and let them write additional data while in the background uh pulling in the read in order to fill in the bits of that block.

P

So the performance improvement in several different workloads was sometimes you know, order of magnitude faster just because the writer could progress and give us more data, especially if you have scenarios where the writer is sequential. So the way.

D

Zfs works right now. You come.

P

In you touch a block, you immediately resolve that block. You pull it from media because you know it's an overwrite, but if they're, if they're writers yeah it's well, it's partial override yeah, um but I mean, even if you start the beginning of the block, it's uh it's gonna. Oh yeah,.

D

I mean if it's, if you have one zfs, doesn't know that you're gonna write the whole block right. So, let's.

P

Say your record size is 128k and you're writing 8k at a time. Zfs will just go and fetch that thing, even though you might be writing a whole stream of sequential 8k things like, for instance. Maybe your vm is ntfs with a nice 8k native record size right, but you really want to use 128k records on your backing store just to make the metadata more efficient and things like that.

P

So we're working on getting that stuff upstreamed, hopefully, hopefully we'll get there at some point so on our current product, they're working on we're, also an archive company, but usually dealing with powered off storage, so tape, uh potentially shingled media at some point not really.

D

P

But that there's always an asynchronous recovery process required, like you might, um let's say that it's a facebook or maybe it's a yearbook company or something like that. You might.

J

Have thumbnails that are.

D

In hot storage,.

P

For all the pictures that you want, you can select them and say I would like the full rest copies, we'll go off the tape at some point and bring those back so there's an asynchronous recovery process involved, but the current products that we're working on are basically serving as a cash tier to go to that cold storage and the other aspect of how we're trying to do this is to tie it all into the web interfaces that people are familiar with now.

P

So that's s3 and we actually have a version of s3 called ds3, deep s3 which works in batches. So you can basically tell the system I've got this 10 000 object boldness of work that I would like to stick in cold storage. We get that request as one uh one request.

P

Instead of 10 000 individual requests, we can then figure out how to schedule that properly with the robots and the tapes and all of that kind of stuff to be really efficient and then stream it through our cache to make that transition even easier for folks that aren't familiar with s3 or ds3, or maybe they have a giant nas archive and they just want to be able to drag and drop files over there.

P

We're also looking at adding a nas cache in front of the tape storage. So.

D

P

As a researcher, let's say at a laboratory, I may know that this particular data set I'm not going to touch for three years. So the actual consumer of the data can make the decision to drag that thing to deep storage and also can go look at our system and see it to drag it back, but by making it nas uh any client. You know any operating system. Client that can talk to nas can make that data movement without having to have any middleware or anything.

D

P

Order to make that data movement possible and then we make sure that the data is uh stable, so we create a version of the of the data and then we push it out to tape. So our main need in the near term is the ability to create that stable version of individual files, and so that's ref links or something like that.

P

uh The main reason why snapshots don't quite work for us in that environment is that we want to be able to allow the end user to say yeah that data set. I no longer need it anymore. I would like to be able to individually pick different objects in the data store and blow them away, but with something like ref links, where it's an immutable, hard link, we could provide s3 versions of of objects in a very seamless way, just using the copy and write semantics of cfs.

L

K

I remember a talk with specialized a while ago about performance improvements. They have done and comments in the code. I remember uh that seemed pretty exciting. You guys were saying you're gonna, get it back in yeah yeah we're working on it. Okay, cool! uh Was that what you were talking about earlier in the talk yeah yeah that looks really yeah. The comments are always great yeah and those would go freebsd first and then it's.

P

Actually, not in previous dn, either we've tried to follow the process of. We need to get into a lumos and have a limous people review it before we push it into revised, but it's been in a.

D

P

Product for over a year now and gosh, we probably developed it two years ago, so it's it seems pretty stable.

O

Have you guys looked into like business changes.

P

We have thought about them for like half an hour, you know, but uh I did actually attend a seminar about single media, so I understand a little bit about it. So the challenges uh fragmentation is interesting, but maybe with some changes in the metaslab code, you might be able to make it work.

D

I mean it sounds like they're kind of gravitory.

P

Gravitating towards like 128 megabyte zones or regions or whatever they call it, and so maybe a medicine for zone yeah.

O

I'm sure there's some interesting stuff to do. Yeah.

Q

Thanks justin yeah.

F

B

Here, yeah go ahead.

C

And while you're coming up here I'll change, the chat.

F

Hi, hello, alfred pearlstein, I'm with uh ix systems.

F

uh Network attached storage and we're doing the uh we're sharing cfs over.

B

Nfs sifs afd.com.

F

Most of our work has been integration with freebsd and getting the cfs running well on freebsd and zfs has been.

B

Pretty pretty awesome for us we're actually able to uh well with our ten thousand downloads, we're actually able to collect the metrics on the freebsd kernel and uh debug problems. Pre-Freebsd major release, sort of uh by.

R

Seeding uh the os out to all.

F

Downloads every day, um so um we're looking towards more contributions towards zfs right now, but mostly it's been performance on freebsd.

B

And and getting uh the stability, so um I also have a couple other usb keys in my bag and wants a.

A

Copy of freedoms for the phone.

F

Yeah, that's that's! It yeah very cool.

Q

So my name is alex and I'm actually a hardware guy, and so what we've done is we've done a hardware accelerator for sharp256 that is very easily uh pluggable into cfs. You already have a patch for xerocs on linux and it works great. So it looks like a lot of people are using shop256 for check summing for data integrity, and what I heard is that it is often ball neck for performance, especially if you're trying to do many discs or flash storage things like that. So basically we have this very smooth low cost. Hardware.

Q

Accelerator works with zfs. On linux. We are working on illumis, illumina's patch. Again it should be fairly straightforward, design, specifically to be high throughput law, latency very lower at the software side. If, if there is question, if there are questions or any interest, I actually have something: it's a pci card. Yeah yeah can you can you show it to us.

B

Just you know, it's.

A

Q

How to create the device uh software hardware, interface and also very low latency high to input implementation? What's.

D

The throughput.

Q

uh With this card, you can get two gigs two gigabytes um again. This is more limitation of pci in this card. Fundamentally, we can do whatever a whatever you want.

Q

And on software side, to give you perspective, like we have um two circuits sandy bridge server, so with the two ssds we get about 800 megabytes throughput. If we round shut to 56 in software, we utilize around.

D

Q

Percent of cpu, just on the phone with this, the cpu utilization falls to like below five percent, so cool cool, very cool. Did you say it's a6 basic thanks to our gpg, but yeah so again,.

E

Hi hi, my name is chris george, with ddr drive, I'm the founder and cto. I'm a, I guess, also a hardware guy, but stretch enough to write the solaris driver. uh Our product is called the ddr drive x1.

E

Basically, we marry the speed performance and reliability of dram with the non-volatility of slc flash we're very close to formally introducing so we've had that product. For a couple years our driver came out in january of 2010 and uh when it required external ac dc adapter for the uh for the power backup and we're kind of excited to have a supercapacitor solution which enables us to remove the external power which, for a lot of customers, obviously was a real pain. So we're kind of excited about the supercap.

E

It's taken us quite a bit of time, and one of the reasons is that we're actually going to offer a five year warranty so that you can really install it with the card and then forget about it, and that longevity aspect was a kind of a key design challenge.

E

So we're uh we exclusively target zill acceleration, which you have to be the only ssd vendor to do that. So we're uh we're all in on zfs.

E

We support all your efforts and yeah cool. We love cfs.

I

Any questions for chris before you.

E

Yeah, our curt is a four gigabyte. Our next generation will be a gigabyte.

F

Next dude: do you want to go next spring.

A

Oh, I don't think of myself as a.

B

Vendor they come from a national lab and we actually aren't selling anything. We did the original ports to zfs on linux, mainly because we needed.

A

B

A

We run supercomputers that are all linux based.

B

Then we need storage for them. So since we're a scientific computing center, we care a lot about the fact that the data is right. um So previously we invested in a hardware solution for this that did jackson, my read and stuff like that, but upon looking around that was uh not the way to go, so we made an investment in pulling zfs over to linux, so we could use it there and get the data integrity protections. All the other good.

S

B

That came along with gfs. That was just gravy from our point of view, but the the real reason was data, integrity and scalability. I suppose so um I guess I don't have anything else biggest super computer in the world. Thank you cfs. Yes, thanks for letting me gloss over the.

K

So you guys use a distributed file system overzea, that's right! So each machine.

Q

K

You say in some of the slides and stuff that you produced before that snapshots do help you guys. I just can't really wrap my mind around. If you snapshot on this machine, would you say for like looking back at what happened, an error? What do you do with one machine out of 70 that were so? We.

B

Don't take distributed snapshots today, um we could, but that would require us to coordinate across the whole distributed file system right, which we'd like to do would be a nice feature for luster to take distributed snapshotting. So that's certainly a cool thing we'd like to do. I think the more useful use case for us is something like upgrades when we did take the whole file system down and do a major software update to it, um particularly on the luster side right.

B

We can take a snapshot before we do the update and if something goes horribly horribly wrong, which unfortunately isn't unprecedented right. We have a way to pretend that never happened and roll everybody back to the pre-upgrade.

K

All the guys go back, yeah.

B

You know we can bring it up, everybody can like out the file system, we could run for a couple hours and say: oh, my goodness. What has just happened right and then go.

K

Back to stop buster before that snapshot thing, you say: okay, stop, stop right, yeah! So it's a distributed.

B

Snapshot but done manually basically form an administrative procedure. It would be nice to be able to like coordinate that through luster, but in front of that. It's a nice administrative tool for us, but the bigger win for us is actually uh compression worked out great for us. Without our data on our data sets, we didn't know very compressible, so so we got a nice performance boost out of that. So that's cool.

I

Cool, do you do anything for aha for the, uh I believe, they're called storage servers.

B

We do so lustre supports uh failover, basically,.

A

uh It's just uh basically active active failover for the servers. uh If one fails, we just start zfs and port it on the other server node, basically, and.

B

Then basically start luster there and it will pick up the services or a sailor partner. So there is some concern there about importing the pool in the failures area and accidentally trashing. It that's kind of a worse case for us, so we do have higher level scripts that coordinate the failover process, an automated way to make sure we don't accidentally.

B

You know, import the pool on both nodes at the same time and completely trash.

A

It so we would like to do some more work in that area too, because I.

B

Think there's room for improvement in zfs without coordinating that and maybe allowing it to uh detect the fact that it's been imported on both pools concurrently quickly and prevented. um There's some papers out there for ways to do that, all right and some write-ups, but they haven't been implemented. Yet that would be a nice feature.

P

So are there particular things in your implementation of cfs that make it possible to be an sd or can could another cfs appliance on another operating system, be an osd in your environment? So.

B

Osd in the term of lusters yeah osd, um so it's all luster ties directly into the dmu for uh cfs. I.

A

We haven't changed the implementation for the dmu much it uses a standard interface, so we have extended in a couple of ways, but I think those changes have been made. It.

B

Back into alumos, like uh transaction, commit callbacks were added to roster. I believe.

D

Those are in a little.

F

B

In the day,.

F

B

But by and large I think the basic functionality you need is there. There may be some performance improvements in certain.

A

Areas that we take advantage of in luster, like a.

B

Storing extended attributes, as a system attribute, that was important for us.

A

To improve performance.

B

A

Because lustre makes heavy use.

B

Of system of extended attributes and that's not exactly zfs best workload at the moment.

A

So we've been doing some work to improve that.

F

Cool cool thanks, who is next.

F

B

Nobody else wants to talk about that. Okay, I guess a lot of us are from the same companies.

T

I know it's interesting, is it related yeah sure.

T

I work for a well. I work for gmo internet in tokyo, we're a medium large isp service provider, actually we're a group of isps. We have quite a few now we have something like five data centers in tokyo, uh so we have a lot of storage. We use basically zfs on nfs servers.

T

We have lots for customer data, so the customer not might not benefit from zfs, but we do as administrators in terms of backup and just being able to provision commands, uh so we do simple hosting from just email, p or web all the way up to vms. If you want to run your own os or whatever kind you want basically and there's that's all on top of dfs, uh we try to do everything real time.

T

So if you want to create a new email account that gets provisioned right away and he first takes care of the back end stuff, which is quite nice. We were originally all sun solaris, uh so we started with zfs in 2008. I think, because we were tired of doing metadisk mirroring so getting.

D

Boom mirroring.

T

Was yeah quite a nice win, uh but of course now we're gonna decide where we're going to go, because we can't stay with oracle after that new process. um Just wasn't cost effective anymore.

B

So you're actually still using oracle zfs. Now we're.

T

On the edge right having to be forced over there to make a decision so either it's open indiana, we might go with omni, uh we could go with linux or one of the others. Here you know we're quite open, so we're kind of shopping around a bit, not too sure.

T

One thing we want to change, it might seem a bit mean, but um when you have quotas, you have compression at the moment that benefits the customer and we like to change it to anything else.

G

Do you think you'd yeah? We could certainly the hard way. um The problem is that the you don't get the updated data. Yeah is done so the size updated yesterday is not done in situ ah yeah yeah. We love it. Yeah.

T

Okay, good, thank you.

B

Any anyone else who always will move on to the next sessions- all right, um so I think, is karen: are you there should we do the community organization uh we'll do community organization stuff, then um channel program, stuff and then we'll see? Maybe it's time for a break and then we'll uh then maybe adam will tell us about performance.

U

R

U

um So I thought it'd just be good to have a discussion about what we need out of the community going forward. Are you interested.

L

In yourself, karen.

U

Ritter, um I work at dell fix and I've just been helping out a little bit with the open, cfs stuff yeah. We.

B

Have karen to thank for all all the stuff? That's been organized for us today, all the food and the.

U

Chairs and together.

U

um So you know we started off just kind of with this idea of having this community. We got a bunch of people together uh and they very generously gave their time to try to think of what we should do around the launch kind of content. We should have the kinds of things that we should be working on and that's kind of how we've been managing it. It's a little bit ad hoc people are just updating the website, which is all great, um and I think we definitely want to continue with that.

U

um You know: we've got the the mailing list. We've just got basically that one for now, um although I think the linux mailing list will someday, maybe move to open the effects as well um and then we've got and we've got the the dev summit that we did, but that's pretty much all we've done as a community in open zfs. I know all of you guys have been working on stuff for a long time.

D

K

So the zfs on linux group, so we'll get a lot of things like this, doesn't work on this version of gen 2 and that kind of stuff. I don't think you'd want that on developers. Oh.

U

No, no, we don't it really. This is just that they've outgrown um google.

P

This is just hosting oh, it's going to be.

A

Different yeah, just on certain days, we got from the how many messages google group has passed for us. I think it's like before thirty thousand messages.

B

A day or something like that, and sometimes we exceed that and things bounce yeah.

U

So we'd like that not to happen.

R

U

And so we have this mailman set up, which um the uh hybrid cluster folks have very generously uh posted and actually did all the website design, um which has been great, and they said that they'll move the mailing list over yeah.

B

This is just like a providing infrastructure for it as soon as I figure out how to get out of google groups. If anyone knows talk to me.

I

How to get the archive.

U

F

U

Of the things we're really wondering about is whether or not we should create an open, zfs foundation. Obviously justin and others have had experience in doing this. I don't know how much work it is, but it seems like having the ability to transfer ownership of the logo, as well as the domain names and just kind of being a resource for people to show their support through donations um and and do these kinds of things separate from any company would be a good idea.

U

I know that ryan you've mentioned that people have come to you and said. How do I give money? And you know we can create a an open, cfs paypal. We can have people pay pal the mat, but I'm guessing that people would feel more comfortable if there was actually a foundation there.

U

I'm not sure matt was feeling more confident.

B

So part of one of the questions is like what, if we create a foundation, what should it do like? What should the point of it be.

C

E

The question on the trademark sure at one point oracle had abandoned the zfs but they've. Actually, since reapplied, I was curious to see how that would affect if they are able to retain the zfs trademark. Would that not include them? I think that.

B

They did get the zfs trademark, I don't think they would necessarily preclude us from getting open dfs. um I think that's something that we would need to basically try and you know apply for it and hope that it works out and if.

O

Not I think, there's other.

B

Instances of um people getting trademarks for open whatever right um it may be that maybe we can't get a trademark on the word open, zfs, but I'm sure we can get a trademark on the logo, like the logo plus word yeah.

E

So I didn't book approached oracle and actually asked no.

B

Getting like getting high enough into anybody who could actually make a decision on that, like basically, is like, I think.

O

We kind of, are you know, whatever he's going to be yeah right.

U

Also, I don't know we want to really call attention to ourselves.

F

I'm not part of the previous foundation.

B

From an outside.

A

F

It um is always.

A

Really important to.

F

Book this course.

B

And you actually don't see the last year a.

F

Couple years we've been able to hire dedicated resources to contributions, and that happens in some of the things like you know, you want someone to maintain.

B

Ci or people you know there are some people who are.

F

I don't know you know, maybe we have like full funding partial funding we wind up being like. I really want to do this.

F

It's just awesome right, so we can fund those people um and uh it seems like a good idea. Maybe just wants to.

R

P

Give you a little background, I guess, on the previous d foundation, it started in 2000. It's it took a really long time to get to the point where it is right now. So just have a lot of patience, mostly because the people behind foundations like this are very passionate people that want to have it happen, but they.

D

P

D

And families and all these other commitments.

P

As well and um at least to start at least the way the foundation was, the frequency foundation was for the first six years, or so we had nobody that was paid right before we had enough of a financial footing where we could say okay, we think we can go to that next step of our evolution, and so we hired a part-time person. You know and then.

D

G

D

P

But I think the other aspect of having a foundation- that's really beneficial at least has been for freebsd- is that in the corporate world it's still very hard. It seems like it shouldn't be now.

P

It's like we've had 30 years of open source right corporations should understand how to talk with open source communities, but they just don't understand it, and so the freebsd foundation actually spends quite a bit of time going in and talking to corporations about their use of open source and how to interact with the community properly and make the best use of open source, and those conversations, I think, are extremely valuable, at least for the freebsd community, and we take the information that we learn from those meetings and then funnel it back through developer meetings and back into our community and that actually turns into the roadmap for freebsd.

P

So it's just another avenue for getting a very important feedback about what the open source community should do.

U

There's some overhead, obviously.

P

U

A structure and you've got to have minutes that they publish, there's like all kinds of like.

P

That's actually not so bad. uh So when I started the foundation, you have to have three people at least, if you're a us one right, you have to have a secretary of vice president or president right, and so I was the secretary I said I'll take care of all the paperwork and the paperwork as far as getting the non-profit status, that's difficult, higher alert, um the other stuff. The other stuff, though, is not so diverse.

P

Really um that shouldn't be the benefit that scares you away from it, and I mean our nonprofit uh process was maybe five thousand dollars. You know.

D

So it wasn't a huge.

P

Investment to be able to get to the point where we were 5016.

P

um but as far as overhead I mean the open source community has to understand that if you're going to hire people to do stuff that the community is doing and things like that, they have to accept that there will be some overhead that won't be directly going back in development.

P

It might be going into sponsoring events like this or outreach or marketing, and things like that, and so one of the things that previously foundation is working on right now is we're again another one of these transitions of a step function in our capabilities, but our overhead, probably the way that we would account it right now is maybe 25.

P

But we have enough staff on hand that if we were able to double or triple our donation level, uh our income or revenue that we could probably keep the same staff and drastically reduce our overhead. But you run into these places in your in your history as a foundation where your overhead is going to be higher than you'd like, and you really need to then go out and and get more revenue and really work it so that you can balance things out again. It's this constant cycle.

B

So what I mean we all work for like companies that would hopefully support a open, zfs foundation like what would be compelling about that to solicit your companies like what would you be looking for? Is it you know, development effort? Is it like you know, should we should we target like we should you know, fund somebody part-time, to go work on like making all the make files and doing all this stuff with the open for the code?

B

Repo that you know is kind of grunt work that is unglamorous or you know, or should it be purely like an advocate advocacy thing where you know the foundation sponsors events like this and you know we have a booth at other trade other you know, conferences or.

P

So again, just speaking is because I think that you could draw a hundred different parallels between what what you want to do with opencfs and the freebsd foundation.

P

Most of the folks that we talk with they're really interested in about four different things. uh The first is: will.

D

P

There right that's.

D

P

Big issue for people: it's like I'm building a product or a service that depends on this. I know the technology is gonna change over time. I can't do all that development myself. How do I make sure that four years from now, I'm not in a dead-end product? That's something that the foundation couldn't speak to.

P

The next thing that they talked about talking with you know commercial users. Is you know what are you working on right? So can I fund a specific development project, but I think that's actually not nearly as important as the the ability to improve the downstream experience of people who integrate the technology.

P

So that's probably the third thing that these guys are concerned about. When you talk to a commercial consumer, what is your test environment? How do I get fixes back into your community? How do I know that the regression that I fixed doesn't come back again in the future? What's your testing methodology all of those things about making it a professionally developed piece of software that you can rely on?

P

So I think that's also a very important thing that you can speak to and then it's just about. You know, at least in the free psd case being able to grow the community so that there are people to hire.

P

So sometimes it's talking about training materials so that once I've hired somebody, how do I bring them up on this code base? How do I make them effective? How do I then get them into your community right so that our our changes don't stay here and we have to maintain them over time. So it's really those four things and I think that, as a foundation, you can speak to all of them very, very well and.

D

And it's a single point of contact for these people right, so they.

P

They don't understand how to how to talk to a hundred different developers and get what they want. They can.

D

Talk to you and say: oh well,.

P

You know you have to be uh having a problem with reputation. George over here, you know, is the person to talk to and by the way, have you integrated this change that whatever and that provides extreme value uh to companies when you go and talk to them. So almost all of our discussions as um and when the freebies foundation goes and talks to a consumer freebsd, it's not about money um because oftentimes, that's the that's the barrier to actually getting in and having that conversation they think you're just going to come. Ask for money.

P

The entire conversation is about facilitating their use of open source, and so we, by by phrasing it that way, we're able to talk to the people who can.

O

D

Make a difference for that company, as well as for our community.

P

So we have conversations with very high level. Vice presidents sometimes ceos talking about the business model of how you integrate open source and, and that really.

D

P

Have a big value add for people to support the foundation.

F

B

That's super valuable: um does anyone else have some like experience dealing with foundations, either on the creating them side or, like you know, interacting with other foundations? I.

B

The foundation.

F

Has done over the past couple of years, I think, with zfs being on linux,.

B

There might be some really deep pockets to fund a lot of things uh from external sources. It also seems to function.

F

Actually, it also seems to function as a uh escrow almost or when you when you do want something. So if you want to make files done everyone's like maybe someone else will pay for it. But if, if they'll fix the buyouts, a bunch of people put money in in a pot- and we know it's going to get done doing.

B

That, outside of a without a foundation entity, it's pretty challenging and a lot of paperwork and really frustrating. So that's that's.

F

What I want to add.

C

As many of you probably know, we failed to have an alumus foundation. uh I am still trying, with the state of california, to wind down the illinois corporation.

C

One thing we found out is that the nature of alumos is such that it does not qualify as a 501c3, because it is development towards the commercial profit of one or more entities. So we were trying to get it done as a 501 c 6, which is essentially a trade association.

C

It's not exactly the same, and I have a feeling that might happen in this case. uh Do you know.

B

How they make that differentiation, because, like on the on the surface level, it seems like previously- and the lumos are like kind of very similar in terms of their goals that we possibly use previously in commercial products.

P

Right so the way the freebsd foundation was created was towards research and education right using it as a platform for that, and so that's how we achieved our 501c3.

C

And also it may have something to do with timing, because I'm told that the irs has cracked down a lot on 501c3 status.

C

So it can get very messy and complicated and also.

D

C

You've got a kind of chicken and egg problem that you get to a point where you need at least part-time staff to do certain things such as ask for funding, but you don't have the funding to pay the staff without etc.

C

um I would also have a look. um There were some interesting talks at the community leadership summit, which takes place before ozcon every year. Now, for the last five years, henrik ingo formerly of mysql, gave an interesting series of statistics. He pulled together on whether open source projects really need a foundation or benefit from it, and his findings were pretty much came down to no, um so I would think long and hard about whether to go down that path.

C

um You know it may be worth it. It may not, but I'd if things are changing very rapidly in this space and it's not as obvious as it used to be.

U

um If there are people who would like to help with a foundation, if we go down that back, please raise your hands now.

F

U

um So moving on to the next thing, do we need community management? I mean we don't have a community manager, I mean I don't know it doesn't seem.

D

Like we do, we've got the developer mailing.

C

U

You know you want.

C

To discuss the good side of the point is that smart questions. I've have not really needed much management and we have all specifically avoided the government's orgy that was open to virus.

C

So we've had a couple of blow-ups on some of the mailing lists in which people just went nuts and went off into politics or what happened would have been suppressed. um So I think, there's enough consensus around these as groups of technologists.

C

D

You know that's like basically.

C

You are kind of the head of this, whether you want to be or not, and so, if you stepped in and said hey, this has gone too far. People would accept that. So it may be that there's not that much formal process needed.

U

Plus you can have the advertisements.

F

It's challenging.

B

And it causes problems, so I mean um sometimes from our dinner discussion last night. It seems like there.

V

May be some chatter on the list that is distracting confused people who are coming in to.

P

Contribute- and we face the same thing in 3dsd.

B

In greenhouse 2., so sometimes it might make sense to have a you know, posting guidelines, some kind of moderation to it. That's it. We mostly live without it, but the fallout is what it is. You know you get, somebody really wants to contribute, gets flamed by you.

V

Know some that job you know and.

P

So one other thing on the community management thing because you know freebc's been around for a while, and I think we've gone through periods where we had better internal management, and you know cases where we just didn't have very much at all.

P

Community management can I mean, if you're just talking about like list moderation stuff like that- that's probably not not as interesting, at least from my perspective. But what is interesting is strategic planning, um better tools for accounting. You know what people are working on and where it is and what its state is, and maybe by using github and people having their work pushed up. There that'll take care of some of it, but often times.

P

um You know during these periods, where we had less management uh coming from like the frequency core team or what have you uh their lacked focus right. What are we trying to achieve in the next six months or what are we trying to? um What is what is open, zfs supposed to be really good at and what are the spaces that we're not interested in right? That kind of level of.

D

J

D

Necessarily come.

P

From just you know, one person or one group, but having it recorded somewhere and and making it. So it's not just a couple people who have to constantly you know remind people on the mailing list. You know that that's not what opencfs means, or you know. We really need to finish this feature, because that's our goal for the next six months.

U

um So what other areas of focus we won't have? Maybe this is probably too long a discussion to have right now, but something to think about is you know. We've talked a lot about the development projects and the ways in which we want everybody to come together.

U

Are there other things that we want this community to be about? I mean you know, we've seen on the mailing list, where people will come up and they're new users, and they want to know more about just cfs as a whole to try to do a little bit in terms of writing architecture documents and that kind of thing, but are there other areas that we want to expand to to try to come out with openstack.

U

That makes me smile um anything else. We want from community management.

U

More dev summits.

W

I mean we talked about uh testing methodology here that was mentioned, but I would just underscore that one of the main things I would want to see is infrastructure yeah. You know, builders, automated continuous integration, testing.

U

Right something that's cross-platform as well.

W

U

W

I don't see the the shared repo or common code base being viable long term unless we get to that point where we're doing automated builds and testing on all platforms. So I would hope that would be a focus.

R

I, like the suggestion someone had earlier, though, like if you care about a specific platform, provide the hardware to test on it yeah, and so it's interesting to the infrastructure where you can plug in your vm. It's.

B

Not just hardware, though you know it's like it's.

R

Almost it's all.

G

The software's yeah like getting someone to pay for a vm in the cloud is, is easy. They're, just like say, hey you paid this much per month. That's part of your donation that goes over those bills. Without getting all the infrastructures, you can actually spin those up and.

X

T

Side effect that my users seem to expect user interfaces.

Q

And we considered.

T

A platform independent user interface, something like that where it actually would work for everyone, that's been gfs.

T

More of uh you know the wizard. What you see is when you get sort of the buttons to create a pool, and all that I mean I knocked. I looked at the z if it's guru, but it's php, and it was not as easy to get running on os x. Perhaps.

F

Wait are you you would like to agree for managing it. I don't but a lot.

T

Of my users do yeah.

B

Green apps grab it it's it's, it's bsd license. You can do whatever yeah.

T

So they're, okay, it could be an option either one on the other six.

G

The other other kind of things might be worthwhile. I was resurrecting the old cfs administration, that's getting them updated.

G

You know one centralized location for.

L

That kind of documentation trying to deal with all the platform differences in one spot and kind of just you know in terms of just making how to find influence on how to do basic things and.

U

All of you guys are going to sign up for office hours. Yes,.

B

So we did the one office hours like are people interested? Are people interested in seeing more of those and then also our people interested in hosting or being the moderator who who wants to see more of these or things that they're valuable for the community all right, a few people?

B

I know a few people have already volunteered, but any more volunteers to post- and uh you know, answer questions and.

C

I can probably sign up a couple of people from joining for you after the term of the year cool all right. I think.

B

That's definitely a good activity to continue.

D

B

Thanks yeah, it worked really well. I thought it was uh google plus hangouts on air. It like streams, live to um to youtube so like anybody that can use youtube can can view it. We had people like ask questions either by joining the google plus hangout um or by doing it on irc.

B

Just repeat that um yeah, so I mean part of my question is like I think a lot of these are really good ideas, like you know, updating the admin guide and creating this foundation and making make files, but but.

U

I don't let him have time to do that.

B

Yeah, so you know how like, uh how should we go about finding people to do this? Is it is like? Should I be going- and you know, sit like posting on the mailing list and saying like these- are the priorities that we gathered from the development summit like who will who will do them and hoping that somebody like volunteers or like, should I be going and like finding particular people me like hey max, you want to do this? Don't you come on.

U

P

So I think one thing that might help you there is by lowering the barrier of entry. So when we're talking about all these documents right right now, they're these monolithic, pdfs and you know trying to understand well, can we use any of that material, it's right from scratch, what make sure you do right, yeah um and then bootstrapping finding finding just a couple. People will be just bootstrapped as in take whatever content is available, get it in some places. It has easy access and then you can go to the community and say hey.

P

Maybe you know something about a particular section of the document. You can just go in there and end it or wiki style that they have normal pdf documents. I.

U

Know in some cases there are a limited number of people who can actually write specific documentation.

X

How would one go about changing that stuff? It's like the on this format. Document is old and like if I go and study it and I work on this format for a month and I'm like okay now I understand I could probably modify it, but it's just a pdf, but.

G

I don't know how so so for all the old sun dock books, for example, we actually still have all of them so they're in kind of crappy formats and it's sold. You know a lot of xml, but we have a license that lets us do whatever we want with them.

G

Okay, so, like we've, been slowly converting them for lumos, but we have the zfs one so that it's just a matter of just like going through cleaning it up and like changing references to names and just like doing a few markup things, and then there you go it's easy to go, build and put it on a website, so they're just sitting around. So for those it's kind of easier I'd, say for a lot of those things like the on discipline: you're better off this copy.

D

Page just copy and paste like just do some.

G

Like markdown files, like that, just like something that's simpler, just kind of go and edit and just put the upstream document well.

L

Then we'll just create some.

X

People create some repository.

G

That, basically has that has formal documents.

X

B

Yeah sure yeah, so you either just create like a web page on the wiki that exists today or like create a github repo. That has you know your marker or, whatever you know, source you're going to use, checked into it and then have some way of building that and then putting it on the web page.

U

Yeah, there's definitely stuff that we can get added to the website as well. I mean luke has talked about you know, possibly setting up some additional things like we asked for opengrock to be out there, um which hasn't happened yet you know they're, obviously limited in their resources as well, but if there are things that we want to see added, you know definitely send me the list and watch.

U

Anything else, cool.

F

Cool all right thanks thanks karen.

B

um So next, uh chris and max, um can you tell us about zfs, channel programs and why we're going to put lua into the kernel.

B

Do this because these guys are gonna, do it? Oh.

S

Thing is like as a friend so although.

D

It didn't work so this.

S

Is an idea of max and I've been throwing around for two months now.

D

B

um Well, not the lewis thing.

D

But the challenge yeah, the thing is, as a friday.

S

D

S

Going to write our own byte code uh and parser and lecture so that was that was a bad idea, so these are improving.

D

You want to go.

S

Back to the beginning of the presentation, okay,.

R

S

Introduce yourselves.

R

S

So I'm chris, I work at delfix, I'm max. I work at dolphins.

V

B

Chris did uh implemented a lot of the um the picture.

F

B

F

We have to think thanks for that.

B

And uh max uh has recently worked on a project that is not yet integrated, but um do you want to do words on.

V

B

F

V

uh Yeah, so I started celtics about five months ago, and one of the first things I worked on was holberth, which, um basically, when you do a zfs, send it sends things called free records which represent polls in a file where it's basically all zeros. So we don't actually have to allocate space on disks for these holes because they're zeros, we can just you know, pretend we actually wrote them disk.

V

um But when you do a cfsend, if you have very holy files, very holy datasets you're sending a lot of these free records, which takes up a lot of um takes a lot of bandwidth and make sure sunstreams take a long time. And so I did some work on changing um the behavior of the birth times and basically recording when things are freed, so that we don't need to do these signs of redundant free blocks. If the target of the send is already aware that those blocks are free.

S

So we already do this for data blocks like don't send things that are older in an incremental, send, don't send things that are older than the or like the originated snapshot of your send, but we'd never recorded the. When we freed blocks, we didn't record what transaction the block was freedom. So anytime, you saw a free record, you had to send it um which caused problems. You have like a z-ball with some other formatted file system.

S

On top of it, for example, you'd have lots of free records, and every time you did an incremental, you have to send all the free records.

B

So this adds the birth time so now the polls now have birth times and we can figure out exactly which ones need to be sent, which is unrelated channel programs. But just so you know who these guys are.

S

All right, so I'm going to talk a bit about the problem and the max we'll talk about lua, which.

V

I learned yesterday.

S

All right, so a little bit of background before we get into the the problem, um just when you run a zfs administrative command. So this is things like that: manipulate the zfs level metadata like creating new file systems or setting properties or creating snapshots.

S

These operations are done in what's called sync tasks, and the idea is we do these. uh When you run one of these commands, we send the zfsioptyl to the current says. I want you to create a new file system with this name, and the kernel will cue that operation to happen at the end of the current transaction or at the end of a transaction. uh So the idea is at the end of each transaction group.

S

We kind of we hold all the locks say no metadata updates. A single thread goes through and has kind of the power to change, update all the metadata, create new file systems, take snapshots and stuff, um and then, if you, so, this ioctal waits for the current transaction to complete, so that you know that this update happened, that you created the file system, that the snapshot was created.

S

D

This creates a problem.

S

Because we don't sync transactions, don't happen that often um and it scales based on the load of the system. So in general I think the target is or was like to do it at once, every second right, but if you're, if you're on a heavily loaded system and kind of it, because the idea is in order to do finish, the transaction, all the dirty data has to be on disk.

S

So if you have constant fl, if you have lots of data all the time it takes more than one second to sync it out, and so the uh I o throttle kind of scales, our target to say like okay, so we've got a system in in at deltix where uh each txt is about 10 seconds. I think um and.

D

So this means that each and.

S

Each time you do an administrative operation, you set, you do the after down to the kernel that puts it in the queue to be finished. You know 10 seconds from now, basically, um and so on our system where we have uh where we store our vms. We have kind of the some python scripts that use cfs properties and metadata and there's a lot of destroys and promotes underneath the hood.

S

But each of those is one operation, so we send down like a command to do a destroy, and then we wait 10 seconds and then we send down the command to do like a snapshot and take 10 takes 10 seconds, so each command the user runs like create me.

S

A new vm could take a minute because I had to do six uh administrative operations um and sometimes sometimes you can do them in parallel, but sometimes you know you can't destroy the file system until you destroy the snapshot or something like that, so you can't actually send the commands down in parallel.

S

So that's problem number one is just how long that takes. So we, as the system, gets more full uh these administrative options operations you start noticing how long they take.

S

um Another problem is that uh these sync within these sync tasks, if you say, take a snapshot there's a lot of. We do a lot of checks. So I was I was saying earlier during these this sync test. We hold all the locks, we don't let any other threads update the zfs metadata so that we can consistently check. Oh this, you know: if we're taking this snapshot, we can check that the file system you're trying to take a snapshot of exists and between that check and the time you actually create the snapshot.

S

Nobody else can modify the metadata. um So this works within the sync task. But now, if you're splitting across multiple operations say you want to do a destroy, but you have to do a promote first, because there's some clone of this file system you'll, do the promote in one sync task and then 10 seconds later. You'll do the destroy and between those two points, someone will have created another clone and that won't work.

S

So you kind of and there's a lot of logic, for example, zfs destroy dash r, which destroys all the snapshots of a file system and then destroys the file system. It's actually. First it gathers the list of snapshots in user land which hopefully isn't changing because it had no locks held when gathered the list, then it sends that list to the kernel and in one transaction, destroys the snapshot and then does a second transaction to destroy the file system. So the system crashes between transactions or things are changing too much you get well.

S

We try to have good results, but it's there's a lot of complicated logic in user land to ensure that we have the right results for some of these weird things. Just because we can't do. We have very specific things. We can do transactionally in the kernel in defining these sync tasks and everything else. We have to kind of wire together and worry about that in user land and worry about different transactions. So all the different snapshots, capital r, is even worse. I don't know how many transactions that does, but it's a lot.

S

um Let's see the last problem is uh kind of we've been adding more and more features. Matt's been adding for more features like the uh taking snap. I think you added this in the last year so like taking snapshots, giving a list of different file systems to take a snapshot of, rather than only being able to take recursively under a single point, but each time we add- or you know add so- zfs destroy, dash p and rollback.

S

P are like on my wish list, and then there's uh and and the idea is like destroy dash p, would atomically do the promote and then destroy in a single txg to make that to avoid that problem, I was talking about earlier. Rollback would do a similar thing if you want to roll back through a clone. Do the promotes to make the rollback work.

B

And like every new, like functionality that we had like the bookmarks stuff that I was mentioning earlier, like you want to be able to create a bunch of bookmarks at once or destroy a bunch of bookmarks at once, and all that stuff like you, got to go reimplement a bunch of boilerplate stuff. That snapshot also has to do, and destroy also has to do so.

S

Yeah for each new feature you end up with this trade-off: either you do it tie it all together in user land and have really complicated code to deal with transactions that, like the fact that not everything's happening transactionally, or you write this terrifyingly long giant. Sync task function in the kernel and hope that nothing's, wrong and or everything will explode.

S

um So those are the problems. Max will now talk about the.

D

V

So our solution is called zfs channel programs and basically high level ideas. We want to send a stream of instructions that execute down to the kernel and have.

D

V

Execute them all in syncing context, so, for instance, this could be executing multiple snapshots, destroys, creates and kind of force. The zfs um kernel module to execute these all in singing context, which means you get that atomicity, you get better performance because you don't have multiple transactions. Things like that.

P

So yeah, like I said you get.

V

Atomicity you get consistent state, you don't see, you know somewhere state where half of your snapshots have been destroyed, but not the other half, and you also get more rapid development of new sync tasks. So you can quickly prototype up some new combination of these different intrinsics, like snapshot and delete and create, and rather than having to you know, create this entirely new sync task and see you can just write it quickly in this scripting language.

V

Have that be translated into this stream of instructions and send it down to the kernel where it'll kind of interpret it like an actual program, and so this will be a lot faster and a lot more fun than the way.

B

It works now, I think, that'll be really cool because, like creating a new, even if even if you've already written the sync task, part of it like creating all the infrastructure around that is, like you know, there's like seven layers yeah and you gotta, like you know, add.

D

A new zfs subcommand.

B

You gotta figure out the personal arguments you gotta, plug it through lib z. Like live, cfs, live zfs core then, like.

L

O

Yeah, the syntax layer, you know that.

B

Could be like you know, 30 lines of code, but then you know you've got another 100 lines, sprinkled out throughout all the different layers of the stack just to do like one little thing that should be simple.

V

And the other cool thing about this: is it's a single eye optical? So, in terms of, I think it's like compatibility across platforms. Is you only have to support a single optical and the kernel can decide. You know what intrinsics it supports and maybe throw back an error if you ask for something that it doesn't provide. So this.

B

Is particularly around the user to kernel interface stability, which is an issue on like you know the mixing previously, where you try to maintain a stable kernel interface. um This should help with that.

S

And a couple people had mentioned asking for a better programmatic interface, and this would be a better programmatic interface, especially if we there, you know, there's some there we're glossing over a lot of details, but it'd be nice to have a way to, for example, return data from this, uh not necessarily from syncing context. So if.

B

You want like, with all the lock slots, with yeah.

D

For reader yeah yeah.

B

Like you come down to the kernel, you get the read the yeah.

D

And then you can.

S

That's correctly formatted in the way you expect and you get to write your program to do it. The way you want.

O

This also simplifies all the error code blocks. We have today yeah.

B

Because you can, you can reduce the number of possible errors by using this.

V

And then any existing functionality, of course, would basically just be translated into a constant stream of instructions, and so for any of the existing commands. Cfs commands.

D

V

Just send this constant stream instructions that we have pre-generated or whatever it's easy to port, um so just to make it more concrete. This is a simple example channel program, so cfs destroy.you, destroy all the snapshots of a file system then destroy the file system itself. So the way this works today, like chris mentioned, is you you list all the snapshots of the file system. You issue a destroy on all those snapshots, and then you destroy the file system target itself, so you could write a simple script. Instead, that has nice scripting.

V

This is lua similar python. So, if you don't want, you can probably learn the other um and you just iterate with a nice for loop, you do a destroy of each snapshot and then you do destroy the target, and so all this gets executed in a single syncing context. It's faster! It's it's more correct and the host application doesn't have to worry about kind of working out its way around any weird edge cases where you can get into inconsistencies or failures. Halfway through a long sequence of interconnected operations,.

B

So uh maybe you're gonna address this later, but do you want to talk about like um let's say uh one of these snapshots can't be destroyed for some reason right like it has clones, then you know how? How would that error be handled? Could I still end up with something where it's like? Some of them are destroyed and some of them aren't.

V

So I think what we're thinking- and you can definitely chime in here.

S

We haven't translated that part of the louis yet problem we hadn't we hadn't planned for this earlier, but I mean what.

V

The wasting uh the something we didn't really mention is what tasks do now is they have a check function and a sync function, and so the check function makes sure that this operation can complete and then the sync function actually completes the operation, and so I think what we kind of try to do with here is have a sort of a dry.

S

Run, I think the important thing is you can run a bunch of in within some constraints, you can run a bunch of check functions and that guarantees that if you run the same, the same sync functions of all the check functions you just ran. All of them should complete properly, so you can kind of ah do a dry run ahead of time to make sure.

B

For this you could do a dry run, yeah and then the garden will guarantee that it all works. It all fails.

S

Right so in this, as he's drawn it here, I think this would actually like if it fails to destroy a particular snapshot. You wouldn't it would. It would destroy the rest, but not destroy that one snapshot, but I think the idea is you would wrap so lua has first class functions. You would have some function like uh driver on then do or with a better name, and you wrap this in that, and so it would for each snapshot in this.

S

It would first it would run the code once with just doing the check functions and then, if all the checks functions pass, it would run the second time uh running. All of the um action do actually doing the work, and so, since the check functions, all pass you'd be guaranteed that all the actual functions themselves would pass. I think.

B

The cool thing about this is that puts that kind of semantics into the control of the application. So the application can, you know, can think about application writer to develop what how you know am I going to be able to check these? What do I want? There are semantics to be. Do I wanted to just like do as many as it can and if it finds one that I can't do then just drive on, but you know maybe.

S

That's the right yeah, because, right now we have some weird things like on the. When you do this list of snapshots, we just kind of made up a definition of like if the snapshot fails to be destroyed.

B

Because it's you know it's because of a certain error. Yeah then.

S

Go ahead and finish the rest of them: oh yeah, because it's enough because that means it already doesn't exist and then return an envy list of all the errors.

D

Here you would be able to say, like.

S

Check the error like, if you cared about errors, you'll be able to actually define that here and then, like form, some output object. That would contain the errors that you cared about, rather than having like whatever matt arbitrarily decided would be the best. uh According to him uh to be like what everyone has to use, yeah, which I mean for snapshots, I think that's not as big a deal, but when you get into like the oh, I want zfs destroy dash p that takes care of the promotes. Then there's like oh well or is.

D

This one to promote yeah, I think.

S

One of the other feature requests that we had that we wanted for our vm system was like recursive. Destroy that does deferred destroys, except that exists. There was.

M

There was some there was some.

S

Combination of like mixing, you know where you end up with like 20 different flavors, basically that you have to interpret, or you can just write a program that says what you want to do.

V

So we have to implement the infrastructure for interpreting the programs.

Q

V

um So this is kind of a high level, so we we chose. What we want to do is we want to basically add a lure interpreter to the kernel. It'll, take these streams of instructions that are passed down from user space and actually execute them. So the great thing about lua is it's designed as a domain specific language. So we think it's going to be easy to add these kind of.

D

V

Specific language library features to it and have the lue interpreter call back into the cfs functions that are actually implementing the logic, and um so, for instance, this would allow for that dry run capability, because we can basically ask the zfs interpreter to execute these instructions with a driver and flag set, so that we know we don't actually want to make any changes and then go back.

V

um You know have to drive around flagging it or the dry run off and actually and actually run everything, so we're kind of still sorting out exactly how long it's going to take and what the first steps are, but definitely the first.

S

Step I've made in the lumos kernel yeah compiles it compiles, so it works clearly.

V

um Yeah, that's.

B

It's so what is your plan of action in terms of uh tomorrow? Oh.

S

We're going to work on this tomorrow, I think, are you working on this all right.

B

So we can take another.

S

Person too, and.

B

What like, what are you wanting to get done tomorrow, like what part of this.

S

Well, I started hooking up the compiled lua in the kernel to the actual interface, so I think it'd be nice if we could execute something in the kernel either something that just takes snapshots or can list properties. Basically.

D

Have something we can demo.

S

At the end of the day, where, like I type something in or we type something in and then something comes out, some construction.

V

Stream creates a snapshot or creates a file system.

S

Or it outputs some string to the debug or something yeah, let's scale.

V

It according to expectations.

S

Do all this sort of debug log printer give me that, as soon as we have something into lua that I compile to the kernel attacks.

B

Do people think that this is a cool that a is this a problem that other people have and be this solution.

R

As opposed to another language,.

S

Because it's a relatively tiny, very highly functional language, that's meant to be embedded in things as opposed to every other language that exists.

S

You can reconfigure all things in lua like there's, so you can reconfigure how they parse numbers, how they do arithmetic. So you can I've. Actually we might when I made it compile in the kernel. I remembered all the floating point: stuff, yeah. Okay, um you.

D

Can actually like the lua.

F

Number numbers yeah, the lua, there's.

S

Actually, a pound defined lua number type, that is double and you can change it like in 64.,.

X

How do you wait to do this without needing a whole language inside the kernel? Well,.

S

That's kind of the point right that.

G

You need some kind of.

S

Language in vm, whether that's right so before friday, our plan was we're going to make some envy list format that the colonel will parse. That will like let it interpret, and it was basically like we just had all these artificial restrictions. Based on how much of an interpreter we wanted to write ourselves.

S

So we figured why not just take an existing interpreter that works and poured it into the kernel, which was I mean I did it in like an afternoon. So it's very.

D

It's very much.

S

Meant for this type of thing, not quite into kernels, but.

X

In a sense you're, just like you're, saying you're, adding all this logic that you execute in the program and you're just saying do this then do this then do this? If this happens, when then do that yeah, can you just queue up all those operations in user space and just pass in a list.

S

Because what you want to do is be able to check the state of things you want.

D

To also be able to.

B

Look at the data, so if you say like yeah, if this is true, then do this.

S

I mean one of the examples you.

B

Only know you can do that newsletter, and you can say if this is true, then do this, but by the time you do this, it may.

X

Happen the predicate may not be so, for example,.

S

You have to wait for the.

D

So in this example.

S

You know we do destroy right now and it's kind of like unless, if you're, creating and destroying snapshots, as you run it, we did our best like there's a lot of first of all, there's a lot of logic in user land and a lot of thinking that went into like what's the right semantics for this.

S

But basically you get your list without any locks held, and then you just send down a list of things to destroy, and if things have changed then you might you know the second destroy might fail, whereas here listing of the snapshots is actually done in syncing context, with the correct locks held so you're guaranteed that when you like do the destroy, you actually destroyed all of the snapshots.

X

I'm just thinking in terms of dealing with the linux kernel community.

S

X

This doesn't seem like they would let this kind of thing fly.

B

You know they also don't let gfs fly.

S

So right now, so our goal is not: let's put generic lua into it. It's let's put lua on top of the interfaces that you know this was saying before we should put lua should anywhere. First of all, lua barely has to call out into anything. um You know it has like one place where it does memory allocation that you can that's once again like something you can configure um and that's you know pretty much it and then it calls back into your stuff. So the idea is, we put the zfs interfaces.

S

Lua is part of zfs, not part of the generic kernel interface. We haven't fleshed this out completely, but to make it portable, it would be part. It would be part of uh you know when we create the open repository the repository. It will be part of that.

S

Our implementation because, like like we're saying before we're going to like change it, so it doesn't have to use floating points and stuff like that or get prints. Yeah they're, pretty.

G

Standard yeah. I just want to give that like, where you're going to do safety so like, and if you think about like p and d trace like it, makes sure that you basically can't infinite loop in probe context forever. Yeah.

D

G

If I do an infinite loop in my sync, like in sync.

S

Don't write bad programs and also yeah. Well so I mean seriously.

G

I can't just say like yeah, don't write a bad program well,.

S

Not any not, anyone is gonna be allowed to run these things right.

G

But if I have a delegated data set, for example, today,.

S

I can actually delegate.

G

You know doing a lot of these operations so.

L

You're going to make sure.

G

Our channel, I guess, is channel or there's a channel designed to be something going to be a first class abstraction to the user. Is it really just going to become an implementation detail of that's been zfs and that's been simple.

B

So I mean so my initial idea was that basically only root can send channel programs down and then like.

S

Once root has defined them, other people can run them. According to some permissions.

B

And this is like a that's a just to be conservative. You know um so like these channel programs would be written like into s been equal and then, like.

B

The kernel like when you boot up like root, would send these down into the kernel and then like when some ordinary user or some somebody in the zone runs zfs whatever. Then it would like invoke the channel program that had already got down at the time.

B

Okay, so like in terms of those in terms of using the zfs command line like it's just as safe as it is today sure it's just that, like those people in zones, wouldn't be able to write these arbitrary programs and send them down to the kernel, so they don't get like as much of the benefit.

B

X

Thought they could if they were root right, yeah.

G

Well, no we're not going to be zoned, because the container would basically we know if you're in a container or not so, okay,.

S

With the idea, this is done in syncing context, and so, if you go into an infinite loop, you can stop all rights from happening. Yeah.

B

We also, I also had some other ideas about, like you know, trying to detect. Like basically say um you can't go into an infinite loop um like you have to always halting problem.

B

Well, you mean not necessarily infinite loop but make but basically say like you can only run ten thousand instructions before you actually like give us something to do. You know, and you know that prevents some of these. Like worst case scenarios, I don't know if that's really enough to let, like ordinary people, write these programs, but I I wasn't sure what the.

L

End goal that that is so. I think that that end goal doesn't want the choice of language and design. I.

B

Mean I don't think that we necessarily, I don't think the end goal necessarily has to be for unprivileged users to be.

G

Able to write these programs because.

S

But this would make it really easy to, for example, add a new zfs destroy dash p.

O

Yeah yeah, it's just.

S

A new thing that root feeds down into or.

B

Like you know, but if you, when you're building your own management stack, that's built on top of zfs and you need it to do xyz. You know atomically.

D

B

You write that and just run that as root in the global zone.

G

um Sure yeah that was just trying to think about that because I know like I'd, be terrified if I had another bug in that, then like it only comes up in some corner case and now, like all rights are wrong. Until I.

S

I'm sure we can have a way to cancel a running program. It's just that you won't have. You won't be guaranteed the problem, is we don't want to like kill it arbitrarily after a certain amount of time, because.

C

D

Things have happened and some things.

T

Happen, but if you knew that that's what.

S

You were doing if you had a. If you were testing your code, you knew it was buggy. We could just let you say like. Oh.

G

No we're about not worried about a bug and testing yeah.

S

About your testing just.

G

Remove the box: that's.

L

Fine, no one cares.

P

L

G

It's in production, then that means customer downtime, that's or.

L

Just like user paying and it's kind of yeah, okay.

S

Yeah but I was saying you don't have to read the box like we wouldn't make it so that it hangs and you're lost forever.

B

It's just it hangs until the admin comes along and says and kills this kill yeah, and hopefully we can make this well.

I

Other other people yeah, so you know if I understand that it runs in the context of that one conduction group.

F

V

If you need that long to be correct, do you want to give up your correctness so that you don't have really long transaction groups.

B

I

So the same type of.

B

The same exists, the same problem exists today, where you can create. You can tell the kernel like hey. I would like you to snapshot to create these 10 000 snapshots. It's like well. That might actually make your transaction group take longer than is ideal um and hopefully like the right throttle will kick in and then everything will slow down gracefully, but eventually you know you could uh you could get some bad performance there?

B

I think this is making it a little bit easier to get into those situations, but it isn't fundamentally different than what we have today.

X

It would be nice to be able to drop the locks and reacquire them. So if you know something that's going to take a long time, you do like a portion of it drop the box, require it, let it sink, and then revalidate your data and you yeah yeah.

V

I mean you lose consistency immediately. Yeah.

S

Right and then you want to build that yield yeah I mean, but.

V

If you like the stories, how can you rebuild that so.

B

You said: do you have a specific example of of something that you know can benefit from like this yield kind of? Let me do no, I think something you could add if there are, if there are things that could take advantage of that, it would be easy to add.

R

Back from the security perspective again, yeah are we going to put mechanisms to prevent someone from blowing memory accidentally or if you say that lua has to do memory.

R

But you know if someone like it might not be immediately obvious somewhere that zfs list or something would do lots of real.

S

Well, I think, how would you use cfs in this situation? Hopefully this is actually a bad example. We should have swapped it, but uh zfs list presumably would be better in this case than the current case, because in the current case you do need to read everything into memory. In this case, you could iterate on kernel memory.

B

You don't have to have everything in kernel memory. At the same.

S

Time when you send down the the oh, if you send down a request based on the list output, you do yeah.

B

But if you're just doing like literally zfs list.

S

Return but I'm talking like in this command, you do need to have all of the snapshots that you that you want to destroy in kernel memory.

R

I guess the point is more that there's some subtle things to pay attention to with running.

B

I think that part of this is like. We need to investigate like how you can debug this right, because you're.

A

Talking about accidentally.

B

Using too much memory, that's you know. Those are things where we need to look into like how would you debug, how do you have observability into what your lua program is doing decreases stack.

G

Yeah so show me: what's going on.

V

Yeah yeah simulated.

J

Simulated user space.

V

And get some like usage statistics and then actually run the kernel or something no.

S

We should have the availability of this there's.

J

A lot of steps: we should get it working first and that's.

S

That's many steps away already, which is why we're kind of like oh well? What about this? What, if you accidentally do this or that you know what we get it working and then we can worry about like making sure it's debuggable. It's not none of those problems, I think, are insurmountable.

B

Yeah, so to be clear like this, all that exists is like chris got a compiled guru compiling in the kernel and like these slides, so.

F

B

Much a work in progress that you know we want people's input on and it's.

R

S

G

Different, it's a very different. The language has constraints.

S

Yeah, there's no there's no backwards.

G

Branching in the language.

B

The most constrained environment possible, like you, cannot allocate memory. You cannot take locks right versus here like we're in the loosest in one of the loosest kernel contexts where, like we can save space and we can take locks and like all this kind of stuff. So I mean it would be cool if you could write your detracts in lua too, but that's a lot trickier than doing this project.

D

F

B

um Other concerns or or or flat-out criticisms, or this is useless or it needs to have x, to to be legit kind of gathering requirements here. So.

F

I think it's also being built. Just in general, I mean we wind up with I've, done some performance work for other products, and the uh request has been help us avoid the context between the user and parallel. So just again, the machine the system falls together is actually really really nice. He also provides some privileges that help the security staff almost like send files, but have like ultra descent bottle without something it's pretty cool.

F