OpenZFS OpenZFS Developer Summit 2016, 10 Oct 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZFS & Containers by Michael Crogan

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, so I'm gonna talk about ZFS containers and I'm, going to attempt to condensed exactly 20 minutes of content into 20 minutes time. So I'll do my best, so I've architected a project which can be built upon during the hackathon. These are the URLs to the github project and also to a vagrant image which will let you use open, VZ and CFS simultaneously.

A

So my my goals around this were to so my goals today are now are to introduce a this project and image which may be used for inspiration or further further development in the hackathon to demonstrate some practices for efficiency, an alternative to ZFS dee doop, which is known to be expensive, and this is just some enhancements upstream, which can enable some additional use cases and the initial motivation is that I couldn't find a great tool in doing ZFS synchronization, nor a common practice of combining ZFS back end with openvz and I wanted to incremental e address ZFS sync in my own way, because I there's many different complexities to this without implementing any kind of synchronization across hosts of your data sets.

A

You have divergence of your data, inefficiency and it's transfer and in flexibility and expressing what changes you want to propagate I wanted to advocate ZFS as a back-end for a variety of reasons and reduce the barrier to entry for new use cases and dedupe is known to be expensive. So there's other ideas around that. The project's aim is to make it transparently easy to use one command to say. I've I've made a change to my data set and I want to snapshot it in time or I want to derive something from it.

A

I'll go into more detail on taking Dayna aspect next, but I also wanted the project to be versatile and flexible efficient in terms of incremental transfer and a notion of D dupe that I'll explain later to enable some kind of greater privacy through a dataset injection that I described later, and to advocate that this may be very well used in a dev build production environment in a dev environment. You get perfect fidelity by like synchronizing the exact snapshots in production.

A

You gain efficiency by having a common starting point, and your incremental release is a small Delta from your starting point, so the so I I leveraged the FS and openvz can't together and briefly openvz is similar to docker. It's a container technology into which you may put an OS image and an application open bc adds the ability to checkpoint your state, which is which is handy for debugging or for failover or migration. Things like that, but under the hood, the same principles can be applied with docker and CRI.

A

You I leverage, ZFS snapshot, clone, send and receive these basic functionalities. Basically, it's copy-on-write that I leverage with ZFS you get the ability to inject consider like orthogonal datasets, your home directory is orthogonal to the container. Speed-Wise ZFS is very efficient.

A

It turns out to checkpoint your entire OS image in openvz is very efficient if you're only using 200 megabytes and your applications it'll write a 200 megabyte image representing your whole OS state and, in contrast like sending a full vm like of your entire OS, may, take 8 gigabytes or 16 gigabytes or whatever it is making it problematic to do. Direct checkpointing or migration and the whole use of snapshot and clones very fast and efficient in CFS, so a observe, a practice which is an example of a use case.

A

It can be generalized, but this is a example for this discussion. So I have a core starting point: a data set which is like a container image. It's like a entire OS Debian 7 starting point within open BC. It appears within a within a hierarchy with the ZFS back-end that core starting point, 1007 I call it is cloned copy and write and becomes the basis for a intermediate common packages like container data set, meaning like if most of my packages include the build environment.

A

This is an efficient intermediate point for encoding and then from that I have specific containers. Maybe it might be a web server, a bill environment and these are lightweight to encode relative to the 2007, and this is just what it looks like under the hood. I constructed a root directory, meaning within one thousand seven. Here are your your s, bin et Cie? Your dump is a single file and to clarify some of the assumptions around this I assume that derivative hierarchies are largely additive.

A

If we have a starting point of your base and we have a build environment, largely we're adding content, it's not true all the time, but it is true by construction or in this particular use case, and because there's a commonality, the way that Debian or other software pulls in dependencies there's identical content. That's you! That's actually shared between these containers. That would otherwise be isolated and I choose this intermediate point deliberately to be efficient, and last assumption is that size of datas is larger than the size of metadata.

A

This is just showing you know more concretely. The structure of the actual content of the data set so just be curly like this is one example on up to the point of route. This is a 1 ZFS data set and then up to the dump is another CFS data set, and likewise, when I checkpoint I'm a checkpoint on a daily basis or when there's an update, this is just showing there's. It's a recursive snapshot and then there's a serial notion of like checkpointing over time for a fixed container.

A

This indicates how I, basically start with a basis, derive an intermediate point. I take the example of build essential which, like pulls on a lot of packages, and then this from this 2007 intermediate point I, add a small addition. Engine X and this is done through clones and snapshots. Here's another example of Hachi different container ID 2007 is the common starting point, and one obvious claim is the the footprint the storage footprint with using this ZFS ecosystem.

A

Is that your size of storing this is more efficient, also more efficient to ascend and to know what actually is differing in your intermediate containers. So when I say five thousand seven, given two thousand seven, it's considered a incremental Delta from 2007 as a starting point. If we were to compare this to a full concatenation of each of these, that would be duplicating a lot of content that ZFS could encode more efficiently.

A

So this ZFS in containers combining the two we can do lightweight snapshot, clone and rollback and because we are leveraging openvz, it can also include the state of your running system so just to roll back. If we have a when we take a snapshot and it includes state, we do a recursive snapshot of the one thousand seven data set and it includes a dump file. That is the result of easy suspend.

A

So we capture the state of your file so and also the a single file image representing what was running at that time, and this shows what it would look like you know, under the hood or like low level. This is roughly what would be done to to represent that and here's some example uses of the of the of the tool. You see it's very usage driven a checkpoint. Will you know automatically snapshot and your state and you can choose to snapshot every container. That's running.

A

Every dataset you've got choose a particular container to drive a a clone from suppose you have got your one thousand seven. You want to drive two thousand seven, your intermediate step. You know container, and so this functionality can be built over time, but these are some like example commands you can also see what's differing locally versus your last snapshot and suspend and resume.

A

This gives fidelity of the container, as mentioned with best practices, efficient storage, integration with the checkpoint resume of your OS and application state and the it's possible to do something that you usually don't do, which is like have a environment and then roll back. So sometimes I accumulate data. You know temporary data and my file system grows and grows, but I realize it was just a temporary experiment, so this makes it easy and one command to to essentially rollback.

A

Now the core technology of the tool is, is synchronization of these data sets across hosts and to do so manually as complicated, especially involving three hosts or coordinating efforts. So if, without that, if see, there must be done manually or maybe there's another tool out there, but the data sets become out of sync and that's not what I want when I have multiple hosts and I want to essentially propagate or share a container across them. So the project identifies a common snapshot starting point.

A

It makes your ZFS send minimal in a certain sense, and the notion of synchronization is done through primitives a push and pull operation with that you can incorporate a sink. You can also encapsulate the notion of a centralized synchronization.

A

Suppose you want to coordinate between an hosts if you have peer or the central repository, you can just simply include in your practice like synchronizing to the central place first, but any number of enhancements can be built upon this, but the core notion is a push and pull and, as an example suppose I'm working on a development environment, I noticed a bug. I can quickly send the state of my running system and everything in it.

A

Dependent libraries dependent data sets to a developer, so they can know exactly what they're looking at and and reproduce with perfect fidelity.

A

Likewise, with a production environment, a centralized repository can roll out the incremental builds and assuming most of the data doesn't change. This can be like a 10x speed. Up and again, you can quickly fork your your data and roll back. A work-in-progress of the script is so injection of a data set like your home directory. You may not or Denair like. If you have a home directory, it might be part of your build environment. You don't necessarily want that evolving and part of your shared ecosystem.

A

So in the spawning of the container you can inject home user or whatever it be into the container as it runs, and this tool can treat that data set independently and can choose to share the home directory. Maybe if it's, if it's temporary directory, it might exclude it entirely from synchronization, but the injection lets you consider holes or pieces within your environment that you want to consider or do not care, so I think we have time so the this there's an idea of a rebase for this is the generalized idea around greater efficiency.

A

So suppose we have a collection of containers and they live in this ecosystem. What we can do, if we've accumulated that over time, maybe less efficiently is we can rebase we can reconstruct the identical content, but in such a way that it is efficiently coded encoded by ZFS through deliberate copy-on-write constructions. So a very simple way of doing this, you know so this is just the overview.

A

So a way of doing this, as you name the containers that you want as your starting point suppose, containers 1, 2, 5 & 6 are: are you want to be included in your rebase? You essentially tar dump them all into one consolidated container.

A

Now there will be overlap or which won't be we're done in it will be just simply. You know encoded once and your result and there will be non overlap. There will be files that differ and there will be conflict, but the notion is that the percentage of data that may differ, for example, it's may be. Your package database- may differ for each individual container for the storage footprint that component, that portion of the data is much smaller than what is common.

A

So the idea is, you derive a consolidated container. You clone that to form primed versions of container one two, five and seven, and then you do a perfect fidelity rsync from the original content to the primed version. You get identical content, a deletion is lightweight in terms of footprint and you're, conserving space because you're getting a maximal notion of Union and, as I said before, the part of the content that differed is exists. It will be encoded with high fidelity, but it is small, so it is fishing to do it.

A

This way, likewise, snapshots can be rebased as well, using the same idea and.

A

There's a way to do deduplication. That is not quite, do the duplication, but it is a it's a version of conserving data. If we note that a file has moved in theory, we could encode that by move operation, this is just a work in progress or idea.

A

If we form, if we do a full file, whole file to duplication, we could figure out which files are identical across the entire ecosystem and hard link, everything in the destination- and this could be even cheaper- not practical for every case, but it is an example of reducing footprint, and so then it's just the final slide here is around how like maybe ZFS, might be modified to enhance like a piece of functionality.

A

So supposing a portion of our image like an engine X Server is we want it to be public and we drive it from a common public place, and then we have our own revisions to that that we want to encode with a key and so right now it's not possible to like turn on encryption later.

A

But if this were enabled, then one could does encrypt with a like a workplace key. Those private proprietary changes to a public data set same with taking that further to the personal component, and it's not a perfect, not perfect from an encryption standpoint, because you have some known or anticipated data. You know chosen known known plaintext, but it may achieve be useful effects to be able to add encryption.

A

Likewise, if Debian, upstream or if we knew exactly which directories are potentially contained personal, identifying information, we can know exactly what what what to inject, we can consider temp directories or as orthogonal and likewise and there's a variety of advantages or sorry enhancements to the tool. I'll be adding. You know issues you know in terms of like potential enhancements on actual content over over today later today, and but also perhaps if this were taken up as a hackathon project, other enhancements can be can be taken up.

A

So in summary, it's one way of you know doing combining ZFS containers, and it illustrates some other ways of thinking about ZFS and storage footprint, so happy to entertain any questions.

A

Yeah, so the the question is: how how can this practical use like? How do we enable it or use it? So, the the vagrant image is a starting point which has a openvz ZFS entire Linux distribution as a starting point and into which you can enclose the project and that's so yeah. So that's the starting point like you essentially can either do this yourself or use this as a starting point. Open BC is the technology. You can build your ZFS into that.

A

Have them as your route and enhance a does that design answer your question there yeah.

A

A

Yeah, so the there's two parts to it, so the mashing everything together will you know if there's an identical file in multiple containers? It'll only be stored once after cloning that the clone operation takes no storage footprint that the there's an incremental like for each individual container? We do in our sink, so there's an our sink of 1,000. Second tainer one container to container five just is that how you'd imagined it or maybe I didn't understand.

A

Well so because the notion is additive, growth of either snapshots or derived containers, it's more efficient to go backwards in time. So if we were to go forwards in time, we'd be re-encoding things in more than one way. So as an example, imagine five thousand six and sorry imagine container five container six share a lot of content that is, would represent the same file if we encode it forward. In that way, we're essentially replaying the same content. Again you identify what actually is duplicate.

A

What actually is identical in the result by kind of like tearing it, then we we only store at once and then we're rolling backward, so there's many ways I'm enhancing this rebase, but this is just one example.

A

A