Ceph Conferences, 11 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Snapshots for Fun and Profit

Description

Ceph includes snapshot technology in most of its projects: the base RADOS layer, RBD block devices, and CephFS filesystem. This talk will briefly overview how snapshots are implemented in order to discuss their implications, before moving on to example use cases for operators. Learn how to use snapshots for backup and rollback-ability within RBD and CephFS, the benefits they bring over other mechanisms, how to use the new snapshot trim config options, and most importantly: how much they cost in

A

All right welcome everybody back to the Ceph day track here, we're going to keep moving along with another SEF FS talk or pieces of cephus I. Guess the the snapshots for fun and profit we're going to talk about all the various snapshot, mechanics that exist within SEF, whether that's the block device, the FS or the or the gateway. So there all kinds of options and Greg Farnham here is going to give the presentation he's a longtime core developer of Ceph, so I'll. Let him take it from here: hey.

B

Everybody that mic dwell okay cool, so this talk is SEP snapshots for fun and profit. My name is Greg Farnham I'm, a principal software engineer at Red, Hat I've, been working on the project for almost eight years now can't believe it. So during this talk, we're going to go through the origin of snapshots and M staff, because that's important for some of the design decisions that have been made. We're going to look at how rights work inside of the OSD and how the snapshotting systems interact with those rights.

B

Look at how snapshots work at a higher level in the RB d and set ves systems we'll look at how the snapshot trimming works inside of the OSD, we'll look at some ways to control that and throttle it and the consequences of the implementation and we'll look at some use cases when I was practicing this earlier. The talk was a little short when I was just like talking through it, but hopefully this one is more understandable. The last time I gave this talk. It was a little too hard to follow.

B

So please, if you have any questions like raise your hands or jump up and down or something because we should have enough time for Q&A, while we're going through so Steph started out at the UC Santa Cruz storage, Research, System Center they were, it was a long term research project. They were trying to build a successor to the lustre HPC file system. It was some of the research was sponsored by the National Lab Sandia and Lawrence Livermore, as they were, setting up lustre for the first time and realizing like wow. This has some downsides.

B

We'd like to not have those downsides things. Msf projects have changed a bit since then there's a lot of open source and hardware companies that are contributing to the project. It's a lot more cloud focused. That's why we're all here most customers are working in virtual block devices RVD or in the s3 and Swift radio Skateway interfaces.

B

But about a year ago, at the OpenStack Austin summit, the Ceph community was really proud to announce that we had a stable, fest file system upstream, and so some of the vendors are now starting to to push that down to some of their customers as well. If you've ever seen a step, talk, you've probably seen the slide. The step project starts off with the reliable autonomic distributed objects, tour that sort of provide that provides the data, durability and consistency mechanics.

B

On top of that, we build various interfaces, a full file system with metadata server and some and a custom client, the radio SLOC device, which is just a client library that sits inside of kimu or inside of the Linux kernel and other systems or a radio Skateway proxy that speaks s3 and Swift the outside world and turns that into internal radius. Operations for itself snapshots were initially envisioned as with the rest of the project as a thing and set in the set file system, and they were designed to be really easy.

B

Every directory and stuff of s has a hidden dot snap directory inside of it. If you want to make a snapshot of that directory and everything underneath it, you just create a directory inside of the dots adapter. So it's just a make two dot snap / snap and then everything underneath that has a new snapshot, that you can reference through the dot, snap directory, dot snap and see the files of the state when you created the snapshot. That was that was a big goal was that you could do this with arbitrary sub trees.

B

You didn't need to specify that the directory was special. Before you made a snapshot, you didn't mean to create the directory in some special way. You didn't need to do sub volumes and things. So it's just. We wanted to work with any directory in the system. Your home derp, the at like as a user, the administrator taking a snapshot of every home door or at the root of the file system or whatever it would all just work.

B

Because of that and the user accessibility and the fact that in HPC applications when people are taking snapshots, it might be a thousand nodes all doing it at once. Those snapshots need to be cheap to create, but we do have one big advantage over some systems, which is that we have intelligent clients. The Southwest client is pretty smart. It does a lot of work.

B

The RB decline, not that we knew about it then, but today, the arbiter Klein is pretty smart and does a lot of work so the so the clients can coordinate the snapshotting across OSDs. We don't need to flood all the OSDs with it with a synchronous message, message system that says: hey, there's this new snapshot. That applies to these objects and indeed, when sage sat down with that system and worked out the first design, then we took advantage that snapshots in ray DOS are actually per object so to the OSD.

B

All it knows is that there's a snapshot, you know snapshots 72 and it has this object in it and he might later find out that you know it's got a second objects, but he doesn't know. Oh hey, there's this new snapshot, 72 and it has these 17,000 objects.

B

Those and that's because the the snapshotting is driven on object right when snapshot. When you take a snapshot in the Ceph file system, then it applies to the whole directory and everything underneath it or if you take a snapshot and RBD volume, it applies to all the objects in the RVD volume, but we don't go out and touch those objects right away. We just when we have a right to them, then we send along a little bit of metadata.

B

That says: hey you're part of this snapshot 72, and that means this data is pretty skinny having every object. There's a list of the snapshots it's part of, and we have a list of snapshots that have been deleted in the cluster. That broadly makes sense. No one's screaming too hard. Oh.

B

Yeah it actually, it actually works with any data that it's been put into the system. But if you bring that up later, we'll get to the file system- and we can talk about a little more just.

A

As a reminder, if we're going to ask questions, please use the microphones. We are recording this for posterity. The question.

B

Was if this works with Stefan ves on open files- and the answer is yes, but we're not going to talk about too much detail so in ratos in the OSD, just normal, writes without snapshots involved, you have object, storage daemons. You probably already know this right now, those consist of a user space daemon that talks to an XFS filesystem, there's a new thing, coming called blue store that manages disks directly and that's going to have a lot of advantages that is being pushed forward.

B

But most of this talk is going to focus on the files on the file store on X FS, because that's what most people have it's the most battle tuned? It's what it's the only one thing that a lot of vendors are supporting right now. So in terms of the network, when you have a a raw ratos client that wants to write something, it says: hey I've got the subject: foo I want to write to it. So it says: hey, like the client says: ok I, find that found the primary for this object.

B

Foo and I want you to write this data and just sends a message to the primary OSD. The primary OSD sends that sends that message to to all the replicas for project for object, foo, and then it sends back an act to the client when they've been committed to disk inside of the OSD. There's a couple different things that need to happen. It needs to look up the current object state to make sure that the clients allowed to touch that object, that the object actually exists.

B

If it's not doing object create to see if it needs to change the sides of the object or whatever. So that's one disk I/o. If that data isn't cached it packages up the right data for its replicas and for its local files, local storage system, and then it sends that to the replicas over the network and to its local storage system to persist. And that's you know, depending on the file system, on what the file system feels like doing right now.

B

That's about one disk access, you'll notice that here I am ignoring the journal that you probably know about in the file store because we're more interested in sort of the throughput rate of the backing hard drive.

B

So in ratos, when we're snapchating, a single object along with that snapshots and radios, are identified by just a single 64-bit number, they don't have names as far as regos is concerned. They don't have any metadata associated with them, except the fact that they have been allocated.

B

We call these snapshots self managed because we're bad at naming, but also because force f of s and for our BD they're sort of managing the metadata about the snapshots. F FS is responsible for knowing which objects are in snaps are in this particular snapshot. 42. It's not the responsibility of Rea dos or anything like that. So to allocate a self managed snapshot a snapshot.

B

The client just says to the monitor: hey I want a new snapshot ID and the monitor does does what we call a paxos commit round and it, and it allocates one internally between its it's lightly, consistent and available and writes that down to disk, and then it says: okay, client, here's, a new snapshot and as far as Ray dose is concerned, that's it there's now a snapshot, it's not associated with anything except that it exists. But that's that's all it takes to do the logical creation.

B

Now the client probably actually has some data at once in the snapshot. So at some later point it says: okay, I, have this object? Foo, let's call it that is in my that is in my snapshot. I got, which is just snapshot 42 and so now, I'm writing new data to object, foo, and so he says, says, sends a message to the primary that says: hey write, the state object, foo, oh and by the way, I know that object. Foo is a member of snapshot.

B

42 and the primary gets that and it sends it up the replicas then back and everything's happy internally ratos. The OSD looks up object, foo, that's about one disk IO, it says: oh, hey object, foo, isn't in snapshot 42, yet that I know about so I, better make a copy of its current state and say that that's snapshot 42, and so that's a a clone operation in in XFS. That's a full copy of the object and in blue store it. It's just.

B

A little bit of metadata gets scribbled down because blue stores controlling the block allocation, and so in XFS you're copying the four megabyte object and applying the the new right right in memory, and then we also need to record in the OSD a lookaside table that says hey. We have.

B

We know that there existed snapshot 42 that has object foo in it and also we have an object foo that is in snapshot 42 and you want the bi-directional look up so that we can do things like trimming which we'll get to in a moment graphically well.

B

Okay, sorry so graphically we have. The disks ADA's exist as we have this. This object foo and it's got sort of an X adder which contains its info and we say: hey I, want you to look up the untidy to look up the info. So please read this X at or out of XFS for me and we get it back from the client or not from the client, but from our file system. And then we say: hey X of s. We need you to copy object foo into this new location and I.

B

Think what it actually looks like is renamed yeah and I. Remember we copy it into a new location. We overwrite it. So we say: hey clone the object right, this new data to the newly cloned object and record the snapshot, and so that goes into the file system and it says: hey. We now have the foo snapshot version one and this object foo, which has the new overwritten data and also in a level DB instance that we use to provide a whole lot of things than we've written down these.

B

These two key value pairs from snapshot to foo and from food to snapshot, and that can get cool less mostly into one commit. If the file system feels like in time, it might be a couple more, it depends, and then we say: hey the file system did this, you can you can have it back now and you're done so that's sort of the local path and you'll notice. You know, depending on the file system, feels like it might be to iOS.

B

It could be more if, depending on how many folders it decides, it needs to look at it needs to go, do lookups in or to update or whatever at the time so at a higher level in our VD. From its perspective, let's look at writes in snapshots, ratos block device stores, virtual disks, you're, probably broadly familiar with it.

B

Visually you've got most of the time. You've got the Lib RBD library running inside of key mu and providing the disk services to its VMs. It might also be a kernel, client or whatever, and then that library, just talks directly to the OSD, is to do what it needs to do.

B

When you take a snapshot in our VD you're running a simple operation which I have later I, think it's our VD create snapshot ID on this object, then the client goes to the monitor and says: hey I need a staff ID, and the monitor goes to this concise here's, the staff ID and then the client needs to write down on what we call an RB header out, that's responsible for saying: hey, we have this object or this RV volume it exists. It is of size, 10, gigabytes, things, lend and it supports these features.

B

It also has a field in there for what snapshots exist in the in on the RBD volume, and so we write down on the RB header volume: hey you're, a member of I like 42, of snapshot 42, but it doesn't need to go out to any of the data objects later on when you're writing to the RB d volume.

B

For some other reason, then, you say hey by the way you're a member of snapshot 42 because it says so in the in your RB d header, and that goes through the same path as it does with Rados, but more, but importantly that can be in parallel, like there's no serialization or synchronization across the objects that requires any kind of ordering it's just every time we do an I/o in parallel or sequentially.

B

Every I/o has that by the way you remember object, 42 and the OSD is take care of it on their own, and so the right path looks the same, and it doesn't really do anything from the clients perspective. So it's pretty simple and stuff s not hugely different. We do have a metadata server that sits in between the OSD s and the client in order to provide file system.

B

Namespace operations like saying, hey, I, need you to create this directory or rename it or created or allocate a new ID, node number, and so the client goes to those to do that. But then, when it wants to write cloth actual file data, it just talks directly to the right OSDs for those with the objects that that, with the objects that are a part of that file, when you want to and so sort of graphically it says, hey MDS I want to open this. This file, Greg, slash, git config for right and the MDS says.

B

Ok here it is, and the client then writes out. You know the new version might get config out to an OSD if I want to make a snapshot of my home directory. The client says to the MDS: hey I, want you to make a snapshot in slash home, slash, Greg, dot, snap slash my snapshot and the MDS will has its own. You know MDS logs, that are journals too, and so it- and so it persists that hey there's now a snapshot in slash home, slash, Greg and then I respond to the client.

B

Ok, you've got a new snapshot, got snap, ID 42 and then, when I later on- or maybe it's happening at the time. But when I later on, say, hey I want to open and write the dot git config file in Greg's home tur, then the mbss, here's, the here's, how you open the file and then the client sends off the new the new data to the object and says by the way this object is a member of object. 42 and again that happens in parallel.

B

You go to the MVS to open files, and the MVS tells you that it's a member that the file is a member of whatever snapshots it's in and then, whenever you go talk to the OSD, is you just set that up so sequential or it's not? It's not serialized, it's just all in parallel with whatever files you have to be doing.

B

This could be you know one big file that has that you're writing the three objects at once on, because you're doing, oh dear, because you're doing sixteen megabytes streaming iOS, it could be three very small, four kilobyte files. It just all happens. Naturally, so that's how snapshots get created any questions? Oh one in the middle yeah be good. Sorry, I'll.

C

Go first sure you mentioned for when it's writing objects for Atos as a bi-directional thing, and it writes it to a database after right, where's that database to live so.

B

Every OSD daemon provides three different sort of data streams or Forks on an object. You've got X adders and the object, byte stream, and what we call OMAP or object map. It's a key value store in the OSD, that's implemented with leveldb rocks TV if you're familiar with those. So it's it's not it's not a sequel thing. It's just us, just a key value store that you can list and read it out of, and we use that for providing the O map implementation.

B

We use that for doing for some of our internal metadata, like this snap mapper thing in the normal course of doing business, on a write, you actually don't talk, don't do anything with it, but it is sort of a thing that's being worked on in the background all the time, and so there is a cost associated with writing into it. But it's it's sort of an ongoing thing. You pay it's not a for this OP. We created an I/o. It's like for these 50 ops. We created a 4 kilobyte write to disk.

D

E

You yep does replicating begin as soon as the OSD starts. Downloading data yeah.

B

So replication happens, sort of parallel with the local right to disk of what we call the primary list. He gets a right and it puts it through some processing and once it's sort of approved it and ordered it with respect to other rights, then it simultaneously sends it off over the network to its replicas and and gives it to its local storage too, to persist. Thank you, yep, okay, so we've seen the decree up one more sorry, yep.

E

Say at what time do you acknowledge the right? After all, the replicas are done or when the first one is written rights.

B

Are always acknowledged after it's been committed to every OSD in the system? That's important for our consistency results. Then it means that you never get into split-brain situations with stuff and.

E

The next question was that when any kind of a right is done, you write some kind of a journal and given the fact that you're using X FS, which is also a general in file system, so aren't you taking like a double head for writing journals for.

B

It yeah your request, yeah. We can talk about that afterwards. That's part of the reason that there's this new blue store thing I've been alluding to is that it handles the disk directly, and so we remove the double the double logging, but that's not something we can really talk about right now. Okay, so we've seen that creating snapshots is pretty cheap.

B

As with many systems, what you don't pay upfront you have to pay later on and in Ceph, then the cost is paid when deleting snapshots now I'm going to talk a lot about sort of sort of negatively about the cost here, I want to be very clear step is, is really very efficient about snapshots, it's clean it.

B

It defers, work and batches, it together reasonably nicely, but we because we are, we are data, light on the front end when we're doing the creates, we have to be a little data, heavier on the back end back end.

B

So when you delete a snapshot logically, the client actually just sends a message to the monitor, saying: hey, I want to delete snapshot, 42 and the monitor, writes that down and sums back in acts saying: okay snapshot, 42 is gone and the way it sort of persists that and shares that information with the cluster is in the OSD map that that records, what OSU's exist and various other cluster metadata and that everyone sees and it just has a a field, called deleted snapshots or something- and you add it into that- and it's a efficient representation.

B

It's not one for every for every snapshot. You do it's not an integer for every snap that you delete, but it's basically it's what we call an interval set, and so it goes in here and it says: okay all done now, that's just the logical deletion.

B

Once you delete it logically, you will never see data associated Samsa 42 again from the client side, but it's still sitting on the server's taking up space, which, obviously you don't want it to do, because you want that space back so on the OSD, it gets a new OSD map and it says: oh hey.

B

The zoasty map has a deleted snapshot, I'm, going to put that deleted snapshot into my queue of things to trim and then, as it works its way through that, through that snap trimming queue, it will list the objects that are in the snapshot. It will for each of the objects on length to clone for that snapshot in XFS.

B

It will update the the objects main info same info exeter that contains the metadata about the object and it will remove the the leveldb snap mapper entries for that objects in snap pair visually. Let's say: we've been a little more ambitious. We've now got three objects that are in snapshot. One and we've got an OSD map that says hey. You need elite snapshot 1, so this snap trimmer is now running through and it says alright. I need to leave snapshot 1. What's the next objects, that's in snapshot 1 and the answer comes back.

B

Oh, hey, it's foo and so the OSD says: hey XFS, I needed to remove this food. One object. I need to update the info on foo to say that it doesn't have a foo one object and I need you to remove the keys out of out of the snap map or leveldb instance, and so XFS and level. Do you change their instances, level D crosses out the entries. Xfs has a new info and it's removed foo 1 and it says: ok, I'm done and again snap trimmer goes hey.

B

What's the next snapshot, one object and this time it's bar will walk, do the same process for bar. Does it again what's the net? What's the next object, its Baz walk through the same process for Baz, and then we're done and we say hey: what's the next snapshot, one object and we go. Oh there isn't one and maybe we're now deleting snapshot too or maybe we're just done and the snap trimming can stop for a while.

B

So you know frequently that's about to iOS project, miss snap. You need to go, look up the X adder and you need to go and then you need to write out the etc and the unlink and the the snap map or leveldb entries are just getting coalesced into the sort of background work that's going on.

B

Sometimes it can be a lot more. Sometimes XFS doesn't have any of the metadata and entry in memory, so it needs to go through. Sometimes XFS has journaled up, but it's two unlinked 50 or 50 files, and then you give it a 50 first one. It's like oh, like no I need to go actually unlink things at about of out of the out of the folders that I have in having other places on the harddrive.

B

So it's a little unpredictable when scheduling this is a lot better in blue store, because all the metadata in blue store is just Cola scible into into the leveldb instance, and so it's sort of an amortized look up and then an amortized right out of the new instances of keys and in particular seth, has historically had problems with throttling these trim operations. Because of the way that work that we think XFS is done where it says: hey, alright, unlink.

B

This file has not actually done inside of XFS, and so it just pops up later on as much more work. So the snap trimming in XFS is in ray dos is very important. The ways that you control it and hammer sort of the classic and the classic version of stab trimming that people who have used it a lot of had some trouble with, but that had it's the first sort of rudimentary controls that were there were two main switches.

B

You could change the maximum number of snap trims that it would be doing at a time that is the number that every PG and the OSD would be giving to exit the number of files that every up PG in the OSD would be giving access to remove at once. And so you could say: oh hey, like you'll, have a lot of PGS but let's say up 30 PGS at your primary four, so you'll be giving your ex FS with what the defaults you'll be giving at 60 things to remove it once so.

B

That's a lot, but you know it's sort of okay and but some people found out that it wasn't okay at all. So we also had this thing called the snap trim sleep and with the snap trim sleep then, after every time the OSD gives XFS to things to delete.

B

Then it sleeps for this number of seconds it defaults to off, but a lot of people have turned it from have tuned it from 10 milliseconds up to like 5 or 10 seconds, even because they didn't have very many objects, but they just needed it to be very. Very background: in the jewelry lease we made a lot of improvements.

B

We move the snap trimming from its own sort of separate worker pool of threads, where it just contended in the disk layer with client IO into what we call a unified op q, where client IO goes in snap. Trimming goes in backfill and recovery go in all through the same all through the same set of threads and the same queue. So we can prioritize them and say: ok like given the cost of doing all these operations and their priority to the administrator.

B

What order we want to go in, and so with that, when you we can set the snap trim priority it defaults to 5, which is pretty low. Clients are 63 which is sort of the max. You can specify how expensive you want to consider a snap trim to be, and it defaults to being one megabyte of cost, which, which frequently is a little more expensive than it needs to be, but sometimes it's not quite enough.

B

You can still specify the concurrent snap trims and you can still specify the snap trim sleep, but this was really embarrassing because if you turned it on, then you actually block the OP thread that client I went through. Whenever you did it, and so you could set a snap trim sleep of 1/2 second and then no IO would happen for a second, including all your clients, and it was bad. So you shouldn't do that. But someone pointed out this bug and we did fix it.

B

It's in the upcoming 10.2 point release, and it also has a few new things. In addition to making snap trim sleep work properly, we also- and we also added a new configuration option that specifies how many PGS the OSC will at trim at a time and so I think most of the, and so with these options then use all the users that I'm aware of that have tried.

B

Them are really happy with the way that that trimming works, because previously, if you delete it, a snapshot that had a lot of objects in it your than euro, so you just go away for a while. We'll look at that in a minute about why that happened, but with these settings they managed to turn it down. So it wasn't a problem.

B

The upcoming luminous release has the same tunable as the previous one, so there are some consequences to map shots and the way they work every I/o to an object in the snapshot that hasn't already like been registered. As part of that snapshot copies the object when you're using when using XFS. So if you're benchmarking, random I/o, we occasionally have people come on list and say: hey, like I, took a snapshot, and now my random I/o FIO benchmark is running at a thousand to speed. It was before and we're like well yeah.

B

That's because you're copying every object if every every object on every access, because you're taking a snapshot every second and you're, never going to win that race in general. This is amortized across iOS. So you know as long as you don't take snapshots to fast, where your cluster can do and what workload you're applying to it.

B

Then you should be okay, we've seen people who really did try and take snapshots every minute or every five minutes on our Beauty volumes or something and that didn't work out well for you every day, probably we'll assume you have enough, like slack in your cluster, to do the trimming that you that you're going to want to do if you're doing that and snapshot trimming, cost sort of a little more than a client up for every object, with fresh data in the snapshot.

B

So again, it's amortized, but if you take a snapshot on our BD volume of a thousand objects and you write to every object and then you delete it. You've got about a thousand client eye, ops, maybe two thousand, and if you have ten primary OSDs with a hard drive that can do 100, I ops, then that's one second of cluster throughput to delete that snapshot. Now, assuming you've set up you're using the defaults or have set up the snapshot. Trimming tuneable as well that'll, be you know, not a whole second, but it'll be distributed.

B

But it is, you know, you're starting to think about it and in turn, in those terms, when you're doing your cluster capacity planning, you'd better, not create a cluster and then have an hour's worth of snapshot creates an hour's worth of snapshot trimming every day. If your cluster is running at full capacity for 23 hours out of the day, and you need to design the system for to do that.

B

B

Yeah, sorry, so if I haven't been cleared enough about that, the copy goes away with blue store. Blue store does block allocation. So you only write the fresh data into the in the blue store. When you delete the data, you only delete the data that isn't used by a current snapshot.

B

A blue store blue store, makes everything wonderful and is full of rainbows and unicorns and ponies, but most of you are are not running blue store and how can it and if you're running a cluster you're going to be using file store for a while, so know these things?

B

So that's how snapshots work in stuff of SNR VD. We also have this other thing called pool snapshots that I made in my first year or two and that I'm a little sad about the goal with pool snapshots was to make things easy for admins. I think that these might have existed before RVD was even a thing, but after we created the radio Skateway, so it was like you know.

B

Maybe we got want to make this thing so that admins can take copies of the current state of their cluster and we want to use the same implementation inside of the OSD, and so we're, like you know, a really easy way to do. It is to just put the snapshot in the OSD map and let it spread. There were some problems with that, though, unlike with our other snapshotting mechanisms, pool snapshots are not point in time.

B

If you have two RVD volumes attached to a to a VM, and one of them is your like database log and one of them is your database data and you do a pool snapshot. Those are not point in time consistent and if you do a recovery for your full snapshot, your database is going to be very angry at you.

B

The snapshot is just sort of spread virally, as OSD maps get pushed out between the OSDs and between the clients and their OSDs, and additionally, it makes the OSD map bigger for every every snapchat that exists in the system and, most crucially, it doesn't work at all with self manage snapshots.

B

So you can't use the real RVD snapshots that are per volume and that are used for some of the replication systems that people have built up in things and you can't use it on it: fool where you're using stuff of s and also because it's pool wide every object in the pool snapshot trimming is a lot more expensive than on a snapshot basis. Then then, when doing most snapshot removals, we throw out a lot more effectively now, so it's better, but it does mean that you know your pool. Sort of.

B

Has these giant consists, not consistency, points where when you do remove it, it's all it just used up a whole lot of data throughout the system, so you might have a use case for pool snapshots. There are some but they're unlikely to to be what you're, after, if you're looking at them, and so you should talk to a list about what your goals are and or you know, your support, guy or whatever about what your goals are and what the right way to accomplish them is. There are also a few pain points and stuff.

B

That's with snapshots. I should call out hard links and snapshots do not interact at all right now, if I create a file in my directory or my home door, and then I hardly get from somewhere else and then I take a snapshot of the somewhere else. It does not copy the current state of the file.

B

It's just the thing that hasn't been done, it's kind of hard. We know how to fix it, but it, but it's still queued up, because other things like the multiple active MDS has got prioritized the last planning round. There are also a few hard edges and some narrow bugs bugs when you have the have various combinations of features turned on, so snapshots aren't considered generally stable, yet I'm, not sure if the file system team is turning them on for Luminess or not, but as but they aren't a jewel.

B

Certainly, that's said you know they're coming along they're nice most of the time, and there are some good use cases for them, which I should have ordered next, but are instead the next slide. So an RB D, there's a there's, a web or there's a doc page about how to use snapshots and it's pretty simple. You run the RVD command and you say snap create this snapshot on this RB D volume and it takes snapshot of the image.

B

You can also clone the image from a snapshot, and so when you do that, you've got your image. Ruin your image bar and you start you've got your image foo and you've snapshotted it. You can make an image bar, that's in a different pool somewhere that might have you know different different speed requirements or different, consistent or different durability requirements, or something and or it might just be like you want to copy, and then you have this new image bar.

B

That starts off the same as our who was that it's edited snapshot, but that changes as as you do rights. There's some nice use cases associated with that. You can create a golden image and then every time anyone wants a new, a new volume. Then it's just an overlay of your old, an image you can take snapshots right before you do an OS or a big package upgrade and if it fails, you can clone the snapshot and just resume from that snapshot.

B

If you want to take backups for your clients without them noticing, then you can get a point in time, consistent on a point in time, consistent, harddrive image, that's not like it's not like an FS freeze and flush that that's safe, but it is its crash consistency and you can use that to back up somewhere else out of our BD or across to another, our beauty, cluster or something, and you can use it in various ways to transparently migrate, VMs around between pools or clusters and stuff. If s, it's a simple make sure by default.

B

Everyone on the on the cluster can create snapshots, but you can limit it by UID range. If you want to, you, can sort of use it for anything that you want read-only data for you can create point-in-time backups of a directory before making big changes that does work with open files. As long as the data has been put into SAP of s, the clients will flush it out correctly. You can use it as a poor-man's. Get that works. Ok with binary data.

B

You can use it as a basis for copying consistent data around you can take snapshots of the home directory every day to prevent user, to allow your to allow your users to ask you to recover files for them or to do it themselves and the project manila files as a service system in an openstack uses snapshots for whatever use of snapshots for, but this is for that, and we have come to the end of my slides a little early. So I'll take questions now or maybe I won't.

D

I think I saw a blueprint for a rattlesnake way, snap shots, but it never went anywhere. Is that something that people are asking for, so you can kind of back up. S3 buckets I.

B

Am not familiar with any requests for that I think there had been some requests poured in the past, but they added the s3 versioning interface, so you can do virgin to objects that just don't disappear and I think it pretty much died away after that, if people are interested you should put in those tickets, I don't know how they do it, but it definitely could be done in a couple different ways: I had.

D

Another question: you don't mind if you snapshot a directory tree with accidentally some humongous file in it. It is there any way. The reason about where did all my space go? You know I, it's hidden away in snapshots. Is there some way to kind of see in there? That's.

B

One of the rough edges in the file system. Right now there there are cepsa ports. A thing called are static thing that we call recursive statistics where usage information gets propagated up into directories, so you can look at a directory instead of it being four kilobytes, because that's the size of a block, it'll say: oh, hey, there's 10 gigabytes of data in my descendant and so I think probably will end up. Doing is hooking snapshots into that. To have like a snapshot are sad saying.

B

The snapshots underneath here have this much data, but it's not implemented yet all right. Thanks, very much yep Hey.

F

Hi excitation I have a question, so you mentioned that you use the RVD comment tool to create snapshot. Can you use things like and you bypass it? Rb 20s like live, set, thee or live artbeat RVD yeah also initiate that snapshot yeah.

B

I don't work with RBT too much so honestly, I just went to the snapshot page and looked for an example of how to do them, but yeah those are that's implemented in terms of live RVD. It's all programmatically accessible it just it's pretty simple. What's the point of that command? Thank you. Yep.

G

So the OSD map has a list of snapshots that a deleted basically stored as a interval set. He said: does that just grow with the number of snapshots, or is this some sort of trimming around.

B

So the interval set is a particular kind of data structure where and it's nice for snapshots, because if you've deleted all of the snapshot 0 to 100, it says alright, then that takes two integers to represent. It says it says, starting at 0 we've deleted. Then this set contains 100 entries. So as long as you delete snapshots from the tail moving forward, then it's a very small structure. If you have a more complicated backup system, it can grow, it can grow more. It hasn't been a big problem for users.

B

Yet, although yes, there are going to be changes post Luminess, how? How did the snapshot deleted? Snapshots are stored in the map? So it's.

G

Bounded by a number of cells left I mean delete.

B

It it's bounded by actually it's bounded by the number of holes in your deleted snapshot sets.

B

Are there plans for the ability to create consistent snapshot of multiple RVD volumes? I believe that's a blueprint in progress, but I can't talk about it very much. There's a mechanism yeah from mer, antis, save someone. It branches is working on a consistency. Groups feature for RBG volumes, Oh Oh, Jason's there sorry Jason. You should ask him about things like that.

B

Alright, I guess I'll, let you go early and feel free to come up and ask me or Jason or save some questions thanks so much guys.

B