Ceph Cloud Platforms, 23 Jul 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Linux - Ceph object storage block storage file system replication massive scalability and then some

Description

http://www.twitter.com/user/LinuxConference https://twitter.com/LinuxConference

A

Day of Linux conference, australia, 2013 yeah well done people, we have got two speakers, we have got Tim's wrong, florin, hearts and I'll be talking on. Sef objects, object, storage, box, storage file system, replication, massive scalability and then some ! as a very quick bio Florian is a link file, availability and storage specialist.

A

He frequently consults and conducts training on both OpenStack and the SEF stack and tim is. The company is currently employed while susa as a senior clustering engineer, working on suse linux, enterprise hire villain fee extension and the Susa cloud product which is based on OpenStack. He has now got doubts on whether there are too many zeros on such a thing as too many log files. With that introduction time we both give Florian and him a warm introduction.

B

Thank you for that. So for those of you who came in on a kind of tight schedule or a kind of late, if you already have those virtual machine images set up, that is perfectly fine and you can follow along if you choose not to follow along on your own virtual machines. That is perfectly fine to you will take just as much out of this tutorial and then you are able to retrace your steps later. We actually have made a point to make that easy for you.

B

If you do follow along, please make sure that other than the fact that you have those virtual machines installed and you run, you have run the install SH script. You should make sure that you have virtualization enabled in your system BIOS. So that is a feature that is usually found as Intel VT or AMD SVM. You want to have your kvm module loaded, that is just a mod probe kvm and it will load the appropriate either kvm intel or kvm AMD's module for you.

B

Obviously, because these use a Liberty, you want Liberty to be running and in order to avoid memory hogging, what you really want to turn on is Colonel same page merging, that's the fun stuff where you have like multiple processes using a page that has identical contact content, it's all being merged one page, and then we just have pointers, and you do that by checking or editing that one file in Sisyphus / news, /, colonel /, mmk sm run some distros ship with that enabled by default.

B

Others don't in that case you just do an echo 1 into sis, colonel mm mm k, sm run and by the way we still have seats. So if you're in the back feel free to rush in and grab a seat over here,.

C

Still them there's a couple of USB, please going around. If you want to copy the virtual machine. Image is under your laptop there's one up there and when they are so just hang out so.

B

So, generally speaking, we do have sort of a theoretical introduction to the safe stack and if you would like you can keep your you can keep setting up your your machines during that time. However, if you're just starting now, I would actually suggest that perhaps you will leave the following along for later and then so you don't miss to much of the talk here. Yes, we have a question.

C

B

The question was all we are: are we going to be able to download the image later now? No, you are able to download the image now, and you can do so just as well later on. So it's on a linux, australia mirror I tweeted the the the info yesterday and there is also an entry on my blog. So if you go to a stick, so calm, / blog / Florian, you will find all that information there and I. Don't think that the organizers are going to take that off like any time super soonish all right.

B

Okay, so we are going to talk about the safe stack and this F stack is a storage stack that provides us with object, storage, block storage file system file system, some replication, awesome, scalability and some other goodies and we're going to walk through those one by one. We, this is a double slot, so we have a little more time than we usually have in a talk which is delightful, and this will give us the chance to cover safe theoretically in a little more detail, then we were able in to do in the open stacks.

B

F talk for those of you who have seen it and also walk through some interesting practical steps there before we get started, you may be wondering just who the heck we are and what we do in this talk and I will start with my co speaker Tim's wrong, that link that you're, seeing there is his google plus page, so you can find Tim on google+ and connect with them.

B

There, Tim Martin Susa and his email addresses tisa wrong at SU, sitcom, Tim's, actually based just outside of hobo, and in this talk, or in this tutorial he will be doing the real work, meaning actually working on systems. The handwriting that you see here is his and he's also responsible for the cartooning that you will see halfway through the talk with Tim, doing real work handwriting in cartooning. That leaves me with fake work, hand, waving and babbling.

B

So that is my role here, that is my google plus page, or at least the short link to the same, and you can find me at florian at mystic. So calm I run this deck so which is a professional services company which is based just outside of Vienna, and that is where I helper hail from ok.

B

So with that out of the way, let's get started with Seth, so Seth is really not one thing, but for things for relatively distinct things all rolled into one and we're going to go through them step by step before we do that. Why are we going to give you a quick overview of all of this, so Seth, fundamentally at its core uses, object storage.

B

That means that the primary interface to interact with data is not files is not blocks is objects wine, because in order to build a massively scalable distributed data store, we can actually reduce that to relatively simple operations. What we want to do is we want to be able to write data. Read data perhaps delete some data, but we really don't need to worry about what is our block bar sector address? We don't want to worry about what I know it is or whether we have directories or files or permissions or echoes, or things like that.

B

We don't need any of that to achieve massively distributed in extremely scalable high level storage, even Zeph. The basic unit of data is and objects and objects as such are being distributed, replicated kept highly available within the distributed cluster, and we're going to look at in a certain amount of detail. What that means, as we get to the to the practical stuff.

B

Sef also has a block storage interface so, rather than interacting with SEF objects directly, we have one abstraction layer that allows us to treat a great number of Seth objects as one block device, as we would with any other old block device in Linux. So there we have something that appears as a virtual block device and we can write to it at a certain offset or read from it as at a certain off senate. We can use it for anything else, so we can use a block device for block device.

B

Block storage in SF is thin provision. It's very space efficient. It supports redirects on right snapshots, it supports cloning and several other interesting things. The third thing that we're going to cover when we're talking about Seth is restful storage, so this is something that, for those of you are familiar with things like Amazon is three or OpenStack swift. This.

D

Is how we interact.

B

With object, storage using restful interfaces, and that means that we have clients that maybe just an HTTP, client and they're using standard web technologies like HTTP, HTTPS and JSON to retrieve data from the object store and do so in a very, very efficient and simple way, and then finally, we have a distributed file system, which is the layer that adds all of these things that are interesting to POSIX and only deposits to the distributed storage stack. So here is where we get a distinction between files and directories. Here is where the namespace actually becomes hierarchical.

B

Here's what we see things like file, ownership and permission bits and those things and, as you will see, that is actually a very, very thin client layer that talks to the object store, which makes the whole thing very, very interesting, elegant, okay, so those are the four things that we're going to talk about and when going to look at them, both from a theoretical perspective. What's behind it?

B

How does it work and we're also looking at all of these, with a with a practical view, actually operating on the virtual machines that we have running here, and some of you have running on your lab laptops as well, so before we actually dive into you know getting our noses bloody and our hands dirty and her feet way or a combination of all three on those machines.

B

Let's take a look a little bit of what is so special about this whole native object, storage thing that is at the core of set, so Seth is based on a distributed. Autonomic that means self-organizing and redundant native object, store named ray das and radar stands for reliable autonomic, distributed object, store so ray.

C

Das is a completely.

B

Flat namespace, so that means that we don't have anything like a directory hierarchy or anything of that nature. It's completely flat. This is something that is very common to pretty much all object, stores and in great us in each for each object. We have a name and a dana five. We have a payload or contents which is pretty much arbitrary by in size, and we can also stick on to a radius object. Any number of key value pairs attributes, so we can assign an object attributes at will.

B

We can obviously retrieve them as well, and those are relatively independent of the actual payload. Now objects and ray das are assigned to something that we conceptually refer to as placement groups or PGs and every PG. Every placement group has a list of object, storage devices or, depending on whether you're reading the older or the newer documentation, object, storage demons where the content of these pg's. So all the objects in a specific placement group are stored in a redundant fashion. Now, why is this list of 0?

B

Is these that each PG rights to and reads from order rate us uses a primary copy mode of replication? So as we're writing an object, it's it along, as would any other object in the same placement probe is first being written to the primary OSD, which is the first entry in this list, and then this primary OSD takes care of where the other replicas go and how many replicas we have is entirely configurable.

B

So we can and we can change the number of replicas in virtual in subdivisions of the object store which we refer to as fools. By the way. Don't don't be shy rather than sitting in.

E

The authors, a.

B

Few seats left over here.

D

B

The question was: is the number of replicas configurable per object? No, it is configurable for pool a pool. Is an administrative subdivision of the object, store and all of the objects in that pool? Have a certain number of replicas assigned to them, and the replication to these OS DS is actually synchronous. So we can make sure that when we define we want, for example, three replicas of every objects, a negative in a given pool, as the object is being written.

B

The application, the client that is actually doing the right does not get acknowledgement of that right until it has been completed in all the red book on all of the replicas object. Placement is completely algorithmic, so there is no central lookup database and there is also no such thing really as a distributed hash table.

B

Now. What is special about that? That concept of data storage is a little abstract, so I tend to like to use it with a bit of a more concrete example. Okay, um when we are checking into hotel right, that is a data storage problem I, the traveler and data and I wish to be stored in an appropriate location and preferably such that I actually get my own room and do not intrude on someone else's.

B

So how does this typically work when we check in to hotel we're traveling, we typically go to the front desk and we give our name or reservation number and the duration of our stay or whatever we need in order to uniquely identify the reservation, and then we are assigned a room and we're given one of these key things right.

B

F

B

Tell is a little more modern than this one. Then it may be a key card, but that doesn't matter you get something to get to your room and you get the information where your room is okay, so, in terms of a day to look up, as in I need to figure out where I the piece of data am to be stored. What that, in fact is, is essentially a central database lookup with an optimization which is there actually telling me where my room is and I can memorize it.

B

So we've just cashed the lookup okay and, as with all caches the cat, the cash typically expires. So my reservation is and the room that I get is only for I, don't know, say four days and if I want to extend my stay, I need to get back to the front desk and do a fresh look up and I might then be reassigned to a different room or the same room may still be available and it's just being extended, but I still need to go through the front desk. So they look up the data.

B

Look up of the data storage. For me, the traveler to a room conventionally, the way we do in hotel is you go to the front desk. They tell you a room, you get a key and boom you've just done a central database lookup with caching, something that we would typically do in a storage solution that operates on the basis of essential metadata service. Now that works just dandy for a small hotel.

B

So if we have about maybe 20 to 30 rooms, that will be just great what, however, if we have something like rooms or 300 rooms, or maybe even more than that? Well, so what if our hotel is actually fairly large?

B

We could do something, that's very, very simple. We could just add more front desks and then hire more people right so, rather than having one front desk, we might have to Europe I have five or might have 12 or something, but that really doesn't work too well, because what that does is we can now handle more lookups in parallel.

B

So if we have a larger group of travelers that we can always handle three at a time what it doesn't solve is the problem of actual data assignment, because what what might happen is Tim and I, both coincidentally travel & stay in the same hotel and we both approached the front desk and I'm being told I. My room is number 365 and then Tim strangely he's also being told his room is 365 and then eventually, what's going to happen, is one of these transactions?

B

Is it inevitably going to fail right so either the system is smart enough to detect the conflict and then kick it back or we just meet at our at what is ostensibly our door and I. We both say: well, we actually don't think our relationship is quite ready for this. Yet so so one of us returns to the front desk and say: oh and the hell is up with you I.

B

You know, please give me a room that is not occupied right, so that is the classic, lock contention problem that we have in these database lookups right. If we have, if we allow like multiple clients to access the same database, the same central database at the same time, we might work relatively well normally, but if there is any conflict, then one of the transaction has to be kicked back and that doesn't scale really well, because the larger the onslaught of travelers becomes. The greater is the probability that we're going to get one of these conflicts.

B

Then you're gonna have to roll it back, so not really too good for when we actually want our hotel, we could do something different. Our we could build several completely identical buildings.

B

Okay, now, here's where a little bit of magic comes in the buildings are so identical that one groove in the one building has the exact same view as the same room in the next building and the next building next one so a little bit of magic there and we can assign gas on a pseudo random basis, such as, for example, say a second letter of their first name, okay, so for me that would be L and for Tim that would be I.

B

So we could have like several of these hotels right next to each other right and one of them and said, say we have 26 of them and then one says a one says: be you want to see and then the assignment is based on the second letter of their first names. So now we have sort of partitioned the problem a little bit, because now we don't have to access the same front desk. We can.

B

The lookup database can be just for those individual pieces and then Tim goes to one building and I go to a different building and we might actually be assigned the same room in that building. But now we no longer collide because the space is actually partitioned right, so be good, he'll there or many many many hotels, and that.

F

Is exactly the approach.

B

That we typically refer to as a distributed partition hash table. That's exactly how that works. So we r we r sub partitioning the namespace, essentially sort of from the get-go, and then we do our assignment within those okay.

B

Now, that's actually pretty good approach for when we want to grow from, say one order of magnitude to the next door of magnitude and then perhaps one more okay. But what if our hotel was not small was not large was not because not huge. It was absolutely gigantic: okay, hotel, for example, with a billion room. Okay.

B

So let's assume we want to build and organize a hotel with a billion rooms which is relatively similar to the challenge of managing. You know: storage at an appt, edda, biker exabyte scale compared to you, know, puny new bites and terabytes like we commonly do today.

B

So this creates some interesting challenges. The hotel, with the billion rooms and I've been told I should use the term the hotel at the end of the universe, for this creates some really interesting challenges. So, for example, the whole thing with the room number doesn't really work so well. Any okay, because if I have a room number like that, I might know how to get to the 156 floor of the hotel, but to get to the 390 8480.

B

First room on that floor may be quite a walk and besides it doesn't really help for me to sort of memorize that number, because it becomes essentially pretty meaningless. So um so that's one thing that doesn't really work so well. So this whole look of thing hmm not so cool we're, also having a bit of a statistical problem with the hotel with the billion rooms, because in a hotel of the billion rooms, it's relatively likely that at any given time about ten thousand rooms are probably going to be on fire. Give her table.

F

B

20,000 8030, we don't know but a substantial number and there will probably be about 120,000 200. 300 thousand rooms are currently under some sort of maintenance because they got a water leak or they have the walls repeated or something or they may just. We might just build them or we might tear down the building right sound like that.

B

All of that is actually relatively probable to occur at some place in the hotel, um so that doesn't work too well, um because we have to remember that at a certain scale, something is always going to fail fat period in the story, if it's highly unlikely that everything is going to fail at the same time, which means basically the hotel with the billion rooms with maybe under a thousand buildings, is all going to get knocked out by the same alien spaceship or something.

B

But then we have other problems, then the signing or traveling probably, but it is perfectly safe, safe to assume that at a certain scale, something is always going to fail and we have to build an engineer the system for this. So what is it that we really need in order to manage your hotel with a billion rows? In other words, what is it that we really need in order to manage like really really really big, distributed, replicated in storage?

B

Okay, so because we've already established a thing with the room numbers to identify, the room is pretty meaningless. We should really use something that we already know about ourselves to identify where we need to go so, for example, that might be a fingerprint or an iris scan or something of that nature, and then we use these to identify ourselves. So when I get to the hotel I give my I put my finger in a fingerprint reader and I know exactly where I need to go.

B

I know where my room is something something something all of that has to be done by the system by itself. What I, essentially want to do is I want to go somewhere where I have something completely automated. That takes us information and automatically guides me through my room, so I no longer need to care where it is because it could be anywhere right.

B

So what I want really is one of these neat little automated, robotic helicopters that read my thumbprint and then I get in and it air lifts me to my room. That's what I want to do, because the only thing is just kind of nice because, as we already established, we have this bit of a walking problem on 156 floor right, so airlifts would be really nice and then make that completely automated put in my thumbprint and there we are so now we have sort of solve the allocation problem right.

B

We have something that takes something that we already know about ourselves and automatically gets us to where we want to go. That is not all the problem that we need to solve. So, for example, we also need such that we can still enter our room when housekeeping comes in and does our room. We want some robots that automatically move all our stuff when we can't enter a room.

B

So if there's housekeeping in there or mints or something we want something like this to move all our stuff, completely automated to a completely different room right, including the umbrella and the at the end and the bag and the kids titty beer and the pillow and whatnot, and we want all of this to automatically move to a different room, because the system doesn't really care where my room is I. Don't care where my room is: we've already established that the rooms are magic because they're all completely identical and they have the same view, etc, etc.

B

So I don't need to care where my room is and if there's housekeeping or maintenance or the walls are being repainted I, don't really want to know about it. Instead, I just want to be taken somewhere else by this magic thumb, print later helicopter system and then I need something, and that has actually got my stuff there before I arrived. So I need one of these fancy little fast robots um that solves the maintenance and housekeeping problem. However, fibers are typically not scheduled, so the fire problem is one that we still need to cover.

B

So what can we do to to to to combat that is? We need some magic, replicators magic replicators that duplicate all our things that are in the room and store them safely, somewhere completely different as soon as we put them there, because I don't want to lose my stuff in a fire.

B

So we want something like this: okay, fancy little cameras and things that scan all of our stuff right and then all of that is connected to a MakerBot now or something that duplicates everything for us and that would include you know my wild about water bottle and my phone and my laptop and my cash useful.

B

C

Put that in the next version, yeah.

B

And my dog, River, and and and so all of that automatically goes to a different room like as soon as I put my stuff into the room. So, ideally, what I would want is as I as I.

C

B

And close the door I want to experience just a little bit of pushback just a little bit of additional latency, because that's what it takes. That's all the time that it takes to scan everything and replicate everything and then put it load it on one of these fancy level track, robots and move stuff elsewhere, so that will be cool. So, in summary, what do we need?

B

Something that takes a piece of information that we already know about ourselves which automatically moves us to a room as soon as we read this piece of information, something that automatically moves their stuff from A to B when my room is not available and we need something that automatically replicates all my stuff when I first get in there, so I don't lose myself in a fire right so far, so good. The nice thing about is as far as data storage is concerned.

B

Steph actually does all of that for us, which is really kind of neat, so with Seth I can actually manage in a meaningful and useful way such that I can actually use it as a regular user.

B

The billion room hotel for in data terms the petabyte or exabytes storage, which is really kind of cool, and now we're going to look at how that is actually done.

B

Ancef implements an algorithm called crush, controlled replication under scalable hashing, and the interesting thing about crush is that this algorithm is known to pretty much everything that plays with the Seth cluster. That is the components of the cluster itself, but also all of the Seth clients are aware of this algorithm and because this algorithm is generally available to everyone and everything in the cluster. The only thing that we actually need to distribute in the cluster are the parameters to this algorithm and then safe speak.

B

We call that take rush map, so the crush map is the stuff that is actually distributed across the cluster, very, very small bits and pieces of information.

B

That is all that we need to pass around and that's basically gossiped across the cluster, and if we have hundreds of nodes, if we have it doesn't matter, if we have, you know less than 10 notes, tens of no time as your notes, it all works the same way and we don't have any sort of sensual lookup instance, which, because of the aforementioned locking issues, would completely kill performance at scale.

B

Now we already mentioned OS these object, storage demons and what's cool about OS DS. Is that all of us? These, just like everything else in the cluster, know about the current map? The current crush map that describes object placement and they can also propagate that if they are sort of endowed with the ability to do so and they get.

D

That authority.

B

From what we call monitor, servers or mons, and they act as arbitrators to the cluster status and as authorities for the placement map, if you want a little more detail about how monitors monitor servers interact with OSD s and how lady shout their leases and how they delegate their authority, please go read sages, two thousand six and two thousand seven papers on Seth I.

B

Don't necessarily want to treat you to his whole thesis because that might be a little bit long, but but the but the two favors from two thousand six and two thousand seven are really really easy to read and explain this very, very succinctly where you it's one of those technical reading experiences where you go: hey, yeah, that's totally logical! That's the way that you need to do that and- and it's not so much clever as it is really smart, because, as we all know, cleverness is the enemy of stability.

B

The the mons themselves use a distributed consensus protocol, also known as a part-time Parliament protocol. I understand you are just in the middle of an election campaign, so you may be familiar with a part-time Parliament.

C

This stuff's more fun than our election campaign.

G

B

I one has anyone been arrested over.

C

B

Sauce yet nothing.

C

I'm aware aunty.

B

So the mods usually distributed consensus, protocol and algorithm, which is based on Paxos pack. Sauce, is a distributed. Consensus algorithm that I think was first described in about nineteen. Ninety five or 1996 safe is not the only not the only distributed technology that uses pack sauce in some way shape or form.

B

So, for example, if you are familiar with the pacemaker, high-availability stack, there is an add on to that called booth, which is built for site-to-site clusters or multi-site clusters, and they define or our arrive at consensus using pack sauce for those of you that are familiar with zookeeper, not the conference management software for LCA, but the other zookeeper that uses an algorithm based on pack sauce. So that is actually something that is fairly common, reasonably well understood in the literature and is in relatively wide use. Was there a question.

E

B

So, really, if you, if you want to read up on those papers, they're really really cool now, what's interesting- is that both Mons and osts operate entirely in userspace, as do all of the SEF demons. Really. This is a departure from stuff that we've seen other distributed. Storage technologies like, for example, the master file system, lustre, does a lot of its work in kernel, both client and server side.

B

Everything that happens in SEF suicide is entirely in user land and there's only a few relatively thin client layers that are implemented in kernel for the file system and for our BD is we're going to get to in. In a few.

D

B

Take oak, so we have a total of four virtual machines that you can connect to their named Alice, Daisy, Eric and Frank. If you looked into your NCOs, follow they're. Also at some point was a Bob and Charlie in this demo, but we try. We try to not overtax or laptops too much.

B

The way to connect to these is shown at the bottom. I hope that is illegible. So you connect to these boxes. As with secure shell as root 2, 192, 168, 1, 22 and then 111 is Alice, 114 is daisy, 115 is Eric and 116 is Frank the password the root password for these is Hess tech, so H aste, x0, all lowercase and please feel free to put your ssh pop key into the dot road.

B

/ authorized keys file by the way, if you choose you shouldn't, but if you choose to take these virtual machines with you and plug them into a network that you own on and somewhat internet-facing platform, be aware that there's my public key in there for root and tims and potentially your own. So if you actually want to put this out on the internet, prepare for a visit from texting me.

B

No, so seriously, if you, if you want to deploy these boxes, that's perfectly fine, but make sure that you actually clean out your your route authorized keys before introducing.

D

Scoops, what was that.

D

B

The question was: are there any knit scripts or back doors or things that when they get booted nan, we didn't know? If we did, we wouldn't tell you.

G

G

B

So yeah, so the question was how many people have trouble getting this stuff going, so they should work unmodified if you're running ubuntu, they should work unmodified. What was that.

F

B

Might have needed some.

F

D

B

David sale of this does that work. I, ok,.

F

So there's there's no guarantee.

B

That you don't need to make any changes to these. I can tell you, I created them on a boon to 12 10 and they should work on unmodified there. If they're, not hey, it's LCA, you get to hack. All right now know so we're aware of two things that you need to change in opensuse. That is, you need to change the emulator line from user bin kvm to user beam, qmu kvm and same for fedora. Ok, so on fedora, the same emulator is user bin, qmu kvm and then on opensuse rather stupidly.

B

You have to change the machine type from pc, whatever it is to just pc. Ok,.

G

B

But the thing is SO comment was set it to PC, 11 or whatever, but the problem with open suse is actually it once pc 0 point 12 or whatever. And then, if you just use pc, that's fine yeah.

C

But actually defaults back to the most recent one by magic, yeah, yeah.

B

Bye bye, so bye children I actually buy green magic. That okay, so like I said you know, if, if you can get your stuff going up right, if you can't just don't worry about it for now, because you will take more out of this tutorial if you just watch rather than if you're trying to now scramble and get things done and then try to catch up.

B

So the first thing that we're going to show you is: we have a running set cluster on these boxes and we can use a utility which is creatively named safe to check the status of the current size of the cluster, so we do Seth dash lowercase W, and that should give us a current status of the cluster. This may sound like um well or look like something like Cimmerian to you at first sight, but you actually learn to parse it relatively quickly.

B

The most important thing is that we see at the very top is a general overview or general. We can get a general idea of the overall health of the cluster. Generally speaking, if it says health, okay, then health of the cluster is ok.

B

There might be a health, worn or health, critical or other things, and then in the next line we get our current Mon map, so that is the current monitor service that we have in the classroom. What we did in the configuration for these is, we put. Let me break that here, real quick! Can you just control, see that just so it doesn't roll over. Thank you. um What we have here is three months all of the three safe cluster nodes are also monitoring service, and that is something that you do form on high availability.

B

So what you can do is you can have a single mon if you lose that you're kind of screwed, because the Mons is how clients actually connect to the cluster and that's the only piece of information that we need from the client to connect to the cluster once it has it, a single mom that it talks to it, finds out about all the other bonds in the cluster we're all viewers. These are etc, etc.

B

But if you have one and you lose, that you're kind of in trouble- if you have to that's, actually a really bad idea, because it uses a consensus algorithm based on quorum and if you have just two nodes or two mons in the cluster, you lose one. The other automatically loses quorum and is also unavailable, so minimum number of bonds that you want to have in SF cluster. In order to be highly available rate and, generally speaking, you would use an odd number.

B

Ok, we have an OSD map, so we know about the OS DS in the cluster object, storage demons, and currently we have 30 s DS up and also 30 s DS in so up means that the OSD demon is actually running and is responding on the network and in means the OSD is currently eligible and available for data placement or data storage and we're going to see what happens. If we do things with these and then we get a little bit of information about our current placement groups.

B

So we currently have 840 placement groups in the cluster. They are considered active and clean. So that is to say, we have no degradation of storage here anywhere. Everything is wonderful and something that I failed. To mention up to this point. We also have in this cluster are two mdss metadata servers. Those are only relevant to the Ceph file system, which we're going to get to at the end of the tutorial.

B

So for the time being, please bear with us and we will get to that as we go along, so that is the current status of the cluster of what goddess here. Seth has a single central configuration file just a second, and that is etsy safe, safe calm. Yes, we have a question.

B

So the question was: what do the various ease mean in the in in the in the output of safe dash W, so the e 1 e 1, 55 e 66 and so forth? It's an epic. So all of these maps in the cluster are versioned and those are just the version numbers that increase and it's actually really really cool. How it's done that no OSD and no component in the cluster ever sees any of these version.

B

Numbers go backwards like really it's like that again is one of the things that you will totally geek a geek out about in the papers when you read them, okay. So the next thing is what got us here. We have a single central configuration file, / etsy, /a, /div com. There you go, and this is actually relatively simple and straightforward. There's nothing. Super duper. Fancy in here Seth uses an authentication service called SEF X, which we can use for both clients to authenticate to the cluster and individual cluster demons to authenticate to each other.

B

It's essentially based on identities that are stored in the cluster and those are pretty much using a shared secrets.

B

We can define a specific walk file if we want to, in this case we just logging everything to VAR lock, set and then cluster ID cluster name and demon ID by default. The cluster ID is the cluster name is Seth and the demon ID is all the stuff that we have like. After the dot of the various of the various demons, we have defined three different mods on our hosts name, Daisy, arrogant, Frank.

B

By default, these listen on the monitor, address or or or monitor, port 679 and, like I said you want to have at least three of these monitors in any safe cluster that you wish to be highly available and set cluster that you wish to be highly available is like river that you wish to be wet or something so that should be you.

B

If we scroll down in this in this in this config file, we have a few mdss, there's, really not that much that we need to configure about them, but we'll get to MDS is little later on and then.

F

We have the OSD.

B

S so the u.s. these are actually the data storage workhorses in the cluster and of these we might well have hundreds. It is a relatively standard design practice to have 10 s deeper physical storage, disk that you have in the server or in the in the individual node, and it is relatively common for a safe storage server to have between about 4 and 12 spinners in it.

B

So if you're thinking that oh I'm gonna I'm gonna I'm gonna I'm going to procure hardware for a safe cluster, that would also work really really well for the dinosaur technologies of days of Europe, with life, typically like 48 disks in them and whatnot. That is not your ideal set, foster or OSD box really OS. These are really really smart in what they're doing so, they do consume a fairly significant amount of processing, power and memory, and what you typically want to shoot for.

B

Is you want to plan your OST such that you have at least about 500 megs of memory for every OSD in there and so and for other reasons as well, and that is 0 s.

B

Ds use a use journal writes so all the rights that happen two and two and 0 SD go into a journal and either then that is subsequently or in parallel to the file storm and what you typically do is you have a number of spinners in the box, and then you have one hour to really high performance, high bandwidth SSDs in them such that you use the SSDs for journal devices, you partition them up and use them for journal devices of your OS DS, and then your the actual file stores are on the spinners and from that it follows relatively logically that you have a box with 48 disks, 48 spinners and just one SSD hosting all of the journals that SSD is just going to be too slow.

B

C

B

We have a question.

D

B

Are you talking about storage space limit? Okay? So the question was: what is the storage space limit of a SEF cluster in a data center.

E

You like to make me dance: ah okay, yeah.

B

Well, you can you well, so you can do that, but.

F

B

That case, what you would typically do is you would completely forgo the SSD approach and actually put your journal on the devices, because then you can parallel eyes that to a certain extent as well. Yes would.

G

You prefer a systems for us yeah, so.

B

Question is: do we need raid force F noob yeah? No, no I also would not prefer it now. Seth does the the takes care of its own replication by itself? So no you don't build rate on force.

C

C

B

The question was instead of building one raid with multiple disks in them, would I use them all separately? Yes, you essentially use them as a j-bot and you're you're deploying 10, SD, /, spinning disk Roger.

B

If you run multiple SSDs, can you increase the density? Yes, you can. You can, of course, run a whole SEF cluster with like several petabytes in all SSDs. If you choose to that, will typically be limited by budget constraints. So the nice thing that you can do with this f cluster is you can build something that is really awesomely distributed and very very fast for using say, 7.2 k, RPM SATA spinners with two terabytes each which are really cheap disks, and you just invest a few bucks extra into the into the SSD.

B

So we have a question redhead personality. Yes, yes,.

G

You mustn't does typical configuration would have four to twelve spinners, fair boss, as we change this number to a lower number of higher number will do it. What do we in? She doesn't change from wife's fault. You stupid, you start something else. So.

B

The question was: I said that the standard number is about 4 to 12 disks and what changes or what? What change? Can we expect if we're using more or less of this, the you just happen to hit different performance limits right, so, if you're, if you're using it, will still be fine in terms of performance and utilization and whatnot, but for example, this is a standard thing that we come across relatively often when we do.

B

Projects like this is someone buying a server with lots of lots, lots of spinners and one or two SSDs, and then what happens is what I said earlier: it's actually the SSD that becomes the performance bottleneck. So in that case you can still use the box and what you would do is you would instead just put the journal actually on the file store on the on the on the spinners themselves, but you may just have wasted a little bit of cash for a few SSDs Bruno. You had a question.

D

B

So yeah, so how does how does a safe deal with with the heterogeneous type of disks and arrays and things in the crush map? You can define weights, so you can say, give preference to you know this type of disks or or or this box has this much more storage than the other box, etc. So it deals with that.

B

It deals with it in a way that is not completely self adjusting. I think is fair to say.

G

Oh have any role.

D

B

Don't know I mean it's fine for me, but for those following along that might be a bit problematic. Oh here.

E

To room full dimmer.

D

B

Better, it's wonderful, okay, um alright, okay, more questions! Yes,.

C

Sorry, this co-wrote, the right question I, understand the surface phenomena they were so into the system. Would it not increase the speed right.

B

So question was: would a raid not increase the speed of the system? No, it would not, because the nice thing about SEF is that all of the rights that are all of the io that's coming into the system is actually being distributed across the whole cluster.

B

So the classic issue that you typically have in standard issue, a storage solution, which is you typically get periods of relatively high locality of Io- is just not happening instead, you're hitting in a sufficiently scale-out cluster you're, just hitting sufficiently many nodes that you're that you're distributing the load nice lease. Are you actually not getting that kind of locality?

B

It's very uncommon for a sufficiently distributed, SEF cluster to be hit with rights to a single OSD of more than about 4 to 5 gigabytes at a time, okay, we'll take one more and I'll have to ask you to hold back a little bit just we need to sort of kind of sort.

G

Of yes, what would be a good baseline SSD to spin a ratio? What.

B

Were you'd be sly, ness, st to spin ratio, so a good rule of thumb would be, for example, if you're going with so a good rule of thumb is generally about 40 s, three journals, /, SSD, okay, give or take a little bit. So if you have say, eight spinners to SSDs would be perfect: okay, okay, all right! So um we're not I'm we're gonna skip the whole right gateway stuff, because we're going to get to that in a second. What we're going to do instead, we're going to show you real, quick.

B

What a safe OSD actually looks like here be dragons, be very afraid, but can we just see a mount real, quick? So what we have here is we have a separate file system for the OSD, because it's on a separate disk. In that case it is the exit has file system that is mounted from diff VDB, 12, VAR, lib, safe OSD, safe 0, okay and the we have to theoretically recommended file systems for Seth OS DS in practice.

B

I would actually recommend only one there's two that are theoretically recommended our butter, FS and XFS butter, fest sort of being the perfect option when it's ready people have been saying nasty things about butter. Fs like it is two years from production always will be. I will not go that far, but for right now, arguably it is just it's still marked as experimental. You probably don't want to entrust your petabytes next bytes of data to butter, fest, safe, actually, cephalus DS actually do a few clever things with butter.

B

Fs such as, for example, when using butter fest, you can do the journal right and the file store right in parallel, which can greatly speed up the process, because what you can do if the journal right fails. You just roll back to the previous butterhead snapshot, which is kind of cool, so you have a consistent state again and and there's there's various other.

B

You know neat little things that it does with with butter FS for right now my general recommendation and safe projects is to use is to use XFS, which is really really fine and helpful. As far as that is concerned, there is support for ext3 ext4. It has some performance issues because of the way those handle user extended attributes, but, generally speaking, XFS should be a very safe bad.

F

If we actually look into this.

D

B

And do a do an LS or even well, actually go into current in here there we go and do an LS l are in here yeah. A lot ah looks so much like a normal, regular file system right.

B

So this is very much optimized for sex own purposes for optics storage, and it is one of the things that sort of tend to scare people the most about Seth, which is that, if you're running into a problem where, essentially all of your mom's, all of your monitor servers are completely unavailable. You can't connect to either them. It's really really hard to get your data out.

B

hmm That is something that, where people you usually get more of a warm fuzzy feeling when they use in Gloucester FS, because in Gloucester FS, all of that is very transparent. If I put data into a Gloucester FS, then I can look into the the brick. The underlying local file system, that gloss Drake sports and in there is exactly the same file name and if I'm not using striping the same file, size and the same attributes etc, and I can easily get my data out of bluster FS. Even a 5k lingual surface but force F.

B

That is not so much the case so and you also really really really don't want to muck around manually in these things. You don't want a VI one of the files in here. Don't do that not.

F

So you've all good, so you can't.

B

You can do it and you get to again you get to people all the pieces. They're.

C

G

B

Okay and then in and on the other on the other house, if we, if we just move over to to to Eric here real quick, that's our our our OSD one. So that looks pretty much exactly a tentacle.

B

You just switch to Eric real quick and look into the look into the USD there so same thing, so that is chef, OSD sep, one and yeah there. We.

E

Go and then it's the same.

B

All right, okay, so.

B

That is our running safe cluster, and you can imagine that Frank looks roughly the same. That.

F

Is our running safe, Buster.

B

And we can interact with this raid oz object store not with the I in the OSD okay, but with the number of client-side api's and tools, and that's what we're going to look at next and here's another thing where you can follow along this.

F

Is on Alice so.

B

Alice is our client node for everything, but most of the stuff that we're running on Alice. You could also run on all of the other hosts, ok and in a slash route on Alice. You will find a in total for directories and their names, 01 radar, 02, RVD, 03, gray, das gateway and 04 Assefa fest, and that's exactly the order in which we're going to do them.

B

So we're going to start with ray das, and the first thing that we can do is we can just get a list of the radio schools in here and just so, we don't. You know waste a lot of get a lot of friction and things by us reading out or spelling out commands. We've decided to just put everything in neat little shell scripts, that you can then review and see what they're doing under the covers, and all of these are running with set dash X. So you can actually see the commands as they're happening.

B

So the first thing that we're doing is we're just getting a list of all the pools that we have in this in the same classroom and three of those are always available by default and those are named. Data metadata and rbd. The data and metadata poles are for the SEF file system and our bdr for rato Spock devices and the others are all because we have created a radar skateway in here for you already and there's one that we're using for testing purposes and because we're tremendously creative we've named it test.

B

So the first thing that we can do here is we can just create an object and put that into this pool. Okay. So what we're doing here is there's a command line: utility named Ray dose again very creative, we're defining.

B

What's the pool that we want to interact with that's the dash p option test, we define what is the client identity that we want to use, and in this case it's the one that Seth creates when we first install it faster clock, client admin and we define a key ring that we use to identify as that identity to the cluster. So in this case, radar stash be test incline, mmk, ssf, cheering by the way we could leave all of that out, except the pee test, because they're actually in a default.

B

Okay, so by default, you're connecting to the cluster with the admin identity using etsy, safe, key ring as a default keyring location, and then you put an object in their raid, us put object name and then what gets cut off at the end of the screen here is the actual content that we want to toss in there. So what we're doing is we're echoing hello world into this into this binary and that gets piped in on standard in and then we have created.

B

Sf object named hello in the test pool and we can retrieve that again with that's the game. Object thingy. There we go again. Raid oz bottle, awesome options, get object, name and it just spits out the content of this object from on on the command line. Okay, there's a few other things that we can do with objects. As we mentioned before, objects, don't only have a name and some content, but they also can have dr. abuse the next thing.

B

So on this object, there are two attributes: food with the value of bar and spam, with the logical value of ace, and you can do that with as many attributes as you want. So you just you can define new attributes, you can set them, you can retrieve their value, you can update them, etc, etc and then, finally, there is not only the the possibility to get and retrieve the object, but we can actually find out where a specific object is okay, and that is what we do with this test map object thing.

B

This is actually a two-step process. We've rolled it into one script: you retrieve what is called the OSD map, which is essentially information of where all the OSD czar and then we use this OSD map tool utility to figure out. Ok, where is the object named hello and and then it tells us? Ok, it's part of that placement group and it is currently assigned to the OS these two and one which happened to be the OS dizon, Frank and Eric. Ok, so.

F

In the background, while you're not watching Tim.

B

Is now going to be? Do something nasty which is kill the OSD on, say, Eric.

B

While I click in here and do myself dash W there, we go so currently, as you can see three years, these three up, three in everything, wonderful, making, dandy and then after some time we're going to see that one of those OS des are going to be down. You killed yep there we go in 30, sts, 30 days, 30 days for a moment.

B

We change killed here. We go Frank Eric.

C

Eric Eric should we did I tell it, but kicking me out too. Okay.

B

There we go right, 30 s, DS chewing la la, so let's do it again, then, up as you can see, 30 s DS 2 up to in oh, you actually kicked it out. Okay, ah sorry that was too quick so hmm last week. So we have these two statuses, so one is up and down.

B

That just means is this thing actually alive and is it responding on the network and we have in or out which means is it available for actual data storage and what Tim did he kind of normally an OSD goes to the down to the yes to the down status, and then it waits for about five minutes and then it's actually out and then what happens? Is this thing that we call backfill and recovery? So what we can do now is.

B

Lo and behold, this has been completely reassigned right. So prior it was two and one, and now we killed one and now it's 2 and 0. Isn't that wonderful? So what Seth does is? Not only does it keep our data available as nodes become unavailable, but it actually restores a degree of redundancy to what we have configured completely automatically.

B

So it deals pretty well with the room on fire problem.

B

Okay, and we can fire that back up while I go along oops, Rick, shin, okay,.

F

And then there are several high-level.

B

Client leaders that rate all ships with and that's what we're going to concern ourselves with in the last just over 30 minutes. So one of those high-level client layers is ray das. What device or RVD so r BD is a thin provision block device that stripes data across multiple radar subjects, so they'll block interface. Everything that we write to this thing actually gets strike across multiple radars objects in the cluster and then all of the distribution and replication, nay, chi and whatnot. All of that happens at the radars level.

B

Rvd doesn't have to care about this, which makes it a very, very thin file error on top of radios compared to having to do everything now in regards replication in the h.a RVD supports a read-only snapshots. These snapshots are a direct and right and they are super cheap, because everything's team within provision, we can use much the same facilities in order to provide snapshots and that's really really cool.

B

It supports cloning. Cloning means that we can designate a snapshot as a master copy of other than writable RVD images, which is cool and with that it of course becomes very suitable for maintaining things like template-based virtual machines, which is why our BD is heavily used in trust. The cloud technology is like open, static and cloudstack and others, because it's just very, very useful for this purpose.

B

Now it actually comes in two flavors, not one. We have our BD, which is a kernel level block device driver and that made it into upstream Linux in 26 37. If I use an RVD that way, I just map, it called mapping, and then it becomes a virtual block device that pops up on my linux box, it's dev a RVD, something and then I can use it just like any other old block device.

B

What we can also do is we can iterate integrated directly with virtualization solutions, so there is Q mu r BD, which is a user space storage driver for both q, mu and kb m, and that is built on the liberators capi.

B

So with that, we can avoid the colonel round trip that we have to go through if we're mapping a block device for use cases where it should simply not need it. So this means qmu itself becomes an RPD client connects to an RVD image and then presents that as a virtual block device to the guest.

B

So let's take a look at this again and now here is where the tutorial is a little tighter than it was going to be, because we uncovered a couple of interesting rbd bugs just yesterday, which sager was nice enough to fix, but we just haven't had gotten updated packages yet now, but alone. So next chapter 02 rbd, now we go and again we have a simple script there, which we can use to just create an RVD there.

B

We go very simple: we do rbd and then this is kind of nice, pretty much all of the one that quite all, but pretty much all of the client user space management utilities in radios support identical command-line options, so the dash in for selecting the client identity is always identical. The dash k for selecting the key ring is always identical, so the same thing is true for our BD or doing here's we're creating a an r BD image with the size of 512 megabytes named test and then doing LS of the same thing.

B

To just show that it is in fact there the block device as such is thin provisioned, which means that the space that we're defining here is not taken up immediately. That is only the maximum space that it can eventually take up. So, for example, if I do a if I use the radars utility to list the rbd pool, so I'm using sort of a lower level utility now to look under our bdays covers what I can see here.

B

Is that there's actually just two radars objects that have been created in this, namely the RV directory, which has information about all of the rbd images in the pool and a header object, taste rbd, and that's it only as I write to this device? Do we actually get additional objects that are being allocated? Yes,.

G

B

Think provisioning have a performance impact, no um okay, and we can then map this thing. This is what we do with the o2 map Sh. So here again we use the rbd command. In this case we actually have to map. We have to mop probe the rbd module and once we do that, it becomes available as a block device, dev RB d0 and under dev RVD.

B

There we go, we also have some you dev managed reasonably named siblings in there. Okay, so rather than you having to remember, remember the the device number RV d0. It is the device in the rbd pool or the image and the rbd full name test. So it's diff, RBD RBD test that you can use here and.

G

B

Fancy we can now, for example, we'll go ahead and make a file system on this thing, so, for example, that would be mkfs, Dashti, XFS or yeah like that, and then you just do dev RBD RBD tests and it should hopefully happily create a file system for us did it done and the terminal is really slow, but that's okay, so um we can use this and, as with any other with any other block device, the next step in the demo would have been a fancy snapshot that that's where the bucket is. Yes, we have a.

D

Question you occupied tomorrow can.

B

I can I do it. Do.

D

G

G

B

I present the block device to multiple machines.

B

Yes, I can, in fact it is a bit over permissive in that regard, such that there is actually no locking mechanism akin to whoa hang on it's built into our BD, if you're putting a lock manager for an OC f for an ocfs2 file system achieve as you can perfectly do, that what it doesn't do is something like um scuzzy SPC's, three persistent reservations, such a thing does not exist and question is: will it ever because it happens to be distributed, whereas all of the scuzzy stuff is pretty much centralized? Okay, so.

F

Yes, but you can definitely absolutely.

B

Map it from as many clients as you want, but you do need to take care of coordinating access to the block device again, if you fail to do so, you get to keep all the pieces.

B

So to get that on the video, there is a locking tool that does cooperative locking, but that really doesn't get enforced. Okay, which scuzzy three PRS do unless you preempt them, you actually have to say I'm willing the big stick now and I'm, beating you to a pulp.

B

So locking in in the QM you layer, no, but what is in the qemu layer, which is actually really cool, is the caching is live migration safe. So this is a. This is a problem in in queue cow. If you're, if you're running off a Q mu k, DM off of off of Q cow, then you have to disable.

B

Caching, if you want to be able to live my great that virtual machine, because it does fancy stuff, that is totally not cool during the migration process, and that is not true for the for the rbd storage schreiber, so you can actually rbd caching defined as in your liver, config file, you're saying this such-and-such cash equals right back right through whatever, and that is actually like migration, safe, which is kind of neat.

B

Okay, so much for RVD oops what happen.

E

There we go and.

B

Restful storage, restful storage is really kind of cool with that does is we have any number of clients that are anywhere on the web and they understand nothing but HTTP or HTTPS, and now they want to access the object store through a restful gateway, so something that speaks HTTP and produces some JSON containing data from the set object, and we have that in Seth.

B

So we have restful, HTTP and HTTPS access to the object, store and Seth does that through a fast egi application, name, brightest gateway and radars gateway itself uses the C++ API for ray das called the gray dots PP, just in case you're interested and the Raiders gateway, runs essentially in any web server. That supports the fast CGI interface. So you can run this with engine X.

B

If you choose to, you can run it through lady, you can, if you're brave, / insane run it through I is which I think supports fast, CGI kind of sort of like, but the canonical way of doing this is, interestingly with apache and my fast cgi for those of you who are Apache geeks. You will probably scream and shout at me that this is not the latest fast cgi implementation. That's commonly recommended for use with Apache.

B

That would be mod fcgi d, but still it just so happens that the radars gateway, developer is kind of just or recommend mock fest CGI.

B

So that was what support well okay, so this is apparently about 100, continue HTTP status, support, which fest with FC GID doesn't do too well. Alright.

B

So what's nice about write-offs gateway? Is it did not invent its own wheel in terms of rest? Api? Is that it supports? You can add additional restful. Api is, if you think, that's necessary, but the ones that it supports straight out of the box are those for amazon, s3 and OpenStack Swift.

B

The support is not completely featured for beach a feature in buck for buck complete, but the future disparities are well documented and there's essentially a list of features that you're getting out of this theory, but not at a rate in a ratos gateway or that you're getting out of swift, but not out of the right out of how to register the radio skateway does something very smart and that is, it doesn't put any of its own relevant data or data that is relevant to itself on the radar skateway host.

B

Instead, all of that data that it works with even the stuff that it needs for itself to function, use itself in radars. So there is no local storage whatsoever, and that means that radars gateway completely natively supports load, balancing and scale out. If we want to scale out over multiple entry points that we just add them, we just added one right us gateway host and then other and then another and then another, and we can use any sort of load balancing facility that we would like to one of the more popular ones.

B

Susan scores just round robin DNS. You could theoretically use an IP load balancer, but then that might become a scalability to a point just the same, because now you're redirecting all or you're, directing all your clients through one gateway that then load balances across multiple rails gateways, perhaps not the greatest idea. But if you, if you're using dns load balancing so round robin DNS, multiple dns entries for the same may, then that is just perfectly fine. Then we're going to take a look at what that looks like.

B

So we have a radar skateway set up on the same hosts on the same three hosts that run the OS DS and the mods. So that's just because it's kind of convenient in this kind of demo, setting most typical setups would in production would run rate us gateways on separate hosts as many of them as you like dream objects, which is about three petabytes of storage behind dradis gateway. The last I checked, used I think for write offs gateways. It's not like that.

B

Just to give you an idea, there's there's about what is about hundred notes behind it. So it's a it's a it's a three digit number of notes: that's behind it and there's four ray das gateway hose that can handle this just very nicely and by the.

F

B

It's all HTTP and wrists and whatnot you can use whatever caching implementation that you want to use. So if you want to do the same thing that, for example, Wikimedia does for their Swift stuff, which is that they're just strategically placing varnish caches across the globe and they can run out of a single switch repository that way, you could do the exact same thing with ray das gateway, because it's all adjust HTTP and you can use the proxying facilities that we all know and love from HTTP.

B

So we're going to start out with some, oh. So what we're doing to interact with this ray das gateway here is just a completely unmodified amazon, s3 clients. Okay, it's using s3 CMD, which is a an amazon, s3 client that ships with debian and ubuntu, and presumably many other artists rose, and we can just use that to interact with the object store. So what first going to do is going to see? Ok, what kind of buckets do we have on this thing?

B

And we deliberately turned on debugging for this thing just so, you can see what is actually coming down on the wire. So, in this case, we are connecting to something that actually tends to be. Is three amazon AWS calm? We do so with some cute little dns tricks. So no real trickery. Here, it's just that on Alice we have a bind a name d that just hosts that zone, and then we go from there, and so we have three buckets here and they are aptly named fubar and bass.

B

So what we're going to do now is we are going to check how we actually integrate, how we actually need to get it interacted with this thing. So there's a utility called Raiders gateway admin and that's how we can define our users and permissions and axis and things so in this case we created a user with a beautiful name of John Doe. He works, for example, calm, which I hear is a multi-billion dollar enterprise, we're in the United States, and what we can define here is just regular old amazon, s3 access, keys and secret keys.

B

So for all of you familiar with amazon s3. That is how you interact with with the storage space and we've also prepared. Is we have added credentials for John Doe when he acts as a swift user? So what we going to do first, is we're going to upload and create an object in there. So we're uploading something to s3, ok, so again with debugging in la la la. But what we did is we just put something into the food bucket: ok and it uses essentially bucket based host names.

B

So food is 3, amazon, AWS, calm and again, there's a little dns Wockhardt magic in here. But what we did is we have just created this thing and if we now go ahead and take a quick look at with our s3 CMD LS.

A

B

See that went into the foo bucket, and so so now there is this thing called food spam which I am talking to using the Amazon s3 API and now I can do something cool which is I, have created this amazon, s3 buckets and I've put an object into it, and now I'm going to use swift addressing that same bucket as a swift container and retrieving the same object.

B

So there we go and we just downloaded our object named spam. Okay and, as you can see, what we originally uploaded was the test TST test txt thing and it said hello world, which we uploaded using amazon tools into brightest gateway and SEF, and we then retrieved it using a completely unadulterated, swift, binary and did a download request of that object named spam from a container named foo and lo and behold it has exactly the same content.

B

So that's kind of neat, because we can do these two different things or we can use these two completely different api's with the same object store cluster.

B

And you could do that from these you're Eric as well, but that's okay and then finally, the stuff that you all been waiting for, because most people tend to perceive Seth as a distributed file system, which it is among many other things now in CJ's presentation. Yesterday, it listed basically all of the components in SF, laprade, Osprey, das gateway rbd as awesome and the file system as almost but not quite yet awesome. So this is considered experimental, which is kind of interesting, because it is what originally drove the development of SEF. I.

B

Think, that's fair to say, was to build a lustre without its shortcomings right um and there we go just read file system on top of a dose, and these currently considered experimental, although it's been in the mainline kernel since 26 32, and that in itself is no surprise because butter FS has been in the main line for all for quite a while, and it's still experimental. So that's, okay and you can use it.

B

That is just fine and there are people that are running this in production or at least claim to do so on the relevant IRC channels and made of this and what not. They are typically to be found in academia, which is no surprise because they are also the typical luster users and they're. Also looking for lost or without the soccer. Essentially and lustra has a few really painful shortcomings, number one. It has a central data, look up, it has a metadata server, which is a scalability took point, and it is a single point of failure.

B

So you have to do a.

B

Little bit of high availability around that you can. It also does a few things in the kernel that Seth does in user space and a few other things when I say lustre without the sock. That, of course, doesn't mean that luster completely sucks. Quite the contrary, it's a very stable HPC file system, it's being used in use forever. It's just that. It has a few of these shortcomings that people would like to address, and one of those people happen to be staged and this team at UC, Santa, Cruz.

B

So self is the stuff that layer is plastic semantics on top of raid oz, so Seth the filesystem introduces things like directories, because in the pockets, filesystem the names basis not flat, it is hierarchical, it introduces attributes, ownership, permission, bits and all the other good things that a POSIX file system means, and it does so as a very, very, very thin client layer on top of ray dots.

B

Just for the sake of comparison, if you look at ceiling, storage file systems such as ocfs, two and gfs gfs to the Darra kernel code, is about 35,000 and sixty-eight thousand lines of code respectively.

B

Safe is seventeen thousand lines of code, which is like really really tiny in comparison, and that is because all the others have to take care of a lots and lots and lots of things by themselves that, in in the set file system, is just offloaded to raid us all. The files is the metadata itself lives in radars objects and to manage this metadata. We have another type of demon.

B

The third type of safe demons called a metadata server for MDS and it what it does is: filesystem clients and fascism, clients only no RVD clients, nothing that uses liberate us directly. Nothing that uses the python bindings directly. Only the file system, clients actually talk to this metadata server and the metadata server also caches this metadata. For for clients to improve performance, it runs entirely in user space. The MDS that is, and only the file system client, runs in the colonel. So we only have two components really insect that run in kernel.

B

One is the kernel rbd device and one is the kernel set file system. There actually is also a fuse client force F, but that is really only recommended for use in those situations where you are on a system that does not. That is not linux and therefore does not support the file system. Client that does support fuse. So that, for example, would be your way of talking to a safe file system from say freebsd.

B

If that's what you want to do now, set amounts are writable from any client, so we can mount them from as many clients as we want, and they are of course readable and writable from them, and they also play nicely with by locking all the file locking as pretty much anything in Linux. As far as file locking is concerned is advisory.

B

There is no mandatory locking, so applications actually have to ask for a law for locks, and if they don't get it, they to politely wait, but if they just don't ask for locks, but just barging, then okay, that's it. But that is how how file locking works in Linux in general and mandatory locking is available as a mount option in just about any file system and just about any file system. It never really.

B

And something that's really cool about Seth is the set file system. Is it supports arbitrary directory level snapshots and what that means is something we're going to get to in just a moment. It's a really nice way of doing copy-on-write directory means. One thing that is currently unsupported is ruffling. Ref link is a means of creating a copy on write, duplicate of a single file that isn't. That is something that is supported in ocfs2. It is support in butter fest, where it's called cloning, and that is something that we can't do.

B

So we can't do snapshots of individual files in the set file system, but we can do snapshots of directories and we can also do snapshots obviously of the root directory of the set file system, which means we're snapshotting. The entire volume ancef has really spiffy accounting and statistics that it supports through virtual, extended attributes. If you're, considering you have a say three petabyte cluster with hundreds of nodes, then finding out how much data is in a specific directory. When you're running by running d, you it's going to be pretty intensive operation, because you're, probably.

F

B

To talk to 100 nodes, which is not fun so in order to support that better SEF has some virtual extended attributes where we can just look that up via an f @ recall, which is kind of cool, and this is the final demo I think of the entire conference, as we don't call account lightning talks, so we are back on Ellis again and the first thing that we can do is crossing our fingers because yesterday we ran into a memory error problem when we did this so now we can mount this thing and it says, cannot allocate memory, smoke and mirrors, but luckily we find out how we fix this, and that is.

B

We are now going to reboot this thing, which you please do from that need there we go, it should come back up and then, when he comes back up in la liar, you never drill, because we did this yesterday. So here is why the file system is experimental. No am we so this is so. We have several issues here. The the the set file system development happens completely upstream, and it is actually sort of detached that way.

B

From the main set code base, because everything just happens in the kernel- and we have a relatively old colonel here, which is a 320 which you know if you're a faucet and develop result for us mere mortals, it's pretty damn your pet, oh well!

B

So we're gonna wait for this thing to come back up back up now. What was.

F

B

Me ha that is beautiful, so we should just show you that we can really really fast and we have it has actually mounted and now we're going to look in what's in there in the / M&T directory.

B

So that's there and we just put a safe source tree in there. Okay, so there's some set code in there, mmhmm yeah.

G

B

And we gonna do a few things if we just change back into the root directory and do the yo force of the best thing there we go so, for example, we can do the same attributes thing. So what is that we do a simple if a tree call- and it tells us.

F

B

Many files are in this entire volume, okay, all of them and how many bytes are in there and what is the last, the mode, the most recent see time in the entire tree and so forth. So that's really cool and beats the hell out of doing do you or find, or whatever, over hundreds of nodes and petabytes of data. So that's really kind of kind of cool feature.

B

We can also interrogate the file system itself about the underlying ray das properties of things that we do so in this case we're just doing a show location of the slash MNT directory node, and it tells us exactly what's that ray das object, name and what its its size and where it's at, which which OS these it's sort at, and we can also do speaking things about the file layout. So it will tell us okay, how many stripes do we have for this particular file and so forth, and so on.

B

So that's really kind of neat and and what is our final thing? Oh that's right, yeah of course snapshots. So what we also want to do is create a snapshot, and now this is really kind of interesting. Hang on a second sorry. So what we're doing in order to create a snapshot? There is a in every in every.

E

B

We have this fancy little subdirectory called thought snap and if we create, if we just do a maker in that we've magically created a snapshot, so what we can do now is we can do everything under / MN t, which is going to take a few moments there. We go good move it. Yes, hey what do that again with obstacles?

B

Did that really nuke anything there? You go all right.

B

Okay, saigon and.

B

So, let's see and the snapshot is there, isn't that beautiful, so I'm gonna do now is we're gonna, do RM, sis and boom and now we're nuking all this well.

C

It's not some terminal escape or something screwed us.

B

Yeah nevermind, that's okay,.

B

Saturday I chug, chug, chug, nao removal, stuff and there's interesting stuff in there, like Java beans and brew.

B

No, that is actually in the now. We go yeah well these days and you know the drill la la.

B

Come on ah there we go and not quite there. Yet, let's see it should be, it should be rolled out of it. That's that's gone. That's one. That's gone.

C

Remember the truth: oh, come on a question.

D

Yeah yeah shell.

B

In the box team, Shawna boxty, yes, thank you. um Another bowl of.

C

Twitter is.

B

There going to be quota spot sage. Is there going to be quite a support? Yeah.

G

B

D

B

Gonna do I'm going to go about two minutes overtime here, just while we're waiting for this thing to happen, damn it that's.

C

B

C

B

More source and you live and what not come on. Well, you know it's like mmm it's for its 40 s, DS running on I'm, sorry, 30 s, DS running on one laptop. It's probably not going to be fast.

C

Question of the book we.

E

Can probably move into a combination of questions on now anyways because.

D

Yes, questions.

B

I I just wanna know it's not bad.

D

B

Question? Yes, oh so.

D

B

D

The mic, if we're running our bday, and it's all doing all these lovely things for the volume service, for example, yeah- how unstable is, are going to make it if we then throw some metadata servers in front of that, are we able to do that without endangering.

B

The rest not happenings not happening, you can't. You can't just magically migrate from our BD to the set file system, it's completely different as theta in different pools and all sorts of things, so so that that, if you're, if you're, just if you're just using our BD, then you have no need at all for the metadata service and whatnot. But if you then decide that, with this beautiful set cluster that I've built and I've run our BD and radars gateway on etc.

B

You now want to add the SEF file system. You just fire up a few mbss and you're good to go.

B

It's experimental.

B

C

Experimental experimental is far better than understand at.

B

Any rate you totally need to be geeked out now, because, if you're not you're positively soulless before we get to the questions and things a bit of a thank you thinking here, so thanks goes obviously to sage and crew for SF. If you're wondering because you had a question there, what are the presentation tools that we've been using impress GS and shell in a box? Ink tank was nice enough to let us use the SEF logo. All of the artwork is courtesy of Tim.

B

And if you're interested in in this type or reusing it or whatever, all of that is on github here are the URLs. All of this is under the CC by-sa 20 license. So if you would like to reuse any of the material here, please feel free to this. Actually 0 is your handwriting font, cc-by-sa yeah.

C

Sure nowadays,.

A

F

A

You for in and.

A

We have a couple of minutes of questions, further questions.

B

Actually finished, hang on a second, let's, let's go see if that actually finished uh-huh, so.

F

B

It's empty see that and then I go into the snap directory and into snap zero and it's all oops. It's all there.

B

Ok, question okay, make this quick because we all want to get into the closing and see lightning talks and things that'd be awesome. I would.

G

Like to grill briefly go back to the node composition question we discussed really going.

B

Back is not an option right now next, but to leave sorry.

G

Okay: okay, when disgusto is day to spindle right here, yeah how about court to SSD and corp to spindle right, no.

B

So, on average count about 1 Giga 141 yards port for OSD; ok,.

E

A

Kimbo on, when did you do the snapshot? I can I didn't, saw I, didn't recall when.

B

When did when, and how did we do the snapshot? So this is what we did. It is where's the.oh force, FS, thingy oops, and what we're doing is just this just.

F

Up here we're creating a magic directory. We just do.

G

B

Mit / snap and that's the name of snapshot.

F

That's it. That's all.

B

We need to all we need to do to to snapshot things yeah.

G

You just make a directory.

B

And the directory is kind of magical, because the dots now the the dot snap doesn't show up anywhere, because if you actually have that you know in the listing of the directory and then you did in RM / whatever uh-huh, oh, you knew the snapshot and the snapshots are actually read only so you can't just nuke them that way. If you want to remove the snapshot, that's an R, ender and prove the snapshot is gone. What question really.

A

Work for a 20 gig file will it work.

B

For 20 g farm file file a fine.

A

I put a flower what.

B

Out of the sky.

A

That's pretty big all right. That is really all that we've got time for. Can people put the hand again for Lauren and hang on think our way that quickly darkness? We have a little gift here for each of you from LCA 2013. Thank you very much awesome. Okay. Thank you. Thanks.