Ceph Ceph Day Boston 2014, 25 Jul 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RH InkTank Ceph Day Sessions Jeff Darcy REDHAT

Description

Ceph Day Boston 2014
http://www.inktank.com/cephdays/boston/

A

Okay, hi everyone, so I'm going to be talking about a kind of insane thing that I did immediately after the ink tank acquisition, which was actually an idea that I'd had about two years ago about a really different way to possibly combine the cluster and SEF technologies and I have a couple of points that I want to make before I really start diving into it. First off this is science. This is not engineering, it's science in the sense that it's about discovering what the world is like. What things are possible?

A

It's it's not trying to apply that knowledge to serving any particular need. It's not part of a roadmap. I particularly have to mention that, because there's some buzz going around some fud being spread by people who have vested interests in alternative technologies about how, since the acquisition we're going to take cluster em safe and we're going to hack off bits of both of them and leave them bleeding in the gutter, that's really not not the case anyway.

A

There's no particular plan and my boss here can back me up to merge the two technologies that is kind of what I'm doing with this experiment, but it's just an experiment. What I'm trying to learn from this is is a couple of things and trying to learn something about the liberators API, because I'm curious, I'm, trying to maybe tease out some information about which components within each stack are contributing to performance, contributing to failure, handling other things.

A

One way to do that is to sort of mix and match the pieces and see what changes relative to their their origins and partly it's just because combining things in strange ways, you never know what information is going to fall out. So you know we actually did find a couple of fairly interesting things that deserve further investigation and which may end up ultimately benefiting both projects.

A

Now I usually hate these slides, I hate the who am I but I think in this case it might be a bit relevant, so I've been doing distributed, work cluster, whatever file systems for quite a long time, I might be the only guy who's actually run like a dozen of them so for besides the ones that are mentioned here, where I've actually contributed code to some degree or other I've run extremely fast, moose, FS, HDFS I, don't even remember what some of them were.

A

I did a little project at one point, while I was at Red Hat, just taking every distributed file system, I could find and running it through the same set of workloads. Just a little side note, it was pretty sad about half of them. Crashed few of them couldn't even build one corrupted data one hung Steph was actually one that managed to make it through the the exhaustive test of trying to write him 10 files simultaneously and then read them back.

A

So it was, it was pretty pathetic. Gloucester was another and I think extreme FS was the only the only other one that actually managed to do that now. Remember that was like three four years ago, so they've maturity, but they could probably maybe right 20 files simultaneously, so yeah I'm, just the kind of guy who's going to take this opportunity to try and mash together to thought to be disparate technologies, in fact who here is familiar with both cluster and SEF to any appreciable degree at all?

A

So you know you people probably realize that we have a lot more in common than that separates us yeah sure we took some different implementation approaches, but you know they're both the same sort of scale out distributed systems we're using some of the same algorithms, or at least algorithms that are cousins to one another. So it's actually not that weird to be combining them now. I know this is SEF dey nah cluster day, but I'm to want to explain what it is that I'm doing here.

A

I do need to explain a tiny bit about how gluster works, so the core concept in Gloucester is of a train later. This is actually one of the founders of Gloucester likes to point out, probably to his detriment that it came from the gnu hurd operating system project, but the idea is that a translator is a module that takes in I/o request and then spits out I a request in exactly the same form you know to create on one end it to create at the other end, so it's a kind of filter.

A

You can also do fanning out. It can also do routing and can do things like that. But the key thing is that the interface above and the interface below are exactly the same, so you can stack them in all sorts of orders. You can move them across the server client, boundary, etc. This is just how how it's implemented, and so most of the functionality that ancef, you would see.

A

You know, as of one piece, is actually split out into different translators that are separately loaded separately, developed in some cases in gloucester. So what this diagram shows is that we have you know some translators on the left, which are the ones that inject requests into the system. Our primary one is fused, that's our native protocol. That's you know that are talking to the fuse driver and then it's taking requests and turning that in in into translator requests.

A

Nfs is another one that does this. Then we have something called lib, GF API, which is a library interface that is also capable of generating all these requests and pushing them through the translator stack. So, for example, the samba integration is moving towards using gfap I the block device integration is using GF API. You can do all sorts of things. It's mostly like lips ffs. There are some differences.

A

I believe sage has said that they could probably be harmonized to some degree so that they would look even more similar, so it'd be even easier to flip between them.

A

So these are all the ways of getting stuff into a system. Then you go through a bunch of translators, represented here by dot dot dot, which is things that right ahead right behind, read ahead chain locks, all sorts of other stuff- that's not particularly relevant here and then the two big boys DHT is our distributed. Hash table. It's how we do distribution across many servers, so it's basically performing a routing function. It gets a request in from the user and it sends it to one of its children, which is one of the bricks.

A

That's our lowest level unit of storage. Beneath that we have a couple of instances of PFR advance file replication.

A

A little thing that I'm never going to do is use advanced in the name of any code that I develop because five years later is going to look silly, but at the time it was advanced. So this is our replication module which takes a request in and sends it out to all of its children. So in this case it's each a fr instance is sending it to two bricks. The DHT instance is sort of round robin rating or random hashing. Among those AFR instances.

A

So this leads to a couple of differences between how gloucester FS works and how SEF works in gloucester. We have one role in the I/o path, a brick. That's the only thing. There's one process on the server's doing one thing, its own data and metadata in SEF in the I/o path, using the file system. Of course, you actually have two different roles which may be differently distributed among your nodes in the system.

A

Then, of course you have them on. On the cluster side, you have cluster d to do management types of stuff, but it's a fairly fundamental difference, because it affects a lot of the performance characteristics, both good and bad. On both sides. You know there are some operations, we're having it all be. One is good if you're going to do an operation that affects both data and metadata, it's actually really kind of nice.

A

If it just asked to do one request and those happen together down at the local file system level, if you're doing aggregated operations that have to operate on a bunch of metadata, like say, I, don't know a directory listing, it's actually kind of nice to have the metadata layer be separate, so H have their advantages and disadvantages, but it creates a little bit of an impedance mismatch between them.

A

So what did I decide to do? I decided to combine Gloucester, FS and rate us knots ffs, that's that that would be a much more challenging kind of thing to try and do just to see what happens to our data, I/o performance and and other behavior. When we do that. So that's the only thing: that's really getting sort of snipped off from the Gloucester world and shunted off into a ratos cluster.

A

So what we see here is we've got the same things on the front end. We've got fuse, NFS, etc. They're all inject and request into the system. They're going through some of the higher level translators. I actually had to leave a couple out because the the system wasn't really working correctly with them in place.

A

Then it comes down to Glade us and all that's doing. Is it saying? Is this a read or write for file data? Yes, ok, we're going to shove it off until this ray das world. We don't know what happens to it after that now what happens to it after that, as you all know, is it's going through all of the rate us distribution, replication, possibly erasure coding, etc, etc, etc. Anything else any of your I note operations, your directory entry operations, etc is staying in the Gloucester world.

A

So when we create a file in Gloucester, we actually create the gluster file and we create a corresponding rato object. So we can, we can alternate between using those two and then there's a little bit of strange stuff, for example, if you want to know the file size well, that's not actually correct over in the Gloucester world, so we have to go query the radio object, but basically that's that's the idea behind it and, of course, I'd it in a Python.

A

Ok, not really I could have actually because in the Gloucester world we have something called gloopy, which is a way of writing these translators in Python, instead of and see, is actually really nice for prototyping crazy ideas. I didn't actually do that, but it was a fun thought that came up at dinner with some of some of us F guys, here's a code sample of our incredibly ugly cluster C code. So this is part of Gladys by the way, how many people recognize Glade us as an acronym?

A

How many of you noticed the first slide?

A

What's that in the background there's a famous line in portal, which is where I Glade us comes from, which is I liked you better when it, but when you were a potato, so that's what I use as my background anyway.

A

So this is a pretty typical piece of Gloucester code, except for the fact that is calling out into a completely alien distributed file system. So we call rate us read. So this is the part of part of our read path. We check the return value. We do a couple of other things. Yes, we heavily use, go to in our code base, I'm, sorry and then the stack unwind strict is basically how we pass stuff back to the translator. That call this so I mean there's nothing terribly interesting about this.

A

There was nothing terribly difficult about it. I did most of this sitting between sessions and openstack summit. In fact, but you know this is kind of typical gluster code. I mean it's not all that different I didn't have to do major surgery on everything. It's just another translator. That does something slightly different in one place, and it turns out that the radio interface was, you know not hard to work with liberate us interface.

A

But of course, I was only doing the easy thing. I was only doing file data, there's all sorts of really hard things that I wasn't doing. I. Believe somebody wrote a PhD thesis about how to solve some of these problems so metadata and especially directories I. Don't do anything with that I, just let gluster handle it as gluster always handled it.

A

If you know, if anybody really did want to make this real- which I still think is kind of crazy they'd have to solve that problem, there's a whole lot of server-side functionality and gloucester that gets basically bypassed when we do this stuff. That would normally be observing the I/o as it goes past and now, since we've shoved it off and to write us, it doesn't see it. So that would be a problem and then, of course, there's performance issues so I mean you know.

A

This should underscore the point that this is just a you know a discovery process. This is not something that could seriously be confident as a way to build a real product, yet so over the results.

A

So I ran a couple of my standard types of tests, so this is a 64k sequential right and I saw a couple of interesting things. So you can't really see the legend very well. I can tell.

A

The blue line is Seph, so SEF one yay and the Green Line is cluster, so it's taina behind and what the first thing that left out at me is the latest version, which I had two versions like significantly at the lower thread, counts it's delivering relatively little I, oh I, don't know why it does eventually catch up at higher thread counts, but it seems to have a little bit of a start-up latency issue. Well, I mean not start up, but somehow the single thread through applet. Just isn't that great.

A

So that's something that's worth investigating now the two versions. Here, the yellow version is using a KO, asynchronous I/o, so I'm issuing all these reason and rights through the asynchronous interface I'm, not sitting around waiting for them. The the other. The orange line is multi, where I'm actually using multiple liberators contexts, so I'm, basically doing a sort of gattling gun, round-robin thing between them.

A

So there are two different ways to get a little bit of parallelism among all the requests, because just using one context in synchronous, I/o was just you know, kind of awful and that wouldn't have been very interesting at all. So this is a fairly interesting result. A much more interesting result was when I started looking at latency. Now these are 4k synchronous, random writes.

A

So we see a couple of interesting things. The first is that the positions of SEF and cluster have flipped, and this is actually something that I've seen in other tests. That I run that you know this kind of relative latency versus throughput for the two systems tends to follow that pattern.

A

The other thing that's interesting is that both of the gladys approaches fairly consistently managed to eke out wins over a FF s itself.

A

What's up with that, I don't know you know, so that's telling us something about the fact that it's mostly kind of tracking the Ceph numbers, so the ceph numbers or the gladys numbers are dominated by the radar formance behaviors, but there's some little difference in there. Also that's probably worth investigating. This is exactly the kind of thing that I wanted to tease. Out of this information is to see. Well, you know where do things not line up in a fairly obvious way here, we've got this little anomaly where the the low thread count.

A

Numbers are just strangely low. So let's look into that, you know here we have numbers that are just a tiny bit higher than C ffs, why this was all on firefly by the way. So it's kind of interesting to look at these things and really that's about it. You know it's a fairly young sort of thing that I've been doing playing with. If anybody else wants to grab the code, I haven't actually gotten around to pushing it anywhere.

A

I, don't have any reason not to I just I've been too lazy, so you know, let me know I'll, probably put it up on github or on the gloucester forge or something pretty soon or if you have other ideas or, of course, questions like the frog.