Ceph Inktank Videos, 12 Jun 2014

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: New features in Ceph Firefly

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so hello, everybody welcome to be bunting. Online summit can spend the next hour also looking at new features in SEF Firefly. We have John sage boat from in tank and John's, going to be running the session predominantly and if you've got any questions, please feel free to ask them. Via our see. You haven't, got my attention as king, you directly and I'll inject them into to the presentation. Okay, after you done Thanks.

B

Let me get share. My screen is.

B

A

Coming through a right.

B

Okay, good so I'm gonna start by introducing staff for the benefit of anyone who isn't already familiar with it, and that I'm going to talk about what's new in Firefly and our recent open-source release of the calamari management tool for Seth and give you a little demo of that if it all works so Seth is a large-scale distributed storage platform, its marked for being low cost for being open source and also for providing enterprise. Great support from tank in tank is now part of red hat, we're continuing to provide it Seth under new management.

B

So, from users point of view you can access Seth via the Raiders gateway or rgw, which provides an s3 or Swift compatible HTTP interface to provide an object store.

B

You can access it via the Raiders block device or RVD, which provides a block device interface, which acts essentially like a software-defined Sam, and there is also a POSIX file system interface, which is not quite production, ready yet, but is nevertheless quite powerful beneath all those three interfaces is liberate us, which is the interface to the underlying object, store that acts as the backend for all the SEF services.

B

So I'll briefly explain what radel's is and how it works. So.

A

An application.

B

Or one of the interfaces I just mentioned talks to a cluster of servers in this diagram. Each of the squares is a server.

B

They are responsible for communicating with one another and presenting a unified front to the application itself. There are two types of server in the cluster OS DS, which stands for object: storage demon. They usually correspond to physical disks, so you can, in a simplified view, think about an OSD as just being hard disk with some software running on top of it, there are also monitor servers that is usually a much smaller number of monitors and they're responsible for keeping track of the state of the overall system.

B

When a clone initially connects to a radius Custer, it connects to a monitor server, which is responsible for telling it where all the other servers are.

B

Each OSD sits on top of the local file system on top of a disk or other block device. That file system is XFS by default. In the previous session. Safe is mentioning that in firefly, we've got the option to switch out from a file system to a dedicated object, store and the clever part of seth and the reason it was worth sage getting a PhD thesis, for it is the decision-making about where objects live on the cluster.

B

So this is the challenge that any distributed storage system has to solve, given an object in hand and hundreds of thousands of servers. How do I decide where to put it and what I want to go and get it later? How do I remember where it was so? Some legacy systems will use a metadata server for this they'll have a essentially a database of the location of each piece of data and when the application wants to something it goes and looks at the location and then goes and retrieve it from the server.

B

It's not hard to imagine that this becomes a bottleneck quite quickly at large scale. So it would be nice if we could do something different to that calculated placement where we can take the idea of an object and apply some algorithm to it. Usually hashing to choose a band from a range of locations can work quite well as well, but we can do even better than that for Seth using an algorithm called crush in crush.

B

We map each object, location to multiple OS DS, and that allows us to provide redundancy via replication. So in this diagram you can see that the data we have is split up into these colored blocks and each block appears more than once on the right hand. Side. In this example, each block complete appears twice, which is a two-way replication of the data and crush, provides a number of really important characteristics. So it's a fast calculation, so the the burden of working out where data is is fairly low. Even more importantly, it provides a stable mapping.

B

So when a an OS deal server goes away and we have to come up with new locations for the data that was on that server in Seth. The number of pieces of data that we have to move will be more or less of the order. How many were on the dead server, whereas some simpler methods used by some other storage systems would shuffle a lot more data than was actually necessary and it's configurable.

B

So within crush we get to define rules such as put two copies in this data center and two copies on that data center or put a copy on flash and two copies on spinning disk, and we can define those rules within Seth by defining the topology of the data structure of the data center. I should say so telling it here is a here's. A datacenter here are some racks.

B

Here are some service within the rack and that forms what we call the crush map with NSF system, so I'll zoom back out again and to put that into context in, for example, an OpenStack environment, the radar, CW interface or the object store, corresponds to a swift interface. With an OpenStack, it also takes advantage of Keystone for authentication.

B

The rbd interface is accessible by services like cinder and glance.

B

So that diagram should give you the idea that Seth doesn't tend to exist in isolation and integration with other components is very important.

B

Seth, more 80 or more casually Firefly was released just a just a few weeks ago, and it's a major feature release. It's also important because it will be the basis of our next enterprise supported version of the software, so cash tearing and erasure coding are the the big ones in this release and I'll talk about them. Now capturing is more or less what it sounds like get to use one piece of storage as a cash to another piece of storage in Seth. These are pools.

B

Pools are logical, divisions of storage, so they're, not necessarily physical divisions.

B

It can be the case that one pool is, for example, you're, spinning disks and another pool would be your flash drives, but it can also be the case that two pools exist on the same physical storage, but they have logical differences such as one having a different replication level to another.

B

So cash tearing lets us set up one pool as a cash to another pool and configure some policies for how that will behave.

B

The size of the cash, for example, how long an object remains in cash before it gets flushed out to the backing pool and the mode of the cash is also considerable between being a right back cash and read only cash in a write-back. Cache writing to the cash pool appears exactly the same as if you were writing to the underlying pool.

B

It has all the same semantics, whereas in a read only cash you would do rights to the underlying pool and when reading from the cash, you would get more or less up-to-date versions of that data, so you might use a read-only cash. For example, if you have mainly immutable data.

B

Erasure coding is an alternative to replication and Seth, so the cost of doing replication can be high. If you have a large system paying for two or three times as much more capacity as you have, data erasure coding does essentially what a raid 5 or raid 6 disc set up would do so. It splits the data into data chunks and priority chunks in surf.

B

You can configure the number of data chunks and parity chunks, so you have a lot of flexibility about what level of redundancy and level of resilience you want to have, and clearly that's a cost-benefit decision for how many disks you want to buy.

B

The reason erasure coding isn't simply used always everywhere is that it has a different set of trade-offs to replication. So while we use less storage capacity and commensurately less bandwidth because we're simply writing Lester desk other operations can become more expensive, for example, when doing replication. If I want to read a piece of data, I just have to talk to one of the replicas and I can pull the whole object back there and then, whereas with an erasure coded object, I'll always have to talk to multiple servers in the case of Seth, multiple OS DS.

B

In order to gather that data, and so that generates some overhead, because a rager coding relies on chunking data as well. It means that modifications become more expensive in order to make modifications, we have to read back at Hulk and modify it and then re encoder and write the whole chunk back and so for that reason, in the current implementation for educating and Seth modification isn't actually supported on our educated fools youjizz them in conjunction with a replicated cash tier, which would support modification, and then the modified objects would get migrated back to an array.

B

Checotah tier that point about modification, and it's worth bearing in mind that you have just exactly the same problem with a traditional raid controller. It's just that it's hidden from you, so a traditional raid controller will be doing this kind of operation. On your behalf.

B

Reading back data modifying it writing it back, and that's one of the reasons that if you're using high-end raid controllers, you'll often get advice about keeping your iOS to be four megabytes, aligned, reads and writes, and that's often because the underlying hardware is having to do these read-modify-write cycles for a Reggie coded data and the way this is implemented in staff uses a plugin for the computation of erasure codes. So the Firefly release ships with a single plugin, but that interface will be extensible for the future.

B

In the previous session, sage was talking about the work that's happening at the moment for future releases to come up with even better, more efficient and racial coding schemes. So to show what these features look like in action. There's a simple diagram here, showing a hot pool and a cuddle cold pool which are red and blue.

B

The cash pool of a hot pool is using replication, which is costing us a two hundred percent overhead. Compared with the amount of data we want to store. The cold pool is using erasure coding and, in this example, I've got three data chunks and to parity chunks, which leads to a sixty-six percent overhead.

B

The idea, of course, is that you would have most of your data in the cold pool most of the time and that the hot pool lets you work on data quickly when you need it.

B

So cash tearing and erasure coding are both features in their own right for they're, even better when they're put together and a razor coating is really only useful in practice for general-purpose storage systems, weren't combined with cash Terry.

B

What this looks like if you're a Cepheid Minister ater setting up a system, is a relatively short list of new commands within SEF, so just running. Through this example, we create a razor code profile which tells SEF the K and M values we want to use and optionally. We could also be specifying which plugin we want to use here. We create the cold pool using that profile that we've just created, create the cash pool. There are no special options there at all.

B

That's just a normal replicated pull of the kind that's existed in Seth forever and then the new tiering commands. Let us firstly, add the cash pool as a tear of the cold pool, set the mode to write back and then set the overlay option on the cash pool, and that means that any rights that a client sent to the cold pool would get transparently redirected to the cash pool.

B

So, even if a a client doesn't know that it's dealing with a rager coded system using cash tearing, it can go ahead and open the cold pool and send operations to it and the south cluster will automatically perform all of the caching and erasure coding functionality on that.

B

Just to call out other new features in Firefly, there are a huge number of new features in this release. Primary affinity has gone in that's an option which lets you specify, which OSD should act as the primary.

B

When Seth distributes data across several OSD s, one of those OSD s will be acting as the primary at any given time and that primary is the one that services reads of the data in general. So what primary infinity lets you do is specify that you want certain OS DS to essentially act as the read servers for data.

B

So if you have some servers which have more Network band width or ram or anything like that, just faster storage devices, you can give a hint to Seth that you would rather use those for your read workload, as was mentioned before. The OSD backends now include object, stores, key value stores as well as file systems, and the raiders gateway now includes a built-in web server to simplify the deployment compared with integrating with apache.

B

The other big new thing in SEF recently is calamari, that's something that we built at ink tank and for our enterprise product and shortly after we join Red Hat, we were able to open source calamari, which we're really excited about. So calamari is now available under an L GPL licensed calamari is a high level interface to Seth. So it includes a user interface and it also includes a REST API, and the idea is to make both using Seth and integrating other systems with Seth much easier and more accessible than it has been in the past.

B

And I'm going to try and switch my screen sharing over to a different windows, so that I can give you a quick demo of what color looks like.

B

Okay, so hopefully that's coming through the screen you see right now is the dashboard. That's the first page, you see when you connect to calamari and it gives us a rundown of the status of the system. So it's showing me the overall health of the system, which, thankfully, is okay right now it's telling me that all of my OS, these are up all my moms are in coram.

B

It's telling me how many pills I've got and in the middle that sort of green lump is showing my me, my PG status, and if any of my placement groups were not in a good state than we would be seeing different coloured segments of that along the bottom. I have a plot of the rate of operations on the system, so the graph on the bottom left is actually not block cups.

B

It's raid us operations per second, so that's shown me that my class and throughput going on this cluster just has a simple right workload brand against it and there's a gauge with my usage and how many servers I have in the system. So that's your top level status view of the system. The kind of thing you might put up on a screen and the NOC we also provide a more detailed view of the OSD s.

B

These are laid out on a grid rather than a list lists tend to fall off the screen quite quickly when you have more than a few tens of objects. So we've used to grid here, and this view is sort of sort of all and filterable, both by the attributes of the SDS themselves, end of the pts on them.

B

So if you have placement groups which are in a bad state, then you can use the switches here to filter down the view 20 SDS, containing those placement groups in order to do further diagnostics on those particular OS. Ds calamari also provides a graph view both of cluster metrics from Seth and also of the underlying status. So, for example, I can click through to one of these servers and see the IELTS on a particular block device.

B

Similarly, the capacity of devices cpu load all the kinds of things you will expect from a typical monitoring setup, but provided wrapped up and integrated around your self cluster, and this is a fairly early iteration of the software. We think there are lots of possibilities for doing interesting things here, such as correlating statistics correlating the I ops in your overall cluster, with the I ops on particular storage devices or the throughput on the system, with the cpu load on different servers.

B

So what's actually behind this is a standard graphite deployment and anyone using the calamari API can go ahead and query whatever stats they want. There are actually more stats here than are exposed in the user interface and use that to hopefully do interesting, new things with the system, so the user interface we provided is a starting point and we're really excited to see what people do with the API.

B

Finally, there is a management page in the user interface which lets me modify the system, so, for example, I can drill down into the OS DS on a particular server and asks F to mark a particular OSD down or out going to go ahead and mark all of them out, and it's telling me it's completed that operation I can see it's updated to reflect that the status is now out and if I go back to the dashboard, I can see that now I only have eight out of 90 s DS up and then, if I, if I waited a few moments, I would probably see my placement group status update as well to indicate that some of my placement groups will be in recovery for some period of time.

B

There we go so it's now showing me that most of my placement groups are now in a recovery state, ordinarily and safe. If you had an OSD fail, you wouldn't expect so many of your placement groups to be affected, but this is just a very small cluster with three servers, so the failure of 10 SD has quite a big impact across the system.

B

So I'm going to switch back to my other window and just talk a little bit about the architecture of calamari. This will principally be of interest to anyone, who's thinking about packaging or customizing calamari.

B

So it's only been open source for just under two weeks. The word are getting affirmative. Attention on the mailing list from people are looking to get involved with the guts of it. So we provide this REST API. That means that you can construct fairly intuitive URLs, using the FS ID and the IDS of msds to go and get detail about objects in the system. This is a little bit like the output.

B

You would get from doing a SEF, OSD dump command line operation on a set cluster, but it's providing a slightly high level view and it's also providing a little bit more information. So in this example, we can tell the API consumer, which commands are currently valid, to be run against this object and that's the kind of thing that's really useful if you're building the easier interface and you need to know which buttons should be presented for a particular st and that kind of thing.

B

So the API, as well as being a convenient way of getting to your south cluster or more convenient than the CLI, if you're, integrating it with third-party systems, it's doing a little bit more as well so operations that we expose in the calamari API tend to take a higher level view of things.

B

For example, when you create a pool at the SEF command line, it will receive the operation to start creating the pool and then it'll tell you success you'll go about your business and you'll just hope that creating the placement groups for that pool actually happened in the background, whereas in calamari, when you asked to create a pool, it will monitor that operation. For you and let you know if any any of the subsequent downstream operations went wrong.

B

We also provide a more consistent view than you would get in a sort of hand set up, Magyars types type environment. So when you do operations by the set of the calamari REST API, your modifications to the system won't be marked as complete until the monitoring data is updated to reflect changes to the system.

B

That most son sounds like a slightly nitpicky attribute for a system to have, but it's actually really important if you're, building a user interface or a third-party script, that's going to, for example, create a pool and then go and look at the status of the system and expect that pool to already be there. It's very important for those two things to be in sync, so there's a big advantage to having an integrated management monitoring platform like calamari, rather than stringing together, separate tools.

B

And, finally, the operations when done through the calamari REST API will give you an asynchronous execution model. So you don't have to maintain an open connection to it. You're given a handle and you can go and ask about the status of that handle. Subsequently, that's especially important if you're doing a long-running operation.

B

For example, creating a pool can be a long-running operation and, as we extend calamari to include managing things like RVD will be looking to do the kind of progress indications for long running operations that you might imagine for long running operations on block devices.

B

The API is fairly well documented. It's got a HTML documentation which is available at calamari, don't read the docs org and taken together the the semantics of a higher level operations provided by calamari, the accessible REST API, which just gives you a simple JSON interface and the documentation we provided for. That makes creating management tools for Seth much much easier than it used to be so. The screenshot here is isn't actually of the calamari user interface.

B

This is of something that I just wrote on a Saturday afternoon to expose how simple it is to create a custom dashboard showing you the status of your south cluster.

B

You could kind of already accomplished this using the interfaces directly to Seth, but using the calamari API as a huge time-saver and we're hoping that this will really grease the wheels for people to do new and interesting things with Seth and its interactions with other third-party software under the hood. Calamari is a Python application. The user interface is built on top of the same REST API, that's exposed to third-party applications and that REST API is calling through to a Python service, which is responsible for any of the stateful activities.

B

On the system so, for example, keeping track of ongoing operations, that service is using saltstack as an execution and orchestration mechanism across the cluster, so salt sec is best known as a sort of competitor to the likes of chef or puppet, but the strength of it that we see for calamari is its message, bus and remote execution framework. So we use that anytime. You see something happening on a remote server with respect to calamari that's happening via salt.

B

The REST API also sometimes calls directly into salt. So if there are synchronous operations which don't need any status tracking, we just call directly through that allows the main calamari service to stay really lightweight and we can avoid any throughput bottlenecks there and the rest api also calls directly through to a graphite service for all the time series data down on each server. We have the salt minion, which is essentially the agent that saltstack users and diamond, which is a statistics collection package.

B

So we're quite pleased that we haven't gone written our own agent here, we're just using existing third-party open-source agent software in the form of Sultan Thailand and just to call out some of the technologies that are in use here. I just mentioned saltstack, graphite and diamond we're also making heavy use of zeromq there's a Postgres database and the camry server and the rest api user interface stuff is using a very typical django. An angularjs stack.

B

The idea of that list of dependencies isn't to impress you it's to reassure you that it's using fairly standard stuff and that's a big part of why we that's that's intended to enable contribution and enable involvement with the community that as much as possible, we want people to be able to reuse skills they already have from working on other projects.

B

The downside to the way calamari is constructed from so many third-party components is that packaging is more painful than it would be for a more monolithic application and that's an area where we'll be really happy to have any help than anyone could offer. At the moment, packaging in pylori is a little bit hairy.

B

Finally, the extensibility of calamari I mentioned we hadn't written our own agent. That means that if you want to extend calamari to get more data from servers that are being monitored, you can write modules against the existing interfaces of diamond at saltstack. To do that, so you don't have yet another plug-in interface to learn having written plugins like that, you would then go ahead and modify the central calamari server itself to do something interesting with whatever data you were pulling back and hopefully that would one day turn into a call request against our upstream project.

B

We also find that the use of a REST API provides a very simple natural way to provider a modular interface to the user interface into third party applications, simply by putting a new functionality into a an appropriate URL path or namespace.

B

So that's all I have for this session. I just wanted to note that the safe developer summit is happening a few weeks and anyone who's interested in getting involved in the development of calamari should definitely come along to that, as should anyone who's interested in South. More generally, category is a very young. Open source project is being out for less than two weeks.

B

So please be gentle with us, but if you go ahead and find us on github we're at SF, / calamari there's a link to the mail list there, and it would be great if anyone wants to come and get involved with that project. Thanks a lot.

A

Thanks John excellent movie I had a couple of questions actually just just as going through the kind of sprung to mind, specifically around the the new arranger coding and tearing functionality. So if someone's running the existing safe deployment and never upgraded, the Firefly can then easily add a rosier coded tier to their existing pools. Is that a supported operation uh I.

B

Don't believe it is no I think that if you've provisioned some, you are educated, pools and some cash tiers on top of them. You would need to move the data from the old ones to the new ones at the either the layer of the application, so r VD, or what have you all? You would need to be doing that directly at the live radar Slayer, but I, don't think we currently support doing what transparent, okay.

A

Cool and the other question I had was this in quite specific guidance, around kind of placement group sizing based on numbers of OS, DS and stuff for standard pools. Is there any kind of documentation references for using your AG coated functionality? You two go on the rock on top website. Yep I haven't I, haven't seen. It is why I'm asking yes.

B

The I think it's mostly up now. It has just been in the past couple of weeks that those bits have been getting updated, so the the guidance for how many placement groups you should have in a pool and that kind of thing, I think that has very recently been updated and if it hasn't it definitely should have been so that there is a new rule for how you pick my own replacement groups.

B

I'm not sure I have the formula right off the top of my head, but it's it's something along the lines of dividing by K, plus M I think, whereas previously you would be dividing by the number of replicas and the system.

B

There's there's another limitation there, which is when you are using a cache tier pool. You can't change a number of placement groups in a cache tier after you started using as a guest here right. So that's usually less of an issue, because if you, your cached here would usually get smaller slice of your storage rather than the big account that you were going to grow.

B

But it's something to be aware of. So you want to connect her on the or on the high side for PG counts, on cool that you're going to use as a cached here but yeah, the setup, calm, / box is it's the right place to look and anything that hasn't already been updated should be updated soon. Okay,.

A

Cool thanks John um and this checks. If there are the questions, I, don't think they'll work. Oh okay right! We can probably wrap up and grab 25 minutes back of our lives. So thanks very much for time. Bomb really really good and tuning I think want to break in 25 minutes for the lunch and then it's back for final couple of hours of sessions, including the bra memory there. In the last hour, okay, great thanks very much I started. We buy.