Ceph Conferences, 11 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Project Update

Description

A full project update on current work, future roadmap, and project contributions.

A

All right welcome to the final session of the Ceph date track here at open source days, go ahead and take your seat when we get started just a reminder, if you do have questions either in the middle or at the end, to make sure you use the microphones so that the recording that we have going for posterity will will pick you up as well as the rest of us in the room. Our our next and final speaker is the creator and current project lead force.

A

F sage well he's going to talk about some of the more recent work. That's been going into the upcoming release of luminous and a few other things that we have going on so sage. Thanks.

B

Can everyone hear me? Okay? Yes, all right! My name is sage while and the Ceph project lead I work at Red, Hat in the office of Technology and oversee Ceph development, and so on so I'm going to talk a bit about the release, that's about to come out luminous. What's in it, what's cool and then I'm going to talk about what we're working on after that and then a bit about contributor stats and so on. So just a level set set does a regular release cadence, we do normally do releases every six months.

B

Every other one is an LTS, which means we do backboards for bug, fixes and so on. Luminous is about to come out as supposed to be spring. It's kind of sort of another month or two before it's out, so it'll be I, guess early summer, we're a little bit behind the next release after that is going to be mimic which will be an on LTS in the fall or maybe winter.

B

If we continue our sort of laggard pace and then the one after that will be the end, release probably Nautilus, but the name isn't really finalized. Yet they're they're all names of cephalopods and they go by letters increasing, so luminous is luminous squid, which is beautiful.

B

You go look at Google, Images and wonderful, so on so lots of good stuff coming luminous, it's going to be really good release, I'm very excited about it. The biggest one, the biggest piece that I'm most excited about, because I worked on it primarily was blue store. Blue store is going to be stable and luminous, and it's going to be the default back in for the OSDs. So a big big milestone bluestar consumes a raw block device, in contrast to our sort of legacy file store, which consumes ekits XFS.

B

We use proxy B internally for metadata, but it's all sort of packaged up one big thing that we control it's very fast on hard disks, roughly twice as fast, both for large iOS and small iOS for regular SSDs. It's also faster than file store, maybe more like one-and-a-half times, but the varies on your workload for nvme. It isn't that different than file store, because the mdme isn't actually the slow part.

B

We have other issues to deal with the optimizing stuff itself, so it uses less CPU, but blue store is the future and that's that's sort of where we're trying to get get to. More importantly, it sort of gets rid of all this legacy stuff that we had with valve store. So all these weird performance anomalies that you wouldn't notice until you had strange workloads, try to go away because we are less stupid. Mostly so blue store has full data check zones on everything.

B

So every time you read any data from the disk, it gets checksum verified, so you won't get sort of bad errors. It also does in like compression, was the love or snappy, which is nice, and it's it's going to be the stable thing. Lots of people contributed to it as a group effort and we're very excited to finally have it done and out there I'm going to show a few quick performance plots.

B

These are showing large and small right, iOS random, writes on trying throughput and latency and the whole bunch of development branch is sort of the top. One is the one that all got merge. These are a little bit old, but it's roughly twice as fast for large and small. You sort of the takeaway same thing, similar picture when you mix reads and writes the reads: aren't necessarily twice as fast, so it's a bit of a blend there.

B

But, more importantly, if you look at sort of the the aggregate workload not just sort of a micro benchmark, things like radio Skateway that are doing index updates on the bucket indices that are stored in Ray DOS. There were all these weird annoying things that file store had to do in order to make that work properly and be consistent and safe. Well, the store does it much better, and so the performance improvement is more like 3x or 4x, depending on what your workload is, because the ditch just isn't has done this.

B

It used to be so we're very excited about that, and, as a consequence of all that, this also enabled us to do another big feature, which is a racial code, support for ratos block device. Finally, so that the key missing piece before was erasure, coated pools didn't support over rights of existing data. Now they do so, you can put a block device on top of those objects.

B

It requires blue store in order to perform, because we have to do a two-phase commit with the official to be able to rollback and that it's implemented on the file store, but it's horrifically slow and the other thing is that we rely on the checksums in blue store in order to do the deep scrubbing, and so, if you're, using file store with the easy overrides, you can't deep scrub. It doesn't actually verify anything. Let's just go read the data, so blue store and easy overwrites are to go together in luminous. So it's there, it's good!

B

It's a significant improvement in efficiency over 3x replication. Even if you use a very narrow code like a 2, + 2 or a 4 + 2, it's like a factor of 2x or a 50% cost improvement, much less storage, so you have to buy.

B

It's not perfect, though, with an erasure code, small rights are slower because you have to operate more update more devices when you update a full stripe with the erasure code.

B

Hopefully, we hope to mitigate that with blue store, all that we haven't done sort of the final testing that pits file store, 3x against blue store or racially coded, so that's sort of in progress. It will sort know with the final how we NetApp, but on the flip side, large rights are actually faster than replication, because you're actually doing less IO to your devices, which is also good.

B

The implementation is still doing sort of the simple thing when you do a small right: it's updating a full stripe, so we we want to do things that are more clever, but it's going to take a little more time before we were able to make those optimizations, but it works in luminous. It's there. It's ready to ready to go.

B

The other big piece of work that went into luminous was the new stuff manager daemon. It actually appeared in kraken, but didn't actually do anything useful. Yet in luminous it does lots of useful things. The main thing is that it offloads a whole bunch of work that the monitor used to do into a new demon, so the Oh monitor previously dealt with all the PG stats. It ended up with a whole bunch of data that was just turning through paxos and slowing down the monitor and limiting or overall scalability.

B

That's all gone manager is doing it. Instead, it's doesn't have any durable state, and so it's much faster and more efficient and we can focus just on the things that are important for keeping your cluster up and consistent in storing data.

B

So it's going to make Cephalon scale again and coincidentally, this morning, I just got word that CERN has like a 10,000 OSD cluster that we're going to be able to use for two weeks in a couple of weeks to do another set scale test, and so it's perfectly time to test all this new Luminous stuff to actually see how it does with these big, because we don't usually get to buy that much hard road Red Hat. Unfortunately, so that's that's going to happen in next couple weeks.

B

Very excited about that set manager also has a new REST API, we sort of took the caliber API adapted it. Its uses, the pecan framework. Now it's written, Python manager has this nice Python plug-in framework that you can use, so that's going to be there and there's also going to be a built in dashboard. It's super simple. It's basically like a set yes on the web, but it works. Here's a screen shot, it's pretty simplistic right now.

B

This is a 1 OSD cluster on my laptop, so it doesn't really show much industry, but it shows you know all your demons, all your health warnings, whether things health is good or bad.

B

It shows the log just really basic stuff like that, so not much in Luminos yet, but this is going to be a building block moving forward, eventually we're going to add all the all the metrics in there that the manager already has actually just doesn't present them through the GUI and that sort of thing so moving forward there there's a new network messenger implementation in luminous. It actually was new in Kraken, but it's lumen or jewel. It's new since jewel new implementation. It doesn't use up lots of threads, it's much more efficient.

B

It's event-driven fresh codebase, it's great! That's so much better! It also has a pluggable back in so there's an RDM, a back-end for the messenger that is built by default. It isn't tested very heavily but I'm having our team a gear in our in our community lab, but it's being used in a few places in production with good results, so it seems table, but it's not sort of officially supported and tested. Yet so your mileage may vary, there's also an experimental DB TK back-end that uses Intel's user space, acceleration library, stuff.

B

That also looks very promising. It's it's definitely in a prototype stage: it's not ready to go the codes there and it you know you can build it and play with it. If you want so very excited there, Mellanox has been an ex Kyra have been that many people working on the rDNA stuff and the other sort of nice thing is coming in jewel. Is that we're finally going to have perfectly balanced OSTs?

B

So any of everybody who operates a large cluster is dealing with a variation between the least utilized OST and the most utilize OSD and dealing with three weights and capacity planning. And it's just a headache. We finally have a bunch of new tools to actually make that essentially a perfect balance. So the two two tools are something called shoes args for crash, which basically is sort of a way to feed in all the specific parameters for a particular pool that tweak the weights and the ID.

B

So you can sort of get it to do exactly what you want. It's sort of a generic capability, but what it allows us to do is around a numeric optimization that just does a gradient descent and fiddles with all the weights to up so you actually get. The actual output is exactly what the weights are that you intended when you put it in so it solves that imbalance problem, but also addresses addresses something that we've been calling.

B

The multi pick anomaly, although it's a very imprecise mathematical term, but it basically is an issue with the way that the underlying mathematics of probability that crushes based on where, if you have a device that has a very low weight and a bunch of things that have larger weights, for example, if you have a bunch of racks- and you start a new rack with one server in it crush- tends to put too much data on those devices and overflow them.

B

And this it's it's a it's annoying math, but we don't even notice it for a long time, but we can act. We can correct with that as well by using adjusted probabilities for the second and third replicate choices than crush, so the good news is that the imbalance portion of that optimizing or that is actually going to be backwards compatible with older clients. If you want to correct for the multi pick part, then you have to wait to allow your clients or running luminous and understand the new stuff.

B

The other tool is something called PG up map, which is just like it's just the ability to put an explicit exception. Mapping in the OSD Memphis that says this PG is stored on these OSDs period, so it just overrides whatever process and says put it here, and so there's a really simple optimizer that looks at your distribution and says this: postie is one PG too many, and this one has one too few, so I'm just going to move there and does that. So both those tools are there, but that PGF map also requires luminous clients.

B

They won't really be able to use it at a production cluster until everybody is upgraded on the client side and a few other odds and ends for the rate aside crash has something new called device classes where you can just tag the OSDs in your system as being a particular class or type. So you can say these are SSDs. These are hard disks. These are in VMI's and then you can write a crush rule. That's just really simple that says: map to OSD or O's to use that are SSDs and map to hard disks.

B

Previously, if you want to do this, you had to create you have to manually edit your crush map and create two parallel hierarchies and futz with all the names and all the automatic crush manipulation. Stuff just kind of broke super tedious. Now it works out of the box. It's really simple! So that's nice, there's a streamlined disk replacement process. That's well documented, so set disk. You can replace OS D's for using the same IDs and it's going to be simple and it's actually going to work. It's all be nice!

B

There's a new configuration that lets you as an administrator just declare what the oldest client version is that you want to support and then the monitor will sort of gate. All of the other things that minimally like crushed in doubles and so forth. So you don't screw up.

B

So you can just say: I want to be able to compatible be compatible with hammer clients, and you tell the cluster that and it'll just prevent you from doing anything that would break break that constraint just to make operator slides a little bit easier, we're annotating and documenting all the config options in the code. So you can just do a dump and see all the config options, what they mean and what, whether you should touch them or not. They'll be marked as like experimental developer. Only do not touch.

B

First is something that you should adjust expert. Only that sort of thing there's a mechanism so that if a PG or object is stuck, then there's a back off mechanism, so the clients will stop sending requests which, in certain recovery situations, can bite people in the whatever so that they can't actually talk to the roasty. So there's some things like that that are fixed from having better eio handling appearing in recovery, speed, ups, new and crack, and actually but nuisance Jewel Saif is now in most cases. If analyst, he fails, it immediately notices.

B

You have to wait for a heartbeat timeout. So it's much faster failure, detection and cluster moves on so lots of good stuff. That's there's a ongoing list that just sort of random little robustness stuff, that's improving, so good things there I'm sort of moving out of radius into the raitis gateway.

B

We sort of have this high-level view that in the future, most data is going to be stored in object stores, so, while block is obviously very important, particularly for cloud workloads and hosting VMs, that's not actually where most of the data is going to be most of its going to end up in objects like s3 and Swift api's, and so there's a whole raft of features that we're looking at here. Things like a researcher coding, tearing multi-site, Federation and so on.

B

So new and luminous sort of the biggest new thing, that's most exciting is greatest gateway, metadata search. So we already have this mechanism. I might build kinda screwed up here this mechanism, where you can take SEF clusters and multiple data centers or the same data center, but you can have multiple zones, sort of quasi-independent rgw installations that are federated with each other, so they share a bucket namespace and user, and you can put a bucket in a particular zone. You can replicate across ohms.

B

You can do all the stuff with the new federation active, active, all kinds of stuff, but the mechanism that allows that that does actually the syncing is pretty pluggable, and so we have a plug-in that does stinking of just metadata and it dumps it all into elasticsearch and then there's a new set of api's in there, a toast gateway that our search api is that go query elastic search.

B

So if you set up one of these zones to index your object, gateway content, you can have an index either the default stuff or whatever headers you care about. Then you can go. Do search queries to find out what you're, storing you know what file types with headers are setting whatever you want to do so that's exciting and totally new. A bunch of other stuff with the Rado Skateway there's a new NFS gateway and it's actually been present in some versions of dual got back ported some of the downstream stuff.

B

I can't remember, if actually, if it was upstream before, but this is for the right escape way, there's a very simple NFS gateway that lets you mailed nfsv4 v3, the cop in data or copy out data, which is great for migrating. Existing workloads from sort of file based storage systems to object, as you make that transition, and it's not meant to be a full POSIX file system. It doesn't do small, writes and renames and truncates, and all that random crap. But for just copying data in and out it works great.

B

So that's that's big for a lot of users, so the biggest management operations, headache that we are resolving is dynamic, bucket index charting. So for our GW users, the bucket indexes, if you put too many objects in a bucket, the index would get big and there was a tool that you could do offline that would restart it or if you created a bucket, you could declare decide what the charting was up front, but it was kind of a headache and you had to plan ahead and it wasn't very friendly.

B

Finally, a luminous, that's just going to be automatic. So as the book it gets big it will restart on its own, and you want to do anything. It'll happen online it'll just not be something you have to worry about, just sort of. If you might there's a bit of a theme of not having to worry about annoying things that we're trying to chip away these things. So so that's good. There also a couple other sort of headline features that came into the gateway through team room rant, as did a bunch of great work.

B

There's inline compression, so rgw will compress the data as it comes into the cluster and write that compressed data to to ratos. So that's good and sort of happens transparently. There's also a bunch of encryption api's that were implemented. These are following the s3 encryption. Ups I can trouble with the name of that the whole categories, but you can set keys on the buckets on the users. It's a whole bunch of it's a big, complicated API that Amazon made up.

B

I'm gonna basically implemented it, so it's there um yep and then there's it's a whole bunch of stuff with the s3 and Swift API, that's been improved and added and updated and there's sort of always a constant flow of issues. There I get resolved. Those are sort of the big, exciting things on the rate of Skateway site on the rate of block device.

B

Also lots of stuff going on the biggest thing, obviously is the racial coding which I already mentioned, but I'm going to mention it again because it's a big deal, you can run rate of block devices on a richer coded pool and buy less hard disks and SSDs and everything else. So it's pretty simple. You just specify the data pool when you're creating the our buddy block device, and it puts just the data blocks in that in that pool.

B

There's also a lot of work went into the our buddy mirroring mechanism, so the our Bhatia mirroring daemons are now both they're, multiple ones of them and they're sharing the load and their H a and all that stuff, whereas in the ghoul there it existed, but it was just one demon. Now, it's a bunch of them and they scale out and all that's all that stuff, so lots of stuff there, I'm, mostly just around robustness and not so much around new feature capabilities, improved cinder integration.

B

Noise is ongoing, stuff OpenStack a lot of work going into I scuzzy. This has sort of been a multi-year journey of various false starts and attempts to use different kernel interfaces that get tanked by upstream kernel whatever, but the final, the new, the latest I scuzzy approach is based on l, AST, CMU Runner, which is basically a user space pass-through.

B

So the ice cozy kernel target and the kernel is going to pass through to user space to lebar BD, which is nice because you get the full level everyd feature set on the latest and greatest there and the performance penalty of doing that. Pass-Through is very modest. Actually, so that's good. It's going to be a full H, a solution that does failover it scuzzy reservations and all that stuff.

B

So that's that's coming and on the kernel side there have been lots of our beti improvements, keeping up with the crush and OSD and closer protocol changes that have happened. That's all there I'm already, specifically the exclusive blocking stuff is in the kernel now upstream and support for the object map stuff, which are both kind of old features, but they're now in the chronal. So if you're using the native chrono block device, you can get that stuff when finally set of s.

B

So if you've seen in my talks last few years or last year- I guess you've seen this before, but we used to talk about SEPA, fests and stuff, saying that all these parts of stuff were awesome. But sefa Fest was nearly awesome because it wasn't ready yet and yadda yadda and finally, now Stefan Fest is production-ready. It's stable.

B

It is now fully awesome, yay, so I said I said exactly the same thing at the open sack talk six months ago, except it's at 2016 instead of 2017 the new the latest fully awesome part that is now fully awesome is that luminous will have support for multiple NDS is active, active, which also has been a long time coming, and despite my there, okay.

C

B

Hopefully not long ago, all right, let me see, I, can catch up. I blamed the car.

C

B

There we go okay, there we go yes, so multiple active MVS is finally and there's them a bunch of stuff to go along with that. So the multiple NDS is have this load balancing framework. That's all heuristic based and tries to understand your workload and move things around, but it's hard to understand what client workloads are doing so there's also this manual mechanism. Is you can just go in and say this subtree, this directory I'm just going to pin it to that NPS.

B

So, if you want to, you, can just manually just enforce whatever the subtree partition is that you want? If you don't want to rely on the automatic thing to do its thing, which it might do right, it might do wrong. It's sort of ongoing work, so there's that um directory fragmentation is finally on by default. This is stuff if s dealing with very large directories, it'll break them up into little pieces and put them in separate objects and multiple DS's and all that stuff.

B

It was off by default for a long time just because we didn't have the test coverage it's finally on, so that's good, there's been so many tests written and so many bugs fixed. It's just that this FS team has just been kicking their butts, really getting confidence in the stability and so forth.

B

So that's all there and a lot of work also going in on the on the kernel client side, also keeping the chrome client up to date with all the changes in user space and improve fixing bugs and so on so group effort here, mostly from redhounds user developers. But it's been it's been good, so we're really excited about some of us. Ok,.

C

B

If you saw the user survey also, you might have seen that on the manila, south FS driver is like hugely popular number one yeah, that's pretty cool, so I'm, ok, yeah open source! So that's that's mostly it for all the stuff.

B

That's coming aluminous we're sort of wrapping up the development cycle as we sort of there a couple last features that were finalizing and getting version of the tree and, in the meantime, we're also focusing on what sort of the low-hanging fruit for just usability and making things easier to manage and deal with and less confusing that we can squeeze into the release, because that's sort of you might have noticed, there's a bit of a theme of trying to make seth less hard and so we're trying to get as much of that luminous as we can, but then after luminous there's more stuff.

B

So what's what's coming next, the next release is going to be mimic. It's named after the mimic optic octopus I strongly encourage all of you, a google mimic octopus on YouTube, but look at them they're like super super amazing. It's definitely the coolest F WA pod. It's Google.

B

They're awesome anyway, so mimic lots of stuff again the sort of the sort of that highest. The main motivation. I say the main priorities are like you know, makes F faster performance, but mostly as far as features, we're feel like we're in pretty good position. The main challenge I think that a lot of both the stack users are facing or dogapus access users are facing is around usability. It's just hard to manage.

B

It's complicated, it's hard to set up, and so we're just trying to make it easier unless confusing and a lot of it yeah there's a lot of stuff that we can do that. Just is sort of needlessly obtuse and opaque, so we're trying to try to improve that, but there's also performance as a big thing too.

B

So, on the rate aside, the biggest thing that's going to happen post luminous is we have sort of a big RIF factor clean up, optimization exercise planned so there's some peering stuff to win if except but the main thing is that the main I/o path where messages come in and they get processed with the thread pool and then they get handed off to the objects whatever it needs to be: refactored, cleaned up, restructured to use, more asynchronous state driven model for programming for plugging use, their futures and all these fancy language features and whatever.

B

But it's going to be painful, but it's really important, because the current structure of the code is hard to maintain because it's gotten so complicated and it doesn't perform as well as it needs to so, as our storage devices get faster and faster, we really need to address this sort of elephant in the room in order to make progress. So that's going to happen. Ongoing work on blue FS in rocks v2. There are some sort of tactical items that we are dealing with there, but really we're limited by that OSD piece.

B

We've done a lot of optimization on the messenger side, you saw the talks earlier with our DMA and so on. That's getting much much faster, getting stuff into an out of Ceph and blue store is much much better. I mean getting stuff on disk and off disk. We sort of eliminated the main issues there, but it's really everything in between that needs to be fixed up. So so that's the big thing. That's going to keep at least some piece of our team pretty busy, but there's some other exciting, Rheda stuff, that's coming too.

B

So one of the efforts has been going on for quite a while now, but sort of has been a secondary. Priority has been working on quality of service. This ongoing background development around the DM clock. Algorithm I was published several years ago in an academic conference. It's distributed quality of service. That gives you two things.

B

You can give minimum reservations to clients or classes of clients, and you can do priority based waiting for everything above that, so you can guarantee so many iOS to certain clients and then whatever is left over, will be proportionately shared with weights among the other clients, and the idea is to have a range of policy, so we can use this just to prioritize types of traffic like client area versus recovery.

B

Oh, we can prioritize pools so that certain pools will get more be faster than other pools on the same hosti or just do it based on client, so that this client gets a minimum reservation. This one gets whatever is left over, but the problem is it's just a complicated problem, especially when you talk about distributed systems and things that replicate so you're actually signing up to do I/o and other people's nodes it's complicated, but despite that, our initial testing has actually shown pretty good results.

B

So we're encouraged that, despite not having sort of a complete solution, necessarily it's actually actually seems to be working pretty well. The main thing missing right now is really having any kind of management framework. So we have a lot of the underlying doing stuff done, but we don't know how it's going to be configured and how what the user experience is. Gonna, look at that sort of all work TBD, but the initial results are promising. So this is a.

B

This is an example of a test run a month or two got back where you have a couple: clients that have minimum I observations of 100 or no 50 100 apps, and then the third one had a very high priority. So everything that was left over beyond that ended up given to the third client and not the first two, and you can see that it actually, you know, does what it says.

B

It's supposed to do so so it's exciting we're sort of getting pieces that merged and it's sort of coming together, but it'll, probably a couple releases before it's actually a complete usable thing. The other thing that's going on is there's more work in the tearing department. So once upon a time, we did this thing called cash steering and it worked okay, but not great, and so we sort of stopped talking about it and doing much with it. This the new tearing stuff is, is coming and it's it's based on some pretty simple primitives.

B

So the basic idea is the concept of a redirect. So you have a ratos object. That's basically assembling to another lab object, but from the clients perspective, you don't know you just talk to the OSD and it proxies it through to wherever it is so, instead of you'd be able to move from a cache cheering type model where you have sort of a sparse set of objects, that may or may not be there in the cache tier and if you miss, you goes through the base tier to the new model.

B

Where you go straight to the base tier, which is essentially an index, it knows what all the objects are and either they're either there or there's a pointer to where they are, and then they can be either in you know, one slow pool or different slow pool or wherever. So it's a bit more flexible and it enables us to do other things.

B

So deduplication is a project that the folks at SK have been working on for a while and we've been helping out a little bit with and it sort of builds on this basic concept of a redirect. So the idea is to generalize a pointer to somewhere else with a manifest that says, you know this part of the object is over in that piece, and this part of the object is over in that piece, so you can have fragments that are stored in other pools, so we break objects into chunks.

B

We can store those chunks in content, addressable pools where you hash the content, and so your DD Bing, based on the content and the reference count those chunks and then you can have these manifests that point to a bunch of different chunks. So it's how the basic idea, how all the you know due to Bing storage systems, work with you know, chunking things and so on, but we're it's scale out in the sense that that that's sort of that base ray dosed here is acting as the index. That says you know.

B

This is the name of the object. You look up the name and that pool, and it tells you what the chunks are and where they're stored. So that's that's the basic idea and that's the direction we're going in there's a lot that sort of still to be determined. Is this going to be inline, chunking and storing, or is it going to be post processing? Is that going to happen inside the OSD or by an external agent? Is it going to be?

B

What are the policies that are going to control when you chunk things when you don't?

B

These are all sort of TBD when we're just sort of getting the the core underlying functionality in place, lots more work going into the set manager and it's sort of been built as a new place to do new stuff in Ceph, and so there are a lot of things that we want to make it to one of the first things is metrics aggregation, it's already slurping up all the performance counters and all the demons into the manager, but they're not really going anywhere yet so in the short-term, we want to have sort of the manager provide time series data just in memory out of the box with a short history, the last few minutes or whatever.

B

So you can get a little eye, ops graph without any additional work, but eventually will I want to have that stream off to external platforms. So if you have like Prometheus or zabbix or whatever your big thing is, you can also just turn on the firehose and send it all there. So that's that's coming it might even it's even possible. Maybe that will get the Prometheus stuff in for Luminess, but maybe I, don't we'll see, that's I know fast. They are, but there's also the intent of manager.

B

Is that it's a good host for other management functions. So it has this whole Python runtime environment. So you can just add all these Python plugins, so you can add new stuff to it pretty easily.

B

So it's going to be where we do the automatic crush optimization, where we're automatically balancing your cloche, your crush weights and stuff they'll tap another box, but we can also do other things like automatically identify which devices in your cluster are slow and migrate, workload away from them, Stewart Iowa from those devices or even do device failure prediction so that if we're pulling smart data off the disks, we can run a prediction model that decides which ones are about to fail and preemptively copy data off of them.

B

So it lots to do there and it's kind of exciting, because it's that there's, because you can write things in Python- that are more policy based than it brings in a whole new pool of contributors that can that can write that kind of code. So that's good, there's also stuff on the architecture front. So there are arm 64 builds I've mentioned this before, but we're still trying to get enough hardware in the lab, where we can actually do these on a regular basis and get them into the C ICD pipeline.

B

So we have some of the hardware we're still waiting on a few more boxes, but the intent is that going forward, all the new releases are going to have arm 64 packages for both CentOS and for a boon they're. Also much of patches coming in recently on PowerPC, adding support for that as well, and we're talking about getting PowerPC Hardware in the community laughter. Do those builds so they power a while back. We did some work with arm. 32 builds because we built this 500 node cluster 4 petabytes.

B

Out of these little micro servers, it's essentially a hard disk with an arm server on the hard disk speaking use a net, so you're running those T's on the hard disk, literally no boxes hosting them. That was pretty fun. That was with WD labs and they're, actually doing an update to their platform. They're doing a that was agenda to drive they're, doing a gen 3 Drive that has a 64-bit arm, yay and more RAM and better networking and all kinds of stuff, so we're working with them. It's exciting!

B

If you're interested in that project- or these things seem interesting to you- you should contact Jim Wilshire at WDC, they're, looking for POC people to work with so exciting, good stuff and then finally, a client, caching sort of across the board, so they're on the greatest gateway side. There's a project with Boston University and until I think worked on it, where they added a persistent cache for a rate of gateway to support their big data workloads over rgw, and they it worked great they're putting stuff on nvme they're at like saturating the N'diaye they're.

B

Getting really good performance I mean it didn't, sacrifice consistency because of the way that it was architected with they're doing immutable objects only. So that was great and the students a couple of the students who worked on that are now interns for the summer at Red Hat, so we're playing us to get all that code, cleaned up and merged into the tree, so that's exciting and on the RBD front, we're also very interested in doing client-side.

B

Caching, if you saw the talk earlier with Jason and Tushar we're looking both at immutable caching, so that you have, if you have snapshots that are the basis for clones, then that sort of immutable parent can be cached and then also write back cache. So you get sort of low latency writes that then gets streamed back to the cluster. Then seth FS are actually already has a persistent client-side cache.

B

If you're using the kernel client, it's been there for awhile there's a generic kernel infrastructure called FS cache that plugs into Seth FS, so I run the set aside. You already have client caches at least a read-only cache. It's not doesn't do written data, but yes, so client caches are good and that's that sort of it. For my whirlwind, tour of all the new development stuff I'm going to talk a little bit about all the people who are helping us do it. So these graphs are a little bit old.

B

I didn't get updated once, unfortunately, but lots of people are contributing to Ceph. The number of contributors is increasing or it's great. We love it. um It's it's a challenge for us to keep up with all the pull requests and reviews. So apologies that you've submitted a pull request and you feel ignored just keep painting us and we're busy, but we we want you to keep doing it and the community is broadened and expanding. So these are the most. These are the top contributors from since jewel.

B

So it's updated a bit since my last talk, but you'll see that there's sort of a broad set of people here, so you can see all of the OpenStack vendors on this list. With the you know, the Linux people and easy stack unite stack. You see a whole bunch of cloud operators, a bunch of them and in a pack and also in Europe public cloud, private clouds all across the board. Not all of them are using OpenStack, although I think most of them are.

B

You also see hardware and solution, vendors that are selling software products based on self SEF or in some cases, hardware, products based on SEF, which is very exciting people, like quantum twice in fact, that you don't usually see on these lists and a couple of the people that actually I don't really know what they do. I guess I could have googled it, but but it's exciting it's exciting to see the breadth, I guess of a contribution. So there are lots of ways to get involved. There's mailing list. We do a set developer monthly.

B

So every month we have a developer video call. We alternate a pack friendly and amia friendly times, and we just talk about whatever development issues are pending, it's all virtual IRC, whatever so so you can join and then, if you want more events like this, of course, they're set days, this one's awfully convenient cuz it's at OpenStack summit, but we do them all across the world about once a month.

B

You can go see the schedule there is the next ones are all, and in Asia next few months, their meetups that you can search for in various locales. We also do SEF tech talks. I think this is like the last Thursday of every month or something there's always an on. It's like a YouTube blue jeans thing where somebody just does some technical presentation of sub check on some subject related to SEF those tend to be debris, interests and all of this stuff is recorded and ends up in a set YouTube channel.

B

So if you go there, there are like hundreds of talks recorded talks over the past many years, that you can go, watch and learn all kinds of good stuff. And of course you can follow us on Twitter. Okay, that's it and don't forget to Google the mimic octopus because it is awesome.

B

And I'm happy to take any questions on aluminous. What's coming after on old stuff cephalopods, gotta use the microphones, though, because they're trying to record yes I.

D

Have one quick question about you know we're really happy to see the work on QoS coming? Does that also cover like per client metrics in the manager? Yes,.

B

That is the intention, so the way that damn clock works, the EM clock piece is the actual cue prioritized queue that does waiting. That does both minimum reservations and waiting and the D part is the distributed part. So there's metadata that's shared between iOS by the clients across those DS, so you get a global reservation and not just a local one. So the intention there is that you can tag.

B

This client is going to get this reservation and even though it's talking to lots of devices it reaches that okay, that goal, but what's again what's missing, is how you're going to configure it and how you store what the policies are and how that, let's, we have to figure that out.

D

We're interested in both sides would both be able to limit people and also being able to find the people who are abusing yes.

B

So just measuring usage is one of the thing I should have mentioned that that we want to do in manager, so have the have all those T's in the system sample the request streams and send that information back to the manager, so they can build like a top a top view, basically of who's, doing all the IO in the system. Okay, thanks.

D

B

E

You have any plan on CPU optimization, because now self itself consumes occasionally most EU than disks. Yes,.

B

Yes, we have many plans to optimize the CPU utilization, so that's largely what the the OST ver factor, one of the main things that's looking to address, but also just across the board, we're doing a lot of profile and we're trying to figure out where we're wasting CPU and data structures and whatever else that just shouldn't be done. The way it does so, if you're a if you like, optimizing and profiling whatever then like we'd love to have you be involved.

B

But yes, there's lots of work going on there that stuff it's a known issue, especially as the devices get faster and faster. We have to reduce the amount of CPU that were using yeah hi.

C

I'm is there any improvement about the cross region, rgw.

B

Improvements for the.

C

Gwl cross via a cross.

B

Region so in joule, the the multi-site federation for our GW was almost completely rewritten, so there's a whole new way to configure zones and zone groups and replication bi-directional stuff, and there are ongoing improvements to that bug, fixes, but there's no new feature per se in Luminess, except for the metadata indexing. So but it's changed since hammer and Firefly. So if that was last time you looked at it, it's it's newer and better and more robust and all that man, two.

F

Brief questions I was wondering if there's a way that we're going to be able to detect the list or the version of all the clients that are connected to a cluster. Because it's a bit hard when you're doing upgrades. And if you forget one of your clients, yep.

B

I mentioned usability a couple times for building a Trello board with all the annoying things, and that was one of the cards. I just added get a day, I think it's a simple enough thing: that's going to make it into an luminous so that we can also gate changing that minimum required client on whether those clients are connected. So it'll prevent you from saying require luminous. If you have a more clients that are talking to the cluster or something yeah.

F

And then the other thing is: is there kind of work around figuring out if you have some OS D's? That are example have high latency, because usually those will cause a lot of issues in a cluster and are hard to find? Yes,.

B

You're not so the the OSDs already report, just their sort of average latency metric to the monitor and there's already a command called set, postie perf that you can pipe the sort Kay and whatever and you can. You can see it, but it's it's annoying. You have to go, do it so one of the ideas is that the manager will be able to now that it met it. Has all those metrics. It can do its do that automatically, and we can. You can write. You know easier to understand code in Python.

B

That just looks for the slowest OSDs and then write some policy around that like set the primary affinity on those devices, so that they're not primaries, and so the reads go to other devices and you mitigate that or preemptively fail or whatever it is. You want to do yes,.

C

B

Yeah thanks any other questions all right. Thank you very much.