Ceph Ceph Days NYC 2023, 19 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC: State of the Cephalopod

Description

Presented by: Neha Ojha & Josh Durgin

In this talk, we'll provide an update on the state of the Ceph upstream project, recent development efforts, current priorities, and community initiatives. We will share details of features released across components in the latest Ceph release, Quincy, and explain how this release is different from previous Ceph releases. The talk will also provide a sneak peek into features being planned for the next Ceph release, Reef.

A

It's wonderful to see everybody in person. Again, it's been a long time since we've had one of these events and it's wonderful to see familiar and new faces as part of the stuff community.

A

So today, I wanted to talk a bit about the seat of the stuff project. What's been going on recently in the latest releases, what we're planning for the future and some of what's Happening. Overall, um it's been a a big several big years of growth. That's a community! In general, we've changed to a new governance model for our technical governance. Instead of having a kind of a single person stage.

A

While we've had the project, we've moved to a model of a council, as Mike mentioned, and the shared leadership with the leads or the different components of stuff. So the the leads elect the council every few couple years and the council is responsible for making sure everything else basically happens and taking care of the community as well. Foundation.

A

um A few major Focus areas for us in general have been quality performance and scalability additional usability, so we've been doing lots more testing at Large Scale, trying to detect things before users hit them and uh improving our processes like creating RC releases that users can test out and at Large Scale. Before again, the final release hits the ground.

A

We've also been focusing much more on performance testing and performance improvements and stuff. A big part of that is the Crimson project. We'll talk we'll talk more a bit about later, but uh this also means trying to run at much larger scales and much more varied, Hardware, more continuous, on a more continuous basis with more realistic workloads.

A

Finally, we're also trying looking to see more about how stuff is used and get more continuous feedback from the community. Part of this is the Telemetry effort which we will talk more about later. We have. um This is all opt-in data reporting that you can share anonymously. What what house bigger cluster is how many osds it has these sorts of information and right now we have somewhere north of uh 800 petabytes of of stuff clusters, reporting, there's a massive amount of users out there and there's either all you can inspect some of the data.

A

That's out there in the public, 23 dashboards, which we'll talk more about later and for the developers there's some private dashboards where we can go and see. For example, if we adjusted to release how many crashes are happening, which classes is it affecting, is it only a one-off incident, or is it widespread that sort of thing.

A

So in terms of releases we're midway between Quincy and reef, um originally our plan was to release Reef in March, but due to some lab outages this year, we'll pushed that back into June this time.

A

So as before that we're continuing with the yearly release process doing a major release each year and deprecating the release uh three three years away, so we'll have to reef, is out. um Pacific will become deprecated.

A

So let's talk a little bit about what's going on in the project. What are the technical features and the improvements we're being making in Quincy and Beyond.

A

So, as I mentioned earlier, performance and scale has been a big Focus area for us. We've been doing lots more testing at the number of different clusters.

A

One of the major ones was the Super Computing Center in Australia on the super Computing Center, which allowed us to use a very large cluster of theirs for testing out and fixing a number of bugs in sefadm and other pieces of stuff, including the manager and a very large scale on thousands of osds we're continuing to work more with with them again this year, I hope to do similar kinds of testing before the brief release.

A

We're also uh trying to do more logical scale, testing on our Upstream in our stream Lab, which is not as large. We can't run thousands of osds and and use that in full space, but we can run thousands of osds with a much more limited amount of resources in space, so we're developing some methods to actually test out things you see at scale without needing the full scale of data.

A

Yeah finally, there's also more focus on performance as well and the performance it can. It can be a very interesting thing to to test. There are many different aspects of this um one of them. A lot of the things you can test for performance actually end up showing up even at smaller scale, so we have a smaller much smaller scale, a set of high performance nodes in our RF stream and CPL lab, and we're always improving and and um looking for more help with the bunny tests.

A

There currently I think we have uh what is it something like 16 or.

A

A

Right so we've got a few different clusters of a mixture of Intel and AMD processors and a mixture of different manufacturers uh fast mdme devices, um and we're continuing to try to improve that that testing and maybe even make it into more of a continuous process rather than a kind of one-off uh manual test effort.

A

So within the radius layer, which is the base layer for everything and stuff, uh there's been a lot of focus, a lot of focus on them, um stability and reliability. So one of the major things there is quality of service being able to maintain the performance of the different clients versus when different things are happening inside the cluster, like you have lots of scrubbing, going on to check for data Integrity or lots of recovery from a failure.

A

If you want to keep your clients happy and keep the latency and throughput going as you expect so in Quincy- and this was made in the default, um we only have a new scheduler called m-clock which implements this quality of service for background operations. We made further environments to it, they're coming up in brief um and right now. It's only for background operations compared to clients, but the in the future we're going to extend it to be supporting different classes of clients, so different clients can have different reservations for amount of iops.

A

They want uh that sort of thing.

A

Let's try to make this to the easy to use, because stuff has many many options for tuning these kinds of things today. So with with the m clock scheduler, instead of setting tens of options to figure out how fast you want recovery to go, you just choose a single profile. If you want recovery to go faster, do you want clients to go faster, or do you want something in the middle?

A

It's always possible to go deeper and configure things at a much lower level than that, but we're trying to keep things simple, to manage and and maintain. So, if you don't want to, if you don't need to, you, don't have to carry around all those nitty-gritty details of thousands of configuration options.

A

This also applies to some of the other improvements we made in radius, but in terms of how we're reporting Health errors and how we're reporting still operations we're trying to make it easier for folks to diagnose performance issues and stability issues in the cluster uh sooner and go see some more improvements there in the future, as well as distributed tracing, gets more embedded in the different Protocols of stuff. uh Brief will be the first release where then that's going to be present and easily Deployable.

A

So also coming up in Reef, there are lots, more improvements for blue store for performance purposes and and also potentially addressing um different kinds of performance problems that kind of throw up like fragmentation. This has been a major cause of issues in Blue Store in the past, and there are some thoughts there and improving the way.

A

The allocation and algorithm Works in blue story to prevent this from happening or label it to be handled during scrubbing, so that you can deal with it if it does occur without having to redeploy osds, making it all much more automatic.

A

Becoming more automatic is the general theme that we've been striving for in Raiders, and another aspect of that is improving the balancer, how we're Distributing data across the cluster, so today the balancer Works only with them look looking at the purely there's a capacity used on each OSD and moving it around to try to even things out.

A

uh In brief, this also includes bouncing across the primaries, because when you, when stuff is doing reads and rights, reads, are always going to the primary OSD. So if you have a skew in how many um osds are primary for a certain number of objects, you're going to see hot spots, so now the bouncer will take that into account as well. If you have even more even performance.

A

One of the major Focus areas for um staff and performance is the Crimson project, which is a re-implantation of the OSD in a new architecture which isn't shared nothing, meaning it it's designed to avoid any kind of cross, CPU or cross-thread communication. So they go with the maximum speed from the wire to the disk.

A

This is in the um experimental stage right now, where it supports the basic operations of RBD, but it's still undergoing stabilization and has a long way to go in terms of optimizations before being fully production, ready and being able to be used for all workloads.

A

So in Reef we have some initial support for um multiple reactors and snapshots, but we're continuing to build out the test, Suite there and continuing to improve the functionality and to make it fully able to be used for RBD in the next release. Probably yes, a big aspect of that as well, is the C store, which is a new store back in the OSD designed specifically for high performance systems like ndme disks and uh other sorts of very high performance media.

A

This is also an experimental stage right now and has a lot of room left for optimization and stabilization, but we should see this come to come to production, hopefully in the next few years,.

A

So, as I mentioned earlier, we have this inflammatory system, which is all opt-in, reporting, we're trying to make this easier to use and easier to update. So we introduced a number of Notions around how to manage the data there. I think you eat is going to talk a bit more about this in more detail later so I'm going to skip through this pretty fast, but essentially we're Gathering, both metadata event, clusters crashes that happen and also information about Drive drives that fail. So we can feed that back into models predicting values before they occur.

A

One major aspect of usability is the interface burst F and the dashboard is the easiest to use UI for it today.

A

So in Quincy, it's much easier to manage things like expanding the cluster, adding new host, removing them, and it's gained support for a number of different uh protocols like monitoring protocols and adding proxies to be able to make things h. A and reef is expanding that support across different functionality in RBD and rgw, especially and trying to make things easier to Monitor and easier to diagnose by integrating the centralized logging as well. So you can have all your logs go to a standard, elasticsearch based platform to inspect and and sift through.

A

uh So fadm is the current way to deploy our clusters with stuff which isn't designed to be as easy to use as possible. It's uh deploying them via containers and system D, but managing it all through the step. Orchestration commands, which are common, Zippos, fadm and Brook, which is the deployment method on top of kubernetes uh stuff. Idiom itself has gained support for a number of different things in Quincy um and in general, we're trying to make it easier to use and easier to deploy in flexible ways that people would like to so in Reef.

A

We're also making it simpler to upgrade more piecemeal so that you don't have to upgrade the whole thing. Whole thing at once. That was a little bit more difficult in the past, um making it simpler to set up multi-site deployments for the various Gateway and for mirroring with RBD, and we're also added support for automatically rotating the keys for authentication for demons which keeps your clusters here.

A

So Rook is the main way to play stuff on top of kubernetes and Rook as I always catching up and keeping up with new features and stuff and new features in kubernetes. So now it's supporting much much easier ways to troubleshoot a stuff cluster I believe there's even a new project, a little bit separate from book to me um kind of gather. Some of these troubleshooting commands together in one place, to make it simpler to diagnose and and to inspect a cluster deployed in a kubernetes environment.

A

Has also not support for more NFS operations.

A

So RVD the block device portion of ceph is one of the most popular uses for it. It's been very stable for a long time and now we're starting to see much more interest in more advanced protocols and more advanced functionality with it. So there's a lot effort around the ndme over Fabrics protocol.

A

Currently the work there was around our Gateway, which uh Jonas was going to talk about after me, I believe um in the future that I guess will be a large, an important protocol for Seth, because it's becoming a very widely supported standard and a very easy way to connect to all kinds of operating systems and hosts.

A

There's also some performance improvements within RBD at the RVD itself, um support for a caching Daemon that does right back, contributed by Intel and designed to take advantage of very fast local storage if you want to have um rights going through a single discount locally before being pushed back to the main cluster and finally, we're focusing a lot in RBD on the multi-site capabilities there asynchronous mirroring, namely between multiple clusters.

A

So this is important for Disaster Recovery and there's a lot of work going on to harden and stabilize this and make it perform well at a large scale.

A

Interesting research in the future is ongoing from a group at Northeastern University uh into a long structured format for IBD, which has the potential to drastically improve the performance for random. I o in particular, but it has some restrictions on what workloads it applies to I. Think it's a very interesting uh effort for the future.

A

In Rita's Gateway, the SV and object storage interface per SEF has a number of ongoing efforts. A lot of these are around supporting Ai and analytics types of types of workloads like things like f2b, select being able to run SQL queries against your objects. This is a particularly interesting project because it's able to run Standalone as well, so you can easily develop your queries against a local file and then run them against your cluster.

A

Another big big aspect for the radio's Gateway is multi-size support, so it supported different kinds of multi-save replication for a long time now, um but there is, there are some performance issues that we're looking to address in Quincy and brief, namely around parallelizing more of the work among multiple gateways and synchronizing things across and balancing that load across all the gateways.

A

There's also ongoing work towards um being able to debug and understand what's going on, namely with the uh Jaeger or open Telemetry tracing, which is now Deployable in quite easily vsf, ADM and I believe rook in the pretty soon.

A

um This will let you debug, where um operations are getting selective or where performance is slowing down, basically, each type of the way, so you can see if something's being synced from site a to type B. uh Is it getting stuck or is it? Is it slowing down outside a or site b or which part of it um it's going wrong?

A

If you have a problem, there's a lot of work into the readers gateways um zipper back end, which is a kind of pluggable back end to let you put different things behind rgw for different purposes. If you wanted to run it in a very constrained environment, for example, you could back it with a simple file or a simple database.

A

Within CFS there's a lot of work on this ffs top utility again for introspecting. What's going on within your system from a performance perspective, um so you can very easily see what clients are doing. What and and if there's an issue um go in and look more deeply at that.

A

There's also work towards uh cloning and snapshotting support, similar to what we have for the black device, where you're able to your files and easily make copy and write changes to those.

A

Apparently, there's some work on going to it's: a better security via file system level, encryption.

A

Philosophy I want to touch on is uh testing and quality in general, so there's a few different aspects to this one. A new effort that we're working to focus more on this year is the developer experience. If it's very difficult to run the test for a system, uh it's it makes much harder to add new tests and keep them maintained.

A

So if we can create a way to run these these uh tests locally instead of having to wait hours for our builds and push it through a queue for days, um if you can run those on your own laptop or you go into your machine, it becomes much easier of our position to contribute and to maintain those tests.

A

In addition, um making a easier to develop local developer environment can make it much more much easier and faster to develop if we can make it based on kind of an incremental builds, instead of always be building everything from scratch. So that's something we're going to see more improvements on coming up.

A

As I mentioned earlier, there's uh there was a bit of a large outage in our lab that slowed down our release process this year, so we're making some uh improvements in our lab infrastructure to make it more resilience and also looking to how we can um run these kinds of tests in other environments.

A

In case there is a problem with one another aspect of that is engaging with other organizations that have Hardware they'd like to help other community with and enabling them to contribute in by rather running tests themselves, an RC candidates, for example, or maybe providing some backup Hardware in case our CPA lab, uh has an issue. We need to get things going to be able to merge code and and get releases out the door and finally, as I mentioned earlier, looking into improving performance by on a more continuous basis and monitoring it via a CI system.

A

Instead of manually running performance tests on each release,.

A

So that's kind of what that was happening with stuff.

A

Again, we're mainly focusing on performance, quality and usability, and if you have any questions about any of these feel free to ask now or come up and find me later today,.

A

um Yeah, so the question was for the qos: will you be able to do a fine-grained um decision for a particular image, and the answer is yes, the idea is that you'd be able to apply a policy for how many this this image it gets or you could make it on a per pool basis or there are several different granularities, but the underlying support would be even at the Image level.

A

Any other questions.

A

All right well, thank you very much.