Ceph Ceph Days NYC 2023, 8 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dynamic multi-cluster management with Rook for cloud native IaaS providers for the private clouds

Description

Presented by: Joachim Kraftmayer | Clyso

Over the last few years, we have been gaining experience with Rook in production. One of our challenges was to implement dynamic resource management between 50+ Ceph clusters. Kubernetes events dynamically and fully automatically distribute loads and capacity between Ceph clusters. This is done by removing single or multiple Ceph nodes from Ceph clusters while ensuring data integrity at all times. In the next step, the released Ceph nodes are integrated into other Ceph clusters as needed.

A

Okay, first I have a short introduction for myself. You have some I mean it's the first time for me in the U.S since since, uh since a long time, I'm more focused on Europe, so um I'm, the self ambassador of the dark region I'm the foundation. Member of the board I was previously Susie, Enterprise principal consultant. It's also um I'm working for. We worked for Redhead as an Enterprise consultant in Europe as well.

A

Stone a company, yeah and I started in production with firefly with Seth, so where it was really a funny time, I would say for an engineer not not for someone else. So yeah.

A

Sorry, is this better now, okay, good, so the company is founded in 2010 focus on infrastructure and platform as a service based in in Munich, and we are proud, proud to be the member of lion, Foundation self foundation and also Cloud native Foundation, so Dynamic multi-cloud cluster management, I. First of all, I want to give some key facts, um because this product already started in 2020 and it's all and it will I think, um will go on for additional two years.

A

So um so the main um the main keys are, they have no vendor lock and open source, scalability and and so on. Pure IPv6. Everything is in this project is kubernetes driven only micro services are the the workload and also um in the used in the core, and it's a project that's coming from from the public cloud, and it's going back into the on-premise um private Cloud so, and it has a moment somehow a bit more than 200 petabytes of ram footprint in production, but it's also one of the projects where we don't count.

A

At the end, the storage capacity we just um orientate on on the on the ram, how much the project or how much the production environment will consume or already consumes.

A

So so the motivation for 50 or even more safe clusters, and why we need this Dynamic multi-cluster management solution is uh we have different project requirements, so we have exclusive use cases. We have shared use cases, we have failure domains. We have we have different availability zones, data centers regions, we have security, isolation, I'm I mean the talk before we have. We spent a lot of time in investigating how how this fcsi is doing encryption.

A

How how The Rook is doing this um and um also the underlay is, um is completely built or based or rely on on Seth. So for the virtual foreign stack for for out of band management for all the services, you need to run a cloud in the kubernetes way you um yeah, we have so so much different requirements. Even this nice database as a service is even something very, very good talk, um I think at least an hour on it.

A

So um when, when we come to this, to the point that um yeah that the project or even the user requirements constantly changing, even I think we talked today about this- uh that the car can a client, explain you what kind of performance and what kind of capacity he really needs, and how is the outlook for the for the next two years? It's quite hard for them even to somehow report.

A

Okay, we have at the moment this IO or throughput um needs, and that will grow over purpose will shrink um next year, even when, when some of them um update their software stack, then they also have not the real um yeah information. How this will impact the back end as well, and so out of out of this um it comes. um It comes to a point that we can do some estimations that we Acclaim resources for specific customers for specific projects for even for the internal infrastructure as well, but we normally, we are not never.

A

We are never our right, it's just an assumption, and then we have to adapt it um afterwards again and for this we implemented um this Dynamic multi-cloud multi-cluster management that we somehow can create um elastic back end that can adapt dynamically to the demands of um of the application stack above. So what does what? Does it mean?

A

uh One? Second: um what does it mean? I mean what we do is we have. We have different resource pools that are completely managed by kubernetes, so we have no control over this. We got different events um that explain us.

A

Okay, we need resources from from this pool, and here or the surf classes um is selected to um give some capacities back to a calming pool, because someone else um already um is somehow at the limit, and so we start processing to add and remove um ceph nodes from one cluster and also introducing them to a to another one completely controlled by um by kubernetes, and we we have to control or what we but um Rook Seth has the control over over the data and how it should be managed, but um the the events that are created to to distribute the data or the safe nodes um yeah, that's one.

A

A

Yeah, so once again is informing us what staff should do Seth is taking care about what how it should train safe nodes and inform all the kubernetes that it will be possible or even not possible to um to drain them, and then it will give it back and kubernetes takes it into account and bring it to the next cluster again. So I think that's not on this slide, but I will explain it later, but with an example in the upper. So we don't. We also changed some parts of Rook. We also refactored Rook.

A

At a certain point. We also said okay, who is responsible for what kind of um um management, and that's just one example that we said. Okay, we should delegate the responsibility responsibility um to scaling the safe monitors back to kubernetes, because when Rook was introduced there was only a deployment um available to manage replica sets and I. Think two years later, uh then the state full set was introduced to kubernetes, and this one can take over or can guarantee um that the self-monitors will be scaled one in the right way.

A

It will be up they during the upgrade phase that only one is always down and the other ones are available. So that's that's something out of the scope on um of Seth directory.

A

So- and um this is the the part I was talking previously before we get a kubernetes event, adding nodes. We got we're, adding the notes to to the crush root staging doing the the OSD test, doing doing the benchmarking doing the classification, because we have different fast and slow um device or Hardware storage classes in kubernetes there and then waiting or checking.

A

If this, if the cluster load is below a specific specific threshold, and then we move the Crush, um the one node or the multiple nodes from the staging route to the plot route, and so that the customer experience is explaining no difference in in latency, IOP or throughput delivery.

A

And also example, how this is implemented. Is that how you remove a node um from from our existing um root cluster, that you say, Okay um I have um the oval. The kubernetes controllers told you okay, we need a fast or slow um note from this cluster. Please free it up. Then Rook is verifying the user capacity, the use capacity and is say: okay, we have enough capacity in this cluster. We can give it back to to you then the next step. It's very do the verification of quality of service requirements.

A

We have also in this um an extension to Rook that every RBD image and has an IOP and a throughput um definition before we introduce the cluster. um We do. We do a basic test or load test.

A

We try to find the boundaries of the existing cluster that we initially deploy and then we can somehow create an estimation how many RBD images of this specific storage class is 100, iops or 200 iops and 50 50 megabyte of throughput we can deliver and um and when we remove a node in this case, then we also have to reduce the the images that we can deliver with a specific quality of service requirement um in the next step. Also, we verify verify um can be still can be still handle the failure domain across.

A

If you have, um for example, a clustered deployed across bootable availability zones and your capacity is, you could remove a node on multiple nodes, capacitive wise, but perhaps you cannot even compensate a failure of of one a set anymore. Then it would be also taken into account in this process and perhaps it would be also denied.

A

Then um it's the same that we also say: okay, you have perhaps um High usage on um on disguster at the moment and it would be not a good idea to even put more loads on the cluster. At this point to say: um yeah, we just wait.

A

Perhaps a day or even a week, or we have a specific, a different defined maintenance window and then those go goes down and then it's a good, um a good point to remove the nodes from from this production cluster um and um yeah and give it back at the end when it rained um to to kubernetes and and drained mean that we really move the data away so that the replication is always at three or the Erasure chord is always complete fulfilled, so that we never end up in a situation that perhaps node reboots or something like this, and then you can only um yeah.

A

You would end up in the read-only self-cluster state.

A

So so what what we also did in this time was we added the extended Rook um that kubernetes is more Seth is, um is yeah? It's a data, awareness um or I've, always always said leader at the end, who makes a decision? If, if you can how you how you can manage um a ceph node in Rook, we also removed some some features like the PG Auto scale or disabled them.

A

We also, um together with the customer we agreed at the at the beginning, that they should not use uh CFS if, if possible, I think that perhaps will change in the next two years that the requirements will come again and again on on the table that that they will need a CFS again and also um that they will handle the um the distribution of um of objects via rgw on the higher layer, not not on in the back end of stuff itself.

A

What you also did we extended the health states of of Seth? At this point, um we have to differentiate between data, replication, Delta, Data, health and um and, for example, the the status of the service we have to we have to. um We also have to suppress for this. Some some health states of self, for just a simple example is, is the crash warning um that we have to suppress, because it's not Management in a setup like this. So what we did is okay.

A

We set the warning for for the crush topic to ignore, to suppress it and just um send the the events to a central facility that this will analyze. What's happened in in each classif cluster.

A

um We we extended also Seth um for Safari, improved with um this additional recovery options, so that we also can um interact with Mr osds if there is really a problem- and we also extended the RBD metadata in safe itself for for this project, so that you can store um different metadata. That's not that's not um in the not in a direct context of Sephora, even RBD.

A

That's an example for for the metadata is, if you have, if you want to have a router, if you want to use it for a router environment, and you want to add RBD images to a virtual machine and you want to have it always in the again in the right order. You need something like a a worldwide name um that that you really can identify. Okay, that's that's my disc at this.

A

At this position in in the virtual machine and um yeah and um like um mentioned before, um we already improved with the RBD encryption for the fcsi but I think last week we decided to completely re-implement it, um because um we we don't see that um the passphrase for to D and encrypt the keys, the deluxe keys for each image should be stored in in the cfcsi.

A

A

So at the moment, so perhaps no last year last year we did a partial production setup that was around about 300 um group clusters um between 16 and and 35 osds. So it's quite small. We we tested several improvements in um in the setup and even it's still running in in production.

A

um We we now um go more in in the larger scale setups, um so um they are building up the ceft class at the moment around about 20 to 30 clusters are in production, but will be edited more and more in the next uh few months.

A

um Three new regions um uh are built. They are building up at the moment and because of this also, the data center fail domain is still in progress. So um and also um now, after around about two years um decision was made that we don't will support virtual machines. It's a standard, kubernetes PB PVC um stack because it makes the handling much more complicated and the cesy's eye is also somehow a limiter to use all the set features that are really available and um I think with the smartnex.

A

Yeah, we are it's, it's still. It's still in progress. We we had a stronger focus on this, especially also with nvme uh over TCP um for security isolations.

A

um The one idea was to implement or to directly use the nvme over over February client of of the smartnic and directly connected to um to the project um um of surf with nvme over fabric as a security, isolation and um also and also um height, even the informations of um of the ceph client for for clients that have direct access on on the hardware as well. So what so?

A

What we? What we achieved is that from a client perspective is you have a hardware node, you see disks in the hardware node, but um but in reality it is just a mapping of the smart Nick on on this Hardware node, where you have yeah an RBD client. That's um this is uh this: is the config in the in the key? That's just taking care of this as well, but it's it's one way to go, but it's um it's not not decided if or it's not 100 production ready, good and um yeah.

A

If you have questions, perhaps I did a little bit a bit wrong at the beginning, but it's also quite complex project at the moment and I'm involved in this since now more than two years, and it's it's really really hard to compensate every single to compress this um in in in 30 minutes. So um if, if you have any questions, if then, please.

A

Sorry, maybe I missed this. um You said you're not using the CSI driver um for the utilizations, take as a software for normal kubernetes cluster, so for client kubernetes cluster. So we will, we will use it or we are using it, but um we also have um we have a virtualization stack, that's consuming the self cluster and um the decision I think was made yes uh well last week that we will not use it anymore, that we have to re-implement it at the end.

A

So there will be something, but without any PD or PVC, because I mean you have to think about this.

A

um We are talking about ten thousands of future machines and it brings no um no benefit um for the underlay that you handle it like this and there's also I mean there are a lot of other services as well, and one one is just taking care already about the quality of service and how many, um how many images are allowed to to be created on one thing self cluster, and this can take also over um the part. That's normally. This fcsi is doing I mean it's.

A

It is from from a general from a general overview. It's it's quite complex from from all the community uh kubernetes events, even how how the decision is made where an image will be created in a self cluster.

A

So and it's also not that transparent to us where this comes from and uh what is the real use case for it? So a bit blind in the underlay.

A

Yeah, correct yeah, partially I mean we. We have to remove a lot of stuff of this. um We we um and we have to also extend our services. There. I mean it's not not um the main part of of us as a storage um consultant, it's more or less the kubernetes um experts who who do the decision and we just support them- how how they should do it.

A

A

Yeah I think they they try to to stay with kubernetes with the Holy Stein and um and follow all of this, and then they realized um after a while that it is not sufficient enough and we have we have. We have different other problems as well. How many uh pots you can deploy on one um on one server, one physical server um to get rid of this limitations as well, but it's everything more we need is related, yeah yeah.

A

The requirement for this I think is above 800., so yeah, okay,.

A

Some other question I mean if, if you want I, can show you what what we also um there as well, if you're interested um so like I, said the NVM view of over the TCP um smartness support was 100 gigabit. We also um then topics like ecn, DC TCP.

A

um That's all all the part in in the pure IP stack, uh IP, IPv6, stack, um I. Think although I also not mention is that the theft classes and everything else is just a run with bgp, um so no layer, 2 anymore, um then um we are working on the implementation of uh you, BLK with iring for RBD, backup, daily backup.

A

um It's at Excel scale. um We are um yeah. We will improve um Rook with the safety uh day, two operations um yeah. We all still have a look on on system chromson for the next OSD generation. In Peril. We try to improve the performance of dual store as well um yeah.

A

We are thinking how we can um integrate, integrate, backup strategies, um technology stack specifically for for kubernetes again, because if you, if you think that all or most of the backup Solutions come from from kubernetes sites and all all the time tries or backup your whole PVC and the rest of and all the major data of your kubernetes cluster there's also some better approaches how you can use all the features in the underlay for Seth to make it a bit more efficient with um data export for for RBD.

A

As an example um yeah, we are still on this topic with safe namespace implementation in the boot cream or I. Think that's open since three years. So um so we will see if this sorry, it's it's not I mean the ticket, for this is um I, think three years old. Something like this at the last time, when I, when we requested or when we asked for this, then they just ask us uh um for what customer what could be the? What is the customer who wants to have this yeah.

A

Yeah that you have that you can use RBD namespace the pool namespace inside of this okay.

A

Yeah I'm I mean yeah so from from my uh what I know, but I will double check this. Thank you very much. Thank you for this um and also yeah storage as a service and um yeah. Some we're also working on some FS improvements, and we are really really interested in in the S3 super project as well. What's what's going on there as well, especially also um what Susa is doing with this S3 rtw Patrick is the right name from the yeah yeah.

A

Okay, yeah I just saw because um last summer on Foster um yeah, so that's it good.

A

Okay, thank you.