Ceph Ceph Days NYC 2023, 17 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC 2023: NVMe-over-Fabrics support for Ceph

Description

Presented by: Jonas Pfefferle | IBM Research

NVMe-over-Fabrics (NVMeoF) is a widely adopted, de facto standard in remote block storage access. Ceph clients use the RADOS protocol to access RBD images, but there are good reasons to enable access via NVMeoF: to allow existing NVMeoF storage users to easily migrate to Ceph and to enable the use of NVMeoF offloading hardware. This talk presents our effort to provide native NVMeoF support for Ceph. We discuss some of the challenges, including multi-pathing for fault tolerance and performance.

https://ceph.io/en/community/events/2023/ceph-days-nyc/

A

All right so I'm going to talk about damn Envy me over fabric support for Seth. Actually, the whole effort started almost like I think three years ago now um um we were basically asked to um see if we can support Seth with um dpu hardware and we actually tried it out. We ran it and at least back then you know the typical smart Nicks that you could buy, have very low power arm cores. So we actually saw very slow performance on on the dpus and that's why we decided.

A

Okay, um let's introduce Emmy, mirror Fabrics to Seth. You know why introduce it in the first place. I just gave an example, but you know you have the rbd2 for remote block storage.

A

um It's proven right, it's um it's reliable object, access to Charlotte and replicated or Erasure coded storage in the back right. So you have all these features. You have fault, tolerance and so on. So why do we need to introduce a new one? Well, for one thing is Envy mural Fabrics, as just mentioned, is more and more widely used.

A

Everyone is using it it's kind of the de facto standard to access remote block storage and if you already have a storage environment where you run enemy over Fabrics right, it's kind of natural to have other systems to integrate other systems like SEF star systems and with nvme over Fabric, and it also makes it easier to switch between those right and once you have a system like this in place and as I said before, this was the whole motivation for the work and how we started.

A

You can take advantage of dpus that have typically Envy me over fabric offloading, which probably you won't see, a dpu with Seth offloading, because that's that's not an easy task. um Okay, so what have we done?

A

um We basically looked at what has been done in the past, with iSCSI in how to integrate this and I think we have basically similar designs. So this is the overview. So, basically you have the ceph cluster on one side, then it talks the Raiders to SF enemy or a fabric Gateway, and the Gateway is just basically a bunch of processes.

A

I'm going to talk about some details later and then on the other side, you have nvme of fabrics in the form of TCP, RDMA or fiber channel, not sure if anyone is using fiber channel but and then you have the clients on the other side as initiated right and you, you can run this Gateway service. You can run this everywhere. You can run this co-locate it with your osds. You can run it on a standalone machine right.

A

The thing that you have to keep in mind is with with this Gateway in between you're, essentially adding a network hop, um that's just the way. It is because in just how Envy me or Fabrics is designed, it's a point-to-point storage protocol right.

A

um So let's see how this looks like so to give you some Envy me Basics right so um for the rest of the talk.

A

Typically, what you have an image, or a volume in nvme, is a namespace represented by a namespace and which we map to an RBD image right and the RBD image is mapped to object in your self storage and namespaces are logically grouped into subsystems, and each subsystem typically has can have multiple listeners attached to it and as soon as a client connects it, it creates a controller or there might be controller already there, but it connects to controller, and the controller is basically responsible to issuing the actual IU to executing your operations against the namespace, which should be mapped then to an RBD image Okay.

A

So what we've done is we introduced this um so I split this here in the view into a data path and a control path, so the control path is actually a little python process that runs that exports the grpc service, and we provide a CLI where you can make like a configuration run. Configuration commands again against the control Daemon and the control demon is basically there to set up the whole mapping right. So you can tell the control demon to do stuff like um a map.

A

This RVD image to this subsystem and this namespace ID create this listener. Where you want the subsystem to listen at so people can uh connect to it things like this. The control demon is responsible for Okay, so to basically have the configuration per system. What we do is we store the configuration in a um in a safe object actually, and we store the um the configuration itself in the home above the object.

A

So this the object map is basically key value store that comes with every object and Seth um that allows us to basically shut down the Gateway, bring it back up and restore the configuration right, but we use it also for something else which I'm going to show you in a second.

A

um But first, let's talk about the data path. What do we do for the data path? um We actually looked a bit in how we want to support this initial ideas was to build something ourselves or look for what's out there. You could also use the kernel for this right, but in the end we decided to go for spdk.

A

This is an open source project. It's called storage platform, development, kit, I believe, and it was it's mainly driven by Intel and it basically, the initial version of spdk was meant for user space access to local nvme storage, but it has been extended to be much more flexible and now it also supports nvme over fabric since a while- and they have this nice abstraction that you can create basically a Target and map it to a block device um abstraction in the back, and there are multiple implementations and they already had a RBD block device.

A

um Abstraction was very limited, was not very performant, but it worked so we could actually when we started this effort, you could actually already set this up and kind of show that that kind of the whole thing works um so be expanded on this and how it works is that our python Control process, basically issues rpcs against spdk to set up this mapping. So you issue rpcs against the control Daemon, and the control demon then sets up spdk accordingly right, okay, so um next thing right.

A

So, as I said before, the way how Envy mirror Fabrics is designed is a point-to-point protocol right? It's not! It's not doesn't support like a distributed star system in the back right. It assumes the path you connect to has all the storage information like the all the data behind it. So if you do IO against it all the data is behind that.

A

um That means is, we don't have um like you know, eraser coding or anything like this natively supported, so we have to introduce a multi-passing for fault torrents or uh to to improve your performance. um Obviously, Envy mirror Fabrics has features for for multi-passing and which we leverage. So what you basically see here is we have a grouping concept, so um a Gateway group is essentially multiple gateways that share the same configuration, meaning they export the same subsystems.

A

They support the same namespaces right and, and things like that, so the only thing different is obviously that the listeners on these two have different IPS and they can also have different ports, um and um so the way we do this is we basically sync on the configuration that we store in SEF and we have two ways to do this.

A

The first thing is that we use the watch, notify mechanism in SEF to update each other, and we also have a polling mechanism in the case that one of them crashes, while they do a watch, notify right to keep them consistent, and the cool thing is I. Think also that you can issue ipcs to any of the gateways in a Gateway group and they will sync with each other to make sure they are in consistent state right as through this uh or map uh in in the object.

A

Okay. So um now you have set up like multiple paths. We also support stuff like or be in the process of implementing this um it's natively supported by spdk. So it's basically just for us to add the rpcs for it that you can specify which of the paths are optimized which are non-optimized. So basically, this has a hint for the client which pass it should use to um to access a particular um Gateway. In this case.

B

A

Okay, of course, you can also use the same feature as I mentioned before, uh to do load, balancing or scaling, so you can decide right and with the AC asymmetric namespace access. What I showed you before? You can also steer um particular clients um of a particular subsystem to a Gateway and others to other gateways right, so you are you're pretty flexible in what you can do there Okay so right. So most of these features that I talked about except a a is already available right.

A

You can download this today and try it out um fingers crossed it works. uh You know how it is um so feel free and let us know, of course it's all ongoing work right. We are kind of we know. We need a lot more testing and things like that, so we are still at the beginning, um but we it's it's there, it's running. So it's not just me talking about it.

A

um Let's talk a bit about performance so um for performance. The goal was, of course, to come as close as possible to a non-gateway performance right. Yes, if you would access um basically directly your your RBD image from a host directly to the cluster now, as I said before, right and with the Gateway we're introducing basically an additional hop, and even if you co-locate the Gateway, with uh um with an OSD right.

A

That means that only for that particular OSD, your local, but for all the others, you would still be remote, so it will still be two hops um and we have a little. This was run on a little Seth on a little lab setup with the three note ceph cluster, um and this is on the left side is where we started on. The right is where we are now. So what you see here is that you have um on the on the y-axis you have.

A

Thousands of iops um might be a bit small, but hope you can see it and on the x-axis you have volumes number of volumes, so in increasing and um the blue bars are basically native access. We use fire with lip RBD, so from user space instead of the Chrono um and because to make it fair, because spdk also uses slip RBD right and on the right side we have spdk with an NVM uof Target and um the um talking to lip RBD on the back.

A

So how this looks like is you have spdk here um and you have a um so. We run spdk with NVM uof, connecting to our Gateway with also runs NVM spdk, which translates to basically rather's RBD in the back, um and this is how we started. So we were quite far away from our goal in having almost native performance, and this is where we are now I- think it's like uh I think we're eight percent away from native performance.

A

uh Obviously this is kind of small uh setup rights, only 32 volumes- and this was Envy me back and it's a 16 kilobyte ISO size. um It's it's kind of you know quite low uh in terms of what our three note cluster can do in terms of iops.

A

That's why we switched in this graph here um to a ram disk space, osd43, note setup and we use multiples of client instances in spdk to improve our performance and the access now is again thousands of iops, and then we have numbers of volumes per Sev client instance, and we run the total of 128 volumes um and having multiple self-client instances helps because the self-client instances have their own like kind of IO, threads and and operation threads um in in lib IBD. So what I? What I really want you to focus on?

A

I know it's a lot of bars, but the one with on the red is basically is the best performance in. In this scenario, you can achieve with eight volumes per Sev client instance. Again we have 16 kilobytes um and the total Q rep is 256 and around I think 16 yeah. These are number, of course, the different um bars that you see in around 16 um cores. We actually achieve line speed essentially, so this is a hundred gigabit Nic.

A

If you make the calculation 16 kilobytes divided by 100 gigabits, it's around 700 000 bit more then. Okay, there are some protocol overheads right. So I, don't think yeah, I I think I'm pretty comfortable with this number, at least for for this setup.

B

A

um And we are currently just going back to this again. We are currently in the face of actually testing to bigger scales like 256, 512 and and even more and trying this, of course, to run on a bigger cluster ninja star three note lab setup.

A

Okay, so let's talk a bit about what we've planned so um for um our planning. Is we have, of course, if you talk enemy or Fabrics, we want the discovery service and um for Discovery there is a native nvme over fabric discovery which current, which currently there are some there's technical previews to um to add, centralized Discovery, um which we want to do, but for now we're starting with just having a discovery service in the first place and Inter is helping us here and trying to introduce it.

A

Essentially, what we are planning to do is have this as a standalone service, um again synced via the config like we have for the gateways, and then these ones these two. In this scenario they advertise to the clients which pass they can use to get to a particular subsystem right.

A

That's all the discovery service does, so you can basically add gateways, remove gateways and the discovery service um will um tell all the clients that are connected to it, that there were updates and that there are New Paths available which passed they are supposed to take the next one is um it's also, of course, a big one, especially if you want to deploy it as an Enterprise setting right um identification and encryption support.

A

um So this one, what we had in mind for now um is that essentially you can. We want to have a plug-in architecture where you can add this to the to the python Daemon, your own identification or encryption method, and we as an example, wanted to add this, and essentially what we had in mind is: if, if you have a subsystem, what I said before each subsystem typically has its own listeners, if you restrict your subsystem to a particular IP port pair and also add some credentials for ipsec to it right it.

A

Basically, if a client connects to it, you um it's it's kind of authenticating against the subsystem. Obviously, it's not because it's ipsec right, it's um it's against the IP port pair, but it comes as close as possible without implementing nvmb inbend identification which we're going to have in the future, or at least we're planning to have. But for now we I don't know at least of a single implementation that implements nvme invent identification. The kernel doesn't have it, spdk doesn't have it so um yeah. This is the state we are at and I think this.

A

We can pretty easily Implement without a lot of effort, because you know ipsec, all this stuff is already there. We just have to plan it together and I. Think it's it's already like a usable approach.

A

Okay, so where are we now? As I said before? um This is ongoing work um you can try it out today. I showed you uh that the link, um so we want we plan the release initial release for uh Reef. It will be like a scaled down version from what I showed you. So um what we plan for now is have a single Gateway um with the grpc control and you will have the CLI. We will have to see, save and restore configuration, but we will not have yet the multi-gateway stuff.

A

It just needs a bit more testing to to refine and for the future besides the stuff that I talked about Discovery and so on. We also working on zeph, ADM integration, so I think for iscari. It's already there. You can basically say hey, give me iscos Gateway on this node and we want something similar for nvbof and we are working with Intel on something called adnn, and the idea here is what I said before. Maybe let me go sorry for the switching but I think he had this one.

A

So um the idea with adnn is that the problem is right, as I said before, if you, if you put a Gateway next to an OSD- and you always have additional Network hops right, because obviously your data is typically um spread among all these osds and the idea of adnn is to have a mechanism that they are also trying to get into the enemy specification, where you run basically the crush algorithm um in the Target to figure out which, um which of the OS sorry on the client.

A

Not on the target, you run the crush algorithm on the tag on the client to figure out which Target you need to go to where basically, your um your data is. That means you need to co-locate with each OSD um one Target, and then you basically do the same thing but and it's in a it's, the idea is to make it like lazily update the crash right.

A

So if you go to a wrong node use you, you still will be able to do the extra Network hop, but in most of the cases you will go to the node, where the actual data is so this is kind of an optimization. Intel is working on um and hopefully it it.

A

This will be added to to the enemy me spec, that that makes it more applicable to to implementing systems like this go back to back to the last one, all right, um of course, the whole thing wouldn't be possible with tons of people working on this, so this has been uh collaborative effort between all those companies listed and all those people listed. I hope, I, didn't forget anyone um so with this. Thank you very much.

A

um I hope. I see you one of the weekly calls or download the stuff. Try it out and let us know if it works or what issues you run into and I'm open for questions.

B

Kind of an easy one. So can you walk me through what the operational maintenance would look like, for example? What changes do you need to make to the Gateway if you have an nvme disk go down or if you have mtbf hit several disks in a cluster right.

A

So you mean like, if you have a disk down in the in in SEF or um a physical disk, so I mean that would be basically, you would deal with this on the safe side. Right since I mean we rely on. Basically um you you have the reliability from from. You have a normal RBD image right, so it's the same as before in in that sense right, so you need to replace your disk. You need to make sure uh your your objects are redistributed, these kind of things right it doesn't.

A

You don't have to reconfigure the Gateway, because the Gateway just is basically a safe client right it. It works like any other client, so it shouldn't notice actually that this went down or anything like that. Yeah.

A

The only thing where this might change is: if, if you have something like adnn, what I talked about right and now, maybe your backend search somewhere else that you need to basically update the crush and then on the client side. But again, this is future things so.

C

I had a question similar um I. You might have covered this, but it's a future question. Probably is there thought of and maybe Josh kind of hit on this pudding nvme natively in the OSD, because you're adding a network hop and a translation layer, you're going natively from one protocol to another right? Is there a thought to try and move to nvob natively in the osds and dual protocol like RBD and nvme as a performance, kind of improvement.

A

I think yeah. The problem is, as I said before, is that nvme, like from the spec perspective, is a point-to-point protocol right? It doesn't deal with. um You were talking with multiple paths, and the path is actually means one pass only as part of the image available. The other path has other part of the parts available right in. In that sense, you cannot natively integrate it other than you build something like this.

A

You you, you basically extend the nvme um spec to allow these kind of things right to have like client-side decision to know where to go to right. I.

C

A

That's the at least I see the only way forward to kind of make it more native in quote-unquote and to to avoid the network hop right.

D

Yeah, so if is there, is there any advantage to using nvme over fabric from a Linux client that already supports krbd or live RVD?.

A

I mean it's, it's it's more of a um if you are in the in the ecosystem, right you're already, maybe running nvme over ferries, at least in my opinion, right. Otherwise, if you're fine with running krbd, obviously um it might give you better performance. You might not have to deal with setting up the whole Gateway infrastructure, and things like that. So.

D

Yeah and I wanted to know like what they're is it technically possible to run the Gateway on the client side? Oh.

A

Yeah yeah, you can you you can just run. um You can just run this Gateway on the client and do loop back to it that that works.

D

Yeah then you avoid.

A

The Hop, actually, um while you mentioned this, there have been actually some people talking about this for uh having kubernetes support, because I think the problem at the moment is with krbd you're missing a lot of feature that lib RBD has that haven't been pushed into the kernel and one way to to basically run lip RVD and still have a kernel block device. You could do this or you do MVD right there, a bunch of options, but this was something that was discussed. Yeah.

E

Hey uh so a question for you on the performance side made more efficiency side, um so both like Target, CLI and RBD MBD suffer from a lot of memory, internal memory, copies and slow or actually High, CPU usage due to the the internals and I, was wondering if you guys have had to deal with much of that as you've been developing. This.

A

Yeah so I mean I, don't want to lie in the Gateway. Obviously, is heavy in terms of a number of CPUs I. Guess you need to push stuff around I mean if you want to push. If you want to run the whole self protocol and the mme ORF Target and want to push 100 gigabits right, you need a you need some course, and you also need memory right um so as you've seen before.

A

I think this one yeah, so this one achieved line speed with 16 cores, basically running at full blast in spdk I'm, not saying that this is the final thing my you might, there might be some improvements to it, but yeah you. You need to spend some of the performance, and then we looked into memory consumption, so there might be some ways to improve that in spdk and one other thing is also per default. Spd case just polling all the time on all cores.

A

um You can turn this off. They have a new feature where you can basically say: there's like it goes into a interrupt mode, the quote unquote, but yeah. These are definitely concerns um that that we are looking into and trying to like optimize this as much as possible to keep it down. Yeah.

E

Let's talk more later sure yeah.

F

I was going to comment on Dan's question um about. If you have RBD client.

C

Whether it be would.

F

There be any reason to you know, use nvme um and I would say: one potential benefit is on the client side. It could potentially be less less CPU as consumed. I mean it's really moving it to the you know the Target right, but you could also have a hardware hardware, nvme initiator, and then you know the CPU consumption on the client side would be. You know much lower.

A

Yeah, that's actually a good point. I mean the nvme client-side protocol is actually pretty slim, so it's much more efficient than running the full blown, uh let bar video krbd, but obviously it needs to be it's it's clear right from what I said before. One is point to point. The other one does the whole thing with figuring out where to go all these kind of things right.

A

A

I think there's one right here.

D

Doesn't anyone but want to put a native RBD client into a NIC into a smart Nick.

A

Yeah so so we talked yeah, we talked to silence um they, they have one, but I, think they. uh They are still a long way out to having actually something that's performant and that you can run on like on 100 gigabit, at least from from what I've heard.

G

Looks like you got a lot of the performance through concurrency, just adding more and more uh volumes were there other tunings that you've done to get the performance up to that, like 92 percent, roughly.

A

Right so that was mostly stuff we did with an spdk, so there were some issues with IO have like, as I said before, they had this RBD block device, but it was kind of just a showcase thing. It was not actually used and, for instance, all the I was funded through a single core, um so we had to fix that. Then there was a lot of like um how um Affinity of lip IBD thread was handled in spdk.

A

um We, we kind of fixed that. What we now have is that actually, when you create a client instance, you can tell it, which course the lip RBD threads should be running on and before it was like in the very first version, it would run on the same course that the target would be running which is obviously pretty bad, um and and now we have this option that you can actually move them, move them out.

A

So with these kind of improvements, there were a few more that had to get busy to this uh uh eight percent off for what we had at the moment here,.

A

um Yes, yes, yeah.

C

A

We do all right seems like, what's all the questions. Thank you. Everyone.