Ceph Ceph Month 2021, 14 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: RBD Update

Description

Presented by: Ilya Dryomov
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

Hi everyone, my name is ilya, I'm the tech lead of the rbd team. I will be giving an rbd update uh going uh in a little bit more detail into the features that shipped in pacific and uh also uh covering what you expect in uh quincy.

A

The first specific that I want to cover is instant import.

A

uh The image live migration feature that was introduced in nautilus for use cases such as moving the image from a replicated pool to an erasure coated pool with minimal downtime has been extended to support external data sources and in fact it cannot be used as an alternative to rbd import command.

A

The advantage uh is that the image becomes available for use as soon as the migration is prepared, which is basically instantaneous, because it just creates an empty image and makes a record of the data source and the data source format in the image header supported. Data sources include file both local and remote, observed over http or https, and an object in any amazon s, recompatible, object, store and that file or object has to be in either raw or qcal or qcao2 format, but note that the advanced, cucao and qco2 features such as compression encryption.

A

Cloning here referred to as backing file external data file and some others are not supported in pacific, and so once the link to the data source is set up. The image, even though it is still empty, uh can be opened as as, if it was already fully imported.

A

uh A read on an initialized area would be redirected to the data source and right on an initialized area would check if it overlaps uh with something uh in the data source, with something initialized in the data source and, if so, uh trigger so-called deep copy up, and that would pull the data into the image that is being imported.

A

Of course, one thing to keep in mind is potentially high latency if your data source is remote because the image can be literally anywhere- and here are some uh examples of how you might uh import a qca 2 image from a remote http server. The first one is the most straightforward and also the smallest one. We just patch the file convert from qcal format to raw in a separate step and then call rbd import command.

A

um In the second example, we take advantage of the qmu image capability to read directly from the http server and write directly to the latest cluster.

A

It does not use extra space and it is faster because the conversion from cucao to to raw happens in parallel with with importing, but you still need to wait for the import to finish before you can open and use the image. In the third example, uh we do a migration based instant import.

A

So the image can be opened uh and used right after uh rbd migration prepare command so right after the first step and then rbd migration execute command can be invoked at any time to hydrate or populate the image by forcing those deep copy ups to happen in the background- and this can happen while the image is being actively used and finally, once the image is fully imported, rbd migration commit command can be used to disassociate it from the data source, and at that point the iteration is finished.

A

The second feature, definitely worth highlighting, um even though it did not get completed in pacific, is built-in encryption.

A

uh The need for encrypting data on the client uh at image, granularity with you know, with per image keys, uh comes up more and more and not just from financial or other. You know similarly regulated industries, but also from small regular users uh as well.

A

uh Layering qb, encryption or dmcrypt on top of uh labar bdd, suffers from a major limitation uh that a copy on right clone might be encrypted with the same key as its parent.

A

uh If you don't want that, you basically end up having to abandon thin provision clones in favor of thick copies and to solve that uh support for luke's base. Encryption has been incorporated within liberty.

A

We use a standard, aes cipher in xds mode for full disk encryption. This is more or less the state of the art in the full disk encryption technology.

A

Right now- and this is what looks uh defaults to uh on uh you- know only regular devices and as far as the implementation, there are no reinvented wheels here, at least not yet.

A

We use liquid setup for working with the lux header and the openssl library, for the actual encryption both looks to looks one and looks two formats are supported and for looks two. We set the sector size to four kiloblades for better performance uh across the board, but uh this can hurt some workloads uh because any rights, smaller than are not aligned to four kilobytes, will incur an expensive three modified right cycle.

A

So, if you have you know 512 byte workload, you might want to stick to looks one um clone images, inherent parent, encryption profile and key, and unfortunately, only flat that is non-cloned images can be encryption, formatting and pacific.

A

This means that the the limitation that I mentioned, that I just mentioned, uh is still there in pacific, but it would be resolved in quincy, and uh here is an example of uh encryption. Formatting, uh an image and mapping it with uh ibd nvd encryption format command, uh generates the look scatter.

A

Sorry generates a loops master key adds the supplied passphrase to key slot 0 and then writes out the lock scanner um and because it is a standard lux uh lux format.

A

A standard, lock, shutter grip, setup tool can be used to add additional pass phases or perform any other maintenance operations that you would typically do on a lux volume and another cool consequence of using lux is that as long as the image is not an encryption, formatted clone, which is coming in quincy, its layout is understood, for example, by the encrypt, and so you can see an example here of mapping the same image with current rbd, which doesn't know anything about reliability. Encryption and we just open the box container on it and it works.

A

Next up is to performance items. The first one is a vast improvement in small io performance, thanks to work that started an octopus and was completed in pacific as single library, client, uh often topped out somewhere between twenty and thirty thousand uh four kilobyte uh iops, uh mainly due to issues uh with the internal friday architecture.

A

There were several context switches per I o, and that is not counting the actual processing on the osd and that really matters at flash speeds.

A

A single, so-called finisher thread handling the callback queue in libretas did not help matters either and so to address that liberbity io path has been rewritten in pacific and it now sits on top of the new neo radius api that landed early in the pacific cycle.

A

It provides a pool of boost sao, I o context, reactive threads and that allowed us to kill numerous ad-hoc.

A

You know one-off threads and thread pools within library and some in the greatest, and this resulted in generally lower latency uh and up to three times increase in iops uh for some benchmarks uh in all flash uh clusters going forward. uh The switch to an asynchronous reactor model uh may also allow libar, bd and liberators to uh tighter, integrate with spdk.

A

In an ideal world, we would have a single reactor uh handle handling everything uh from spdk or maybe even qmu in the very far future, uh all the way through liberty and down to the uh you know all the way through the objector and down to the uh messenger layer in libras. That would be really cool and uh eliminate lots of monologues.

A

The second performance item is a client-side persistent run-back cache for many different use cases and workloads. uh Setting up persistent write back cash on the client is a sensible choice and an acceptable trade-off.

A

But the problem with doing it today is that layering, something like dm cache on top of leave ibd uh is too risky, um because dirty cat blocks uh are flash to the backing image uh out of order.

A

uh This is done according to uh whatever cash policy, um but you know none of the supported cash policies uh can can implement uh in order uh flashing, and so with the current caching solutions, at least those that are widely used and easily available out of the box, uh you're sort of stuck uh at one of the two extremes: either the cache is not right back and a synchronous ride gets act only when persisted in the cluster, which does not help the latency problem or a synchronous right is act when persisted on the cache device, but the write back is not ordered, which leaves you with the backing image that is most likely corrupted from the file system or user application point of view, because it is essentially a mix of all blocks and new blocks, with all blocks, potentially being really really old for offsets that correspond to hot spots, which would be the you know.

A

The file system journal, for example. So you you, you know as soon as uh that is off. Even by a couple of bytes, uh uh the file system is gone.

A

um So there is now and library plugin uh that implements a lock structure, persistent right back cache. uh The key here is the log aspect. uh A synchronous line gets act to the user as soon as its log entry is persisted on the cache device and dirty log entries are flashed to the backing image.

A

In order and as a result, the backing image is always point in time, consistent, no matter what, and if the device is lost, you only lose a small bounded amount of updates, typically in the range of single digit seconds, or something like that.

A

uh The new cache comes in two flavors uh one uh targeted at persistent memory devices and another for ssd devices.

A

They share a common core uh in the you know, the uh flushing logic uh the I o dispatch uh code, things like that, um but they differ uh in on disk format and in how the cache device is accessed uh in the persistent memory mode, uh which is referred to as rwl.

A

um You know, for historical reasons, uh the on disk format um takes advantage of uh byte versability of persistent memory, and it is just a very simple head and tail pointers in the pool root structure. A contiguous log entry table and a contiguous data area, the ssd mode, is more complicated because updates are done in four kilobyte blocks.

A

uh The log entry table is spread throughout the cache device in the form of a linked list, and there is a certain amount of coordination that needs to happen when the data is written and the log entry is committed and then the root is updated. So it is, uh it is more involved.

A

Additionally, everything gets zero padded to four kilobytes, uh so you should expect some cache space uh wastage if your rights are smaller than that um for accessing the cache device. uh The persistent memory mode uses a library from intel's, persistent memory, development kit and the ssd mode, reuses seph's block device obstruction, which I was uh written for bluestore uh and it's basically uh just reba, io, plus or direct flag to bypass the host page cache.

A

The latency improvement is dramatic, particularly for 99 percentile latencies. Some reports claim uh an almost two orders of magnitude reduction and and then right benchmarks.

A

um Latency reduction, I should say which is uh which is huge for latency sensitive workloads, uh but it currently has some rough edges. um There is a cache reopen issue that affects uh both modes, uh the fix for which is expected in 16.2.5 table release and, uh unfortunately, several stability and crash recovery issues that affect the ssd mode have been discovered after pacific has shipped most of them already fixed in master, but a couple are still pending and those fixes would be reported to future pacific stable releases.

A

Also, the user admin experience is definitely not where we want it to be. Observability is rudimentary at best, and the rpd status command is just outright.

A

A

um So, moving on to snapshot qs hoops.

A

This is a couple of new rpc messages that allow coordinated snapshot creation. This comes up, for example, when creating a scheduled mirror snapshots which are initiated at the cluster level. So neither the hypervisor nor the user know anything about them since ensuring that the file system, and possibly the user application checkpointed before taking a snapshot, is always a good idea by default, if any clients, if any client fails to qs the snapshot is not created, this behavior can be changed to ignore the error or skip the attempt to qs entirely in pacific.

A

This has been wired up in rbd nvd. If dash dash, qs flag is supplied, the daemon would attempt to freeze the file system mounted on top of its device before taking a snapshot, uh and it is done from a sim, simple shell script.

A

You can specify your own script or an actual binary that you know, in addition to taking care of the file system would also take care of your applications.

A

Ensuring application level consistency and the goal here is uh to integrate this in uh krbd and qemu in the future.

A

But at least uh in the latter case, uh it is somewhat challenging because the ends up being initiated very low in the stack uh in the blog device driver, and it is a bit of a layering violation, because the information usually goes uh backwards in the stack.

A

Quite a lot of work and the pacific cycle has gone into the kernel. Client support for messenger 2.1 protocol was added in kernel. 5.11 is controlled by the new ms underscore mode. Mapping option and one caveat here is that currently there is no separate option for uh affecting only monitor connections.

A

uh Also, the original messenger two protocol from nautilus is not and will not be implemented because it is defecated and shouldn't be used as a consequence. The kernel client would require a messenger to the one enabled release of nautilus and octopus, and you can see the particular various versions on the slide and, of course, the legacy messenger one protocol is still supported and will be for many years to come.

A

In fact, it is still the default in the kernel. uh We should probably change that.

A

Then there is support for replica reads that landed in kernel 5.8, so this can be randomized in the pg or much more, usually the osd that is closest to the client. The closeness is determined by the specified crash location and you can see an example.

A

So if your client is in or close to data center dc1, you would provide that uh cross-location string when mapping the image and uh for the pgs uh that have an osd located in dc1.

A

The reads would generally go to that osd.

A

There are some corner cases, though, where a read cannot be served by the replica, and in that case it is redirected to the primary osd, and this is the reason for needing optical services for this to work. Otherwise, you might run into data consistency issues and also note that this feature only applies to replicated pools because there was some confusion in the community about this.

A

The primary use case here is clusters that are stressed across data centers or cloud availability zones where the primary osd may be over the link. That is not only higher latency but also higher cost, because, uh inter availability zone traffic in the cloud is usually much more expensive and finally, uh there is support uh for uh compression hands uh and then also landed in kernel 5.8.

A

You can use this to enable compression on an image that resides in a pool with compression mode uh set to passive, uh or vice versa.

A

uh You can disable compression uh if compression mode uh on the pool is uh set to aggressive, and this is useful if an image is, for example, encrypted, because there's no point in attempting to compress.

A

It in pacific uh we have native uh windows, support, uh ceph client code uh was ported to windows, so we now have libarbd.dll liberators.jl et cetera.

A

The team at cloudbase put together a wnbd kernel, driver windows, kernel driver that provides a scuzzy-like virtual blog device and what it does today.

A

It used to be a generic npd client, but what it does today is it just passes.

A

I o requests through to user space via a device io control that is much more efficient and in user space, rbd, w and d uh xz demon uh joins, you know, picks them up, uh transforms them and calls an entirely about bd. This is uh similar in concept to rbd nbd on linux.

A

This daemon can also run as a windows service and in fact, that's usually how it is run and in that mode it manages uh the the mappings uh in a way that they're persisted across reboots uh with the information stored in the windows registry, and it also provides a proper boot dependency ordering.

A

So rbd is really first class citizen on windows now, because there's also support for hyper-v- and I believe other integrations uh alessandro covered this uh in a much greater detail in his uh seven windows talk earlier this month, the recording is already available. So I encourage you to take a look.

A

And that's it for pacific! So moving on to what is coming in quincy.

A

Starting with the usability and uh quality bucket, I already touched on uh encryption, formatted copy and write clones uh and persistent uh right cache improvements. This is the continuation and hopefully, the completion of the work started in pacific.

A

In particular, I want to highlight that it would be possible to have a clone encrypted with ta of a clone encrypted with key b of an unencrypted golden image.

A

So that is- uh and that is you know, thinly provisioned for copy and write as you would expect from the regular uh rbd clone.

A

uh Of course it is, uh it would not be uh compatible with uh standard dmcrypt anymore, so there would be there will be a flag or a magic string modified to prevent uh the uh lux container. Open example that I showed on the previous slide.

A

Persistent right back cash improvements, uh we already went there, our crash recovery testing uh is going to be vastly expanded, and we also need to make the status easy to get and easy to interpret because you know tune in your cache.

A

Well, there aren't any tunables, but just uh seeing if it actually, if it actually helps, and how much is dirty, how much it's clean is very useful.

A

um Make rmd support manager module handle full clusters should allow removing images by the manager command interface when the cluster is full. This does not work today. The manager just hangs and and not necessarily inside of the rvd support module.

A

So this is a wider issue that spans outside of rvd replace rvd map script with systemd unit generator. This would generate a systemd unit per device, allowing other units to depend on it and making systemd handle file, system mounts and unbounce, including fancy things such as auto mounts that would be triggered by accessing the mount point.

A

This also had support, add support for a couple of options such as no failure and similar such that failed mappings, don't fail. The rbd map service, such as it is today, and a couple of other improvements in usability.

A

Allow applying export div incremental to export file. Currently, there is no way to do this without importing the image into the cluster just to apply the incremental and export the result back.

A

This is not a good user experience, so uh uh there we there will be uh a new option added uh to uh rbd uh import and uh rbd export commands, and this would support both full and incremental updates.

A

And uh export import of consistency groups uh is somewhat related, uh which should allow exporting and importing consistency groups uh in a manner similar to individual images. Again, both both full export and export, div incrementals are meant to be supported, and this would introduce a new export file format for files that export files that can contain consistency, groups.

A

On the multi-side front, we're looking to improve, uh mirroring monitoring and alerting the goal is to come up with a unified, mirroring metrics schema, at least across rbd and cfs, and expose this matrix to prometheus.

A

It should be done directly from rbd mirror to avoid adding to the growing pile of manager, performance and scalability issues, and it would also be more expressive because we can avoid the staff perf counters bridge, exposed snapshot based mirroring in the dashboard uh is self-explanatory.

A

uh Currently only journal based mirroring is supported, uh and this would probably tie into the improved uh monitoring item with perhaps fancier grafana dashboards, based on the exposed, metrics and snapshot based mirroring of consistency. Groups should allow for uh mirror snapshotting uh consistency, groups and then mirroring those uh consistency. Group snapshots again similar to individual images.

A

Together with import export of consistency, groups, item consistency, groups would become really more with first class citizen in rvd and in general our import export story would be, or should be much better in.

A

Quincy on the ecosystem front, uh the big ticket item is nvme over fabric's target gateway at a high level. This is similar to the existing iscsi gateway, but hopefully more performant and scalable.

A

It would be based on the nvme over fabric started, provided by spdk and on the back end, I use the spdk rbd vdf driver.

A

One advantage over one other advantage over iscsi is that modern smartnics such as nvidia's, bluefield and others can translate between nvme and nvme over fabrics in hardware and what it means is the rbd image can appear as a local direct attached, nvme disk, and this is completely transparent to the host no hypervisor required, and it is independent of the operating system or kernel installed, and this would be useful for bare metal services and cloud providers where the entire physical machine is landed to the customer.

A

On the rbd nbd side, there is a kernel patch posted that makes it possible to implement safe reattach after rbd1 is restarted.

A

This is this would be used in the kubernetes environment, where the rbd plug-in pod or container, uh which is uh you know, which is responsible for uh mapping uh the uh image and which actually contains uh the rbd and the demon responsible for the image can be restarted or upgraded at any time, and we need a way to uh to to make that reattach safe in the kubernetes environment.

A

It is relatively same because the environment is highly constrained, uh but, uh to you know to expose this functionality wider. It is critical that there are some stop gaps in place, because currently uh this is uh you know, one command uh away. One key stroke away uh from uh corrupting uh corrupting the image.

A

The second rvd nvidia item is support for a single demon, managing multiple images or devices, and the goal here is to reduce resource consumption, because currently we have a demon pure image, and that means.

A

A separate set of sockets separately radius and blue by midi instance.

A

A separate set of credentials, and so on, so to improve the volume density which again comes up in kubernetes environments. This is something that we're looking at for quincy and on the windows support side. uh It would be nice to get a sustainable, ci infrastructure setup, preferably upstream.

A

uh Currently there is a jenkins job that runs periodically runs on master, but.

A

We we do not have any tests we covered and uh to add that uh we need we need somewhere to pull from, and uh you know have them be up to date, and uh I think this is the last item. The qmu block driver uh should receive a face lead facelift in the upcoming 6.1 uh release of qmu.

A

uh The most important thing is probably uh rbd write, zeros support, uh which is available since nautilus, but uh the qemu driver wasn't updated to take advantage of it, and this should allow us to close some uh poles in in our sparseness story, because in some cases uh when importing uh the image- uh if you remember that example with the fancy qmu image command, that I showed on one of the first slides in some cases that that actually breaks uh sparseness- and this should uh this- should we should help resolve that.

A

uh The second item is uh support for loading the encryption profile, so that the image so that the encryption formatted image can be opened and I believe uh leadvert integration is planned as well, but that is blocked on on this landing in qmu. First uh and finally, uh there is a switch uh to uh qumu routine infrastructure uh from the legacy asynchronous io emulation stuff that is currently in use.

A

I think yep, that's it that's all. I have happy to take.

B

B

Didn't see any questions in chat if you have a question, feel free to unmute yourself and ask.

B

B

Well, I it seems, like you, answered, everybody's questions,.

B

Also, there were some issues with people being able to view the slides, that's because of blue jeans. So sorry about that.

A

Hopefully it made it uh into the recording.

B

Yeah well, I was able to follow, follow along, but.

B

Dan is saying: he'll re-watch, the.

A

Hoarding all right, thanks for listening.

B

A

Thank you. Thank me on irc or my email thanks.