Ceph Ceph Month 2021, 24 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: Crimson Update

Description

Presented by Samuel Just
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

All right, I'm going to give a bit of a rundown of some of the recent crimson work and a more detailed dive into the ongoing c-store work, um the first a bit of a refresher. So what is crimson um crimson is an effort to replace cephosd with a new implementation, crimson osd, which is better suited to the demands of next generation storage hardware by requiring less cpu overhead per eye out.

A

So the important thing here that we're trying to improve isn't so much throughput, although that will probably that should hopefully be the outcome.

A

It's to improve uh throughput per per core classic osd is capable of driving a great deal of throughput. It's just that it requires a lot of cores to do so so we'd like to improve that um there are a few ways we're looking at doing this, we're using the c-star framework to try to avoid context switching context. Switching is.

A

One of the larger drivers of overhead in classic osd and crimson tries to avoid this by allocating a single thread per core and partitioning all data structures and work among those cores, so that accesses tend to be core local so that locks in cross-border communication can be minimized.

A

Let's talk a bit about what's been happening lately. um Recent pros and work has focused on a few broad areas: um implementing raido's features, data, durability and reliability, visibility and debugging and stability.

A

So because crimson is essentially a rewrite of the osd, every feature will eventually need to be ported over one way or another.

A

Recent work has focused on getting rbd workloads up and running um for a couple of reasons: one it's in a way the most sort of straightforward workload. um Another is that rbd workloads tend to be the ones most sensitive to uh cpu overhead in the core right path, so it seemed like a good place to start, so we can start getting good performance information.

A

B

A

We've added better watch notify support, mainly correct support for timeouts, but also lots of bug fixes.

A

In general, better object, class api support, mainly omap, so that we can correctly support rbd header objects and joyhon has had a better osd client pipelining after identifying it as a bottleneck in rbd testing.

A

So, as crimson is a drop in replacement first fosd, it also needs to implement all of the relevant data. Reliability and durability features that we rely on so recently, radik managed to merge the backfill implementation last year with some initial testing and ronan did some work to refactor scrub, to enable code sharing between the classic osd implementation and crimson, and the initial version has been ported to crimson, um so backfill should give.

A

Crimson should be the last piece to enable quincy to do um appropriate failure, recovery and rebalancing and should allow crimson now to survive pathology failure. Ejection testing scrub gives us a way to ensure that backfill is doing its job to ensure that the crimson's.

B

I o and recovery.

A

Pathways aren't introducing corruption.

A

As crimson starts to approach a state where a wider audience might be interested in testing and benchmarking, it it's important to improve debugging and performance visibility. So to that end, the team has been working on wiring up something like perf counters, using seastar's metric framework and exposing these counters through a prometheus endpoint for larger cluster level statistics. Segregation um work has also been done to improve back traces on crash, although more work will need to be done. There.

A

The main focus, though, lately and going forward, has been stability. um Crimson can now be tested with tethology and has an initial suite of tests in qa sweets, crimson radios, tons of work has gone into expanding that set and fixing the exposed crashes.

A

This will likely be the major focus for the next release and probably several releases thereafter.

A

So the other major area that's been seeing. A lot of effort is c-store.

A

um So now that we've gone through the some of the stuff that's been going towards, I thought I'd go a bit deeper into c-store, so people can get some insight.

A

So c-store is a new object, store implementation designed from the ground up for crimson's threading callback model.

A

It's designed to avoid cpu-heady uh cpu-heavy metadata designs like rocksdb, and it's intended to exploit emerging storage technologies like zns, fast uh dns, persistent memory and in general, fast nvme devices.

A

um So I mentioned zns zns is a new nvme specification intended to address challenges with conventional ftl-based flash designs, traditional ssds implement what is essentially a log structured file system internally, where writes are performed to free regions of the disk, um with a dynamic logical to physical mapping being updated. In the background, um doing random writes to one of those regions of the disks. First requires relocating the data and erasing it because you can't do a random write on an ssd. You need to do writes in erasure block size chunks.

A

The upshot of all of this is that the fdl tends to impose fairly high right amplification and a lot of background garbage collection which tends to create sort of non-deterministic latency properties.

A

The right amplification is particularly a problem for uh new quad level cell flash technologies with relatively low right endurance properties, so where zns differs, is that it changes the interface to the drive by dividing the drive into zones which can only be opened written, sequentially closed and released.

A

We also sort of hope that this kind of write pattern will tend to be good for conventional ssds, as well as it tends to reduce the amount of garbage collection that will need to be done and reduce the amount of right amplification.

A

The other major uh uh technology that we want to address here is persistent memory.

A

It has um almost dram like read, latencies right, latency, right, latency, dramatically lower than flash and very high right endurance, so it seems like a good fit for persistently caching data and metadata, particularly uh data metadata, with high update rates that would otherwise have an impact on the right endurance of an underlying qlc device.

A

um So the approach that's being discussed- this is a little ways out. Hopefully, this year is to keep these sort of. Caching is to keep the caching layer and persistent memory and to maintain a copy on write pattern to extent mapping maintained via the write ahead journal so that we can rebuild the sort of cache mapping on restart.

A

So that's what c-store is: let's talk a bit about the high-level design, so the first thing is: what do we actually need from an object store inside of an osd?

A

um So the main characteristics are that it needs to be transactional. It's composed of flat object, name, space object, names may be large and they may be under the control of the user. In the case of rgw, um each object contains a key to value mapping as well as the data payload. The key to value mapping is used for rgw, bucket indices and cfs directories.

A

Among other things, we also need to be uh to be able to support copy on right object, clones um for a lot of features, including snapshots and rbd snapshots, and we need to support, ordered list listing of both the omap and object name. Space.

A

So the really really high level logical structure of c-store is we have a sort of a root block pointing at an o node index mapping h objects, two o nodes, um each of which contains uh sort of pointer to an omap tree for that key to value mapping and a set of contiguous logical lbas for the actual object data.

A

The other half of the metadata structure is this lba tree, which is how we map logical addresses, used on the left side of this diagram to physical on disk locations.

A

So why do we need an lba interaction, it's about garbage collection and relocating data. So if we have some internal structure on disk, with three extents, a b and c with a referencing b, referencing c, if these are physical mappings, then if we relocate c to c prime here, we need to update b and transitively. Therefore, eventually update a once b gets relocated.

A

um This is potentially relatively expensive in terms of right amplification.

A

um By contrast, if these references are logical, then all we need to do is update the logical mapping for c uh the logical mapping only needs to maintain these are the sort of basically 64-bit 64-bit mapping, so it has considerably higher fat or yeah higher fan out than, for instance, the o-mapper, oh no trees, so there should be. We should be able to trade extra reads in the lookup path for lower right amplification during garbage collection.

A

It also makes a bunch of other things way easier, including graph, counting and sensory sharing for copy and write clones, and it allows us to disperse mapping for object data extents, which we would otherwise need. Another data structure to handle.

A

So the on-disk layout of the journal looks something like this: each each journal unit, this called a record and a record contains a header with all of the usual checks on the length information, a set of deltas which are sort of mutation, records for existing on disk extents and a set of new extents, um including logical or physical blocks, where logical blocks are data or things of the left side of the tree really and physical blocks are really just the blocks comprising the lba tree itself.

A

um So here, uh new blocks b and a get written and they're sort of in magenta here with a being part of the lpa tree and d. Prime and e prime, are the new uh representations for the d d extents after these two deltas are applied.

A

So this structure allows us to atomically mutate existing extents as well as write new ones.

A

This is important for dns, because we are able to logically mutate extents without actually mutating them. We don't need to actually rewrite dnd or go back to where they are on disk and rewrite them. We can just write down a record explaining how the the extent should be modified, um there's another wrinkle, which is that with zns devices we don't necessarily it might be possible to predict where this record will show up on disk, but it's more efficient if we don't, um as it would increase concurrency.

A

So uh these internal arrows here from e prime to a and from a to b, these are expressed in terms of relative addresses, so that when we read this record back off of disk there's a step where we need to um adjust any of the in extent addresses relative to their own base. Basic addresses. This way. These records work, regardless of where they actually end up on on disk.

A

So, at a high level, the architecture looks something like this.

A

There are two major divisions, uh the things above and below this transaction manager concept, transaction manager supplies a transactional interface in terms of a logically addressed set of blocks used by data extents and metadata structures like the gh object to o node index and no map trees.

A

The components under it, which are mostly the logical address tree deal in terms of physically addressed extents.

A

This gives us a way to build these higher level data, these higher level, more complicated data structures in terms of a more uniform easy to understand interface, and I'm hoping that it will allow that it will make implementing things like tiering, simpler by creating a sort of a unified interface that won't need to worry about it.

A

So now I'm going to sort of talk through some of the pieces that have been merged so far, um so the first is the fl tree or the odon manager fl tree implementation.

A

It's a sort of a b tree based structure for mapping, gh objects to o nodes.

A

If you look in the source code, a gh object is really, I believe, a three four five, a seven tuple um with a fixed size, prefix, shard pool and key a variable size, middle, the object name and name space and a fixed suffix snapshot in generation.

A

um The ordering here matters because the traversal order for this index is is, is used for making scrub and backfill behave uniformly across nodes. So we actually need the listing behavior to be well defined here.

A

To save space, internal nodes drop components that are uniform over the whole node. So if you have a long extent of internal nodes that all um map to the same pool, they can omit the start. Pool the shard and pool portions of the key, and so on. um Internal nodes can also drop components that differ between adjacent keys, so that, if you have a sequence of uh nodes high in the tree.

A

Well, above the point at which the sort of name comes into play, they don't need to. They don't need to include a name for uh their child blocks, which also helps to save space. uh The work was done mainly by eagleton at intel.

A

The next bit is the v3 omap manager. um This is also a bee tree, essentially a b3 structure, and it's used for storing the omap data for each object. It's a fairly straightforward string to string b tree, contributed mainly by chad may at intel.

A

The object data handler is the component responsible for actually mapping object, data to logical, logical extents. um It uses the lba manager to avoid needing to maintain a secondary extent. Map um clone work is still or clone. Support is still a to do and requires the ability to relocate logical extent ranges which is uh future work, so this transaction manager is the sort of intermediate layer between everything I've mentioned so far and the lower level implementation details.

A

um This is what actually mean this is the layer responsible for maintaining transactional um atomicity and it's responsible for detecting conflicts between transactions.

A

Users of this interface can express can um can create a dependent delta data type which can be included transparently in the commit journal record. This enables, for instance, the b tree o map manager to represent the insertion of a key into a block by encoding, just that key value pair rather than needing to encode the full new copy of the block or everything after the point in the block where that key shows up. So it allows more compact, encodings components. Using transaction manager can ignore extent, relocation completely.

A

The lba manager is yet another b tree this one mapping logical addresses to physical offsets. It also includes reference counts so that we can use it with clones and later on. It's where we'll put extent checksums.

A

The cache component manages in in memory a set of in-memory extents, um including mainly any any dirty extents, whose current representation differs from the on-disk representation due to deltas that have been written between when the extent was written. And now it also is the component responsible for detecting.

A

Conflicts uh finally, the journal is responsible for actually handling atomic write and replay of journal records.

A

um The segment cleaner is the component responsible for garbage collection. It tracks, it also tracks the sort of usage status of segments. It runs a background process, though, within the same reactor thread that sort of in a loop chooses a segment relocates any life extents within it and then releases it.

B

A

uh Logical extents are simply remapped within the lba manager, physical extents again, uh the b3 lba manager extents themselves have a mechanism for finding their parent node and updating the the physical pointers.

A

It's also responsible for throttling foreground work based on pending garbage collection work. This is very much a work in progress, but the goal is to avoid abrupt gz pauses and to generally smooth out latency.

A

So that we are able to sort of garbage collect as we go.

A

So the status is that we have very initial osd support, just merged in that it is there's a v start dash c store command line, option that will allow you to start and almost immediately crash an osd um once you do any actual. I o, though it doesn't support well in addition to crashing it also doesn't support snapshots other things.

A

um The ongoing work will be to stabilize this, to the point where you know all of the features that exist actually work and then to start working on optimizing performance, particularly the garbage collection, I think, will be an important focus in the near future um and then there we need the ability to remap logical, extents, to support clone and then slightly further out we're working on um in-place mutation support for fast nvme devices.

A

What I described here is so far is focused on zns and slower ssds, where functional rights are helpful, but for faster devices like optane. That's not really helpful, so we're working on adapting the internal interfaces to enable direct mutation for devices where that's uh efficient and then we're also working on adding support for persistent memory um somewhat further out. We want to add support for tiering so that we can combine fast small devices with lower higher capacity devices, for instance, persistent memory with qlc or obtained with well also qlc.

A

That kind of thing.

A

I think that's all I've got do we have any.

B

Questions there's a question in the pad: um can you explain the role of spdk? Is it layered above or below c-store.

A

Spdk, if it comes into play at all, will be below it um that so the the low level I o details need to be plugged through c store, z, star rather sorry. um So that's how that would work. uh It would be an alternative to the posix-based accesses we're using so far.

A

um Oh question crimson osd going to support bluestora. Initially it actually does right now, um so c-store is a whole different thing. This is a different way of doing storage. Right now, crimson osd already supports blue store via um uh sort of a uh queuing, separate threads system, um so yeah crimson sd will support blue store and does support uh blue store.

A

Iops performance increase of crystal osd ssd, with blue store, not directly. No sorry, uh particularly c store c store is really immature. We don't really have data on that.

A

Someone else may have data on crimson with blue store.

A

Though it didn't uh so the design of c store didn't really come from a university research project. It's pretty similar in a lot of ways to butter, fs and other log structured file system implementations, and it's not wildly dissimilar from ffs either.

A

The zns wrinkle encourages a design that has lock social file system components but other than that. It's not really that different from a lot of other approaches that I've done. I think.

A

I think the zns thing is something manufacturers are promoting because of demands for the market. It would enable higher capacity cheaper devices um with less on disk or with less on device.

B

B

There's one other question in the pad: um is there integration with the nvme over fabric gateway? No.

A

uh Do you mean for crimson in general or for c store uh to both questions at the moment? The answer is no um in the future, possibly depending on.

B

A

How the envy mover fabric gateway ends up evolving. There may be some common work, for instance it in both cases. It may be a good idea to well for crimson we're definitely going to need to to create a version of liberators uh from the objector down.

B

A

Deals with um the seastar reactor or thread bottle correctly, and that is likely to be of use to the nvme of her fabric gateway as well, and there may be future work or future things where nvme over fabric may be useful in crimson. So there may be some overlap there in the.

B

B

Do you, you have a sense of what the? What milestones do we sort of expect to hit in the quincy time frame 440 around the end of the year? I guess.

A

B

A

Yeah the goals would be um a lot more stability, a lot more of the tests supported over. I think that's, that's the main progress marker um for store, it'll, look like um osd's, don't crash anymore. That'll that'll be good um and like starting to evolve like real performance information to get a sense for which things are really important to attack next, um but c stores actually has work being done on a lot of fronts, so it'll sort of depend on where different people put their.

B

Focus my understanding is that the current, um the current crimson osd is um single core. Yes, that's true, oh yeah, where what's the what's the thinking current thinking, there.

A

I think the current thinking is still that stability is the more important component, but I do expect that to be addressed at least partially in the next year. It's going to be a fair amount of work, though so that'll be. There will be a fair amount of underlying work to modify the messenger osd interface and the the osd backing store interface, um at least with blue store.

A

It won't be so the ladder bit won't be so complicated, since all of the chords are just going to share the same blues, for instance or c store, it'll be more complicated, but c store, isn't quite to that point either. uh Yep.

B

A

um I think the main thing is that the one of the biggest like danger points of going to multiple is messing up. The uh bob ordering and correctness details in hopefully fewer of the ways than we managed to in classic over the years and having a decent test suite filled out, will really help with that.

A

So any stability work we do before. That will be really helpful in terms of getting it done.

A

B

Cool okay: anybody else have any other.

B

B

All right well, thank you. Sam.