Ceph Ceph Tech Talks, 21 Jun 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-JUN-21 -- Ceph Tech Talks: Bluestore

Description

A detailed update on the current state of the Bluestore backend for Ceph.

http://ceph.com/ceph-tech-talks

A

Welcome everybody to a special bonus edition of the Ceph tech talks, we'll be doing a double this week, instead of just the usual time on Thursday at 1:00, which is still happening where we're throwing in an extra one here stage wanted to present some of the latest information on the the new blue store backend for staff, so we'll go ahead and and get started when you're ready sage.

A

B

So this is a updated version of a similar talk. I gave it vault about a month ago that covers blue store, a new, faster storage back in force, F, so hi little outline of what I'm going to cover I'm gonna spend a fair amount of time, just giving some background on what part of stuff this actually is where it fits in and what we used to do and currently do with file store, and why sort of the current approach that we've had doesn't work?

B

I'll talk a little bit about a new store which is sort of a first attempt to do something different and then move on to blue store, which is the current effort. A totally new backend for the OSD I'll talk about high-level, how its structured, how the metadata is stored and handled how the data path works.

B

I'll go over some performance numbers that were generated on an earlier version, although the current version looks pretty similar and then talk about about the current status of development where we're at and how you can try it out in Joule or using the latest master.

B

So that's that so I'll start by just motivating. Why does we care so just a little bit of background? Sef, obviously, is an object, block and file storage system providing all those interfaces in a single cluster. It's designed so that all components will scale horizontally. So you can just keep adding nodes to get more capacity and performance.

B

That's architected to have no single points of failure to be agnostic to the kind of hardware you deploy in just usually commodity hardware, to be self managing rubber possible and, of course, it's open source under the LGPL copyleft license. So stuff is great. If you look at our earlier early papers and documents, we used to write about stuff, we also use words. We would describe the system using phrases like a scalable high-performance distributed file system.

B

That's what the original set paper called it, and we would usually say that Speth is designed to provide performance, reliability and scalability all in the same system. I mention that, because performance has often been sort of a challenging piece, at least when you compare it to sort of the raw performance that you could theoretically get out of a piece of a piece of hardware, and a lot of that is due to the way that we're actually ultimately storing that data on a disk is.

B

Are the sides advancing on the video or is it still showing the first slide.

A

No, it seems to be we're on page 4. Ok,.

B

Good, ok that bluejeans screen doesn't show it. Okay, so SEF provides these three object. Interfaces greatest gateway get to assess, serene Swift compatible interface already gives you a virtual disk device, Virtual Box twice it's used extensively and OpenStack and South FS is a distributed, POSIX file system akin to NFS but scalable and distributed, and all that good stuff. All of this sits on top of ray dos, which is the piece that actually replicates all your data and distributes its across lots of nodes and make sure that it's safe and moves it around.

B

When notes are added and heals the system I know it's all removed, so rate is sort of the key piece of that and that ratos cluster is structured using out of a series of posts. Each host typically has a whole series of object: storage devices, OST daemons, each of which is sitting in front of a hard disk. So the hard disk is plugged in the system.

B

There's a filesystem that's sitting on top, normally it's XFS, but we can also use butter, FS or x4 and then the OSC diamond on top of that and writes files into that file system. So that's how its deployed today in in reality, there's a sort of well containing the module with the OST called file store, that's responsible for actually writing that data to that local filesystem sitting on that disk and.

A

B

Piece that particular piece that we're looking to replace with this, so this that file store implements an interface that we call object store. This is a abstract interface that describes how each OC daemon stores data on this local disk and then stuff. The larger system is responsible, for you know replicating across multiple has to use, but this particular part is just how to store on one disk.

B

Originally, if we implemented a a file system called a boss and a back-end called file store, we had two implementations of this interface, and this interface objects are is built around sort of a couple basic abstractions. There are objects which are sort of like files and they sort of data there's a bunch of bytes.

B

They store attribute just sort of like it's in it attributes and they can store OMAP data, which is sort of an unbounded, key value type thing, although that's less commonly used, and then it has an idea of a collection which is sort of like a directory, which is just a group of a bunch of objects and in the larger system you have a whole pool of objects that shard it into what are called placement groups, and when that placement group is ultimately stored on one or more OSDs it sort of maps into one of these collections.

B

These little directories and the other sort of key property of this local storage interface is that all rights are transactions though The Sitter face has to make sure that whenever you hand at something to do, it's applied atomically and consistently and durably. So it's actually will be there when you lose power and we don't worry about the I and the sort of acid terms of things, because that part is provided by an upper layer. So we don't worry about conflicting transactions.

B

There always sort of non conflicting and isolated, but we do have to make sure that they're atomically applied to disk. So if we lose power that you guys to get all of it or none of it, so the first implementation of this objects or interface was a boss, which was a user space extent based object file system. We wrote back in like 2005 2006 2007 somewhere in there, and it was a copy and write BNE tree based extent file system over in user space that would sort of implement our sort of customized interface.

B

So part of the reason why we had transactions was because we had full control of those stack was the most natural interface to accomplish. What do we needed to accomplish? Even what the requirements of you know, building laundry system or so in that sense, I think it was great that we wrote this in the end we got rid of it and instead switched to writing files into butter FS around 2009, because butter FS was just coming getting the scene.

B

It had sort of all the features that a boss had and more- and you know, whole community people were excited about it and we expected that we wouldn't have to sort of write this part of the system. We could just rely on other tools that didn't really pan out, but it sort of brought us to where we are today, where everything is written using the file store implementation of this interface, so-called because we write objects as files.

B

So each of these place in groups or collections maps to a directory in the file system each object maps to a file. We also have an instance of leveldb that we store other stuff in so sometimes attributes that we put on the object files are too big. Each file system has around peculiar limits, so we have to a to avoid hitting those we put many or large attributes and level to be instead, and we also store the OMAP. The key value data in there I'm.

B

So originally the file store implementation was there just for development, so we could write code and run it on our home directory on a random machine without having to have dedicated disks that could be formatted and so on, but we sort of morphed it into something that we actually used for production in the real world and the structure is pretty simple. So you have. You know that OST directory has a directory for each placement group with a bunch of objects in them.

B

There's a DB directory that has level 2b and there's a meta directory that has sort of high-level metadata objects or the OST as a whole.

B

But because this is system is built on top of an existing file system, we're kinda sort of constrained by the POSIX interface and it that has problems.

B

So the first problem area is the fact that our interface, it wants us to provide atomic transactions, and we need this because the OST is very carefully managing the consistency of all the data it's sourced locally, so that, if it fails, it can very quickly recover and resynchronize with other replicas. um And so we need that transactionality.

B

So, in practice, most transactions are pretty simple, usually you're.

B

Writing some bytes to the object, file and you're updating an attribute that says what version of the object it is and you're updating a log that says that I have to do it the subject, so we know to recover it, there's a failure, but the transactions can be arbitrarily complex, and so we can't sort of rely on the fact that most transactions are simple to make some sort of work around where we carefully manage the order that we do updates and so on, because it just it wouldn't work I'm pretty well, it may be very hard to maintain, but on the right you see sort of an example of what one of these transactions but I, look like you know we do a right.

B

I'm set an attribute, update a log entry key so initially in order to support these transactions, we started wiring ourselves into butter FS, so internally, butter fest, like many other file systems, has the notion of a transaction the way at it had an I octal that we would sort of bracket our operations with that said, start transaction.

B

We do all their stuff, that would say in transaction, and this would prevent butter FS from doing a commit sort of while we're halfway through our work, and so we would either get the whole thing or none of it committed atomically to disk as it sort of did it's in terminal checkpoints, and that worked that sort of got us most of the way there. So our initial attempts to get this atomic transaction support was to hook into butter fests that we would mark when our transaction started.

B

Anyone ended and sort of bracket all our operations by those two. Those two calls so that butter FS would never do a check point and commit everything to disk with only half our returns. Actually, we had to get all of it or none of it, so that got us part of the way there. We also had a mount option and to make sure that when it did do a check point, it would flush and enter to date.

B

Any dirty data that was in the page, cache and so it'd be sort of a full consist fully consistent check point of everything that we've written to the file system and that that mostly worked as long as nothing went wrong and the problem was that what would happen if the OSD daemon crashed, slides.

A

Aren't up right now! Oh.

B

Yes, yeah there we go so what would happen if the obviously crashed- and we didn't- we didn't finish- sort of writing our full transaction to the file system and butter fests. What have seen a transaction start and a bunch of rights that would have sort of done a bunch of stuff scribbled all over the page, cache and disk possibly, but there would be no end and we would never get the second half of the transaction, because the OST process died and so the only real way that we could think of to get around.

B

That would be to add this other really very horrible amount option that would basically make butter fest deliberately wedge itself crash if you didn't close a transaction and so sort of necessary, because internally there was no support for rollback in this sort of transaction stuff that that butter fest another file systems they're not really meant to be transactional in that way. They're just trying to manage their own internal consistency and not to provide sort of a higher level transactional concept, and it's very hard to sort of shoehorn that in later so that didn't really work.

B

So, instead, what we did is just it'll write a head journal, so every tick, every transaction that we handed to this object store interface. We would sort of serialize the whole thing into a sequence of bytes write that to a journal on the disk once that fully committed, then that transaction was stable and we could sort of take that same data and scribble it again across all the objects and metadata and so forth, and to write it back to the file system. So in butter fast we could be a little bit clever.

B

So we would periodically take a snapshot of the file system, which would do this sort of full check point and then, after we did a check point, we could trim journal entries from you know in the past and then, if the OSD ever restarted, we would just roll back to the most recent snapshot and then replayed the journal from that point forward. So we'd have sort of a nice consistency model in that sense on non butter, fest systems, it wasn't so elegant.

B

We still did periodic sync and Trimble journal entries, but on restart we just replayed the journal blindly, which meant that we might be repeating certain operations and unfortunately, the operations that the objects or interface supports aren't all idempotent. We have things like renames and clone and so on, and so there's a whole bunch of really ugly hackery in there to make sure that we don't replay those operations twice and we don't sort of scribble old events on new data and corrupt things. That's over, so it's kind of gross. It works.

B

Obviously, because you know this is how every current SEF system is deployed using this layer of code, but it's it's awkward and hard to maintain and painful and not particularly inefficient. The main thing is that, because we have this full data journal everything we write, we write twice. You write it first, the journal, then we write it again to the file system, which roughly has the available disks throughput. So that's sort of unfortunate.

B

The other area, where POSIX release really getting in our way, is with enumeration, so uh Seth objects are distributed in a pool based on a 32-bit hash value and we do a numeration of those objects in hash order, and we do this in lots of different cases. We do it for scrubbing.

B

We do it when we're doing backfill or sort of sinking objects between across OSDs, and we also do when you request an enumeration via the liberators API and you're just listing objects in a pool, and the problem is that POSIX reader to actually list these object. Files on the underlying file system isn't well we're POSIX reader is totally random. There's, like some internal hash function, that's used in different file systems that varies and so on. So that's problematic.

B

We also need this ability to take a given collection and in sort of a fixed, constant amount of time, split it into 1/2 or 1/4 or any or whatever I'm. As part of the process of scaling sub-clusters up, we need to sort of repartition our data collections, our place in groups, and you can't do that with POSIX. You can't take this directory of a million files and and fixed time sort of split it into two separate directories that just doesn't happen so in practice.

B

What we do in file store is we build this sort of ugly directory tree of directories and then files and where the directory names are based on those sort of the first prefix part of the hash function or the hash for that particular file. And so you get the sort of deep nested file structure that might looks familiar from what lots of other projects do and then, when we have to do a numeration, we can sort of have a fixed.

B

If all our directories that we read in and then we sort the entries in order in memory and then return the results, and it also lets us do splitting by sort of renaming these directories around just sort of smaller units and so on. So it works.

B

It's not particularly efficient because you have this sort of complicated directory structure and when you hit thresholds, where you hit a certain number of objects- and you suddenly have to split all these directories into smaller directories- and you have all this extra IO and the disks that some people notice as they're sort of filling up the cluster. So it's definitely far from ideal, so we decided to do it was time to do something. Different sort of POSIX is causing more trouble. That's worse and we want to do something else.

B

So we wanted to make a new implementation of this object or interface with several goals. We want a more natural transaction atomicity. We wanted to avoid all these double rights, so we get better efficiency. We want an object, murmuration to be efficient. We wanted clones to be efficient so internally we have critical e, have to take an object and make a clone of it. That's sort of copy and write for snapshots, and we want to do that without actually copying any data on butter fast.

B

We can do that, but on XFS we literally copy for make objects. When you first touch them after a snapshot which is less than ideal, we were targeting sort of current generation storage devices like hard disks and SSDs and beamy cars cards, not really worried about persistent memory, because we think that's going to be pretty different and the hardware isn't huge anyway. We wanted to make sure the code was structured so that there was minimal, locking and better parallelism.

B

So we could, you know, go really fast on SSDs and we also wanted to finally implement the things that we were hoping that the filesystem would do and never really delivered. So we want full data and metadata check sums on everything we write. Butter fist does this, but we don't use butter fest in production, first ability reasons, and we also wanted to inline compression, which would be great because it you know, lets you store more data.

B

So a new store was our first sort of pass at doing this, and the basic premise is that POSIX has the wrong metadata model objects, aren't files and collections, aren't directories and all the stuff that posix is trying to do, isn't really what we need for all the reasons I mentioned.

B

Instead, an ordered key value database, that's sort of a perfect match, because we have objects that have a very well-defined order and we want to fish an enumeration and fast lookup, and there are lots of these databases out there. So the news store is a combination of rocks TV, which we sort of picked semi randomly and then to handle all the metadata and then the actual data for the object. We would still store and POSIX files that are just sort of simple name flowers, so the idea would be you plug in your key value.

B

Database rocks TV was what we were targeting, but level 2 B would work any key value database that was wired into such sort of abstraction would work, and then the actual data for the object would be just written in sort of a simple file with a very simple name: the short name with nice, big, efficient directories and XFS or whatever to just keep it very simple. So that's the idea with with the new store it didn't really work very well. The main issue is that rocks DB has a right ahead.

B

Log, a journal that is used to manage its consistency and then the file system that is sitting on also has a journal that manages the file system, consists consistency and this hold journal and journal thing you'll find papers written about. It has a very high overhead, because each journal is essentially managing only half of the overall consistency of the system and because so you sort of pay the overhead twice.

B

So there's an example: when you're writing a new file in new store, you would have to write to file in blob and then do an S sync. So you'd actually have an eye out with the file data. You have another I/o somewhere else to the XFS journal to actually commit that update and you'd flush.

B

The device twice and then you'd have to a new store would try to update the metadata for that object, so it penned a record to rocks to be which would also append the recipes log file and then F sync that so would be another to iOS to the rocks, to be log file and then again to the file system journal to update the metadata for the log file. So you'd end up paying like for iOS, when you really only want to pay two with four flushes instead of two flushes.

B

But what the solution really to put everything in one big journal that manages consistency of the whole thing. The other problem is that we still need this atomicity for being able to do overwrites within the system. So in POSIX you can't have a file that already exists. What data there and an overwrite some of that data as part of a larger transaction, because politics is understand transactions and in stuff.

B

We really need these overrides to be atomic so that they don't actually destroy the old data unless the entirety of the overwrite data and its metadata are able to be committed and POSIX just doesn't do that, so we could have made new source or to write all over in data to a new file somewhere else, instead of overwriting the old data.

B

But that means that each object Maps to a whole bunch of different files with this weird mapping structure, should get complicated and inefficient, and so we sort of end up again where we started with redhead logging, where we would log the data to rocks to be in a wall record that says I'm going to write this overwrite. This data atomically can get that with the metadata and then asynchronously go overwrite the data in the file system and as a gener'l game that works fine, except that it with new store.

B

You might have a four Meg object and overwrite two Meg's of it, and so you'd have this like 2 Meg blob that you're stuffing the rocks DB temporarily to do it over right and you're back to this double right thing, which just sucks and unfortunately over rights are a big part of the stuff workloads that we care about, including our buddy block workloads, notably so new stores sort of having this sort of half measure where we handled the metadata.

B

But not the data ultimately doesn't really pan out, which brings us to blue store, which is what we're actually doing. So. A blue store is so named because it's a combination of new store and a block device, and we decided that spelling it B le W store would not go over particularly well. So it's blue store with Yui, and the basic idea here is that we consume raw block devices, Arad disk dev, SD, B or whatever.

B

We still use a key value database for the metadata, and so we're embedding rocks to be sort of in the middle of this, and all data is sort of written directly to the block device and we have our own allocation code that figures out what blocks to use on disk and so forth.

B

So if that was previously something that we were picking up for an MX FedExed for whatever we were using and now we just have to implement ourselves, but in exchange we get sort of full control over the a path and things some things get actually much much simpler. So that's sort of key challenge here is that we have to share the block device with Roxie B, because Roxy be normally writes a bunch of files for its SSD files in a log file, and we need to make that sit on top of the same block device.

B

We don't. Actually we just want to get XFS or butter of s whatever completely out of the picture, and we do that by implementing our own rocks to be back-end. It has a nicely abstracted and class that sort of captures all of the platform dependent stuff that has to do including file I/o, and we implement a very, very simple file system in your quotes called blue FS.

B

That is just complicated enough to support the types of operations that rocks to be does when it writes files and reads files, and then we make blue store and blue FS smart enough to sort of share the device which is relatively straightforward.

B

So blue FS, if I said, is a really simple file system, as you'll see I'm all metadata for blue FS to sort in RAM. So when you start up blue store and blue FS, it loads the metadata and just keeps it in in memory. Super simple, which means we don't to store a free list, because we have all the four pointers. So we can sort of regenerate it as we melt it, users really coarse allocation units.

B

One megabyte blocks just to keep things really simple, because there actually be only ever writes big big files and never write small files. There's no reason to deal with small blocks and all metadata sort of has a single way that it can be written to the disk everything lives in a journal. So the idea is that you sort of write to the journal.

B

You right updates to you know F nodes which are sort of like high notes and so on as they happen, and then, when that journal just gets logged big enough that you hit some threshold, you just sort of rewrite the whole thing in a more compact form, which is essentially just an enumeration of the files that exist in the system and because rocks TV is only ever writing big files.

B

Sort of the total number of files that blue FS ever manages is relatively bounded, and so it doesn't have to be that big, blue FS is also smart about multiple devices, so rocks TV is sort of configured so that it writes its types of data into different directories.

B

Right ahead, it's wall with log file, it's journal just written in one directory and it's SS T's or into two different directories, and we can map that to different block devices so that the rocks to be logged, for example, might go to an SSD or 2 MB RAM, whereas all the other metadata might go to a soar device and then so. The last thing is that blue store and blue FS communicate so that, as blue FS runs out of space, blue store gives it more and it was blue store, runs out of space and blue.

B

If s has extra, it gives it back so they're sort of sharing they're sharing this free space in the device.

B

We did one sort of tricky thing with rocks to be upstream authority of stream and rocks being created.

B

Roxy B was written to use these log files that are essentially just journal and it would just write a new log file every time, but that results in a pretty inefficient IO pattern because it has to append to the log file and then F, sync it, and that has to update the metadata for that file. And so every rocks to be log commit hits is at least 2 iOS to do the data and the metadata, in contrast, every file system and database in the world that does logging uses a circularbuffer.

B

So it's just overwriting the same disk blocks over and over again and it can sort of detect whether it's hit the end based on checksums or some other scheme, and so we implemented that in rocks TV. So they would sort of recycle all previously used files and overwrite them so that we would be one I/o for a committed to is so that works with rocks to be on regular files and also actually be on top of blue FS.

B

So it helps our workload and also benefits other rocks to be people, and that's been upstream for probably like three or four months now. So that's good and I sorta mentioned this before with blue FS, but I'll sort of make it explicit. Your two Blues store is designed to deal with multiple devices, so you have a couple different scenarios.

B

The simplest is that you just have one device just a hard disk or an SSD, and it puts everything on that one device, so Roxy B is there your object data is there these journals there and it's just really simple and it just it. Just works I'm. A slightly more complicated deployment is two devices. In this case you could have a very small 828 Meg's of SSD or envy Ram, and you could put just the rocks to be logged there and then have the main device for everything else.

B

This is more or less equivalent to how people currently deploy Ceph file store with a journal device and a large device, at least as far as what it's doing internally, except that in Bluestar's case, it's only a metadata journal, and so the journal device can be much much smaller, which is nicely 128. Meg's is generally enough, so you could have a single, very small, very fast SSD in a system and have lots and lots of disk devices that are sharing it, or maybe some in DRAM on your 20 Meg's. Isn't that much?

B

Maybe once these persistent dims show up you'll be able to use that I'm pretty effectively and cheaply. But if you have a larger device, for example, an SSD that you've already bought, you can use more of it. So you can put the redhead log and part of the rocks to be database.

B

The warm data on the SSD and then how about a cold data on the on the big device for us to be as far enough to sort of separate, warm and cold, and do this tearing internally automatically, which is kind of nice, and you could even have three devices where you have the right ad log on say in DRAM and most of rocks do be.

A

B

Ssd and then, like the really cold, rocks to be in the regular data on the slow device, so several different options. I'm the one thing that we don't support is blue, store, automatically tearing actual object data on to a fast device and we're only using the fast devices for metadata, but that's something that we may explore in the future.

B

So that's that's sort of high level. What's fox v is and what it does. I'm gonna talk a bit about it internally, how it represents metadata and how its internal data structures look. So, as I mentioned, blue Saurus, storing all of its metadata inside that key value database, we partition that namespace into a bunch of different sections, so the superblock section is just meditated about the whole system.

B

What a block size is what configuration options you chose and stuff like that, there's a section that we use for block allocation metadata for essentially keeping track of which portions of the disks are free and unused. There's a section we use for stats just for keeping counting up. You know how many bytes are written. How many are compressed bytes?

B

There are how many objects that sort of thing things you'd see in DF and then sort of the interesting pieces there's a namespace for collections, though these are all that made it, the denotes the collection place and group information and then a larger section that has the mapping of object names to object metadata. That's where most of the data goes, there's also a section we use for right ahead, login trees, so I mentioned with new store. We do write ahead logging for data.

B

Sometimes we do the same thing with blue store to insert and search in certain circumstances where we decide is more efficient and finally, there's a section that we use to store home up data, so a suppose to use might store arbitrary key value data inside objects, for example the radio Skateway stores its bucket indexes this way, and so that'll all get sort of crammed in some section of the the rocks to be namespace so starts order from the top down the collection.

B

Metadata is stored, aniseed owed in practice, there's actually only one field in the C note that matters, it's that's sort of the number of bits. So each collection represents a shard of the overall pool namespace, so all objects in a pool have a hash 32-bit hash, I'm associated with them, and a collection is sort of a fraction of that overall 32 bits.

B

So it's basically the name of the collection is the value of the hash and the bits indicate how many of bits of that hash are significant that have to be matching with the object in order for the object to be considered part of that collection, so that it's sort of graphically represented here, you'll notice. The placing group has this particular prefix, which maps to a hash prefix and everything beneath that prefix.

B

Where those 19 bits are do match, then the other, whatever 32 19 bits can be different, I'm in comparing. So the this has a couple of nice properties. Obviously, we get ordered enumeration of objects. We sort of carefully construct the key for each object pair in the database, so that it's it's sorts and the exactly the order that we want Seth to sort objects and you'll notice, because the objects are sort of in hash order.

B

We can take a collection that represents a particular range of that hash space and we can split it into two pieces by just changing the collection metadata and saying that we now have two collections with one more bit of significant bits without actually rewriting any key value pairs and we just sort of changed the collection metadata and we can arbitrarily carve that namespace into smaller pieces.

B

So that's a nice property that if I also had to do a lot of work to accomplish and sort of trivial, with simpler key value model and make my phone stop making that noise so most of the interesting stuff that was actually in the O node. So this is the O node stores per object, metadata I'm. It lives directly in the key value pair. So the key will be the name of the object roughly and then the oh, no it'll be all the metadata about that.

B

It serialize this to hundreds to thousands of bytes, might be a few kilobytes or depends on whether you have checksum is enabled we're doing some tuning there to make sure it make it smaller, but they're the main pieces information in the O node are the size of the object in bytes the logical size, the attributes associated with them. So I remember that an object has sort of small in line attributes, like version equals to that sort of thing that are stored in line with the metadata.

B

It has data pointers that indicate where the byte data associated with that object is stored on disk, and this is a to a two level mapping. So you'll have a logical extent that map's sort of a range of the of the object to a range of a blob and a blob maps to a particular range on disk and may or may not be compressed or have different checksums associated with it I'm.

B

So own oats have extents that map to blobs and then blobs map to some region on disk and then finally, there's a structure, a field hero map head. That indicates a prefix in that o map key space for all the key value pairs of Sochi that object. So if you have user data that sorted ASCII value data, this tells you where to go, find it actually.

B

So that's that's! No node, there's one other structure. I haven't mentioned yet called a B node, and the reason is because we also need to store the metadata about those blobs you'll notice that the oh, no dear, has the mapping of the object space to logical extents, which Maps the blobs, but it doesn't actually contain the blobs themselves.

B

So usually we store the blobs next to the uh node so that keep that value pair will be the UH node, that's encoded and then a bunch of blobs sort of appended to the end of it. But occasionally we have blobs that are referenced by multiple objects. This happens when we take an object and we clone it, for example, for a snapshot then we'll have two separate objects that both have logical, extents that point to the same blob. In that case, we can't put the B note in the own ode.

B

First, the blob map in the O node, because there are two different own oats, and so we need to put it in a separate key value pair, and so when we do that, there's a restriction at this interface level that you can only do these clones across objects that have the same hash value, though they sort of are sorted in the same position in this overall object pool namespace, and so the B note is just sort of the prefix of the object name.

B

Essentially that specifies the hash, but not the subsequent name of the object, and so you might have multiple objects that have the same hash value that reference blobs that are stored in the B node with that same hash value.

B

But regardless of whether those blobs are stored in the B node or the O node there they're sort of the same. um It's just a map, it's a map of an identifier I'm. You know 1 2, 3, 4, whatever to the blob metadata and when, when you point to a blob, if it's a positive value, it means that the blobs in the O node, if it's a negative value, it means it's in the B note.

B

So you sort of know where to look with that blob by the way, and then the blob remember tells you where to actually go find the data. It has the extents on disk where it's, where it's stored, whether it's compressed or not, some other random stuff.

B

And one of the other things that blobs, let us do this nice is checksums, so we set as a whole scrubs periodically so every day, it'll scrub metadata every week, it'll scrub data.

B

But that means that if you have a bit that gets flipped on disk there's some window where it's wrong before we actually detect the air, which is unsettling because you may be. You read the object before you notice that it doesn't match its replicas, which is unfortunate, and even if you describe any, do finding consistency, you might not necessarily know which replica is the wrong one. I mused, you might know that there are two copies that are the same and ones that that's different.

B

If you have three replicas, that's a pretty good indicator, although necessarily a guarantee, because maybe there was some recovery or migration, you just copied the bad copy to another location, so you're, never really sure. So, with blue store, we want to validate a checksum on every read. We want to sort checksum for everything we write and whenever we read something we always want to check the cut checksum to make sure that it is actually what we meant to get.

B

So that means that the blue store blobs have to store more metadata than just we're on just the data store. They also store the checksums for that data, so we can use multiple checksum algorithms. The default is crc32, see two sort of industry standard for everything. That seems the only real problem.

B

There is that if we have say a 32-bit checksum on a four megabyte object, where we're check something every single 4k block, which we would probably want to do if we want to support 4k over writes, then that actually ends up being 4k of metadata just to store those check sums, which is a lot of metadata to cram into that key value pair in rice TV.

B

So it's doable, but it's it's sort of a lot so there there are lots of cases where we don't have to store that much and we can have larger text M blocks. So, instead of check something every 4k, we could check some 16 K blocks or 128 K blocks. You know whatever it is. We could also use smaller check sums.

B

We could use like the low 8 bits or 16 bits of the CRC so that we're just storing us checksum information and have a slightly higher probability of a false positive, whatever I'm not detecting a bit foot I guess.

B

So what what we do is sort of a matter of policy, but we have a bunch of stuff now, because we sort of control the whole stack that we can do to control these choices. So, for example, if the clients hint that this object is going to be written sequentially in read sequentially, for example, it's written by the radix gateway and we never have small over writes, then we might as well have sort of large checks on blocks and so that we'll have compact on the side of the object.

B

It might also be that if it's the his same hints that indicate that we're going to be compressing. If we do compress the block, then we might as well just sort of checks them for the entire checksum region.

B

Since we can't sort of over write up a small piece of something that's compressed anyway and in the end, the the plan here is just to have policies that you define on a pool basis. So you might say that this pool is used for RVD. It's going to have you know random, 4k, I/o and so I'm going to have maybe a smaller check sums but I'll. Do it? That's fine granularity, so I can do a fishing over writes that sort of thing, and maybe this other pool is used to route of Skateway.

B

It's all sequential and so I'll have other hints that indicate to use different checksum policies. So that's that's the plan there and then there's compression 3x replication is obviously expensive. You have to buy three times as many discs as the amount of data that you're storing and, in fact, even besides that anything that's scale out is just expensive, inherently because you're buying a lot of something and, at the same time, lots of data that we're storing is highly compressible. So it seems like we could do better.

B

So the Blues story implements inline compression, so it'll sort of magically compress things before it gets written to disk. So it uses less space. It's a little bit tricky to actually implement it efficiently. So we need to sort of largest extents large-ish extents on disk they get a compression benefit, so you can't take a 4k right and compress it to you, know 2k and then write 2k, because the block size, the disk, is 4k, so it doesn't can't really get it smaller than that anyway.

B

So we take larger blocks like say to 4k or hundred 28k or five okay and we compress those down, and then we write it in less space, which means you have sort of the largest blobs donkus that are sort of compressed into smaller pieces, and that works fine. If you're just sort of writing data and objects in their entirety into the system. The trick is when you need to support overwrites, I'm so hopefully this end of this diagram.

B

Hopefully it looks so make sense of people, but if you have sort of this logical mapping of an object or it starts on the left and ends on the right, maybe initially you write, you write the sequential object which maps to translates into two big chunks that get compressed. So these sort of gray blobs indicate the uncompressed region of the data, and then it gets compressed down to that blue thing which gets written somewhere on the disk. Maybe it does that twice.

B

You have two big chunks and then say later you come back and overwrite certain parts of it, so maybe you overwrite a little bit over the first region. What blue saw basically does. Is it just says? Oh, this is your occluding you're writing overwriting, something that was compressed before we can't sort of touch the compress thing, because it's all compressed so we're just gonna write that data somewhere else on disk and we'll logically point to it so effectively.

B

The part of the compressed data is obscured, it's still sort on disk, but you logically can't read it because it's it's been overwritten by something else, and it might even be that we write something really small. That's even smaller than the unit of allocation time say we write ten bytes, that's inside a 4k block. You might still allocate that 4x4 K block for the data and then only point to that sort of small, logical region and the other reason so loose or certain does this.

B

It has much of heuristics to sort of try to keep the resulting structure as simple as possible, and then the idea is that if it starts to get too complicated, where you have too many layers of occlusion, then it'll sort of flip a little trigger the O's to you to actually say what it. This is getting too crazy, I'm just going to read the data and rewrite it in a more efficient format, so sort of.

B

In the general casual case, you get relatively efficient, IO patterns and layouts, but if it starts to get too crazy, then we'll force a compaction effectively and write it more efficiently. I think in practice, most cases where you're going to enable compression are going to be sequential, and so you really won't trigger any of this weird layering over height stuff.

B

If you do go and turn on compression for a Iowa workload that has lots of overrides, for example RBD, then you will get sort of some inefficiency because we sort of can't have can't have it both ways, but we think that sort of this basic set of heuristics are going to do a pretty good job of sort of keeping things in check.

B

So, but this is sort of figuring out that performance properties of this are you know, future work, we'll see, we'll see how it pans out, but at least in the general case we can compress things on the fly and it and it's going to work just fine.

B

So, that's that that's how sort of blue store represents the mapping of an object to logical extents to blobs, and then the blobs mapped to disk you'll notice that this blob structure tells us. You know which type of checksum algorithm we're using the checksum metadata, and it also has if the compression flag is set, then it also. The actual data will tell you which algorithm is used to compress it I'm so again go to read it. You can use right now we implement the Z live and snappy you can plug in whatever other algorithm.

B

You want to it's very frizzy to do that, so the data path I'll cover next. This is basically how the code flows when we're taking data off the wire or from the OSD, and actually trying to write it to disk. So there are a few basic concepts. We have the notion of a sequencer at the object store layer which is basically represents an independent stream of transactions that are being fed to the the object store that need to be ordered with respect to each other.

B

So normally there's one sequencer per place in group and use placement group is sort of emitting an ordered set of transactions to that placing group that are updating an object and adding an entry to the PG log. But you have lots of place in groups on you chose D, so you have lots of these sequencer, so you have typically like 50 or 100 independent streams of transactions that you can be working on concurrently and only certain one. Only some of the actual transactions have to be ordered factor each other.

B

Each transaction is represented in memory inside blue store with what's called a trans context. This is sort of a transaction in progress and all the state describing what is currently doing and then at a high level. There are two ways that blue store will write data most of the time, we'll just do a new allocation, this sort of sort of at its heart copy and write right, anywhere type file system, so any write, that's larger than metallic size goes to a new, completely unused unwritten allocated region of disk.

B

So we just find some new empty space on disk and we write the data directly out and then we have to get the metadata at a point to that new region, the disk and once the I/o completes once it's a blondest and we can commit the rocks to be transaction that actually points to it. So that's all fine and good in general. Sometimes you have small rights, though rights that are smaller than metallic size, so say on.

B

An SSD metallic size might be 4k, so you're sort of fragmenting all the time, because SSDs don't care, but on a hard disk you might set metallic size to something like 64 K, which is about where the the seek overhead trades off with the amount of data that you're writing may be a little bigger than that, and if that, in that case, if you're doing say a 4k overwrite on a 64 K metallic size, then blue star will do a right ahead.

B

Blog style update where it will commit the transaction that updates all the metadata and, as part of that transaction, will be an entry. A temporary entry in rocks TV. That says: I promise to overwrite this data over on these blocks of disk and then after that commits asynchronously. It'll go actually do that, update to that previous location. So this is effectively data journaling, it's sort of what we used to do with new store and with file store, but you'll notice.

B

We only do it when the right is very small and the idea is that you'll have a knob, essentially that you tune so that if it's faster to do the right head blog then do that and if it's not faster, then you'll write it to a new place first and then update the transaction and which is faster sort of depends on the properties of your storage device on a disk. You know generally anything under 64 K.

B

It makes more sense to do the right head logging on an SSD currently, if we only do for something less than 4k, but I have a feeling that she that even stuff that's larger than 4k, maybe 8k or 16 K might still be a win to do the right head blogging and we need to do some testing to actually find out because we don't all agree on whether that's the case, but it might be. The main nice thing about right.

B

Ed logging is that you have a single I/o to commit this transaction, and then you can acknowledge the right. It's stable once it's committed along with this right ahead promise and you can act it back to the client, whereas, if you're doing a new allocation you have to do is a right to the new space. Wait for that to be durable and then do the transaction commit and then we've put up to be durable.

B

So you have strobe to iOS at the lower bound, whereas with redhead log you only have one so it might have a Laura late see if it's fall enough we'll see. This is a complicated slide that sort of describes the flow of transactions through this process that I'll go through relatively quickly.

B

So basically idea is that for each trans context it starts out a prepare stage where we prepare all the updates that we're gonna make to the metadata we figure out where we're gonna write the data. We choose our disk box and everything. If there is data, that's being written to disk first, then we'll initiate some I/o and then we'll go into the a I/o wait state.

B

While we wait for those I/o is to complete and at disk- or maybe we don't do an area, maybe we're just deleting an object, in which case we skip that step, but once I diode completes or there was an area we go to the queued state or we basically put that transaction in the queue of stuff that needs to be read through to committed to Roxy B.

B

At that point, in the queued state, we might actually have to wait for a while, and the reason is because you'll have multiple trans contexts within a sequencer that have to commit in order, and maybe the one in front of us was doing I/o and it's waiting for IO and we come after it and we can't commit until the one in front of us does it say? Oh it also commits, though these they sort of are in a chain and they have to go in order.

B

So on the right, you sort of have a picture of this where you have a request. That's in KB queued state and the one in front of it is an AI. Oh wait waiting for its IO and so we're sort of blocked, but once we have a bunch of stuff in the cute state that isn't waiting on it, then they all go into the committing State give them to be until it to commit it. We wait for that to actually happen and if sex myths, if we're lucky, then we're just done.

B

We got to finish, but it might be that we're not done it.

B

Might be that we committed a redhead log entry, I promise to do feature REO, in which case we go into the wall queued state, which means that now our request, we have to go actually do that asynchronous IO and wait for it to complete and then once that completes, there's a clean up state where we delete that temporary record at a rush to be thing because we already did it and then and then we go to the finish state so sort of confusing here, but it's in practice.

B

It's actually not not too bad they're, really just sort of two two queues. Caching, so blue store implements its own cache in userspace memory. Remember it's sitting directly on top of a block device, all of the I/o it does that that block device is now using direct I/o. So it's not using the kernel for any caching whatsoever with it mostly the case there, a few bits of code where we're in the process of removing that but that'll, be the end result at least so.

B

There's a structure called an O node space that caches a mapping object names to the own admitted data. These are uh nodes that we've recently touched that are in memory that are all decoded they're ready to use. We have everything ready to go, there's also a buffer space base structures. That's a mapping of object, offsets the buffers that is attached to each unknown in memory, and actually it's a blob, see tupelov. This in memory has a buffer space that might have some buffers associated with it, so we sometimes cache actual just data too.

B

So both buffers and no nodes have life cycles that are linked to another structure called the cache. It has a couple different implementations. We have one implementation, that is a trivial LRU and we have another one that implements that to cue cache, replacing the algorithm, which is much better than LRU, it's sort of resistant to sequential scans, preventing those from pushing out hot data, and so the cache manages the overall life cycle and trimming, and the basic idea here is that the cache is charted for parallelism.

B

So you blue store, might have many many cores that are processing these transactions, and so we basically take the collections and we shard them and map them to different cash, shards and then effectively. Those will be have some affinity different course or CPUs in the system, and we use the same mapping that the OSD also uses at a layer up in its up work queue.

B

So the OSD already shards requests across multiple cores to other collections, and then we used an identical sharding scheme lower down so that the same cpu context will do all the OST level processing in the request and the request to the object store, which will do a bunch more processing, and you know, twiddle the buffers position in the LRU or whatever within the same cpu context. So you won't have cache lines bumping around between CPUs that that appears. We think that's gonna work pretty!

B

Well all the way haven't sort of done extensive performance testing on that yet, but we think it'll work I'm. The one thing that it hasn't been done yet is that the IO completion currently all happens in a second thread, and that might do some updates as well. So we may end up charting the IO completion, also so that the completions also happen on the same core, but we're not sure we'll see we'll see what happens. I think that'll only matter on really fast devices. I can view me.

B

um Let's see there a couple other things that happen there, there's a free list manager which is sort of a module that keeps track of space on disk. That is unused. That's in the free list, that's responsible for having a persistent representation of what parts of the disk aren't being used. The initial implementation was just based on extents, so you have a bunch of key value pairs in the database that have an offset and the length a region of the disk, that's not being used.

B

It would have an in-memory copy, so I'd know which keys to delete and update. When you add an allocation and de-allocation. The problem with this approach was that it enforcement ordering, because you had to sort of delete the old key and update, insert new keys, and if you reordered those transactions it would sort of corrupt its reference representation so ended up, and with this older implementation, you had to sort of serialize all of those allocation deallocations in a single thread.

B

We replace this with a bitmap based approach where we basically have and offset on the disk match to a bitmap. That represents a bunch of blocks that start at that offset, and these are sort of relatively small key value pairs. So he might only have 128 blocks, which is divided by 8. That's only 8 bytes of actual bits for a particular region.

B

We have a bunch of these keys and then we leveraged the merge operator in rocks TV, which is sort of does a deferred XOR work when Roxy B deserts, compaction and stuff, and so we can have a set of doing a put into the key value database. We do a merge which basically just tells it which bits to flip as either because they were allocated or D allocated and rocks to be. Does that efficiently in the backgrounds later, and it goes and compacts things the nice thing about that representation.

B

This new scheme is that there's no memory state and there's no ordering constraint as far as making serializing all the transactions in the system. It might sound kind of weird that I say that there's no in memory state, because it's our free list and we need to know what is or isn't used. That's because this is the free list manager module which is just responsible for persisting the free list representation.

B

We have a separate module, called the allocator, that's responsible for deciding where we should allocate new data, and that does obviously still need to have state, because I need to know what parts of the disk art in use. So it's also an abstract interface. We can plug in different implementations.

B

The first implementation was affectionately called the stupid alligator. It was extent based. It would sort of been free extents by how big they were and when you allocated something it would try to get something that was big enough, but not too big sort of nearby to where you want. Whatever you hinted, titute the allocation, it works pretty well. It fortunately, though, has a very variable memory usage I'm, so your device isn't fragmented.

B

There are a small number of extents that represent the free state of the system, but if you have a highly fragmented system, then you have the zillions of key value pairs and it sort of uses a lot more memory.

B

So what there is a new implementation called the bitmap alligator that SanDisk wrote that uses bitmaps to indicate which parts of the disk are in use, and it's not just a single bitmap where it's one bit per block, but it also has a hierarchy of indexes that are layered. On top of that to indicate whole regions blocks that are either completely used are completely unused, so there, if you're, looking for a large extent, you can look at the higher level indexes to find big chunks more efficiently.

B

And the nice thing about this implementation is that it has a fixed memory. Consumption per terabyte of disk space, so uses about 35 Meg's of RAM per terabyte, which is predictable, and you can plan around it and you can just budget for it and it's relatively compact, pretty reasonable. So that's sort of the new default. What's going to happen, I mentioned that the alligator is pluggable so or also have an intern google Summer of Code Sooners working on adding me to support for SMR hard disks.

B

These are sort of these new annoying devices that the manufacturers are producing that write data on disk in an annoying way that prevents you from doing arbitrary over writes. Oh, the disk is separated into all these zones or bands that have to be written sequentially, not necessarily all at the same time, but you have to sort of write in order. If you go back and overwrite something before it'll sort of corrupt thought, that comes after so you have to write them in these sort of stripes.

B

So there's a library that lets you query the zone layout and manage the pointers and so forth. Couple of DBC that we're going to be using the current crop of prototype devices that are experimenting with our host host their host, aware, which means that the disk will actually support, will sort of. Let you do whatever you want, but it'll get really inefficient as it sort of tries to mix things up behind the scenes.

B

But the goal is to make blue store work with either hosts aware or host managed, so we'll sort of carefully make sure that we're writing in an appropriate way so that we're using the disk efficiently. So the SMR allocator will become very simple, we'll just sort of need to keep track at the right pointer, Spurs own, and we know that everything after the right pointer is unused because of the way that we're sort of forced to write to these discs, which is nice because the allocator has almost no date.

B

It's like very simple to implement I'm the trick is that then we have to implement a cleaner that goes back and sort of rewrites old data when it becomes fragmented in order to retrieve space.

B

So the strategy that we plan to use there is to have hints that sort of map zones back to the objects that live there and have basic accounting so that we can find the zone that has the least used space and then go back and find the objects there and rewrite them into a new zone in order to reclaim that and reuse it.

B

So this is sort of work in progress, we'll see how it goes, these devices are going to be cheaper and bigger, and so it'll be it'll pay to support them. They're, never going to be as fast as sort of more normal, hard drives or flash, obviously, but for sort of capacity plays where you're building archival clusters, where you're just shoveling as much data as possible and you're- probably compressing it and so on.

B

We want to be able to take advantage of these I'm it possible, so we're doing it and one of the nice things about blue store being implemented entirely by us in user space is that we have the flexibility to actually, you know, write these custom policies and cleaners and so on to actually make them work, whereas previously with file store, we were sort of at the mercy of you know what expats are quite RFS or you know, device mapper, mono, typically.

B

So, that's! That's! That's blue store! That's how the data metadata paths, work, sort of a high-level I'm gonna talk a little bit about performance. These graphs were produced a month and a half or more ago with an earlier prototype version, so they're pretty preliminary and they're, not super detailed, but it gives you a taste of what we expect. The final version is going to look like, so this is just looking at sequential right on a standard spending.

B

Hard drive, you'll notice that for large iOS for exactly twice as fast, which is basically what you expect notice that file source, blue and blue store is red. So that's sort of annoyingly inconvenient, alright, so in large I/o case file store, has to double write everything to the journal and to the disk. We don't do that.

B

We write it just once and then update the meditative, so we're twice as fast for small AO we're not quite twice as fast but, like you know, 60%, faster and much more predictable and the latency sort of is more yeah. It's better! um So we're pretty happy with this random rights. Look good they're also about twice as fast, almost twice as fast. So this is the left is streaming throughput, the right as I ops, so you can sort of see detail on both ends.

B

The ones sort of interesting thing here is that there's this little kink between 3d QK and 64 K rights. That's because that's where we transition from doing the right ahead, logging, where we'd sort of journal the over right update and then go, do it asynchronously versus just writing to a new region of disk and you'll notice that there's sort of around this region is where there's a trade-off and we we set it around 64 K, because we don't want to have highly fragmented objects on the disk.

B

We don't have a very fragmented layout, because if you go back and try to read that sequentially later it'll be super slow. If you have to do too many seeks, so it's sort of a policy choice there. If you can there's a configurable knobs you can set it to whatever you want.

B

Sequential reads are a little bit more interesting, you'll notice on the high end that we're not that different file store a little bit faster, which is good and, on the low end, we're about the same, which is about what do you expect, but in the middle we have sort of this weird dip pattern and the reason for that is with file store, the stores, layer or sorry file stores, layered on top of XFS, which is doing normal file over through the kernel, and it does a read ahead, but the kernel will sort of do its own read ahead in XFS blue store.

B

We decided not to implement any read ahead and we did that because all the workloads we care about have something that's sitting on the other side of the network, not on the OSD, but on the client side. That is also doing read ahead. So if it's tough, if s south of s, does its own read ahead.

B

If it's RBD, the filesystem sitting inside the RBD on top of the RV device inside the VM, is doing read ahead, if it's raitis gateway, ratos gateway, does this one read ahead and once you actually look at the performance from the full stack from the client it?

B

It looks better like blue stores, it's only when you look directly at the OSD that sequential I/o, that blue store isn't doing that readhead behavior, and so it appears to be slower, but we've yet to sort of identify a real workload in the real world where we need read ahead at this sort of low level, and so so we're not implementing it yet and until we sort of have a compelling reason to do so, we don't plan to it should be just fine random reads: you know, sort of what you expect again, we're slightly faster than file store and even more faster on small I/o here, just because I think our metadata is a bit more efficient, whereas on file store.

B

We had all these sort of directories that hash weird hashing structure we had to traverse and so on in blue store, is sort of a single flat look up into the key value database before we get the block, so we can go then go looking at do the I/o.

B

Those are all on hard disks. We did do experiments with SSD Nandini, unfortunately, don't have graphs for it on Indy and me, random, writes are good they're much faster, our testing when we were doing this was seeing some anomalies because of the kernels. There was like a weird issue with the driver on the CentOS kernel. We were using on the machine and it was a confusing that got sorted out later, but I didn't end up reading the tests, but it's faster. It's not as fast as we want it to be.

B

Yet we haven't really done a lot of performance tuning yet on indeed me on, but we planned to I did do a series of tests on like sort of standard consumer ish, SSDs I was using like a crucial in 550 or whatever it is like 1 terabyte, crucial SSD.

B

It's similar to the hard disk result in that blue store is basically two times faster, except that on the SSD, the small IO benefit was more pronounced, so, whereas with hard disks, the store is little less than two times as fast with the SSD blue store is a little bit more than two times as fast for small. As for La Hoya's, it's like exactly 2 X yep, so there's more work to do there, but overall we're pretty happy.

B

Just sort of the sort of just 2 X across the board is sort of a very approximate improvement is pretty compelling, so current status lots. The step is done. We have a fully functional implementation, it's in the master branch. It's the I/o path works, it's stable, it does checksums and inline compression it all works. There's an FS check that you can run either explicitly or you can set an option.

B

So it does it every time it starts up and shuts down which we do during QA, and we have these new bitmap based alligators review lists that are implemented that work. Our current development efforts are focused in a few areas. The main thing right now is were focusing on making the encoding of the metadata for our own ODEs more efficient.

B

Once we added compression and check sums, suddenly we're storing a lot more metadata per object, and we need to make sure we store it very efficiently in rocks to be, or else rocks, TV gets big and it does when it does its compaction. It just generates lots of I/o, so we're doing a lot of that performance. Tuning there's also an effort underway at sandesh to take the zetas scale, which is a key value database that designs specifically for flash that they recently open sourced.

B

You have that plug-in blue store as an alternative to rocks TV, so zetta scale is a b-tree based implementation. It's sort of generates I/o that is friendly for SSDs, whereas rocks TV. Is this log structured, merge tree that sort of lends itself better to spinning disks, so assuming that works well and their initial performance tests are very promising, the plan is to have rocks to be when you create a an instance of it, it'll look and see.

B

Is it an SSE or hard disk and based on that it'll decide whether to use rocks DB or set a scale for its back-end database and just sort of get performance that map's to whatever the best choices? There's also, some implementation work to do still around making sure that when we have these compressed blobs and we overwrite data that we sort of prevent the metadata from getting too complicated, I'm I sort of alluded to that earlier.

B

That's in progress right now and then, what's coming after that, we want to add per pool properties to radius so that as a policy level or an entire cluster, you would set options that say this particular pool should have this type of checksum or should use compression or should not be compressed based on the type of data that you're storing there that'll inform blue store to do whatever it needs to do compressor or not compress that sort of thing lots more performance. Optimization is coming stabilization.

B

This native SMR support I mentioned is in progress and there's also a patch set. That's actually most of its merged, initially I'm interested ever SPD K, which is an Intel library for kernel bypass for nvme devices. So it puts the driver for the nvme card all the way in user space and talks directly over like memory map sort of rather over the PC to PCI device to have very fast access to nvme.

B

So most of that's there, but it's sort of awkward to use and test. But the goal is that, eventually, if you're doing nvme cards and blue store, you'll have sort of a full user space stack, that's very efficient and fast for those devices. um Where can you get blue store? So it is still experimental. It should not be used with the production data. It will lose your data most likely it's early days. Still we do have an experimental implementation in jewel, which was just released two months ago.

B

You have to enable this scary option enable experimental and recoverable data corrupting features equals blue star, X DB, and we mean it it will. You will lose your data most likely it's stable enough to do benchmarking and like some basic testing, but the disk format has already changed. So if you go and upgrade your cluster from jewel, you won't be able to read the data.

B

So don't put anything important there set disk and jewel has basic support for blue store, but it only supports the full device mode or everything is all in one disk and it doesn't automatically set up multiple partitions to do the rocks to be wall on a different device and so on. You have to do that kind of manually and tediously. Unfortunately, the dual implementation also predates checksums and compressions. It doesn't do that stuff yet, but it does have sort of the overall IO flow and performance. So you'll break that we expect to see.

B

If you pull up to the current master code and get then we have the new disk format and that does do check sums and compressions in compression and it works. It's still changing. So again, if you write data, you an upgrade, you won't be able read it back, probably definitely I should say, but it's looking looking pretty good again, it's pretty stable and functional. The goal for blue store is to have a stable version for Kraken.

B

So that's due to be released in October of this year that you can deploy on a production cluster and will have a stable disc format and will not eat your data. That's the goal where we've done most of the development. As far as all the feature work, it's really about optimizing, the on test data structures and then doing lots and lots of testing and performance, testing and optimization and so on, and so I'm feeling pretty good about being able to meet that that goal.

B

But that's we're aiming and then the secondary goal is then by the next release after kraken, which would be luminous in the spring of 17. It will be the default back in for the OSD. So if you stand up a new cluster or add a notice each an existing cluster, it will use blue store by default instead of file store and we can finally sort of deprecated all the legacy stuff.

B

So that's plan, so in summary, obviously stuff is great. Originally we built the SDS on top of POSIX file systems, but it was sort of a poor choice. It works, but it has huge performance disadvantages and there's lots of complexity working around things that don't do what we need them to do. So we built a new back-end called blue store. We embedded rocks 2d, which is great parks, would be rocks.

B

That's easy to embed booster was cool, does full data check sums and inland compression, and it's fast it's roughly twice as fast that was healed stuff? Oh, that's. What I have any questions about anything I've talked about? You can ask a question either verbally or posted in the chat. Let's see, I've got a couple of things.

B

Post factum asks how would encryption be handled in blue store?

B

The short answer is Blue. Star itself won't do encryption at all.

B

The plan is still to do encryption via DM crypt at the block device layer, so we're sliding an encryption below SEF below the set photos to use on the block device and then we're also adding encryption support at sort of the top end on the client side with our blue bar buddy is eventually gonna have compression encryption before it writes in Doritos and Rios Gateway is going to do it and maybe says that's eventually on the low end, though beneath EOC it is based on Lux, it's already present in dual and previous versions.

B

Actually there's an argument you pass to that disk, DM crypt that'll, just turn on the encrypt and it'll store the keys on your monitor by default.

B

How is cache reclaiming performed in the case of type memory conditions on the OSD server side? So there is a tunable that you set on the OSD that basically tells blue store, how much memory it's allowed to use and that's how much memory it uses. There's, usually you want to just set it at you know. A few hundred Meg's, probably I, think the default I'm going to set at 512 Meg's Borowski, but you can set it to whatever you want.

B

If you set it really small data, that's in a cache, while it's still being written it hasn't been committed to disk, is effectively pinned in memory because the transaction might be doing a read of stuff. It wrote previously whatever. So it should behave. Okay, if you set it to really low number, but we haven't tested that yet and yes, the internal page, cache is no longer used for blue store, it is in the jeweler j'en.

B

We do rely on the kernel and jewel, but and the master version and going forward, we won't be using the curl cache at all on disk fragmentation does happen because we are a copy and write system right now, there's no defragmentation except sort of implicitly. If you rewrite the data, it's unclear what we're gonna do there. It might be that we don't do anything all that probably isn't the case.

B

It might be that if the data is fragmented, we just let it stay fragmented until you read it and then if we read a bunch of data and we notice that it was fragmented, we just sort of in the background QR right to defragment sort of opportunistically. As you read, fragmented data that way, there's no background read cost, there's only a background, write cost, but we'll see what algorithm is used on compression. We use snappy and zeal it by the two that are currently plugged in there.

B

A couple others out there that look interesting or something called like next lib or something hammer was called that looked cool. It claims to be very fast and as small as ela. But who knows the nice thing about Z love is it's it's like inflate deflate, which have been around for a million years and it's there's our versions of it that are optimized with like special CPU support on Intel processors, but once that stuff is wired up, then hopefully it'll go super fast it'll be basically free, which will be nice.

B

How is IO alignment work when you write directly to the disk?

B

You can tune it so.

B

The alligator doesn't try to be clever about where and the disk it writes aside from like really basic hinting or like doing a free block, scheduling or anything like that.

B

But we do let you configure the allocation unit and the size at which we allocate new space versus overwrite old space. So that's sort of the only knob you have right now, yeah we'll see, let's see, is there? Oh, the is there deduplication in blue store, there's no deduplication and there are no plans for deduplication.

B

The expectation is that D, dupe and Seth is going to be implemented at the radius level as part of the tearing infrastructure, because Fredo is to sort of randomly spraying objects across those DS. If you write the same deja twice, it's usually not going to land on the same device and so you're not gonna, be able to D to pit. So any DDP do on a single OSD is going to have very limited benefit, so the plan is to have sort of an indirection layer.

B

That's part of the tearing infrastructure that will rewrite data in addy dupe fashion, once it's like an archival load or whatever. That will choose the OSD based on a hash of the data, so that multiple copies of the same thing end up in the same place.

B

Is there caste ring feature in blue store, only sort of so blue store tears across multiple devices just for metadata and data and the rocks TV sort. Does it with its metadata. So the colder metadata goes in the slower device. We don't blue store, doesn't tear the data itself, yet we're unsure whether we're gonna do that or not there. You can do that in the block layer using DM, cache or flash cache or B cache, and those seem to work reasonably well.

B

So the real question is whether those are going to work well enough or whether we think we can do better and it's worth the additional complexity to do tearing in blue store itself. In theory, we might be able to do better, because we know more about the workload of the data, then you can sort of discern from the block layer and so I think eventually we'll probably end up doing it, but initially it's not sort of on the table. Just yet miss NIC casts about memory requirements.

B

So there's, as you mentioned, there's a 35 Meg per terabyte for the free list. Beyond that the memory requirements, it really depends on how much memory you want to cache.

B

So I set the default amount of memory to use for the buffer cache at 500, Meg's I, just sort of picked that randomly out of a hat as a reasonable number of puros T. Given what hardware people usually deploy today, but you can make a smaller bigger if it's smaller than you'll have more cache, misses and you'll do Mario and it'll be slower if it's bigger than it well, so you can sort of choose the sort of the OSD independent of the object store. The memory utilization, you know, seems to fluctuate between.

B

You know 150 Meg's to like three hundred Meg's or more sort, depending on how many peaches you have and how much IO you're doing and that sort of thing, and so the blue store overhead will be part of that. It's going to appear as though the OSDs are using more memory than they did before, because with file store, all of the cache is managed by the kernel.

B

It's not part of the OSD process, and so the Oh Steve would appear to be like 200 Meg's and then the kernel would have you know like 16 gigs of page cache memory, that is, the OC is taking advantage of with blue store. It's all going to be part of the process memory, so the Oh Steve process is going to look bigger and the kernel is not gonna have any cash basically, so it'll be a little different.

B

There's question about EC pools: it will be compatible. The EC pools are gonna, set all the hints to tell blue store that the I/o is sequential and has large blocks and all the stuff, so that blue store will trigger everything that it can so it'll be more likely to compress the data. It'll be more likely to do larger checksum blocks, so they'll make the metadata smaller. All that good stuff, though it'll work, it'll work, fine with EC, if they're blue source support for trim on SSDs, currently, no, not yet.

B

That is still in the shoe list. It's a little tricky because SSDs before they pay attention to your trim, the trim needs to be big enough to sort of align with whatever the unit of trimming on the SOC is. So this will have to be supported, I'm wired into the alligator, the bitmap alligator, so that it sees when an entire large chunk gets freed up. Then it will issue the trim and it needs to be easily turned on and off, because some SSDs just do bad things at USU trim.

B

They like trim, is not efficient and it like stalls. The I/o pipeline, or sometimes even like corrupt things, so we had to be careful, but eventually, yes will be supported, but it isn't yet I'm. I should note, though, that on the RVD level, trim is supported that just sort of throws that logically throws out that part of the object. Now at the object store layer when you do a trim, you're just punching a whole new object. It's just dropping the logical references to that data.

B

I don't have to deallocate it or drop the El extents, so sort of trim at the prompt blue source perspective, a layer up is sort of trivial. Just twiddles. With this metadata just kind of nice, let's see Steve Taylor asks are there tests that might elicit the performance differences in RVD clone rights using blue store?

B

Yes, so this is going be a big difference right now, if you run SEF OSDs with file store and use snapshot a walk device, and then you do a write that first right is going to trigger a literal four megabyte copy on the OSD. Before the write happens and I'm lose store, that's gonna, be a cop it'll, be free. Essentially it's a copy and write.

B

It's just a metadata update, um so we're trying to think of what a good way to trigger this would be I mean you could just take a snapshot on a block device and then go into that VM and then do like write a small, a small file and do an F sync or overwrite an existing file.

B

Overwrite one bite, an existing bio, new, dep, sync and you'll see that the latency is like probably hundreds of milliseconds on that first right and it blue store it'll, be you know what you would expect it'll be you know less than ten milliseconds or whatever it is on your cluster.

B

Let's see the OSD side, cache support does is for different strategies right back and right through the cache is always right through because everything the OSD ever writes has to be committed before can reply to the client and so a write back policy on the OC just doesn't make sense, um because the OST is a server and you never want to say you wrote something when you didn't, because you lose data. So it's always right through.

B

That's it for the chat any other, any other questions about blue store.

A

B

A

Thanks everybody and Thank You sage: you.