Ceph Ceph Days NYC 2023, 17 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC 2023: SQL on Ceph

Description

Presented by: Patrick Donnelly | IBM

Ceph was originally designed to fill a need for a distributed file system within scientific computing environments but has since grown to become a dominant *unified* software-defined distribute storage system. This talk will cover the new development of an SQLite Virtual File System (VFS) on top of Ceph's distributed object store (RADOS). I will show how SQL can now be run on Ceph for both its internal use and for new application storage requirements.

https://ceph.io/en/community/events/2023/ceph-days-nyc/

A

All right, hey folks, um I'm Patrick, Donnelly um I work primarily on CFS for uh IBM, although that's not the topic of conversation today, but feel free to stop me uh outside. If you want to chat about 7fs at all, um today, I'm going to be talking about SQL and stuff, uh which is a provocative title to get people in the room, I'm really going to be talking about SQL light on Seth, which is should also be exciting, but maybe it's not the distributed sequel. You may be thinking of we'll talk about that more later.

A

um So here's a brief outline of my talk, but while people skim that, because I don't need to read it to you uh who here is used, uh uh rados, okay, who's actually used liberatos, okay, just a few people and who Harris use SQL, Lite, okay, so lots of people. So we should all be pretty excited about that.

A

um This slide is is fairly famous, because it's almost in every step developer talk very exciting. uh We got our three pillars of of Seth. They all talk to libratos, that's what they have in common, but did you know the set manager also talks to liberatos very exciting uh yeah, so the set manager um does many operations on libratos. Anything that needs to be persisted. Some things are also stored in what we call this.

A

The the monitor uh config store, which we use for a super reliable store that can't um that has to operate even when rados is, is inoperable, uh especially in like early bootstrap stages of a cluster when we don't even have an OSD yet, um but for for some things we can, we can just persist directly to rattles.

A

um So let's talk about the set manager, uh design um Dan talked about it earlier. uh You know. We now have this magical way to run Python and ceph. As part of uh that manages cluster operations, um we have these different modules, so everything's abstracted into a ceft manager module um these modules. Take care of things like orchestrating demons. Orchestrating upgrades fadm is: is partially a CF manager uh module.

A

We have a dashboard now we have a way to monitor the device health of our osds, and these are all just several dozen manager modules we have and you can enable or disable them as needed. Some are default and can't be disabled.

A

uh In general, the manager has no dependency on CFS, RDW RBD and that's a deliberate decision, because we want the set manager to operate even if those services are not available.

A

That actually does talk to to ffs, um but generally it's a standalone demon. So what does it provide it? It provides a basically a glue for Python and ceph lets. The manager modules talk to things like the mon client, get access to the monitor, Maps OSD maps and then also some of the modules even need to use labrados for persistent state. What's an example of such a module, we have the device health device, health module.

A

So that's going to slurp up all the smart data from all the osds and persist that out to rattles and that's basically, all it does. It provides a CLI interface for other modules, namely disk prediction to allow us to predict future failures of devices.

A

um So it's a fairly simple module, but one of the things it was doing is that it was uh basically using liberatos to store all these all of this data. And if anyone here who has used labrados knows it's not really the most programmatically friendly Library I mean it it's a good library for for doing distributed storage, but it forces you to think about things that are rather difficult problems, things like consistency or how do I scale out right.

A

These are not really things that you want to spend time doing in in a language like python, um so this is where sqlite enters the picture for those of us who have not used sqlite, it's an application library that allows you to store sqlite database as a regular file, so a single blob of data, um it's known for being, especially if you go to their website, it's known for being one of the most uh use SQL database engine in the world. It's on everybody's phone, it's on almost every computer, that's deployed in the world.

A

um It has a very rich C API, allowing us to embed it and also extend it and then another highlight of SQL light, which was one of the reasons I was looking at this at all was it's. It has an easy binding for python. Now, of course, now we have Python and Stefan in the form of the stuff manager.

A

So, let's see if we can bring these two worlds together, uh the way we actually do, that is through the sqlite VFS, and that um at the beginning only provided access to what was What's called the Unix VFS access to the Unix file system or any file system local file system.

A

um So what does that? Do it's just a part of SQL light that abstracts all the open crates reads, writes syncs into a single library that can be swapped out for something else.

A

So if you don't provide a VFS, it defaults to Unix. So if you use sqlite on the command line, that's what it's going to be doing is putting in a regular file on your on your system.

A

What does it actually look like? We have two: uh it actually ends up creating two files, the journal and the database. The journal is usually deleted by default, so you don't ever see it if you do an LS after interacting with your database um but predominantly, but there is actually two different files that we care about so enter libsep sqlite. So cephsqlite is a VFS that allows you to stripe. He sqlite database over rattles um the the real highlight of lipsteps equal light. Is it doesn't require any application modification in order to use it?

A

You just have to load this F SQL, light um Dynamic library and um and specify an alternate URI for your for your database. So again, uh the journal in that database are automatically striped over the osds. You don't have to think about it at all. You just use your sqlite database like normal.

A

How this actually works is on a new library called Simple rattle striper. This is just a simple interface that allows us to stripe a variable size, blob onto a number of objects. This is something we're already fairly good at doing at doing with SEF it's ffs files, RBD images, uh our rgw objects. You know these are things that get striped over multiple objects.

A

uh Similarly, we're going to do the same thing for this fsqlite database. This is conceptually based off of a CERN Library where's where's Dan, oh Dan, uh liberato striper, which was developed by CERN. Unfortunately, I couldn't use it because it does a sync on every uh or it does write reads and writes all synchronously by acquiring a lock. I was very sad about that.

A

So, instead of like ripping the rug out from under them by trying to modify it, I just wrote a very simple version of it that I that I called Simple rattle striper tailored for Libs F SQL light mostly to allow for very asynchronous rights.

A

um So again, simple, rattle striper provides all these Primitives that we use open, read, write scene close and it Stripes all the data over these objects. You can actually use a rattles client, the CLI client, with a striper option to read and write these databases out of rattles, so it's actually compatible with the librato striper, that's already in stuff.

A

So if you want to use lipsep sqlite, what do you have to do again? I told you, you don't have to modify your your your application. I wasn't lying. uh You do have to tell our SQL light to load the VFS. uh That's done through a special sqlite command dot load.

A

You just specify this fsql light um library and away it goes it loads it up, and then you open the URI associated with f sqlite, which is uh involves specifying a pool ID a rattle same space in the pool which is optional and then the name of your database and then the the VFS which would be ceph. That's all there is to it. Maybe some, depending on your your situation. There may be some Enviro environment variables you may need to set, namely like where the cefconf is or any additional arguments you want to specify.

A

If you want to use a particular credential in your key ring for yourself sqlite, you would use cephargs for that, so the set manager as of Quincy is now using lips, have sqlite, probably didn't notice, I hope unless something crashed and then I'm sorry.

A

um So it's uh the way this works is it. The SEF manager now has a uniform interface for interacting with a asqlite database, that's associated with every uh manager module, so every manager module can just automatically use the sqlite database. It doesn't need to be uh tailored at all. To do that, um we have this. uh We also now have a DOT manager pool which some of you may have noticed. It was actually in the previous talk. You can thank me for that.

A

It used to be called the device health pool, but we just generalized it to be the dot manager pool so and be used by all the manager modules. Hopefully, your upgrades went well when that rename took place um the uh the actual access within the stuff manager you can see on the right in that code. Snippet is fairly simple.

A

You just special specify your SQL statement, as you normally would, when using uh SQL Lite within Python, and then you have to acquire a database lock in case you have multiple threads within a manager module while accessing the database.

A

You have to synchronize on a lock which is provided to you and then actually create a transaction with it within the database by specifying the database in the with statement and then finally, you just execute the SQL and then I can do things like read out the data with- uh and this is a this is a simple generator which is doing a select uh on B from who, where a equals question mark a is provided to the function f, um so we're reading all the uh columns B in in in that table.

A

um Let's see what else so everything else is abstracted away. Where is the data? So um in all of these manager modules the database that's associated with each module is just called main.db there's also a main.db journal file. As I said, the sqlite has a journal file associated with each database and then in the dot manager pool, there's a namespace named for each manager, module and then a number of files or sorry objects for each given database, blob, main.dbe or main.dbjournal.

A

So that's what you'll find there if you did a rattles LS there, the device health module. So now the schema is fairly simple, um but I didn't talk about earlier. Forgive me, the device health object module. What it would do is it would create an object in rados named after the device that it was uh collecting health for and then it would use the omap key values to store a series of smart data dumps in in the in the omap.

A

So now this is all just transferred into a very simple schema for keeping track of the devices and device health metrics. So we've got a single table which gives us the device ID it's very simple table, and then the device health metrics, which gives us the the time stamp of when the sample was taken.

A

The device ID associated with that that sample and then the the raw smart text and then the primary key for that that table is the device ID and the timestamp, and so I can do complex queries to look up a series of smart data. Smart data dumps and then also I, can like if I'm trimming, the journal or sorry the database as part of routine garbage collection.

A

I can do very simple delete statements as well, and then this is an actual snippet of real code from the manager module I just took out a few keywords: first, horizontal space, but otherwise this is basically exactly the code. If I'm up putting a device metrics into the device health metrics table, I am going to first create the device ID if it doesn't already exist, create an epic associated with the timestamp and then ver the the third step json.load. That's where I'm loading, the the the the data dump, make sure it's valid.

A

It should be otherwise it's going to assert and then I'm going to insert it into the table. That's really that simple.

A

So it looks like normal sqlite code in Python, so I, don't know why the GIF already started, but it did so. This is um lib sets equal light in action um wisely or not. I tried to do the tutorial uh and then record it, but uh so this is a SEF status here, I'm just listing the pools, you can see the dot manager pool and then I'm purging it so I'm deleting everything in it and I'm just verifying that.

A

There's nothing in it by doing a an LS with the all keyword that looks at all the namespaces in the dot manager pool. There's nothing there.

A

uh It's just setting the stage for the next part, so here I'm, going to actually Run sqlite 3 command line tool to put a database in SEF I'm doing this within a developer environment, so I'm setting a number of environment variables, namely where to find the libsef sqlite library. It's in my build directory.

A

I have to tell it where the ceph comp file is um the which key ring I'm going to use and then which credential I want, which is the admin credential, which obviously is a developer, will tell you, don't don't use that in routine operation?

A

You can also specify some debugging Flags in the sephards. If you want to look at some of the the debugging, so here I'm just loading the lipsep SQL light Library, that's the first command to SQL, lite3 and then I run another command to open the database file.

A

So I'm going to put it in the pool a in namespace B and the the database name will be a.db I'm. Creating a simple table with integer uh an a single integer column and I'm, inserting one value into that table, one so I'm just going to dump it verify that the dump of the the sqlite is exactly what I would expect, leaving SQL light now I'm going to do a rattles command to LS all the objects on that pool. You can see in names, namespace B, the single database object a DOT db.000.

A

and then finally, as I promised you, you can use the striper command to actually slurp the database out of rados. If you want to which I did- and you can just have a look at it, it's eight kilobytes, which is why it only took a single object in rados and then did sqlite on that uh data. That database, which is now local and verified everything, is as it should be,.

A

So in the next one, uh I We have basically the same thing here, I'm just going to forgive it's copy paste because it's just a lot of text, but this is basically a fancy way to create an infinite Loop limited by the number of rows to insert a number of random uh integers into this table, and you can see that at this point now the table is large enough.

A

We have a database that spans four objects and as a user as an application, writer I didn't have to think or care about the consistency of having you know several objects associated with my data set. It's the selling point all right.

A

So, um unlike my faustin version of this talk, I have a little bit of time to actually talk about some performance notes. So a lot of this is already dealt with in the SEF manager, but if you're playing thinking about using this object yourself, you want to play with it. There are some caveats: is performance wise to actually squeeze out the most performance you can with this Library, uh it's all documented in the documentation which is at the bottom. Hopefully, the slides will be available at the end of this half day.

A

If not, you can always email me uh to get to get a copy of the slides, but basically you need to. There are a number of things you have to control in order to get the maximum performance once one is the page size. The default page size in sqlite is very small, which results in excessive reads and writes to the backing. Osds um I reckon I recommend raising that as high as possible.

A

I believe I use 64k in the manager, um keeping in mind that a single object can be four megabytes, so you want to kind of maximize the amount of uh or reduce the number of iops involved with your use of sqlite um sorry. That screen is too small for me to read I you want to use a larger cache size, So reading, uh to avoid reading from rattles.

A

So there's no cache and simple rattle striper deliberately instead use the sqlite cache instead and the way you can get more cash is telling sqlite to use more cash. You want to persist the sqlite database, so you'd want to avoid deletion of the objects associated with like the journal for the sqlite database. If you, if you can so that's one reason, we would pers. That would be why we would persist it to avoid those unnecessary operations.

A

You want to use exclusive, locking, if possible, of the sqlite database that can reduce your transaction lead at C from five to three operations per transaction. So normally, when I do it enter a transaction in sqlite I have to lock the database and then do some reads and writes and then finally unlock the database if I enter an exclusive mode of operation, sqlite locks the database once at the start and unlocks it at the end, when I close the database so.

B

That can save you a lot of Ops.

A

Whenever you're doing a transaction um and then another optimization you can do is use the wall journaling in sqlite, which normally requires shared memory communication between uh clients of the database, but if you're using the exclusive locking mode, then you can um you don't need to use shared memory instead, sqlite will allow you to use a wall journal without without that, obviously, if I'm using a sqlite database in rattles, there's not going to be any shared memory, communication between the clients which could be anywhere.

A

So if I do that, then you get the most optimal behavior of one operation per transaction, because I've got the database open exclusively and I'm just writing to a wall journal in the general case. So that's that's going to get you two to five millisecond latency depending on, of course, your cluster. If it's dance cluster, it could be up to 155 milliseconds.

A

um Reads are synchronous, that's really unfortunate. That's just baked in the into the design of SQL lights. Vfs I've complained about it. It's actually, uh while it's not on this site, it's a link on the next slide. uh Where you know, there's not really anything. I can do about it. um If, if sqlite does a read through the VFS, it's a synchronous read to alteratos, ideally, we'd be able to do asynchronous reads and then sqlite would do a gathering at the end.

A

I think I know what you're talking about, but it's not the right thing not for reads um any, but we do do asynchronous rights out to El Dorados um yeah, so the retrospective on sqlite VFS. The end of this slide is where I ranted the sqlite developers about the VFS.

A

um It was mostly positive things to say, but there were a few uh gotchas that I found along the way uh right now we're using it in the snap schedule and device health modules.

A

um It's limited use so far in the other modules. um Hopefully gonna make improvements to that. Maybe by publicizing it more but um I, for example, would be nice to use in Telemetry to keep track of reports that a cluster is used in the past.

A

um We did have a few bugs I'm. uh You know. Most of them were related to like some packaging. There was one GCC static, compiler error, um I'd, add a rattles Destructor to or sorry a a Destructor to the library for to avoid a bug we found in in unit tests, and then there is one outstanding bug which is tripped up a number of clusters.

A

It's not really debilitating, but um if we lose the the lock on the database, perhaps because uh uh the the there was a short Network partition, and so the library was not able to renew the lock in the background, then it automatically block lists itself to protect the Integrity of the database, which means that whatever the module was doing in this case device health won't. It can't continue doing and the logic wasn't yet built into it.

A

To try to reconnect so I'm, actually working on a fix in progress for that that I'm hoping to release soon so that the Sev SQL light library, at least within the sep manager, can reconnect to rattles. And then the module can pick up where, where I left off.

A

uh So, as far as future work um right now, the library is, does everything under an exclusive lock on the database um within rattles, so that you know I'm all there's only ever one reader, one writer, uh that's not any architectural limitation. It's just something. I need to address by adding support for multiple readers um and then read ahead performance, which is related to the limitation in the the sqlite VFS.

A

Only allowing one or doing synchronous reads so the only way to approach that problem from our end is to add, read ahead and that's going to require exploring how read ahead Works in within the Unix context, to see if I can reproduce the correct read ahead performance in in which will mirror give us a somewhat similar performance. In that regard, we do have a lips Libs, ffs, SQL Lite, which was developed uh a year ago by a gsoc student.

A

um It does require a little bit of cleanup. It wasn't, it wasn't merged yet, but that's another Library binding. We would like to add the main reason. You'd want to use that over like say, put sqlite on CFS mounted. Normally is, then, you don't have to actually Mount CFS. It's just sqlite on Libs ffs, on onset, all right, um I'm, very confused. Why the slides here.

A

Well, someone edited my slide deck. Sorry, those were appendix slides. um There's. My thank you um and my contact information. uh There's a blog post concerning lib sets equal light if you'd. Rather we see this talk again in text form and the documentation which also talks about all the performance notes. I mentioned earlier. Any questions.

A

B

A

Back end, on top of libsql on Seth, sorry, what I'm just I'm, mostly joking, there's a lib SQL back end for rados Gateway. Now, and so you could. You know oh wire, the two together in an unfortunate way, but I feel like Matt was behind this Matt. Is this your foreign.

A

Map used by this- it's all just blob, storage,.

B

A

And there's no chance of having objects that are too big either, because the uh um the objects are limited in size by the striper.

B

Yep, if I'm not mistaken, SQL light is pretty scalable in terms of size, of how large a database can be, what what's the largest you've tested. It.

A

So I believe the architectural limit in sqlite is 480 terabytes I've gotten up to a few terabytes before I. You know stopped um you I would not advise getting databases that size, because we'll be you'll start to hit. You know when you actually want to read out of the database you're going to start hitting performance issues, namely because of the lack of asynchronous reads um so I Would by default recommend at least at this time.

A

um You know, databases on the smaller side and maybe more databases rather than one large database. Once you start wanting to go larger I would I would start looking at things like postgres, which may someday end up on Seth.

A

Any other questions.

A

Cool come on your time.