Ceph Performance Weekly, 2 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2020-09-02

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, wanna get started on pr's here and then we'll move on to discussing uh blue store own. So uh I saw two new pr's this week. One was from josh to speed up caps and uh cap updates in the monitor uh josh. I saw that uh that joao had uh reviewed, that uh it doesn't look like we changed default, behavior right.

B

That's right: yeah, the keyways is confused. um The default favorite doesn't change, it's only a niche use case, so we just have an option to avoid parsing in that kit. In that case,.

A

So so um I'm not super familiar with this uh pr, but um are we primarily limited by all the object, creation and deletion using that boost framework or what? What exactly? What are we seeing now.

B

um Well, right now, this option just skips the parking entirely when the pricing is enabled for the osd capabilities. uh I think we are limited by I'm not sure if it's the option, creation necessarily, but that's my intuition, a bunch of different different calls within bruce spirit, uh showed up in the in perv okay, but without the parsing, as we seem to be um limited more by rockstar, perhaps perhaps over some kind of locking the monitor um with your script to analyze the logs mark.

B

We still saw like one second pauses running on top of temp s for compactions, which is kind of surprising to me.

A

Yeah yeah, if I remember right, though that was about I I didn't, I don't think I saw quite one second, but I was getting kind of close to one second on official analysis running on nvme drives as well, so it's maybe kind of in the right range.

B

I wonder if there's some kind of uh doing we're running into there, that we could change with with the I think, it's like right right, blocking behavior or something.

A

I have a feeling that gabby's pr might help a lot we'll see, hopefully.

B

Not non-demanders, though,.

A

Yeah, that's true.

A

How often were we seeing sorry? I don't remember on the monitors um when we ran that script for looking at the rocks db compaction statistics were we hitting compaction very often or was it kind of like intermediate or sorry uh intermittent.

B

Yeah, it was pretty frequent. It looked like like so.

A

We're actually pushing a.

B

Lot of we'll go ahead. Yeah, I guess we're pushing our surprising amount of. I mean the database itself is pretty small. It's like an order of 30 megabytes or something, um but I guess those levels are are initially small enough that we're still running into a lot of compaction.

A

I wonder if we have any similar issues where we have a lot of tombstones and a lot of deletes happening if there's like a lot of stuff, that's getting in that actually is like temporary.

B

A

B

Be that could be if yeah it might be worth investigating further, if we want to improve the monetary performance more.

A

It could be interesting just to try an experiment where you create um much larger buffers in proxy b and the right-hand log the same way that we do in blue store and see if that reduces the uh the amount of uh of right amplification coming in.

A

That's that's why we have such huge ones in blue store, it's kind of ridiculous, how big they are this it. You know it seems to be what how we can get around. Having like you know, lots and lots of keys end up being. You know, uh put into l0 that are probably then like deleted really soon afterward.

A

Maybe we're doing something similar with the uh with the mon.

B

Yeah, maybe I don't think we're deleting keys quite as aggressively as those d does, but uh it's worth checking.

A

Okay, interesting, okay, um let's see what's next um change the default value of osd async recovery min cost. I don't even remember that option. So so yeah josh do you. You know what how that affects things. I don't I don't remember it.

C

Yeah, that's basically the threshold um at which async recovery kicks in and uh the the pr proposes that we reduce the threshold so that we can do more async recovery, uh and this is based off of some uh testing uh internal testing. That was done by the by some of the folks at red hat, um and they saw that with the default value of 100.

C

They saw a lot of asian recovery happening on rep, especially for rgw workloads. uh They saw a lot of uh async recovery happening on replicated pools, but not as much or like almost zero on uh ec pools. um Now it's totally um possible that the workload they are using or um you know the setup they have- is not uh able to induce async recovery.

C

uh The thresh is not able to cross the threshold of 100, but their proposal is that, since there is no harm in having more async recovery, we should just reduce the threshold and there's been some discussion on the pr uh um I think shea uh I think he earlier proposed. I mean he said that they are using a min cost of one in their cluster and that's been the case because they want more and more async recovery happening.

C

But I do agree with his latest comment um that there is um a caveat that during upgrades, if you have um or not just upgrades when you have mixed clusters, running different versions with different min costs, it is possible that choose acting does go into a loop where um we might just not want to get go. There.

C

We've seen issues like that in the past, so you we should only be changing the cost when we have only one version uh running in the cluster, so I have to think about that a little more and then we can see how we can implement that. But in general it's a good idea to reduce the threshold.

A

Right, uh let's see uh those are both of the two new ones that I saw. um We have one closed this week from majiang peng uh that was enabling rocks to be pipelined rights. I still think that's actually not necessarily a bad idea um in in general, but um it turns out that uh when mahjian peng did more testing it looked like uh it was not really having much effect and that it wasn't actually affecting a whole lot of ios that doing that was not as impactful as he thought. It would be. I guess so.

A

um Anyway he closed it. uh Maybe someday we'll revisit that. But for now uh it's it's. You know fine. We have plenty of other things to worry about. So uh let's see updated, prs. Majiang pang also has another pr for reducing buffer list. Rebuilds um radik reviewed that first, and uh I think maybe even was looking at a more general solution for it, but he approved the pr and igor also approved it.

A

uh It did go through a round of testing and it looks like there were some failures, so it might need some fixes um so back in my gm pegs hands now to to see what's wrong. um Also optimizing. The lock up blue store writing process. Oh this is the the one that no one wants to touch, but kifu actually said he was going to in the pr.

A

So uh my my hat is off to him for, for taking the time to to look at it, um so hopefully keith will be able to make a determination of whether or not we really want to take this on or not um beyond that lots of stuff in the movement category, probably still the ones that they keep coming back, that we need to look at are reducing no node memory usage, uh so both igor's pr, potentially and then also uh the thing I wrote a while back almost a year ago now uh to uh separate out the uh block cache into multiple block caches per column family just to allow us to um to handle block cache for oh nodes and block hash for omap separately, uh also extents and all the other blue storm metadata.

A

uh So uh those still need to be worked on, maybe after I'm done with this a ml. So if I can get back to it, that's it. As far as I saw anything, I missed guys.

A

All right, well then, um adam or gabby, uh would either of you like to talk about uh the the things that you've been looking at. In terms of uh oh note, refactoring um adam, I know you had a document that you had sent out earlier. Maybe we talked about that first and then gabby. You can talk about your stuff.

D

Well sure I can I mean that's not much to I mean basically mark it's just what what I told you we talked some time ago in barcelona that it might be nice to have just the same stuff on disk and in memory and be able to operate on that and as having some free time.

D

I I just constructed a proposal how to? How can we do that and it's just what it is I mean I'm not even supposing that, maybe that we definitely should do it, but I like to put it as a at least as a point that this point in a space of possible solutions exist, and there is a cost metadata will be much larger. There is a benefit right after you get access to your metadata, either by getting it from roxdb or reading from disk or whatever you can access it as any other data structure.

D

Yeah. So that's! That's it.

A

So you know at different times in the past we've we've talked about things like uh completely ditching encoding and just writing the buffer list directly into roxdb um or, potentially, um you know a gamer like var engine coding and anything that takes a lot of time. I guess, um or you know, maybe a more exotic thing that might be storing an encoded form of the data in memory in in stuff and then only doing decoding when we actually need to access something.

A

I guess the the thing that I have seen keep coming back over and over again is that um we're being asked to reduce cpu usage.

A

People want to be able to run osds with one or two cores and have the rest of the system available for uh hyper converged uh processes. Other things uh you know, uh user processes, other you know whatever um right now, with nvme drives a blue store, backed osd can easily take between 10 and 14 cores from what I've seen when faced with a very, very heavy random right workload.

A

So I guess my question for everyone here is what what are the things we can do in the osd to without you know, crimson will eventually be the real answer, but in the meantime, are there? Is there anything we can do in classical osd and blue store to make us more cpu efficient.

E

Can I jump in yes absolutely yeah, so I I I came from a different kind of system which was heavily focused on performance. It's like unlike safe, which the algorithm allows to grow. The system could not grow beyond 16 nodes, so 16. What was the absolute maximum you could? You could use so we always been forced to squeeze as much as much performance as possible from each node, because instead you could always say you know what take another 100 nodes there are there, so you used, we had like each one of them was extremely expensive.

E

Every time you add another node, it's hundreds of thousands of dollars, not very common. uh One thing I I noticed again and again when you look at the code and it's very different in the way I'm used to is everything here is strings. I'm used to seeing stuff in binary format.

E

Prints are comfortable for debugging, but it's not something that computer that is comfortable using and in many levels things tends to be arbitrary size, while binary formats tend to be the same size, strings compare takes longer than comparing binary format and strings, at least in the part of the code.

E

They tend to always do dynamic again, so the strings start in one thing, like you, add the string for prefix, then you add the string itself, then a suffix and so on. So every time you're doing this is dynamical location release allocate release.

E

So I think an easy target would be to move away from string and concentrate on binary format from whatever is possible. So one example I give in the o. Node is the attribute format, so we have in every o node we have map of attributes which is strings so like the way I'm used to seeing things like this is that you have.

E

If the attributes are coming from a predefined set of attributes, then you enumerate them and instead of writing the attribute name. You just have an index saying that's um that attribute and reader code shouldn't care about the name. We should just use binary, compare binary operation, the strings are only for the users, but if you need them for debug purposes, then there's always a debug function which can access. That's right.

E

That's the easy case. If your strings is com are coming from a much bigger pool, and you don't know the definitions, then you could create translation table on the fly, and I suggested doing this one pair thread where every time you see you, the client is coming and it's and it's passing inside the string.

E

So you check, if you have it on your hash table. If you do, then you immediately translate this to a binary index and then that's what you see inside the osd osd should never see the string afterwards. If you don't have it, then you create a new entry get back in index and you pass and you start using this so that way you get few things right. I mean you shrink the size of your data that compares search feature and you don't do dynamical. Okay is, for example, in the node.

E

All the attributes that are using hash table, I'm not very familiar with the way c plus plus, is doing hashtable. I know many years ago people suggested that when hashtag got very few entries, they should just be using array, because um hashtag are efficient when you have hundreds of attributes. But if you got just a few of them, then you should use array. I don't know if this is what c plus plus is doing. But if it's not the case, I would suggest just create an array.

E

I mean how many attributes are you going to have you're going to have 10 20 of them? It's probably cheaper, just to scan the array looking at you then, and then then running hash table and you could save in performance. The hash function um and hash function also tends to have these ugly things in which the memory location is unpredictable.

E

Because of the way we uh like to randomize the the option, then you you tend to move all free cash into dram, which is very expensive and say you have. um If the index is a two byte index and say you have 32 entries in in your map, you could easily create, like a eight byte value with four times the index and then run the compare in eight bytes. So you could scan 32 entries very quickly, you're just doing um eight compares and then inside you do internal compares.

E

So I don't know like maybe 12 compare the most and that's a big one so um trying to use more. If I'm easy, it's like it's the most natural thing to use and we love them, but performance-wise. They tend to take more memory because you need some space. You keep the key. You keep the data, you do dynamic allocation in small sizes, so there's always some overhead for them.

E

So go like the first things coming to my mind, then the other one. For the o note, I was suggesting that if we can identify that portion of the o node uh fixed, so we know all nodes have a fixed form, fixed part and then there's dynamic part which can grow and shrink, so we could use um released with own node from the fixed size.

E

So all of them just going to have the same amount of space and then the last entity should be a pointer to a dynamically allocated area and you could use it using body system. Then you could allocate the part which is changing, but the part which is not changing. You don't have to copy again and again, because that thing you could do in place. So if you change one byte, you don't have to copy the full entity and uh another thing to do here.

E

If all your data structures are simple memory, data structures, you don't need encode index, you just map this thing into um binary format and you just dump it. As is the last thing for the pointer. You need to do some more interesting thing, but the first part which is constant. You just write it, as is you, don't need any any formatting.

E

A

So so gary um this is a little old at this point, uh but is the one that I was able to quickly find. um This is a I pasted a wall clock profile of the osd under a 4k random, right workload.

A

This is kind of a standard case where we really stress certain parts of the osd, specifically the kvsync thread, and it's not necessarily definitive that this is you know where, where bottlenecks creep up, um but it it so far, it seems like it's done a fairly. Our profiler has done a fairly good job of of pointing out areas where um potentially we can make improvement.

A

um So I I don't know so. This is basically time spent in different parts of the code under a 4k random write workload, uh you know of a specific osd so like if you go to line 415.

F

um You'll see the.

E

F

Wait 415 takes me some time to get there uh yeah.

E

Yes, I say it now before k value. Sync.

A

1000 samples- yes, so you can see in here that for, for instance, we're spending a fair amount of time in wall clock time uh in the inline skip list in roxdb, doing key comparisons.

A

So in in this case, um this is because we have such large buffers in roxdb, where they're filling up with so many keys. That is taking a lot of time for rocks to be to actually walk through and do comparisons figuring out where keys should live for ordering purposes.

A

um We have such big buffers, though, because we're trying to avoid a situation where um we have temporary keys due to pg log, presumably they're, going into the the database and causing extra right amplification.

E

um Mark sorry uh question: is this code trying to sort all the the key data, so it could be the stage sorted? Is this trying to create a sorted uh table from from what we have in memory before before we write it to the disk.

A

uh So yes see see how it uh earlier on. It's been people add uh this is during um you know this uh calm family insertion and we see a mem table ad called uh online 430..

A

So that's basically where we're iterating through ruxin is iterating through all these uh doing key comparisons to do that add into the mem table, because everything is sorted.

E

Did we do we really need? Okay, that's maybe a bit changing the direction, but why do we use um key value database for the all nodes? I don't see much benefit. Why don't you use just an oltp kind of database like mysql?

E

I don't know who's now, the blitz uh classic rdbm, but oltp that they are used to writing small entities and they don't need to sort them. I mean we don't care for sorting. We don't need the meta around all these ones, the things about key value and and ls entries.

E

They are great if you have big object and those objects are not changing very often, but if all nodes they are extremely small and they keep changing. So you keep creating more and more version, and when you read them, you have to visit that many levels until you you construct them and it's a tiny object.

E

So I understand we need the persistency that works to be give us, but maybe you could just use somebody else, something which is tuned for small right old peda. This is what they do right, meaning otp. They use to you. Your.

E

They are all very small, take some changes and then stage them did we try comparing uh roxdb to mysql just for the own node, not for the not for the object itself, but for the oh node uh state.

A

So sage isn't here to to really kind of give the the what he was thinking during this, but I can give you my perspective when this was all being written. um I think that sage, really at the time when blue store stores being written, wanted to be able to get all of this data in one transaction going into the write ahead log in roxdb, especially for the possibility of being able to do deferred rights.

A

So um you know back then there were ssds around, but hard drives were still really popular right. So we were trying to figure out a way to be able to support both use cases and for hard drives, having the advantage of being able to do a single transaction with both the o, node metadata and potentially also uh uh if it's a small, I o a small amount of data right into the right ahead. Log was really.

E

Attractive, um but I think this is what we're trying to do if with systole right, I mean that makes sense to me that you just staged the metadata with the data. That's perfect made sense, but separating entities on any entry, I'm not sure it's the best fit.

A

I'm not sure that roxdb in the long run, really is a great fit either for a lot of what we're trying to do. um But having said that, you know with crimson we're changing that direction. A lot right. um Yes, I think the question right now with blue store is um given what we have uh without completely changing the rocks db back in because we've actually looked at a couple of things we we did, try replacing it with lmdb.

A

I actually wrote uh an updated version of another lmdb back end that someone else had made a while back and uh it was slower. uh We couldn't do it as fast as roxdb. um It probably was not fully optimal. There are probably things we could have done to improve it, but it the the early prototypes and early tests didn't really show um a significant improvement within easy implementation.

A

So maybe we could have done better if we kept on working on it, but it certainly was not a dramatic uh improvement just with a basic implementation, mysql or postgres, or something else you know we. We could try something like that. um I think, though, if we were going to go down that road, I I'd advocate we'd just write our own thing right, specifically, that we write our own uh right, headlock and whatever we have seen behind it is you know its own thing, but.

E

You know what I think that actually might be the best solution. I think robsdb is great when, because it can give you a way to develop things very quickly, you just throw it here and you got stability and reliability.

E

If you develop your own right, a headlog built to support your own internal mechanism, I don't think anything else could compete, but this is what bluestory is going to be sorry, sister, going to give us right, it's exactly what his story is doing.

A

Well, this is something I haven't followed recently, so yes, probably, but I haven't, talked to sam in a while, so uh you may know more than I do at this point. What sam is working on with it.

E

So that's t store. Is writing some um log structure file system, so everything is flowing there. The metadata and the data is going there, so you could still do the sequential right or moving forward with zone name spaces, but at least I hope that we're not going to do separate uh log for for the o node.

E

I think we don't need to uh okay. Like yeah then said we still want to do them, but I think it's it's open for negotiation, okay, but okay other than that.

C

E

Building something else just for the owner is probably the best approach we could do.

A

So there was a prototype from intel um a little while back from lisa, where she was trying to take specifically pg log and just write it out to uh bluefs directly without range roxdb, so not exactly what you're doing gabby but um burs. Unfortunately, it didn't handle dynamic sizing it just like allocated like a very large block and then hopefully, the uh pg hug entry fit inside it, um and that was as far as that one went.

A

um She didn't think that she saw a significant improvement with it, so it kind of just ended up being abandoned after that, but I'm not convinced that we really gave it a fair shake um or at least the idea of trying to change how we write pg log entries a fair shake. um That's why I'm very excited about what you're working on, because I think that there's still potential and possibility there to uh to really improve.

A

What's going on, whether or not that translates into something even bigger in terms of looking at how bluestore stores data uh and how we, you know, write ahead log and how we actually write transactions out. I don't know- maybe it's too much work at this point, but um certainly this is, I think, a rich area for performance improvement from the the traces and wall clock profiling that I've done.

A

um You know there's, there's this and then also uh overhead, that we have in the tp http threads um and then there we see some other things going on.

A

But that's that's why I posted that that uh that profile, just because it might might give you kind of an idea of at least where we're spending one o'clock.

E

Time and actually sorry one thing I'm coming back to something I said in the beginning, uh big part of the workload. There is the sorting right I mean before you create a run before this stage around uh by works db on the lsm. You must sort everything. So if you shrink the case, the sorting is going to take it's going to be faster.

E

I mean it's still and again, but at least the constant could shrink if your keys, for example, if we're talking about rgw and a key, could be hundreds of bytes of of string and the compare comparing hundreds of string byte is expensive, but if you make it to be a binary format, but if you compress it on the client side, you will never see the full key. It may be like a 512. Byte of key could shrink to 64-bit binary compressed format. Then comparing them could be much faster and you could also use.

A

Yes, um I also think what you're doing with reusing keys is going to be very important, because I believe that if we do that, so we no longer have tombstones. I think we can shrink the buffer size for the mem tables so that we're doing compactions more often with smaller amounts of data, and I think that's also going to help.

A

And we won't see the right amplification. We do right now when we try to shrink those.

A

So all this stuff is kind of tied together right, like oh nodes and how big they are and whether or not they're encoded or not encoded pg log and how many pg log entries we have coming in and how often they're tombstoned and when that happens, um and how we we have transactions hitting the right ahead. Log. All this stuff is kind of interrelated in really complicated ways, but I think um you know all this work that you're doing and work other people are doing.

A

If we can start experimenting and finding out kind of um what things affect things in different ways. I I think it will help us learn what what's useful and what's not, and what we should be doing.

B

I think that the idea like akavi was talking about of uh potentially having oh nodes outside of roxy b is pretty interesting, because it's not just the cpu overhead there. It's also like the amplification medium envy application.

B

um It just can matter a lot email and hard disk, and when we have these uh tremendous amounts of objects like for rgw, they have like billions and billions of objects. That's tons and tons of o nodes, um which don't necessarily need to be all assorted together, like that, when you rw, is maintaining its own index and for the oc's purposes uh listing out nodes, doesn't necessarily need to be a low latency operation.

B

It's mainly used for backfill and scrubbing. These are background things, so I think there.

D

B

Might be worthwhile to try some kind to figure out like if there's some kind of perfect, we could do uh with owners outside of box db and see what effect that has.

A

I I think josh, if you pull oh nodes out of roxdb, specifically for hard drives, you'll need to move we'll need to implement our own right ahead. Log, I think, does that make sense to.

A

A

D

A

Still want to have a single transaction hitting the disk to avoid lots of seeks right.

B

A

But then, once we do that, if we had our own write ahead log, then we could send off the stuff to roxtv that roxdb needs and then we could send o nodes back to the disk uh doing whatever we want. As long as we controlled the write ahead, log.

B

If you wanted to.

E

If we're using zone fs, my understanding that as long as you're writing for location and writes, tends to be fairly quick, even qlc drives so just use a simple log and flash the data forward, always move forward. Don't try to optimize access, I mean what lsm is doing the way. The reason is sorting. This is because they need to build the mechanic and the sorting immediately, because you need to merge levels.

E

But if you keep fighting forward and it's just write a headlock which, every time you wrote something safely, you could uh put a checkpoint and you can discard it and it could be cyclic. Then you could just do something simple, but again it's not going to work with spinning drive, as you wrote, as you said, spinning drives you're going to kill them doing this, but zone spaces might be a friendly uh friend yeah. If you have.

A

A dedicated zone- that's for this, I think the hope with roxdb and blue storm right was that um that the overhead for this wouldn't be so bad on on flash that you know, we could still do a reasonable job with it, um but you're absolutely right. You know it having the blue store. Design is still kind of trying to have the best of all worlds right like let hard drives still be fast, but make nvme faster than it was in the file store days, and it is right.

A

Yeah I mean we, the blue store, osd for nvme is still usually faster than file store was so we got some benefit from it, but we're we're hitting the limitations. I think we can't cpu now is is what holds us back.

A

And I think we're right simpler is what we need to be fast with with flash. We have too much overhead too many things going on.

E

Because there is nothing the same, all the complications because been retrieving this object, but we're using right ahead log right ahead: log you're not going to retrieve them in most cases you're never going to read them again. You just write them and you know- and once you decide your data elsewhere, you're going to forget them, you don't need to build fancy indexes for them. You don't need all kind of sorting because in case of failure scenario, okay you'd go back and you would read them one by one and nobody cares. It's gonna take minutes.

E

Maybe it's! Okay! It's not going to nobody going to say now. Go fetch me what you have on disk. You don't need it. You have it in memory, so we just need to write a headlock. We don't need an lsn and they might. They might be even right ahead. Log right ahead, log open implementation. We could borrow.

B

E

B

In the past, like for file store, we were on our head log, essentially this. For the same reasons, uh I don't think that part is necessarily um that complicated.

E

Shouldn't be, I mean people been doing this for like 50 years now. Every database in the world is using right ahead. Log.

B

Yeah yeah, but I I guess I do want to point um point out that um with bluestora, I'm not sure we should be thinking so much about like zone namespace or super fast devices.

B

That's kind of where c story is going to be coming in in a say a year or two once it comes our stable and that's going to be fully optimized from the ground up for that use case. So I'm not sure it makes so much sense to try to shift the store in that direction. So much as um improve its efficiency for, like slower, ssds and hard disks.

A

Yeah, that's kind of the question right josh is like. Is there? Is there low hanging fruit and blue store? Are there things simple things that we can do or even moderately simple things? We can do that that improve this, to the extent that we can without turning it into another copy.

E

Of c-store- and so I mean there's a tricky thing without zone name spaces if you're writing short right one after another, you're going to create crazy white amplification, so zone namespace was built from scratch. Sorry, the design there was built to support this kind of white logs because nobody is trying to collect things nobody's trying to do anything.

E

The only thing I've tried using many years ago we did uh like 10 years ago when this is these were still young. We tried to do the um the translation layer, try to destroy the translation, so you'll be able to do that. So I don't know if this is a valid option. Can you say you know what? If we have the firmware?

E

Maybe you can disable the translation that the front flash translation layer and then you could do the right ahead right ahead log and there's not going to be any right amplification because nobody's going to coll to do garbage collection. The problem for us is, if we do small right one after another, and the garbage collection is going to kick in they're going to amplify amplify the right, which is not what we want.

E

So I don't know how this thing would work.

A

Just in terms of your pg lock stuff that you're doing, though, that's a, I think, a a very valuable thing to be looking at, because at the very least, it's it's something that might reduce the amount of keys going into the database significantly and delete some tombstones very specifically tombstones and um pg log is kind of nice from the standpoint that you're right, you don't go, read it during normal operation. It's only read when you have a case where you actually need to read it after osd reboot.

A

um So my hope is that it it's not going to be nearly as impactful as trying to like change the the undiscovered format or something.

E

And he, actually, you know what now that I think about it. There is actually a potential problem, because, if we're going to recycle the keys, it means we're going to create to create more and more version of the same key, which means lsm.

E

We need to compact, we'll have more and more stuff to compact, because naturally, if we don't recycle keys, then usually once you wrote the whole node a few times, it's not going to change again, but now we're just going to change them again and again and again, hopefully they're going to be changing memory. But I don't know if you're going to create an extra rounds of them and you're going to see in every level of the lsm you're going to have another copy of the object.

A

They'll be really hot right. Yes, those those pg log, key value pairs will be super hot, so hopefully they won't be in like a very deep area of the other side, it should be all sitting in like l0 l1.

E

So maybe we could allocate enough memory for them stay like I don't know your your offer. Space should be big enough to hold uh 3 000 keys in memory that you can always have them all in the same level,.

A

So right now we use 256 megabyte buffers for mem tables, which should be easily big enough. I would think.

B

I mean the typical size we see for the like in memory. Peachy log is like around 120 megabytes without default settings today, so there should be plenty of space.

E

Okay, is this a per pg group, because every page group got his own ls entry yeah.

B

And that's total: it's all going to want you so.

E

You have to divide it by 5 12., uh how many pg group 12s, I know not all of them going to be active in the same time. But what's the number of active pitch group.

B

Okay, you can assume like around 100 pgs per osd, and you can assume that they're all active at the same time, roughly.

E

um About 100 active.

E

So if an object.

B

I think or 120 megabytes is the for the integrosd, though so that's like you know, 12 megabytes per pg, roughly.

E

12 megabytes rpg.

B

E

The adjustment downloads.

B

Just for the pg log.

E

Yeah, okay, yeah suffice.

E

B

We should just make sure we're not.

C

Going to trust.

E

The system by just keeping it uh writing everything again again again, maybe the best thing to do. I I think I need to see how I recycle the numbers. Maybe the best thing is to always recycle the last one. First, oh the first one I don't know like if you need to do like a fifo to see which one of them is.

B

Is better yeah? I think the backspace stats will be forwarded there in terms of how many keys you end up going to the lower levels.

A

This is this is the why it's very seductive to just use like on disk ring buffers right, yeah just get rocks together picture entirely and just have a uh uh you know some some continuous space on disk that you right into for this.

E

But how do you get away from right amplification? So I'm again going to say this. I know I said a few times zone name spaces might be our best friend here just slide forward. There's no right! Amplification right speed is good enough. I mean they can do easily twenty thousand iops per second like twenty thousand items, even if it is like a small ones. So it's okay to do all these small stupid rights.

E

Filling drive is going to die from this unless you got a dedicated spinning drive, because if you keep right into the same location and then the head never have to move so actually they can even to give you enough performance if you could take one drive- and nobody is using this- except for this uh right ahead- look.

A

One thing that we we um have moved toward, though, is telling people that, even with bluestora, we expect that if they care about performance, they should have some kind of flash in the system or write a headlock.

A

I mean that's pretty common that this.

E

Right yeah, but right ahead, log without zone name spaces might be a problem because you can kill it with white amplification. I mean it's something we need to verify.

E

Actually it could be very interesting to see how spinning drive is going to behave, because you know spinning drives also got their own internal cache and if they are battery protected, then the rights tend to be very small, so essentially you're going to use them as an uh non-volatile ram.

E

I think now they have like 64 megabytes of of battery protected memory for cash. So if you're using just one of them- and you keep turning small lights it to finish immediately and if the drive can put everything in his local cache, organize them and then send them and we never moved ahead. We just move one direction and they could be staged.

A

The so the the big thing there is right battery back to uh memory. You know ram um if you have it and even down on flash drives, we've seen that um historically in the older days, sometimes you would have a super capacitor, backed uh cache, where you can issue uh uh a flush request to the disk and or to the the drive, and you can immediately respond saying uh it's fine.

A

um You know on the assumption that if it loses power, you can write this stuff out anyway, but we even saw flash memory ssds where they didn't have this and it uh you know, a flush request could take a significant amount of time. It was very slow, so in terms of the user experience and the kind of support requests we got sometimes it very much would depend on whether or not the drive actually had this kind of battery pack cache.

A

So if we're going to require it if or even right now, I guess uh you know, we see a very, very big disparity between things that have this and the hardware that has isn't hardware that doesn't.

B

So I think we talked about.

B

um How can we uh kind of reduce the memory needed at like for nodes or for booster media in general, like eagle or atom? You probably have some input here, because a lot of the different things in practice.

G

uh Well, let me jump in uh starting with a previous topic and then proceed with some thoughts on doctor node.

G

Well, first of all, in my fio plugin for bluestore, I have an ability to simulate pg, lock load, and I did some investigation on that a couple years ago and well, it's still possible to do the same right now, but I recall uh I saw around 20 to 30 percent of in performance improvement when disabling uh pg, lock load.

G

That was random right scenario, if or without pg lock simulation. Just to top this way we can try to estimate our benefits from using from not putting pg log data into roxdb.

G

Secondly, I, in my opinion we might have some issues with having a separate, pg, lock or separate right ahead lock, along with.

G

Putting some data into roxdb when the issue comes from the need to have transactions atlantic on right operation, and hence we need to update both this standalone right ahead, lock and rox db in an atomic manner.

G

This doesn't look a simple task, in my opinion,.

G

And more general comment is about.

G

The store which looks like a pretty general and flexible solution from one side. But this brings a lot of complexity and which.

G

Negatively impacts the performance and memory usage, and things like that, and it looks like looks like one of the possible way in in improving things, is to making more specific solutions, maybe based on bluestora or something, for instance, remove spinner support well or something else.

G

There are plenty of touch stuff which is pretty outdated from one side and brings a lot of complexities from another side.

G

And another one of the example is probably the simplification of this on node structure.

G

So right now we have pretty good flexibility in in block sizes. We in no note, for example, so at logical level, for each node.

G

By granularity, which probably is never kill the same, we provide some ip, in my opinion, excessive flexibility in blob alignment in shards alignment, which we need to maintain for keep data in roxdb in in less chunks. So.

G

If we talk specifically about blue store in a no or not structures, then probably the way to go is to simplify the. This is to reduce the the amount of supported features that reduce the flexibility and provide, maybe more matured solutions say for immutable objects, or things like that well very, very general overview. But that's what I.

B

Immutable object case, like we already have these hints that rgw is passing us saying when it's writing its data. These are immutable and uh or append. Only though we might be able to do more optimizing. Those within booster already just based.

G

Yeah, the same might apply to to small objects. uh Maybe plenty of other stuff well again supports for for deferred operations which is redundant for flash.

G

Drives the issue is that blue store is a very general purpose.

G

And and that's why uh one probably one of the reason why roxdbs or kvs or is used, provides.

G

More flexibility than having some fixed structures and things like that.

G

So you have great flexibility in adding different kind of information into ev store, which you might don't have with.

H

With some more fixed struck data structures.

B

And so I mean, even without reducing some of the flexibility, uh we could add more information to the store in terms of like the instances getting from clients like rvd is already sending hints, but it's obvious size being like four megabytes. For example,.

B

For s it's similar, um it could perhaps be sending more hands by kind of that same kind. What its expected I o profile, or at least object, object, profile.

A

A

We're uh we're past I'm here um but uh josh. You had also mentioned wanting to talk a little bit about, oh no shrinking. Do we want to quickly get into that, or should we punt for next week.

B

um Maybe maybe we could plan for next week. I think we're going to talk a bit about that one of the papers next week as well, but I could probably take less.

A

B

So can you continue the discussion.

A

Sure so I I did link uh igor's own uh diet, pr in the chat window for folks that are really interested in this. um That might be a good uh thing to look at and look over um for the future discussion. um I also can link. I had mentioned the um the double caching fix, pr, which uh I can also link here.

A

I do think that probably even more important than this is figuring out how to reduce cpu usage. If we can do both it's, you know even an even bigger one, but um I think it will be easier for us to argue for more memory than it is for us to argue for more cpu, just in general feeling. I've gotten.

B

Yeah, in some cases, they're related right like if you you reduce memory size, you fit more in cash. You don't need to go to rockdb as much, therefore, using less cpu fetching from rockdb and.

F

A

E

Agreed one hundred percent, just volatile memory, obtain kind of things and if we got obtained, then most of our problems will go away.

A

I think dims yes, but not not yeah.

E

Yeah sure yeah they are opened in like, but like even a very small one like the smallest, the gut would be more than enough. I mean we'll never have like. We could do a recycled buffer for the right head, dog use it there and then you could throw away roxy.

A

That's um we're we're supposed to be getting some we've been supposed to be getting them for the last six or eight months. um Theoretically, they may show up at some point here, but covid makes it tough if, as soon as we get them, though, um this is exactly the kind of thing that you're thinking about doing with them.

G

uh uh Well, um I have got some experience with um and vimy gyms for last year. Well, half a year and I've been trying to to implement uh well a sort of rocksdb replacement using mediums.

G

Well right now I have a working implementation, which is okay from functionality, point of view. It supports its embedded into bluestora and is able to replace roxdb.

G

But I can't say: well, I did a bunch of benchmarks, maybe not very excessive, but well some of them, and I can't say well I I I have some additional thoughts: how to to redesign my current implementation, implementation and things like that. But what I can tell for now is that using limit genes, not that straightforward in terms of achieving a great performance so right now I don't see much benefit of using nvidia mediums comparing to roxdb on past intel devices or intel flash drives.

G

Well again, I I I have some thoughts how to try to improve that. But what I'd like to to highlight is that it's not that trivial and straightforward task as it looks like.

G

And one more comment is about this hardware itself: bringing this requirement to to use this hardware limits the usage or limits the.

G

Auditorium which might use stef and this store so right now uh there are pretty high requirements on on the hardware which supports these mediums. This probably tends to change eventually, but for now it's it's definitely not the the common solution uh and it's pretty expensive.

D

Guys we do intend to spill this discussion to next week right yeah yeah. We should probably wrap it up. No, no, I'm not pressing that, but is there a limit to flexibility of targets we want to achieve because in the same discussion we talked about spinners uh nvmes, uh obtain pms, and I mean what is the target I should think about for the next next week, because now it's too many.

D

Options, reducing cpu usage, fine that that always works. But what is that? Do we reduce uh functionality, trim some unnecessary logic we have in all nodes. Do we have some idea.

A

My take adam is that for blue store we should target hard drives and drives primarily and if we can use optane, dimms or other technologies like that as an easy drop-in. Maybe we do so, but not as the primary goal, I think for crimson we target nvme drives with uh team as kind of the the target.

A

That's my take.

D

Cool, so we keep targeting spinners still, that's fine yeah, I think so, for blue store. I think we have to.

A

We josh is that, does that seem reasonable to you.

B

Yeah, I think that's a good summary marker.

B

As we're discussing a lot of these things, though, um a lot of them like in terms of oh no structure and format and kind of what what if we could make things more more specific for certain use cases um like a lot of that those same ideas could apply in general form to see stores, uh conception of outstanding nodes too.

B

I think these stories definitely need to go in the direction of not doing encoding and decoding on an off disk yeah.

A

One thing I've been wondering about for a while, and we should we should wrap this up soon, but um is is if our concept of storing data structured in an o node actually makes sense, especially for crimson or if we should be thinking about groupings of data and trying to apply like cindy operations on multiple parts of what we think of. As an o note.

A

At the same time, um like you know, batching batches of ios trying to operate in ways that make most effective uses time um or if you, you know uh abilities, it seems like right now. We think about things in terms of objects and creating objects and deleting objects and intermediate objects and translating data in different ways. Copying data in different ways, but maybe maybe we're trying to fit all this into this kind of um idea of how this is supposed to work. But it's not really the the way that is most efficient anyway.

A

I'm not I'm not sure if it means anything or if it's really the right way, if we should be thinking about differently, but it's just something that kind of keeps poking at the back of my head.

B

Yeah, I think it's a good thing to keep in mind and think about for next week.

B

Well, it seems like we have a lot to discuss mark. Do you want to do the paper discussion next week as well, or do you want to move that out.

A

Yeah I mean if people want to keep talking about this. Let's keep talking about this. It's this good.

A

A

Adam um you had mentioned, uh you know different things to think about, uh regarding um you know, targeted hardware, for blue storage store and and what how to think about this stuff. Do you want to um update your document that you have been working on with kind of what we've been talking about, and um you know, ideas and questions that next week.

D

Well, I can create new, but but updating the existing one is like detracting from its actual meaning, so sure.

A

Okay, maybe um if, if we do a new document, maybe um we could create a a new thing in the ether pad for next week with um I've been I've gotten very bad at actually updating discussion topics with uh all the relevant points, um but maybe we could put together something in here that uh uh kind of summarizes all the questions and what we should go through next week.

A

Anyone else that wants to jump in feel free to. I can create a new thing for next week here.

F

I'm going to do that right now,.

A

um So continued: oh, no discussion all right so um yeah! Anyone that wants to jump in. I guess um we've got that sub for next week on the ether pad so feel free to update it and we can go through next.

D

Week, all right anything else, guys.

A

Well then, uh this was a really good meeting. This is really exciting. Oh the discussion- and um I think this is this- is stuff that um we've got more people that are more familiar with bluestora now, and I think we maybe have a chance of actually uh changing some of these things, both in blue store and then hopefully, you know the the the maybe more more exotic and more uh interesting uh uh changes that we can make in in crimson, so uh excellent job. Everyone uh looking forward to talking more next week.

B

Thanks mark thanks, everybody.

H

Thank you, bye.

B

See you later guys thanks mom.