Ceph Performance Weekly, 11 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-03-11

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's see I'm going to pull up the ether pad and we can get moving. I figure people might slowly trickle in, but it looks we do have core people now. So, okay, uh let's see I have got two new pr's this week. That I saw one is from me that is looking at uh doing, trying to make a really quick and dirty omap. Benchmarking. Implementation in our uh object store store test suite.

A

So this is really just taking the like simple omap test that we already have and more or less duping it up a little bit with the ability to write to multiple objects, do lots of keys and then more or less do the same test. That's already doing and timing them.

A

I imagine that there's going to be room for improvement here for sure, but this is just kind of a first pass attempt at it. um I'll talk about this a little more later, but uh it it's definitely showcasing some interesting things.

A

um The next new pr is from gabby, and this is uh his pr to remove allocations from roxdb and and josh and igor both have reviewed it. It looks like there's, maybe a couple little things that could maybe be worked on or improved, but uh uh they've they're, both on top of it and working to to help make it good, uh or at least ready, not not, that it wasn't already good but ready for for uh uh being merged.

A

um Okay, um I did not see any closed peers this week, but I just realized. I don't think I looked through the list of um uh prs that were submitted this week and closed so it's possible. I missed something, uh please feel free to speak up if there's something that you know that clothes that I missed here um updated this week.

A

We have a pr from. uh I hope I say it right shuhan that is kind of like two separate prs in one um uh scattering aliens store threads across separate cpu cores, but then also has a a separate thing kind of bolted on as well so kifu, and I talked about that this morning. I think we just need to separate that into two separate prs. The first is pretty straightforward uh for actually distributing threads across course. That looks great. I think, that's just fine.

A

We should merge that the other piece is more complicated and, and maybe uh isn't uh uh quite what we want. So uh more discussion needs to happen on that part.

A

uh Let's see, there's another pr from shohan uh optimized crimson it was these client request process parallelism. um I think that was reviewed and it's in testing now.

A

uh Let's see what else? uh Oh actually I I am wrong. I had this in the wrong spot adam's pr to distinguish between buffered and direct io in blue fs. uh That was a bug fix actually, more than anything they weren't reporting the correct uh state to roxdb. That's been fixed and was merged by kifu.

A

And then there's the seth volume retrieved device data concurrently, david galloway, uh just commented on that.

A

I don't know what exactly is the hold up still on this one, but um in any event, uh that is still being discussed and worked on, ah let's see so no movement for this blue fest, buffered io, but just before the meeting last week um or kind of yeah, it was really just before it uh adam, and I had done a bunch of work uh going through the rocks db code for about three hours to understand uh the route that we take when doing iteration all the way back to blue fs and reading uh data for roxdb.

A

So I can go over that a little bit today adam's. Not here, I guess so um we can. We can discuss it at least a little bit.

A

um Otherwise I don't think there was a whole lot of other things going on that I saw um sam has a whole bunch of stuff happening with c-store, but I don't know if it's really exactly performance-related yet so I didn't include it in this list, but uh definitely there's there's a ton of work going on there right now.

A

Anything I missed.

A

Guys all right well, then,.

A

For omap performance, this pr that I've got for doing, object, store or test on that bench, uh I'll, post or I'll paste it in the chat window.

A

Here we are so I don't really trust this yet. To be honest, um the file store results, look really really good, and I I don't totally know why it's possible that file store just uses so much less memory. That rocksdb can utilize the page cache, given a small container memory limit effectively, and it's still just doing really really well, but it even in other tests is doing surprisingly well.

A

So I I need to understand this, but uh the gist of it is that in blue store, we're we're seeing really really slow, iteration overall map and seemingly maybe slow, omap put and get performance as well.

A

um It's really nuanced and hard to figure out the rock cd code that that is involved in all of this has changed dramatically uh over the last couple of years.

A

So there are multiple code paths involved depending if you're doing director buffered io there's a lot of of different things that could impact uh what's going on, given what I'm seeing in my tests that I don't totally trust, um I think we're seeing a really really obvious performance hit when the block cash and rocks db is being heavily contended when other things like oh nodes or maybe allocation data or other stuff is cycling through the block cache it.

A

My guess as to what's going on is that that's basically forcing um the sst files that were being read for iteration out of the cache we're doing going back, doing prefetch, it's all very slow reading from disk and the iteration step is, um we might even be iterating over like a long time period where, in between uh other stuff, is forcing those ssd files out of the cache and and we're actually reloading them. I don't know yet, but that it sure seems like that's what's going on, is you're really nasty?

A

If that's the case um in that pr, though, if you look at the last side test from like an hour ago, you can see there's this huge disparity in blue store uh direct. I o iteration performance when we have a smaller osd memory target, so a smaller rock city block cache a bigger memory target and thus a bigger block cache.

A

um It's like a 10x difference, so that is kind of what I'm going to try to focus on is understanding exactly what it is. Doesn't look to me like right. Look to me right now like this is a bug exactly in rocks tv, it's not like when we're doing, pre-fetching um or or reads for iteration that we're not putting stuff into the blog cache. I've seen some evidence with wall cloud tracing that we are, in fact loading data into the block cache.

A

So I don't think that's what's going on it's that we're just thrashing it really badly. This may just be an indication that blue store is trying to use rocks to be for too much. We already kind of suspected that anyway, um so that's kind of what we've been working on. If people are interested, we can go through like a a walkthrough of the rock cd code, but it's you know pretty esoteric. I don't know if folks want that or not um that's. That's kind of high level uh questions, thoughts.

B

It's really interesting mark what you brought up by the difference between file store and booster. I wouldn't expect that.

A

I don't totally trust it yet, but yeah it's a little disconcerting.

B

Yeah yeah and maybe maybe you're right- maybe it's an artifact of the memory situation, but um it.

A

Might be if you could imagine right that, if with file store, we just have enough memory in the page cache to be able to to keep those sst files in cache for iteration, you can see a dramatic difference right.

B

Yeah, that's true because as soon as you're going out of cash, it's a tremendous hit.

A

Yeah, I think I'm going to try even limiting the c group limit even more and see if I can make files form look bad just just to see if I can prove it. No, no.

B

No, I I could very well be part of it. I mean, since I've tried stories, people take advantage of the page cache a lot more. It's kind of implicitly getting a lot more memory from the system than booster is.

A

With c group limits, I I think, but I'm not sure that that also limits your your amount of page cash. You get right, um I'm not.

B

Sure for that I don't think it's necessarily accounted for on a per process basis, since the page cash is kind of shared system-wide.

A

Well, that's I I I keep going back and forth on this because that's kind of what I originally thought- and I saw some very ad hoc knowledge of you- know various forms claiming it did and I am seeing with blue star a change in the performance of iteration with buffered. I o when the c group limit is increased, even if the osd memory target is left the same.

B

A

B

We don't understand about the model.

A

Yeah exactly right, there's something going on here. I don't understand what uh yeah.

A

So yeah um anyone that's interested in this stuff. Please please look at the benchmark code. I don't really trust it. It's very or um it's basically just taking the existing test and throwing counters around it.

A

I haven't played a whole lot with collections or any of this other stuff. Yet so I'm guessing that I could probably improve this dramatically but um yeah. That's. I guess you know it is what it is.

A

Sam, I I don't mean to pick on you, but do you remember with file store since you worked on it so much for so long? uh Would you expect really fast, oh map performance with it? Does it make sense to you, given the.

C

A

C

Seems good, you should look at that.

A

So what was that.

C

The memory theory seems good. You should look into that.

A

A

So yeah that's that's kind of where we're at with this. um I think there I include a couple of wall clock profiles of an early version of the benchmark in the uh pr there's. Definitely some crazy stuff going on in here. um Oh, I forgot. I still need to run these tests with um telling tc malek that it can have more hard cash uh because we don't.

A

This is not kind of uh implicit when running tough test object, store individually.

A

That will probably help with tcml a little bit since we saw that show up in the profiles yeah, otherwise, just a lot more lots, more investigation. That needs to be done. Hopefully we'll get to the point here, though, where blue store is um as fast as file store seems to be.

B

Thought it was interesting that the um hashing was showing up behind your profiles for um booster just to be calculating that the hash for which cash chart to access it looks like.

A

B

A

I do kind of remember seeing that too, uh let's see yeah there, it is.

A

Oh, are you talking about the uh the that? That's for the roxdb block.

B

A

B

A

And their version of the their block cache doesn't use. This doesn't use our jenkins. We, we tooled that in to use uh when we made our own version of it. We replaced their hash with the ceph1.

A

I don't actually know is our jenkins any slower than any of the other implementations like um like. uh What's it the the twister, I'm not sure, um okay shoot. Sorry, I'm I'm yeah anyway. Yeah go ahead.

B

I was just surprised to see that um it showed up so high in the profile like that.

C

Yeah, so mercedes twister is a cryptographic versus a high quality hashtag. This is a low quality one. I would assume the merchant twister is slower. Okay,.

A

I know that back when I used to do ray tracing work. It was one of the faster implementations that we used for it, but I don't remember specifically why we would use that versus r jenkins because we don't really care about high quality, but.

A

A

So also in here, there's the red black tree, the standard map, uh time being taken.

A

I didn't include a trace that was deep enough to look at where that what that's actually doing but.

B

Yes, some of that.

A

Might seem indicated.

B

With the tc malik, um increased uh central uh heap size too be curious to see what the profile looks like after that.

A

Also, the current version- the benchmark works really differently. Right, like this, those traces were from doing like a single object with um like a million keys on it. The new version is letting you specify, like a number of objects to create with you know, a smaller number of keys per object, so the trace might look a lot.

A

B

Yeah I'd imagine we might yeah end up hitting more pieces of blue store, rather than only rocks db. In that case,.

A

Yeah yeah, quite possibly.

A

But the good news is that we still can do that right, like we could go back with the new version of the benchmark. You just specify one object that you want um in the collection and then you know a ton of keys on it, so we can still do those kinds of tests.

A

What about multiple collections?

A

B

um I don't think it makes much of a difference in multiple collections versus in the collection. Okay, also something I'm missing about serialization there, but you're going directly through the object store interface.

B

You have control over the like object.

A

Sequencer, which I haven't done anything with at all right.

B

Right right which mean that you're getting like the most parallelism, you can.

A

Is there any thing else here? I should be that missing. Do you think that that would be good to include or look.

A

B

uh What kind of like db size are? You are you testing with, like so far.

A

I have no idea literally, I just started like throwing. You know, objects and and keys into things. um So I mean, like you know I, I know what the test does right like this is like a hundred thousand uh objects with 100 keys per object and the key length is like 64 bytes and the value length is 256 bytes, so I mean you can roughly kind of figure out what box db would see prior to basic amplification.

B

Okay, so that's like maybe a gigabyte or two roughly.

A

I didn't actually work it out.

B

I think if I heard you, I think that adds up to roughly that so it's probably worth trying with like larger data set sizes too.

A

Yeah, I know that memstor uses like 30 gigs. When I run it, that's something.

B

Okay, I'm not sure that that tells us too much but other than how an efficient amp store is.

A

Yeah, let's see okay, so if I do this, I get 320 bytes per per key yeah 100 000 keys per object. There's a hundred thousand objects with a hundred of these per key.

A

What does that end up working out to be that's like.

A

3.2 gigabytes, I think, okay prior prior to space, amp.

B

Right right, that's not too unreasonable, then, but we've certainly seen even like order magnitude bigger in practice.

A

Yeah, it was enough that I actually started starting to see cache effects right. That was what I was really going for was I wanted that increasing the size of the blue store or the osd target? You know memory target and increasing the block cash actually made the difference for for blue store, so that was really all. I was focusing.

B

B

Yeah, I guess, for the file store case. You could even like run this the system in a little memory mode, where you don't let it use um like with the kernel with a low memory mode. So you can't even use more than the um data set size to force it to be able to go to disk. Sometimes, if you wanted to avoid that, like the page, cache effect entirely.

A

How do I do that? I mean, I think, that's going to go ahead.

B

Because the kernel parameter you can set when you boot, to determine your respective memory um to a smaller amount than what you actually have physically available.

A

Oh, oh yeah sure so like globally. You could do that.

B

A

B

Then you don't have to worry about, like the c group, um how that's playing with a page cache exactly.

A

Sure I'll try to verify one way or another that how c groups actually work, because I'm still fuzzy on that yeah.

B

Yeah, it's really useful to to understand that in general,.

A

Yeah, especially since that's what we do for containers right like that's, that's exactly what our real production stuff is doing. So we should understand.

A

It um one one other thing I wanted to ask here, igor you, you looked at like collection, prefetch and collection like caching, I think a while back, and I think the thought was that prefetch was going to be good enough. Do you do you remember?

A

D

Well, actually, this was obsoleted by fast pg removal.

D

So the only place where suffered from on collection listing is generals.

D

I just replaced the procession.

D

I was using the previous position for removal.

A

Did did you ever prototype like a cache.

D

uh Yeah, it was implemented in some way. uh I don't remember the details, but it's still pr.

D

Yeah, it was sort of kept prefetching.

A

Yeah yeah, I saw a reference to this. uh I think. Maybe it was this thing uh just.

D

Let me check, I can try to find.

A

uh This is, this is the one I saw earlier there. um I just posted in chat.

A

D

Yeah yeah, that's.

A

A

So I haven't, I haven't, looked at exactly what your code does here.

A

Not really in detail anyway, but yeah, it looks like maybe maybe you've made it. So you can read from cash.

D

Well, the description is pretty straightforward. I.

A

Suppose yeah I mean collectionless prefetch.

A

Yeah, I I would have interpreted that I guess maybe I was mixing it up with your other prefetch work or prefetching in ruxdb right there yeah, but yeah this. I see here in the code. It's like collectionless cache.

A

I wonder if this might be worth revisiting.

D

The beginning of the discussion, what.

A

Well, so this is not really being tested in the benchmark right now, right like doing uh a collection list itself, but given the other things I'm seeing in this benchmark, I wonder if it's going to be very slow in blue store.

D

All right you mean your this new pr for, for benchmarking, right.

A

Yeah I mean, if you look at it, I'm not doing collection lists right, but there's a bunch of other stuff and it's.

A

It seems like any time we have a lot of data.

A

The the roxy cb block cache ends up becoming kind of worthless.

A

So at least for iteration, so I was wondering my I guess my general high level theory is that the more we can catch that at the blue store level, the better off we might be in many different situations, one of which being the one that your pr is. Your old pr here is addressing right.

D

Well, the question is: when do you need this sort of caching for for collection listing?

D

Well, the the exact cases.

D

D

C

D

I I I was pretty lucky with uh repetitive uh pg listing uh power, technical, additional caching. Once I I started to use this position, I'm I I can't see.

D

That huge delays or collection listing, as I did before, okay well from uh the last case.

D

Started to be around two milliseconds or even 200 microseconds, to get 30 entries for pg removal versus up to several seconds, uh originally removed stuff.

D

And also all on, on the same, uh this was compared on the same note on the same data set.

D

um Well, it was something like that about perfect io or blueprint buffet, I use set to false it's getting around eight seconds and I set it to true dropped to one or two seconds and then upgrading to octopus 15.2.9, which has this new remotely dropped to. Let it be second.

A

How many seconds.

D

Two meters, second.

A

Two miles so we went previously with buffered io. We had one to two seconds but with direct I was 80 and now it's back down to like point two is it? Am I correct.

D

Well, it was eight seconds with uh yeah. I used two false around one. Second, all of this buffer trigger set to true and yeah. That's why it's anymore? Okay, two milliseconds.

A

Wow, okay, so that's basically already fixed. Then.

D

And that well definitely that that was the case when, when bulk removal uh happened before so for regular usage, the difference is not that large, but that that was. That was the case when it was degraded after bulk removal.

D

Auto hd, but forever.

A

For bulk removal, you mean like um due.

D

To it, it was removal caused by remapping or something but.

A

Okay- and that was uh not caused by the rocks to be tombstone issue that we saw also right.

A

Where, like bulk removal, causes significant slowdown due to like.

D

As far as understand, currently we produced such top stones case of large removals in a single shot, and yes, that's not the case for our regular pg removal. It takes just 30 entries and then removes them, so I'm not applied, but it looks like roxdb.

D

Might be still inefficient when a bulk of records before yeah.

D

But again it's not that efficient, since uh it depends on your access pattern. If you use iterators uh and save the previous position, you as it works pretty good anyway, without compaction, incredible things like that, it's uh it's a first search which might be inefficient.

D

So uh well uh what's happening when you are removing pg on each collection listing to retrieve text items to to remove it has to to run through a longer list of removed entries. I mean the original behavior. It always started from from the beginning, so on the first iteration it retrieves. The first entries then remove them on the next one.

D

It has to bypass somehow this just removed entries and find the next portion, and so on so forth and the it looks like the the farther it goes, the the worse, the behavior.

A

Yeah that sounds right.

A

I my inclination is to want like an omap cache.

A

I don't know if it's maybe more work than it's worth.

D

Well, I I I suppose we should start with the use case, so just just find these use cases which suffers from.

A

Right right now, we've got a couple of other people.

A

We've got a couple of people, or at least one person right now, working on trying to uh take all of the different customer cases that we think may be related to flow map, performance and solo, iteration um and and kind of get them together, so that we can. We have that um it seems like it does. Whatever this this is. It's maybe affecting many different use. Cases, though.

D

Yeah, just just one one thing to mention that actually this patch for removal doesn't drop roxdp in in in any case, so one might have uh some different scenarios when, when there are a bulk of uh bunch of multiple removals in in roxbb, and currently we are still- uh we might still suffer from from from this yeah. Well, one of the keys is snap mapping records.

D

Definitely so, if you case a lot of snapshots might get wet performance as well.

D

Maybe a bw now example, but not sure I I've heard about each suffering from it, but but again so I I presume that anything about removal might cause be slow down and yeah.

C

D

Agree currently, we have just this one one scenario: six, uh my second, my my second pr, which, unfortunately, don't have enough time to proceed, work on.

D

uh Well, it improves.

D

Some stuff, but well it allows to run, uh allows to to fix bulk or map removals by using uh by using range, deletes subsequent compaction. uh Yeah again, that's probably not a 100 percent pollution.

A

If you force the compaction after the range delete, though that at least maybe makes it more viable.

D

Yeah and actually compaction is good in in any case, if you applied it or not, this uh workaround to apply compaction with totally degraded trunks db. It's a little was a workaround for for a while.

D

It looks like we can can run.

D

We can run online compactions on a subset of just removed records and doesn't impact to the performance. So that's what I am doing in my second vr, but.

D

Still still, an open question is how to to handle uh bulk removals on unsorted records.

A

So, in the the test case, I've got here in this this pr, I linked there's no removals happening until the very end, so we're not even touching any of that that's going to just be even worse. I think this is just touching the cache and how effectively we can. We can iterate your memory constraint scenarios.

A

But you imagine that if you start doing deletions, then I imagine that these numbers would get even worse.

A

Maybe I think so.

A

Damn not to bother you again, but if I remember right file store, does something kind of strange with omap um when it's is submitting to roxdb or previously leveldb, it's not as straightforward as this bluestore. But do you do you remember how that works or is there? Is there something? Is it easy to sorry? What was that.

C

Yeah, if you're not using cloning, it's a prefixing scheme.

A

That's it: okay,.

C

Pretty much yeah! Okay, are you asking how that could possibly result in higher performance? It couldn't.

A

Yeah, that's um I I. The only thing I can think of is that it's getting a bigger cache.

C

Memory theory is again still a really good theory. File store is using the kernel page cache. It's probably evading your c group limits and it's quite probably quite a bit larger.

A

Yeah, it's probably not using a significant amount of block cache because I don't think file store changes the default, so I think it's like whatever we set the default block cache size to be. I don't know it's like 500 megabytes or something, um but it doesn't really matter because it's all buffered I o going to whatever file system it's running on.

A

I believe bird reads.

A

So whatever page cache is left, if assuming the c group is actually limiting it, then if file store's not using a whole lot of memory, you might have a couple of gigabytes available for rocks to be paid to use in the page cache for for this and that's good enough, and it just works all right. Well, this is all I've got guys um any any other thoughts or questions or comments all right. Well, oh sorry, were you saying something.

D

A

Okay, um I'll keep working on this, and let you guys know how it goes. um Anyone have anything that they want to bring up before we wrap the meeting up. All right well, then, have a great week, everyone and see you again next week, thanks guys thanks.