Ceph Performance Weekly, 9 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2023-03-09

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

Hey guys, I'm sorry, I'm late, I I got into talking about defragmentation with Josh and the course stand up and and lost track of time. um Hopefully we can maybe continue that conversation here, uh all right, so so obviously core is still going on. Hopefully we will get some of those folks uh soon. Oh hey Josh! You came.

B

um Your last thing so, however, filtering over here: okay.

A

That's that's good um all right! Well, let's! Let's start this thing, then um oh I gotta go find my uh a web browser tab for the performance meeting. um Give me one. Second, all right, so I I confessed again that I was not great this morning and didn't make it all the way through the PRS, but I got through at least a decent number of them.

A

um Okay, so new PRS this week there are two that I noticed, but I I think there might have been possibly some others, but um if I, if I miss anyone's, uh let me know uh the two that I saw were um Matan has a PR for the Crimson Suite, uh the QA Suite, who basically fix an issue with collect all not being installed.

A

This is basically just for the Crimson performance test to to be able to collect a whole bunch of background system, level, information and uh I, guess I guess in the QA Suite we didn't work in collect all data before so this PR should fix that I approved it I didn't test it. It looked fine just from a casual glance, though so I figure that if there are issues, will will figure it out after this is applied, not super worried about it.

A

um Okay and the next PR was also Crimson PR. This is to add fine-grained. Caching. uh This is from one of the developers at Intel. They also uh conveniently provided a benchmark in the pr which I always am very happy to see um and uh in the the four megabyte pre-filled case they they saw a really huge increase in performance, which uh honestly actually surprises me just a little bit.

A

um Oh I'm, sorry, this is a 4K ranked read performance test. After doing a four megabyte, pre-fill, that's what they mean here so um I again. It actually surprises me. A little bit. I didn't see anything quite as bad. The last time I looked. Oh, this is c-sword, though anyway, sorry um I'm, I'm, getting lost in this uh looks like a good Improvement, so uh I think once Sam has a chance to look at that. uh Hopefully we'll we'll merge that pretty quickly, but uh definitely definitely good things.

A

Those were the only two new PRS that I saw. uh I didn't make it fully through the pr list, but I didn't see anything obvious that had closed uh that was new anyway.

A

um I did see a couple of updated PRS. There have been some additional reviews on the qat batch PR, uh that's from intelligent I think um there is Igor's, PR or sorry uh case, uh Corey's PR or saying the roxdb iterator bounds for collection list, uh I, think that has a couple of new fixes in place.

A

Oh, let me open it up quick and take a look.

A

Actually, no I was wrong on that. Those fixes are were not this week. Those were older, so I think that actually needs to be moved, no updates or no movement category rather than now.

A

The other PR that I did see that was updated, though, was Igor's uh pr4 not resetting the pre-fetched buffer, while doing multi-chunk uh I think, it's probably supposed to say, reads at the end there, um but basically bluefest prefetching, uh Behavior and uh I think. The hope is that this might help us with roxdb when it's trying to do prefetches, if it does there's a chance that we might be able to re-enable uh blue sir buffer uh disabled Blue Store buffered Io.

A

If this, if this works well we'll see, but generally speaking, that was the big reason that we had to revert the switch to direct. I o that we're not reading From the Block cash and rocksdb properly or roxdb, isn't um there's some issue there and we're entirely relying on the the page cache the the Linux kernel page cache, and if we can somehow improve things at the bluefest layer, then maybe we don't need to do that anymore.

A

um Igor is that right am I, am I, am I overstating the importance here or is that kind of what you're thinking.

C

I believe your estimate, the purpose of this specific PR are actually just improves buffer, reuse or well prefetch buffer reuse. So in some cases I could see its.

B

C

Is still holding some useful data so.

D

C

Is PR, fixed uh I was 16 that specific issue so just well battery using the prefetch buffer and lately I discovered that it could result in RAM usage growth, while detox DB performing bulk data reads from DB, for instance during iterating all the every record. It doesn't close all SSD files along the path and if we do not dream our buffers as well, equal gigabytes of RAM occupied. So the latest update improved the streaming as well.

A

Nice remind me: uh can we still use the read ahead at the blue FS layer? If we're using um blue star buffered, I o false, will that will that still work.

C

No, if I remember correctly, we have some slack which prevents us from doing that, but Electric.

A

Okay, so I mean the the the behavior it seems like we see is that when we, um when we do a poster buffer, IO false or not properly, reading from the roxdb blog cache for whatever reason and we're somehow we're not doing uh uh any kind of um prefetch, and so we end up just re-reading the same blocks over and over again, when we do these, these kinds of um listing steps where we we re-list or re-re, uh evaluate um uh some kind of iteration, and we also are not doing any kind of pre-fetching very well.

A

So um we we basically just end up relying on the page cache in the kernel to to save us, but there's probably multiple ways that we could avoid this at the bluefest layer, I suspect, while still doing direct. I o against the kernel.

C

uh Maybe I'm wrong, but if you look to me like we disable our own prefetch using this flag and also we use direct direct IO yeah for discrete so hence apparently I meet quinoa Pages as well.

A

Yeah yeah I wonder if we want to do direct, I o under the hood, but still have our own ability to do. Prefetching Maybe.

C

um What's traditional behind it,.

A

That, strangely, it doesn't seem like roxdb is behaving well. I guess.

A

Like okay, the the hope would be right that if you read an SST file right, you have to do a a an iteration over an SSD file. You would hope that roxdb is doing like a big, sequential read, to read the SST right.

C

uh It actually does uh well at least for some sequential scans. It reads: uh SSD file using prefetch call, which attempts to get something like housing, Nick, and then it's gone through that block using small reads.

C

um Yeah and we might want to substitute uh their original buffer size with our we've increased one uh during this prefetch call.

C

But honestly haven't seen much improvement in that case.

A

But I remember and it's possible that Mr membrane is that when we disabled Blue Store buffered Idaho, we saw during iteration. We would do this.

A

Do we have this pattern where we will iterate to a point and then we might do like a deletion, and then we start we do a a like a um uh uh like a ring scan and start at a certain offset again and then reiterate over again until we hit another point where we do deletion and we reiterate over the same range over and over again and when we have blue star buffered io on we do the reads from page cache, but when we don't with direct, I o we end up like doing that over and over again, and we do small reads against the disc.

A

That's that's why I remember.

C

Seeing yeah that's scenario.

B

C

Description so.

C

D

C

Have page cache at this point so, but we.

A

But we don't really want it right, like it's slower, like in other parts of the OSD, it's faster to use direct, I o, and we don't really want to mix them so I guess what I was wondering is. Can we do direct, I o against the kernel, but still have some kind of prefetch buffer that we that we keep ourselves at the blue effects layer.

C

The the shh the issue with roxdb is that uh it attempts to read from the same locations multiple times and.

B

That's different from.

C

What we like to have it OSD level or regularly users, do not read the same data on the same locations in short period of time since they use their own caching, or things like that. Oh.

A

Yeah and you'd expect that if you did a reread of the same region uh that it should be read from block cache, but for whatever reason it doesn't seem to I, don't know if that's because it's just failing or if it's, because maybe we're triggering the um the uh eviction of the block cache I do sometimes see that rocksdb, like evicts the entire blockage, maybe due to SST files becoming invalid. But it's it's definitely it's definitely not working the way. I expected it to.

C

um I can't say anything about proxyb, cashing, so I.

D

C

Like I haven't looked into that in depth.

A

C

Whenever I saw I'm good enough stats.

C

What's happening.

A

We might be able to fix that the rocksdb layer, but I suspect that it might be easier for us to just kind of get do it at the blue FS layer before we hit the kernel, but maybe let just leave racks to be alone.

C

The question is: if overhead to access page page cache is significant, so if it does, then we definitely would like to have cash as close to rock the GPS possible. uh While it's. If it's not the case, then it's still rely on page cache. So.

D

C

Don't see much honestly I, don't see much overhead in accessing page cache. So to me the so I I did brief.

C

uh Benchmarking, oh well, collecting some latencies and numbers on on operation. Latency I! Don't like didn't see any overhead here.

A

I think it's been a while, but I'm pretty sure that a couple of years ago, when we're looking at this on the Intel official analysis notes where we had fast nvme drives that direct, I o was, was faster, maybe like 10 or something um foreign.

C

Page cache I mean block cash approx DBS, rather than because no.

D

C

Would be the overhead from I mean page cache to access it.

A

I think the block cache is can be tricky because it seems like it just doesn't consistently keep things cached the way that it should, and maybe we can fix that but I'm not convinced, because the Rock's DB guys don't seem to like merging a lot of external PRS.

A

um But if we showcase the really bigger example. Maybe we could.

A

I think the thinker issue, though, is that we have to make sure that our iteration performance remains good. um One, Way or Another it'd be nice. If we had some guarantee that we could do it in our code, rather than relying either on roxdb or on the Linux page, cache right like I, don't I, don't like being reliant on either of those things.

C

um Well, uh I have to be different thoughts about having.

C

um Synchronous or maybe parallel access to data, so right now it looks like rocksdb well, at least for for large scans. It performs reading in sequential manner yeah and maybe if we do prefetch most smart, at least at our level and Trigger discreets beforehand.

C

This might provide some benefit, because uh right now, I can't see so for this sequential scans. I could see that we under before from disk bandwidth, point of view like why slower slower than the disk might is capable and.

D

C

For SATA drive, uh I am afraid that for Jimmy drives, it might be even worse and.

C

Yeah, so if we manage to access G Ives can.

D

C

Somehow well maybe prefetch.

C

Data and parallel, uh we could benefit from that.

A

Yeah, it's a it's a good idea.

A

So Igor I think I found.

A

um Of the old info in this PR.

A

This was I, think testing that um other people at Red Hat had done.

A

I think I had done some as well, but the stuff that's in the pr is from from uh other people on the cat's team, I think.

A

I think that case was using hard drives, probably with nvme uh DB wall yeah yeah.

A

But anyway um take a look. We can, we can think about it more, but I think it's it's a good thing for us to figure out. If we can, if we can go back figure out a way to go back to uh uh bluefest, buffered IO, disabled I, think it's uh it's a good idea at least a try.

A

Oh Joshua, you say plus one thousand.

B

Yeah I mean I've offered my opinion, a little festival for diyo before please do so again, yeah. uh We we have.

A

B

Everywhere, except for index OSD, because for us it makes everything worse.

A

Yeah I think we just need to figure out how to deal with this super irritating uh uh Corner case that actually ends up being kind of a big Corner case, but I I very much would like to see if we can move back to disabling it.

B

Yeah sorry I wasn't paying full attention the whole time is this the issue where we seem to prefetch the same data over and over again or something else, yeah.

A

Exactly so so I was asking you were Igor if there's any chance that we can do prefetch at the bluefest level without and still uh disable uh the store buffered. I o to the kernel like do direct item to the kernel, but still have our own prefetch buffer at the bluefest level. Well,.

C

Well, that's definitely easy or we just need to operate to the.

D

Well, maybe it'll be a different.

C

Flags to control prefetching, the blue effects and another one to have page cache enabled from that's. Definitely not a big deal to to support.

B

Something I actually started to think about I. Don't think I wrote any code for it, though, is like technically. We don't have to prefetch the blue FS level. We just need to Cache right, because if the issue is that rocksdb is basically asking for the same data over and over again, if you just give it that data from memory, you're, good, uh yeah, I,.

D

What I wanted to.

B

Do, though, was and like I've had zero time to do this for years now, um but was figure out why it's asking for the same data over and over again, when it should be caching that internally.

A

I did I did like a a huge like like walk through the rock CD code, and it's changed dramatically in like the last three years, like it, it's very different for Nautilus versus Pacific. So it's that is probably going to take like multiple independent code, walkthroughs to figure out what each version of rocksdb is actually doing. Yeah.

B

Do we know it still has that behavior.

A

Not not with the latest merge right like in Reef. We just did a huge upgrade and I have no idea. So, okay yeah, we should probably figure do that again.

C

Bro um just my uh comment: uh I disagree about kitchen retro reflection, since it's still uh helpful for long scans and drops GB behaves in a way where it uh issues a call to.

B

C

File system uh saying, please find larger block to retrieve and then scans it through using small reads and if we do.

D

C

The this uh secondary reads: they are not repetitive, they just.

C

Incremental access to.

C

Elementals reads from that previously specified extent, so there are no repetitive access during that and with.

D

B

C

They would be that inefficient.

A

Yeah yeah I, don't I, don't think we have to get rid of like prefetch I think we just need to make sure that we're.

A

Taking care of this one use case where we see the awful Behavior um I mean like everything, I was seeing with blue store. uh Sorry bluefest buffer, IO off looked better, except in this one case so like. If we take care of that I think I think we can probably reevaluate this like for the third time.

B

Yeah actually Alex is on the call here. I think he might still be on the call, and he he's he said the same thing internally, because we've also flip-flopped like several times internally as to whether it's better or not to have it. On.

A

But we've mostly landed with it off yeah. That was where I was at and then and then people got burned. I was like oh I, think Dan banister, like uh wrote, another PR, the disability and or to turn it back on turn bluetooth back on again and um and then we just kind of like yeah, yeah I think we need to do that because too many people are hitting this this awful case, but I really wanna get back to being able to switch it back to direct. I o. If we can.

A

All right have we have we beaten this this horse to death again sufficiently for, for today,.

A

All right, I, don't think that is a a yes, so, um okay, that was those were updated. Prs um did I miss anything guys. I didn't make it quite through the list today. So was there anything I missed.

A

All right, if, if not then um I don't actually have anything listed for today for discussion topics, um does anyone want to jump into The, Fray.

A

If not I do have one thing: I could actually bring up um uh I. Think. Were you the author of the uh the the the the pr for CBT to disable the existing check for um for uh the results directory.

D

I'm not sure which one you refer.

A

To um one second, I can pull it up.

A

Sent me a link to this yesterday. I think uh I.

D

Think you might refer to meet son.

A

D

A

uh Yeah, sorry, it was Nissan's uh PR. Kid said that Nissan was gonna, come to the the meeting to talk about it, I couldn't.

B

A

Confusing I'm just I'm, terrible with names that I couldn't remember, who did it so I I apologize, um but uh I guess I did want to talk about this. uh Possibly um the goal with that particular line of code. That's there is to prevent the uh re-running of tests that were already run and and that that particular check has saved me many many times in the past from overwriting existing test data uh when, when you've run it against an existing directory.

A

um So in the the it sounds like for some reason in the um the pathology test that use CBT, it's triggering this. This code, causing I, guess a result structure to be created in the second Loop of a benchmark, and it will skip that. But I don't see this in any of the testing I've done in like the the year or two that this code has been there.

A

um It's always worked properly for me, so I'm I'm, trying to figure out why we're hitting this in the toothology case and not in any of the other test cases that I've done um I, don't think that getting rid of the check is the right solution, but I don't have a good answer for what the right solution is.

A

I'm, just not sure I want to get rid of that check um uh completely so anyway, that's that's the status of that I guess.

D

That this, the problem is a new one produced by a specific commit from yeah.

A

Exactly exactly but I've been running that code for years. I think at this point without having this problem so I'm trying to understand why it's happening in the technology case and not in any of the other I mean, like literally thousands of tests that I've run since then.

A

If the comment says that after Benchmark initialization, the directory existence was skipping it but I, don't I, don't see that and the test that I do so I'm I'm confused. Why why it happens? What's the difference.

A

Maybe there's some difference in the way that that check is happening with the radius Benchmark. If this is all Rita's tests, maybe there's the that exists. Call is, is behaving differently or something.

A

Is not happening in the the HS bench or the fio test that I do but I don't run a lot of veritas bench tests these days. So maybe it's maybe a specific to radius, which, if that's, what Anthology is trying to to call.

A

Well, I'll I'll try to follow up with Nissan and and see if we can just just hash it out and figure out what's going on here, um but that's that's uh I guess the status of this thing is I'm hoping we can just figure out a more nuanced fix than completely disabling. The existing directory check.

A

Well, that's all I have so uh unless anyone else has anything, maybe we wrap up a little early today.

A

All right, I, don't think, there's a sign, no one's got anything so enjoy the rest of your day. Guys and uh I'll see you next week have a good one.

C