Ceph Performance Weekly, 11 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-09-11

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, uh let's see, I'm just gonna, start doing this and maybe they they join when we actually get to the discussion topics. So, okay, uh let's see this week closed pr's uh there was a prometheus vr that just lets you disable the cache. I think that's pretty simple, so that merged.

A

Igor had a pr for capping the omap naming scheme holy smokes. I remember what this was. I think this I don't remember.

A

I'm sure it was basically to eliminate some kind of corner case, though yeah we shouldn't use single per oh no transaction for such an upgrade. When oh, who knows. Oh map list, is huge yeah high memory, consumption, okay, so it's basically just making it so that we don't end up consuming tons of memory when there are like huge, oh map lists in a particular node.

A

Or on a trend, single pronoun and transaction okay, so splits it up: okay, yep! So, basically, just not letting you blow up the osd when things are awful, um all right in the mds there's a pr that merged that switches, the mdl mds lock to a fair mutex to avoid starvation issues.

A

So, okay, that looks good. um A bunch of updated prs this one from adam. We'll talk about that. I think a little bit later m is not going to be able to make the meeting, but um the gist of it is. Is that this pr does direct, I o for, writes and then buffered io for reads, and so the the concern there the benefit there first, is that it's um it's kind of giving us the best of both worlds in terms of uh what we've seen for performance potentially.

A

But um the concern is that this could introduce consistency issues if you're trying to do things like read from cash. Well, simultaneously, you have direct, I o rights going on, so um that would need to be very carefully audited. I think, um okay, what else.

A

This incremental update mode for bluefest log that failed qa.

A

Osd compression bypass after rgw compression looks like that made it into eric's testing branch, so maybe that's possibly even converge if it passes um more ongoing work with this ttl cache implementation for the manager module. I think they had another review.

A

A

Optimized pg removal, pr uh that had failed tests at one point and I think it just got another review. I don't remember who looked at it recently, but um that was reviewed again, so more ongoing work. There.

A

The ceph messenger header, 2 decoding optimization, that's gotten, updates ilia previously had reviewed it. I think there were test failures at one point, not sure if that's still the case or not, and then um the other big one for me, the one I've I've been excited about for a long time is this mds uh remove subtree map from the journal. um I wasn't sure if that would just languish, but it looks like uh zhang has updated it again, uh so really yeah.

A

He did uh I'm not sure exactly what he did there, um maybe just made it rebased it to this current, but um but he's been actively periodically um keeping that up to date, so um hopefully we'll be able to get that in. That looks like a really good pr if it works, and then uh that was it for updated prs that I have uh is there anything I missed or anything anyone wants to discuss related to those prs.

A

All right moving on then so uh I've presented some of this data to the core team already, but I'll share it here as well.

A

Last week- and earlier this week I did some work. Looking at the osd code in master, I had noticed that it looked like we had regressed, somewhat versus when we started quincy.

A

So I wanted to go back and dig around and look for it and in the spreadsheet that I just posted in the chat window, you can see the results of that work. There's good news and bad news here. The good news is that it wasn't anything super unexpected.

A

We had been worried that uh there were some m clock. Changes for qos that were a little concerned may have introduced regression, but it doesn't really seem so maybe a little bit. There was some fluctuation that I didn't really date too much deeply into, but it's not clear. There was anything other than just random variation, um the real big source of regression. Well, there are two that we saw one we already knew about and fixed that was um a 7e 3 ece online 16.. That was a change that was introduced that mark hogan actually discovered.

A

First, uh where um we were, we ended up basically doing debug builds by default uh and that that was a real real nasty uh impact, um but that got fixed, so that was no longer a problem. um The the other change that we kind of suspected and also kind of already knew would not potentially cause some regression in this kind of workload was when we switched to using blue, has buffered io uh true by default again, we've gone back and forth on this a couple of times.

A

um The reason that we switched that back to being true again, is that for some reason that we still don't totally understand, roxdb will not read uh blocks that you've read during iteration, like iteration of our collection recently from cache. For some reason, it seems to go back to disk again when we have this kind of pattern. It may be that we did something like deleting something that caused invalidation of the cache. I've seen that in rocks tv before, where um potentially like you do a compaction and the entire cache is invalidated.

A

It's kind of weird um this, maybe something like that is going on or maybe just their behavior is- is kind of broken, but uh what we've seen is that roxdb will often maybe always go back to disk when you reiterate over something that you just irrigated over rather than using the block cache.

A

So what happens is that the buffer cache ends up doing all the heavy lifting but uplifting it will. um If you do direct, I o you'll end up just doing all of your reads from disk, but if you have a significant amount of buffer cache, then you'll end up doing those uh those reads from memory instead.

A

So that's why we made the change, because people were noticing this uh in for for real deployments during um you know any kind of operation that would require collection, listing or deletion or other things, and it was bad. um We also were sort of able to replicate this uh ourselves through a benchmark. I wrote called omapbench, that's just part of well, it's not yet.

A

This is there's a pr4, but we've merged it um part of the the uh make test or the g test suite, um and you can see some of those results here: lines o and p where, with buffered io we're quite fast across the board and in column p. When we have direct io on, you can see that oh map get and omap remove the time it took to do those tests uh increased fairly dramatically.

A

So all this is to say that um right now we're in kind of a hard place where, on one hand, direct. I o for us is faster for things like rvd 4k, random reads, but it potentially is significantly slower for omap operations through rocksdb uh for things like collection listing.

A

The good news here is that, uh while, if you go back to the the other spreadsheet in the uh looking at the bisection results, um gabby had had earlier during this release or during this development cycle implemented a pr that removes all the allocation data for blue store out of rocks tb, and that was a significant performance when, for things like 4k random rights. um In that case, I think his pr is probably worth about. 20 or 25 percent improvement, and it it basically makes it so that even.

B

A

uh You, when using buffered io we're actually faster than we were when we started quincy um in those results. You can see that, in fact, by switching back to direct, I o or rbd 4k random rights, it's even faster we're gaining like another 10 or 12 percent, um but we don't quickly need it, we're still faster overall than we were when we started quincy.

A

So um there is one other thing in here. I mentioned adam's pr about um doing direct. I o rights in bluefs and buffered reads that got us back to the 4k random write performance, we're seeing in direct, I o mode, it's possible and even fairly likely that we'll see the kind of better performance in the omap read and deletion tests with that pr, maybe it's the best of both worlds.

A

The downside with this as there's some discussion in the pr you can see, but the the real downside with this is that it could be very difficult potentially to guarantee consistency. When you're doing reads from the linux page cache while simultaneously doing direct io rights, we will need to be extremely careful. I think if we go down that path and make sure that we're not introducing any kind of inconsistency in corner cases, but just for the sake of argument, uh it looks like there's potential there, so it may be worth further investigation.

A

um I see everyone from core joint, so um yeah I'll open it up does uh gabby. I I want to specifically mention how great your pr is. Looking any any concerns or comments.

A

I'm not hearing you, but maybe my audio's broken.

C

No, no, no gabby is not heard.

B

Okay, okay, can you hear me now? Yes, yeah sorry, my mic is keep jumping between different uh definitions, so I I saw in in in the doc in the tables that the commit level seems very far away. Did you just take compare the code from before my commit and after there was a a big difference in between with many other commits coming in between.

A

Gabby, I believe that on line 17 e21v68d, I think, is the commit right before yours is. That is that incorrect.

B

No, I guess- maybe I I remember this incorrectly, but okay, so okay, so it was just between these commits and we jumped from 1006 five to thirteen hundred to ninety five, which seems about twenty five percent.

B

I didn't do the math, I'm just doing this by hand uh yeah.

A

Yeah, it and, and honestly it's it's all messy right, because that was during a severe regression that was introduced because um we ended up doing debug builds by default.

B

Okay, so so it seems we got the same improvement we expected and again it tends to reason, because we skip a major step in in the I o flow. So it's expected that you will gain some significant improvement.

A

Yeah, I would, I would look at line 27 that result and compare that to um probably line 15. It's not exactly apples to apples, but, um oh sorry, now line line nine. Actually, that would probably be the more sorry which lines nine and compare that to line uh 27 approximately um it's not exact, but it's it's. You know very.

B

Huge jump, but there seems to be much more than that. What are the differences.

A

There is there's a lot of difference. um The the problem that we have right now is that we don't have any direct comparison right. We we basically before line 10 when, uh when we um enabled bluefish buffer diode by default, um we are seeing fairly consistently around 66 to 68 000 iops.

A

For for this particular 4k random write test. Once we did that uh we were dropping down to like 58 to 61, I guess um and then, when we introduced the the regression that made us compile by default uh debug by default. Of course, then it was significantly worse.

A

So um I guess really the significant changes over master that we saw were, um if you explode. This debug issue um was when we disabled bluefs buffered io, and when we introduced your pr. So those are kind of the really important parameters to look at line. 27 is the average of those three runs in master head. um As of like a week ago, roughly.

B

And disable column, family b.

A

uh That's uh just whatever the state of master is with your pr. Assuming that your pr does that then! Yes,.

B

Yeah so no column, family b, plus disable buffered io.

A

B

Exactly what we see in in row number nine, so that is um okay! I need. I need to open a calculator really. I mean.

B

Really I'm too slow to do anything by my head anymore.

A

I understand I I I feel you.

B

I used to be able.

B

Okay, now I have the calculator, and so we do um 81.

C

A

Yeah approximately.

B

68 100, that's 20 improvement. Actually.

B

But I thought we saw that better improvement. When we looking at 17 and 18., there was a bigger improvement. Let me.

C

Double check it.

B

So maybe the combination is not that great or is it it's doing small? I o.

A

Yeah exact same test exact same configuration. There will be some random variation involved, but um the bigger thing you're right is that I mean on those lines it's just depot built. So I mean you can't can't directly compare right.

B

When I share the difference between line 17 and 18, it's exactly 25, where.

C

B

9 and 27 it's 20.

A

But but gabby is a debug build right, so I mean we can't really directly compare.

B

These things, debug it and n9- is without debug.

A

Essentially, yes, that was the regression that was introduced, which is why those numbers are so much lower and 17 and 18..

A

They overlapped with your pr.

B

Okay, so we can't really no but number 27. There is no debug right. I mean there's no way you can write this numbers with debug.

A

Exactly exactly, but that was the problem with 17 and 18, is that when your pr was actually.

B

Not working in debug mode and 27 is, without the debug mode, exactly a 17 and 18 in non-debug mode to see what kind of difference we get. Maybe because maybe there's a lot of debug messages in the code I removed.

A

I mean we could, we could manually, apply the fix to those, um but I mean gabby. I don't. I don't think it's really worth it. I don't think, there's anything that merged after that. That was really impacting performance that I could see.

B

But there is a difference between 9 and 27 is also the difference of buffer. I o being disabled.

A

B

A

Yes, but but look in master, we've got the uh on line 31. You can see the case when we have blue this buffer diode. True.

B

Okay, so buffer your to get you down to 73 so buffer. Are you true is the same so number nine line, number nine is also buffer. Are you true right.

A

Correct that was from before your pr, it seems.

B

That my improvement from 9 to 31 is not very much.

A

Oh sorry, uh 10 is buffered out. True 9 is buffalo, false.

B

So 700 7300, seventy three thousand three, sixteen divided.

C

B

Fifty seven.

B

Six, twenty seven! So now we get twenty seven percent improvement. Okay. So when buffer io was disabled, there was 27 percent improvement.

A

And when it's on, then it was more like a 20 or sorry when it was.

A

When you're doing direct.

B

Okay, so 27 is the low improvement, but by making buffalo disabled we able to gain more, and the improvement was more significant without my changes because it made them much faster than they used to be.

A

I think my take away from this is that when you're using buffered io, your improvements are having a bigger effect and when you're using direct, I o there's not as big of an effect, but that just might mean that there's a bottleneck somewhere else. That's impacting us more.

B

Yeah, so by by uh disabling buffer io, you remove a bottleneck which was part of the work needed for column, family b.

B

So that was making them slower with without buffer io the if the io was just slower and column family. The work was affected by this. When you remove this bottleneck, then you get improvement for column, family b work, but removing that code makes even better of improvement. Okay, so I take it 20 with buffered, io and 27.

B

A

Think it was the other way around right. I think it was 27 percent.

B

Yeah I mean we are disabled.

B

Which is another way of saying without it, okay and.

D

B

Test is 4k random white right.

A

Yes, yes, with a relatively high depth, uh the data set size tuned to so that all low notes fit inside the cache.

B

And 24 to uh 27: what's the difference between them.

A

Let's see 24 27 uh uh 27 is just the average of those three separate runs.

B

Oh okay, it's the same test. Just you got an average number. Okay, yes,.

D

B

A 29 30 is the same run and 31 is the average, and I would assume that 11 12 13. No sorry, if you look before do you also have numbers and then average.

A

No, those were just individual uh steps.

B

A

Yeah, exactly exactly because I didn't want to do three runs for every single uh commit on the the uh the bisection unless I needed to uh it was really clear this time. So I didn't really need to gather more samples.

B

Okay, so all in all it's we are in line with expectation which is good.

A

Yes, yes, exactly yeah, your pr is looking great um and, like I said it's, the only reason right now in master head that we're faster than at the quincy launch. So kudos. That's that's a really good result.

B

Need to out some issues that we've seen impacting correctness.

A

B

I hope yeah, I hope so one of them we fix and I hope it will get murdered soon. It's finished.

B

There are two which were still investigating one of them, adam thinks might be unrelated to my changes and maybe in his domain, the other one. I have no explanation, but I was able to see that with my changes, disabled, we are unable to get these failures and, with my changes enabled that failure happens on tautology boxes. I'm still unable to get this thing to fail on my box. Maybe I'm doing something different in the theology, but.

D

B

Environment is different, but this thing doesn't fail on my box, which is very annoying again. None of them seems to be it's not going to change the way we do things, there's probably some uninitialized something somewhere. So it's I don't expect that we'll need to do major changes once we find the root cause.

A

Cool, that's good, good and yeah. I wouldn't doubt that there's a possibility that your pr just exposed some pre-existing issue. There's there's, certainly enough complexity here that um that you know.

B

It might be, and the other one uh I don't know, I'm still thinking- maybe I'm doing something too fancy for bluefs the way my file is stored, I'm doing too many truncate and such which other stones seems to be using. Maybe blue effect doesn't like truncation.

A

Well, if it's true, though yeah, if it's true, though, we need to fix it because there's no guarantee that something else couldn't come in and do the same.

B

I think I'm the only one doing transjunctions.

A

A

Well, in any event,.

D

A

um So the only other question here that I was kind of thinking about is, as adam was talking about. uh You know his pr for doing direct rights and buffer greeds is this. This may be another. um Maybe this is the time to start looking at iou ring again and potentially running iou ring in buffered mode, seeing how that does.

A

um We have some code for I o? U ring, but I don't know if it's even working at this point uh this that whole interface has gone through some churn in the last year, um so this may be worth reinvestigating it um gabby. Have you ever looked at iou ring.

A

C

Else for that matter,.

C

I wrote our during uh application, but the past, not not in the reddit, but we we made use in indigo fire urine. But that's why I remember a lot of what I did there.

A

We um so so ronan we have um code that came in from an external contributor that kind of hijacks our aio.

A

Path to support iou ring, and it I mean even at the time it felt a little hacky to me, but it more or less worked. um But I wonder if it might be time to start thinking about iou ring as a first class. um You know kind of citizen for uh bluefest.

C

Hey somebody, if you remember the what to look for who, who, where is the pr that.

A

uh Yeah and think, actually it's all self-contained in uh the blue store directory. Let me look um well, that's not still contained, that's not entirely true, but um the.

C

A

A

Somewhere in here now, I thought that it was actually a super file, but it might not be um yeah. Let me let me look at it. Look it up uh uh and I can. I can send it to you. I actually don't remember exactly what the state of it is. So maybe I'm misremembering something, but I thought that we uh we supported it in blue fs.

C

I seem to remember from the our last discussion the euro, so good. It was supported, but.

A

Oh, we moved kernel device out of here. That's right, I'm being common! Now.

A

Just doing a find a source block, colonel okay, sorry, I couldn't remember where we moved that to.

A

Here we go. Here's I o! U ring, okay, um the I o! U ring code itself is here.

A

Now I have to admit, I don't even know how much this has changed. I know it got moved, but I don't know how much has been changed since then.

C

27 days ago, or just moving.

A

Yeah, it doesn't look like there's been a ton of changes.

C

The kodio seems seems general, not specific.

C

I'll take a look.

A

Like I said, I thought that we were somehow hijacking our aio code for a lot of this.

A

Well, anyway, yeah take a look. Tell me what you think um you you probably are one of the few people. That's actually you know done anything with I o urine, so I'd definitely be curious to see what your thoughts are.

C

Okay, the coder was just upgraded from written three years ago and upgraded for new library, minor library changes during that time.

A

Yeah, I doubt anyone's even really tested it much. um I I once tested it a long time ago, and that was it uh that was even before centos by default, supported it. So you'd have to use custom kernel um now. Maybe it's different.

C

I think it's part of some of the queensland tests we tested sister with over ioui. Oh that's.

A

C

I I don't remember what what was the status and what were the results either I probably radic or kifo is the one I think was doing it. But I'll ask radek.

C

A

All right, um I think, that's all I've got so. uh Oh I'll, open it up uh any any other topics that people want to talk about today.

A

Gabby, I think your dog maybe wants to talk about something.

B

uh It's my neighbors.

A

B

Yeah, that's we're living in the city. We are not like you guys, living in your own kingdom, of isolation.

A

A

I remember, I remember the apartment days: that's mixed right, good things and bad things.

B

Yeah but noises are just all over the place.

B

Do we have a project using the urine.

A

We we sort of did at one point and code was implemented and it more or less worked. uh I don't know a year or two ago, um but it wasn't showing any particular improvement. uh There was very little uh difference compared to using the aio interface.

A

Theoretically, uh it may have shown some benefit in that we were actually at the time able to make io submit in the aio path stall, which you know it's not supposed. Well, it's theoretically, maybe sort of not supposed to do, but in reality you can, you can make. I have submit stall in the ao path if you have too much um backed up io, so I think I o yearing might help in that case, but in reality it was very. It was really similar.

A

It didn't show a whole lot of performance improvement in the test that we looked at. um So it's just kind of sat there for like two years.

A

Maybe now, though, as we're talking about direct io and buffered io and changing the the the way that we do writes, maybe it's worth actually looking at this more significantly.

D

I think it might show more benefit with uh c store in the future, rather than bootstore more likely to be hitting those kinds of bottlenecks. I'd expect with much faster devices.

A

One thing I was curious about josh is whether or not um I mean it looks like uh jen's uh spent a little bit more time, thinking about buffered io in the I o urine context than they did with the bayero. I was wondering if bufferdio may actually do be closer to the direct. I o level of performance uh without having to you know, do this kind of mixed direct rights. You know uh buffered, reads kind of um uh uh thing that that I was proposing.

D

Do you think the overhead from the cisco interface itself might be significant?

D

A

Don't know I don't know, I don't actually know why uh we're we're seeing the the performance regression with buffer rights that we do. I mean it's not entirely unexpected, but I don't know if it's specifically because of cisco's.

A

Do you have any any opinion.

D

um I guess my intuition is that it's unlikely to be the cisco related. It's more likely to be used like this to be watch tv behavior that you were talking about well,.

A

That's no. I was thinking more. Why do we see um uh blue fs writing so much faster with direct I o than buffered io? That was the question that I have.

D

Oh, I see I, I don't have any any intuition there.

A

So my my question is um given that it looks like jen's, maybe actually thought a little bit more about buffered io in the I o. U ring context, maybe we could do buffered io and with iou ring and get closer to direct high levels of right performance, while still having reads coming from page cache in a safe way.

D

I don't think it would affect the safety aspect. It's primarily changing, like the queueing mechanism to the kernel right used a lot yeah.

A

D

Buffer mechanics within the kernel to contend with.

A

But what I mean is right now we see a regression in terms of 4k random, write performance when we use buffer dio right.

A

So in the iou ring context, can we use straight up buffered io and not have that performance degradation? So we don't have to worry about data consistency, issues related to direct rights and buffer greeds.

D

I see you, that's what you mean yeah. Might I I. I would expect that the um more related to going through multiple hot muscle memory in the kernel in the buffer, cache yeah.

A

D

Strictly to the way where to communicate with the journal, I wouldn't expect the ionian to be a huge help. There.

A

Sure, that's a fair! That's a fair assessment.

A

It's surprising to me that the buffer cache kind of slow enough that we actually see it at like 60 to 70 000 right apps, which you know, is kind of kind of low right.

D

Well, the buffer cache has been was designed. You know how many years ago and with my speeds.

A

I guess the good news on that front is that we're fast enough that we're actually pushing those boundaries now.

D

Yeah exactly these are much higher numbers that we saw even a few years ago, yeah uh speaking of the landing storage deck there's just paper that ran across a few weeks. Back that was very interesting, might be worth um discussing in a future foreign maybe about we are connecting linux. Internals big uh have much lower latency.

A

Yeah, this looks really interesting.

A

I'm game: when do you want to talk about it? Josh.

D

um I know- maybe you guys say two weeks so folks have time to take a look at it.

A

Yeah that would be the 23rd, looks good, I'm giving a talk on the 21st, I think about uh cephafest performance stuff, so I might be uh maybe cramming and reading this on like the 22nd, but uh I think I can do that.

D

I've got a lot of dates before that too.

A

Oh, I know, but.

D

A

D

That I'll tell you guys until the last date as well, yeah.

A

That's there's a lot of a lot of stuff in the queue I can do two weeks, other people, any anyone have any opinions. Are we good with two weeks from now.

B

Two weeks, cool.

A

Josh, did you already list this? I see you did good yeah and just add on here. Discuss on is that 9 23 one did something determine that was.

D

A

Yep cool yeah I'll try to remember to send out a email reminder about this too, and we'll a reminder next week at the performance meeting cool all right. Well, then, uh any tasks.

C

Just just one comment in three weeks: it's a holiday in israel, so some people will be missing in two.

A

C

In in two weeks, it's again 23 right, yeah yeah, you have a shutdown here.

A

Oh, let's move it then: let's, let's change it then uh what was the 30th.

A

C

A

Josh does that work for you.

D

uh Probably yeah, that's true: let's go with 30th.

A

Okay, cool all right three weeks from now we'll discuss this any any dissent.

A

Cool sounds good all right, any anything else, guys.

A

Then have a great week, everyone see you next week.

B