Ceph Performance Weekly, 21 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-05-21 :: Ceph Performance Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hey Adam, sorry, the core meeting window, the loader.

A

Alright, let's see.

A

We've got two new PRS this week, both of them related to avx2, encoding for doing buffer list encoding and decoding. The claim is that it's about twice as fast in some cases and 20%, faster and others- that's exciting.

A

A

Else do we have here, nothing close I saw this week.

A

Rocks TV, we have had some issues with data, corruption and potentially I, don't actually know if what we're seeing it might actually not be due to rocks to be itself but there's a proposal to update it to six point by one which I greatly support given some of the changes that were mystery, so that is being looked at.

A

And then then, there's your peer here that it looks like still going through testing.

B

Not exactly apparently, something apparently something else. Do you stabilize testing and Niihau just wants to wait for that. You get sorted out exactly what is attributable to that to merge my thing and then she says that should be Lupul within the week. Okay good deal, I'm super.

A

Excited to see yours go in.

A

All right, let's see what else do we got here this optimizes, the lock of booster rating process I, should probably dig into that I, don't know that anyone else is going to and it and I'm scared of the lock locking you loose door. But theoretically, that offers a nice performance improvement if it actually is safe, I think actually I can I can punt on this for a little bit, because key food means driving through pathology, but if it passes tooth ology then probably should actually dig in and see. If it really looks safe or not.

A

If it fails, then it's easy. Did you say no? Okay, what else.

A

Sam's initial C store work. You can use to get review and updates. That's in the works good.

A

Igor's memory reduction for the O node PR failed test. I keep you put in a note in the request where it looks like it's failing so hopefully, or can get a chance to look at that. That would be a really nice PR to get in here really in the next cycle.

A

Lots of stuff with no movement actually didn't make it through all of them, but usually the stuff at the end doesn't get updated anyway. So I'm not I, don't anticipate. There was a whole lot there that we missed I, know closed PRS for performance. That I saw this week, but it's a little bit of a slow time at the moment, though, not terribly surprising.

A

Okay, there are two things on the discussion list today that I wanted to bring up. Oh good, Adam you're here.

A

Ideal I was actually just wanted to bring up your your work on the blue start memory growth. So do you want to talk a little bit about what you found her I can recap if you want to either way.

C

Well, mark I think you will do it better.

A

Than me, okay, no problem, so lots of users are reporting that they're seeing buffer and on memory growth, a mailing list and I think Adam. You are also seeing memory usage growth during tests with compression enabled, and it looked to me like in the work that you just did yesterday and today that you were seeing that when objects are pre created but not yet fill, then the own words memory.

A

That's used is very small due to not really having a whole lot of gloves or gloves or other things, but that then, when you actually fill that those nodes in with data after they've already been cached, then the memory usage of all those other things can grow quite large. Is that correct, I.

C

Knew that you will explain it better than I could.

C

This is incorrect, though I guess is my question for you. Yes, exactly, that's that's the issue and the objects start with very low overhead, but then the overhead grows very quickly. The other question correlated for me for next investigation is: why do we need about one megabyte to store metadata for 32 megabytes object. That's like crazy stuff, but I'm working on this now.

A

Excellent I agree: 100% is madness. They should not take one when megabyte of metadata for a simple 52 megabyte object.

A

Yes, I'd be very, very curious to see what you find in terms of which of those things is taking up so much space if it's shared, blobs, blobs or abstence or I, don't know if there's anything else, possibly as well. I.

C

Promise I will get with some numbers when I have them cool.

A

Cool, so that's very excellent work Adam, because this is a very, very good to find out, though so my atom you have you have your proposed fix, that's very simple and very, very convenient just to do. The trim in the mempool thread I'm a little nervous about doing it. There that's kind of what we used to do. Is we trimmed periodically in the mempool thread?

A

Potentially, that's maybe why we didn't used to see it, and now we trim every time a new thing is entered into the O node cache and that lets us trim when we already are holding the lock and also trim in the different T POS DTP threads, rather than in a single thread like that. The the reason I'm a little nervous about doing the trim in the mempool thread is one I think that the way that your PR would implement it.

A

It would only do that that trim once every second or five seconds or something so you can still potentially have a lot of growth very quickly.

A

They would get rid of it eventually, but but you know you can still have temporary spikes and then the other thing I'm a little bit nervous about is that if we were to increase the frequency of that, we might be back in the same situation. We were in before, where we are grabbing that lock and doing those trims very often and it's it impacts performance. So the.

C

First thing: I, guess that what you say is actually incorrect because, okay, if you add some object to conquer shart, then this valley marks will be tested and that dream cannot cure. The problem stems from that. There was no traffic on Ahnold cash shots and then they just stayed big and I mean bigger in a size bar, but still the same amount of objects. So that is the case. I mean, but that may be a problem with efficiency when doing periodic and trimming.

C

But you will not get a much much spike unless you will grow the metadata for objects. You have already cached very very quickly so that that's the possible avenue of growth isn't.

A

Yeah and that's what's happening right- is that the object count is saying the same, but because you're filling things in these pre created objects in later, then you have the growth in buffer Don right. True.

C

But so my observation, it cannot grow very fast.

A

Okay, okay, that was what I wasn't sure about was whether or not you could end up in a situation where, between those trim cycles, you could have significant growth, I.

C

Don't think so because this all our blue store structures and there is a lot of them, so you just had to process a lot of new in my case, compressed blobs shared blob stuff, like that there is a limit how fast you can generate.

A

A

Is it? Is it possible I guess my question would be, would it is it possible for us to do a trim when we grow those structures like if we know that we're going to be increasing the size of of structures that we should trim, then, if especially, if we're already holding a lot.

C

Does that make sense? Well, yes, it makes sense, but this defeats the original purpose not to actually touch all notes. When you only operate on data, so you would have done if you operating on like shared cache data cache buffer cache, then you will have to jump suddenly. Oh no, and do some poking there I guess.

C

All if we put it in some lower level of boost all then, of course, if we change like extent map of all node, then we can also make proper on old cash card cash stick but yeah, that's still possible.

C

A

Yeah I haven't I, haven't looked at where it would be convenient to do something like that. They you've probably looked at more recently than I, have no, but if, if we have an opportunity to do something along those line, I think it might be worth crime as opposed to doing in the men pool throughout the like this. But we you can see. Okay,.

C

I will come through code and try to find some from candidate places or periodic, but still not high frequency check.

C

A

Okay, cool cool yeah, again Adam. Thank you so much for for looking into this, because it's very impactful for many users, so if he really could fix it.

C

Because, thank you thank you, Mark for our yesterday talk without it it. It would take much longer.

A

A

Okay, so the other topic I want to bring up today is this. Is this is good good information I think so. There's been a lot of discussion right now inside Red Hat about whether to run multiple OS D's on one device and if we should be building tools for doing partitioning in such a way to make that very convenient for an end user and all these kinds of things. So one question that came up is whether or not it's really beneficial to do this still so over the last weekend, I ran a number of tests.

A

Looking at that, and the end result of this is that much faster and the testing I did on the platform that I did in octopus and master than we were in novelist, but I pasted some of the results into the into the the chat window there.

A

In Nautilus we saw a very significant game with running multiple OS DS on an nvme device, two of these and one nvv device and all the tests. It was better, but the results range from 10% better up to around one case, 80% better for random, read small random reads in octopus in master it's much more mixed.

A

Our single OST case is much faster than it used to be to the point where in many cases now you're actually in these tests anyway, it was faster to run a single OST on an nvme device than it was to run multiple hosts he's on an nvme device, and there are a couple of reasons for that. I think the cache refactor from last summer I think played a big role in some of this and then bhajiya Peng had a very good PR in master.

A

That is very simple, but it we do notify all instead of notify one in the Shara dock work. You I think that's a large part of the reason why a small random read performance continued to increase and master the point now where it's is faster with one OSD vs., 2 OS T's. We previously just were not making good use of all the different threads we weren't breaking them all up.

A

We were only waiting up one, so it's is interesting, I'm getting a feedback from from from Intel that they're not seeing the same kind of improvement that we are, though we do need to do some more investigation into why there they're, seeing regression actually with both octopus, but at least here I'm fairly, confident that these results are real. So the good news is that I think we have made some some real improvements here.

A

So that's really all I had any any questions on any Visser comments or thoughts.

D

Okay, good, so this is danny abdul kalam, I've just joined midway through this call, the first time I joined so I'm, probably missing quite a lot of context, but these test results look really interesting, I was wondering: is there a somewhere online I could find kind of documentation or information about the process. You took all the tools you used to run these tests and bench punks.

A

Sure I can I can tell you um this was run using CBT, so that's kind of our general test framework for running benchmarks that we use anyway and CBT can run a couple of different benchmarks. In this case it was fi. Oh I, already familiar with a file at all. I am.

D

Yeah and I've had a brief look at CBT, but not not into too much in depth. Okay,.

A

Okay, so all CBT really is is a tool that then goes through and runs a combination of different benchmarks with different parameters, so you defined in the animal and of what what different parameters you want to run with, and that will run through a bunch of different tests. That way, you know if I already can do some of this itself, but DBT is kind of a more generic way to do this across different benchmarks around multiple benchmarks.

A

They can also set up a cluster for you if you desire, so in this case, CBT built the clusters, but a lot of folks actually will run it in a mode where it will run against some existing cluster. That's like, for example, what some of our upstream pathology tests they're from CBT, do.

D

Okay, that makes sense and if you were to look at doing testing over at 3, but what's the preferred way of doing that with DBT.

A

So there's two different things you can do: you can DVT can launch and run cause bench, but cause bench is really almost its own framework right, it's very very complex, and it does a lot. So some people have done that and that can work, although I think a lot of people will just run cause bench on its own is also totally fine. I also wrote a benchmark called hot sauce.

A

Actually that is a much smaller kind of simpler benchmark than then cause bench is and is kept designed to run from the command line, and CBT can also run that cool.

D

I've actually I ran something brief with hot sauce as a test, but I I can't really comment intelligently on on it, but I was just playing around, but it looks oh.

A

Thank you yeah. The the goal was, it was more or less to be able to run run tests in a much more kind of a convenient way through external automation, husband just trying to do that.

A

D

A

D

I'm familiar with the pain of cause bench.

A

That the original authors have moved on to other projects, so it's you know I think I, don't know how much maintenance is getting these days so yeah, if you feel free to you, know, try CBT. If you, if you'd like you know it's not the most documented tool in the world, but but you know people are actually the the crimson team has now been using it for or mainly regression testing. So it's getting a little bit more love from from multiple people now, so hopefully we can make the documentation, look better cool yeah.

A

Now that looks great excellent.

A

All right, so the only other thing I want to bring up myself this week is Igor you you. You gave a really good overview in the core meeting today about some of the things that you're looking at with data corruption with rocks DB and are the change to revert to always using the.

A

Buffered I/o booster buffered I/o disabled setting good. Is there any chance? I could get you just to recap. What you've been looking at.

A

I don't know if you're trying to talk or not my keys, muted.

E

Yeah sorry, so you want me to repeat my overview right. Yeah.

A

You don't mind.

E

Okay, so um just a general video, very you for everybody.

E

Yesterday we got a couple of complaints, have users million least about data corruption in right, ahead, lock of frogs, DB and also the well. Both cases where happened after upgrade from octopus point 1 to point 2 and at the same time, neha reported a similar issue in iran, master and finally, I managed to reproduce at local as well and after some investigation, which is not completed yet but looks like two recent modification triggered this issue.

E

The first one is disabling. His disablement of Blueface buffett, I own, which results in a synchronous. Are you both coming into stem, and the second notification made a while ago is.

E

Refilling right, a headlock with zeros, if pre extent well enabled- and it looks like this- causes some overlap in iOS running simultaneously and looks like they might be applied in in random order and which causes some data. Corruption since well earlier. Right comes later.

E

Well, actually, I'm trying to reproduce that in the single of unit test and if it confirms, we need to introduce some I await after such that you pre feel. Actually we have a sort of it for am our stuff and perhaps we'll need the same form of for this case.

A

Europe I am: why was it okay, when we had buffered I/o enabled by default.

E

Because it goes via different right path,.

A

Did that different right path? Did that then guarantee ordered, looks like that? Okay and now that we're doing we're doing direct IO? We we don't.

A

Interesting well.

E

As I said that these are preliminary results, but I can't explain why I am getting 32k of zeros instead of valid data in the local III checked. If we pass Ron buffer- and it's like that, but it's not the case, but definitely I can see multiple running iOS for each reproduction. So III think that's the case, but I haven't proved that 100% yet sure.

A

Thank you for working on it. This is this is a tough problem, and it's good that we may fix it. That.

A

So so they're, the original reason why we disabled blue store, buffered I/o, was because it was causing significant problems with rgw performance, where we were actually seeing that we were digging into swap, oddly, even when there was significant memory available, and we didn't really understand why so we switched back Igor. Do you think we should revert that change and go back to an overnight on no.

E

No at this point, I think it's rather extending right ahead, lock, which actually causes the issue disease modifications on buffer tile. They just made it visible back.

E

That's breaks right, a head lock for extending staff, which is under percent, correct, I, think.

A

So I know that I'm pretty sure that we back ported the buffer mayo chain. You know if that other pair, that you'd mentioned is never was back ported to Nautilus.

E

Sure about notice, I didn't well Adam, maybe you you can comment on that. Well! Well, the last fix on refilling right the headlock and was yours: misfires I, remember yeah,.

C

This is true: I have been touching a feeling. I would have to revise it because I don't remem, remember correctly, but it came out as a fix for problems. I had when actually multiple times changing logs DB sharding on the same volume. Without that tree, extending it almost always failed. So I'm I would be cautious to disable that that part.

E

Well, actually, I think the question is rather, if we have four extent right that headlock enabled in modulus right now, I don't know. Okay, I think that's easy to verify.

A

Yeah and I'm will just a little bit nervous about whether or not we have introduced this into anomalous just because that could cause a lot of frustration for free users. Octopus already his problem right, but um when we could to know if it's also a novelist.

E

Give me a minute: I'll tell you or if it expanded walls, and they welcome notice.

A

Sorry I'm trying to remember I Phegley, remember when you were talking about this back when you you were working on it, but what was what was the reasoning for for that Paragon.

C

There is an Inc was that I had a press test, that changed has been changing, rocks, DB charting schema, and the problem was that if device was not fully realized, two zeros just reused as it was sometimes after restart blue store. Basically, Rob's DB couldn't pick up it's right ahead logs and continuing ed Aaron's there. So that was the reasoning, but never I could repeat the problem. If I had always, he ought in all tests the same set of column families.

C

So it was like Tom change in in data structures in amount of right ahead, locks, I, don't know what, but the trigger was changing amount of Collin families in rows to be.

A

That was the only case where, across the problem was when you changed the number of colonies.

C

Cap was the trigger that ie that I had to to make the fix. Okay.

E

So mark back to a question about modulus I can see that LFS play extent, while files is tabled for NATO, so hopefully that's not down. There is no.

E

Is there by default.

C

The original reasoning for that was that it it was to speed up to or meet adjusting size with each right to new right, a headlock, so we had a larger right, a headlock and just appended the data, never bothering to update blue FS log that contained metadata with size, so that that way we were a bit a bit faster. That way, we make it more alike for as its what in normal drugs to be operation, when Rob's DB rotates reuses previous right, ahead locks and just appends data from from the beginning.

C

A

Okay, guys just staring to look through some of this again. Okay, all right, well, yeah, good, good job, investigating the see her and definitely maybe let us know what you see- yeah.

C

Antigua, please are like me if you, if you find that there is pump, think I can evaluate the contribute to that. Actually.

A

Alright, that's all I've got for this week, guys anything else. Anyone would like to bring up.

A

Alright well, thank you very much for coming and have a great week. Everyone.

E