Ceph Performance Weekly, 31 May 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-May-31 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

A

Good morning guys, the quorum meeting is going a little over so hopefully folks from there. She joined soon, but we'll give them a little extra time to join.

A

Matt I see you listed twice, so you must be extra dedicated to joining this meeting. I.

B

Joined once I'm, actually the dual joined another meeting, but this was because I joined realize that I tried to try to just my mic off and close the call. So let's gonna drop it loser, get my other instance.

B

Yeah I just just dropped.

B

Hey Matt, how was a house beast going?

B

It is making progress. Beetus work has moved attention is focused, moved to work by Adam Emerson. To put beast compatible is a I/o interfaces in to liberate us nice and okay, I.

A

Have to say that I'm I'm really excited for beasts I, hopefully I'm, not putting too many expectations on you guys, but it's I'm super excited about it. Progress.

A

It it feels like the thing that that we should. You know that that we've wanted for, like you know, like nearly a decade.

A

I'm very excited I'm glad to see you guys doing so much work.

B

On that well quit we should we should I. Do you draw your attention to the PRS by Adam, so you can help review, maybe we're trying to the interface. Is there and there's many plenty of issues? I think that to be resolved to make sure to keep compatibility with liberated applications, yeah I mean.

A

You guys you're you're you're, all kind of at the forefront of all the the the work this been going on with a sea star and that whole kind of model anyway, so I think I, don't know how much I can add beyond what you already know, but it seems like you guys, are going in the right direction with a lot of this kind of stuff, so good job.

C

So okay I mean just go through these four requests quickly. The first one is the the rock cb1 mark. You guys were just talking about that, but I was turned out to start by something else. What's the video yeah so.

A

So my gosh I guess isn't here, but my my personal take on this is I. Don't think it really matters that much I mean I think it may be he's gonna marginally help in situations where you end up with a bunch of like stale stuff in a high level that you don't really need stuff to be in it could be in like level three, even though it's stuck in like level four for a while. So you get some additional rate, amplification, I, don't I, don't know. I could be totally wrong by kind of question there.

C

It's not gonna hurt right, Josh.

A

Found something in the header that said that it could like produce like suboptimal, Ellison, trees or something, but they didn't elaborate any further than that. So who knows maybe for the monitor, I, don't think it really matters if it if it.

D

A

Or if it even gives the perception of helping, then you know sure whatever it's fine I would be more hesitant to apply it to the OSD unless we actually like go through and make sure it's not doing anything bad, but all right.

A

C

Let's see Adam close a couple of this hash, pull requests the one that's still open, I comes it on I. Think there's one extra hashed colleagues making that's not necessary, so we can simplify it slightly, but we'll see, let's see another one else. Yeah.

D

C

All closed on the blue store one keep these testing. It looks like they fixed the perfect counters to use the right clock resolution mark your priority cash flow and is ready to test now oh good, and keep it as it in his testing branch.

C

Igor's bitmap allocator I ran it through and it failed a unit test, though I'm not sure, if he's for feisty yet or investigated, but there's that.

A

Unit test failure did that, like make sense, did it was it like a when you.

C

A

To something like this, I didn't.

C

Look at it closely, but it failed colouring the it was initializing, the in memory allocator state during start up during mount ok, so it was like in the updated code. I think.

A

One thing I would like to bring up with the the my PR is that it was failing these ceftaz tests and I have no idea. Why and I still have no idea why the changes I made actually fix that, like it's now passing the big check and I, don't know why I mean I know I fixed a bug that was like a really weird bug by I I spotted after seeing those but I have no idea why I fix it. Yeah.

C

Those tests always been fragile, so I look forward to feel free to ripping out this. That discus.

C

Alright liberty supporting throttling I'm, not sure who's, reviewing that that vice and Jason's boat think he's out, though,.

C

The atom review to blow the fess stuff I've changed I would like for this one yet, but it looks like he liked it, that's good! That's what marketers needs QA um and the iOS Rattler Eric extracted, as he finished extracting I guess, planning on extracting this out from all the other team, plucks stuff.

C

He's not done yet.

C

B

I can't prod him.

C

Okay, yeah, it sounded less last update, I heard a couple days ago. He was close, I think so we're making good progress. At least that's good. Let's see the huge table, one looks like Braddock has cleaned it up a bit and it's ready for review, maybe Adam. Let's look at that. One too.

C

C

Let's see, there's a crypto cleanups thing from Keef ooh that needs some fixes. I'm going optimization, stop tracker from Bratislava that look quite promising, not sure what the current status is. There.

C

Needs a rebase at least he's. Is he talking yeah Peter, okay,.

C

C

Okay, lively discussion: yes discard that, once no movement yeah, these are the ones I've, no movement. So that's all right. I.

C

Don't have a lot on the list here, except OS, the memory usage. What's that, oh.

A

That's just me making the assertion that someday, we'll be able to say here's how much memory the OST should use and try to stick to it. So yeah all of this memory like work fed in the last like weeka or week, or two that I've been looking at, of how much memory we use and then also kind of this work towards dynamically adjusting memory, use or memory pool assignments based on priorities. I think all that kind of confluence into this storm of you know really.

A

What we want is to say yeah try to stick to this I, don't think we can guarantee it, but we can at least say try to stick to this memory usage kind of adjust things to decrease until we're close to that memory, usage or right around it, and then you know do our best I think we can do that, though I think we can. We can we've maybe got enough data to be able to say you know: here's what we want to target.

A

We were keeping track of everything, we're keeping track of high priority items and lower priority items. We can. We can dynamically assign things to try to hit this target. I think that's possible, so I want to work towards at least attempting it.

C

Yep I think that makes sense.

C

Yep I think that the one.

C

There's gonna be a point where we add a configuration, often that's like listing memory and then everything else sort of is subsumed by that, but we're not I, don't think we're there. Yet you know, don't you know I'm not pieces in place. The one thing that I think in the short-term we do need that I was planning on doing directly on top of your pull request systems. That merges is just like adjusts, based on the observed overhead that the allocator seems to be adding in and I'm not sure how to do that.

C

Yet I went and looked a little bit to try to get the stats from nail care and I couldn't find them to compare like the the amount that the allocator thinks its allocated versus how much it's used from the operating system.

C

A

Scale, but did when you were looking, is the the the the same stats that we get in the perf dump for the heap allocation. Is that not good enough.

C

That's forget in a perf dump. Sorry.

A

The the heat profile I.

C

Didn't I tried to avoid the heat profile because I thought that was an optional thing that we wouldn't to turn on, probably, um and so the interfaces I found were Mallen, foe and.

C

It's basically there's a mouse stats, I'm out TC Malik stats, which prints everything to standard out. So it's absolutely useless and there's a TC male info, which returns this mal info structure, which has like ten fields with like really confusing names. And then, if I remember correctly, there's this comment that says it's only for arena zero and they're like the 50 reais, or something like that. So.

D

C

Far as I could tell it was like lately and I really use. This I mean one option that we there's a TC Malik size which is equivalent to malloc usable size, but I, don't know, yeah I, don't know, I, don't know, I'm, not sure what the right interface is here. One.

A

Option that we have is given that all ready for the the cache balancing we do this on a really coarse, bein, grace a course based course basis it. It might not be terrible to just grab a heap profile every 5 seconds serve. You know whatever right and and then use that to figure out if we're close, I.

C

Thought that the, if you turn on the heap profiling, it adds all this overhead of cos acid you, although it's accounting for all the allocations yep, so you can.

A

D

C

Even for to make one call and then turn off again like does it actually do anything useful.

A

If we know what the total allocation size is, I mean that's really what we're going for right. We don't care about how much any individual thing is affecting things. We really just want to know how much is allocated I, don't even know how it calculates that yeah.

E

A

In the way you dump it, I, don't know where it's getting that from.

A

I'll I'll try to follow up with Adam you tomorrow or on Monday if he said any progress and looking at this stuff, yeah. Okay, the other thing that would be really useful to get is fragmentation. Information right. How fragmented has memory become, because we don't even know that yet we suspect, but yeah.

C

Okay, well, unless we can get sort of meaningful numbers, I think as a stopgap. We just need to like hard code like 70%, but whatever we think sort of is yeah 100% I.

A

Also want to design an interface for this that can very reasonably take into account scaling factors for different things right like when we assign priorities, if, if allocating 50 megabytes to something really results in a 80 megabyte allocation, where is for something else, Falcon 50 megabytes results in a 55 megabyte allegation. The priority based mechanism for for dealing with this stuff should be able to kind of take that into account when, when assigning memory for different things,.

C

A

Like you know, if we're saying that that own ODEs should get a 50% allocation and kV should get a 40 percent allocation, take into effect what it really is going to use as opposed to what the the memo thinks is going to use yeah so I want to make sure that that's kind of reasonable it's reasonably done and not just kind of like captain after the fact for that so I'm gonna try to think about that a little bit.

A

E

Paige perfectly but the earth or the stats that we have I, don't think actually require that he profile to be running like old yeah.

C

This is in perfect Lu. You see a little.

E

C

Quick Greg put the file name again, a.

E

Sep source, perfect Lu, /e, profiler.

C

And what what stats.

E

To also get numeric property one there and I.

C

Have done a cry that if you turned on I, don't.

E

Think it should I'm I mean I'm just looking at this, and maybe they already turn 0 but I. Think it's just available. Ok,.

C

So any would I don't this work would be nice yeah.

A

Sorry I was heap or perf glue and what file heap profiler profiler.

C

That's easy: okay,.

C

Maybe that's just like been compiled.

A

This, the is the heap profiler dump call. Is that what we're looking at yeah.

C

So if I just called stats so.

E

That's the human readable one I think.

C

This has a check in here that it will refuse to let et the dump command if it's not running so.

E

The stats one isn't, though, I got it: okay got it. Okay,.

E

And maybe that's just broken, but it's very careful just separate that so I think I think it just works. Okay,.

C

E

And that should be do like what's up what's allocated, what's allocated from the system and I located by the program etc. Alright,.

A

Stasia the Malik Malik stench you're, looking at when you were looking out like last week or whatever yep.

C

C

Cut out, I didn't mark I.

E

Think we lost mark.

C

C

Do the heap profiler.

C

C

Mysteries I compiled without him, Malick's I, could run valgrind.

A

Testing testing testing one two: three testing freaking here, you know Oh again, Oh fantastic, I juggled the USB plug. So apparently that was the defects I.

A

Found an old PR saying that TC Malik getstats was not showing the correct data I.

E

Know what works under many cases I'm not gonna, guarantee it's never been buggy yeah.

A

E

Let's start with that sounds.

A

E

I really ought those numbers numbers you really ought to be able to get without you know having a father right. That's that's stuff. The alligator needs yeah yeah.

A

Os X memory sets wildly different from TC Malik report.

C

Okay, well, it sounds like that's the next step. Well, I think I think we just Monica. We want to get marks over crossed, tested and merged yep and then, on top of that, add in something that looks at these T's Malik stats and based on the disparity between what it's has claimed from the OS and what it's actually allocated. There's a 30% difference, it'll scale the boost or cache size, whatever yep based on that I. Think that's what we want.

C

Yeah I think the only question is if it's above some threshold- and we haven't done it in a while- should we do like the automatically do the TC Malik keep free. Whatever thing I know that we used to do that in like production situations. Where remember, we got out of control and we would have to like poke it and then that sort of seems to not be necessary in newer versions of TC Malik, but I never was involved in those support cases genome. Do you need to remember this Craig I do.

E

Remember it I think it was just there were. There were various buggy versions of TC Mountain fact: it just didn't distribution, so that made it better I, don't think it was a okay.

C

So should we just should we ignore that then, or do you think, there's a case for we shouldn't.

E

If you want to write the logic, then go for it. I hope it doesn't turn up. Yeah.

C

I wouldn't know what the threshold would be: I guess yeah.

E

C

We only do it once a minute, I, don't know whatever percent.

E

I think has been a lot all the time, although maybe you need to put a minimum size on there yeah it's 20% of the total of every year, so just 200 Maggie's. Then we don't care but yeah.

C

E

My samples, I've seen lately were much much lower than that. So I.

A

Did I did a I did a test recently with the the rgw case, which was where the OST was using the most memory, and in that case it was about 10%. It was like six and a half for seven gigs of OS s, RSS memory usage or OST RSS memory usage, and it was like 600 megabytes of of you know, memory that could be freed back to the US.

A

C

That's not so bad.

C

We could yeah that.

A

One thing that we that I want to do is add, like PG log memory, that's pinned into the pinned memory in the the cache balancer, so kind of insert, including things like that. Like some of the memory that's allocated is going to be reserved for PG log or it's going to be reserved like for other things like right now. One of the things that's reserved is memory for indexes and filters in rocks, TB mm-hmm.

A

So so there's have this priority: zero level of reserved space for things, and if we can't hit it, then we actually assert right now. Initially by saying you know, we can't support this. It doesn't for or rocks B's indexes and filters at that point it will. It will try to, but it won't successfully.

A

It won't fail if it doesn't reserve those, but for things upfront that it knows about it will say: I want a reserved space for this and if I can't then fail and that's like the logs, we probably do because right now we require those logs to be a memory right yeah.

A

It might make sense that we say the user specifies. I can only have one gig of memory for the OSD and they specify enough PGS per OSD that they're gonna exceed that up front. We might actually, in that case one has a certain tell them you, you don't have enough memory assigned to do this, do something to tell them yeah.

C

C

Anything else to talk about.

D

I've just got one small thing: yep have done. um Was there any change in the locking during scrubbing, between jaw and luminous, only asking because I upgraded a cluster to luminous and when I've got the scrubbing going on now sort of a radars been where I was getting about one or two millisecond average right latency I'm, now seeing it in the sort of 30 millisecond range and putting in scrub, sleeps and stuff like that seems to help bring it down.

D

But it's almost like when a deep scrub is going on on a PG, also fairly IO. It just gets held up by the locks which didn't happen in jail as far as I can see. I don't.

C

Think we changed the locking, but what might have changed is.

C

Some of it, you might have tuned options in tool that no longer exists in luminous that were replaced by any sleep. I, wonder if that might have been with the differences.

D

I, don't think I had I think is a pretty standard. Config.

A

hmm Do you know which lock or which lock sir be involved? I.

D

Haven't managed to look into that deeply. It was just sort of one of the last clusters aren't graded to to luminous, which is where we've got sort of probably the most monitoring one I just son, you saw we've got a collect Dijon, which just does a radio spends every 60 seconds and something noticed that went from sort of sitting about one millisecond up to about thirty post upgrade and.

A

Sorry only when deep scrub is running, yeah.

D

So if I do, it might also be normal scrub. I, don't think it had is large effect, but it was. It was still there. But if, if I run like a Pheo on a RBD and I did fork a sequential writes, though it's spending a lot of time per RPG effectively and you would see when it hits one of the affected PG. Suddenly it would just go to almost like 0 io for a couple of seconds and then, as it gets to the next one over it, then spikes back up.

C

Its that's possible, we change something, but it should if we change something, it would have been only when describe window intersects with an actual object that you're writing and that wouldn't happen very frequently at all, like with a radius bench. Oh yeah,.

E

And I thought that basically all of the grub stuff was this like because we fixed it in Joule but I think all the fixes were like I, think the luminous version and the fixed Rovers and we're basically the same unless there were more things that one. Only when it's after initial release.

C

Yeah there's a major exchange that happened after luminous fur mimic where, if you're scrubbing an object in an aisle comes along instead of the aisle waiting for the scrub to finish the iowa preempts crab and the scrub will get repeated after it should have a significantly reduced rias latency. But we didn't actually back part that too luminous, though I'd be kind of curious. What you would see if you went to mimic that cluster, but since you're just Ford alumina. That's why? Okay for a while yeah.

D

I mean I can probably replicate some one of the other luminous ones, which is going to be more of a candidate. Scouts have mimicked at an earlier day and see if that helps when, when that happens here, okay.

C

Go ahead: there's the I think you can do clear historic ops on an OSD and then maybe do at the bench and get some some of those sort of emails. I couldn't rights and then look dump them. It might be, there might be a clue and the dump the store cups will show. Basically, your longtail request, whatever that might tell us something: okay,.

A

Nick, in addition to that, it may be worth trying either the the GDB wall clock profiler, that I've got or Adams wall clock, profiler the C++ version, his his version actually will slow down the OSD, far less than mine will, but that might give you a clue as to where time is being spent.

A

Even you know, kind of what lock potentially might be contended on okay and we're so it if you use a GB version that I've got it will slow your cluster way down. We've.

E

Got an actual it's over there.

A

One way or another, though, if it's, if it's something that you know shows up, we should be able to pinpoint it with either Adams or my profile. It probably it was spending a lot of time. One particular area of the code.

D

Okay, I mean I, assume from a very high level. What's happening. Is that this process of 7.2 K disks is that it's probably trying to read like a size of data putting in the scrubber know maybe four mega, something which is probably taken: yeah, 40, 50 milliseconds and that's just basically holding all the ir until that goes through I. Don't know why that would be any different to what Jill was doing, but.

C

Might be it at some point? We changed it so that we hold the lock during the actual read operation under. If that's what the differences.

A

Save which lock are we holding during the read operation, but.

C

The PG lock PG.

A

C

And was it running the latest jewel.

D

It might have been about 0.8, 0.9 I, can't quite remember no.

C

Jewel, let's turn that to that right.

B

C

Yeah I can't really make that change. I. Remember changing that here to go.

D

Yeah I mean it sounds like pretty where it's doing now, it would be the correct what you would expect it. We learn that you wouldn't want the data, possibly changing in that PG y. Whilst it's checking it I guess any problem is the dependent on the amount of data you you read: you're gonna. Have that be look the the header from the disk for that amount of time.

D

Playing with, like the the strut scrubbing stride size, so it's trying to read smaller chunks of data at a time, so the the scheduler can sort of try and insert data sort of IO in between the various scrub reads would would help this person. One needs to experiment with.

C

Yeah I don't have a good answer: I'm guessing that that's some variations, that is what's happened and that mimics should mitigate it by pre-empting that letting you prep those operations but I think I, don't know yeah. If you can, if you can reproduce on another on another environment that we can play with, that would be okay.

C

A

Question I have for folks. If anyone has played with rocks TV a lot. Has anyone ever tried to change the buffer size of rocks tb's right ahead log a during operation like I, said: I never tried to change it at runtime, I'm, guessing the answer's? No. But if anyone has.

A

Nope, okay, I may be the first try it then.

A

C

A

We're done I think we're done cool.

A

All right thanks.

C

Everyone have a good week, guys yeah. Thank you.