Ceph Performance Weekly, 28 Sep 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-SEP-28 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Like we may have a small crowd today,.

A

It will give to simulate 30 seconds or so if anyone else shows up.

A

All right now, let's see here, why have we got going on this week? There's been a lot of movement, pull requests a lot of new stuff. um Quite a few blue store. Full requests have been coming in for different things. Both bug fixes and fermo performance related things.

A

Let's see, there's this first, one about garbage, collect for partially overlap. Blobs I'm curious to see how much that helps, though that's basically blowing down reads: overtime and skid that we've got that in there um else is in here, Oh sages. This isn't blue star, but this is a sages ER for less frequently updating certain PG and protease stats I.

A

It turns out that when you look at all of that data that we were updating, it was like 800, bytes or every single metadata update, which is kind of crazy, so this I think he said, gets it down to something like 190 something.

A

So originally we had been count looking at just bar and encoding the whole thing- and you know it could be a kind of brute force, easy fix, but I think this is actually going to be better. The conditional checks is doing here, probably lower overhead. Then the encoding was so um yeah that looks good I'm excited to see how much that helps.

A

We are kind of stupidly using the rocks TV iterator instead of our own range.

A

If for anyone who looked at rocks TV, the iterators are super slow, so this is good, but else uh-oh suffering the key vsync thread into two parts in blue store sounds like that might give a pretty substantial random light performance increase, something like 15 or 20 percent, so also really good.

A

Yeah there's a whole bunch of stuff here that closed as well yeah.

A

There's an order of things here that we we closed, I've been kind of superseded by other work, the safe and unsafe appender. That's all now kind of changed in the the new encode decode PR. That is up there.

A

Speaking of that, one that one's kind of going through testing right now, it looks like, in the very most recent version of that pull request. We may be seeing a regression specifically with aged random, reads or random reason on an aged system, so we're kind of trying to track that down is a little bit annoying, but hopefully we'll we'll figure that out soon.

A

There's a nice little quick PR here for changing the guard, search from a linear search to binary search. That's a nice little clutter, gara.

A

A

Now, there's a couple- the ones in here, but I haven't looked at them too closely. I guess so yeah it's this kind of. What's been going on this, we can put across a lot of movement lot of changes lot of stuff to test.

A

So let's see well do we have here. um I guess I've already had talked in discussion topics about some of these bull requests and things ice age. Is there anything you wanted to add to any of that.

B

And I think that covers it: okay,.

A

Maybe then, since you've got you and we've got Alan here, it might be interesting to talk a little bit about the memory. Full discussion that can I came with a mailing list.

B

Yeah man I think we pretty much a Korean requirements, I think it's a question of how it's we're going to come in it.

B

How am I sent a link to the sea star stuff haven't looked at any of that yet um I don't know Alan. If you know where we should start there.

C

I'm, sorry to say that again it dropped out I, don't.

B

Know if you know having ideas where we should start their bars, selecting an alligator and putting together a prototype.

C

You know I'm not sure, there's any way to figure I need unless these things are showing up at the top of the perfmon it's sort of.

C

A

C

Of the list inside of a bummer okay, that seems like a no-brainer case, where you almost always add one or two members and most, I think, of the vast majority of list members and probably a two or three calls to them out.

B

Looks like you're cutting up um yeah, I mean, I think this this seems is independent of them, using the the slab allocation for like small vector and slab vector and so on.

B

Yeah, I guess I don't I just I just don't want us to have to go right right, anything that we don't have to so it'd be nice. If we just choose I, don't know if we can just instantiate the existing Estelle, alligators or dunst.

C

B

Yeah, we lost them, yeah all right, Oliver I, to come back to this yeah um I, don't know if it's possible just instantiate an existing, a standard, allocator st alligator in class, or if you want to use one of the boost ones and try to optimize it for type I, don't know but I think simple to start, because I think the biggest thing I want is just the memory accounting but I can make the the backpressure stuff trim. Cash is based on actual memory used instead of these hard to configure counts.

B

But beyond that.

A

Do you know what kind of Greg's biggest requirements were for his is use case and.

B

It's gonna be the same across all the code. It's okay! I don't think I mighta how much cool.

A

B

Want to define this particular type goes in this particular pool and fix up the allocation call sites appropriately, but nothing will be 2n yeah.

B

um Yeah I don't know if there's.

B

Too much to discuss until we look at those options.

A

Of how we like budget memory have, in a very general sense in light blue store and where it strikes me that we probably want to be giving rocks to be a lot of a lot of cash. I would suspect.

B

Probably yep I'm more yeah.

B

Yeah I think metadata. Caching is going to be higher yield in general than data caching, from.

A

Us meow, but I guess: do we um I.

A

Guess well, I think you're, probably right. That simple is the right way to go on this right, I mean rather than I, assume that the Roxy bees can be doing some one's own cash in there. So we're probably not gonna need Pete yeah.

B

Well, we have to do it just have to decide how much had a tuna rocks TV so that it has an appropriate memory yeah. It.

A

Might be that what we.

B

Do is change the Ceph configurable, and so they are defined in terms of a fraction of the total cache size that we allocate the rocks TV so that there's one knob that somebody says I want the Oh, Stevie use, one gigabyte of RAM or whatever, and then a moo barn to figure out how much to go drugs to be, and so on, yeah, but think I can come later run for now. It's all just sort of crammed into that rocks to be tunable string, yeah.

A

I guess I guess can't where I was going with as I'm wondering is there? Is there anything in the OSD that we really need to worry about in terms of, like you know, making sure whatever solution we went here is actually really good or is it just kind of a.

A

Look, he said, I think simple is the right.

B

Answer: yeah: no thanks, so um I'm, probably the hardest thing that the OST is going to do it's going to the most sensitive is that it has these PG logs that are a bunch of omap. The inserts are spread out over time in different parts of the namespace, and if you have a postit that goes down and then comes back up, if you do backfill, you have to read these logs in no now I. Take it back that that happens on startup, so it should be okay, um yeah, I.

C

B

Think of anything.

A

There are there any other things outside the OSD that really need this badly. The bmds.

B

Needs it I mean everything, even though is the outside of the blue store needs. It like this has always been annoying figuring out how much memory the OC is going to use and tuning it to use that much memory is, is a black art, so maybe we solve the general alligator problem and we'll we'll be able to clean up a lot of this stuff is actually the MDS is pretty bad name. Yes, okay.

A

B

Anyway, I think that next step is to continue the discussion on the list. I, don't think we have anything, we need to go through right now. Sure, um let's see the fast info stuff is just queued up for testing I. Don't think, there's anything else to do there. I could do some performance tests. I guess I see how much it helps on I'll store and on blue store. um It should reduce the metadata load quite a bit, but you notice what the performance impact there is just apply. Some testing did you. There.

B

B

That aged random read regression that you're seeing is really weird I. Think that's the next thing. We need to dig into yeah.

A

Yeah I want to go through and figure out. If I look at that older branch that I have that I think you you hadn't like merged I, assume you merged a bunch of stuff into the that one commits the I forget.

D

442 something something yeah so.

B

As long as I've had actually I'm, let me frame work than theirs I'm going to keep rebasing that branch as I fix things so and I pull it up to wait a semester. So I would probably I'm gonna make sure it's bisect able and once I do that then I think we could retest on that yeah figure out figure out where we're we're broke. Cuz our there are a few things. I have changed, but I don't really understand why they would have affected it. Yeah.

A

Me either hits. It seems to be fairly clear, though, because it's I've gone through like seven or eight different, full tests, different points, and it seems like I'm, seeing pretty consistently a1 range in another range of results that don't really overlap, so you know there's their variation within each one, but they they seem to be distant clusters. So.

B

B

Know anything else if anything else to discuss.

A

Not necessarily maybe I'll mention that it seems like the stuff that Dan is working on might be important for the async messenger. Based on what I was saying in the analyst.

B

Yeah yeah he's great yeah. It's.

A

This young guy yeah, it sounds like where we worked fast dispatch harbor with a sink messenger. I, don't like to understand that, but.

B

It's I think it's because the least bit either you have a thread for a connection that calls in the fastest patch. So it's little bit slow. It only affects that connection. Now it's a pool of threads, that's handling all of the reads and writes off the socket, and so in fast dispatch slows down and kind of slows down the whole little thing: okay, I'm guessing I'm. Guessing that's! Why that's for me high level, hand-wavy, I guess, but but yeah I think it may even our suicide. There are some some blocks.

B

I think we can just remove, also thought she was looking at first, okay,.

A

Yeah, if we can, we can sneak those in before the next release. That would be really swell not to have a sink messenger. Slow things down did.

B

You see, did you see em jumping's email?

B

This is that saw something opposite. Basically, what he's saying like a 30-percent speed up what they think look.

A

But look at it is j-mal test yeah.

B

But we're not using genomic I'm.

A

Wondering, though, if his other tests of this TC mal test was actually started with hi fair cash values, he should know.

D

B

A

But that's the thing right is that simple messenger is really slow in those kinds of tests when you don't have high fair cash. Oh.

B

I see what you mean: I see: okay, so you're.

A

B

Simple messenger accordingly: well.

A

We as long as you have like 128, it seems like a lot of times. Simple messenger should be faster than nascent messenger. They might not consistently hold all the time, but it looks like at least with JD malik and which is and gives this similar results that you see with a fair cash. In that case, that's where we see simple higher than a sink, though it's it's kind of like once. The memory allocator is out of the way. However, you achieve that then simples faster than a sink yeah.

A

But if you do you look yeah.

B

All right: well, he wasn't doing I, don't think he was doing the inline, the inline send in code yeah, that's which was yeah. It didn't I. Yes, the.

A

Other the other thing is my my customers with blue store in his was this file store, so it might be that there's, um maybe either blue stores faster and we're hitting this ball next posture or maybe blue stories just making it worse. Somehow, I don't know yeah, maybe.

B

B

Okay, boys should chase chase at dentists. There might be dear. We can just spend some time profiling it and identify things. We can speed up. I, don't know how much that yeah Justin fit optimizing. Yeah.

A

I've got some some results or some perfect races that I cannot was eyeing a while back, but I haven't I hadn't really dead. You to him at least I'm, not respect. Looking at the messenger stuff.

D

A

One one question I had for you sage, speaking of perf, is um in those traces that you were getting recently. Did you see anything related to the bitmap alligator, except those still almost nothing.

B

Well, I still don't trust my traces because they look completely different from yours and they show no no detail. My only theory was that maybe my compiler version is more aggressively in lining and so I don't see a lot of the tail ends of the call chain, but I don't know it's really weird that I when I look at your profile, I'm like oh I, can fix this in this net. Then I look at mine. There's like it doesn't there's nothing to do.

B

I, don't know, I, think, let's, let's get the encoder stuff sorted out and then um I spend more time on that again. I have a cup okay, we gotta fix first, but yeah.

C

B

C

B

About getting crackin thing hanging out grade tests and getting other stuff cleaned up to merge, yeah ah I, don't want to get, is exciting stuff in yeah.

A

The only other thing I had in here too, that it's not a good, exactly perfect related, but it meant I wanted to mention that you started getting the blue stove going through the to ecology, test chip.

B

A

Exciting yep yep.

B

The last one I'm fixing a memory take another memory leak right now, I'm stupid one on the mom earring stuff, so yeah.

A

All right all right, one.

B

A

C

If you want to relive hey, if you want to resume the discussion, but for my cell phone went crazy, we could try it.

B

Sure yeah I was just saying that I think we should avoid writing any allocator if we can and just use something that art exists, if possible, I'm like. Why would we.

C

Could find something that gets the job done yeah, but none of the things that I found so far really seems to cut the mustard I.

B

C

Like a the boost pool is all about multiples of the same object, type which you know, sort of doesn't really solve any problems for us. Yeah.

C

There's some other kind of low level stuff, but uh you know you know I think the a lot of it resolves around being able to save the container code that we've got I. Think once once you sort of treat that as a prerequisite, you know then you're you're dealing with an STL allocator and.

A

C

Know- and uh you know they're not that hard fundamentally.

B

Yeah well, I mean the st. I don't think the SEL allocator piece is difficult. Any alligator we can wrap up in the stl allocator class, but the right, but alligators in general are very difficult to write. I don't want to write our own alligator. A great and I. Wasn't.

C

Proposing that I was proposing adding a layer of accounting around the existing memory allocation framework, the of all.

A

Of the underlying allocations.

C

Is still malloc and free there's the.

A

C

Going on that, absolutely no.

A

C

Of touching the.

B

Memory allocation.

C

Okay, you know I'm, just gonna keep track of some stats, yeah.

B

C

Know- and they.

B

Were just thinking like like a per CPU atomic or something exactly sure? Okay, yeah.

C

I was going to take like I think it's the it's, the t thread ID, which is actually the root of the call stack and use. You know a few bits out of that index into a shard and you know it, and that would be my / cpu atomic and.

B

We can we just use like declared a threadlocal variable and have it an array of Atomics, that's per cpu or the.

C

Question is is so how do you get to somebody else's I mean you still want to be able to globally say at the end of the day, how much.

A

Am I using across.

C

A

C

So and then you run into the problem that you know that the allocator is not going to match the destruct err. Ok,.

B

C

Know so the scheme I was going to use would end up with a bunch of atomic that might go negative, but you have to add them all up. You know what I mean yeah, because you might increment one thread and decrement and another thread and I'm sort of not going to care about that. But.

A

I mean this: you, at the end of the day,.

C

Is you have to find all of those things and you're right? You can make it threat local, and that might be the right thing to do. um But now the question becomes you. We don't have a common place of capturing all the threads. It's problem so.

B

C

I mean this is something that.

B

This is something that the colonel does, like all the time. There's a whole library for doing this, that you know sums up all the things across GPUs and everything. There's got to be a see library that does the same thing.

B

Well, Isaac.

C

B

C

Like a register, courrier.

B

On right, it's like you should be built on the same infrastructure. No.

C

No, not at all the the P threads library does it I did it does malik's to create stack threads and your your your your local storage, to sort of hung off of that right I mean, I agree, the library should sort of know where everything is, but if it doesn't have an interface to it, you know what eeee oh you're kind of a lock and you.

B

C

I think the I mean that I think the whole per thread. Stuff I, don't know that the colonel knows anything about that. Another register he swaps right yeah,.

B

But we don't we don't care about threads either that we just want per core. All we want to do is avoid hammering, on the same, a tub. Sorry at the same atomic from one course yeah.

C

But if you, if you base it on well, so the question is so: how do you figure out which corey Iran yeah.

B

It just seems like there.

C

B

Registered like you, should just have it well.

C

But there isn't there's not even.

B

Any register, okay.

C

It's not in userspace it and it's. It is fundamentally ambiguous. Okay, you know. If the minute you ask the question, it's wrong: yeah.

B

C

Know so I, don't you know there isn't.

B

C

Reason have that, so you know what I was going to use is about. All I care about is some identifiers with some affinity to the cpu, so what I was going to propose is the basically the the base of the stack frame, which is what I think the P thread. I forget it's a P friend ID or the P threatening, or something like that. But it's a cheap thing that you can get.

C

You know that is it's just disambiguating. That's all I really care about yeah.

B

Just something.

C

Something really cheap, you can get that will spray it around. You know Hellyeah.

A

C

Could just take the current frame pointer or the address of a local and take it off? Of that I mean pseudo-random is good enough. Okay, maybe.

B

She was like I said it's a.

C

Cache line stuck, you know, I think you want to keep those cache lines. Local! That's why I was going to use the base of the call stack biggering that once you fall sit in you know, if you you, you, you wouldn't have a context switch for a while.

B

Yeah I mean, but if the threat is, if a threat ever moves to another core, then then.

C

You get one fault until the next switching interval, yeah.

B

I guess I knew you just make it big, so it's much larger than your number, your cpf count might be. Your cork out might be exactly.

C

By my dad thousand of.

B

C

Exactly and then you pay the price to add them up. When you ask the question: hey.

B

C

Know as you only do that in frequently, then it's not too bad. You know, I was figuring 32 or 64, but you know I think that I I think that's probably plenty of disambiguation. It probably only really needs to be like maybe two or three times the core count that you know that this OSD is scheduled against. You know what it really says. Is you probably want to partition your cores against your OS DS yeah yeah.

B

C

Once you do that, then the need for this to be big goes away exactly.

B

Yeah: okay, many.

C

Ways for my thought- and you know.

B

It's pretty straightforward code.

C

It's the it's the usual rigmarole at you think it right and then it's only.

B

C

Lines of code, at the end of the day, right.

B

So I guess I mean the requirement. Is that we need. We need an STL allocator class that we can beat into the containers right.

C

B

Kind of have, with the.

C

Slap I was going to take the slab stuff, I've done, which is sort of basically a beginning of that and just keep its pension. Okay.

B

And that, but we also need the individual call. Sites for for new also needs to the same thing. Yeah.

C

I was going to do that with kind of a macro kind of thing. You know, I think the easy the easiest hack would be to basically make your new and delete be the equivalent of an intrusive list.

C

You know, and then you just put one of those / type in the in the container, something like that. That kind of hack doesn't really matter. I think again, a.

D

C

Long as you use it as long as you're, not mucking up the past new to malloc and free significantly, you know, I think.

A

C

Looking at, like you know, an atomic increment after some shifting and masking to pick one, you know which you know with some affinity to the current core. You know, I'm guessing you know the net it out should be. You know a dozen instructions or something like that times. Whatever the effectiveness of the sharding is on the cache miss that's.

B

Going to be the total, the total pain.

C

You're going to see on right.

B

C

Cost of the constructing the de lis, you know I, guess um you know I guess what I'd say is is I. You know, let's make it a compile-time option. I think there's a lot of value in having introspection into what's going.

B

C

B

I song is the communal set yeah yeah.

C

You know, and we can just kick the question of whether or not that overheads too much just kick it down the road, it's trivial to disable with a compile-time switch. You know whether we leave it on and off we'll we'll figure that out later.

B

Okay, I'm good night.

C

B

All right, I'm gonna go to work. Okay,.

D

Hey hey: did you get better okay, bye.

A

All right guys would anyone else like to bring it up anything this week before we retire.

A

All right: well, then, I will get back to work as well. Everyone have a great week and we will reconvene next week thanks.