Ceph Performance Weekly, 7 Jun 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-Jun-7 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

A

Good morning, folks, sorry, the or the the core stand up is going a little over I.

A

Imagine hopefully, Josh and all their folks might be able to make it soon, though,.

A

All right, maybe we should just get started here. Oh good, there's Josh, all right, I, actually don't have a whole lot. This time, I have to apologize. I was trying to do like three things at once this morning and forgot to go through the PR so next week, I guess we'll have two weeks worth, but there's a couple of different things. Going on right now, maybe I'll I'll highlight Igor's continued work with the the bitmap allocator and I. Think that Adam has been reviewing that which is good.

A

I am very, very hopeful that we're going to see some better behavior with that versus stupid, allocator and especially versus the old bitmap allocator I'll talk a little bit, maybe about the cash balance or work which has had some updates, but maybe do that a little bit later.

A

A

Don't know how much other new stuff there is this week, I'll have to have to look next. We can see how much of it I can I can cover there. Radek do you? Do you know of anything else? New! That's interesting! That's worth piloting, I.

B

Hello, I guess there is thing related to to the priority queue we are using.

C

In shattered of work you in profiling, I can see there is that we have a lot of memory allocations made in our way that is not friendly to Stygimoloch. The issue is that we are. We are in queueing an O in messenger, while D queuing in TPU HTTP. This means that time's, the deallocate needs that the allocator needs the TC malach needs to employ its. It's short pole, I put a shaft between between Alfred's.

C

The reason for that is is that we are made that the weighted priority queue is implemented in a way that calls many times ma allocate allocates money times for, therefore, in queueing, a single off, and, moreover, we do that under shot block. So it's even in it has. It has more coarse granularity, even in comparison to VD, Bock I.

C

Plan to take a look on that and talked already with site about that I considered me I I asked site why it's the complex I think it looks that three years ago there is a discussion.

C

There was a focus on making the kill more fair to avoid starvation.

C

That's the thing I would prefer to focus on after after the buffer, the other feedback are circulation and I'm working on right now, cool.

A

Yeah I I do remember seeing some contention in various circumstances there. So it's probably a reasonable thing to look at I.

C

Can see in providing I can see. We are spending around five percent of cycles of all T POS DTP, but in the Nereo assuming 100 a cache hit in Brewster. So it's it could be a problem only on very, very fast devices, but LEDs. It's pretty insignificant. Yes, yeah.

D

I was wondering if you did and I think any city made this suggestion yesterday to make I guess a benchmark that would just test the performance of the in queue and EQ and through the charity work you.

C

D

Guess, as long as we can capture that same kind of you know it's money from one thread and deallocating for another thread to make it more similar to what the others used to me.

D

Josh yeah, so there is I guess it hasn't hasn't seen much activity in a well position.

C

Also there is, it could participate to the difference in performance I'm seeing I am serving when from different sizes of fretboard. Actually, in some scenarios, I can only squeeze much more higher ups from single from having single thread a single TP or DTP done from huge pool of them.

C

It seems to me that the cluster faithfully signal is actually variable in when you, when there is no waiter, it's basically talking atomic nothing mark, but if you actually but the condition very its stun body weight and like condition variable well, a pretreat, the fritz, the signal or broadcast will become costly because surface is called.

D

D

C

Soup is continued.

A

I was just gonna mention that it'd be very interesting to see whether or not that also changes or is more pronounced with the different security. Colonel security patches that are now out there for both Intel and AMD processors.

A

Just the comment you made about the variability of the performance impact, especially related to the syscall, makes me wonder if the specter in meltdown patches may have an effect on this.

C

Definitely could could could have. I can include that when I some some time ago, when I profiling, the code for a Cisco I, was observing ratio around 5:00 between fiber or between five and seven syscalls per single operation. Wow.

C

Most of them, most of them were muted in wet headaches. Okay,.

A

Did it did look like it would be possible to reduce that significantly or easily in any cases.

C

C

It we could try some small optimizations I mean very small optimizations by default, at least on my gypsy that I'm running can be found in sector and also my my my laptop I can see that the default implementation of P Freight mutex doesn't spin.

C

It means that if, if the user space part of the of the mutex is contended, then you are going to catch up, then you are calling the Phoenix to Sisko. There is no deep-sea offers. Some kind of adaptive me taxes, but you need to tell them tell it explicitly want to use that by default at least all the two two versions I was taking called a Luka. They are not spinning.

C

I made us small small utility called trying is called spinning garden. Something like that. The idea was to to call a try block a couple of times interleaved, with the pulse instruction that x86 offers.

C

In some situations, I was seeing significant improvement on the coast of of synchronization.

D

So yeah, like its might be worth trying then I trying that in this case, since this seemed like a place where we happen to be contended on crossword things.

D

How exactly is it enabled? Is it like I need to learn, I got / mutex basis, just tell you.

C

Yes, I see two eyes. First of all is just to set prepare initialize a mutex. It could be, it could be in such way. We could touch our abstraction of new Texas. Second thing is to change to alter the way, how we, how we take how we manage the lock its we can. Just you just can replace some some unique locks or our Union guards, with with the train, with the trial guard.

D

Yeah, it's a good investigation.

C

Another thing another degree of freedom is is suffered for Intel TSX, transactional memory. Gypsy got fattest, enabling that for mutexes destroy vendor can enable enable that that acceleration by default. However, for read/write blocks of of GDP of pivot, the the tsx code is enabled by default, and this can actually have some.

C

Some costs related to cancellation of transit, of continues of permanent cancellations of of TSX transactions, which can happen because even a bigger screen, laughs, critical, critical sections and I am afraid that that's the case of our of our retry clocks, especially the one in blue star that is guarding, really really huge pieces of code.

C

E

Its correction.

C

Lock, if I recall correctly, our actual collection lock in Brewster is implemented as retry clock.

D

My impression is that the transactional memory stuff worked better for area since its optimistic for areas that were less contended as well say, correct.

C

It depends it depends whether on the size of critical section, and also on the pattern of of memory accesses over critical section, if you, if you are touching things that are not logically related but staying on the same cache line, I guess you can cut unnecessary cancellations of India, basically fall. Sorry right.

D

And excellence.

D

Reminds me of your two ago, someone came to this meeting and give a presentation about some file sharing and so I think no matter ideally party that was in ended up being from tea out.

C

I can recall that yep.

A

That was a Joe from the performance team. If I remember correctly, that's.

D

C

And PM you at least on Intel's chips, has some some events pretty useful if it is to use the easy to use for nailing down the false contention. Third default. Sorry.

E

C

Christmas I guess: what is our our path to see start our have to said nothing.

D

Yeah, that's why I guess that it's maybe a longer path, things if you need messenger v2, to be able to target different charts, at least and.

C

What is what is current state of the wall apart I, so I hit that key, which is returning to the messenger yeah.

D

It's returning to actually to um ready and what he's calling and I like an IO engine or AI engine for making it simpler to interface with the C star, int star code, so that we can start converting some pieces like the that aren't performance, critical like the mana clients or the operator, or there was new about cash to continue running in non T star threads as week. While we have the main notice to your knee and C star.

D

We went over this visit. That's Ian yesterday! So if you take a look at that ether pad from that or the recording on it'll be pretty informative but as in front terms of actual actually getting the charting worked out, ends kind of removing the cue and gets a bit a bit of ways off still.

C

A

All right does anyone have anything else that they'd like to talk about this week before I get into cash balance or stuff again.

A

All right, I'm gonna talk a little bit about this. Then this is stuff. I've been working on pretty excited about it, so the first thing I will mention is that there are PRS both for rocks DB and for stuff for the first kind of piece of this, which is implementing the the priority based scheme for for assigning memory to key caches. The the rocks TV one is in review, but they're they're moving has slowed, but we we do have it a branch in our own Fork of rocks.

A

Db that that we can target the Ceph PR is is basically there. There was a bug that that we found in a corner case where basically, we we didn't create a rocks, DB cache if the cache size was set to zero. So we just have like a null pointer, which it was. My code was expecting that it would be there so that was easily fixed. So that's not a problem, but but there's a number of other issues that have prevented it from merging. In my own defense, those don't appear to be related to the PR.

A

There was a. There was a bug in the in the blue, spur cache implementation related to: u n 64, T conversion to in conversion that wasn't safe, that is fixed in the PR. Now there was an added commit to fix that and then there's a long-standing bug that our QA suite only picks up, maybe one out of 30 times running a particular test and I just happened to hit it when testing this.

A

So that's actually about a year old and master we'd haven't yet figured out what that is, but it does not appear to be caused by my Pierre. It was just happened to be picked up in the the self-test objects, for a run. That I did so, hopefully that will merge soon. I think it's hopefully safe at this point, but the the real kind of guts of this is in this other commit as a separate branch for age based binning of the the caches.

A

The idea here is that we want to know how old things in the cache are, or rather we want to know kind of on a bin basis. How many things are there that fit within certain age bins?

A

So the idea here is: is we basically have a circular buffer of counters, where each counter represents an interval of time by default? This is five seconds and we're keeping 720 of them. So one hours worth and that's separated out into six different priorities. One through six. Zero is kind of a special case one. But the idea here, the the kind of the the the current implementation has a five second bin for super high priority stuff between five seconds in 30 seconds, 30 seconds in five minutes, five minutes and 60 minutes.

A

And then, if it's the data cache, it's actually offset one lower priority. So five seconds is actually a priority to instead of priority. One and 60 minutes is at priority six, instead of priority five, the idea being that that, generally speaking, we we want to prioritize o nodes and kV cache over data cache.

A

So the the kind of neat thing here is that in this prototype implementation, the effect of this of all of us doing the the the priority-based binning their priority, based based allocation of memory to different caches combined with this age bidding scheme, makes the balancer dynamic in that it will shift memory around based on what's kind of happening on the cluster. Currently.

A

So so, in some very, very initial tests of this, when filling up an RB d volume full of 4 megabyte, writes for pre allocation of the volume it was keeping the data cache to around 75 percent of the total memory available and assigning about 25 percent of that cash to metadata so about 50% of the OU notes are being cached by by the time I finished. It was a turn 56 gigabyte volume.

A

It got to the point, maybe halfway through where it hit the the total cache, the the total amount of memory available cached as it was kind of balancing. Oh no I'm. Sorry, it was much earlier than that. The the data cache should actually spiked way up, but but over time it slowly gave metamour. But I didn't give him it everything just because it was seeing such a large ingest of recent data, but as soon after after it finished the four megabyte sequential writes, is started.

A

Doing the the test started, doing 4, KB, random, writes and immediately cache started switching over. Oh no, because, as these rights were happening, Oh nodes are being very, very rapidly accessed and the amount of recent own owed data started shifting much higher compared to data.

A

So, as a result, all of the Oh nodes most of the the the KB, the the fork, a fork for KB random right tests, except for that, the very very beginning when it was shifting, except for that case, all own ODEs were cashed for the duration of that test and basically all the remainder was given to the data cache.

A

The kV cache never exceeded about one to two percent, because it was never really needed. There was there's no there. There was no real need for the kV cache, because the only thing that was being really accessed for Oh nodes, which were already in the booster cache. So this this I'm super excited about, because it's doing exactly what I was hoping it would do, which is kind of targeting the current use and and kind of optimizing for what the current use case is the right.

A

Now, there's no provision in rocks DB for doing this, so I need to add this capability into Rhapsody, B's caches and then expose it in the same way, so that we can also balance the kV cache in the same way as the data and meta cash are in in this particular commit.

A

But once we have that and then once we start accounting for other things, like EPG log data usage, the the right had log buffers and rocks DB the row cache and rocks DB and potentially a couple of other things. We should be able to get a really good idea of where we're using memory everywhere and controlling where we're using memory everywhere, and especially if we're watching be assigned memory and TC malloc, we should be able to then start dynamically changing the amount of memory we have that's available to work with and.

A

Try to probably not not, we won't be able to guarantee it, but try to keep the OSD memory usage with us and some boundary that the user sets. So my goal with all of this is to make it so that we are keeping the OSD to a user sign memory value and then not requiring the user to define anything else just automatically. We assign memory for everything, I think it's doable. These results to me.

A

Look really encouraging in terms of being able to have the OSD make smart decisions in real time about where memory should go so I'm super excited about it. I know it's it's uh it's hard to to be excited when you you're, not, you know kind of in the midst of it like this, but it's it's I think it has a lot of promise. So.

D

Yeah I'm really surprised how well it's been doing already, even with the just basic metadata. Caching, how much that's improves are really performance.

A

And it that you know it's it's almost kind of false, though, to be honest, because really what it's doing is it's just like giving? Oh, no giving rocks, DB or actually um blue store memory for? Oh, no, because that's the thing with our BD that really like once you're you are not caching own ODEs performance can go way down right. So it's you know. It's kind.

D

Of like they're, basically yeah.

A

Exactly exactly but but on the flip side, if you can, if you don't need, like all of your memory for Oh nodes, you really want to be able, in the RBD case, to give it to data so that if you have the settings enabled to to cache data on no kind of normal rights, you know you you potentially could be doing like all of your reads from from cache instead, so I think, especially for tests where we're doing like.

A

What do you call it.

A

A test where you aren't you're not doing like a random read over the entire volume or the entire desk, but you're just doing like hoc reads, like you know, a way called a distribution as if Ian, yes, yep, exactly yep I suspect for that. That's where it will really shine, because then you'll see a certain. You know most of the hot data residing in the data cache and you can do those reasons just from cache.

A

Instead, where, as if you were giving all of the cash to the Oh nodes which we sorted we used to do, then, then you wouldn't get any of that or you get very little bad. So.

A

That's that's! That's it any any questions on any of that.

D

Something is popped in I. Had some IBD uneasy pools. I know this is still gonna, be probably not I'm guessing. This is not going to be as effective there just because the bottom line there's mostly going to be over the network I'm having to do the read, modify write cycle a lot of time, but maybe if we can get some of that, like that data cache going in, maybe it would help there as well, but the distribution would be interesting to try out the.

A

Ec case is really interesting to write because you have a much much larger number of nodes.

D

Yeah, essentially huh not necessarily, you could still have the objects themselves being the same size, I guess you're, saying that there are more shards of the object so that they were more yeah. Okay, yep.

A

Because unless you have like a crazy replication factor right, but.

D

Please don't know making your logical objects larger to compensate, maybe well.

A

That's true, too I suppose, if you're going with like a non-standard object, size and RBD do people.

D

Ever do that I, don't actually know. I've heard that I'm doing that occasionally, like ikx guy, for example, was doing like 30 q makes objects okay Wow, but for the you see case in like it's, my probably more important there, we consider what an optimal size would be for a given decoding as well. I.

A

Suspect that it's going to be really difficult to nail that down exactly this, is this so dependent on so many different factors, yeah.

D

There's a ton of factors there yeah many parameters that we could tweak well.

A

One of the things that's kinda nice about this, at least, is you know tweaking that right, the the object, size and and kind of how many shards you have and all this stuff really kind of changes.

A

The the workload on the the key/value database, that's kind of where, like for a small, random, I/o workload, that's kind of the thing that that that you always see really getting hit hard, and it would be really interesting to see if you start changing that kind of thing how it changes the amount of data that's necessary to cache like TV data versus. Oh, no, no data in blue store.

A

Yeah, that's a good point: I!

A

Don't think that we can ever really know in advance or guess in advance what these kinds of ratios should be set for to cover, like all cases, I very much believe that, but I think we can make something. That's relatively smart about kind of like doing exactly this, balancing it out dynamically in real time. It's yeah very, very convinced this is the right way to go, but we'll see.

D

Yeah very excited about it as well looks like it's definite heading in the right direction, so I.

A

A

Sade you've been silent, any any.

F

um Yeah so I had I had a question. um I can remember: I was looking at it, I think it was in there the pull request, description and email he sent or something where you were showing you were bidding things buy time in the different caches and you were skewing the the time series. So a shorter, more recent data in rocks GP was the same priority as older data and blue store.

F

Yeah. Is that right? Yes, that is correct, I, think my question is: if we can accomplish the same thing just by looking at the oldest bin and the things that are falling off the cache, if that's gonna work almost as well, I.

A

Remember thinking I spent about a morning thinking about that for a couple of hours and I. Remember, I, wish I would have written it down because there's something I was really really worried about that we were not gonna. I felt like I was worried. We weren't gonna. It was not gonna work well for if, if we just do that, I.

F

See you have to wait for things to fall off the cache it'll be slower to respond because things have to fall off the cache I guess. But the thing I worry about is the prototype be built. That's basically having all these additional allocated for FCAT and objects that are attached to everything. That's in the cache is just adding all these allocations and ref counts to every object and I'm just worried about the overhead that that's going to cause sure.

A

Yeah I mean it's basically like an extra, you know, add or subtract operation per trim and.

F

Allocation per object right aren't ya backing these additional. We yep.

A

Beat well it's an extra yes, an extra shared pointer to a UN, 6040 yeah. So it's an allocation.

F

A

So it's like, you know, 720 right now, 720 bins. There are the counters and then there's a shared pointer in forw into the bed. Yep.

F

That's right: okay, I.

A

Might have expected so it's basically for every every you know, notice an extra shared pointer, extra friend.

F

A

Or whatever yeah I thought that probably wouldn't be that bad, especially compared to like storing a timestamp or something.

F

Yeah yeah I guess the allocation in free doesn't matter because it's like it's amortized across the design objects. It's it's really just a pointer it to end or to a pointer and direction to increment, an atomic like whatever it's. If you can, are you starting it reading it per cache garden? Yes,.

F

F

I'm so curious, if the trivial thing would work as well.

A

So if we say we did the trivial thing I'm trying to work through the the logic I was thinking of when I was going through this, and maybe let's just try to retry it now and see. If we can come up with what the behavior would be so, okay, we wait for it to fall off, and that gives you kind of the the maximum age of the cache right.

F

It's like that reuse, interval or whatever. What's it called an MRC next maximum reuse curve?

F

That's our C stands for I, don't know ash top. My head, miss rate ratio. Sorry, alright, okay, three assistants, absurdist yeah! This is it, but it's not the reason. It's it's whatever. It's. The time interval that your your cache covers.

F

A

And so you've got.

A

In that particular case you can, you can say that you've got.

A

For a certain current size that this cache is this old or that cache is you know that old right and you can try to start tweaking them so that the the ages end up being roughly the same right. That would be the idea or.

F

Some ratio like if you think that that data cache is less important than you could target like one tenth age of the data cache as the meditative cache or whatever, because if you're, if you're doing the bidding thing and you're skewing the bin slightly, then at the end of the day, what falls off the cache is like off by one bin right, which is like, but nine in. If you have ten bins and it's like nine point- nine, ninety percent or 110 percent. But everyone look at it- feels like the thing so.

A

I mean ultimately right now in this one we're we're still looking at you know anything older than 60 or 60 minutes is, is you know just leftover whatever, but the the curve looks different right in each case.

F

Yeah but the the curve is gonna normalize once you once things start falling off and unless your workload continues to change, in which case you're chasing chasing a moving target. But.

A

So so like, in this particular case, the the the really interesting behavior here was in like Priority, One and prior. That was, where things rapidly changed, we're like the switching over from a four megabyte workload to a 4, KB, random right workload. All of a sudden, the own owed, the number of Oh nodes that were in like priority 2 or even priority, 3 all immediately shot into priority 1 and the the amount of you know.

A

The data that was hot shifted from being like priority 1 priority 2 to maybe more like a little bit in priority 1 a little bit in priority 2 and a lot in like priority 3. Does that kind of make sense like the.

F

Like I'm, trying to I can I understand what it means to pin the caches to understand what the ages are. I'm, trying to like reconcile that with the priorities.

F

What seems legit between the priorities and the bins? So is there something that's written down that you're? Looking at I don't know it's.

A

It, oh yeah, sorry it's in the ether pad here. It's basically just the same thing that I sent you an email, I.

F

See: okay, are these like time periods hard code and are they just like a ratio of the cash sign right.

A

F

A

Now it's well as they're hard code intervals. So it's like you know the the currently the interval length is 5 seconds, so the cash balancer will will run every interval and that's user-defined, and then these right now are hard-coded, but but that you know that can be changed. That's not a big deal so like priority. 1 is for the kV cache and for the the meta, cash or sorry the the meta, cash and medication kV cache R is just one interval, whereas the you know so. The first bin and.

F

A

F

F

F

So this will more quickly a spawn because you'll start to notice once you're in, like the variety pretty priority. To is rearranged, that you have more objects and do you know in cash or whatever yeah.

A

So so like when you're doing four megabyte rights, you'll have lots of Oh nodes in priority. One two and three they'll be spread across them because you're not getting no nodes and fast enough that they all stay at a high priority. But once you switch over to like 4 KB random writes all of a sudden you're accessing Ono's constantly and they all shoot up into priority.

A

1 and now data is it's almost like the situation reverses now, instead of having like a ton of hot data, because it's 4 megabyte sequential writes it was that right, it used to be four megabyte, sequential right.

B

A

You end up with like a ton of superhot data now, because you're doing 4 kb, you random, writes you end up with some hot data, but the amount of data. That's in the cache ends up kind of more migrating down to 2 3 4 whatever, but own ODEs are all super hot, so the the stuff switches around. But if we just looked at like the age like the maximum age of the cache it wouldn't it wouldn't really like.

F

The thing is we don't actually none of this matters until you start trimming things right like it doesn't matter that you have a lot of priority to own oats and you don't have a lot of variety to data until you actually start trimming on oats or trimming data that could be used, and so it not, it doesn't actually matter until you start it's something if you start trimming Ono's that are like less than 5 minutes. That's when you start to worry right.

A

When, in these tests on the nvme drive really really quickly, we start trimming stuff really fast, especially.

F

One of the thing is that so say: you're shifting from a day to heavy workload and you've got like say, 90%, your memories and data and 10% is no nodes and they're sort of equalized, whatever they're balanced, whether that means where so that the age of the stuff falling off the cache is roughly the same.

F

Suddenly we have a very heavy on node workload, and so we start, you know priorities one, two, all those those start filling up, the Minsk get a ton of stuff.

F

None of that matters until we're actually get to the point where we would have trimmed an own of that is recent that we shouldn't have trimmed right, like the only decision. Any of this ever affects is the actual thing that you're trimming at the very end mm-hmm, and so we and we notice that, because the thing that we trimmed right before that is like. Oh, it's only five minutes old inside of 30 minutes old I need to like dump a bunch more memory over to get outside and start pruning. My data cache sooner.

A

Fir for if you in the wind before you notice that when you still have very little memory assigned to 2o nodes, you have lots to data you're going to end up with the kV cache. Also increasing, really rapidly right, because you're doing these.

F

Made so that the hypothesis is that it's it's the it's the same, regardless of what your cache is. The only thing, the only actual effect of changing the cache sizes is like whether you trim an entry or not like that's actually that the effect of all this, and so it's whether this entry, that's in minutes old, gets trimmed or doesn't get trimmed, and you can base that decision either.

F

Based on this having this weird curve of priorities- or you could also just look at the age of the thing right before it, which is going to be approximately the same age, they trimmed, and so my thinking is that you could make the same decision based on that, because by the time you get, let's say your priority to blows up, because you're blowing out a bunch of Oh nodes, suddenly the things that are falling off the UH node cash instead of being you know, four minutes old are three and a half, then three, then two and a half and two then one and a half right, mm-hmm that wraps up you're like oh, my gosh I got to start giving them more memory and then know like equalize it at two and a half you like actually just stop trimming them entirely, I mean even if it's if they were at five as soon as they go to.

F

Fourth, then you're, like oh I'm, going to give them remembering and it's gonna go back to five or whatever it is.

A

Maybe after the meeting I think I may be sent you something on IRC like a week or two ago. I, don't know if you keep your IRA see ya back.

F

A

Far but I really wish I could remember what it was that I was worried about.

F

Well, let's keep yeah I can chase us down there yeah sure anyway, but the results are awesome.

F

We can digest things properly, then we're in great shape.

A

The other thing, too, is the big part of all of this isn't just kind of making stuff faster, but keeping what we have now, while reducing the memory usage right, because that's really kind of part of this is figuring out how to.

F

A

Yeah right now it seems like we're, maybe we're using too much memory right or we're with three gigs of of blue store cash. The the OSD can used closer to like seven gigs of RSS memory in certain scenarios, so we might have a future where we need to be able to operate with far less blue store cash.

F

A

Well, that's all I've got so anyone have anything else for this week.

F

A

Cold all right have a good day. Everyone see ya.

B

Same for you, bye.