Ceph Performance Weekly, 15 Feb 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-FEB-15 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, everyone I think we have our meeting room worked out. Ta.

A

Yes, I am on camera, I I do own one, but it is kind of a little bit broken night. Xo it if it usually works.

A

You can see the knife 1970s office motif is going here.

A

Alright, I suppose it will get this meeting started here. Six moda guys are still in the core stand up such a fight, rippling.

A

So, what's going on this week, um sage who decided to dig in and rip part of the fast dispatch code apart and that I don't from talking with him I, don't think we were able to rip out quite as much as we want to in all cases, but he was able to get a lot of code out of the kind of normal path when you are on when you're not on an existing you're, not dealing with an interesting cluster.

A

So that's the good news have not tested we get, but it sounds like maybe 30 kappa, so very good news.

A

Hopefully we will see he gains there previously and Men bright had attempted to kind of remove a lot of walking in there and and unfortunately wasn't actually in indy games we're seeing actually, in particular, the regression. So hopefully, figures attempt to be better, but we'll see also those are really exciting of here here, actually Oh a capacious keyboards from igor and boost, or he is working on, reducing the mirror usage for blogs in in groups or when doing attendant over rights, and I confess I asked if any, owing to the Kurds I'd look at.

A

I haven't done so yet, but the thought here is that we're usually ferguson, only memories from our children is 40 no 232 k. So this works, like I, think I was hoping more work is pretty dramatic reduction in memory Petronas, which would be fantastic.

A

See nothing really a major import for assistance.

A

There's a there was something that things is working on here, but after talking to how my I think he likes how am I the first better, so that one is just getting close without developing committed.

A

Yeah there wasn't a whole lot going on just a little bit of movement on these arms here she said BTC implementation, but it's still need QA, so I bet that's kind of it not now whole lot of fun, but uh maybe a lot of things going on behind me.

A

A

If we always talk a little bit about eager is patch and osage's looking into mixes things like performance issues, they confound I, think you will blast. I talked to him. You were dealing with random hardware. Issues on this press set up. I. Think you get those results, though so wait if even have handled his earlier this week, he'll still look at it video this week and then I've been actually doing quite a bit of rgw testing and running into some interesting kinds of things. Going on.

A

I do have some results, I think maybe I'll always goes for next weeks and the right now and I'm running into issues where it's different clients, when there's really high concurrently against one rgw instance and I issues where certain client processes are sewing or for starting your race and there's other ones. Even even look really high argue w thread count and also with higher throttle, looks for the object or employed option in close up height limits.

A

So I think this can active too compelling to figure out what's going on yet so anyway, but as getting into the weeds so hopefully next week. I can experience the new sofa, I guess maybe just a bit so far is but.

A

Things were pretty doable supplications, but with their racial coding for the data pool and seeing kind of scaling limitations or not feeling that item we can be more bound by latency than we would be doing them, and I think I would expect, and then we are in the replication case and then once we try to increase the intensity, there's always other problem showing up so anyway. That's why big eaters?

A

That's all! I have worked as a couple. People have added things in here for this week that they want to talk about so maybe start with jeanne here um right performance with Krakens on our disguise.

B

Where you looking the.

A

Under boost- or there was one added here for cloud pro- innovation- that yes 50.

A

Is anyone here that thing couple of those.

B

Of currently looks like.

A

Okay, let's move on Nick I see you got some snapper and several results here. Yeah.

C

So this is from the works. Sign was going to add a photo to suction.

C

So um if I just showed screen.

A

And so are you seeing presentation and senior screen your desk.

C

Along with a health warning, okay, so that's on the other window, hang on right!

C

Okay, let's try this one how's that looking excellent.

A

C

Right so just in case of watching to wondering what this is about just going to cover of a very brief history, and so I was thin and during the day very, very high spikes and latency, and it was a bit confusing to of what was going on so I looked into it and it appeared. It was the snapshots when they're being trimmed after they removed you very big latency, literally all y 0 and the classic just just drops to zero. In some cases, I've had OS these PI now and and I'll get corrupted.

C

So the this is sore causing quite a bit problem, so we wanted to look a bit server into it. Now the previous versions you've been up to use the sleep after this nap training, but since jul that's moved into the main I Fred, so actually that just blocks io as well. So you can see like the green line.

C

There was for my anode rod to set that and sleep on this map trimming and although it it stops tough time now and it actually sort of makes the IO problems like you worse so to test our sams patch is created. I have managed to recreate at essence knowing my test box with 47 k, disks I'm made a snap and generated about 50k of dirty object and then, during the removal I run a 4k random lead to death, one retest.

C

So on of the current master, that's what it looks like zio, practically just falling to the floor. Jojen, that's net removal!

C

So Sam made this branch, and certainly it gives you a calculator cycles, and so he made this branch and it happy limit it to two pgs per 0 SD and also you get a status now in this it'll be a test phases display, which is really handy, so you can actually see sort of how far along and fact that process you've got so it's better, but very, very spiky. So you know look just look at the fact that there's this you know several seconds still where I 0 is, is pretty much.

C

You know on the floor again, which is a lot better than what it was, but it still not still not great so I then tried to set the the frost of 21 and also the the concurrent connections. One didn't really make a lot of difference, maybe slightly better than two, but not great.

C

So then, I try to plan with the right back frontal because it looked like looking eyes that when these terms of zero, I/o periods are happening, that there was just sort of like 50 hundred Judith and all the discs, because it's sort of flushing out the the buffers and in the colonel sorry I'm guessing what's happening. Is these operations, even though I've been throttled to one or two a time? Is that they're they're getting immediately act by the journal and then this of building art and then when they get flushed down the disk?

C

It's set right in the disk and they can't perform any reads, and so I tried dropping the right back fossil from 505,000, 250 and 500 and thats, or had quite a dramatic set of removal of the sub spikes down to zero that this dillish spikes bit a little bit smoother I. Then I plan with the snap gym parking costs, but I think at this this problem it's happening below the queuing and set. So it's all sort of like that level again with the what right back, bottle, yeah and then so really I'm out of ideas.

C

Now I mean it's a massive improvement on what it was, but I think it I think any work, nice of be done below this of the queuing or throttling in in the main part of surface is, you know, does the right by fossil busy for no sorry.

D

C

D

Supposition is that um well we're only allowing to use to trim its hitting the journal and then the page cache and then being considered complete plus the work has been done yet right, though it's building up all right and that's why the sleep parameter work in the past, because this artificially rate limited the number of requests to something that the disk could actually do. Okay,.

C

I can do that.

D

Excessive at the a few level just introduced, something that will do essentially the same thing asleep parameter did, but without blocking the threat of jessica's out of world. That's pretty unsatisfying, though, because you'd be able to produce the same effect by sending the same queue depth of Rights, the cluster normally yeah. Yes,.

C

That I was just going to mention that that you know I sort of them. It's invertible.

D

Be right, back model was meant to was meant to fix um because it did yeah I. Think.

B

I think what happened is that we tunes are right that throttle to get maximum throughput yeah. We did which mean to really deep cues and that sort of the worst case at probably I wonder if gets for the snap trimming, instead of adding asleep just making an unreadable callback for completion. You.

D

Know it is it's it's it's waiting for the unreadable called I, suppose being apply to the page, cache it's nothing. Safeguarding I! Think yeah! Okay. So if it's waiting for on completed see if it's not just waiting for journal kinetic, it's just that. Yes,.

C

D

Me into flight the page cache not perfect, yeah yeah got it. Okay,.

C

So yeah I mean I have been doing some other pretty testing with you know doing this like with back filling in another stuff and I, am forgetting that the feeling now that the white, the right back, fought, always like that you say, has been tuned for the getting their maximum performance. But actually you know in most cases that sort of read latency is probably more critical.

D

Oh, it is pretty far ahead. That part was a deliberate trade-off and that's why there's a novice. So if you wanted to play with that and um sort of report back with performance meeting which you cannot, but that would be good well.

E

B

D

Really do want the fruit bus, especially if there's caching some somewhere else, though it's.

B

D

What we should put the default up, I.

B

Wonder if it's Horace, repeating that's with Bluestar, mostly.

D

B

We do a particular files for two yeah yeah yeah yeah I'm curious. If we, if we do better I'm setting.

D

B

D

Start effect at all, actually.

B

Hopefully not, but that yeah be nice and there is still oh yeah, because there's not even there's no.

D

Asynchronous cash yeah.

B

D

B

Reported committed.

D

To that level, yeah yeah, oh.

F

Maybe this happened before I got in, but why do not? You have set the cost of a snap high enough that an active.

D

That's not the problem. The problem is that by the time that the file storage considers the I/o done almost no work at sapidus. That's that's fundamentally problem, oh, and so we can't even block it across the sinks. Okay, so we have to make some random guess as to how long it takes I'll, sir, to actually get around to doing the work. That's better than good artists, fish, okay,.

B

Yeah the easy thing advises that way like I think: what's up, do you think, is what it had to do to sort of introduce the equivalent of a delay. What.

D

We're going to.

B

Do that? Because.

D

It was, this is simple, short-term fix exactly um but long term I. Think Lucifer is the answer. I think Josh would actually work. I include Nicholas or maybe Nick. If you could confirm that.

B

That would be a dad. Is your problem.

B

I'm just hesitant to to do anything complicated with the file store. Oh no,.

C

B

Is sort of fundamental yeah we had sex yep.

D

Okay, it's worth these the only way you can get the file to do this, for you is to ask thanks to thing on the way out and that would completely tanked performance. You don't really have a choice. Yeah.

D

D

Which, by the way, is what would happen if you turn the right back, bow low enough, it would make it so that you can't you can't open a new descriptor without sinking in old one.

C

Yeah, okay, I'm pretty decent Testament, both the repeating that unbelief snow, but also just a general interest. You know putting trying to restrict the right foot or more to get left by Keith of Reed performance.

D

um Then there are the you're working on this is correct master, so the back off rattle allows you to set a curve. If you look at the documentation, oh.

C

Is that that stuff for.

D

Our enjoyment, what's that was.

C

That my stuff, some some enough, is working on BPM yeah.

D

But I rewrote so yeah. Alright.

D

Cool thanks for looking at this by the way I know there are other people would say problem. The original complete solution was like really unsatisfying, because while the original thing was going to get kind of work, there was other stuff in that thread, tool that did get watched when those lips were were happening. So it was never it's a solution.

D

A

All right, um it looks like nokia has some Easter results. They was like to share. Are you guys here? Yes,.

E

Yes, I attack the link from the pack meeting pack. This is Jaime hello, hello, ah ma.

E

Yes, yeah: there is a link with the presentation and PDF presentation and their role, data and and an excel sheet with the desert house. But basically I go to present you to work out. May I, sir powerful.

A

Carefully I don't see the Lincoln Lee etherpad. Were you better.

B

Or thunder purple strip club cloud and empower elation, I was.

A

Looking at the yellow, I was looking for a yellow one: okay, I.

E

Do able to see the condition step yeah?

E

Well then, this deal sorry I couldn't return the last last week, but you have to ask us to write certain informative of how we gather at the test result. I mean we are using basically I, always stuck okay. This presentation, deeply official name on the link on the past, so I, oh, except, is used for gathering all the ayahs twist and then all the pasta machine is down with excellent well some time ago, I have those doubts about the letters performance impact of you selected protection code.

E

They should call for blood one original, 3, plus 2 30 C Guatemala's performance is more or less similar and in fact, I'll only write. This is fruit plating over by Frank of Taxation. We also make the same death on SSD disk and performance work parents. So what we see from our reality of villages, let's look to layton to play team anguish or right, but it's also the same effect on their range. So in this is ma gram I made some numbers. Is it just for comparing? What is the other language that we get home?

E

Writing and this Sunday Maxine warranty is bandwidth. Comparing a difficult for plus 1 engine called three plus one black to service, an initial call using ssv disk. Ok, we're! Clearly we get more less double by with using a single SSD disk for all them from the roughly data rates, but well after several consideration that we.

E

Finally, make music with the fifth circle 11th to the celebration, and we have a be stopping brands here with the memory we're running out of memory. Nowadays we passed the test previously, but with this new version we run it out of memory. We have five notes with 69. Has this rag, but no, then we start looking into the change between the previous version and DS version, and we found finally found a computer in change. Drastically change configuration on the cash in the previous version. The cash was around 500 megabytes and now it's five gigabytes.

E

So the problem that we have is what caused by this change. There is a different configuration by default values that are using more more cash. That's all in our extraordinary packing.

B

To put the curse, that's when I.

E

See this is not the total disease air bear pattern. Maybe.

B

Sega saturn olympus.

E

Memory like that by high you get 5 gigabytes. This is my guess: ok I, don't know why. But it's a fact. Yes, look at the diagram results that we are getting here after running 14 hours of tech. You can see we create to move with 5,000 disc and we were using six gigabytes, six. Seven gigabytes of ram air disk.

A

E

The by default settings.

E

Releasing the cash to 100 megabyte we're using a rough about the same that we were using with the previous version. Okay,.

E

That's I mean this is our testing that lady let's drag on for people adharma, you know the short on run, but we were, and the fact is that when we ran out of cash system.

E

The camel the establish.

E

mmm Procedure and killed a prophecy, ok, so the complete clatter take our test stopped because decided to kill one sec trouble. Okay, talk here in this presentation we saying the problem, so probably the cash is computed in a different way and at least for us, by its own, well using much more ok.

B

So I just check the code and it is dividing that number by the shard. So it's supposed to be the total, not the purse, art value, and so, if you're, using that much that, my guess is that it's a coincidence and there's some other reason why it's using more memory than its supposed to there's a command call them an advocate command: Youssef demon, OC dot, whatever dump underscore men pools, and that will tell you how much.

C

B

It thinks is using and for what.

E

B

You can run that on one of these when it's in the state and I'll be super helpful. There's also. Actually, if you can do one other thing, if you set a mem cool debug equals true in your step, pops I'm, you can't change it online.

B

You have to set it in the cops before go, see starts, but if you set that to true and then run it into the state and then do doubles, will get detailed breakdown of where all that memory is being used, at least hopefully by some of the system, and that that will hopefully help us figure out. What's going on. My guess is that there's actually a subtle bug in the there's a buffer, probably that we're not rebuilding, and so it's effectively leaking a bunch of buffer memory. I guess except that's what clan, but we should.

E

In any way, dispense run in 14 hours, this only things from the cash we accelerated the run use of the test. Okay, with the only change. The only thing for this is luther cache size to using that preventing like 100 okay. What more does that we have words with the current patrollers that we were able to get now that all the robbery and data is going to the same partition on the DS? Okay and the result is better than we've done in the previous version.

E

Okay, if we compare our content with our previous test, where rock three ish work in the data partition now that has decided, but we all serve the same fluctuation on it on the grind.

E

Well, that's all.

E

My only concern now is that by conversion at leasing, the previous version off of except we are serving lots of free on the data partition on Brewster and when, when I mean log of Crete, is 2000 grit on the list. Opposition and I don't find any reason for the and see about political is where the lower right performance is activity on the date.

E

So I start thinking that by coalition we need to read on the blue stored data partition, while the competition take place, but I don't understand why I thought either man I, don't understand why we need to read two thousand requests per second, while only resonance it's.

B

Just the case where the you didn't have a separate DB partition, a blue sores on / sorry, the rocks TV blue assesses on the same partition as boost or.

E

No, no, this was 11 dot feel not to yes, it wasn't not yeah. Now I we can distinguish I owned, a candy currently imei android/ios.

B

Yeah I guess that's my question because, usually in the past that I think in general, the reeds there should be a bunch of reeds coming from the rocks to be database partition. Basically because after after rocks TV does a compaction, it basically invalidates all it's cached data and then it has to like salt it all in again. So every time it can packs an SST. It then has to like read it again, just really irritating, but dollars read should be coming from wherever rocks give you storing its data. So it's usually the DB partition.

B

If you have one in.

E

The previous version we detected that in the current version we we cannot distinguish which reads our courses by rock video one I see because.

B

They're on the same perdition and okay, well, you put here's. My suggestion is that you try blue FS heard I 0 equals true.

B

That will basically try to use the OS page cache to cache those files that are written out during compaction and see. If that affects effects, 03 is my guess is that it will, but if you could confirm it, that would be really really helpful.

E

Okay, yeah, okay, Louise right we'd come on. Do you mention I, didn't.

B

Put it in that chat and along with the men pool, so it's blue FS buffer I 0 equals true.

E

Okay, that's many more sites, often.

B

Okay, yep thanks. This is super helpful if you, if, if you're, able to repeat the steps with the buffered io setting and with the men pool de bugging that, but those two tiers of information will be the top us a lot thanks. A lot with.

A

Great thanks, I was curiosity. What was the? What was the test thing that you did? Was it like one megabyte rights that you were doing for nine hours when you, when you that many yeah that's bad? Okay? This is gay too needy residue to internal longer influence of what the.

B

A

B

And a pic flip it for + 23.

A

+ 2 I think yeah.

A

B

B

It's everything, I guess on the list. I can give a quick update on the fastest touch stuff extras out all its report that the pull request that does all the preliminary work is basically ready to merge. The one that actually changes. The bestest patch code is the last piece so I need to once the other stuff is ml. Do testing on that, but I think it's pretty close I did go back and look at be sort of the last bit that I I was worried about.

B

After all, the reviewing changes was the get reserved maps and release map calls because those used to be called they usually called heavily during festive stash. Now, they're only used in the legacy client handling I'm not so concerned about that, but I it turns out that there are also used every time you send a message in the OSD. It has to get a reference to a map and then release it so, regardless the fastest that just a lot path, so employee.

C

That my origin goes yes,.

B

Yeah, which, because we need to you, have all these threads that are running that are sending maps so I sending messages other Oh Steve and we need to make sure they don't send messages. Oh see that we just marked down, and so we have to flush all those out a way tree search, maybe river before we start barking shut down.

B

So I think it's still worth optimizing that basically so imma back and looked at how mice or request I think it makes sense. It basically amounts of doing increment, incrementing and decrementing anatomic in the Fast Pass, which is pretty good. We can find you a little bit better with our see you, but I would rather take that step first coast. Our few steps up so weird thread, threading requirements that you register all threads and so on. I think!

B

That's why the next step, but just not having to take that drink fest to fetch it's probably pretty good yeah! That's the we'll start with start with fast dispatch, then improve the get research maps really snap between those things. Those two things will be in pretty much better shape, at least.

B

B

Guess the only other thing I'll mention is that in going through this doing this cleanup I want to go back and audit all of the cases where we wait for maps in the OSD, because I'm pretty sure it's it's being overly conservative, because in general the I think most of the waiting code is waiting for, at least as know that map as a fender, whereas we only really need to make sure that the sender is in the same interval that we are, and so the center should be saying it's lower bound on the map change inside of its current map, basically I'm so I want to go through and out at that and see make sure the requirements.

B

Actually, you know check section, make sense, it's kind of a mess they're all mixed up all over the place, but second come. Second. I think I think the mentor.

D

Well, with the current code, all we have is the maps of these centres, subspace.

B

On red yet, but it can, I think it should be tagging all the messages with the new yellow.

D

Ginger should be.

B

D

The clamp right it um so for all your hearing doesn't suit. That's pretty straightforward: to switch the epic, that's or whatever your query, epic, both of them, I think, actually to be a favorable since plants trickier, though, because they don't make a turn. Upstate. Okay, figure that out yeah.

B

They could, but it it doesn't matter that much that one of the things that the fest attached does is it changes the waiting for map stuff so that it blocks only oh did that before actually nevermind basically only slows down that one client, if it sends something in the OC, doesn't have it only that clients requesting gets blocked. Basically, everybody else can still continue. So you.

D

B

D

That not trimming um part of the quite a size in this problem which is different anything we do the causes of map change, but isn't important causes an unnecessarily large, hiccup and client. I 0 because clients could get the map before thee. Oh sees do and.

B

If we go against.

D

Out what not true yep well.

B

Now would be the time to do it. It did actually wouldn't be that hard to add a little bit of state in the object ER to track the last primary sense or less whatever. In fact, I had a past that did that for the back off stuff and then dropped it, because I need it so now would be a time detect at it.

B

B

B

It send it's the epic that it needs, so if he needs to have and the epic that it has for purposes of you know, sharing incremental some stuff um yeah.

D

Blogging, the problem is that if you set a new io on apg that you haven't been tracking it over type right yeah, then.

B

My areas are new, another.

D

Approach would be to in the OC map. You would include a conservative stating terrible sense of things like spectrum trims or other operations that do change the athena. Don't have to update that number and every client automatically knows. Oh I. The correct map is marked with conservatives.

C

D

Or will sense of twenty efforts ago so I'll just done that.

B

Yeah or we could just move, we can move.

D

B

Don't think it.

D

Has to watch which versions the big problem, but if you could just and do sums updates, you can fix up the monitor yay area. Okay,.

B

Should probably a little snap cheering out of the esteem f anyway, but there are other new subject also.

B

B

That's it, you know what the client could do is it could? It could just have a have a time-bound like five seconds and it would just have it would keep in memory, um hold those fee naps up to five seconds old and if it's sending any requests- and it has older maps, it could just repeat the calculation on those two, so it can calculate a lower bound in just the cases where it is likely to matter.

B

You know this whole thing I'll show you yeah yeah,.

B

Yeah, you can hope that the that's in calculation with that.

A

Okay, a patient's got a output from ernesto here from the mud, pools hand and looking over it for over one gig, but we're not add 6 k's strange.

B

Eggs were gay about first, that's where most of it is there.

A

B

Say it's 500 like this Oh node and slack better max of other, so the blue star cash is right at one gig, so that's actually as it should be. Yeah I've metadata at least, but it's got three gigs of data side bet that the I bet that there's a buffer, referencing espera fixing one of these. This is me.

E

Okay, you know the distinctive we sindikash to 100 myth we set off one zigaboo and husband did worthless, use yeah.

B

So that, there's after the whole class of bugs wear em, there's a bunch of code for reference, counting buffers in sess item can sometimes be too clever for its own good. So, what's probably happening is that some object in the blue store metadata is referencing like a few bites out of the larger memory allocation and that larger memory allocation can't be free because we're just using a small piece of it, and so it's still effectively bounded by the size of the blue sore cash.

B

But it's like you know, 10x or something which is why you see that's showing the reducing the blue store, cache size breeze it. So we can have to figure out which buffer that is, let's go, go fix. It I'll see if I can reproduce up to this on my list, I'll see if I can reproduce this and it was going on blue story. See memory utilization. I guess is that a few we're using replication? It wouldn't happen because slightly from.

D

B

C

We finish caching.

E

C

Sort of been a discussion on the staff users about you know: cash showing birthdays, be cash on the LSD and just wondering so what sort of the official of direction that we think everyone's all going to be selling.

B

The / OC caching is effective and useful and something that we're young we want to support.

B

The plan is to improve the set disk command, so it can configure it for you, so that's not so painful to set up, but that it's it's still harder to administrate doesn't have some of the advantages that caching has, but it's still effective and useful and whatever so it'll be it'll, be something will be a thing.

B

So the flip side of that is, you know what is the long-term fate of the cache, cheering and I think that's sort of not going away yet maybe someday it'll go away if we have something better, but not until we have something better, I guess happy.

A

G

Great right, I said I'll easy or any programs you got into you. The precise location in the knee Bobby decide.

B

To process and cashing the bar BTW, I'm not sure what size. So that is the question for fridge jason. I think there are other things that are being prioritized for luminous with multiple mobile site mirroring agent, making that robust but um I don't yet I, don't know the timeline us to that pressure. Okay, yeah! It was originally something that the Intel engineers we're going to work on and they aren't and and Jason was going to do some of it, but he is other stuff. So my guess is that there just isn't a person.

B

B

A

Right, everyone picks.

B

Iran, okay,.

A