Ceph Performance Weekly, 4 Jan 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-JAN-04 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, folks.

B

Good morning,.

A

All right, let me see if sage is able to make it.

C

A

Happy New Year: this is the first perfect meeting after the new year so happy new year. Everyone.

D

Happy year of slowdown, CPUs.

A

Yep I'm not excited about how much that might hurt us it's the year of stuff on the desktop.

A

That sounds like it might be: a curse.

C

All right, it's the pull request list update it should be, or at least moderately so.

D

All right, let's see, there's an updated version of the patch that adds qat support, that's the processor offload stuff and your Intel chips. It's based on an earlier patch from Shawn ping, I think and looks like it's in flight, not sure the Kerner status. There's.

D

Looks like a nice review. Okay, there's that I added I have a pull request that at some minimal tracing to blue store and the USD. This is in order to generate a trace to share with the cash physics folks, which have they're building a like a SAS type thing that does what that paper that we sought fast several years ago, Josh and Greg. It's the one that does a cache miss reliability curve. Was it cash MRC, miss rate curve?

D

So you can size your caches, but it's you know all improved and all that stuff. So there's that that actually was interesting. So the way that I was the way that they're setting it up is it would be it's like a you set up, something that will feed traces to an agent that feeds it back to their like SAS, hosted service thing or whatever, and then it gives you back all that all the nice information, which is nice, because licensing issues sort of go away.

D

The way that they're their preferred way of collecting traces is based on open tracing and the whole conversation led me to so they want to. They want to basically have these probes, where they're they're sampling, frequently initially to build up the initial data set, and then they sample less frequently over time and so in sort of the steady-state you're, actually sampling at a very, very coarse rate, but they're looking at open tracing and this project called Jaeger, which is the is in the CN CF. Now it's sort of the new version of.

D

Zipkin that came out of over.

D

That really understand what that means for us if it means that we need to like generate different types of trace points, it's that, if it's.

B

D

B

If kendrys is, you can actually have like some extra annotations with Akers format, but then you might want to take advantage of in the future. It's not necessary. Okay,.

D

But as far as like being able to get to the point where there's something that is attaching to roaming process and like collecting, trace data and sampling it and doing something with it, I don't know what what is do we just do. We do anything, and so just using LT, T and G to do that. Or is there like some library that were Civic type of I? Don't I, don't understand networks so, but it kind of reopened that question for me: I'm, not sure.

D

If feels like, we should have some plant or understanding of what direction we wanted to be going in, especially since we're about to be reflecting much of stuff and why I want to sprinkle it with trace points instead of with debug messages.

E

Blocking type of stop can all.

D

The same thing yeah, we should.

E

Have designed a project to sort of revisit, but though the block in and blow and.

D

Yep so I guess the meta question is: is there somebody who's interested in this? A and B actually has time for this that can sort of figure out what needs to be done and already.

E

Get from here to there it seems like if we have some window of time next year, you know a couple: people from my team can help we're done. Okay,.

D

All right so anyway, if that pull request is probably nowhere, but it was enough to generate a trace just send to them to see what date they generate. So I did that, let's see, there's something that was merged with caches on the Barrelhead I, never know it. Adams pull requests actually do until you go, read them or JP cache improvement. That's a.

E

Large of you little bits getting revoked or adjusted, but okay yeah.

F

We ended up having to Ida reader invalidation that we're quite the way I thought it did so we're going to either pull in a different LRU or just make that one use an intrusive list, so it doesn't allocate us I.

E

Mean sounds good. We should just doesn't really I think they all are used up and there is messy, but but develop with the multiple indexes or indices or there's a different thing than that need that yeah.

D

Rusev versus Father is it you think that that ancient LRU code that came from set of s.

E

The whole thing, but it's gonna, be not mini yeah.

D

Sounds good: okay, yeah! That's like one of the first headers was added to the subject: super old, okay, let's see I'm recovery optimizations for overwrites that looks like that needs to review to be exciting. If we connect find like that over the line, people have been working on that for a while.

D

Josh, are you gonna here gonna read that one it's they requested at least this this one, nine five, six, nine there's another edge.

B

W-Want and again there's to wait a lot of history there. Yes.

D

Okay, there's some cleanups from Igor that look good, I, think they're reviewed and ready to go. They did a test.

D

There's a change to the radios been command that, like I, have don't care about mostly so it's like keepo chatter. So that's good, reverting a proc sighs. So it sounds like the takeaway. Is that we're stuck with an order in list size function for the lifespan of Ralph 707 because of Abyei issues with the build tool chain?

D

D

So I think we still need to fix that pull request. We need to do clean it up and merge, whatever version of that makes sense and also keep an eye out for other cases where we're calling size on a list there's well at least one other instance. In the cache. The shared cache code at NEC, fixed.

D

This charted up: do you thing where I don't think we're going to merge that and I've forgot about this one from a long time ago, Jean Peng. Did it a thing that will optionally use a single callback for both op commit and off applied instead of two callbacks, but we already merged the thing I did that makes the unapplied a synchronous call back I'm, not sure that that one makes sense or is going to be helpful anymore.

D

Yeah I think make sense.

D

That's it for pull requests to talk about the async, read stuff and then see what else.

A

Sure I can talk about these or read a saw. If you want talk about the pull requests in general and kind of your thoughts on it that, let's find two, you might actually have more insight into kind of what what you've done and what you think is going on here, but either way.

G

Okay awesome treats some parts of it. I are already done. We have awesome Crete in blue star, but at the moment it's it's. The decision whether to go sniffing is residential mostly is made by by client of the interface by by the brahmalok PG.

G

So, as a result, we are doing also a synchronous treats. We are starting the odd a synchronous machinery also for cave in situations. Our data are actually available in cache in Brewster caches. So in such case, it's expected to have a performance penalty right, I'm, trying to address that with they're, getting the decision, whether to go synchronously or asynchronously to the to the object, store implementation. I've pushed a branch yesterday on my github at the moment. It's it consists changes at the made at the Brewster liar.

G

The missing part is: is the primary lock PG employment of such new interface I mean rich transaction. The transaction alo is some kind of aggregate over over reads that. Could that can and be performed as ingeniously, but color isn't guaranteed to have.

G

To have the hissing machinery started, and this decision is made in Brewster also I. Today, I've got very good results from performance results from mark it seems we got some some performance regression in in writes in hungry person, pure right scenario, which excludes the possibility that the bitter I of threat is is busy with handling the rich completion. Instead of doing, instead of doing the right related things rather.

A

Solve one thing, I just noticed: I wasn't thinking when we were talking earlier, but it appears to me when I'm looking at these, that the random right regression is happening both in master and in your branch whenever we are using unbuffered reads and writes in blue store.

A

So if we look at, if you look at that graph again where we see the the regression.

G

One second I can.

A

Actually, right.

G

Now, I'm taking a look mixed, random.

G

Okay, do you want.

A

G

Yep I will okay one. Second, please.

G

Can you see.

C

G

Talking I'm talking about random right dose to curse here, yep exactly.

A

Those graphs are kind of screwed up actually compared to mine I'm, not sure why hmm let me let me share my screen. Okay,.

F

A

Openoffice or something there is, is actually not displaying this properly.

A

See this one: okay can, can you see the screen now.

G

A

Look to me like I'm, the one that you were looking at the IO size was was not the the the axis was not displaying properly wonderful.

A

Well, that would that would be yeah unfortunate.

B

A

A

What I think is is interesting, look at the sequential right case here, where it's not necessarily a dramatic effect from using your branch.

A

It's it appears that if we look at both the the yellow and the green lines, which are for buffered, basically using buffered reason, writes in blue store versus disabling those, the the the maroon and and light blue lines that there's a much bigger effect from from that versus using your branch. Do you see yep, but actually the the reverse is true at small sizes, for for old announcer, the reverse is true, but that the lines cross essentially so that using unbuffered is actually faster for small writes.

A

So it seems like this is having much bigger effect. Then your PR is at least in this particular case.

G

Yep could be but still and special rights. We have a point.

G

On the right side of it seems we have some regression in blue store.

G

Blue store in blue stars. Do reach a method that is, that is also used on the right path. Okay, it's called from those small rights of the booster. So it seems that if we, if we, if we neglect this performance regression results, could be much better.

G

We might make a refactoring of the the rate of blue star before implemented before before starting poking with the async rich stuff, and maybe the regression was introduced that it was introduced. Then, if so, this will be good, because because.

G

At the moment, we have a superposition of of the results coming from the refractor and async stuff. If, in some cases, the the async stuff is much more performant, then if, if neglecting negative results of of the refactor, the overall, it could be much better. I think.

A

One question I had for you Retta saw is in this random read case when we have your branch enabled for very small iOS, there's a pretty clear trend where the performance drops that we don't see well specifically with unbuffered. So when we disable buffered I/o in blue store compared to the master branch. Any any thoughts on on that.

G

G

G

Would I think I need more time to to understand this properly I think.

A

Yeah agreed agreed I'm going to try to rerun some of these tests and they also look at it with the wall clock profiler to see if I can tell if there's anything interesting going on, but both in the buffered case and in the unbuffered case, especially with small iOS. It looks like the the async reed PRS is slower.

G

But we could try to avoid doing async in such situation and just go with the traditional infamous way. The that come. The incoming purpose for the decision delegation would allow us to do that. I think.

A

That's that's a well, at least for my perspective. I I think that's a good backup plan, but I'd really like to understand why it's the case, maybe maybe there's where you can fix it. Yep.

G

Definitely that would be. That would be the best scenario.

A

So going down to random writes again this this for large writes. It looks to me like this- is more not not strictly due to your PR, but more due to the difference between using enabling or disabling buffered reads and writes in blue store.

C

C

G

G

Those results, I I was taking home on your results using Google, spreadsheet and now those they are in OpenOffice they're. Looking completely different yeah all those things yeah.

A

Unfortunately, I've had a lot of trouble with Google, Spreadsheets and OpenOffice, not interacting well together, so I I guess I just usually try to use OpenOffice, but even OpenOffice. Sometimes it it really screws up the formatting, it's really irritating.

A

When one thing that's, when we were talking to stand-up about the 512 K random, write drop, that's what I'm talking about right there that that point where, with the 512 K IO size, we see this. This regression compared to vs. file, store, there's this kind of drop and then it recovers I'm very curious if that's due to the blob size being the same, but I'll have to have to run more tests and try to diagnose. What's going on there.

A

But anyway, the the mixed results, I think are probably sort of similar to the the other ones for sequential mixed IO kind of are the the sequential read as dry. It kept dropping everything down a little bit, but your branch is doing better than than other things, I think and then again we're faster for in blue store for random I/o again with this camp drop at 512 K. So anyway, that's um when I drop at 8 K, that's below the minute Alex eyes right, because the male looks eyes for for a solid state.

A

This is 16 K. So for whatever reason, when we dropped to eight there, it looks like we're seeing a pretty big drop before it recovers with smaller iOS.

A

Maybe it's worth reading read investigating whether or not 16 K is the right place to be cutting over, but anyway yeah. That's a thought.

A

For for folks that are really you know, horribly interested in the stuff, there's a couple other tabs with more more data in here, but this is really kind of the probably the more interesting stuff.

A

The one thing I will say is during this testing it it it unfortunately kind of late. Then the game became really apparent that the amount of memory in the these notes was really having a big effect on on the performance results, because each of these knows has 64 gigs of RAM and with multiple nodes. That's enough that even fairly large RBD volumes have a significant amount of cache data for random reads.

A

So I manually had to go through and inform the colonel on the nodes to only use 8 gigs of ram that it was really the only way to to restrict the amount of buffer cache that file store could use. So just if you're doing any of your own testing like this there's, there's a pretty substantial effect from cashing, even even in cases where you're using quite large volumes, so just be aware of it. If you're doing your own performance testing it it. You know it's it.

A

It seems kind of obvious when you say it, but when you're in the middle of testing, sometimes it can be easy to forget just how how much one there is. So anyway, that's a set anything else and you're. In luck.

G

Can you hear me guys, yes, sorry, I've lost my network had to switch to my application.

G

D

G

In lustre, let's save three minutes not.

A

Nothing really important, honestly, any anything else that you want to add.

G

Not at the moment, okay.

A

Well, we've I think, there's probably still plenty of investigation to do, but maybe the only other thing I'll say is that there's the there's really wide performance wings here right I mean we're talking about in the sequential read case, you know a swing of like almost 200 percent and some of the other cases is fairly large too. So we're really looking at important critical parts of the the the read and write path, I think I, guess read path in this case, but it's it's the right thing to be looking at.

A

It's just that we're seeing quite a lot of quite a lot of variation, so hopefully we'll we'll build a bagel into it. More and kind of.

A

Figure out kind of the optimal way to do all this, and that's it.

D

All right, we just talked to hope that all been to the sea. Sir sorry I hit the wrong button wrong button. We talked about a bunch of the the async OST stuff last night. Anything else that we should cover now.

D

um Peter I, just rediscovered your CRC cache patch that looks just much simpler, I'm tagging that one for QA cool. Thank you I'm, trying to think if there are other patches that are sort of floating around that I'm that we want to look at that, we forgotten.

D

I think we're gonna close.

D

There's still an issue with the alligator and blue store: it can become sort of pathologically fragmented so that it fails Iger had a full request. They ever need to pick a solution off stem on the list for the boost or discussion tomorrow. Yes,.

D

A

Yep, since we have Adam here, would it be worth talking about the sea. Star mm store thoughts that you had.

D

Sure yeah, one of the one of the things that came up last night, was that perhaps we should it might be a little bit early still, but at some point make a mmm DB mmm store implementation that used to those futures. I, don't know how value that'll be because it'll be it'll, never block, but we could probably make a mode where it'll sort of do a synthetic block where it just says. Scientists will take this long. They.

F

Usually could get like there are there env damn simulators we could probably plug into it. If we really want it to mmm I.

D

F

I could take the one we sort of have that is mostly there and pack it up to be more stuff like, and you know do that. Mmm-Hmm.

D

It might be, it might be premature because we don't have any of the other stuff ready to go. um It might be that there were still better off, focusing on the messengers side of the equation and some of the other infrastructure but I'm at some point. That's I think it's gonna make sense.

E

Actually, it's kind of useful to dictate Oh to Bill and elevate them in parallel, the two parts of the stack right, my passing, listen to all the other bottom end, yeah.

D

We just don't have anything that will exercise an object store both within future futurists interface. Well,.

E

Such things existed in the past, so we could. We should ban business Rubin attack that.

D

It might it may be it's just useful and exercise to just to sort of lay out what the interface would look like for things like cuz. All the read methods basically will have a future for completion like getattr and exists in collection lists. No, that's so maybe that makes sense so that when we, when we do sit down and so I think about how the OST code is going to change, we actually have an interface to line up against. On the other end,.

F

Well, that's actually something I was curious about I, wasn't sure if the plan was to make sure that every object or method has its own futures based interface, which is sort of what I designed around the first time or if we were going to want to try to model the system that we have now where we have a do stuff interface and we batch up a transaction well transaction thingy and send it down there for.

D

I mean for writes, we have the like do transaction but for reads: yeah I think the rates have to be have to be future space because we can't block ever right. Yeah.

F

Any of the transaction could be future space. It would just be a whole bunch of write operations stuck together, yeah yep.

A

Would we more or less keep the same concept of transactions for right, son.

D

E

It differently more, it means there is there any hope of unifying this this at a at an API level such that, even if, if I you know, even if, even if the execution path is all this synchronous, we still have a you know uniform construction.

D

Well, I mean I. Think the nice thing about that features interface is that it is. It hides whether it's synchronous or asynchronous right well,.

E

Regard synchrony, yes, but I, mean reads: being direct operations versus so I'd like to see unification in some manner.

F

No reason there's no reason in principle why we couldn't have direct write operations if we wanted to use them for some purpose and to have the option of sticking read operations in a transaction on the object, store level. I think I.

D

Mean we slowly ripped out all the direct read operations, because everything is simpler to have everything go through a transaction and in practice everything should go through transaction so because I'm I'm the reads inside a inside a write transaction. The question is: what is that? What would that accomplish like cuz? Nobody is there to do anything, but by the time the transaction is submitted. You already know what you're gonna write.

A

Doesn't seem like that would necessarily be useful, but given the and.

D

The model where we prepare transaction at this in sent a lots of people that they all have to play.

F

Necessarily complicate things that Moe did shouldn't, complicate things that much in principle, if we're using something futures based I, wouldn't think I mean if we wanted to I'm, not sure yeah.

D

I, don't know I do what would use it. I guess.

A

That's what I'm wondering is it doesn't buy us anything I, don't think.

D

A

Don't think so.

D

Because action depends on a read, then we're screwed, because we can't have prepared the transaction ahead of time and replicated it.

E

Well, that's kind of it. That's good! That's kind of that's kind of a those are kind of independent concerns. I know they're the same in one model of eg back-end.

E

Yeah, but that something that I've wanted to see is that it's not it's not more important than getting in the performant object, store and Sutton. Is that incompatible with it? But.

D

Can you, or can you articulate what it would be used for I guess I'm having trouble imagining what would use it for well.

E

Probably maybe on this discussion, but I can yeah okay, the things that's being bluntly, the things that would be useful for are things that currently violate this maddox way DOS. But but it's more general.

D

D

All right, I forwarded an email of a short email exchange that Alan and I had about what I knew. It might look like that, his his view of what the log structured b-tree ish thing should do. So that's an interesting, interesting, read.

D

That's all I have, though,.

D

Maybe I'll just talk about.

D

All right sounds good.

D

Have a good Thursday, everyone.

A

Farewell happy.

C

Thursday see you guys see you later, everyone yeah hi.