Ceph Performance Weekly, 7 Dec 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-DEC-07 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, everyone.

A

I think say should be here today, so I'll wait just a little one perform before we started.

A

Alright, I think I'm just going to go in, and so you can catch up when he gets here alright. So what we have going on this week, there's a couple of new pr's that came in uh adding trace points or critical functions in the io path. That's always welcome. I'm hopeful that we will, in general, be able to start doing better trees. Point analysis soon, I think that would actually help us quite a bit. I'm called different things going on.

A

A

And improving buffer a list to avoid unnecessary copying. Who is looking at that?

A

There's a new object store of performance benchmark from sandisk here there were some questions as to why we would want to use that versus something like fio with the object store back end, and it sounds like new you just can't generally for ease of integration into test Suites and also potentially lower CPU overhead. So there's MACD discussion going on there if you're interested in optics or benchmarking and it being a good one to look at.

A

um Let's see, I, I'm not sure why oh say you are here now: was there a reason that the Zetas scale PR was close, you're silent.

B

Those those they're really old, pull request as a newer um version of the code, Oh fantastic.

A

Okay, I, probably just missed that that's from like three months ago, cool.

A

Ste may be the only other one in here. That's interesting for right now is uh it sounds like um Margie and Peng will be next. People have some more time to work on this separate kec thread they held I, don't know how that interacts with the stuff I somnath was working on. So it's.

B

Yeah, it's orthogonal. His pull request is just splitting it in two, so that there's one thread doing the submission and another one doing the completions. Oh and Tom NASA's like in doing submission, so we could have n times too I guess but I'm not not sure. Yet, if that's going to be a good idea or not we'll see.

A

All right me on to discussion topics. There were two the head today, I just want to quickly mention. um We just started looking at RVD with EC over a testing, and the results are still coming in, but so far the the gist of it seems to be that this is going to be really good for sequential write performance.

A

It's on both hard drives and on nvme it was. It was quite a bit faster for large, sequential writes than 3x replication, and this is using like ec 6+2. It's.

B

Kind of gel mid-range, a quick question: did you compare it with the old or the existing ec sequential write performance, because it also should have been better. uh These.

A

Are our BD tests? Oh I,.

B

A

So yeah I have not looked at at anything other than I. Read you so far, yeah.

B

A

Began so at least for, like you know, for four megabytes, sequential writes both on hard drives and then be unique. It looks really really good or basically, all other cases. There are kind of mixed results with you lots going on, and so cases is quite a bit slower, but you'd expect that in some cases just give them the extra overhead this required so next week, I'll I'll have more, but for right now and the gist of it is, is for large, sequential writes it's looking really good from a performance perspective.

A

In other cases, there are there cases where it's, it's significantly slower in some cases, he's about the same.

C

Mark I have a quick, quick question. Yes, um so you said that rant that the reeds did not perform well, and I can understand why small you know random writes are going to perform.

D

C

With ec but uh reads it seems to me, I can't think of a reason why they should. Can you explain that it's.

B

Because the the primary is is performing the read on your behalf, they send over e to the primary and it goes some contacts all the shards and gathers up the results and returns it. So there's an extra hop Oh.

C

Whereas with the regular replication, it doesn't have to go those extra hops, yes,.

B

It's right at sony, logo.

C

B

C

B

So there's a sort of a long-standing feature on the roadmap that would let the clients read directly from the various shards opportunistically at least try to do that before falling back to reading from the primary, because in the case of no bracing rights and that'll work just fine, but that was the hasn't been implemented. So thank you. Nice. Come back to that later, yep.

A

So I, the other thing I don't want you here before me on- is that our friends at sandisk have been doing the kind of work and have a blue store on zetas scale available for people to test there's a couple days ago. They have chem the initial branch here that uses a single kv singh thread and then a new version, I think was just released, may be very very early this morning, the sickly, but all this is experimental, but but an even more experimental version that uses multiple PVCs threads.

A

So I'm hoping don't start looking into this, and you can't trying to replicate some of the the testing that sandusky has been doing over over the next 20 days. But for anyone else is interested in this.

A

We're hopeful that Zetas scale will kind of avoid some of the the compaction overhead that we've been dealing with in rocks DB and may lead to better results, especially on larger volumes, so that some, if you guys, want to say anything about that before me again,.

E

Yeah getting everywhere, yes, yeah, so I think, let's ma note about the mono, you do need to talk anything.

A

Tarzana sir I can't understand you yeah.

E

Mana mana is, is there so he's also? He is actually made managing audi to skeletal too, but yeah, so we should be yeah. I'm writing this multi kissing thing that actually I mentioned on the on the mail. So so the desk is basically pearl is in for working great, and that is what actually we need from the blue store and- and I am seeing promising result with that darling results, but I still need to do lot of benchmark and they're.

E

A lot of optimization still needs to be done on the on the shame layer so that also we are doing and I think. Hopefully, in a week or next, two weeks right, we should be able to present to the community that that entire difference between dsl to get it booster and in the rocks to be so yeah and hopefully actually it will.

E

It will do good and the difference is mostly on the on the bigger volumes, as we know that the, if the matter is less rocks, TV will be, it will be great, but, as you actually put more data down rocks TV so will be actually like working working. Like debating because of the perfection and whatnot so also as that, a for right I am trying to find out what is the? What is the crossover point?

E

For example, we can say that, okay, if for the volumes times, for example, lower than 100 gig so rocks like you will be probably very performing better, but bigger than hundred dig, it will be get a scale, will be always better because a more volumes be cut the volume sizes, so the performance more. So so that crossover point I am trying to find out. So actually it will be helpful for the for everybody, and also we are working like as part of the complete implication. We need to run virtually rocks TV.

E

We today rocks to be physically sharing space between blue FS as a and and the data data for the new store it a partition for the blue stones, so that feature is, is missing on on data scale. So we are working with sage on trying.

D

E

Up with a solution for that, so if we see performance, the performance is the first thing: if we see performance benefit, significant performance benefit all the way, so that so that's the next thing like how to integrate for the product for the production deployment by sharing the spaces in between so so yeah. So that's what else we been working on so fully. Probably a next week, I will target for the next week. Performance meeting so should be able to provide a detailed benchmarking result. Yep see ya.

F

So last time, when we presented mark and sage the performance data, we saw the various call slide. So let us scale with a larger dataset performing better than box TV, but even in that test we saw that out of Iowa a bandwidth left on the system on the flash guys, because the since single thread submitting a bunch of transaction in the stream layer. It is supplied one by one, so not having an affair ilysm. They are not able to saturate the flash bandwidth.

F

So we think that applying this transaction through multiple filled in parallel, we have seen that with the various other tests, which is a dead skin. So we are experimenting this with a multiple thread. If you can push these transactions, so we should be able to saturate and use all the bandwidth left in the system in in the device and expose that bandwidth to the client. So that's the expectation and early results prove that cutting tools from that and we need to do more runs and performance to make sure that it's like this explainable.

F

In all this cases,.

B

All right cool yep go ahead.

G

B

The poor request now I think we can do this a little bit more efficiently than oh. It's like it's things, sort of the threats of being divvied up on the receiving end and I think we can just separate out into n different cues that we can clean it up later. I think the key is that showing that this is actually helping and then going from there.

E

B

All right tree in.

A

State, thank you say you had Intel's enforcement or yes,.

B

Ready are you guys, ready for today, yeah, look I.

H

Kept by them on, can you guys hear me dip.

B

H

Let me let me share.

H

Okay, I am sharing the screen. Let me know.

G

H

H

The chewing of yeah, ok, all right, ok, so I actually submitted a pull request for two sets of ltd ng traces, so one of them is essentially the trays to track when essentially tags, Ltd LG event when you enter a function and then when you exit a function, attacks another event, so you can essentially track function, level, latency.

H

So that's what I am actually showing on the screen for those two trays points, the second one is you know so this as user gives us the function level, the latency we are also looking at into any latency. So if you follow the flight path of 0 ID from the beginning of the operation submission whether it is reach all rights, you should be able to track. You know different places and what the latency is. So that is essentially the focus of the oid tracing.

H

So the already tracing has two sets of ltd ng events: one is a generic. You know that avoiding event that has essentially the name of the object, ID and a tag that I normally overload with different things for a given event. So it is an event or it could be something else, and then it has essentially some information about what is the file where this event is originating from what is the line number and so on?

H

I also added one additional thing called avoiding elapsed, so this essentially gives me there are certain places where I really cannot tag an event we whenever, whenever, if the event is conditional, so if you look at the ASIC messenger, I am only tagging. The MSD up and m-fer reply and I won't be able to know until the message is actually decoded.

H

So in that case, what I do is I essentially take a timestamp right before this message arrived and when the when the decode complete, when it is really the MSD up or image the OP reply, then I take the timestamp and dump it into the oid elapsed event. So there are several events where you may want to. You know: just use the yd elapsed as opposed to just marking an event with certain time stamp in it and attack yeah.

H

I

There's not ready in it's.

B

A field on the message that's always set when it first comes off the wire called receive stamp there. You could probably use that I could.

H

Use yeah, okay, a great idea: okay, so those are the two ltd and you trace event that I submitted a pull request for, and let me show you if I were to collect this data, what how it will look like once you consolidate everything and then you know.

H

About how late, okay, so so sister feel, if you look at those events once you capture them, I have essentially three sets of python scripts. What they do is eat skim through the entire set of events, function, trace, events and figures out the essentially it grates the stack from the function, trade trace events based on the events that are happening within the thread and when the function is entering when the function is exiting.

H

So, if you look at the profile of you know the output, what you will normally see is a text, let's say, for example, if you take a, I will operate as the function call and you get a detail of which function. What file? In what line number it is essentially modulating from that function is essentially calling off submit, which is calling a method in the objector up submitted budget, and then it essentially goes through a series of calls in the different stages, and I can actually compute the latency of each one of them.

H

You can selectively turn it off just by you know, essentially disabling, that specific function trace the way you disable. That function trace is just a very simple macro, and that macro is something called. So if you really want to, you know, trace a function, you just add a macro called function place that is essentially instantiate an object tracking the timestamp of that entry point and then, when it actually goes around hope it is taking timestamp of the exit times time and that's how you essentially get a trace.

H

So if you don't want to use, if you don't want to track a certain function, you just need to comment it out and you can write to three lines.

H

You know shell script, to comment out all those things and then selectively turn it on if you want to, but at the end of the day, once you get the trace, you essentially can paint the stack of all the operations and then you can essentially look at a breakdown of you know the latency for each one of those operations, the intent of actually submitting the pull request for different function. Traces is based on the testing that we did. There are certain critical functions. We really want to do that racing in couple of layers down.

H

We want to look at the breakdown, but once you do that analysis, if you really want to zoom in because it captures gigabytes worth of data, you really don't want to do this tracing when you want to. Actually, you know, run a test run for a longer period. Once you hone in on specific set of functions.

H

Where you see the bottleneck, you want to actually filter out everything else, but those few functions where you want to collect the trace and then we can run it at volume so that you know you will be able to compute and not do this effectively. So you really want to look at this as beat a trace as the starting point and then start actually trimming trimming it down to the functions that you really want to focus and optimize and then just collect them at volume. Give.

B

A sense of what the overhead is like, if you, if you have it, if you have a trace on an outer function, and you start and then look at what you're you know average mark second measurement is and then you add a bunch of traces inter functions that it calls GC. Have you looked to see what the effect on the outer functions? Runtime is.

H

No, um you mean just a lttle trees, overhead, each other yeah.

B

And whatever they ever had is the the macro instantiating, whatever yeah yeah. It is.

H

Just a you know, it is essentially instantiate in just an object and dummy object. Instance, that's pretty much it. So it's.

D

H

Line tracing, so it's fairly lightweight hiding the place. You know. Obviously, the ltte tracing will add quite a bit of overhead. What I have seen is somewhere around ten to twenty percent overhead, the war events you collect the slower it is going to be. Yes,.

B

H

Okay, so um that's essentially the you know the list of functions and the breakdown latency breakdown, you'll get a very my new detail and the OAD is the one that I normally use as a way to figure out. How long is it taking? There are two specific set of events that I normally focus on when it look.

H

When I look at the boy d tracing one is, if I want to submit a message from the client, how long is it taking from a client to OSD that is actually picking up that message and how long is it taking to reply that message? Those two are the latency vectors that are normally care fer from a network latency perspective, so I use voyant as a way to track that the second one is when you do the thread switching, you know there are different places where you do a thread.

H

Switching because of the function level function level. You know the latency breakdown is not going to give you the thread, switching latency, so I use the oid event as a way to figure out. You submitted, let's say, a message into the dispatch queue and then, and then you r DQ that operation there is a thread switch.

H

And how long is that threads, which is taking, and it's important for us to measure the threads which latency as well and you are going to see certain places where things will take some time, and it is very important for us to look at that vector to.

B

They're, like line 41 here, the DQ top apply. That's a big one, but that's fine because it includes the IO itself. Yeah.

H

So this one is actually coming from avoiding event, so essentially they saying how long so let me go through this tag one. So this is the Gredos bench right sequence, so you are submitting a write operation through the air. You will operate, call, that's your entry point and the acing messenger picks up, writes a message, and then there is a delay on how long it is taking the moment. You have a message that gets submitted on the Raiders client to the time that was d picks it up.

H

So that is the Raiders 20 is the network latency and then how long is it taking to decode that operation? And then how long is it taking to dispatch that operation, which is the dispatch function latency?

H

These two things are a little bit overlap because I am were laying the void events as well as a functional events together, but essentially there is a dispatch function latency, and this is essentially the breakdown of that function. Latency and then the some dispatch to DQ operation thread switch latency.

H

So this is where the threats which happens and the question is, how long is it taking and then how long is the operation for the d cube function or DQ operation and then, once you do the DQ operation there is a thread: switch from DQ operation, completion to op applied. Of course there are series of things that happen, but through the finisher it comes into the op applied function.

H

So the question is: how long is it is that taking the numbers you see little bit bigger, because what I realized is the there is a default, the blue store configuration.

H

I think that the finisher shard config is not set up, so it is essentially picking up as one finisher thread, so you can ignore that number for the time being and then how long is it taking for the op applied up, commit and then the response back to the client, which is the OSD to raid of network latency, and then how long is it taking to decode that off reply and- and you know finish up that operation this.

J

H

Of the flow- and you will see the same pattern with the read as well, but except that in the read you are not going to see, I've applied and op commit, and there is no thread switching.

H

So these operations, you are not going to see them right so very similar flow on the red side of the operations as well, somewhat scaled-down stack. So any questions so far.

H

Okay, so this is really the sample output of you know the latest snapshot of the performance data. So if you look at the you know, I basically filtered only the top level functions, but there are essentially six set of runs.

H

One client with one queue depth going all the way up to one client and 16 queue depth and do the reverse, which is for clients 12 depth, eight lines, one queue: depth I even had 16 Cline's one queue: depth I had some problem with the tracing, so I did not capture that thing and then, if you look at the beta of where you see the you know late, if you see the latency breakdown, the usual culprit obviously is going to be the DQ operation function, which is expected.

H

But at the same time, if you look at the network latency, which is essentially the probably difficult to see this picture, there is a ratos to wave the network latency at the bottom and and then there is the OSD to raid of network latency at the top.

H

Those two things if you add up in which has- which is the pattern I have seen in the past last year too, is it- will take roughly around you know, anywhere between thirty to forty percent of your end-to-end latency, as you dial up the number of clients, so I believe that we really need to look at the network latency as one of the key vectors in parallel with the DQ of optimization.

H

Very similar pattern here too, for the read, as you can see here, the module one is the Raiders to royalty network. So, as you dial up the know, the queue depth you are going to see that that pattern repeating quite a bit. So if you look at the queue depth of one all the way to queue depth of 16 with a single client, you can see the latency will just keep going up and up same thing on the reverse. You will see the latency going up and up on the radio's side on the receiving.

H

You know when the response gets back, so you will see that pattern with the queue depth IQ depth- and this is one of the reasons when actually to shower actually shows of the full-blown scale-out performance data. You can see that number of you know the clients and increase in the queue depth. You can see somewhat little bit of this. You know the latency pattern and it may be useful to clarify that this. This was this will make servant done with a single node single and be me: did it.

F

All with wow HD.

H

D

H

D

Mention so yeah.

H

Everything is contained isolated to local, because I can actually get the Ltd ng trace. Events from Raiders and osb win the same time stamp with the same sequence, so I don't have to do acrobatic work to actually sort it out and then I can actually get the control, the number of parameters and mice level to bare minimum. Just to look at the latency of 0, SDM, rado Slayer, you know. Typically, what you want to do is once you do this latency across.

D

H

It then you can actually go for me. Thank you. Yes,.

D

Sir, so the question are you sure you didn't saturate the OC because you expect latency to climb once you know once you had a more stuff in the pipeline than you actually need to maximize throughput, so you would expect, as you add clients or queued up for latency, to climb as you approach saturation, it was throughput so going up linearly or throughput ticked off.

H

Put that yeah, we can actually wait for the shower today that I.

D

H

You had this morning.

D

Out yeah, it's.

H

Exactly what you'd expect to get adjust.

D

For having trouble keeping up before for.

H

The novelties defeat 3,700 nvme SSD, so you probably don't see the disk latency see problems. So that's.

D

H

Well sure, but.

D

Right, I'm not saying seats but I'm, saying that if you have more clients, you're sending more stuff which would naturally increase latency I, don't know what do you come to that point? Yeah.

H

So if you look at the okay, so let's look at here right, so the latency, the end-to-end latency here- are the numbers that I'm highlighting for the reeds. So if you look at one client, verses, 16, client, it's.

A

H

You know it's not that bad right, but there is some variation right. So if you look at one client, one queue depth versus one client 16 queue depth. You can actually see a quite a bit of variation. There are.

D

I

Things that it.

D

If it went from completely empty and able to service request immediately to some of the request having to wait in line between previous up before previous ones, you would expect from increased latency.

H

I'm not sure, if I'm understanding that correctly so help.

D

H

How you would test it out, then I can actually map it to what I have here. But I can say we haven't tested that right. Both.

I

G

H

One queue depth that is it such.

G

A thing to give enclave it didn't uses. This data show that it's not stressing out the hardware, the 16 to 1 versus the 12 16, the big difference between the two pretty much.

D

G

It shouldn't be maxed out its if there's some serialization going on fillmore mold.

B

Okay, probably as it.

H

Akkadian, could we do it every someone work, something.

B

H

Yeah, this is not a completely maxed out scenario. You know I'm kind of torn between the number of ltd LG traces I have to collect and it starts crapping out along I need to run the test and then what it is, but in general the point I was making in the beginning was once you get a handle on. You really need to shortlist, the you know the focus on which you really want to do the latent analysis.

H

Yeah, then you can actually go full-blown with take 10 sd1 in DME SSD and just stretch it to the maximum and look at what, where the bottlenecks are. That's the right thing to do, but number of events that you are capturing, I'm, really running for five or six seconds and the amount of data is around 20. Gig are so right, so you get mind boggling it. You know from an hour and I love his perspective, but really the next step. So.

G

Is your conclusion is: is that we're is that we're serializing somewhere in the network for the single client case, to the detriment of our latency, correct that one yeah Oakland somebody this one yeah so is I? Think.

I

G

Is that a client problem, or is that you know some kind of yeah I would think it would be more? Is it the semantics of multiple ratos instances aren't good enough for the client? It's.

H

A theme library, so what happens is when you send a message from the liberators client to the ho LD? It gets backed up on the OSD and you will see you know it's taking time to extract that message and process it right in better.

G

Than it is I'm saying, is it our system? I suspect, lib rbb is only opening one Raiders context yeah and that that is where the bottleneck is. We saw exactly the same problem in our GW and went in and modified rgw to allow it to have multiple ratos contexts over and that basically eliminated this problem. It's trickier, maybe because of the sequential parallel interlock issues that you have to deal with. A concern is simpler. When you only have one one pipe but I'm wondering if that's really what this is showing up, that probably I'm.

I

Not gonna live according.

H

To see what was a mistake again, yeah.

B

Yeah I mean my guess is that it's a combination of the fact that all the requests are being squeezed through a single TCP connection and the fact that there's one thread that's effectively. Processing that on either end. That has to ball now to clip them in sequence and.

G

One kernel thread and one and one and one you know you're serializing the whole thing, that's exactly as I'm sure it's exactly what yeah.

B

I'm, not sure I'm, I guess, like my question is like is: does it matter? I mean it in the case where you have a single client. I guess is that it's that actually a case that we would care to optimize? Are we more concerned that, because, in general, no ste is going to be sharing its, you know work between a bunch of her clients, so you're going to get get the parallels in their.

G

Ugly commercial Thea had to test yeah.

H

Oh well, I wouldn't say it is completely inaccurate because you know for certain workloads. You normally see a pattern of you know to depth of eight and 16 from a client side. So.

D

H

Important for us to optimize that you will see.

D

Their pattern quite.

H

A bit especially when you have the you know a cesspool that is really backed up by nvme media, where you can actually absorb more and more. You know, I go into it. I guess.

B

I'd be curious to see how much this changes with the DVD k back-end on um on the colony OST. I guess in that single point.

H

Evening, that's my.

B

H

Assignment yeah.

I

H

I

The night working.

H

That's basically what I want to look at and see if we can actually see a better performance out of that, so at least we have an option that is actually in the works. So yeah, hey.

B

My to show today that is really.

H

Use this as a way to characterize the performance you know function level, break we'll break down. I mean you can even use this to figure out how sex works exactly the entire flow, just looking at the entire trace and say: okay, this is where it is getting called from.

H

You know you can essentially get a good pulse and ramped up on self very quickly to you probably may not understand the logic, but at least you can at least understand how the flow works and how to navigate through the the sequence by just looking at the stack trace.

H

So the one interesting point before I actually wrap it up and give it to you the flow to push our is a soap. This is probably a philosophical discussion more than anything else. So if you look at the score, if you exclude everything else, the the core I will into the media, you are take out the locking and network latency and everything very interesting data point. So you are essentially.

D

H

Around you know, maybe around 200 microseconds again, this is for the right place will be a little bit. You know faster right, even if I increase in the queue depth and increase in the number of clients it is ranging anywhere between. You know, 200 to 1 50. It keeps changing, but look at that as three hundred microseconds for the time being, and if you look at the into and latency you know you are essentially looking at probably the core I was taking our twenty percent of into and latency the rest of that is pretty much.

H

You know the the whole context, management network, latency and threading management. To me, that is really really high. If you really want to look at for the envy mes, LTS and optimize it that's where we need to be focusing on from an optimization perspective, the core I was taking longer, then it's a different problem and that's probably the case with the hard disk drives, but it the problems which is completely with the SSD drives.

H

Okay, so there is a pull request. I do want to actually drop in the python scripts as well, where you can actually parse the function. Traces I, don't know mark. If you can, let me know what would be the right place for this. It doesn't look like it is the you know, the actual SEF repo, but it looks like it is somewhere else right. If you let me know, I can actually drop in those scripts and maybe put in some nice readme file, so people can use it once.

B

One quick note on that last comment: you were looking at the the issue, rep pop basically piece and adding that up, that's 200, so microseconds, that's actually only the front Saturday. Oh, that's, basically preparing all the metadata and getting everything ready and queuing it the device, but it does actually block waiting for the IO IO and any of those functions.

B

I think the actual I/o number that you're seeing is 41 where you have the DQ up to apply threads, which latency because then and some other thread we're going to pick up the I/o completion and then we're going to wake up and trigger it, and so that that includes all the context switches to. But the actual cost of that was in that number I've. Even.

I

Economy: I, okay, yeah, okay, I, didn't that one welcome.

B

um But so I think I think, there's there's there's work to do there in a couple different places. Right I mean one is that we can spend less time preparing the I/o if we can optimize that bath and then ensure that the probably we need to narrow and more on be the time the latency between one the I/o is, the aio is queued to the device and when we get the completion and then try to separate that from the various thread context switches that happen along that path. um Let's see parachute.

H

Yeah once I fix that finish of the config problem and collect the data and it let me get additional breakdown on that one bit is okay yeah. This is submitting an I/o, it's not really getting an. I go back from the blue, so yeah yeah.

B

I think the others get at things that worry me or, like you know, line 26 find object. Context, that's 71 microseconds, just to like get the oh, no, that we're going to operate on and you know, prepare transaction is somewhere where we can probably make some improvements. Like finish ctx and all these things are adding up right like five seconds here page.

G

There's a lot of memory allocations going on under the hood here that I think it.

D

Could be waved.

G

Away without a lot of energy and without disturbing as a semantic model, yep yeah try to resurrect my slap container stuff with exactly that I in mind.

B

H

Good point: okay: what's going to be pdk1 I level of work on cleaning of the getting additional break down from the time the blue to write operations submitted and then just want to come back, Germany.

B

Other thoughts here, Sam as far as where we're the best places to focus our energy, our I mean you're, not.

D

Our kind of the right now we're.

B

Gonna rewrite all this anyway right, yeah.

B

Yeah pretty much I think my real question is whether we should do that sooner rather than later, and and not just a matter of resources right yeah,.

D

K

D

Should do it as soon as we physically canvas I'm, probably first priority after luminous. Here we go yeah.

D

L

Maybe I miss Lee: what's the CPU module.

H

What the TPU model, yes, is that a question this is the this: is military.

L

H

So 2699 the high end: okay,.

L

Just too hot of.

H

Course, the lab environment. I have all the cpus that I can use it at my disposal, so I basically put in whatever. Is that happened? But, yes, it is too high. 2619 32.3 charge. Okay, um but again you know keep in mind. You know I'm really looking at this as just a unity will break down and optimization focus, I can't.

D

H

Boil the ocean with kale out and all that stuff it have to make some gradual you no problem and.

L

That do you, beaut, due to the source from to bill to the penury to test, is easy: yeah. Okay, what's the difference with the build-up faggus, do you enable any of randomization flags, optimization plan? Yes, I, don't know yeah.

H

Yeah, so we kind of briefly talked about that during the call today right, so one is on the networking side would really like to experiment. Some of the high owe me upstream patch on the DPD k, to see if we can actually bring down the late network, latency and then I think there is a.

H

We need to look at the blue store latency from the time the RP submitted and the time it completes, which I don't have it here to really look at the I/o latency, which is another optimization point and then I think the third one is I think this is what sage and Sam talking about going after each one of these, you know find object context, and there are few operations in the critical DQ operation flow. We could potentially optimize I think.

B

Emma is questions about optimization, compiler, optimization flags. Don't.

H

Play that optimization flags, no I'm, basically using the stock. You know the stuff I'm not doing anything acrobatic there. You got.

D

Your vehicle powers.

H

I'm not using anything else. These are standards, have build standard self, compile nothing there that I'm optimizing.

E

Also, the cars.

K

Only we try to page book.

H

This is the radios bench at I: have the FIO, both of them fi of Italy, Barbary and Rados bench, those of the tooth. It's.

F

Like when I would like.

H

F

The cluster size no.

H

The publisher site is essentially the 10 SD liberators clients running on the same host machine just to keep it at a micro level. There is absolutely no replication, nothing just 10 SD. You are going to see more and more latency break down. Once you start expanding the scope. To let's say 20 is these in two different nodes between a client is running on a different host compared to the OSD? You know that is essentially a more broader.

H

You know testing here, you're going to see some more latencies along those lines, but here you are looking at essentially the client and way 10 SD that is backed up by one nvm ESS dee, with the blue store, can fix with some optimized safe config settings is you know, performance data that you're seeing.

A

Essentially, this was just to get all the hardware out of the way. All right, I mean fast CPUs fast and be me no network interference. Everything is to get all that garbage away, so you can really know what this does. Yeah.

B

H

Succeeded vacuum you yeah other way, you're going to see a lot more other problems and you'll get lost in the system. Yeah, okay, so I do we're gonna run out of time. I don't want to give some time to push our two yeah sure, but basically a that. We in the party we have shared. You know, especially to 22 how much question on you know: closter sighs. We we actually had to have results on a larger cluster.

H

In the past we we we have been sharing data with, for you know, multiple OS DS on poor nvme on all envy me node, right and and in a sage and thyme like like we had discussed, you know we we do want to go, go away from that model right, which is a sort of a band-aid. So we we actually ran some some data literally last night on on the latest surf master as of yesterday on a five load cluster. So let me see if I can actually put up the picture here.

H

So it's basically a five node cluster with six nvm es of these each, where the P 3700 class SSDs, with 10 SD, / nvme. The data while and DV partitions are separate on each nvme and we have seven clients with ten rbd volumes each you know, and and basically so basically with this this test set up. You know for for reads.

H

Let me see if I can actually make this a little bit larger.

H

Yeah, so the chart is basically a latency I of chart with the markers being the queue depth, and these are the queue depths as seen by fiu rbd from from the client side, so I mean these do need to be translated to to queue depth sat at the at the device level and the way the configuration is set up there. Let's say it's ad queued up 16, though the queue depth of the OSD is, is above right about 30, two and but the you know.

H

But the point here is that, like from an efficiency standpoint, if you look at let's say but typical- be 3,500 class nvme SSD today, which is in fact on the on the lower end of performance. When we were looking at 2017 range of SSDs, it's about 450 k for k, I, ops, and we you know I the I, ops, / and Vav that we are seeing are right around you know, 10 to 12 k range. So we are.

H

We are right within in the three percent utilization today from a read standpoint, and we basically just wanted to give give you guys this snapshot of where we are today and, as you can see, you know, after after 16q death of 16 and in fact already actually did some analysis. Do you want to do and tell them what the five-and-ten client experiment yeah.

H

B

I'm, here that the CPU saturated.

H

It's actually not so the cpu utilization yeah, and if you can see this.

A

H

It's basically the the CP utilization is the chart right.

I

H

Next to the yes, this one.

D

Is only is the idol so.

H

So we're right at at about forty percent CP utilization. You know somewhere between that eight and sixteen to depth level and and I at 128. You know, as you can see, you know it. The CPU stays that under fifty percent, okay yeah and from an vme utilization point of view. You know these I ops, numbers, /, /, 0, SD or sorry I, also over 40, SD or / nvme. Here right they do correspond to the to the back in read utilization.

H

Let me see so we never would actually correlate the numbers very well, and so so this is the this. Is the Reed Reed case right? If you look at look at the right performance.

H

You know we with with a 10 SD, / nvme and six nvme system today. You know I mean this. Basically, a dense configuration that we are we looking at for 2017. That number of people have, you know, expressed interest in you know at in this class cluster. We were not able to scale beyond 8q depth and I. Think the effective queue depth at the OST level is right. Around 16.

H

A

Is all files all right? Sorry, sorry interrupt? No. This is I'm.

H

Sorry I should have I should have said that sorry, so this is basically a blue store. As of self-mastery yesterday, with with basically default dee dee Ford rocks TV settings in.

A

huh Okay is very low blow I apps porosity number compared to what I've seen won't be 37 hundreds, but we can. We can talk about that. Not fine yeah.

H

Yeah I think we should definitely correlate the notes, but if you just in we want to give you know, it is like, with the background that ready provided. You know where we're marching it with. You know we like the goal, is to really get the performance up in this case, where your hostings, you know, 11 /, steeper and Jamie right.

H

So say: I mean I just want to keep it short, but but I'll share this data. You know a lots in the send a google doc with this data. You know on the mailing list, but but we could use this as a starting point for apps to start making from anything wait a minute.

E

What is the, is there? Any option or the default on using any, are and.

H

The signal of hint we can't hear you either coma yeah.

B

You need any bike somewhere, whoa.

E

We can't even go very: where is the server profile? Is that is the default one time safe or the face, and on this thing,.

H

Yeah, it is so some of its yet I think we, we have tried a few configs, but but you know what we are showing is basically the defaults right now, but we we have tried I, think marks, marks letters config, which I think which perform similarly and and yeah. So we are the data that you are seeing right now. Some of these actually with the rocks TV default settings right yeah. Are they selling.

E

Those to defend is different than God's, given.

A

And you get you guys, aren't changing the the rocks TV settings in your septic on file you're seizing the defaults, write.

H

Correct yeah and in fact, actually a week we run into some issues where we run into some OSD crashes, who asked when we, when we change the settings. So that's who is that that expression is still pending, because we we actually started looking at. You know various options that are being shown here right, but but actually we've run into some. So the lander you can probably talk about the behavior, saw yeah.

J

So we haven't really been able to complete a full set of test data with other settings. We some of the OS DS, are going in and out and just having some stability issues with the latest master branch.

H

Yeah, so okay, yeah yeah, we'll find the appropriate bug report so yeah. So this was this was the full run that we could get yesterday with the the default settings? Yeah. Okay,.

A

Couple of additional questions for you on your 4k tests that you did there. How long was it just running for.

H

These were, these are about. You know like 33 minutes, ramp and five minutes run. Ok,.

A

And then how big were the volumes.

H

88888, a gigabit 80 gear bytes, each 10 volumes per port wine, okay,.

A

But, okay, okay, okay um and I- don't know this is probably gonna be difficult to do off the top of your head with you. Do you happen to know approximately how much of the how much that was pearl st.

H

So what was I the.

A

Ear is, are you ready to go right now, but it is probably behind.

J

On beep, then, of the clustered of capacity.

H

J

H

100 hundred.

J

H

Yeah, it's about 130, 250, gigabyte or OST.

A

Okay, okay and and how.

A

It was this can ask you.

A

All right, that's that's, probably good enough. For now this with rocks DBA. All of those things are going to affect things. Imagine if you look at your compaction, stats and rocks to be you're. Gonna, see a lot of traffic.

H

Yeah we did see that and actually yeah we were playing with the right buffer size and you know the other of the number of buffers cetera. But we are we so we'll try to get those tests completed and report back. Maybe on one of these calls, maybe next week or so yeah sure.

A

I, imagine that you guys are going to also see a really big benefit in these tests from Zetas scale, but well by some-some. Nothing probably speak to that more. But um but that's that's kinda. Maybe the next direction for testing that we're going to take here. Yeah.

E

Some somewhat limited anyone, anyone Oh Dora.

F

K

Microphone I'll pay, for it is apparent.

E

It's a brand new microphone, but somehow a chili's NDC, the w of solution here, so you say that you correlated that numbers lading from this at the flute, so that means sigue. There is no way to enter it, going on to the dch all the data eat and still you're getting this number. So have you actually take that okay? Is there any compaction drives or anything else? You click on order only leaves so that we.

H

Assess the things.

E

H

So some of us actually this was a run on a stable cluster. You know like we basically had set up a cluster had run through some read, read data and on the first run we did see a lot of compaction rights right for.

D

H

Can actually share will include that that run results from that run as well separately, but this one was was basically a second run on the same cluster. That's why we didn't see the compaction act together.

E

H

The right side, yeah, okay,.

E

H

H

A

Done yeah we're working at a time here, so it's probably a good time to wrap up.

A

Hey go yeah. Everyone feel free to join us in a half an hour for the monthly except developer meeting. If, if you're, able and we'll reconvene again here next week and hopefully have smores at a scale blues for testing done and also more data or already on ECE overwrites so day and a half an hour guys.

H

A

H

A

G

G