Ceph Performance Weekly, 12 Oct 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-OCT-12 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

B

Good morning, everyone.

B

You may have a little bit of a smaller crowd today. I think a couple of folks are working on trying to help dreamhost fix their their set cluster, which points Italy is why the web page is still not working and Sam is also out this week on vacation. So.

C

I'd be able to have a quick meeting, but we'll see now, let's see what is going on this week in four requests.

B

There's just two new kind of interesting ones this week, but both are are good. One is Ramesh has been working on the mem DB store interface, basically storing a key value stores in our key value data in an in-memory, its structure and that he's switching from a b-tree implementation to generic maps, just because the b-tree implementation was actually not a former very well so they'll be interesting to see it.

B

How much that improves things, but um yeah he's he's working on that and the other one is that dan Lambright has a first attempt at using our see you walking for I think the PG info that he was doing a PG map and that's that's really exciting, because I think once he he dirksen kind of getting that working, there's a number of places we might be able to use it. The the big thing that that I'm, hoping for is that we'll be able to speed up fastest bash with this. But I guess we'll see.

B

That's uh that's 11 393! If anyone is interested in playing around or testing it, although I'm not sure if it actually works yet or not, so those are kind of the two new interesting ones that are there this week, a couple of different things closed, most of it was blue store. Related there's been a lot of work on. Excuse me uh reducing the size of the in-memory structures for blobs and extents and other things like this, so so a lot of the stuff.

B

This week the clothes was related to work in that area, and it is, it has been helping. We have been kind of slowly reducing the amount of memory required to cash Oh nodes and that's Kevin, a big focus for the last week or two here, then in general, we're also reducing the number of cache Dodos.

B

Once the the sub elements age have been working on for the men pool, work can merge, hopefully we'll be able to change it from required that people say how many oh noes they want to cash to something more like how much memory they want to use for cash, /, OSD and I. Think that's.

B

This can be a really important thing for us to be able to do before camp unleashing blue store on the masses, just because it's pretty tough tootin right now, based on camp the variable size that that you know these data structures can have, depending on things like the metallic sighs, so um yeah, just lots of lots of work in that area.

B

Let's see a thing for updated ones is like 11 to 13 I'm, hoping that we can get that merged moon. That's that's really nice! It reduces the size of PG info and it kind of, or at least reduces the amount of data that we write every time by I kind of not lazily, but but more solemnly updating a lot of fills that don't really matter that much. So that's! That's really good for blue store that actually improves random, smaller and write performance.

B

Pretty dramatically and again it's much less overhead on on rocks DB, which is always good. I.

B

Think we're trying to get the fast encoding stuff merged, but there are conflicts so said. You'll have to look at that one and that's that's kind of basically it, though any any questions on any of the pull requests.

B

Are I 12 and moving on so from blue store performance? This week, I sent out a just, have an email to the mailing list with a some some different graphs and I. Suppose I could pull it up here. If anyone actually wants to look at this thing, one.

C

C

If I can share my screen.

B

You guys see my screen right now.

B

Yes, just okay good deal tho, I have the the biggest issue. I see right now with blue store in terms of and of how we're doing relative to file stores with sequential reads all sequential rates, and you know this is one of those things where, hopefully you don't see it that often you know if programs are doing small sequential raises now they're, probably not that well written, but it count looks bad from the standpoint that you know. If someone does benchmarking, you see and look a lot worse.

B

So here this is kind of the the percentile difference. Graph I've got showing kenna where file store was a jewel wear blue store, was it jewel where each of them are now and then what happens when we change them in Alex eyes on this is with nvme devices, but when we increase it and- and you can can see here- that we're not doing so great with blue store and we're doing quite a bit worse actually than we were with jewel in master, if we increase them in alex eyes, it does help pretty dramatically.

B

You can see that anything less than 16 k, allison kind of jumps up compared to what happens if you, if you leave at default, 4k.

B

Can the the question here is: how much is this due to fragmentation versus how much of this is due to these all kind of being sequential and synchronous?

B

So that's something that we probably need to look at here in the coming weeks.

B

Sage has some ideas for how we might be able to fix it, but it sounds like it's gonna be a lot of work, so I'm sure.

D

You understand what.

B

The problem is well, that's exactly it right, I mean if it's fragmentation, then it might be that just increasing the manal excite will help this and make it better and when one thing in blue store and jewel is that the default metallic size for ssds was 64 K. So that might explain why, in jewel we are seeing a higher sequential read through putting this test.

B

It's on the list of things to test again to go back and look at that. But I think that that might make. If that's the case right, then that means it's probably fragmentation, but you know, potentially there might be other things you're going on too so I, don't know what what were your thoughts Alan.

A

Well, I think that the history of storage devices suggests that we're going to end up having to go, tackle compaction or you know, defragmentation whatever you call it eventually yeah, you know, I mean even with XFS, which is a pretty advanced file system. If you talk to the users, they'll tell you that you really end up having to defragment the product over time and the I suspect that we're going to have the same issue with blue store at the end of the day, you know changing the allocation size.

A

First, you know you're trading off one problem for a different problem. Would you do that? I.

B

Agree, but at the same time, right now, it's looking like with a small metallic, sighs, we're generating so much metadata, even with all of our improvements that we made that we're just we're hammering rocks TV kind of to the point where doing the extra white road head log right is actually cheaper than dealing with all the metadata. So you know, there's there's multiple aspects to this right there. There are multiple trade-offs that we have to think of well. Do we know that that's what's actually happening.

B

um What what we've seen him in the past is that rocks TB when you have a small metallic, sighs, there's so many blobs, and so many extents that are getting have shoved into it, that that the the right, amplification and compaction overhead just kind of starts killing things. There's lots of reeds on the the DB partition. There's lots of rights on the DB partition.

B

The right ahead. Log itself is, is kind of relatively minor compared to what's happening on the DB partitions yeah.

E

And and basically I am seeing around 2.5 x more right. If I have, if I have like mean alex eyes of 4k. Yes, in middle of size, of 16 k, we have like one junk giant right of like 4k, that is going to the wall. So if you, if you don't consider that it is leaking into the SST SST files, so the amount of right going to thy sista files is really smallish. Around 500 bites five, six hundred by its vs., probably 1.5 k.

E

You know in case of 4k benelux eyes that I verified with long run 10 hours well and that's how it's helping, even with the default settings of the rocks TV, it's helping the performance boost around thirty percent or so.

A

Well, yeah, it does shed light on on the hypothesis about. What's going wrong, um you could I guess. The question is: is what one of the proposed workarounds for it I mean.

B

Yeah well, I mean, to be honest, I mean I think hard. It might be that we need to consider key value stores that have less write, amplification and less compaction of head and that that might be what we'll see. Maybe we can tweak rocks to be in ways that we don't understand yet to avoid some of this, and maybe we can continue to shrink the the amount of data that is in the the 0 notes. Right.

B

You know, maybe, with all of these things together, we can get to the point where we're actually better off, not doing the right ahead. Log right as opposed to you, know the increase in metadata, but it doesn't seem like we're there yet just based on campus I'm not to see and what I've seen it looks like we're, actually better off eating the external right. Look at it right, but.

A

What you're staying right so what you're saying you owe strongly suggests, but it's not the right ahead. Log right! That's the issue! It's the coalescing of multiple transactions to reduce the blob, the extents map sizes. What's.

B

The reduction in the number of extents right because we might increase the Menelik size, get you. You have few right. It's like before exactly.

A

So I mean this all strongly suggest the you know it is supporting data for that hypothesis. Okay, so the question is: is that if that's really what's going on, then that suggests to me, you know more likely that the.

A

The allocation policy itself is flawed.

A

B

That would be difficult to fix, though, and I'm not sure how I could well. What do you think I mean I, guess I I, don't see a way to do it better, but I might not be yeah. Tell me what do you think well.

A

I mean at the end of the day, to get to the Nirvana. You want small structures in the metadata which it requires that you're going to coalesce multiple small rights in two sequential places on your storage medium. Now the right have log.

A

You is one way of doing that, but clever usage, you know of allocation strategies, might do the same for you in the sense that if you have multiple streams uh where you're fetching or where you're storing data, if you sort of left space so to speak before them under the assumption, that's what was going on. You would achieve the same thing.

A

That makes sense.

B

Kind of in a very high level, hand-wavy third away agreed, go.

A

Ahead, yeah I did I mean the again it's another trade off. If what you do is you say, okay, I'm gonna spread the rights around and leave a lot of space around them in the hope that having written this he's going to come by later on and do another sequential write and I'll have a place to put it so that I don't have to expand the extent structure that will dramatically improve. This is running on flash.

A

It's a disaster on hard drives, yeah, okay yeah, you know- and it probably indirectly will contribute to more fragmentation in other patterns that are sort of difficult to protect, I effect, yeah I.

B

Will say it concerns me, but maybe I mean I.

A

Think you know there's there are other potential solutions for this problem. So, for example, what you could do would be to go ahead and allocate large chunks of space and do direct writing and keep track of the extras. And you know, and if the extra space wasn't consumed in some appropriate interval, you'd go back and reclaim it there's all sorts of strategies you can think of yeah.

B

Yeah, it sounds complicated, but yeah I think you're right that.

A

Like that I mean that's kind of the inverse of the right ahead log, which is sort of you know. Instead of leaving the data there, you leave an annotation that I've allocated more space than I need it for this. You know, and you know, if you limited the number of those you could go back and clean- that up. I.

B

Guess the the thought I have those that, like as we progress here with storage technology, we're starting to see like nvram, become more accessible to people, and you know other really fast fast. Small amounts of memory right that persists. So is the right ahead. Log really a bad solution. If you potentially have a small pool of really fast persistent memory, you can target for it. Look.

A

No that's exactly right, which is that most high performing storage systems have a small scratch pad of persistent memory. For exactly this reason, mm-hmm.

E

A

Course, once you admit you have that, then it turns out that rocks TB might not be the best database officer, but that's exactly entirely yeah yeah.

B

No I agree: I would like to see us really not dependent on any particular key value, store, implementations well,.

A

Yeah, that's that's a great theory, but it's not going to work out that way. Yeah you know the reality here is that we're engaged in a bunch of optimizations because again yeah.

B

On the plus side, though at least with something like a right ahead log, it's not that complicated right I mean that that piece is less hard to to get and.

A

The question of what you put in the right ahead, long versus what you dump you know directly, is neatly affected by the mechanics of your database.

A

That's that's! Basically we're struggling with here. Yeah.

B

That fair enough! Well, let's, uh let's continue through some of these numbers so that we can, you can actually see some of this other stuff so for sequential, writes, we're doing beautifully right now, at least in the test. I've seen you mean large bad lights and small sequential writes not not really but rights here, for basically all sequential, writes, recent blue store and all the tests. I've done look good.

A

B

A

Were looking at a sequential reads before? Is that it? Yes.

B

A

B

Eyes: okay, yeah, no, no worries so yeah sequential writes we're doing really really good dual. We were bad for various reasons, but everything reason were we're better than file store and for large sequential writes. We've always been better just because that's kind of the whole benefit of the way the blue soars rehab.

A

We're still not getting to a hundred percent better, which is what you expect right.

B

Well, would you would you really expect I mean that would be the upper bound right right.

A

um But you know you should be getting pretty damn close if your rights are large enough, if you do it for mega by rights and the end, the metadata overhead is, you know three or four 4 kilobyte chunks. You know you're dealing with point three percent right, yeah.

B

Like four percent- oh, this is this is with replication, though, so there is that extra stuff involved here, no matter which back-end you're, using its be replicating out to the secondaries. Are you CPU, limited or network that are you not cpu limited, but I mean they're not.

A

Ze or network limited that the replication has no effect on the local. uh No, I made.

B

Well, there there's going to be some extra latency and potentially some right.

A

B

If you're nice.

A

You're, not network limited, that extra is sort of noise and yeah yeah yeah. They did was show up in latency, but.

B

A

B

Mm-Hmm could be that were that we're hitting some of the different buffer limits that we have in places, though well.

A

There could be lots of things, but my point is: is that replication is not a good explanation for not getting to 99.7 percent improvement? Eh I have.

B

A suspicion that it might be about, though, an explanation for some of this, though it'd be interesting. You do a single OSD test and actually see, I suspect,.

A

B

Actually get closer, but I can.

A

See it all surprised to find out that there were see realisations that occur because of replication that are indirectly preventing you from obtaining enough parallelism to saturate the cpu. You know dessins, you all that have queue depth or in your enough parallelism the various stages of the OSD, and if that's, what that you know, if the single OSD experiment shows that yeah it gets to it gets to the 99 percent that you would expect the other one doesn't but you're, not network or cpu limited. Then we've got code to go fix.

C

A

Transition, you have more performance that you could get ya.

B

One thing I have noticed is that for large, sequential writes, there's a lot more variability in noise at the upper end, so like with file storing or basically yeah we're we're eliminating all that noise, because it's basically artificially, you know limiting it right, whereas with with blue store, when we get up there we're seeing a fair amount of variability in the throughput at canopy, yeah like the upper I mean you can see in that those lines right, you know we're seeing we should there's very little different difference about sequential write, behavior in jewel, blue store and- and you know, recent master Bloo store, but you see those at like four or mega by taio sizes.

B

A

B

Is it the shape of this graph? Is.

A

Just wrong I mean, if you think about it, which.

B

A

Sequential write or well the the progression from ugly that I'm sorry, this is a difference.

C

B

Now I forgot the actual, then, if we scroll over, you can see here's what the actual throughput graphs look like.

A

Yeah that doesn't make sense, there's something else going on in the system. You're, not you're, not hardware limited here.

B

Yeah, probably not I I.

A

Guess you don't hard limit it anywhere.

B

We probably are on the file store, died, yeah because because they're you're doing the extra right, so you've maxed out the device. But.

A

Yeah well yeah I would idea file store, is the solid red line. That's flat, yep.

B

So, basically, what does that really makes sense? Yeah. It does basically you're doing twice as many rights, so you're you've hit, basically that that is right around where you'd expect the hardware limit to be right there, and now we are not getting any of the software limits. That.

A

B

A

Doing the driver it because, because you're using the buffer cache to coalesce the rights yeah, yeah, okay and.

B

With blue store, we're not doing a story so we're more efficient, but now we're starting to hit some of the various software in official. I sees that we have somewhere right because.

A

You know if you were totally hardware limited. The shape of that graph would be different.

B

Yeah yeah, but the good news right, though, is that in recent versions of blue store, we're still always better than files for was so. You know it's.

A

B

A

B

I was that guys.

A

That's not much of a metric of success. I.

B

Don't know I mean like okay, we're ourselves yeah.

A

Okay, you know, I mean realistically in the competitive world, nobody's going to care about how you do relative to file store they're going to care how much out of the hardware you extract yeah, I'm sure you know, and what these graphs tell me is there's still some kind of plumbing problem internally in the OSD might not be blue store, it might be elsewhere, yeah, okay, but there's some kind of plumbing problem, because you know the shape of this graph. You know if you think about that as the block sizes increase it should.

A

You know spike up to the hundred percent on the difference and then flatline pretty quickly.

A

Because, as each I/o size gets bigger, the the CPU, you know the metadata overhead is constant. So, as the ions get larger, you see less and less of that.

B

Yeah we will, we should in this case, be seeing. You know some some amount of metadata overhead, so he probably won't hit quite exactly one hundred percent graves at double until.

A

Yoli hopes that well but I, don't know that's true, because file store has metadata overhead. Also.

B

It does, but I would think that probably well maybe it's not less and I'm much but.

A

It's not what but I mean. The point is: is that NE a-- capita data is only going to be a few at most a percent or two when you're doing more mana Gaia's, I mean any way. You slice it even a lousy, metadata implementation struggles to get to one percent.

E

A

Because you know, one percent is what 40k for a form a guy out mm-hmm.

E

A

Okay, two percent: you know if you're, not any behind 90s here metadata overhead is not an explanation of what's going on even.

B

With us a CNN via me, though, 4k rights are going to hurt you more than sure. Then the relative fraction of the total throughout right agreed.

A

Afraid so, instead of heading, you know, if you're just going to count the iOS and you're going to say that the small random ones count more if there are two or 3x the cost, which is not unreasonable. Okay, now you're at 90 five percent, but you're struggling here to get the sixty and seventy percent.

A

B

We're at probably about seventy five to eighty percent of the rock vailable.

A

I understand that what I'm saying is that you should be. You know in the 90s sure the other words just like another twenty percent performance yet to be had.

B

Yeah, I would, I would probably agree with that yeah so.

E

Mark what is the difference between christ or we've been tight schedule? Oh.

B

Just the jewels you know the jewelry listen. This is the work in progress release based on a recent master from last week. So basically, the the whip version of file store here is using the async messenger, whereas the jewel one is not and there's probably a bunch of other random stuff. That's in the work-in-progress one, although the AC messenger is probably the biggest difference between them into.

E

Abilities giving giving giving much higher performance than though we prayed we prayed through.

B

In the sequential right case that exactly the same, you can't see it, but.

E

B

The lines just overlap each other.

E

B

Alright, let's, let's move on then to random reads so in the random read case, what you're, seeing here this big dramatic drop is entirely due to a sink messenger. It happens both with file store and with blue store, so blue star itself is not really causing. Much of this we see in the the jewel of blue swore it was doing a little worse, maybe than tile store, but in fact but blue store. Well, there's again, this might be acing messenger. We're actually doing a little bit better at like 128k to maybe like 20 48k reads.

B

The good news is that this is can be mitigated quite a bit by using the setting this you send in line to false in a sink messenger that that helps quite a bit. Probably the other thing that needs to be done here is speaking up fast dispatch, so we're gonna, probably if nothing else is done. We're probably even with send in line falls going to be seeing maybe a ten percent performance regression for random reads at Adam at mall io sizes.

B

It's not going to be as bad as what you're seeing here, but we may still see one. We still may see a bump here at like 128 k, + io sizes, though so it's a little offset but it'd be sure nice. If we didn't see that small random lead regression so we'll see we'll see if anything else can be done in in the meantime, dan is working on. These are see you this RC.

B

You work for different locks that we have in place, and maybe maybe that will happen faster than then kind of salmon was expecting and not sure.

D

Excuse me: yes, I'm, looking at the Blue Star jus, all the data you just presented for the solution and the random it kind of suggested that there's a dramatic change when you are at the I ops, size of 32.

D

If you look over or yeah the dudley 32 the changes so.

B

D

Changes I all from the 32 right yeah.

B

That was um that was kind of the old so like in in the sequential read starting out up here. There's a.

D

B

D

The batter yeah either go to the better directions, all the worst direction at 32 right for a sequencer. If you see sequential, read, ran dry from the dirty to you started to increase right, but for yall r and angry and Rendon right, Rendon read from 32, you decrease and then for your Rendon right now from 30 random right hmm and you go down my.

E

D

Cream he's a sequencer.

C

E

D

Rendon right, I'm 32, you increase from the turning points, is at 32. It's a commonality.

B

D

B

That version that you're, seeing that yellow line, that's actually the old version of blue store from the trailer release so like.

C

B

Line and the kind of purple line and the blue line are actually the most recent version of dual and master. So what your single.

D

Release on in the web I mean it's already available.

B

It's only available in the development yeah.

D

B

That's in github.

D

That you got me: okay, June yeah, so.

B

A lot of the changes since jul have been focused on improving the the way that we handle metadata and blue store and I think a lot of what you're seeing there is kind of related to that. There's there's some other things that have been going on too.

B

In fact, it's hard to remember exactly everything that that's gone instance jewel, but um but kind of the the goal in the last couple of months here has been to have eliminate problems like this one that you're seeing with the random write performance where we're dipping way down, and then maybe maybe we're improving, but maybe not there's actually been some some intermediate releases here, where the random write performance actually looked really really bad, but but the good news here is that, like especially with random, writes, if you look kind of here at the I ops graph, you can see that or blue store, no matter if we're using like a 4k allocation like in this green line or for using a 16k min allocation which is camp, are proposed.

B

Maybe idea for for the release here. We're actually consistently above file store, there's chemistry, r dip at 8k here, but but overall were we're doing better. Let's that's good, and that's that's what we were hoping to do over the last couple of months here is basically really show up the random write performance. The things that we that we really need to look at here now going forward are going to be I.

B

Think if there's anything that we can do to have bump basic messenger to be a little faster to get a cab over this for a a random read regression that we're seeing and again it's not as bad as it looks here, but we still have a little bit of one.

B

Otherwise, we may want to think about the trade-off that we're taking versus simple messenger, or at least instructor our users to to think carefully about that, and then also the sequential read issue. That's that the one that looks the worst right and even though small signal reads, probably aren't something that you hope. A lot of applications are doing.

B

That's that's the one that that you know if you're benchmarking, it really sticks out. So the kind of last two here with sequential mixed, read right um graphs. It it kind of tends to follow the the first one, the the Reed performance, the sequential read performance regression for small iOS. Just kind of sticks out here is not as pronounced, but it still happens, but our sequential write benefit is helping at wyd rio sizes. So that's why it's!

B

This count almost a mix of the two we see the large writes really helping out at large sizes, and then we see the small, sequential read performance really hurting us as small sizes here and random. Rewrite again is this kind of follows the others um because of the issue with async messenger we're seeing small io kind of dropping down with the rights help pull us out so anyway, that's that's kind of where we're at with it. My personal take on this is that we need to be looking carefully at the allocation size that we use.

B

I I'm guessing 16k is where we want to start out right now, just to to kind of as an intern thing until we can make smaller allocation metallic sizes, some work better, either through less metadata or whatever, and it will have those fragmentation too, and then we need to look at the sequential read issue figure out this presentation.

B

We figure out if um some kind of you know, async support for pulling in multiple extents at once from the city, MO data structure or doing a series in general um I'm, not sure what sage exactly wants to do there, but I think he's got ideas. So that's another big thing and then kind of anything that we can do with fest dispatcher nascent messenger to kind of give available that last little portal. I think that would be very good. So that's those are kind of the priorities. Icd.

D

Is this fall DISA datafile, shareable, yeah.

B

D

Long, yes, yeah yeah.

B

Let me let me get the link here. Oh I'll stop sharing in this. Unless is there anything anyone else wanted to ask photographs.

B

But they are messed up sharing my screen then, and I'll I'll, put a link to the the mailing.

C

C

The chat window.

A

C

See now I need to go, find the agenda. Logically, okay, um the only.

B

Other thing I guess I had to talk about, then, was that we are also looking at how to tune rocks TVs right ahead log and unfortunately, our test lab is sort of broken right. Now, since the dream house has been having some problems with that's affected, the dns server and some other things, but I did get some initial results and kind of the what I'm seeing so far.

B

The trend I'm seeing is that basically just increasing either the number of logs or the size of the right ahead logs and rocks eb seems to help can't stabilize the long-term performance and the test I did now. These are only one our tests, as opposed to like the 10-hour tests that somnath has been doing so may be that that we need to see make sure that doesn't hurt us in really long test.

B

But at least for these one, our tests, it seemed like larger log sizes or more logs, tend to tend to help kind of even things out in metamora.

A

Yeah, if I understand the way blue FS works, he basically could distinguish compaction I/o from the other iOS that are going on. Do we keep a different set of stats in there so that we can tell during these runs, you know, what's going on in the compaction wall.

B

You're talking about in when Roxy be compacts telling whether or not it's compacting metadata versus compacting like right, ahead and log rights that propagated into level zero.

A

Yeah, basically, the the level zero rights are distinguishable from the background, compaction, action. Okay, you know, and if we're going to monkey around with these files, I don't think it takes a lot of work to sort of figure out.

A

You know how much I owe gives you how much rocks dbio and you know, I what I'm saying is I think we have the ability to move to a situation where some of these tunable czar more science, because we have a deep instrumentation to what's going on and the to be able to discern, for example, whether your one hour run is actually triggering any compaction, as opposed to say the 10 hour run, which we know would be as well as being able to determine.

A

You know dynamically when the prio in performance that we're in the middle of doing compaction and that's why the performance is dropping, etc. Yeah.

B

We're absolutely true during compaction easily within 100m, um we see an action happening. Releasing the test. I was doing earlier. The compaction was happening, probably on the order of maybe like every 40 seconds. Well,.

A

But do we, for example, do we track the kind of right app that we're saying I mean we have the ability to actually compute the actual right app that that were saying through rocks t did yeah.

B

And actually rocks leave, you will record its own right, and so you do get that um I. Think the the trick here, though, is when we are doing right ahead log rights based on kind of some of these parameters that that that you have a cute week. You may potentially be leaking them down into level zero, at which point they may even get compacted into level one depending on kind of how short-lived versus how long live they are and cab Holly.

B

You know the moon and stars line up regarding compaction, so that's something that I think we need to understand better is when you have a small like a fork, a metallic size right, you're, not really doing well rights anymore, for you know most of these kinds of io's anyway, so you're not going to have any leakage, but you now have much bigger metadata structures to deal with. You have many many more Oh notes to dealer now knows, but you many more expense to deal with so.

A

I understand that I'm just saying the only these are all qualitative phenomenon that we know exists and I guess the question I was asking is: do we have full instrumentation to be able to quantify these qualitative behaviors? No.

B

I, don't think we do I mean would be really really good if we could different differentiate right ahead. Log data in the SSD files, vs vs metadata, but I, don't I, don't think we have any tools right now that actually do that. Well,.

A

But I think we know what we're putting in there I mean it should be trivial for us to keep track of that as the point I'm making. We.

B

Don't we don't know, though, how good rocks DB is at making sure that right ahead, log rights, don't actually propagate into the NEM tables versus into the log?

B

We think it's pretty low, probably, but we don't. We don't know it for sure.

A

B

That make sense I.

A

Understand what you're saying what I'm suggesting is is that, with the with with the right instrumentation I think we have the ability to collect all the information that we need in order to prove or disprove whether the assertion about how efficient rocks dee dee is at handling the right hand.

A

Log entries- I mean I agree with you, I mean if, if we are not diligent at pulling right ahead, log entries out of rocks DB it's a chance that he will merge them into into his SST files and you'll see you know an increase in writing amplification. There. There's no question about that, but I suspect that the you we don't were putting in the front and we can see what comes out the back I if you're leaking these right ahead.

A

Log entries those are so large well to the other things that are going on I, going to see it pretty directly in in the statistics. If we're.

E

A

B

A

B

And I don't think that we are tracking that exactly yet right.

A

Website is: is that I think if we put a little bit of diligence into tracking those, a number of these hypothetical scenarios can be sort of proved or disproved with actual qualitative quantitative information pretty.

B

Quickly, I, don't yeah I, don't disagree. Alan I personally I think would like to see us maybe go more down that route as opposed to just kind of blindly tweaking rocks TV options or not blindly, but maybe yeah yeah Tim I. Yes,.

A

Yeah, in other words, but not, I agree with you, I think the reality is, if you don't get the maximum instrumentation you're just firing blind, okay and who knows what you're going to hit yeah yeah.

B

The good news is, it hasn't been entirely fruitless, because I mean the the trend that seems to be showing up is that we we do better with larger right ahead logs, which kind of makes intuitive sense right. The the tray off is, then, the question of you know that uses more in memory, so you know: do we trade the memory for having a larger log and the better more consistent performance? By so more like, say,.

A

We'll do better, you say you do better, but in in which test cases I mean random thought yeah, so like portly, random, writes yeah, so I, so I would reject that I. Don't understand why the log file as long as you assume that the right ahead log entries are purged from rocks d be reasonably quickly. You know well less than the size of a log file, then I don't understand why there's any correlation between the size of the log file and the performance?

A

If anything I would argue it will get worse because you're in memory lookups will get we'll get longer at larger. They.

B

Do they do get larger and longer? But if you are to the point where the you're in the middle of flush and you have filled up any log files that are not um that Oh anything, that's not flushing right. If you, if you've, got to the point where anything that was available to you, it is now full and your no flushing every the other ones. Now you blocked rights and that's the behavior that we see is that we see these weird stalls.

B

Where are all right stop and it seems like it may be related to the fact that we're now in the middle of flushing, the right head log or we're compacting in the SSTs one of the two okay.

A

So if it's well, certainly the SSDs could be garbage collected on you, you might get some excursions there and again that would yo by looking at the physical Layton sees. We should be able to see that but like, but me the theory with rocks TD is that he never stalls the front end for the back end. Okay, yeah and that's not true right, his potential line. You have to run out of something okay and his behavior. You know we may be driving him to to that state.

A

B

And in compaction in level 0 and rocks to V is still single threaded, so you know once once, if you're doing compaction, you're right ahead, log I think if I understand it correctly, is basically your only barrier towards not stalling.

B

Okay still seems I could be wrong, but then, with my understanding of.

A

It yeah well I'm, not disagreeing with you I'm, just saying that I think that we should be focusing on better insight in the system rather than focusing on hypothetical what else yeah.

B

And I agree with you Alan it. It takes time it.

A

Does take time but but but I submit it's the fastest way to the end point sure.

A

Yes, it is a little tedious to go in and add a bunch of stats in there and then add some histograms or other analysis techniques you'll bear in a day or two or three do that, but then you're in second gear, when it comes to actually doing the performance analysis, because you will actually see what's going on and when you do and you'll be able to make tweaks and make rational inductions about how the system should behave and you'll, prove or disprove.

A

That I suspect they're going to find like a lot of these things that the knobs that are there in the documentation of them is subject to interpretation and just because it's never happened with any of the staff. Not okay, you know, add you know I'm going to grant them the same discretion. Okay and you've got to find that until you have a deep understanding of what these knobs do, you're going to find, you really don't understand how it works at all. Yeah I.

B

Mean it with a straight with sleazy, oh yeah. We just saw this with all the metadata work right I mean where the the amount of in-memory data for for metadata, for the 0 notes and for the blobs is huge. Blobs are gigantic, and I think it's easy to overlook that. If you don't actually instrument.

A

It right exactly so, where I think we're a little lackadaisical, not the instrumentation on these things.

B

Yeah, well, maybe maybe that'll be the next thing that we can focus on here than eventually getting this stuff instrumented. The good news, though, is that we've actually the amenity aside. We actually you know the work that UN sage did getting all of that memory. Usage instrumented in in what mem cools is going to be really really useful. I mean that's that's fantastic. So, while we're making progress in this regard is just maybe maybe we could do a sir.

B

So we're we're about out of time guys is there, it was there any other kind of last-minute things. People wanted to talk about.

B

All right: well, then, everyone have a good week and you will meet again next week. Thank you much guys.

B