Ceph Performance Weekly, 2 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2023-02-02

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

Good morning, folks.

A

All right, we wrapped up core early, so hopefully we'll get people here quickly.

A

I think we should get Adam today, too. I saw him this morning, so hopefully he'll make it.

A

I suppose we've got a decent number of people here. Let's just get started all right. I see two new PRS this week. uh One is from Igor. uh This is: don't use the real whole space iterator for prefixed access Igor. Do you want to talk about that? A little bit.

B

uh Yeah, that's uh not the case which originally brought up by courage, which is using unblocked operators and during my experiments, I realized that I might a week before iteration over every common family or prefixes which do not belong to uh to any specific column families so, for instance, for shared blocks and for starter first of startup, or only for CK.

B

B

Uses match role, merge to whole page iterator and and in some cases like my sandbox, which might take well tens of minutes. uh Ftp is in that degraded state with photons of constructs.

B

So again that is to use and well known. Each of the idea is not to use bounded generators, but rather use default, color feminine.

B

And at least you might in my case with fatigue, so applications are much more faster since well, most of that they are in different kind of families in the map, Market on all spaces.

B

So this wouldn't Travel help in regular operation, but it might help in come.

A

Very good, very good, all right uh next uh there's this uh PR from Adam K uh uh for booster, improving the fermentation Square metric Igor I think you reviewed that I haven't looked at it yet, but I think it's.

B

Mostly about having more.

B

B

So it just changes the fragmentation score, how it's calculated so initially it could doesn't cover. He could not cover some spans from so it was not equally spread of from from one to from zero to one so yeah.

C

Adam just fixes.

B

Calculation yeah.

A

That was I.

B

Don't think it's much relevant to before.

A

Yeah, that was my my basic understanding as well, that it was primarily just a fix for how the fair condition score was being calculated.

A

All right, uh let's see, moving on then uh three closed PRS this week, all by the bot. Unfortunately, uh so uh two different MDS pairs were closed. The first is this: uh removing the subtree map from the journal uh I'm sad to see that one closed, but it's not surprising.

A

um It's a really complicated, PR uh and I. Think uh Greg, you might know more, but my understanding was that it was just it was too much. Maybe we couldn't really effectively review it very well and- and we uh ended up deciding to go a different route- is that approximately right.

D

Which PR I have no idea? Oh.

A

Sorry this is yeah. This is. um This is like an old one from uh from uh Zhang uh to remove subtree maps from the journal. I think he was getting I think he did maybe like.

D

A

Encoding or something instead I.

D

Don't remember: okay, so they're, like Jean, has many PRS, which haven't merged that might get versed as bandwidth and need opens up um for subtree journaling. In particular, we have there are like four or possibly we're up to five now together in various ways, um so I think we're probably gonna go a different direction there, because Patrick and shubo both have some that I think work together and that should be better or at least are simpler, to understand and maintain um Okay, so that one that one I'm guessing is is gone but yeah.

A

And that actually leads into the next PR here, which is um uh Patrick, has a PR that the bot closed for skipping inode with cap iteration for empty directories.

A

um I wasn't sure if that's one, that we want to reopen or got or.

D

Not well that one that one had some fundamental issues, I'm surprised it was still open to begin with.

A

D

Okay, well, the.

A

Bike closed it so yeah.

A

Cool all right and the last one that the bot closed was um uh make do right. Small, never do buffered rights, um I, don't know who reviewed this one last Igor. Did you look at that at all.

B

Yeah, but it was pretty long ago.

B

Yeah I I think Adam should take a look if it's still.

A

Yeah yeah I know he was working on it uh quite a bit, but um yeah.

B

B

Uh-Huh yeah yeah I, don't recall it still.

A

Well for now we'll I'll, let Adam know since he was new, not here I'll I'll, let them know separately when we talk and see if he wants to reopen it or not all right um for updated PRS. This week uh the Rocks TV upgrade PR, um I, guess I! Guess the authors have any issues with our our Seth roxdb repo and uh needs additional help. So I told him I I, just briefly looked at this morning, but I'll I told him: I tried to figure it out so one way or another we'll get the upgrade.

A

I I think we really want that in for for Reef um next, uh adding primary balance scores: hey do we have Laura or Josh Solomon today, oh, you are I. Don't even where are you? Oh there? You are sorry go for it.

E

We're just fine at the the balancer uh feature in for Reef, so um we're mainly just waiting on uh the lab to be fixed uh to I. Think there are two PR's on there. One is for the the uh read balance score, uh which is in addition to the OSD map um that calculates a score for each pool, replica pool on on how balanced the reads are, um and then the other one is uh mine, which is the overall for the the overall balancer um feature and they're they're linked together.

E

They each depend on each other, but um they're ready to go into testing as soon as the the lab can handle. It awesome.

C

E

The dev freeze.

A

Yeah yeah, absolutely very cool all right. uh Let's see here then I think we took care of the last PR in the updated uh bunch here, but I think the the last one that I need to talk about is um uh oh uh Igor, your faster bluefest allocations in AVL, hybrid allocators, I, know. Adam, did an initial review of that and was was not sure about it. um I have not looked at it and I apologize. Oh, oh! If Adam can't help I'll try to make time to look at it.

A

um Did you have any any thoughts on Adam's comments.

B

um Well, he didn't respond to my last comment and trying to remember.

B

Yeah, but it would be great if you take a look as well since Adam I I tend to.

A

None of us have time to review each other's PRS. We make them and then they never get reviewed.

A

I have a bunch that fell off too, but uh yeah I I will I will try if I Igor, if, if I don't just, please feel free to email me and just like Hound me until I do.

A

Okay sounds good: um okay, let's see lots of stuff in the no movement category. I, don't think, there's anything super exciting here to talk about this week, um more rocksdb stuff, but we can we can uh uh see after after we uh hear from from uh David orman's group uh what maybe decide if how much of this is still necessary or not all right.

A

Well, then, um actually, on that topic, uh David or Corey, do you do you want to take over and talk a little bit about uh uh some of the experiments that you guys have made.

F

Yeah, that sounds great um Cory I'll hand it over to you, but I'm happy to present uh details from our internal monitoring system so feel free.

G

Yeah, so um we had the cluster that had issues that we talked about a little over a month now ago, um with eg movement uh and all the deletes in rocks to be causing iterators to be extremely slow and then the osds crashing due to Suicide timeouts. After up to like 500 seconds and stuff this. These osds or this cluster does have some spillover of our DB from the thinning disk or onto the spinning discs, and that was part of it. But we we just over the past.

G

We started upgrading some of those nodes and playing with some ideas uh in order to work around that, and so one of the things that we deployed to the osds was marks compact on delete, uh filter options being passed through, and the other thing was something I.

G

Think I talked a little bit a couple weeks ago, which is basically passing through Max skippable Keys, uh the the rocks to be read: option for that from the blue store layer down to rock CB when we're doing a collection list, so that I could bound the amount of tombstones that were being iterated over and then at the OST layer.

G

Just to kind of back off and retry whenever we're seeing an excessive number of keys being iterated over that were causing big latencies and ultimately, the osts to crash, and so the results of deploying those two things were really good in production. We basically saw the the compact on delete stuff in and of itself taking care of the issue which we were really happy about.

G

um Compassions were extremely effective. In that scenario, we said we we tested essentially by Just re-weighting One of the that OSD down to 90 and letting like 10 of the pgs move off of it, and when the the delete work happened, we did see those compaction filters work properly to clean up those tombstones pretty much immediately and we never had any high latencies at all. So one really positive result there.

G

That I think uh will be good, since that's already in Maine, so I think that that will be a a good point for those moving forward.

G

We're going to we're upgrading the whole cluster at the moment with those patches and at some point in the next few weeks, we'll have to move all the data off that cluster, so we'll see I guess how effective it is in that larger test case scenario at that point.

G

um So beyond that the other thing that we've realized recently. So at the time it happened we had set. We we had set the TTL compaction setting down to like an hour, I think trying to get things cleaned up faster and we had left that there for, like a month, I guess now, and we recently recently realized that that was really causing a lot of bad performance issues for us with the spillover um and since, as David's got the graph up there since we set that down, and we just removed it now.

G

Actually, during this upgrade, it went from 95 disk utilization on those two down in the 20 percentile cut type of range, so etls I think you have to be careful with when you have spillover based upon our experience.

G

um What else so we also have been playing with? We noticed that the nvme, so our nvme volumes on these uh osds are like 60 gigabytes, and we notice that blue FS was only using like five gigabytes of that and the rest was all spilling over.

G

That was very ineffective use of those resources and we ended up finding. There was a the volume selector had a few tunables that we could adjust to make it use more of that, in particular, the blue FS volume selection, Reserve factor that is pasted in the chat, that's default of two. We set it down to one and that works well for pushing more data onto the nvmes.

G

In that scenario, and what we realized there is that the default logic of the volume selector basically assumes that blue store or that Roxy to be is doing compactions according to the normal, like level compaction settings and that each of the levels are filling up to their requisite amounts uh in in doing its calculations, to determine how much it can move over to the fast device.

G

Even if the whole roxdb level doesn't fit on the fast device and in our case with both the TTL compactions and even now, with the packed on delete filters, uh it turns out that we aren't getting anywhere close to filling up those other levels before things are compacted down to that last level, which is L4 in our case, and so it's choosing not to to prioritize moving anything over um to the fast device.

G

For that reason, so there may be some other things we can do to be smarter for cases like this versus having people adjust the Reserve factor to kind of detect that scenario, but uh we talked about that a little bit with Mark on uh some some chat, but I think we need to think through that more.

B

At this point so I'm looking at this volume, selector starts and I could see that your log row yeah.

B

Charles 29 gigabyte, real.

G

Yeah I think there's a bug.

B

G

I I looked into that briefly, but it looks like there's an accounting bug between his like the real com is 7.6 megabytes and that looks like it's closer to Accurate. The 29 gigabytes, I. Think there's.

B

Something maybe at least somehow uh prevents well, because it's still over to happen, you know more early.

G

Well, the so we have the normal racks, DB uh level settings, I, think 256 megabytes for level, one which makes level three twenty five gigabytes and we have what 100 I think I. Guess it yeah 104 gigabytes total. So we have a lot of spillover regardless.

B

G

B

I I I recall so you mentioned the DB size. Db volume size is around 60 gigabytes right. Yes, so I recall we have some spare space for volume selector, which attempts to not use. Maybe it's I need to double check, but I have a feeling that it's close to 30 gigabyte so Maybe.

B

Given the improper calculation for lock base, plus the spare uh best base, you might get that well. Volume selector might get the state where things that everything else should go to slow device.

G

Yeah I ran the numbers through the volume, selectors logic and I can share that with you, Decor to show you kind of what's happening.

G

The the the log issue with the accounting thing I think, might be a problem down the road, but it doesn't really get that far in terms of like thinking it has spare space to whether or not that would actually have an impact on our current situation, because when it's calculating based upon the level sizes and this Reserve Factor before that, how much space it could possibly use it comes up with like five gigabytes or something the the output that David has by the way is the output.

G

Once we do have the reserve Factor set to 1.0. So when we had it set to 2.0, that was like five gigabytes, the slow to DB.

B

And by the way, what's the exact self-release.

G

This is 16 to 10 with a few unrelated patches on it.

B

So uh it would be interesting to get more details on this tissue offline. Definitely.

G

Yeah I'll share I'll share some of the stuff that I was mentioning um related to the actual calculations and stuff and yeah. It would be nice to kind of brainstorm. What might be a better approach for some of these edge cases like we.

C

G

It just ends up not using much of the faster device at all.

G

And then the last thing I was going to mention that was really interesting for us. We tried compression for both um well, we tried lgbore lz4 compression and it was really effective for both the osds and the for data and also the rgw index osds. So mostly omap data. We got like I think slightly more than 50 percent um storage savings which for us with the spillover, was really significant and uh seems to have been a win-win from our perspective. So far,.

A

Have you guys seen any um negative performance impact with it? In many cases,.

G

David I don't know if you want to speak at all that, but last uh we had talked about it. I, don't think that we hadn't noticed anything significant.

F

Yeah no I'm happy to answer that we actually, if anything, we saw a positive Improvement. We see more actual iops going on the devices that we've applied. These changes to now keep in mind. We have applied multiple changes, so it's it's not just an isolation lz4, um but the combination of lz4, which uh positively impacted the amount of spillover um we've seen relatively consistent latency. Actually, we saw a little bit better on read um afterwards um and we're seeing more actual operations completed with the reduction in latency. If that makes sense.

F

So um certainly it's been positive, but that could be a combination of the situation with the amount of spillover we had on a slow drive. The compression Advantage May outweigh the performance implication if any, and as far as like CPU and stuff we're nowhere near even coming close to touching the capability of a relatively meager Xeon CPU.

F

So it doesn't seem like the the compression or decompression has had a has a negative impact on that now it might be a different story on uh nvma based storage, um just because it's so much faster, but at least with rotational based storage. It seems like it's pretty much a win all the way around.

A

Yeah that was kind of what I figured might be the case.

B

And slow devices in your cases are Spinners right.

F

Oh yeah correct, there's 7200 RPM of 14 terabyte Enterprise SAS hard drives.

A

C

F

A

On mdme it looked like it was a trade-off like it was a little faster in some cases, a little worse than others. As long as you had CPU to handle.

F

It yeah it, it seemed like it was kind of a win-win all the way around for us I'm trying to think if it was 24 and 25 it might have balanced look um because it was pretty amazing on our uh index disks. Let me go look actually.

F

The uh utilization, so these are these- are basically the index pool. Only um when we enable compression on these I mean it was an enormous change. We were near 60 of you know, I think these were uh 700, 700, gig, 750 gigs and we dropped it's almost yeah. It was. It was more than half there just because of how much omap I would have. Is there so anyways? That was all kinds of win on those things too um and again you can see when we enabled it.

F

We actually see if anything, latency is slightly better uh um and that's in vmes.

A

So Casey I saw the best compression with this. When I was doing, testing last fall with rgw I. Don't know if there's anything that rgw writes it's like just ridiculously compressible, but it it for whatever reason it seemed like with rgw workloads we saw, or at least I saw a really really good Improvement.

C

Yeah, that's great I mean I, know that there are strings in the index, but um I don't have a good sense of how much is string versus other fields. But potentially a lot of other fields are like integers that default to zero. So that could help too.

A

A

All of this is making me think that we we might be moving toward compression on is the default even just overall, we could just make it the hard drive default, but kind of feels like we might actually want it on everywhere.

F

Yeah everything we've seen would indicate especially I mean we can specifically speak to the rotational side. It's 100 beneficial there.

F

um You know when it when it's mvme, we don't have any pure mvme clusters to really mess with at this point in time. um So it's gonna be hard for us to tell um now just so everybody's aware kind of what our process is. So we had to get all this stuff out of the way because we basically had osds crashing anytime.

F

We try to to do any data shifts, but we're going to take what's on this cluster, um which, let me go look real, quick how much data this is if um it's right around uh 1.5 petabytes currently stored, so 2.1 petabytes used because we have a ratio coding, a plus three, um the intent is. We we've just built out two new 21 node uh racks worth of servers, but those have a multiple 6.4, uh terabyte nvme's on them, and so those will become the new DB wall devices.

F

So we should have more than sufficient space on the DB wall side to have everything uh nvme and we're going to add those into the cluster and shift all of the data off of this existing uh 21 node rack onto those other two that have the new device situation. So we wanted to preempt that with all the fixes and mitigations uh to allow us to do that without the osds going into these nasty little crash Loops that we ran into we had a host down for three days, and so that was kind of the plan.

F

So the next step after we finish this upgrade, is to just verify, validate everything, looks good, post, upgrade and Mark. Of course we can share whatever information you'd like and you're you're, more than welcome to look at the logs and anything else. That would be helpful to you. Maybe you can glean some information that'll help make a decision about. You know what you switch to as a default um more clear and then, when we do the migration we'll collect data during that process as well.

F

So we can see what it looks like as the pgs are being purged off of the source osds and then most certainly we'll have uh better data available on the destination cluster which our intent is to keep lz4 enabled so we'll be able to collect some data there with the the proper mix of DB wall and nvme to hold the entirety of the level four.

A

Yeah I'm I'm, pretty convinced already that I think for hard drive. Slc4 compression for xcps is pretty much a a win. I think we do it. It's just the question for Envy me drives. If we, if we just make a blanket default or not.

F

Yeah that that should be easy to test.

B

And one more question is about: the ratio between the demographer is decent sort of available to be your course from CPU threads in the case.

B

So how many threads you have per USD.

B

F

Course, oh CP cores uh see so we have 48 threads uh 24 actual cores uh and we have 24 osds in these devices.

B

F

. yeah no well I! Guess if we count the index, though, as these bids 26 OSD, so two EVMS that we use for index pool. um So that's two and then 24 uh data osds, uh which are serving the data pool for rgw.

F

And if you want to look at our CPU consumption um and keep in mind that we're in the middle of a uh upgrade, we have a fair bit of I. Guess spare CPU.

A

uh Right now, I have the opportunity to do some testing on very limited machines. um uh There's CPU bound, not not in any kind of real scale, but uh I could do some low level tests. So I think I'll I'll try that out, but I I suspect that the uh the win in terms of reducing traffic to the the hard disks is probably worth it. That tattoo is is no Envy me just pure hard drives.

F

Yeah exactly again like we are in no way shape or form even remotely close to CPU bound with rotational drives, like it's I'm, not going to say we're idle, because this cluster is upgrading. So the customer traffic is not nearly as high as it can be, but we've never had high CPU load on this, like I, would be very surprised if we could saturate a CPU.

F

Even if the osds were at full 100 capacity um yeah, and we could probably we could probably look I I, don't know, am I sharing uh the terminal or am I sharing the um perhaps right now the graphs okay. Well, then, you didn't see me putting all the information in the terminal showing our CPUs are currently at uh approximately four percent utilization per core.

A

um So there's there's not a lot going on there, but.

F

I think we've got some. uh We may have some metrics.

A

F

Go look at one.

A

With with the way that CPUs have scaled over the last like 10 years, it's um yeah.

F

A

Trade-Off, I think what.

F

I want to show you to last two days.

F

This was prior to the upgrade which puts this image in place that we're talking about, and oh my gosh and that's in the process of upgrading, so keep in mind when our index mvmes go down and come back up, um Cory's trying to dig into kind of what's going on there, but they're they're, basically reshuffling all the data so they're running at like a hundred percent uh utilization, CPU non-stop, for you know an hour or two while they shift data, but even with that, that's what we see.

F

So it's like a massive drop in CPU and load and keep in mind some of that was Iowa weight. Of course, like the oranges, aisle weight, but even like actual core consumption is definitely dramatically decreased.

A

Nice, nice, avoiding all that iteration, letting it Compact and avoiding literary I. Imagine that's a lot of it. Yeah.

F

I'm sure sure, but I guess I guess again. My point is you know from the perspective of rotational drives, I I, just I I, find I'm very, very I would be very surprised to see if CPU consumption got so out of hand that the lz4 uh compression really made a material difference from a CPU perspective. Yeah.

A

Yeah I think um probably the only cases are where people are like crazy running. You know 60 osds on like four cores or something ridiculous.

F

Yeah exactly I mean, if someone's like super over subscribed, but I mean these. Aren't. These I mean they're xeons and they're decent xeons. uh What are these These? Are cat Prague they're uh 42-14s, um but these are by no means like super high-end, vertical scale. Cpus. These are more just lots of cores that are reasonable, um so I would be surprised. There's a massive problem.

A

Yeah yeah David uh Joshua was asking in the chat window. If you guys have a list of the the things that you did, uh the the back parts that you made or other fixes that you.

F

um Yeah I'm sure Corey can speak to anything. We've done, I mean I. I can also tell you that you'll probably see me mentioning them a lot. Whenever that we are ready to do the next release and I'm like make sure these go in there. Yes,.

A

Yes, absolutely I'm, I I, you guys are the guinea pigs here. So we wanna, we wanna, use your experience to make Reef, basically uh the model of what you did. Yeah.

F

Corey, if you want to answer that.

G

Yeah I will have to look back through I think there are quite a few patches on top of 16 to 10 that we have but I, don't now they're. Jumping to my the front of my mind here on that are specifically related to improving index OST performance I mean the TTL getting rid of the TTL that we had in place uh for TTL. Compactions definitely seems to have been an important step in improving their performance. There and I'll look back through the list of patches that we do have. On top of this.

G

To make sure there was nothing else related trying to think the compression seems to have helped um yeah I.

F

Think a lot a lot of this is is really about um mitigating the stuff that was causing us pain in terms of Shifting data crashing osds and they all kind of had knock-on effects. So as we were trying to address kind of the root cause, we just saw that oh hey, we can't not have TTL based compaction on when we have this. You know deletion issue that causes the osds to lock up to where they get uh suicide timeout.

F

So, as we've implemented patches that have allowed us to, uh we started by reducing the TTL or I should say increasing the TTL compaction time. So we were at every one hour uh previously than we went to every I think was six hours and now we're just turning off TTL compassion. So a lot of the performance increase is because we've been able to remove um things that were actually punishing, I, guess the osds, rather than helping them, but we're necessary in order to prevent the cluster from eating itself. uh If that makes sense, yeah.

H

Okay, thanks that does make sense, yeah.

F

Yeah- and there were some changes we made- um you know that might have some performance implications, um certainly, but a lot of it I think with Mark's patch.

F

um It may not be as beneficial as it once was, because now we have really efficient compactions.

H

Sorry, Mark patch beating his Rock Stevie settings, the new ones that are in Maine, probably.

A

More so than the settings is the the compact on iteration and stuff right: gotcha, okay, yeah.

G

H

G

Don't think we even changed the other settings. uh We just back part of the compact on delete stuff and we did set those compact on delete settings a little bit lower than what you had uh for now, at least while we're trying to move data around I think we have both of them set at 512, so anytime, there's 512 consecutive tombstones in an SST file in marks it for compaction right away, we'll probably bump that up a little bit uh after we get a sense of like what, after we play with it a little more.

G

But that's what we're starting with here.

A

Sure I I just kind of picked those out of the blue like based on what seemed reasonable, but uh you know, testing is, is much more important.

H

Very cool, thank you. Yeah thanks.

A

Guys this is this is really great, like I I think we're gonna I think reef is potentially could be one of the the best releases we've made in quite a while um and and large part of things to sell this testing. You guys are doing so really really appreciate it.

F

Yeah and and to be clear, we're running Pacific right, like I, think this is all great for Reef, um but we should definitely give a lot of thought into where this land's back ported to because a lot of people would stand to benefit from this too and and I know, it's I know it's really nice to be able to sell the story of like hey. If you upgrade to Reef, it's gonna be way faster and way better, which is harder to sell.

F

If people already got a lot of the benefit out of some of the earlier release, but at the same time I mean clearly we're able to do it on pacific.

F

You know our intent is to move to Quincy next um I think actually, after the next release, is kind of the plan currently sure, but we certainly want to see all this at least as much as makes sense in his prudent back for it now changing the defaults, like that's, probably a different discussion, and that might be something that you know you could do reef specific or what have, but I, think a lot of users would definitely benefit from this. Just in general.

A

Do you, you guys upgrade Roxy B2 right.

F

uh No we're currently running the version that's uh distributed with SEF. We we haven't we're not running oh tree one. This is okay, just running the one, that's already in Pacific, so it's just specific plus yeah I, think there's like 10 or 15 patches, or something that we're maintaining for various things.

A

Okay, sorry I I, misunderstood I thought you guys would upgrade to rock CB2 okay well,.

F

Good, it may even get better with rocksdb right like that, might even be awesome. No idea.

A

Yeah yeah they've got um they've, got something in the new release that helps improve the behavior of tombstones in the mem tables, so that thing I did was just for the SSD files, but it doesn't help you at all with memcable uh Tombstone accumulation, which you know. Maybe we don't hit I don't know, but um it sounds like the new version of roxyb is definitely worth getting in. If we can yeah.

F

I mean I read through the changelog and they're they're like so many thousands of bug, fixes and other things which, some of which sounded relatively disastrous and I, know strange behavior that we've probably seen in the past. So um you know it's like everything, there's a little risk, but I think the reward on that might might be worth it.

A

Yeah we just we need to get baking soon as soon as we can into uh intoology, so that for Reef we we feel confident and then for Pacific and Quincy backwards. That's scarier for me because they they change a lot of stuff in Rock's TV. So it's maybe maybe but.

F

I think that might be. That might be the right compromise right like do.

C

F

For Reef, with the rocksdb upgrade, and then just you know now that we've got some some data and we'll continue to collect data and share whatever is useful to you. um Maybe we could look at just tuning some of the the settings or giving people the option uh to upgrade and tune the settings uh to get at least some of the benefit. Yeah.

A

Exactly exactly I mean it looks like with stockcraft CB and um just a couple of these, these fixes you're you're, seeing like dramatically better behavior and Pacific. So right, I think it's worth it.

A

Oh cool, thank you guys. This is. This is excellent. uh Any other questions for for David's team.

A

All right, well, uh uh Joshua, I, I, I'm I moved your stuff up because it's more important than what I wanted to talk about. So uh why why you? You? You take the floor.

H

Yeah I mean I can talk about it. It's almost more uh Igor story to talk about now than mine, um but um thanks to insights from Igor last week, um he and us have been digging further and we're at this point, I'd say we're fairly certain.

H

We know what the cause of the right amplification is, maybe not down to like the very precise change, um but basically uh what Igor has found is that um there's gigantic eye notes in blue FS, um like 100 600, megabyte inode, for example, whoa yeah, so the extent list for this I node is just gigantic and because of the way blue FS log works is every single time. That extent list extends it's rewriting.

H

Obviously the entire I know to the blue, FS log, and so the log is just being hammered both by rights and also just by compactions, because a lot gets so big, so uh that's actually, and so that I know that's getting. That big is one of the rock cdb walls. In the particular case I looked at, it was for the L column, family I, don't know which is out in the stores um and that's actually expected, because the Pacific setting says don't let the walls exceed one gig in size.

H

So I guess. The question is like what was the behavior on a nautilus? How big could those walls get, because that setting was added in octopus or Pacific I? Think it was in Pacific had something to do with the the starting work yeah.

A

Yeah Adam added that for for his stuff, yeah I thought that was just the Mac I thought that just limited the max size.

H

And I guess we're just getting up to that Max size because we're seeing 100 700 megabyte wall logs.

B

Okay, well in in in in in my experiments, we've we start cluster on the same Pacific release. I could see something like to rounded 50 megabytes as a regular cup to wrap thread headlock.

B

So, but maybe in some access patterns it might reach up to one gigabyte indeed, and that is what.

E

B

Like to mention that another key factor here is Ash that DB shares the main value process, main volume and hence we use 64k location size for all of us, which makes makes the extents much less and the application map is of this black red headlocks, definitely issue yeah and not to mention that 8.11.

B

A

So each buffer is supposed to only grow up to a maximum of 256 megabytes in in uh in Pacific you're, saying that what I know was like 600 or 700.

B

I have a feeling that it might be relevant to a different column, family subject: I think this was backboarded from nodulus. uh Db could be, let's jump it and maybe that somehow relevant, but just just a hypothesis.

B

That uh interesting, that's.

A

B

But anyway, Kevin 250 megabytes in Rider headlock uh 64k allocation unit, my trigger pretty large blueface, lock updates as well. So in total 11 and main branch we use incremental updates. So that's not the case anymore.

A

H

Yeah so ever had recommended two experiments. Obviously one is testing 16-11, which we are planning to get into the lab, hopefully in the next couple weeks or so. um We'll have a hard time on that yet, but this might move that up uh the other one was just like as an interesting. What happens if we crank the bluefest shared Alex size from the default 64k when it's on the same device to one Meg and so I ran that experiment this morning, and it does seem to make a pretty big difference.

H

um I don't have the data to look back on this particular system to say is this now back to the levels we used to see on Nautilus, um but the blue FS log byte stat, for example, is much better behaved um in that mode.

A

I'm I'm curious. The new tunings that we have currently in Maine would help too, where we, we have a lot more buffers of smaller size, and we we deal with the um the through amplification the database in a different way um by having l0 and L1 more closely sized with each other. That yeah I'd be curious. If that, if that changes, it as well might be justification and reef for leaving the the new tunings in place, rather than reverting them.

H

Yeah, so that's something that we can potentially play with the.

C

Other thing is.

B

H

We actually need to let the rock CB walls get to a gig.

A

It's it's good if you have enough back-end throughput to want to absorb a burst right, okay, without without having without throttling pressure, but the fact that you're, seeing it get up to that, makes me think that your compaction is not keeping up.

H

Well, this system is practically idle, really it it has very little load going to it.

A

I suppose it's because each column family allows accumulation of 256 megabytes and then you compact, when you hit a gigabyte I bet, that's what's going on like like it used to be right that with one column, family, you would write into uh a mem table for everything and then once you hit 256 megabytes, then you would start the compaction process and you'd start writing into the next one. And that would be you know if you're on an idle cluster like this, you compact it really fast.

A

Then you're writing to the next one you'd, never in practice ever exceed like 256 megabytes but with column, family charting now I. Imagine if I remember right, I'm a little fuzzy on this, but I think you have the ability to write into buffers for every single column, family, and you might allow yourself to get very close to that one view by limit without ever compacting any individual mem table until you hit the global limit and then you're like oh crap, that may come back early gotcha I think that's kind of how it works.

A

Maybe that's what it is.

A

But there should be separate files. I. Don't understand why the I know it is like 700 megabytes. Is it.

H

Yeah, what goes into the L column, family.

A

Let's find out quick, we'll go, look it up.

A

Are my keyboards allowed I'll I'll mute until I figure it out.

B

Channel refers to different rights.

B

E

B

It's definitely makes sense, since you have.

H

C

H

Words that actually makes sense, because I was just thinking through I'm looking at this cluster, and we have.

H

Just think of the nodes that have 4K masks we're definitely showing less bad behavior than the ones that has 16k mass I. Think.

A

Yeah I'm still I'm so surprised by that, but.

A

Okay, do you remember um Sage, had that PR for rocks tbhbals be able to reuse uh existing files and that got blown away because it wasn't safe for the right-hand log.

A

I, don't remember what the behavior is that we revert to when we're not doing that.

A

I thought we just created new files for every single none table.

B

I'm not sure we roll creation of this right up to rocks DB.

B

Website suppose with never been monitoring the size of righter headlock, or maybe that's pretty very well case. You know.

A

This is another case Igor where I feel like we should get your right hand, blog working.

B

Again, uh that's not a big deal if we use incremental updates in blue. First look: yeah.

A

B

Issue is that I know is pretty huge and jump into the lock. Every on every DP transaction is pretty costly.

A

B

My estimates is something like the allocation map might take 50 or 60 kilobyte. Imagine for each 4K right. You get xdk update in blueface blue First Lock, which is busy and once we have.

C

B

Updates, it should be.

A

Yeah, it's it's kind of um interesting to me that uh it's it's not completely un uh unrelated or different than the issue in some of us with encoding the subtree map for every single Journal. Right like this. This feels like a repeated problem. We.

B

Have the same design pattern yeah.

A

B

You might want to try the same desires, the same things yeah.

A

Yeah exactly exactly.

A

All right, well um uh people or uh or Corey or sorry uh Joshua, were there any any uh other things there to talk about before we wrap up.

H

I guess for myself, yeah look we'll be reported back on the ticket to see you what 16-11 gives us in terms of improvement here, um I mean overall, despite this I I think we saw an improvement from a Pacific install and like this was probably across like two dozen production clusters, or something like that.

H

I think we saw improvements in pretty much all of them, except one um good though like despite this in specific, it's still really good or better from a latency perspective, and so hopefully, once this is solved, maybe that last one will come in line and then it'll be just a straight up Improvement across the board, but we'll see.

A

Good good I'm, very much hoping that uh when you guys look at Quincy, you'll see another Improvement, uh uh there's a couple things in there, but especially Gabby's work with yeah I was thinking of Gabby's.

H

A

Exactly exactly.

H

Yeah yeah yeah I'm, looking forward to that I I'm I'm suspecting we're going to jump straight to Reef um our our internal workloads and timelines. So that would probably be what we do, but who knows to be able to go to Quincy. First.

A

Cool all right, well, I, don't think we have time to talk about hard drive stuff here, uh which is fine since uh Adam's, not here anyway, uh so uh we'll push that after next week, but uh any.

C

Last minute,.

A

Things from anyone before we wrap up.

A

All right well, then, thank you, Joshua and thank you uh David and Corey. This was an excellent meeting, uh really really excited uh to see things getting fixed here. uh I think Reef could shape up to be really good, so uh oh I see uh David had one last comment.

H

Awesome glad to hear that David.

F

We we were in, we were in a pretty nasty place and that tool made it a lot easier to deal with uh manual rebalancing to get ourselves out of hot water so seriously. That was like a massive help, so just want to say thank.

H

You awesome I'm so glad to hear that yeah we uh um we built that tool internally to get ourselves out of massive piles of hot water. I like stuff I can't even talk about. Unfortunately, but like um once, we were done. We felt like it was worth open sourcing, so I'm so glad to hear that I I've been happy to see on the mailing list too that people are using it.

A

Okay, can we get? Can we get that built into the distribution? Would you be willing to like like get it in, because we need well.

F

We we need to fix the current balancer I. Think, that's that's the real answer, right, yeah, sure.

H

Like yeah, so like it's interesting, you say that Mark so like I, actually thought once or twice about actually rebuilding Pizzeria mapper, as like a self-manager uh plug-in essentially um but like ultimately and like I, actually had a cephalocon presentation lined up for this last year, but with all the shuffles I, just didn't get on the schedule um But. Ultimately, what I actually want to do is I. Have a list of I.

H

Think three or four points that if we fix those, then you actually don't even need the tool anymore, and most of it has to do with like backfill scheduling so um anyways that that it would be an interesting topic, um builds in the distribution sure we could I'd almost much rather just like hey. Can we spend the time to fix the things that cause us to need peachy or rapper in the first place? If that makes sense,.

A

Sure yeah sure sure.

A

Yeah are you coming to uh to cephalocon.

H

That seems unlikely. um Okay, like many places, travel budgets are tight, yeah, yeah, yeah but um I don't know. Maybe maybe I'll resubmit the talk and see if I can get some uh uh budget scrounged up for one of them.

A

Sure sure we're we're doing New York in like three weeks, but that's that's pretty short notice of your. This is for the stuff day, I.

H

Don't know it's funny because we're a New York based company but I'm, not in New York. So oh okay,.

A

H

A

All right! Well then, uh let's wrap up here. Thank you guys for coming great meeting talk to you next week. Thank you. Thanks all right, bye.

H