Ceph Performance Weekly, 31 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-10-31 :: Ceph Performance Meeting

Description

.

A

All right well, who knows, maybe won't get more people who are. Maybe it's still going on, but uh David we can. We can start but there's Casey, but maybe we'll get more people, but we can get this moving.

A

All right, hey Casey, good morning.

A

The the core stand up is going over, so we might get a couple more people possibly coming in, but uh they were still in the middle of discussing bugs and stuff, so might be a little bit, but I will get started on pr's here. Oh there's, this new orbitty cache replicated right. Log I think it's part of a previous one that had been submitted, maybe earlier, but I think Jason want to have this broken for the smaller parts. So this is apparently the first part of the previous one.

A

Let's see, there's been rosanell has been submitting a whole different PRS for Numa affinity, updates and enhancements. There was one for the OSD and I think there was another one for our GW.

A

That's the this crimson law, global 1 by ki, foo I. Don't really know him, assuming that's just trying to to update there. There a lot even not to be as impactful or something that's Eric's, just allowable bucket index short counts for dynamic, restarting that one merged. Apparently.

A

A

Adams one that fixes some of the right head log behavior that was kind of in anticipation of his Roxie, be shorting PR or was necessary for that that that merged. So that was good.

A

Let's update his stuff here, I actually have not gotten through all the old viewers, but I think got through mostly ones that matter. So, let's see what has been updated, my pair for pinning own ODEs in a separate list or the own cache, a new store was tested, approved merged, and then we unmerged it. We were heard of it because it was breaking a couple of tests that I'd missed.

A

It's not clear to me why it's broken, though I asked Josh finally to take a look at it as well, just to make sure get another set of eyes on it, but um I am a little suspicious that maybe it's we have some broken behavior somewhere else, and this is just exposing it, but one way or another. We need to figure out. What's going on before we try merging it again.

A

Let's see I've got a PR for increasing the default number of our JW bucket shards easy. We can discuss this more if later on, but um my recommendation I guess, would just be to let's test it and see what happens.

A

Let's see, we've got Eric's PR, that's getting some updates for reducing bucket listing overhead with multiple shards. There's been lots of discussion on that one.

A

Regarding how to deal specifically with kind of a case where you, you may have very few entries per shard, so so say if someone requests a very small page size in terms of how many entries are returned at once, or if you have a large number of shards, but very few objects actually in the bucket, then that this might be the case and just count how we deal with that.

A

Let's see what else MBS auto treating of MBS cache memory limit that needs to be rebased again. The out work is still ongoing. Patrick continues to review that Adams brach CB, sharding, PR I, tested that a couple weeks ago. Adam is also doing a bunch of testing on that. Oh and Adam you're here. So if you maybe after we get to the PRS, if you want to talk a little bit about what you've been seeing, what you've been testing that'd be great.

A

Let's see who else we have here fix, broken and use calculations and blue store? I, don't remember what that is, but I think it's was buggy, so it's being fixed by you or reviewed that oh, can you you're you're here so yeah I, don't actually know anything about that one. Is there anything interesting going on there right? What does it even fix.

B

Well, actually, the ticket is the PR is on my plate. It's not passing a moment.

C

B

Offer it to me she doesn't have enough time so I'll take a look. Okay,.

D

A

Let's see Adam his I object, PR is is continuing to move along. It looked why I just looked at it looked like it was passing a two thumbs around. So that's exciting, I think. Maybe Josh also took a look at that and approved it and, in addition to other people also have been taking a look. So that's really good. Hopefully that gets moved merge soon.

E

My understanding.

A

Is that there's.

E

Only one blacker in this ffs suite and he's been furiously working on that one.

A

Fantastic, that's great to hear all right. Let's see the IOU ring io engine and blue store, though we did have some discussion a while back about that, and maybe like kind of refactoring, a bunch of the code, that's related to a IO and blue sore, but that's can be able to work and I. Don't think anyone has time to do it now, so the gist of it is.

A

It looks like it's working I think Keith Moon looked at it and thought it was fine, so I also approved it this morning, we'll get it in there's no real easy way to test it right now, because it requires newer kernels than any of our test machines have, but it's not going to hurt anything I, don't think, and at some point maybe we'll look at cleaning up the different IO paths to be more independent, but right now, let's get it in so hopefully now, assuming that it passes this stuff and keep his happy with it will merge it.

A

Let's see, working progress, blue store, speed-up removal- you were this- is your your PR anything going on with that.

B

Yeah well, the idea behind is that currently, when I removed in pools or placement groups, we have pretty large overhead on collection distance. We run multiple small tasks, which is just 30 records in another loop. So for huge placement groups it might.

B

It might log or the boost or significantly not saying it's pretty small, and here is the attempt to introduce a sort of caching or prefetching for such collection listing the studious overhead requires.

B

Well, actually, ready for in you, age has a novelty are based on this one and also I have.

B

Additional idea about making collection beasts in a synchronous, because current implementation might spark for a while, under some circumstances, and actually I have seen how this costs us aside time, outs or design of collection is to be improved. My opinion, well anyway, this part which implements perfection, is ready for review about a synchronous collection listing my needs on discussion. Oh.

B

A

And then last one is Adams. Objection sustained and, like I, think some updates over the last two weeks. But I don't know where, where that one's at right now easy.

F

There is one more bug fix to go and then we can merge in.

A

Awesome as exciting, alright, the rest of the stuff has not been updated. So I'm not going to go over all right. Let's move on then to discussion topics which I have not added in here yet, but maybe the first thing we could talk about what I put in the email.

A

A

Actually be shining Adam, do you want to talk about some stuff you've been doing in it.

B

C

Not really but yes, I'm.

C

Basically, the sharding has been done by using pollen families of frogs, DB and shouting.

C

A specific chart is done by calculating some harsh and this solution was extensively tested by me for performance. There is a presentation that was given on Saturday this week in Poland, and the results are as follows: it seems not to give any reduction to latency tail latency significant reduction. There are some some test cases when this tail latency can be observed, but in general latency in became greater force.

C

From the positive side, we have a much better write, amplification and much shorter compaction times like two times even up to two thousand also write amplification up to two times better, but beside that it I would rate its performance, its unsatisfaction re, but I just simply refuse to acknowledge that this solution sharking, would somehow fail for rocks DB if it was so successful. In other cases, I will just roll null in task to roll out.

C

Why why we have higher latency in typical cases, I see this slowdown from Perth in rocks DB, so maybe it's just that rocks DB handles : families less efficient than usual usual columns.

C

I have to confirm that if so, then second attempt shardene could be to actually invoke multiple instances of rock TV, but that would be hugely difficult as additional sharing of low FS and or something else would have to be done. So, basically, that's it, though, from from the my point of view, only the first of the three full requests is actually useful. The one that makes integration was columns, families into our kv database and allows us to properly assign.

C

Property share data not just by by single graphics, but have it more more complicated. So that's that's it basically.

A

That out of my a I think it sounds like the results I saw when I was testing also can match what you are seeing. I can link those again I, think you've seen them already, but is not not higher performance, not significantly worse, but you know maybe a little tiny bit worse, but then much much lower compaction, overhead and a much lower right, amplification, but still still good I think so worthwhile.

C

Job has to be discussed. I personally would not pay the cost of higher latency in typical. We.

F

Don't why it's giving us work attacking times.

A

The hypothesis is that, by not descending as far down in the hierarchy, by having multiple parallel hierarchies that only have like level 1 and level 2 versus level 3 level, 4. Whatever that you are reducing the amount of write amplification, an additional.

F

Reason why that worries me is: you could get the same effects, probably by increasing the number of key files for level right.

F

B

You have effectively.

F

Multiplied the size of level 0 level, 1, etc by a factor of whatever you are partitioning it by all you're doing is delaying it by some constant factor or growth. The.

A

Difference there are those that you could have multiple level 1 level. 0 compaction is happening concurrently when you have multiple hierarchies right.

F

Ok, it's also possible. The increased latency is a direct consequence of having a much much wider level. 0.

F

If, if it's commit more of your data is in level 0, then looking for david level, 0 is less efficient. What's important level 1, so you might.

C

Okay, I mean even number.

F

Of level 0 F of C files.

D

F

D

F

Factory, if you drop that by a factor of whatever the partitioning is that might recover, is the way disease.

C

No, it does not absorb by a factor of partitioning it jobs by a factor Wyatt like 1.2 1.3,.

F

C

F

You if you change the config settings to drop the number the target number before compaction, but you have the same levels: 0 files, number of level 0 files, whether or not you have to pack whether or not you have them starting turned on. Then you might recover the latency. If that's the reason.

A

F

Of having 10 trees of the same size, you have ten trees as one template size each at each level.

F

Particularly level zero.

A

Tim one one thing I've been a little worried about with us is that I've seen some opinions will say on various mailing lists and things that that when you have a large and that's somewhat undefined number of column, families that you can potentially hit some locking issues with the right head. Log I'm wondering I.

F

Believe it that's true that that would also make sense that it would show up in perfect.

A

Locking wouldn't unless I guess you were looking for if you're doing the the weird life you tax logging that therapy you can do it birth, but but you could use Adam, you could use your wall top profiler or mine. Maybe anything to do I could to do.

A

Have you tried testing it in a situation where compaction is a major contributor to slow down like the fall and slow down? Oh.

C

My bigger job by injections.

A

What in there are some scenarios that we saw from customers where they were seeing like 20 or 30 second compactions, and it looked like maybe the right ahead log was was- was starting to throttle because it couldn't keep up with the you know. The the compactions were so slow, but it was triggering, but the righthand log was was getting into this slowdown state and then the the interest rate for writes was really love was lower. I was curious if your PR may help in that situation.

C

Well, that that what you explained could be a explanation for like sometimes having latency of operations like 13 seconds or even more so I guess in some tests. I do have I might have that, but I did not observe it directly. I mean I did not have an air and tool kit set to actually confirm the find having that okay.

A

Yeah, if we can, if we can't craft one of those scenarios where it's you know really taking a long time to compact under the current master code, I I still suspect that your PR may make it much better and.

C

You really think about. You did really care about a compaction here taking a long time or you just care for new operations being stalled during compaction, probably both right yeah, because I I don't care for how long compaction takes for as long it does not eat too much space for.

C

F

This logical booster, that blocks and coming rights if compaction gets too far behind I think that's what we care about so compaction throughput, really at the important further.

A

And in addition to that, Sam there's logic inside rocks TV itself, whereas search throttling rights, the redhead log, if compaction, is if the the number of of buffers right buffers have have already kind of filled up, which I think would basically be the same kind of idea. Right.

F

But neither case it's not that you can hear about how long that you think that you care that the compaction never gets too far behind so a compaction rate really yeah.

A

Sure sure you can tap one giant compaction happening I! Guess that like doesn't matter as much or you have tons of little ones but yeah, it's the rate right.

C

But at least I seem to have a test cases where I have repeatedly a pure, very long write operations for like more than 10 seconds, so I will dig out my my test cases to extract. If this is not the case, yeah.

A

Anything besides performance right I mean your PR is going to help on Drive we're significantly for the flash drives yeah and that's that's a big win. Even if it's there's no performance win that by itself. If we can make people's you know, flash drives last four times longer. That's as.

C

A big and one more note about why compaction in that case may have a smaller awry data produced it's because if you have some pool that has just all maps- and you add it a lot and there is a trigger for compacting all maps, then it doesn't really touch any other objects in any columns. So that would also have impact for reducing write, amplification, yeah.

A

Any anything else, so we should discuss on maxtv, charting.

A

Alright, let's move on the next thing I had, and this was boost our trim, update, I, guess I already can't covered it. Just if it is is that we broke master briefly I'm trying to figure out why it doesn't appear that there's anything really obviously wrong with the PR I am still suspicious that maybe we're we're exposing some mother bug in blue store, but well, maybe have time to take a look at it, and then we can figure out.

A

What's going on so moving on from that ok, rgw, bucket, charting and chart counts, and all this stuff we don't have Eric but Casey you're here so I've got a PR where I was advocating. Once we have the the bucket list efficiency, they are merged that we up the number of shards I'm, advocating for a very high number Casey I think you were advocating for slightly something a little bit more conservative, I guess.

A

I would advocate that we just tried testing a bunch of different prime numbers of shards and see kind of with, with all the work that we've done and then Plus Eric's work. You know Pat, where we end up with that. Does that seem reasonable.

E

Well, I mean: what exactly are you testing for? It seems like you're benchmarks so far, I've just been for single bucket throughput right there. It's clear that more.

D

E

Are going to give higher throughput.

A

So so, the from from what I'm seeing is that the advantage of more shards primarily is going to be with a single bucket or a small number of buckets, because otherwise you just you get the parallelism, any move, lots of buckets and the disadvantage primarily is the the cost associated with bucket listing that it makes it slower.

A

So those are kind of the to me. It seems like those are. The two things is the offset or the the the improvement that you get in single bucket right, throughput versus the performance penalty of of you know what it does to bucket listing throughput is that does that seem reasonable or desisting right.

A

Generally, yes, are there other things to worry about I.

E

I mean there's bucket creation and deletion. More objects would take longer, but I mean that's not not too important unless your workload is constantly creating and deleting buckets sure.

A

I suppose there is the with more shards. You get better deletion performance within the bucket right.

E

um In the same way that you get more right through pusher,.

A

Yeah, so so I guess the trade-offs are with more shards listing, slows down to some extent, which we probably should retest to see how it is now. Potentially, the creation and deletion of buckets themselves is slower and then the trade-off. The the benefit potentially with more shards, would be faster, single bucket right through a faster single bucket, object, deletion and don't cry merrily be it. Is that sound right.

E

Yeah but I feel like one thing that that we're not considering here, is how we always scale with or adapt to the workload with restarting and I are tunable. There is basically the hundred thousand keys per per shard before we look to split, sure and and I think I'm very interested in looking at that and figuring out.

E

If it's giving us the right, the right scaling, and so when we're talking about picking a good default for or new buckets, I think that we just want to fit that into the model for how we adapt and make sure that that the curve I guess, looks right. So that's kind of why I'm hesitant to add a ton of shards, because then it's going to take forever before before we start rewriting and adapting to the workload.

E

Right because if we pick seven or eight shards, you could put almost a million objects in the bucket before it starts adapting.

A

And what would what would be the downside of waiting to adapt until you hit that like they? You know one one point if we were say doing like seventeen shards, what would be the the downside of waiting until 1.7 million objects before you increase the number of shots again.

E

Well, I feel like buckets that bigger or not are probably in that common either. You have couple really big buckets or you have more smaller buckets.

A

Is it would there be like a penalty that we'd impose if people had lots of buckets that all had seventeen shards versus lots of buckets? They all had like seven charts.

E

Not especially I mean you just have more objects and ratos the metadata for that, but doesn't sound extreme.

A

And when you came back to what you were asking about the the number of of objects per shard, you know right now, I haven't set it to a hundred thousand. Do you have a feeling like what would be a better default or kind of what we should be looking at.

E

You know I really don't we kind of just picked that number on intuition.

E

Couple years ago, I guess and we'd never really looked into it scientifically. So I I feel like it's. It's probably not right, I, don't know if we actually want something. That's kind of linear like that.

E

So it I do think that this needs a closer look to see kind of when and how we should adapt. The growing buckets.

A

The the downside of having like too many would be that you would be pounding a smaller number of osp's right.

A

Like if you look like each shard had like a million objects or something instead of a hundred thousand or two hundred thousand or whatever, it is right now.

G

E

That that, and like larger map issues to many keys on a single OSD, definitely impact performance. I.

A

Imagine that probably else would depends, then, on kind of what you how your OSD is configured I would think that if you had more more memory and faster devices in the node, you probably could handle more more objects per shard and I. Think so. Yeah.

A

But really maybe it's not even a parameter that can be easily guessed at upfront.

E

Right and I think that's why it's it's hard for us to just rely on intuition, for this is.

G

There anything in the telemetry data that we could look at what's coming in and see if there's any information about objects per bucket and things like that.

E

Currently, no I don't think we're collecting anything like that.

G

That's too bad.

E

You see Matt joined we're just talking about reciting and how we scale with more objects. Talking about whether the hundred thousand use per Jared is sensible and all.

F

That kind of yeah.

E

And how that influences their decision of the default number of shirts for new buckets? Okay,.

H

Yeah I mean I think it was intended to me other than when I came up with that it was the others intuition /. What was safe at first winning discs, /.

A

Would it be terrible to look at the latency of operations and try to base it based on like how quickly the OSD is actually like returning stuff I.

H

Don't know we're also looking at other strategies to represent bucket indexes, but but in terms of this I mean that might be correct. I mean I, think that I think I mean I apologize for missing the first part of this and I'm not going to bladder a lot I'd like to talk more with Adam in this context and learn more about what, whether that's the true, whether that's a likely assertion to him. If we're seeing we're seeing if we're seeing a response, you know I'm saying first see if we're seeing a behavior.

H

That was just that without that, could that that would work or if we have a way of investigating that um it sounds it's not it's novel, it might be, it might be true, I, don't know well.

A

I guess I guess one question is so right now we.

H

Over time, and how are they how they interact you together when they're all that yeah.

A

Right right now, I guess you know the behavior right is that we pretty quickly shard out. We, if you get several million objects, you'll fairly quickly start ramping up the number of shards up to some maximum. Was it like a thousand or something okay?.

H

Separately, yes, yeah! Well, your honor, well, Your, Honor I think we're on the right, nay right track. Here. We need to do I do believe. We need a different type of growth curve for for sharding. It appears, like you, probably want to jump to sharding quite quickly, but not necessarily very wide and then grow less frequently, but but more.

I

To me yeah so hey, this is Prasad from Flipkart won on the topic of numbers, we've had clusters which were statically sharded to 32 shots, buckets which were shorter to 32 and had a cap of 1 million objects per bucket, and that has served us quite well. But when we on another cluster, which was running luminous, dynamic sharding, we had this one. You know a hundred thousand objects Charlemont and we were have seen you know occasional, slow requests coming in.

I

Frankly, because the total number of objects itself was high, but you know no matter who does a delete. You know there used to be a lot of tombstone entries on the rocks, DB and so there's one lakh itself. You know seemed not quite helpful. So to me, the 1 million divided by 32, which comes to roughly about 30k objects, seem like a nice number. I, don't think 100,000 that we have is to address what needs to be. You know increased or you don't. Okay do.

H

I

Find the time you.

H

Should be observing I think when you use a when you use an even, for example, a non prime number of Sheridan switch, which we adjusted in upstream just recently um that wasn't the intent of the original designs. It's just a simple: they used modular arithmetic, so it shouldn't be doing that so I think again, I think you could a significant imbalance. Even then,.

I

I, don't idea, we were like wondering why I'm probably missing something basic pardon me for that, but we were thinking why there was no 4pg rocks DB instance. Why is it that we had to have like one big rocks TB for all the PGS on a given OSD and no matter who does a delete on mapping to whichever PG it affects the primary.

A

So in blue store was written, I think I mean sage would be the one to directly ask but I think some of the intent was that all.

A

When you, when you do a write, you would be able to have the entire transaction. Only the redhead log is a single operation right, very simple transaction that you have instead of spreading out like lots of rocks TB instances and lots of logs over a hundred with you know, potentially lots of random I/o happening. Then you hit one log with you know all of the things necessary to operation.

A

I

H

You're dealing sailing right into the use case or they end up where the value a research plan of Adams work. So as I understand it.

C

There is intend to there is intend to actually make a lot with attempt or sharding that we that we do today the step two might be to actually employ multiple rocks DB instances, but it's much more difficult, since we have them to take care of atomicity of operations outside and have some some hot sign mechanism that will hello up that.

C

What we currently have for free, which is that in one transaction you write him object, allocate data for it and all that atomicity is provided by Rob's DB features, and if we split it, we have to rein moment it outside. So it seems we will.

I

Be split it on up both PG bases. Don't we have that atomicity carried over.

C

You have that on placement group level, but you don't have that on level of single disk and accessing storage um on that level. So on that level, do you have to recreate it.

I

Thanks for letting me know that somebody has been thinking about it, it seemed like that's how it should have been, because our the primary thing is my rocks. Tv is just too bloated on a given SSD.

I

The blue FS partition, we move to DB, dot, slow and then still spelled over and it's just too heavy with a lot of slow requests coming in, and then we had to keep adding more and more OS DS and then making the size of the rocks TVs much thinner and the compaction times we're like really on a per device. T basis.

A

So so Prasad, you know right now. Kind of the the hacky way to get around this right is to actually put multiple as these on a single. You know, flash device which is kind of like you know a poor, a poor way to to do the exactly what you said right, yeah.

I

Yeah, the software, just you know I'm already running on a VM. On top of you know,.

I

A

That it might be that, at least in the short term, maybe it's a combination of Adams approach and that that could you know get us at least somewhere and then Adam, you know, maybe maybe potentially we could actually do charting at the at the database level.

A

There was something sage had mentioned when I talked to him about this, a while back where we can't just shard, implement multiple deep databases: / PG there there's something else, I forget what it is that makes that a little difficult inside the OSD but I, don't I, don't remember the details, so it's not as simple as just saying: okay, let's you know, have a hash of PGS that map to different databases.

A

Isn't this kind of, like you know, throw stuff at them and then, if they're a transaction at them and then let that take care of that PG. We have to do a little bit more than just that, though no I don't know we'll see how it goes. I guess.

A

So Casey had to go, Matt have do we know or do we have any idea of, but I guess we know that before we implemented any kind of bucket charting if a particular bucket got too big with too many entries, we started running into like weirdo Matt problems right.

H

Yeah, it simply was untenable um and they first and for that reason that was like, when index index was index, those buckets were created, I mean it simply, it simply had a would simply hit a wall and as you as you'd expect, how big was the logarithmic wall.

H

I I think maybe I reckon I reckon system the Sun equipment three years ago, but maybe three hundred thousand objects. The number I remember: okay,.

A

Was that was that.

H

I think those I think was on file. So of course it was a yes of course, certainly a file store and it was initially under a law and leveldb and it would have been objects and then the workloads I'm familiar with, maybe at fifteen sixty K and hundred K.

A

Okay, so really we we don't have a real great idea of how many objects per per shard. We could maintain now with blue store, and you know, and.

H

A

Think, although to.

H

Presents experience playing some intuition, though it would be that done, Sun, flasher and via me we should be. We should be Emily help.

A

The problems before were there any ways that we could identify encode that we're starting to run into those kinds of problems like with latency, be a good metric or their other metrics that we could look at. That would start. Oh yeah.

H

And we didn't write any code to do that, but but certainly I think I do think that yes, you're on the right track to think that measurement of latency is okay.

A

Sensible did we were just doing maybe instead yeah, maybe instead of asking the user to specify objects, I.

H

Mean we call Bator, he would hear he might present a counterpoint. I mean Amazon, just added user-visible sharding to the AWS scheme.

H

If you know you know, you know access assurance by constructing a bucket with us for the weather, unix-like /, shard name, / prefix, and you and you can- and you were ensured- that it that it's or prop or strongly implied that you can, that you get that. Show that you, because you that you get a shirt affinity if I doings up by doing this.

A

Interesting, why did they have to do that?.

H

Well, they rented issues I believe they ran into issues similar to ours, but they would they want. At the end, the idea was to make it user visible, so the FIR, so that so that users can construct work. They wanted to get any wanted to construct some amount of get some type of guarantee, guarantee or partial guarantee that the users would know that they probably were asking like. How do we get well? How do we get? How do we care and get guaranteed parallelism row for a work?

H

You know when they work live, and then that was a wage to get it plus plus make it visible to applications and the way that was that did matter to get to customers? Okay,.

A

The assumption here is that I'm.

H

A

That we want, we want to do something.

H

Like this, we don't I mean I, think I, think I think transparency are doing is the way to get is I, don't think we're gonna be perfectly okay using this for sharding I. Think I think I think that excusing a simple you know using a hash space. Selector, that's it to place. Everything is a great way to get uniform distribution, but it has the issues it has I think we're gonna moving out of that.

H

It may always be a good strategy for some more clothes, I, don't know which ones but I think you know other things we're looking at doing have they've added three others with it. The splitting based approach that you know that then I'm just I, figure out this days. What that, if I just wanted the plus tree approach, we've got working out of design and annotation for that.

A

Yeah in my ideal world there would be no splitting for most people that we could get to the point where we can have a default number of shards. That's not significantly impacting any performance at all, and then you know it's fast enough that that for most people they don't care about having more shards.

A

H

Yeah I mean people. Are we gonna say that the new thing, which is it well? It's not so much the starting to say that you know it's not so much the shard structure, especially after your Eric's improvements to scale and the numerating things, and it was unexpectedly good that made that may that, may that may short-circuit the need for a different way of doing splits. You know that aren't fully autonomic, but he said that people can still say well.

H

It's charged, we see is charged to the overhead from sharding is high, so it's so in the right so and the ramp up to a net. You know in a workload where we're ramping up to a natural just.

H

Sorry, distribution, it's supposed to work, what's where it's, where it's fluctuating I know a lot I, don't know! If the letter is all that common to be honest, I think there's a common work. Let me hear about where we're starting, creating a new bucket ever have a stream and we're flooding and things are flooding in we're getting we're getting to the natural charm. You think we should be at at any rate for that for that for the for the for the for the steady-state, you know size on the bucket, but there it has.

H

It's been spit spits, but the peak size it rid it reached and it's the same workload saying and then we should. We should try to avoid doing this, we're doing the starting over and over as we as we bounce through a bunch of intermediate points. It feels like to me yeah.

A

If we can, if we can get rid of a lot of the restarting at low low object, counts, I think that will at least kinda make a lot of the problem kind of go away right.

H

We've now seen, if we not seen that, for example, with your improvements and Eric's improvements, whatever that's, they owe that bid the bidding stuff. If you just do this.

G

H

G

H

Is much much better than anybody and then listen Lena that I predicted much much better? That means that if you dress, if you started eight shards or something or sorry, seven shards then look really these are things. Are things are probably pretty good for a liver for a lot for a long run to come, Yvonne says I have won more times to get to multiple millions of objects and that that would be tolerable.

A

My um my hope and I know Casey and Eric have expressed a lot of concern with this, but my hope is that when we actually start testing some of this, we might see that the impact of even more shards on bucket listing isn't very significant, like we might be able to get up to 17 or even like 31, and have very little impact on my I. Think.

H

I think I think I think those aren't big numbers, so I I, don't think. That's crazy, so I think we should look at those local cars pick up pick a good number, that's alone and then the smallest and the small size bigger you know airing and the larger size is better.

H

If we don't see a big variation variance, so I think that's I think that Mister is correct, but if they pick a small tycho number and Pickers the car, a small good number around, if we could pick it higher than 7- or maybe you pick it- this 17 or 31 yeah I- think it's good. That's.

A

H

But then but then don't bounce, then don't bounce through this constant.

H

So we should be good. We can fix that. We can fix that without we can fix that without studying without without measuring latency and then guessing we can. We could be three week, we can, you can take a curve and run it I.

A

Wouldn't be surprised to if we have like a camel by modal distribution about its numbers, fabric counts per bucket like it's either gonna be like under a million or like you know, five hundred million or something well.

H

Yes, that's that's, probably pretty common and there's some other weird ones, but yeah, but I thought I. Maybe it would be useful to actually measure such distributions and Eric's some of Eric's work and what he called Olympic. Don't love you buckets. He was looking at that nning any collected some upstream information. If other folks have upstream people have upstream Apostle Peter publicly it you know real, viewable or or a few skateable bucket listings. um This way this could be. This could be useful to us.

H

I mean it would be useful to preface to perhaps track that and to track the put the structure of a bucket distributions and learn and learn from them.

A

When users have problems like in the past that you've seen, has it usually been that they they try to keep like tons of objects in one bucket, or do they have lots of buckets that they history stuff over.

H

Both both we see, we've seen both and people and people have an applications, heaven infer things about. What's why they would do each one. Sometimes it's just random reasons and other times, I think it's one's gonna be faster than the other or in either it. Maybe it is or it isn't if we have their also work. Let's arguable construct and destroy buckets for the same name, frequently producing that's a great data sets or recycle. Our data sets that's that are like snapshots or but we're being replaced by new and from you know, by new information.

A

So so, presumably wonder.

H

But a common one is just like it's it's it's like it's like: it's like it's like a FIFO and then and it reaches some enormous in size, and it's just turning.

H

There is oh there's.

A

A large entity that.

H

Pushes do stories.

A

So some KC did mention that the creation of buckets and deletion of buckets rather than deletion of objects and buckets is something that we want to look at here too, which it sounds like maybe you're but you're. Yes,.

H

It's needed for some for some use cases application instead, they'll want to do this. Obviously I can see why you wouldn't, but it's yeah, but but people aren't very mature, but you get used to this.

H

You know to that to the idea that it has three data centers just sitting out there and a lot of times it is but but there are folks there did that, do that and we've added the we've added the ability to move buckets around and rename them that is and and then that will probably well probably because people realize they can do that they will probably do more of it sure.

A

We had a question in the chat window here from Prasad about prime numbers met I.

H

Don't think I know nothing about prime numbers in Krush. They may be a maybe a demean or you do mark, but and but the bucket charting it wasn't it was. It was supposed to be using them and it wasn't.

I

If there are prime number number of shots for.

H

Buckets absolutely it does because it's basically we're basically just taking number of shards modulo the shared count to get to the place in Underland and there's a retiree taking that we're taking up we're taking out we're taking some random over other than the structure structure, collection based on the name of the of the object and modulo. The third count should place it. You know visiting the shard, so.

I

Number of shots is what is recommended. Oh no,.

H

At that's sort of the worst, probably four buckets.

A

H

Buckets are together. This is nothing to do with crush I, guys guys guys this. This is all Willa. This is all well understood in crush in and fresh experts.

I

H

Be the only ones I speak about crush.

I

H

All this nothing I said was I was not speaking about this crush, come in and nothing about.

H

How it should should always have been, prime or, and and and so that, so this so that the maiden in the name we, the hash we constructor than of the pocket name- is relatively prime to the number of chars.

H

This will automatically happen. This will automatically happen in new versions of sir.

A

All right, well, I, think we're out of time guys, though any last-minute things before we wrap up all.

A

Right well, if I have time this week, which I'm hoping I will, though, we'll see how this similar stuff goes at Matt, I'm, gonna attempt I think to run some tests using Eric's, PR and and look both at the bucket creation and deletion times along with listing times and then read. Earth's I write, throughput and deletion throughput about use within a bucket with different prime number accounts, just to kind of get an idea of where we're at okay.

H

A

Like your idea, try.

H

Starting at a bigger, bigger number: okay, I, don't think you're, gonna notice, much difference, but I think it's gonna behave. It should be a hitch like you say it should behave better as we know this because they scale up if we could find it better and in general, about better curve. You know free shard points, yeah, I, think that'd.

E

H

The simplest one of the simplest ways to fix it would have something that an equation that grows that grows better and sometimes.

A

H

Get up to the to the peak yeah.

A

So so maybe another dimension of this is also varying the number of objects per per shard, and if we get all that multi-dimensional matrix, then you know we can be consumed. The KC and Eric think I know that they're still there so a little bit concerned about going with well.

H

I mean yeah: we should you should go with what the numbers say: I mean they meant that they observed that it said that it's. What did you see because I think you tested already? You tested a variety of right. You like to charge once around around to eight and there and around 32 and so forth. Those if you didn't, if we didn't see you know that, like you say they well yeah, let's talk offline, but I, don't know what their concern is to them. Yeah.

D

H

D

It used to be, but the.

H

Results were getting for listening anyway, are so good.

A

Definitely in master, we would not want to go with that Manisha, but with with our SPR, it I think it's a lot more attractive. So anyway yep we can. We can talk more offline and let this alright.

H

Thank you very much for that all.

A

Right see, you later guys have a good week.