Ceph Performance Weekly, 11 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-11-11

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right pull requests this week. I did not see a whole lot at all. um There was one new pull request that looks really interesting from radic. This is a introducing huge page based, read buffers, so um if you're interested in that it looks interesting, I'm not sure under what circumstances. This is going to end up helping, but it definitely could so um yeah there's this one. Otherwise I didn't see anything new or closed. um Besides that there are a couple that got updated.

A

This week, though um adam's retry of bluefs fine grain locking this went through some qa this week and it was passing cases that previously apparently failed.

A

I don't know if that means this is ready to merge yet or if we still need to run this through more tests. um Certainly that code is is complicated, so um you know this is this is a touchy one and it's it's.

A

Unfortunately, it's also kind of um this is adam's attempt to fix what was wrong in a previous pr from majiang peng that also we merged that also broke after so uh right now we're zero for two, maybe uh hopefully the third third, try we'll get it, but um but this is this is tricky code to get right um all right, let's see uh beast optimizations for request timeout. I love this pr.

A

There's tons of benchmarking data in it lots of discussion about uh what makes sense to include, or what doesn't um oh casey you've been reviewing this any anything else. You want to add.

B

um Since last time, let's see we got a comparison with and without the custom allocator piece which was kind of complicated and we found that it didn't really help. So I was happy to to rip that piece out.

B

Mark is satisfied with the performance, so I ran it through toothology and see some valgrind issues that are there. So um we talked about it in the in the bug scrub. This morning we have a plan for that.

A

Cool cool yeah, it looks great, I mean, definitely looks like an improvement. So that's fantastic.

B

Yeah, the um the piece of um the beast library that we were relying on for the timeouts is kind of complicated in doing extra stuff. So getting rid of that, I think is, is the main thing.

A

Okay, excellent excellent: do you know um there was another pr that was kind of hurting rgb performance? It looked like. I need to go back and look at which one it was, but.

B

A

Are you guys looking at that too.

B

um Not really so the the pr was um so the the abuse front end is spawning core routines for each request and.

C

B

And there's a stack allocator for the co-routine stacks and originally it was just using memory from the heap and it was sized pretty small. So we were seeing a lot of valgrind issues just over running the stack so that pr switched the allocator to use a map and m protect so that we would actually crash if we overran the stack, and so I mean that adds a couple system calls for every request. So we kind of expected a performance hit there.

B

um So I I don't think that we're planning to revert that, but maybe there's something else we could do sure. But I think that the this time out thing was the was the much bigger issue.

A

A

Well, cool excellent sounds great all right. Let's see next up, there's this pr to set the min alex size to the optimal I o size with the underlying device. um There's some discussion eventually, both that was approved by both sage and igor. I believe now it's just it's ready for testing.

A

There's also this pr from igor to make sure blob, fsck much less ram greedy um that needed a rebase. It's gotten an update since then from igor, but I'm not sure if he did anything besides just rebase on master, so um I think that one may be ready for review and and testing um another one from igor. This is this old pr for optimizing, pg removal.

A

um It's been under lots of review. There were some failures, uh testing, more review and discussion, um I'm not without eager here, I'm not sure what uh what what the status of that is yet, but um anyways still being actively worked on. So that's good um lots of stuff in the no movement category this week. I know I've got a couple that are in there, but I just haven't been touching them. um My big one, I guess, is this priority cache. um It was more or less working in performance tests.

A

Well, it was working performance tests, I'm just not showing a whole lot of improvement really, um but then it's like faulted during testing and I haven't gotten back to it. Yet I'm yeah it. There are some benefits to it, just not what I was hoping there would be, so um it's still probably worth doing just it's not a big improvement. It's much a much more uh nuanced minor kind of thing, um but it does give us a lot of insight into what the cache is doing. So maybe that's the big one. um Otherwise,.

A

We still have the mds optimizations here, which I think are really can be good to get in, but um I know patrick have you do you know? um Is anyone still looking at those the the ones from you, colonel.

D

No not yet um look at that in the next few weeks.

A

um For uh kind of, interestingly, I've gotten a lot of people recently uh asking about like hpc and io 500 with ffs uh so they're. There just seemed to be a lot of interest.

A

At least 10 gently right.

D

Yeah um well, we'll see, I I'm not sure if the approach is right, something we want to support long term. So you know yeah hard to say whether or not those will actually get in.

A

Okay, all right! Well, let's see um yeah I mean there. They have likewise on on this optimize object memory allocations using pools, um ronin uh kind of oh you were in here. um uh You know you had it raised some issues about that pr and, I think they're very legitimate. um That's another kind of question of it could be a big performance when uh doing something like this, but we have to figure out the right way to do it.

A

I don't think we saw a response back from the author of that pr, so um still kind of waiting on that.

A

All right well, otherwise, I think that's about it for prs um anything I missed or anything. Anyone would like to discuss.

A

All right, then, moving on um okay, so uh first thing this week I wanted to bring up. Is that um there's been a lot of discussion, uh mostly inside red hat, uh regarding how long it takes to build packages? um Building stuff itself is actually not horribly slow. If you have enough cores on our, you know really fast um development machines.

A

We can build stuff in probably 10 to 12 minutes, which you know it's not great, but it's okay, um but our debug or our actual package builds take a lot longer and there's interest in trying to figure out why um so? The two big things that came up um there's was interest in trying to parallelize builds across multiple nodes, and david galloway has been doing a lot of testing on that um right now he's trying uh it's some kind of uh closed source tool.

A

uh I forget the name of it, but um the gist of it is that it's faster, but maybe not as fast as um well. It uses a lot of nodes or a lot of cores to get an improvement um and dc may kind of be similar in terms of this improvement.

A

So he's testing that as well, but the other big one that I wanted to bring up this week for people is that sage had mentioned that he saw a lot of time spent in dwz, which is basically a tool that lets you extract symbols from packages or from from uh the code to build debug packages.

A

So um it looks like uh from what stage has said that that's basically single threaded, we're we're spending a lot of time in one core during package, build uh just doing that step. So um I haven't looked at it really closely, but I posted the link for the dwz source code in the etherpad.

A

um It's you know basically a c program. It is being uh actively maintained uh there. There appears to be uh someone at red hat- that's working on this, but perhaps this is something that could be parallelized. um I wanted to open it up uh for anyone that is interested in this. um Anyone has anyone used. Dwz knows much about it. um Look at it at all.

A

All right that was, that was somewhat the response I expected um so okay, I've looked at this just a little bit now. um I don't know how difficult it would be to actually do anything with this. I haven't looked at it closely enough to really get a sense of it. There is some comments right at the top of the c source talking about trying to optimize multi-file cases.

A

So you know the author definitely has been thinking about things. You know performance at least a little bit. um I may try to reach out to him and see if there's any opportunities for us to to help, but um in any event, uh I think that that's probably the better way to go in terms of trying to make our our package builds faster, uh at least at first rather than trying to parallelize um and and use multiple nodes, and we already can make the parallel parts fairly fast.

A

Just by using a big node with lots of cores, um it seems like the single threaded parts are the ones that really are are hurting us right now so uh anyway, that's that's it for that um all right! Next, here, um josh and gabby. um uh Do you want to continue a quick summarize, the the um the discussion from core this morning about fast shutdown.

E

I can start my my part, so with no column b, we require a a step before shut down a step where the stage all that location table and that step must be done when the system is fully questioned.

E

If there's anything happening, if there's com sorry compaction ongoing, if there is still uh ongoing rights, then this thing would be bogus and next time we start it's going to be bad, and so there are two solutions. The first solution is just you know. What, if you do fast shutdown, then don't store the location file and the next time you start, you will do a full resto, a full recovery.

E

Full recovery can take minutes. If the system is very big, it can take 15-20 minutes. I believe that you, you should expect five to ten minutes on every reasonable system, but it could get as much as 20 minutes on the fully configured system.

E

That's one solution: a better solution is on the first shutdown: try to do minimal, set of operation to get the system into quiescent state and then stage that location and shut down the step we could skip are all the cleanup. We do for memory and stuff that we don't really need, like um you, don't need to drain memory pool you don't need to.

E

I don't know, free up all the memory and let malok a new uh free and sorry new and free uh deal before the fragmentation. You can just skip that state, and so I've been working on this for some time now and today I I start profiling the step to see how much I'm saving and I was very disappointed to see on my system that, with the minimal set of operation, I can go down to like one second, and if everything is done, it's going to be 10 seconds.

E

So I know it sounds like it's a factor of 10, but it's just 9 seconds and still when I do this one second and I think I'm in good shape, but still I'm skipping out of step, which I might be missing something. The question is: why do we do the fast shutdown?

E

My intuition is that we don't do it because we try to save the shutdowns uh time. I don't think it's very long and I suspect people need fast shutdown in case some bug put us on deadlock or some kind of a condition preventing the system from ever going to shut down. So it's just okay kill the system, but it's my uh it's something. I suspect I have no, no knowledge about. That being the case.

E

F

E

How much time do we spend on shut down if the system is in good shape.

C

It's like, I don't know about the second part, but I can tell you about the history of why it's implemented. um It's actually a fairly recent addition.

C

Only in the last few years that we had faster done and it was implemented in order to make upgrades in particular go faster on large clusters when there's heavy io loads going on so when you're having to restart each steaming across the whole tire cluster.

C

um You want that to happen pretty quickly so that they get back up and running, and can you a serving request?

C

um I think one of the aspects that you may not be seeing as well in your testing on the official analysis like nasa, is because they're so fast and have these very fast and nvme devices is the flushing of data in flight is going to be pretty fast in those nodes which wouldn't be the case. If you had slower hardware like a hard disk or a much lower cpu, I think that maybe one aspect um where why we'd we'd see longer shutdown times in some cases.

C

um But that's not to say that we have to do fashion or we have to. um I think it is worth trying to like figure out where the time is actually being spent. What is worth optimizing or not?.

E

G

Go ahead: hey thanks! I just wanted to add the use case too. If a site, for instance, is wanting to shut down for an outage, let's say they have 100 000 osds and then one or two percent of them are slow.

G

That could completely hold their whole plan where they could be governmental, could be financial losing tons of money waiting for this note to shut down, so there could be a a big, a big difference of what a fast shot dom could do and being newer. I still know all the details of it and what we're looking to achieve, but there could be a case where something just never shuts down, because hard drives like to still be operational, but be very finicky, especially the more commodity type hard drives. I just wanted to add that.

E

So it's another sorry yeah. There is another um solution that we discussed, which was do the shutdown but cap it at five minutes. Ten minutes.

E

If you cannot do it in five minutes, then kill the machine and then you will do a full recovery afterwards, but the full recovery should probably not take you more than 10 minutes. So there's no reason to wait: 10 minutes to save another 10 minutes, so you should never wait one hour or two hours, but if we could maybe make it enough just to do this thing plus there there are few other simple changes which don't change the semantic.

E

There is anything which is not starting the execution, but it's in the queues. So when the system is slowing down, I expect a lot of stuff is being queued uh on the external queues, but we didn't start executing them executing them. So something that I'm already doing in this solution is I'm I'm stopping all the cues from accepting new tasks, the external and the internals. So nobody is going to start anything. Only the stuff which already start execution is going to conclude. The question is: how many tasks can we have in flight?

E

Everybody knows we must have some limit for the in-flight stuff.

C

But that's another area where I think we've made a recent change there, where for a long time we actually had no limit, and so we could have very, very large buffers of data that was incoming to the osd and uh we finally enabled that again it's the osd um message cap, which is um what do we set that to like 100 or something 256.

E

We got 256 external messages being processed on the osd at one time.

C

E

Not what I mean, even if we assume that every iotec 10 20 millisecond, you could still finish just the I o in I would say five seconds like the slowest sata drive, would finish the I o itself in five seconds. Then there is the processing that we do make it to be 10 seconds. So that's not where we spend time. I don't think this is where we spend time.

E

I suspect that there's some cases that so so one one one thing would be: maybe the queues got thousands of of pending io that we didn't start executing executing, which we still keep bringing in. That thing is easy to stop and I'm already doing that, I'm not going to add anything new to the execution, but 256 ios. That's not something that would cause us to wait. Minutes like one minute should be enough to complete everything and with huge margins.

A

That 256 io limit happened after a fast shutdown was implemented.

C

Yeah, so that could have been that we were just seeing the effects of not having a limit more than anything.

E

So else that was the reason that people so so that faster than was needed if there was no limit and by the way, even today, there is still a similar thing, because you could have think skews be queued before they start executing. I don't know how much we can have in that queue. The message queue on the osd. Everything arriving from the clients is the limit to how much we could keep.

C

Yeah, that's what that's! What.

C

Execution queue the messages coming off the wire from the client. That's that's! What's capped at 56.

E

Oh so we kept it just on the pending, but not on the executing. That's actually.

C

E

Thing to do because I would expect that pending don't consume a lot of resources while executing dual consumer resources. So I would put a much lower cup on execution and bigger cup on on pending, because pending is just a ah sorry.

E

The reason is that there is solicited the right data so that the host can send the whole data together with the request. That's why we must put the cap.

C

Yeah, so you might have like each forecast being four megabytes of rates and it's taking up a whole lot of memory.

E

Okay, okay, okay, so even without any of my changes, there is only 256 possible ios. We should be able to complete in few seconds plus. There is cleanup that we do, assuming that there is no bug and there is nothing stopping us. Then I would expect that one minute should be more than enough for any system.

A

Josh, do you remember, did anyone when fast shutdown was implemented, actually look at why it was needed.

H

Or where time has been spent, I that's the busy.

F

I just pasted here so that's where the context comes from is coming from uh rook. If I remember correctly- and I also- I am sure gabby has already looked at it, but this is the pull request from sage where this was implemented.

F

um Yeah, that's uh so upgrades, as josh pointed out, was the reason and like 25 to 30 seconds for osd was not acceptable. Is what I'm understanding just by reading through the initial comments.

A

We avoid the compass clean shutdown process, which is historically a source of bugs that's inspiring.

F

Yeah, but I guess um with gabby's stuff some of the details in this pr don't hold right. I mean like the only reason why we are doing um clean shutdown. Now uh we'll have more motivation right. We have other stuff to do during shutdown, um so it might be worth revisiting um whether we can afford to have clean shutdown.

F

Given that we have the client cap message message: sites gap.

E

Do we have a system, we could try slow system, big and slow. We need something big and slower.

C

Alarm running cluster is large and slow. Well,.

F

Yeah, do you need uh it to run your code gabby then? No. I want to.

E

See no, I want to see and not fast shut down with system doing I o in full capacity, and I want to see how long it takes, because if we can see that that takes, I don't know one. I don't know five minutes at worst case scenario, I don't even think that'd be the case. I think one minute should suffice to any reasonable system. Then I don't think there's much to gain on the fast shutdown and we only risk creating inconsistencies.

F

I think that should be doable. If you I mean it's just a matter like we have long running cluster, we also have uh scale clusters and this it's just a matter of disabling our shutdown.

E

Okay, and actually, I think at the moment, fast shutdown is disabled in safe.com.

E

It's something other than.

F

Again again, I think we talked about this in stand-up, it's just disabled and restart by default. It is enabled, if you see options.cc or the new file name, it is enabled yeah.

E

My system is, it's always disabled,.

F

Because you're running restart right, yes, but we start using a separate configuration file.

E

Oh okay, okay, okay, so try to see the time and, as I said in the beginning, a very simple solution to this is to perform the full shutdown and just put a timer if it doesn't complete within. I don't know how many minutes, one two three four five then kill the machine.

E

So in worst case scenario, we could still do the full recovery for the recovery on a big system is 10 to 20 minutes. There is never a reason to wait, 20 minutes to shut down, and if we think that few minutes should be enough to get the proper shutdown, then there's nothing to optimize.

E

The code on my system seems to be working fine, I keep running and killing the machines and running fsk and it gets fine, but I can always find that there is something I missed something some transient data which I didn't capture and that's what scares me, because it's really hard to prove that this thing is.

C

I think we do have to be a little bit concerned, even if it's a it's a minor increase for osd, because when you're trying to restart the entire cluster and during upgrades now you're keeping things online, you're you're serializing things a bit so that you don't disrupt activities so you're going like node by node, maybe you're starting a few osd's at a time. So you don't make the entire cluster unavailable and that kind of scenario it does add up. When you only have at that sequential delay.

C

Even if it's like relatively fast, I think it might still be worth optimizing, but I think I agree. We should check out how much time it actually does take on something like the lrc and um if it's like only one second or something there, then it's probably not worth looking at. But if it is like a difference of 10 seconds 20 seconds, then there's more significant and adds up over a large cluster.

E

So what I've seen on the official answers 10 seconds without the optimization in about one point, some seconds with the optimizations.

E

But, as you said, the official is a very fast machine.

A

Was that one ost gabby.

E

No three of these okay, but you know what now I think about it. I think the ios were. I was using random right, it might have been small. I can see the size of the random right.

E

uh 4 to 64 kilobytes of rights, so maybe if there would be a megabyte like eight megabytes, if each right would be eight megabyte, maybe it would be slower.

E

So maybe I should also test with a very big rights.

C

Yeah, that would be the worst case scenario on giant rates. Late and necros has a good point about other kinds of scenarios where you care about the whole cluster too. Shutting it down entirely, at least in that case it it is, can be done in parallel, so the extra time isn't so bad. It's it's really the long tail that you're worried about there, but your that's an idea.

C

Make sure there's a maximum.

E

So what size, what's the maximum size that we allow the host to write to us unsolicited.

C

uh Typically, it's four megabytes.

E

But is there anything.

C

E

Doing one gigabyte of right without asking us.

C

Yeah there is a hard limit on on a single operation size. I forget what the that's currently set to. I feel like 64 megs, 128 megs.

E

So doing 256 io of 256 meg each that would probably stress the system to its limit.

C

Yeah certainly.

E

And okay, I don't know if, if io could do that much actually, because I think you have to keep everything in its buffers or doesn't it because if it needs to do right, follow by read? Oh, but maybe maybe they don't. I have to keep any of the data or if they just throw it away.

C

I'm not sure either or it might even be using the same data depending on how you configure it.

E

In scanning, typically, the initiator keeps the right data until you get acknowledged from the host from the from the from the server.

C

Yeah, that's nature within the client-side library, the rpd, regardless of what fio's doing.

E

So libra pd is sending the arc immediately to the host. What does it wait for the host for the arc from the server.

C

It'll still have it in memory so that it can be sent if it needs to until it gets the ack back from the server.

E

So probably you cannot. I cannot use one client to do that. Much of an io.

C

Yeah but again in practice it's going to be a much lower. I mean like, like four megabytes, is gonna, be the uh usual worst case, maybe 32 for more aggressive setups, but that's pretty rare.

E

Because even four megabytes, if I do 256 of them it means one gigabyte of cash on the client side, but the client is able to do that. Much.

C

um You might start hitting some bottles at different layers, yeah.

E

You could run.

C

Multiple clients more easily, probably.

E

Yeah, so I probably need multiple kinds, I don't think one kind, but do that much mark. Could you please help me run the setup or we could do it afterwards when you have the time how to set up double quotes to do big size, I o and using a slow machine.

A

Low machine, the difficult part we have the insertive nodes which have hard drives.

A

Yeah, but there are four four of them are being used by um the jenkins and the other ones I think are mostly checked out. um Those might be tough. uh We have the ancient mirror machines, josh, they're, very slow.

C

Those are incredibly slow, that'd be a good worse case.

A

Yeah so so gabby that might actually be the. If you want worst case scenario, that would probably be the worst case scenario for everyone in your machine, um so those are toothology or in the standard lab. So you'll need to check one out um using that. I don't know if you can.

E

Send me the names and for running multiple, a fire can just keep opening more and more fires, and is there any um synchronization happening between them? Can they all write to the same osd.

A

Sorry, what was the question gabby.

E

Can I open multiple fio on the same post.

A

Yeah yeah, absolutely you can run uh multiple jobs from one fio. You can just multiple independent fios uh hold up the q depth for each one um so for for each one you'll want. If you want to have a high, I o depth for each fio process, you'll use, lubeio and direct and pump the q depth up and just do like large, large writes and whatever you want to set it to.

A

Yeah I mean that'd probably be the way to go. I would think um and you'll probably melt the mirror machine, because I mean these things are like 10 years old, so they'll they'll be yeah, it's gonna be super slow, probably, but you'll you'll learn something maybe.

E

Yeah, because also for now, one thing I notice is that um that the stage part is not what is spending the time most of the time, I'm spending all the stuff I'm skipping is just increasing inter cluster synchronizations, it's not in io, it's like getting everybody to agree on some steps and stuff like that talking to the manager talking to all other components, but I don't see the I o itself as something taking too much of my time.

E

Sorry, I'm saying before before I try to optimize when I optimize, that's the only thing left, but the nine second out of ten there all been some kind of synchronization happening, like a very big chunk, was even spent inside uh service prepare to stop, which is essentially communication with managers and others.

E

I think we spend half the time just there.

A

I'm curious what that will look like on mira, because mira has everything slower, lower discs, slower cpus it'll be interesting to see what how that time spent changes.

A

Josh, do you know how many of those machines do we have left.

C

Maybe 50 and I'm not sure how many of those actually work still so.

A

Yeah who's the right person to talk to about getting those david. Maybe I don't know if he still deals with you know the hardware there or not yeah like he said, maybe just to randomly try to grab some.

C

Yeah, I might do you might just have to randomly try to grab a few and if they don't work, mark them down and keep going like do.

A

We have any do we have any documentation on reserving nodes without running stuff through technology. I forgot how to do it. I haven't done it in a long time.

C

Yeah it's in the technology, docs there's a pathology, lock command. You can use to do that, but you can see here, there's a uh that's the general page, but um I'm trying to link to the popular page without the mirror machines there's a bunch of them that are free. So in theory those those ones that are are free are still working.

A

So gabby, you probably need to grab some and then, however, you want to run your tests like set, accept the cluster and test I mean toothology could do it right? You could set the cluster that way, but um otherwise you know you could you could use restart or you could use cbt once you grab the nodes, I used to do that.

E

A

Yeah yeah, so yeah just grab the notes from restart well compile stuff or you know whatever you want to do.

E

See how long just the drain step would take, because at the moment the drain step phase is not where we spend most of the time at least not an official analysis and not with smaller height. I can try to know fish analysis with big iron. I mean that's an easy thing to do, but.

E

Yeah, I wonder how many io can they get.

E

So the things which are beneficial is to disable roxdb compaction, all kind of stuff.

E

I try to disable also all kind of unnecessary activity, unnecessary activity.

E

E

The the l column family, where we take all the small I o and try to stage them, so this stuff is really good and do we even count them to the 256, so it's they could be. Is this the limit of how many small io we could store in works to be.

C

There could be.

E

C

Like generated by um activity is just the the message limit um coming in the client side, there could be more like io operations internally generated from those offer. Those client ops.

C

That's what you mean.

E

But the stuff that we stole temporarily in rob city in column, family, air or the small island- are they counted towards 256 or it's their own counter.

C

uh It's a separate piece. I guess there would be only like you're talking about the log entries right.

E

There's the small rights.

C

Yeah, that's separate, that's not counted in the same way, but for deferred rates. You don't necessarily have to um do them during the shutdown process.

E

Yeah, I'm disabling them at the moment. Even so I never see them being activated, so I think I'm disabling them, but I never saw anything happening on my machine. I don't know why are.

C

They only used in specific circumstances.

E

Because of the third right, I don't see, that's something that should save us some time and everything to do with compaction should be mark. Please don't put it here. This thing is going to go away as soon as the meeting is over. Oh just copy and paste.

A

E

A

Okay, I can even there's nothing super interesting though this is just if you want to use the rbd io uh fio engine. I mean something like that right. You want to just do big rights with uh maybe a high. I o depth and.

A

You'll probably want to do like so, okay. One of the reasons I brought put it in here, though, is because we should talk about this a little bit um with nun jobs if you increase that what will end up happening is that you'll start out at the same offset on the image at the same time, so I don't typically like increasing the number of jobs with fio for like sequential rights.

A

If you did random rights, it wouldn't matter, so you could do that too, but um with the control rights, you probably want to hit different rbd images.

E

No, I prefer to do random right because that's the worst case scenario right, sequentials.

A

Yeah yeah, okay, so then just change that nicer.

E

A

E

Try to exercise work as an area to be random right, big random rights.

A

Yeah and then you can change. If you want to you, can just use more jobs. Then then it doesn't matter.

A

Otherwise you can run multiple fio processes, parallel against different rvd images, or you could just make one rbd image and then do like you know: num jobs equals whatever you want to set it to, um and then the io def per fio process, whatever you want that to be set to with the rbdm engine, you don't have to worry about like using, um like you know, direct one and um all the other garbage that goes along with that lubio and everything um so yeah.

A

I mean something like that: just change it to rand right and then you know, choose what you want for the other settings.

A

I can send an email too, so um I mean I assume that you want to just use like the rbd with fio, rather than going through the whole process of, like you know, using setting up kernel, rbd and writing a file system to the image and all that stuff.

E

I'm just wondering how much.

E

What's going to be the limit on data buffers on the client side, because when we.

F

E

Format the file, if we ask the client to keep very, very deep queue. Eventually, it's going to run out of memory space.

A

Okay, I sent an email with that in it yeah. I don't. I think that should be fine, just using liberty and big big, ios random, and then you know, however many jobs and, however much io depths. Do you want to throw at it.

E

Another question unrelated something which bothers me a long time ago: why do we lack clients to do unsolicited right?

E

I mean it's okay, if it's four kilobytes, maybe 64 kilobytes, but unsolicited of megabyte of four megabytes, 64 megabyte, that's something which is thrown upon. Usually we don't want the client taking your resources.

E

I mean if you look at a normal nvme server, an nvme drive.

E

The way you deal with it is that you send a request and say I want to do a write, but you don't send the data and then the drive would come to you and say: okay, give me your data.

E

So as soon as you send a request, you need the client need to allocate to have the buffers ready on each side, but it cannot move the data to the the storage to the target unsolicited. That's was always the case with scuzzy, with nvme and before I realized that there is an optimization.

E

If you do small write and then you can do you send the right and the data in one step and you could save the round trip, probably very good, optimization for four kilobyte, but when your io is getting into megabyte, that's become less of optimization and more of a hassle for the storage system.

E

Because what we do, if the client is going to do one gigabyte, I assume we're going to ask him to give the data in intervals. So we could do the same. If he's going to do four megabytes.

E

Was there ever a discussion trying to set what's the correct size for the for the data that we know.

A

I don't know if there's a good answer that I have for why things were designed the way they were back when they're written this is. This is pretty old right josh I mean this is how it's worked forever.

C

Yeah, I think that's pretty much. uh It's been that way. It was started, so I'm not sure exactly, I'm not sure if that was a consideration originally or if that, if we just thought about that in that way, at any point, but that's kind of how the protocol was originally created.

E

The reason data servers don't allow you to give them unsolicited data is because it allows them to control the way they behave and they know how much resources they have so usually they're going to allocate that which offers that request and they can accept a request. But then they will move the data on your own good time when they have enough buffer free for the data, because usually the request itself is just 64 bytes of I, the I request block is usually 64 byte.

E

I think scasi fcp nvme, that's how much they give you and when you want to push the real data, you have to wait for the server to tell you, okay, you can give me now 64 kilobyte, you can give me one megabyte, but the server needs to tell you how much to give it and that way the server is always in control of the resources and it could use recycled buffer and so on and so forth and.

E

I never heard about this being a performance limitation, because you could see people walking in almost widespread with this behavior and again, even if there is I, I can see that there is possible optimization for a very small right if it's four kilobyte around it. But when you do megabyte of right, there is never a reason for you to push data unsolicited and hammer on the server which it might be doing other stuff and you're just consuming space.

A

So I don't know all the details, but I I remember hearing stories that early early on there were many many iterations of the messenger, and there was a lot of like uh uh changes going into a lot of this stuff early on. um I I think at one point like we had a messenger that was entirely based on mpi uh in the early days when it was to look like you know, uh kind of focus on being like a luster replacement for supercomputers um josh.

A

Maybe you know more, but the impression I get is that there's kind of a lot of chaos around this stuff early on, and I don't know exactly how we ended up where we are now but.

C

Yeah, I'm not familiar with that really early interview myself on semester layer, it seems like I mean one of the things I can. I imagine I might be that um it's a little bit more difficult to like re-pre-register your intents and then um execute on them in a system like stuff where you can move around across nodes, and you don't necessarily know exactly where your request is going to be going and until you get around to sending it.

C

You could have some extra latency, I mean I mean it sounds like you're describing the system's able to maximize throughput, but it sounds like that. Doesn't matter our kind of protocol, I have some inherent latency to it as well.

E

But you could maximize throughput if it's a small eye, because then the overhead of the round trip is going to be big if it's just four kilobytes of data, if you're going to send four megabytes of data, then doing an extra round trip. It's no big deal because I mean even tcp is going to keep sending messages back and forth you're not going to move four megabytes in one buffer internally.

E

It's going to open windows, close them and keep asking you to give them the more data, so the overall saving, also the relative improvement in four megabytes going to be. I, I don't think it's even going to give you one one percent extra. If it's doing 4k, maybe you're going to get 10 extra um response time, you could cut a response time by something but four megabytes.

E

I would expect that you wouldn't see a huge difference in response time. I would say single digit percentage, not even.

A

I mean it almost kind of sounds like you want the default behavior to be what you describe gabby with the ability to embed the data. First, if it's small in the original request and say hey, you know if you can just just take this and you know write it. If it's, you know 4k or smaller, something.

E

I'm really unfamiliar with anybody doing unsolicited data. I know it's like since beginning of time. It was always um the privilege of this of the server to control the flow because they they are doing more and they don't want few strong clients to kill them, and they don't even know if these clients is in high priority whatever so they always refuse. It was never allowed to post data.

E

Some people was trying to do it if the data is very small, you could piggyback and such things, but I never heard about anybody doing megabytes of piggybacking data, because what's the round trip that we do for four kilobytes nowadays.

C

Yeah, I get your point like it could be like a few um milliseconds for your bit. My regular.

A

Might even be less depending you know, some of them are pretty fast. These days.

E

And that means that you could create an adversary which could kill your system by sending that much data. If we allow 32 megabyte, then I'm going to shut uh 256 requests each of 32 megabytes and your sister perform is going to suck.

I

I think, from from the traditional system, they were optimized to use um the cues as they are fixed and that's why you need to just ask for whether you could send something down or not. um But the idea, I think, is not that bad um to have some boundary where we just don't allow to send the data straight away and we could gain perhaps some decoupling from those um really big um data streams uh and have a better handling just get library or just at least improve. uh Also the things around um qos.

E

Yeah, so this is the big killer here, because if you have somebody doing small io and is high priority and somebody's doing, these huge buffers is going to consume all the resources.

A

And that's what we see gabby, it's it's legitimate. I mean we have kind of fairly big us issues. Maybe this is something we should be thinking about. More, don't know how hard it would be to change, though.

C

I'm not sure exactly how how big of an issue it is with like once you have m clock, enabled as well the mark. um It does try to be able to handle that, but yeah that doesn't mean that you end up having potentially more like system resources in terms of the queueing space used by these large requests.

C

E

Yeah, exactly immediately what you'd usually do for big requests? You would put them in a different queue and you would pull one of them every end request.

E

If you want to do um any kind of qos you could say you know what after I've done, 100 of four kilobytes, I'm going to be willing to do one of your one megabyte request.

E

Unless you set yourself to be top tier- and I don't know whatever, but it's unfair to have me doing one kilobyte and you're doing 100 megabytes and each of us get one entry in the queue.

C

Right, right and.

C

But I guess what it doesn't do. Is it doesn't um stop you from taking up that cube space in this in the server in the first place? So I guess we make the difference with.

E

This kind of mechanism, you only send the request, you don't send the data, we don't accept data yeah.

C

Yeah, I'm saying that's the advantage of that kind of system compared to just that. Just the m clock implementation that we have in chef today is that uh it it does. Allow you to push back on that and control the buffer um queueing space, as well as the just order of requests.

E

Even suspect that once the osd gets big requests, it will prioritize them because serving one big request, free episode of resources, so you would even try and um benefit the the biggest offender.

E

The more space they'll ask for the higher the priority you'd get because you want to get rid of them. I don't know if this is what we do, but it's that's.

C

Not what we do today but um could be done. I guess, but that's not that's not how.

E

Because, usually when you have a lot of stuff, you try to get rid of the bigger one, but if there is no limit on how much they can push, if you do that for them, they will give you another one.

C

C

I think this might be a good idea to investigate and it's not just um the client requests as well. It's also like the intro osd request in some cases like recovery or for um those kind of things can be quite large as well or replication. Traffic, too.

E

Another thing that we might be able to do is if we do the request in the sequence of operation. Actually probably we're not going to do that, because you don't need to allocate the data buffers.

E

You could just set up the request, um update, robs db pg log and everything, but then the data itself you'd accept in smaller chunks.

E

Okay, give me 64k, you write it down, give another 64k, you write it down, so everything is linked together, but but you're not going to allocate very big buffers because it's I don't know if this is what we do if we allocate four megabytes of buffer, so it's sitting there ideal until we get all the data and only then we could start staging it, which again it's expensive, it's cheaper to take them in chunks and each time you get something you stage it, but then you need to build.

C

E

That you know how to link stuff.

C

Yeah, I guess it gets into that kind of little messenger messenger protocol, which is where this I think gets complicated to implement, because today we do need to like read in is the header of the message before determining like the payload size, and then we allocate a buffer for the whole payload. Let me read that in so that we can decode the payload.

C

um We do that we do to delay the decoding aspect of that, as we don't necessarily fetch that all immediately, but we do treat that as a single large buffer.

E

Yeah, so I'm just wondering if it's possible just to be kid, you call the header, do all the work that it not needs to be done and then start. Then you need to have an active thread, pulling the data and staging them in chunk. But you don't need all the buffers just you could do double buffering of 64 kilobyte and just get the whole data instead of getting one make a big one megabyte. I don't know what happened. If you do, somebody gives you 128, kilobytes a megabyte.

E

Are you going to allocate to 128 megabytes of space.

C

E

And then so, this thing is going to be sitting highly for very long time, because until this thing is complete, then this thing cannot be touched and it's not a very good. You usage of resources resources is going to be sitting either most of the time just waiting for anything to happen. While you could start staging things so in effect, you're actually slowing them down. So I would even suspect that doing a multiple 64k loops would be faster than giving you unsolicited one megabyte or whatever.

C

How would the um multiple 64k loops work when you need to write down the entire right as a single update to the disk, because I, I think you're sending that off to the object, store the single transaction.

E

It should be a single transaction you're going to start the transaction, but you don't mark it as completed until all the whole data is safely on drive.

E

You first got to the log, and you put a record on the log to say that you start doing that and once you did it, then you can start pulling for data from the client. Once you complete it you're going to create a second entry in the log string. That thing was done.

C

Sure sure yeah I see how that could work. That's a very large.

E

C

E

The system works today, I'm not bringing any great design. I thought of I'm just it's. It's a very.

C

Interesting style, yeah.

E

A

I've had these kinds of similar thoughts, not exactly in this case, but with other things and and it's just the the weight of of changing something like this is so big, and maybe it's maybe once you get into it. It's not so scary. I don't know, but that's always been the the thing. That's um kind of held me back from trying to do stuff like this. It's just such a big change.

E

Yeah, I agree, that's a very, very big change because it means the whole staff is going to be touched to get this thing working, I'm not saying the simple thing, I don't even say we could be doing that, but it's.

C

Yeah, the idea of at least having clients.

C

Have to make sure that there is space in the server before presenting a large right. That's something that's more easily added the protocol doesn't need. You touch every layer of the stack.

E

Is there a reason why clients like to do big writes because anybody.

F

E

Have to speak eventually they have to split so they do have the code to deal with splitting big rights, because if they do one terabyte, they have to split it can.

C

E

Instead of making all the change on our side, can you tell the clients which already have the code for splitting, don't give us anything bigger than let's start with one megabyte.

A

Well, like it does, is it josh doesn't like rgw already do that? I don't think that rgw will give you few traits.

C

Yeah, this comes down to really like how you configure the system and in most cases, you're configuring, the system to add the box. That's going to use 4 megabytes for the default size for everything as the largest size that it's going to send from rgw or from rbd or ffs.

C

E

C

E

What they use, not one so 128 is, is it's.

C

Like a hard limit right.

E

Okay, but our clients don't use it because we control the clients.

A

Exactly yeah exactly, I think the biggest concern would be if you had some user. That was just writing their own custom. Like you know, software that used liberators right.

C

Yeah, that's that's a more niche case. um I think that that qs aspect, though, when you have um multiple writers, who are using very different kinds of workloads and getting very different results because of that may be more relevant yeah.

E

You could have that.

C

Users with like one three writing very small objects. One writing very large directs um they're into this kind of situation.

E

Did we try to profile the difference between four megabytes, two megabytes, one megabyte smaller mark? Did you ever do something like this? I mean when we set four megabytes as the default size for rgw was that done after somebody concluded that that gives them the best performance.

A

Yeah, um it's not like there's best performance, there's best performance in specific situations, right like say for rbd. um There are cases where you're better off with a larger size, and there are cases where you're better off with like a smaller object, size right.

A

E

A

E

Allow you to write a complete object at once. If you uh save object rather subject, if you.

F

E

A new pile and each time you give us four megabytes, it means we're going to do one-to-one mapping we're not going to have a lot of overhead of metadata. So maybe that's why four megabytes can but.

F

E

Sequence of one megabyte one after another, are we not going to just put them in the same space.

A

um It's been a long time since I've looked at any of this and it's quite possible that things have changed now, but at least back when we did this and really four megabytes has been the default for a long point. It was. It was kind of the the middle of the road option at that time, um not not always best, not always worse, but but reasonably good in all scenarios from what we saw.

A

That's why I remember, but anyway from we did a bunch of the testing on it, but that for me my number even predates that testing. It's josh. I don't even remember when we decided that was default for rbd, but it was a long time ago.

C

I think it just happened when rpg was created, that was the size somebody picked out and that's kind of what we ended up using for other things too.

A

And honestly from when I looked at it, it was reasonable, like it was actually not a bad default for for where it was at the time.

C

Yeah, I think the idea was that it was. I got a little large size to do a single I o, to a hard disk with yeah without getting too large for the buffers at the time.

A

You know the funny thing right is that hard disk, even at like 512k, is, is reasonably good, like it should be able to do 54k rights. You know pretty well um if you're doing like full full object rights, but we still saw benefit going up to four megabytes from what I remember it was. It went beyond that.

C

Like it's like, I guess I might start getting into like the metadata overhead too.

E

So in iscsi, for example, they allow you to do up to which is again tcp protocol, similar to what we do. So I think the max unsolicited they accept is 256 kilobyte.

E

I think it was something between it was less than a megabyte, but more than four kilobytes. So I maybe 128, maybe 256 kilobyte.

E

But I don't think you even can send anything more than that amount in iscsi and ice, because again they fold the same thing. They come with this concept of unsolicited data, because iscsi length distance could be very long.

E

I think in theory you could move iscsi over uh kilometers of data. So that's why they wanted unsolicited data. I don't know that that we are doing the same thing.

A

Gabby beck, I remember back um when I was doing a lot with luster.

A

There definitely were advantages when talking directly to the block layer. You have the ability to do like um uh rights that were like two megabytes to the underlying disk array. A lot of times, optimizations that we did you know, are same with mac sectors, kb to do one megabyte or two megabytes or whatever the driver. Let us do um so at least back then, on hard drives, there seemed to be advantage to being able to like be you know, doing big rights to the array.

E

um I would probably just my I suspect that that was because you compared single non-q, not overlapping. Sorry, you didn't use multi io.

E

I had this discussion long long time ago. I think it's like my first post in the linux kernel had this argument. Somebody said exactly this thing and I showed him that using two buffers of 64 kilobyte, you could do much better than doing than using.

E

um I think he was using one megabyte by just sending the first sending of uh sending two requests and as soon as one of them come back, you send the next one. So you just you're keeping a pipeline of 64 kilobyte. It will give you better performance than using one buffer of one big buffer. Because of this thing I discussed before you have to wait until the buffer is full and you don't take any action until that. So there is a big window between two operations when you're doing overlapped. I o not of a.

E

Multi, um if you have a queue- and you send few asynchronous io, and that there is an active queue, you don't benefit from a very big I o, I I I'm sure I can I, if you'd use fiona and ask it to use q, 64 kilobyte and use a q dep of four versus one megabyte in q type of one. You would see that the 64 or even 16 kilobytes would be faster.

C

You also get the advantage of extra memory being freed up to be used for caching and other kinds of metadata for other requests too.

E

And by the way, when you send big requests to the disk, you don't send them the data.

C

E

You are the one wasting your space, but you tell them two megabytes.

E

You allocate a buffer on your on your memory space, you send a request and then the server the disk is going to pull it for your data. So it's doing this for you, but it's even this is less.

E

It could be more efficient if you're giving 64k, because you your um you would be taking less time to feel the data. You always have something working you're going to create a pipeline, but there is never a case in which you push that much of data. The only case I'm aware of is ice, causing because the beginning of time ice, kaze people thought about it about being something you write from east coast to the west coast of your eyes over tcp and with that kind of distance.

E

Having unsolicited data was a benefit, and I think they put, I don't know 64k 128k, I don't know, what's the number 256? Definitely not one megabyte, it was measured in kilobyte outs. I.

I

um The big um questions of data, and especially last year, used to raise, and the thing was there, that you were always were looking for a full stripe right. um That's why you ended up with um such big ios, that being the optimum one just in the race, large block uh starts at four meg or 16. Make depends on on what the firmware said and because of that, so then you go for streamers for just sequential io and then deal with the data. Just write it straight to this.

I

A

A really good insight.

I

A

I forgot about that, but you're right. I think.

E

E

Default is 64 kilobytes. You can go as high as 128 kilobytes.

E

And that's what I think: that's which.

I

I

I think that's um that might be something for di uh for the direct hd media um that uh to understand what uh the optimal or just an optimal amount of data that could be written, perhaps in a single track um just with one revolution or something like this, and to get the best um performance for just large ios but yeah anyway. For for flash, it might depend on the media.

A

I mean for hard drive right, it's I, I think. Historically, it's what like between 256k and 512k, where you probably see it kind of hitting that limit somewhere in that range.

I

It was changing all the time a little bit, but anyway.

E

If you look at oracle oracle likes to do one megabyte right where they pack a lot of information, and then they send you one megabyte that thing exists, but of course they cannot send you the data unsolicited. They tell you that they want to do one megabyte and then you're going to port for the data. So it's different.

F

E

Basically, we allow you to push the data unsolicited, but the default is 64 kilobyte and if you negotiate and both side agree, you could go as high as 128 kilobytes.

J

I

Yeah, the pure scuzzy protocol doesn't allow that that much amount of data for single white requests- that's even um with so I don't remember what it is, um but you know you have a bigger portion than your very own, but such a big amount just not divided, so you have to split it up on the driver level already.

E

I mean the fuel scuzzy, f and fcp is also like fcp scuzzy over fiber channel she's meant to work on a very short distance few kilometers at most, but there is no unsolicited data iscsi. I remember done doing that. The guy which led ice kaze used to sit next to me in the office.

E

I think corrie could remember him as well, so he explained to me about this and at the time he was thinking unsolicited, because he was thinking east coast waste cost.

E

He thought that ice cause is actually going to be done over just just internet which nobody is really doing, but at the time he envisioned something like this, but 128 kilobytes is the most you're allowed to send.

A

So so what can we do like josh had mentioned? You know the one change that potentially might not be super disruptive. Is there? Is there anything else that we can do here, or do we really need to like dig into this and and look at changing the way that we handle requests.

E

I think we don't have an easy way to deal with this. I would suggest, if possible, to limit the max size internally from 128 megabyte to 4 megabyte, because 4 megabyte is what the client is using, but don't allow anybody to pass it if, anyway, you don't expect this thing to happen. There is no benefit and then check on the client to see if one megabyte four megabyte, two megabytes where's the sweet spot.

E

If four megabytes give them some benefit, then there's nothing. You can do for now changing the whole protocol, and always they it's a it's a big task. I don't think anybody would do that. Just for this.

A

Yeah I mean, I think, practically we don't see, 128 megabyte requests anyway. I know it's the upper limit, but I I josh, I can't think of the last time I heard of anyone doing anything like that. I I think it happened once or twice, but.

E

A

E

I don't think it's it's it's common. I would say to this energies because you don't want to see some customer machine misbehaving and then eventually you find out it's something that you don't test.

E

Just going to stress the system, is there any reason to keep more than four megabytes.

C

E

C

Folks, like like mark, was talking about like that before it was like a sweet spot at the time, but I think some folks in some systems have seen like the larger sizes, like 32 megabytes being better for them, especially on ssds.

A

There was someone I know of years ago that was doing 16. That was the biggest I think I remember hearing but yeah.

C

Yeah, but that's also not not using that as the I o size, usually that's using that as the object size. So that's that's yeah.

A

But it was like rbd right so there I think they're saying rbd to 16 and then, if you theoretically, if you're doing big writes, maybe you could actually start seeing 16 megabyte rights. 260 megabytes objects right.

E

Then right there can, you do do. Do we support pre-allocation.

C

Probability, we do yeah.

E

So, instead of doing huge, ios just pre-allocated space and then do reasonable size, io.

A

Yeah josh, something you said earlier, though I think is- is very valid. This might be a metadata overhead thing right. They might be seeing better performance because there's less metadata.

C

C

I think that I meant to be. I think, one of the reasons why we wouldn't want to do like much smaller obvious sizes. That's that's! That's pretty clear, like what the how much basic number takes. um Yeah.

C

But I guess a question that keeps me to me is like how how much of a practical difference would this make for like us purposes if we did like the um rearranged, buffers or pre um processing the metadata, without actually storing the data on the server side?.

E

It would apply to a bigger change to the system in which you stop doing dynamic allocation, because everything is done in a preset values. So you don't need to allocate uh one megabyte four megabyte you're just going to allocate some size of buffers. I don't know you're going to have a pool of 64k, maybe 128k and you're just going to recycle buffers on there.

A

Yeah you you allocate uh like a space for, however, many requests that you're um right, you know asynchronous and handling, and then you've got another little buffer that lets you pull in from that right. Whenever data is actually being sent.

E

Yeah, the request themselves should never do dynamic allocation. If you set your limit to 256 requests, you have a template of 256 entries and it's one for one. That's you stop doing dynamic allocation and creation of a new task is a very quick process, but then for buffer on location.

E

Again, you don't need to start searching you're just going to have 64 or 128 kilobytes and you're, going to reuse two of them and using in double buffering to get the push later, and you could use less smaller amount of memory if we allow big rights and if people are doing four megabytes and we're doing 256k requests each of them doing four megabyte, then that means one terabyte of memory space, one gimbal, sorry, one gigabyte of memory space just for holding the data which in reality we could get a loan with ten percent of that you could cut it by ninety percent and you wouldn't see any performance difference.

E

So it's going to be a memory, saving cost.

A

E

Because when I first saw the osd, I was surprised by how much memory I can use at first. I was really trying to pack everything and then I realized there's so much memory, I'm not used to having so much memory back in the symmetrics. There was one megabyte one one gigabyte, one director which would be the equivalent of a noisy, not always disorder, because one director can manage um 128 drives.

E

It seems that we do a lot of things in parallel, but in reality we just do it one after another. So you finish something and the resources go to the next one and you don't hold stuff in memory for no reason you only put the stuff in memory if you're just going to use them now.

A

Josh, how horrible would it be to add like a fast or not fast path, but like a um an async option to rados, like I'm going to tell you I want to do this and then, but let the osd, you know tell you, I'm ready to have data come in, it's not the primary path, but just as a secondary path.

C

I think it really depends on how in depth you want to go like what kind of saying about um making it go, all the way down to the uh objects or layer where you're able to kind of take this into small chunks. That's really like the.

C

I could see a lot further towards minimizing memory and allowing you to cache a lot more of other things and I'm not sure anything like just adding a single like asynchronous way to like supply. A day later, when the osu says it's, okay, that's probably not too hard.

C

um I'm not sure how much benefit that would give us.

A

It seems like there's a number of things that could use it right.

C

Yeah, I guess it's like whenever you have a mixed uh small object, workload or higher low priority in general, where one one low priority client is taking up lots of buffer space today, right.

A

Yeah, I was thinking about memory usage. I guess like if we have a lot of io piling up. Maybe then this is a case that would let us move towards that bottle in the worst case scenarios. But but honestly, I don't know a big big rights aren't where we see memories skyrocket.

A

It's fragmentation due to small, I o is when we are memories, just really balloons. So I don't know.

E

C

E

You never do memory fragmentation, because everything is pre-allocated. You can just.

A

But, but it's not even any of this that really makes our memories. It's memory, fragmentation due to like a ton of nodes being in memory at the same time that are themselves have all these dynamic structures inside them and another. You know, h-object t and everything, it's just the whole mess of it all makes things horrible.

E

Yeah, it's a general reaction, so you start here and then everything.

B

E

Yeah, I mean it's a completely different programming model, so I don't know if it's visible to think about it. I don't know if just keep in mind next time, we do something new, maybe that brand.

C

E

Be designed in a different way.

C

Yeah, I think, that's a good point. It's kind of similar to an idea we talked about when we were starting out with crimson as well, trying to pre-allocate all the structures needed for a request up front. So you could view that kind of pooling and not need to do a dynamic allocation, and I o path but doing the um taking it a step. Further and doing incremental after after wire reads um increases that remember. Saving me from further yeah.

E

J

E

A

And that probably would have been a good design decision like 15 years ago right when this was still written had at the university at santa cruz. But um but that decision early on was uh was made. And now we have what we have.

C

Yeah, it's very challenging to put that uh switch system to work like that when it it's designed the opposite way,.

J

C

We can certainly think about that, as we like to think think about new new things, we're implementing like the c-store and um mpmeof.

A

uh And actually on that note, I talked to sam about this exact topic last week and he did mention that um dynamic memory allocations are one of the things he's not really thinking about much right now, he's just trying to get correctness right, but that they are using a lot of dynamic memory. So this is an area that you know, maybe wouldn't be bad for. Someone else to come in and think about as sam is trying to simultaneously make sure correctness is right, you know, are we are we making?

A

um Are there areas where we could be doing a better job of not using dynamic memory, while this is being.

H

E

And so you need to start with the leaf the leaf. I don't know if there's if it's cyclic, that there is everything is connected to everything else, but if there are some leaf, maybe you could try there.

A

You know the root of all of it is gh object, t and h object, t right, that's kind of like where, in my mind at least that's where it all starts, you see, uh you know, you've got um strings being used for dynamic dynamically allocated um or things like object, name and everything else, and then it just kind of grows from there. In my mind,.

E

If we allow up to 256 requests that mean how many objects can we have open, because every request, I assume you- can have multiple objects if you're doing something.

C

Oh yeah, I guess we were talking about um because potentially creating snapchats, then you have metadata from from like the head address, as well as the snapshot, so you could say double that number.

E

Ideally, like in a system you'd like to set a limit on how many requests you're allowed to accept, and then everything should derive from there, because you say every request could use that many resources and then you pre-allocate everything, and you know how much sorry you do not the cue but the the in-flight object. That's why I was. I was surprised to see that the cap is on the queue, but that's because of the unsolicited data.

E

That's the reason for the cap, but usually what you care about is the in-flight, because everything in the queue it's just 64 bytes, so you could afford to have 1k 2k of them. It's it's really nothing, but in-flight should be a smaller number. 64 128. I don't know.

C

E

Then everything else should derive, because you know how many objects you have based on how many in flight you have and the number three I mean. It's surprised to see that you don't need much of anything because you use stuff you complete, and then you recycle it immediately.

A

Josh, I assume, there's probably some cap on metadata size when you actually figure out what the maximum object name length is and other things. I don't know that we ever explicitly figured out anywhere, but there probably is some.

C

A

C

There's definitely some kind of maximum. You could figure out um the object. Themes can be quite long, especially for rgw, since we end up taking those from like user input, but there's there's a maximum length. There.

A

So, theoretically, we could figure out a buffer size that be statically allocated to allow one object's worth of metadata to fit in.

C

Yeah, okay, but theoretical: oh.

E

No, that has how many objects could be in flight. At the same time, then, you could pre-allocate the whole thing and just recycle them. Yep.

C

Yep yeah at the end of the earlier like in bluestar, we do have more throttles throttling going on to control like how much data is inflate uh and there's a bite total, and maybe an operations throttle too, um but that's kind of the um what's. What's controlling capping the I o resources uh from being overwhelmed.

C

And then the q and the osd is what m clock is acting on, so to be able to reprioritize with within the at danny requests.

A

Well guys go ahead, josh.

C

A

As you say, once you're done the same way, we're going to say I'm going to go because I want, but.

C

Yeah yeah it's getting late.

A

Yeah, okay, bye.

C

All right see you.

A

Sorry, okay, see you later guys thanks, bye.