Ceph Performance Weekly, 3 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-02-03

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So all right, there was very little in pull requests this week, um I didn't see any new or closed prs, which is is not terribly surprising, given that everyone's very very focused on quincy um there was one that was updated. That I saw this is this set tracing compiled in by default pr that folks on the rbd side have been looking at? uh I don't think we have diplica this week, uh but she reviewed that and it's gotten a couple of different updates this week.

A

So it looks like this still under active uh development, we'll see what the outcome of that is. um Coincidentally, one of the pr's that appears to maybe have uh caused performance regression a while back that we didn't notice was was also related to tracing so um keeping a good eye on how much of a performance regression this pr causes, even when uh disabled is probably going to be important.

A

Beyond that, uh I did not see anything else really going on on the performance front this week for for new peers.

B

Oh go ahead. Absolutely! Yes, and I have something which I wait. If you don't see it yet because I wait because of quincy uh the work on quincy, uh I I'm not sure it's exactly the performance. What you do there is a pr there, I'm going to do about the balance of the very rare cases in which the balancer, the existing ones, that the calc pidgeot maps is really stuck with. It's not infinite loop, but it works. I have an a you, an example where it works for more than uh uh 10 minutes.

B

Actually, one function call it really. It goes into some huge calculation. I have a very simple fix that uh changes. This reduces to less than 20 seconds, uh with a bit very, very small changes and verb changes in the out in the results.

B

It changes a bit the result of the balancer, um probably when uh the the the quincy feature fees is going to that that it's going to be more quiet, I'm going to push this pr, it's like maybe 12 lines of code in in the calc pgap maps, but it's uh at least exam for the example that I have about this huge performance uh issue.

B

I did again, I don't think it's an infinite loop, but I think if I give it enough time, it will finish, but um uh I am I'm going to it solves all the use cases that all the examples actually that I have for this.

A

Just to make sure I understand this is this is only an issue when the pg auto scaler is the name.

B

It's not the pg, auto scaler, it's the balancer, it's the capacitor, okay, okay, so there is one I have actually. I have like uh uh dozens or something like this different configuration files from different systems uh and with uh pools which are worth uh balancing because they are large enough. I have something like uh they don't have the statistics, something like 30 large pools that are worth balancing, because all kinds of small pools are not it's not really interesting and out of them.

B

I have one pool with one configuration which causes ironic in the matrix of multiple multiple uh parameters, but I have see I see six examples all on the same pool where this uh the compute doesn't stop. So I I give it a timeout and the largest that I gave is 600 seconds 10 minutes and it still doesn't complete and after my fix it completes in less than 20 seconds.

A

Oh excellent, did you say, is there a pr yet for that, or is it uh still just in a branch.

B

It's still a branch I'm just waiting, because now there is a lot of noise around quincy I want there to to when the when they close the this I'll put it as a as a pr again, it's very small from code perspective. It's only uh about dozen lines, maybe only 10, though.

A

You can submit a pr, and just put uh you know, tag it with do not merge or something, if you're not ready for to merge, yet um just to get more eyes on it. Okay, okay, um cool cool, um excellent; so uh were there any other pr's this week that that were new or closed or updated that I missed guys.

C

So, mark I don't know if you saw the tracker issue, I posted there, but I just came across this today. It's affecting quincy and seems like performance kind of drops to a halt after a couple hours.

A

Did you put it in the um in the the pad or the the chat window or.

C

I put it in the new pull requests.

A

Oh okay, sorry! That was the way I was looking. I thought that was what josh solomon was talking about. Do you think this is the same one? Hey.

C

Yep it's new from a couple hours ago, yeah.

D

Yeah so yeah we have been testing in another team uh workload, dfg uh quincy and uh from remember mark, I talked with you mark it's the same tracker that the uh so let me provide some background to this one. um What we did did take the quincy and we started. uh We deployed it in a 192, hd cluster. It's a hybrid osd's uh db in flash uh like nvme and the data disks are in hdd and um we have fill workload first.

D

uh We fill the cluster uh and it's a small object, like the objects start from 1 kb to 250 256 kbs. Maybe the histogram is already provided in the tracker and it's a cost bench workload s3 completely rgw workflow and we have uh 30 drivers cosplaying driver, which means that around 160 number of workers we are running around 2100, but we slowly reduce it. We were thinking that client is being putting a lot of load, so we reduced this to 1680, we disabled the auto scaler. We pre-sharded the bucket, nothing helped.

D

So what is happening is that reducing the count of the workers, the cosmetics workers helped us to pass one hour: hybrid upload, but not the 48 hour aging. So fill is going fine, uh but not I mean it's not adding all like 50 million objects in each pocket. We write around 50 million object in in budget. Some few objects are missing from each bucket, but it's still okay, because it's a it's just a fill. It's not a write. It's a fill in the cosmetic term client doesn't wait for a write to be successful.

D

It just keeps writing the data to the bucket and after that we do a hybrid uh which is a one hour hybrid and then followed by a 48 hour, hybrid. Whatever we have been doing 48 hour. Hybrid was not successful at any point.

A

And so after three to four hours, nothing you've done stops this from from just ending or just just stalling. Yes, yeah.

D

And and the cluster health has been always fine uh like no saturation or anything any one or more dying or nosd flapping. Nothing.

A

Have you tried, um after it gets to the point where it stalls have you tried um sending any new I o to the cluster via uh other mechanisms.

D

No, no, no, like a ccmd or like uh put bucket or create bucket something like that. Right.

A

Yeah or even just like a radius bench to a separate pool or something just just something, to see that the osds are still responding and doing things.

D

No, nothing nothing. We tried out it's just sitting idle. The cluster is idle right now,.

A

Casey, do you have any sense of whether or not you think anything in rgw would be um not responding, or do you think this might be lower down in the osd.

C

I really haven't had a chance to look into this.

D

C

D

If you want, I have cluster access, I can run commands if whatever you want right now, also so I mean I don't know if it is the right forum but yeah the cluster. Is there uh it's in the same situation? What reported in the morning.

C

Okay yeah- I uh assigned mark cogan to this, um so you might reach out to him and offer assistance, but he's he's also been tracking some memory. Growth in rgw, so it'd be interesting to see. If we see the same thing in this workload.

D

So what I will do, casey I'll email mark hogan with cluster access, retail, okay,.

A

Yeah, if you guys, if you guys determine that you think rgw is waiting in the osds that they're not responding or doing the work, let me know and and we can try to get somebody looking at that too.

C

Cool thanks, vic, yeah and mark. I just wanted to make sure you are aware of this. One.

A

Thanks yeah, that's that's good to know um lots of changes in rgw, but also some changes in the osd. So um I have not seen anything like this uh on the test that I've run, but I have not been running like three to four hour kind of you know long long, running, rw, stress test. So um I have not seen this.

A

All right, well yeah! Thank you guys, um thanks for looking at this um anything.

B

A

B

One point regarding the discussion that we had uh last week about the rocks db and this we talked to the people from speedy b. They also saw the recordings they're willing to uh and they are committed to opening the source. We talked about this. uh They are uh willing to uh be here next week to put it in the agenda and let's see what how they can discuss with people from uh from digitalocean and.

A

The weather, I'm not sure if next week is going to work um intel, is going to be giving a presentation on uh opencas and and their work looking in comparison to dm cash, either next week or the following week. um uh I don't know for sure I gave him the option of either and they're going to try to figure it out, so we might be able to do it next week, but uh that's still maybe.

B

So keep me in the picture and I'll also update the people from spdb and let's, let's do for the next week and the week after we'll do both. Let's see intel decides where they are and on the other other thirds, they will bring the people from spdb I'll talk with them and make sure that the they know that we may change it, because we have a prior.

A

Yeah I I I do want to point out too, that um you know we're happy to talk to them. Happy to bring them in to to to you know, present anything that uh you know they they'd like to talk to us about, um but until it's open source- and you know especially given that blue stores is kind of um you know very quickly- heading towards more of a stable.

A

Implementation, while we develop crimson um major changes like replacing rocks dbr are, you know, definitely.

A

Something that's gonna be difficult, more difficult to do as time goes on right. So just you know, they should know that before we, you know commit to a bunch of their time.

B

I I will do it. On the other hand, let's see if they can prove that they could solve the problem with the digital ocean. It may be worth uh doing the work. You know that we have a use case if, if they could show that they can solve this uh problem, uh maybe it's a discussion.

B

My my experience showed that while they did huge improvement, roxdb it didn't help in incest. So I'm not that optimistic that it can solve the problem, but if it solves a problem, it's interesting. If we see that it doesn't solve the problem, then okay, it does solve the problem.

A

So so adam, um you weren't you weren't here last week, but the the the folks from digital ocean uh raised some some issues that they were seeing regarding rocks db, and um I think josh was was thinking that maybe the speedy b uh folks uh had some some insights or or thoughts regarding that um josh. Do you want to talk a little bit about what what you were thinking- um and you know I know adam- has- has already looked at speedy b a little bit so maybe it'd be a good discussion to foster.

E

I just wanted to say that I will be happy to try speedy b again, especially that we made some improvement changes to blue fs recently. Maybe that will also impact the performance that we were not able to to gain last time. So maybe there is some change some value to make a test.

B

Yeah- and I got it from my person- I just did pattern matching between what the people from digitalocean said and what we heard from speedy b in the past. I didn't fully understand their problem, but it was clear to me that they were they had problem with tom stone and with the delete process within uh roxdb, and I know that that the cbdb claims that they improved it significantly.

B

uh They need to prove it. So that's why this whole thing started. uh I'm not sure I I'm not sure that the problem is mainly in speedy b or I I didn't even have um a feeling how confident that people from digital ocean were that the problem is only in spdb, not in other places, so so early rocks db. Sorry only rocks db, not in other places. So, but if it is there, maybe we have a solution. If it's not there.

F

It sounds like we are planning to invite them. Why don't we get the digital ocean folks, also in a future meeting on the same forum, and uh I'd also like to hear their plans about open sourcing, uh speedy b. I think that's um that'll.

G

Be very good several times yeah. We need to do this again. I think we've been through this like five times already.

B

Yeah they committed, they already told us that they committed to a customer. uh I don't know which one that they're going to open sources within three months, so they're in the process of open sourcing it without us creating a second video.

B

We have the problem now and I think it's good uh uh to to see the josh.

C

I'm saying we're spending too much time talking.

G

About this, which doesn't exist yet, let's wait in few months and go to the next topic.

B

I I think that uh uh if we have a solution for the for digital ocean, I think it's uh it's. It's not the the thing they came with a specific problem with they claim with these with rocks you'll, be, if you don't have a solution for them, you leave them with the option while pdb are working on opening the source, so you wait until it's it's published in order to start tackling a problem that exists now for important user.

A

Well, josh: we we can't, we can't tell digital ocean to use their the rocks to be replacement right and and until we've actually done any testing that that's, like you know, throwing them on unverified software.

F

That's the biggest problem right I mean. Let us assume that speedy solves the problem. How do they even consume speedy at this moment until unless it's open source right? That's where the problem lies? If they they have open source and they come to this meeting and they talk about how they have done it. They present a solution that somebody else can try.

F

That will be a useful thing to do, but if there is something that is far-fetched and we make promises- and some like you know- I I want to get to resolution and again like what mark said, we don't want to invest too much in blue store, given that c store is the future. So I I think it's a it'll be better to talk when when it is open source and then we can have somebody try their solution.

E

On the other hand, if we had digital ocean that will clearly state what the problem is, maybe we could use roxdb to to solve that.

A

Adam the the gist of it is that when they went and were regularly compacting in the background on a schedule- um and it gets deeper into this- but the very high level when they were regularly compacting- it dramatically improved performance for them uh in some situations, uh rather than just letting you know, compaction happen on right, essentially,.

B

Josh josh here from uh digitalocean.

A

Oh yeah josh you're. Here sorry, I didn't see you. That's fine.

H

No, no I've been enjoying the conversation uh yeah, so um yeah. The the issue, essentially, is that um tombstone and overwrite build up causes such a degradation to list performance that we start to have significant index performance issues um leading to even osd's going down from time to time.

A

I don't remember, oh go ahead, yes, sorry! No, no! You go ahead. um I was just gonna. Ask adam, do you remember when you were testing speedy b? You look at like iterator performance or anything else where tombstones were having. You know it kind of a big effect on rocks to be in previous things that we've seen.

E

No, not at all, okay, the conclusion was just a side effect of having a lot of traffic and needing to compact incoming tables. That was just it.

A

I don't I don't remember when we talked to this bdb folks before I wasn't super involved beyond at the very beginning, but I don't remember them saying that they had a solution for that problem. I think I even brought it up specifically that that we wanted to be able to uh reduce the impact of tombstones on iteration performance. I didn't think that they they said that they did that better, but I I could be wrong. That was just just my vague, probably poor, recollection of that conversation.

F

Yeah mark, I think we have other things on the agenda. Can we just move to those.

I

F

For the giveaway stuff, so we.

I

A

Back on this topic, yep, yep and um okay- so, let's move on, um I will just very briefly talk about quincy performance testing, so we can want to give a um there's a in the the ether pad. There's a spreadsheet, a link there um folks can take a look at if they want to, but the the high level of it is that um compared to previous releases, uh read performances is very inline.

A

um Sometimes it's a little faster, sometimes it's a little slower, but nothing um that I'm I'm too concerned about on the right side.

A

However, on these amd nodes that that we just got this year, so we haven't been testing these very long, there's kind of a clear difference between going back to nautilus all the way to quincy, in some cases, we're seeing really almost progressive degradation of performance in these tests, in other cases like at the very bottom, if you look at column, q row 93 you'll see that there there's a chart for 4kb random rights and um there's this this kind of situation where novelis was really good, then octopus through pacific, we weren't doing as well and then quincy, we kind of clawed it back.

A

So um I went back and have been doing bisex all week, trying to figure out what happened there. um The reason that quincy is looking good compared to pacific is due to gabby's excellent pr for changing the allocator behavior uh and and getting a bunch of stuff out of rocks db. That's what's giving us that win and and we're actually faster than nautilus, which is good, but we could be faster. Yet I think if we go back and look at nautilus, what happened is initially in nautilus. We did not see that good of performance.

A

It was. It was much more in line with other results. The wind that we got in all this happened in 14.2.8.

A

I believe when we introduced the 4k min alex size by default for ssds on this platform, that was a big win.

A

It gave us a really significant performance jump more so than I remember seeing in the intel machine at the time.

A

So that really helped, but it turns out that that that change is backported to nautilus when we implemented it in the pre uh octopus time frame it, it turns out that right around the same time, we made that change. We alter also introduced a number of prs that actually were hurting right performance um that it's a little hard to tease all of them out, but the one right now that's standing out is this pr, where we introduced uh changes to trace points.

A

uh This is this 29674, which I will link in the out window here uh specifically inside that pr there there's like 10 commits- and it appears to be one of the two commits I've got listed in the ether pad that are doing that. That was, that was kind of the biggest most straightforward um uh regression that I saw. I think there are others, but unfortunately you know we're talking.

A

Maybe a percent or two at a time, so the the 4k monologue size change was a big win, but then we also had a bunch of smaller pr's that were regressions. um This is exactly the kind of situation that's very hard to tease out, especially when they happen kind of close to each other in time.

A

um So uh the good news was that, because we didn't back port a lot of those other regressions to nautilus, we only saw the big win and that's why it really stuck out in these graphs uh or these charts that I I showed my hope is that we can win back some of that and we'll we'll even see quincy doing a little better in these these right tests than it's doing now.

A

um You know either getting it back to nautilus and some of the the larger right tests or- or you know, being an even bigger win in the small random write test uh where gabby's pr is, is providing even better better wins. um So that's that's what I've got on on those tests and any comments or questions there.

A

Okay, if not, then um one.

F

One quick question before we move on uh uh so have you tested any rgb workloads yet.

A

I have um I didn't, provide them here, because I haven't graphed out the results. I've been so focused on these. These rbd um results that I wanted to see. If I could um identify the regressions and see if there are any easy wins there and then go back and retest uh rgw, um our rgw is more complicated because we also saw some regressions in rgw itself.

A

uh Rbd tends to be a little bit more.

A

uh It, it probably is seeing a little bit less development.

A

You know that has big performance impacts compared to rgw right now, so um in rgw we know that back in the pacific time frame, there were a couple of regressions that were introduced. It's probably going to be harder to tease out if those are osd or rgw regressions specifically, um but we do have some results for it. So I can I I probably won't dig them out for this video, but.

F

That's fine, that's fine! I was just curious because then uh the the stuff that we started off this meeting with the quincy that vikkad was talking about. I was just wondering if you've seen something similar or not, but uh when you have the results we let's talk about it. Then, let's move to the next topic.

A

Yeah yeah that sounds good. um Okay, so neha do you? Do you want to sort of stuff and give away? This is this? Is your kind of your baby.

F

Yeah I mean my baby come on, but I think you you are the driver at the moment. I can clearly see and you've done.

A

No, no! No! No! No! I don't want to be.

F

The checker, no, no, it's good stuff, it's good stuff, so I mean in general. I think folks already know. Giba is like a scale cluster. It's a mostly logical scale. You've got close to 1000 osds uh running with very um less resources, especially with memory and mark has been suggesting and uh even tuning the the cluster to behave, uh or at least hold up.

F

I would say not behave, hold up in such conditions, so uh one of the things that we wanted to do was to use this cluster to run some kind of workloads uh across the board um and mark you've already installed a bunch of stuff that we that can help us run cbd on this right.

A

Yeah is cbt uh needs to be closer one. It's not a problem. um The only thing you might need to do is um uh david had suggested that the way he he wants people to do, um ssh keys here would be to use the the forwarding capability uh and to do that with pdsh.

A

There needs to be an environment variable set. uh I just it shouldn't be hard to make cbt work. That way, I'm not sure if we can do it easily at the moment, but a quick pr would take care of that. So um we just may need to make it a slight change to cpt for that to work right, but um otherwise I don't see any problem with running cpt workloads in this at least straightforward.

F

Well, awesome, so I think yeah, I think the first step is once you know mark you can get some kind of workload running on this cluster uh with cbt.

F

uh What I would like for us to do is, I think, uh ashwarya and sridhar they're both on this call they've been actively working on qos background qos for quincy, and this will be an opportunity for them to test some of the qs related settings and parameters that we have applied for quincy, especially around uh how this, how q, how m clock behaves with scrubbing recovery, while io is running on this cluster at this scale, what they have done so far is tested at smaller scale with ssds, and now there is a test plan that we've created uh kind of to replicate.

F

They have been used. I mean, I guess I should just give it over to either eshwarya or sridhar. Whoever wants to talk about it a little bit if you can give an overview of what you guys have been doing, and what the motivation for this uh gebar cluster testing is.

J

uh Okay, I can go ahead, uh so what we've been working on is uh currently the background tasks like recovery and scrubbing and seeing how m clock handles it with client and how the different amp clock profiles, work um and we've been doing this on the official analysis nodes, but uh it would be great to do it with thousand osds, um and our tests currently um run client and recovery together and there's a new test coming in that runs, client recovery and scrub together so um and we collect uh some stats on recovery and scrub or from pg dump.

J

um So we would really like to see it on a larger scale. That's basically what we want to do with the gebar cluster.

F

So um I think, there's a test plan that has been linked here, but uh it's a private document, so I might need to convert that into an ethernet.

F

You can do that offline, um but I guess uh yeah. You had something.

K

I know I was, I was just gonna say that um you broadly outlined the the test, steps uh for for each of the tests that we have identified, for example, the recovery test, the the scrubbing test and the combination of the scrub plus recovery test. So these are high level steps that that cbt currently does so once the one cvt is up and running. uh What we have tested so far is at a scale of about 1000 objects in the recovery pool.

K

So we could, you know, increase the scale to a much higher value and then and then test the same test cases and see how the customer behaves.

K

With different uh with different uh uh profiles,.

F

Do you wonder for folks who are not aware of what these different m clock profiles, or do you want to quickly uh describe what they are and what they're meant to do.

K

Yeah essentially uh so in you know, clock we have uh defined three profiles. uh The the the default profile is called as the height line tops, which gives more reservation or preference to the client operations, while still giving adequate preservations to background recoveries and scrub related operations, and uh the idea is to uh get a baseline first on on this machine and then uh switch the profiles around. For example, one of the other properties that we have is called high recovery, ops profile and also there's a balanced profile.

K

So with high recovery profile, for example, we could run the same test and see if it actually helps in giving higher preference to recovery ops, while still allowing the client ops to have a decent balance without without it getting affected too much so yeah. These are the three three basic profiles that we have defined and the plan.

L

K

Establish the baseline first and then test uh the different test cases with these profiles and extract the numbers to see how clock is varying.

F

Oh so mark and everybody else, I mean uh what: uh how does this sound and if there are any thoughts, ideas incorporations that we can make into this um I'd, be curious to know what those are.

A

So so I I feel slightly relieved and that I actually don't know a ton about m clock and all this sounded wonderful to me. So I was able to just say great go. Go nuts, do what you think is right.

F

Awesome awesome, and I can clearly see ronan is happy about this. I don't know about others.

A

I think that was the lego saturn v.

M

No no yeah! I I I'm happy about uh about this and.

F

I'm sure uh scrubbing this scrub test makes me rather more happy.

M

I have, I am doing a lot of minor tweaks to scrubbing and I don't really know if I'm really improving performance, like I think I'm doing it. This would be a great tool to test this from time to time.

A

That's good, that's really good!

A

um We should we should uh we've got lots of peers. We should get into cpt that are sitting out there, but we should definitely get the uh both any kind of um scrub, testing and m clock uh changes in if there's still an outstanding. uh It's it's neat that you guys have figured out ways to do this.

M

I have a related question about that just to understand. Currently there are a lot of weights etc within the crop code, some of they are disabled. When the m clock is, is the scheduling method meter that is chosen, but do we envision a time when we can remove the need for any any specific manual, scheduling, manual, delays and just assume we have mdm clock.

K

Yes, I think the the idea is to remove all the manual configurations that we add, for example, the scrub delays and all that um the idea is to dis disable all of them and let m m clock based on the profile. uh uh Let it do the the allocations and uh and let's scrub go as per the uh setting that we have so the uh long term of uh sure we we want to eliminate all those uh uh scrub sleeves, for example, that we have defined and clock do its work without any of these settings.

K

Just based on the profiles, it should work. Fine, that's the long term uh goal.

M

Oh, let me phrase the question: if I'm currently revising coding, do I need to still keep all the mechanisms that do not use the it simply use manual integration, or can I just not insert them if I move in your code.

K

Well, currently, the way the way the code and the code things have been made. Even if we uh add um those settings, uh the m clock code overwrite those back to uh the to zero yeah. Well, basically yeah it disables. It essentially.

M

That I know what I'm asking is that do we need to keep the code that uses those configurations.

F

So I guess the question is about uh when that new code lands in right. It's a matter of maturity, like m clock. This is the first release where m clock is going on as default. um If things look good and we we get much more confidence and like m clock is doing the right thing and we don't need to like you know these sleeps are all associated with wpq, which is the old osd op q that we used to use.

F

At some point we can just say: okay, we don't care about wpq, because m clock is far better and at that point we don't need those sleeps anymore. um So, like it comes, it all boils down to that new code. If it is targeted for something, let us say the r release uh sure we probably still need uh those sleeps uh implemented, and you know we accounted for just in case.

F

Somebody needs to switch to wpq for performance reasons, um but if you're talking about uh you know further down the line, uh maybe not so I guess it's about timelines.

N

Okay, so that's a good answer. R is still it's still there. It's still optional.

M

F

Yeah yeah I mean yeah. I would still keep both for r yeah.

M

F

Anybody else have any other thoughts, and I know there are other folks who are doing similar tests on different scale, different kinds of workloads and, I'm sure, there's a lot of uh learning. We can get out of each other's experiments that we are doing so. I was hoping that if this this could be our, I don't think cbt has been used at the scale ever at least in my knowledge. So if we can get cbt to run on a thousand osd cluster, maybe that can be the workload generator that we can use very easily across.

F

You know any kind of setup that we have, that we are doing such tests and, given that it has like scrubbing recovery, all kinds of tests incorporated in it says it's a good bundle.

A

Yeah, I think the biggest I've ever done is probably four or five hundred I've never done a thousand. I've used pdsh on a thousand notes before, but not cbt, so that'll be interesting to see how it goes milestone, right, yeah, yeah and it'll be easy. This is almost a little bit. Money then doesn't actually have to deploy the osd's spawning clients- let's basically just invoking pdsh but uh yeah that'll be interesting.

D

Yeah to add to it, but now I think in the internal cluster we are going to test the pg delete deletion performance. So.

F

Yep, that's that's an interesting point so, like we are trying to, you know, um spread the load across different groups so that we can get uh a good idea of how m clock is doing across all kinds of operations. Background operations that the usd does so uh what sridhara and aishwarya are going to be focusing on are more like client, io versus recovery and scrubbing, and wiki and team are doing some pg deletion performance evaluation, also with m clock.

D

Generic currency and then pg deletion and then uh tell me one thing: what is the default profile? It is high io or it is balanced. It's.

F

High client, okay,.

D

Do we have enough documentation in upstream for this feature, I mean like uh what these profiles.

F

This is this is one thing I can say: yes, because every time we have merged uh m clock change or m clock pr, there's documentation going uh with it. I why, while we continue with the meeting, I can paste some of the links in the chat for reference, so maybe srider you've written a lot of this documentation. Maybe you can.

O

Thank you. Thank you.

K

Sure I'll paste it in the check window.

A

Yeah, would you like me to move on to talk about low memory, configs.

F

Yeah, I think so I think so I think in terms of action items mark once you're able to like you know, run like one workload or something successfully and everything is set up for for.

L

F

And sridhar to um take on, let me know, and then we can, you know, execute some of the test plan that they have.

A

Sure sure that sounds good I can. I can try to get that uh taken care of so that they're not blocked um it uh shrieker or aishwarya. um Do you do you have a particular workload that you're you're interested in fio or beta's bench or what? What do.

L

A

uh I can just verify it works.

J

Oh, we usually yeah yeah.

A

That's right here, so I've got fio installed on most of the nodes. There were a couple that were down and one that was um stuck on on on rail, eight and there's not really on uh on centos eight instead of sent us stream, and that was causing problems with the young rippo, but other than that. I think fio should be on all the nodes. Now um do you have a specific amount of clamp workload that you usually try to invoke or is it you know?

A

Just kind of you know enough to make sure the cluster's sort of busy.

K

uh Generally, we we have um right now. The way we have tested is just using one one client. So uh I I guess we'll have to just do some experiments to, like you say, uh keep the cluster busy on one pole with client tops. While we are triggering we have, while we are triggering the recovery and the scrub operations and, for example, some other food.

K

When you do recovery.

A

Oh sure, for scrub up and recovery, um typically in cbt for recovery operations, I cpt at least from where I remember it's been a while since I've done it, but I think we need to own the osd to do it.

A

um I've never tried that with running tests on an existing cluster, do you guys let cbt build your clusters, or do you run on something deployed with like cephadium.

K

As part of the testing, we did introduce a new kind of uh recovery test, so that's the test that we're gonna trigger it's already there in cbt. Actually there are two types of recovery tests, one one one that you already had mark and then one that we nearly introduced to test uh victim clock.

A

And the new test is that: um will that work on an existing cluster with like the use existing flag or when you test that? Are you testing it on a cluster that cbt deployed using the um the the the ceph cluster class.

K

uh It's the new. The new testes are already defined in the self cluster class, so yeah.

L

K

It should follow yeah it should it should work with these existing flag also, I suppose, but I never tested it though.

A

That that might be an interesting uh uh uh thing to see. I, I don't believe my existing uh recovery tests work when you do use existing, because it's not aware of the osds uh when when use existing is set it you know, the osds are basically just you know, assumed to be there and it doesn't try to touch them. So that is actually an interesting, interesting question now that that could be a little bit of a wrinkle here um with with testing when the cluster is already pre-deployed.

F

um What does it do differently? I mean: uh what can you describe for me? What is it, what does that recovery test do like? Let us say if we induce recovery by just bringing uh osds down manually so.

A

F

A

Oh yeah, if you brought them manually, that'd be fine. um I so, like I said, it's been a long time since I've looked at this code and- and since I I've thought about this but vaguely I remember that um things in the the self cluster class, like the recovery estate machine work, assume that cbt has knowledge of the typography of the cluster, and that happens um a topology of the cluster, and that happens when it knows and did the deployment itself like, then it knows about osd's knows where they are.

A

It knows how they were deployed um when use existing is set uh that all basically is is empty. There's nothing there. It doesn't know anything about how the cluster is deployed. It's just running some benchmarks against it. um I don't know that the the stuff that I wrote for recovery will work when music system is set.

A

I don't think it does, could be wrong on that, and maybe it might be possible to change it, so it does, um but it is primarily due to the fact that cbt doesn't have any knowledge about how many usds there are, uh how they were deployed and what it should do. um It it's possible, though, that you could you know if it it might work, but if it, if it doesn't work it's possible, you might be able to change it. So it just.

A

You know it's basically marking an osd down and you just force it through and march one back up and then force it through.

F

Yeah yeah, that's, I think, that's what I'm thinking. Maybe that's something uh breathing for you. You can verify in your local setup how this works, but even like I'm thinking in the same direction, there's nothing that stops us from just manually, uh injecting failures and let the the let cbt just you know, collect the stats or or even see how long the pgs are taking to recover and all that kind of stuff that they have added.

A

Yeah, it might not even be a hard change right. I think I just didn't I when, when use existing was was kind of created so that people could run on existing clusters. It was primarily for like running a benchmark on like a a partner setup, so like say, if super micro had some machines in their their lab that they wanted to test and like verify that that they saw good performance on it.

A

That was kind of the idea that someone could just deploy stuff and then run cbd against against an existing cluster like that um it, it's probably the biggest issue is to say I don't think he's ever been tested so that that's probably the first thing to just try and see if it works or not.

K

Sure I can we can try to run uh it on our local setups and see if it works, and it's not really sure to see if we can get it working with the flagship yeah.

A

Yeah and hopefully fairly quickly, I should be able to get something running on an example configuration running on um on on giba, so that you guys can do testing there too um I'll, probably just set up like uh 16 client nodes that that you can use. If you want to for for workload, um this should be sufficient to really saturate the cluster. I think given, given the speeds we're talking about here.

A

K

A

Cool anything else there are others.

F

No, I think, let's, let's move to the next right. We don't have enough time. Let's go to the next topic.

A

F

Is good, thank you.

A

um So related to gibber um it one of the things that we, we is a little tough there when you deploy stuff uh through the the normal processes, is um adjusting the tc malik thread cache, uh I maybe it's as easy as changing some system, d, type, stuff or self adm.

A

I don't know, but um I didn't want to bother with all that, so I I recreated uh the the give a memory full memory configuration that we kind of uh set up on official analysis and just ran some cbt benchmarks there, uh both looking at the things that we set and then also looking at a also saying the tc malik threat, cache 260 megabytes. Instead of 128 megabytes and the overall memory usage um looked very similar to gibba.

A

uh We ended up right around a gigabyte before the thread, cache changes and a little over like 925. I think uh when, when we uh reduced the tc malik thread cache and it was fairly stable through 4k random writes, which is usually a fairly decent test to invoke osd memory growth.

A

There are situations we can use more for sure, especially we found uh nana around the the pg log or sorry pg splitting uh that's a situation that could potentially invoke significant oc memory growth, but this got us really close to what we were seeing in gamo, which is great um there's a little more to it that I was able to investigate um on osd startup.

A

We were using around 280 megabytes of rss, but most of that was in the page free list for tc malik, like 260 megabytes of that. In reality, we were using according to malik only about 20 to 30 megabytes of memory for the application.

A

um The rss memory usage after we told tc melodic to release memory was around 50., so figure at startup. That's kind of where we're setting is in that ballpark of, like you know, maybe 30 to 50 megabytes of rss memory usage somewhere around that once we invoked an rbd pre-fill workload, which is essentially four megabyte, writes immediately. Osd memories just shot up to around 450 megabytes and then progressively grew from there as we started filling in stuff uh up to peaking somewhere around 480 to 500 megabytes of memory usage. Our asses memory usage specifically.

A

After the pre-fill finished, then this test started to 4kb random, write workload and memory again shot up to about 900 megabytes to a gigabyte, depending on the tcml thread, cache setting. So we saw this the significant growth as soon as we started doing small. I o- and this is all very, very in line with what we've seen in the past, especially in relation to the o node cache in blue store. We see that the o node cache especially seems to cause a lot of memory fragmentation, uh but it's not the this whole thing.

A

um There's definitely other stuff in here where we see memory growth well beyond what we see based on the uh mempool counters and based on some other things. um It's it's pretty interesting. So I I went back and tried to look at using the tcml keep profiler after we had kind of done. Some of these different workloads and the results are in the ether pad there.

A

Since we don't have a lot of time, I'm not going to open up and just walk through them, but but feel free to take a look if you're interested um there's some really interesting stuff there. I will say that the in-use uh numbers- I don't trust that tc malloc is claiming uh or p-prof. One of the two is claiming that we're using way more memory uh or have way more in-use memory than than we do. So. I suspect that maybe it's missing some possibly.

A

um I trust the alloc data a lot more. That looks a lot closer to what I would expect to see and it's very high. We do a lot of memory, allocation, same amount of memory allocation, but that's very similar to what I've seen in instrumenting the osd in other ways previously- um and- and it's really interesting- I mean buffer list- is you know clearly the the big thing here right? We we allocate tons and tons and tons of memory and buffers. So um that's really big, but but there are some other interesting things there too.

A

um Finally, radic and I sat down a while ago back in 2020, I think and tried to uh create a pur per thread ring buffer uh for specifically for the use case where we're pending memory uh like little appends for encoding, uh and uh I went back in and started testing that with master again, and um it turns out that if we limit the size of memory allocations uh for the ring buffer to be within like 64k or you know, with an 8k, we can actually get a little bit of a memory usage gain.

A

By doing that, even though we're allocating more memory per thread for that ring, it lowers fragmentation enough that we see a small win out of it. It looks like um on the order of, like you, know, 8 to 16 megabytes. Maybe um I pretty consistently saw in the test the configurations I was looking at that memory usage was down a little bit despite allocating more memory for this thing, so that was really interesting to see, um but the pens aren't the big consumer.

A

uh We we improved that a couple of years ago by changing the way that that that works, so that we don't just do like uh lots of tiny little appends for uh the pen hole use case. uh We we grow it if there's lots of little ones, uh kind of the way the vector works in sql plus. So um that does not appear to be the the real win.

A

Maybe the real win, I suspect, would be to change the way that the uh messenger works, to allocate a block of memory and then not move like allocate memory for for buffers and then move them, but rather to just have pre-allocated memory and then use them and, like you know, send a pointer around or whatever um that's a lot of work.

A

I don't know if it's worth it, but um if you look at those prof profiles, uh the the messenger allocations seem to be a fairly big percentage of the overall allocation behavior along with other things, you'll see it in there. um So that's that's what I'm looking at here. um I don't know what the right route is to reduce our memory usage further right now we're hitting kind of a hard wall at 900 megabytes rss in the sub.

A

We could probably make it a little smaller, but um fragmentation is a pretty big problem to lure further. I think uh and that's what I got.

A

Any any questions or thoughts on any of that.

A

And we lost a couple of people here.

F

Yeah, I think in general, I would really like to see with all the tunings applied and stuff when we run uh workloads.

F

What kind of behavior do we see or where the bottlenecks lie um and how further do we need to go kind of stuff.

A

I suspect that what we've done is we've lowered the steady state memory usage right, like we've done all these things that are really clear consumers of memory, and now our steady state looks lower. You know we got it down to like a gig right, but um you know if there are things that consume memory that we're not controlling. Here we could still see big spikes.

I

A

Proportionally bigger spikes than we see with more memory.

F

Yep, especially, you know, recovery scrubbing, these are all potentially candidates. Yep.

A

Exactly exactly and uh and yeah fragmentation is awful, I mean this is tcml, does about as good of a job as we've ever seen in terms of controlling and and dealing with stuff's behavior memory allocation behavior, our our penchant for for allocating little things all over the place and um and it even tc malik struggles with what we do.

A

Certainly lipsy malik deals with it far far less gracefully than tc malek does. So. um If we want to make this better, if we really want to make it better, it probably means changing the way that we allocate memory.

A

That's that's my opinion anyway.

A

uh Gabby, I am going to call you just briefly here. I know you were kind of interested in some of this kind of stuff. Does this? Does this sound.

P

Like what you've you've been thinking today, we've talked a couple yeah. I just sent you an email uh for something.

L

I've been thinking about uh um here. They try to find projects for third and fourth year, students to thinking with seth that they have to do some kind of practical work as part of the degree and- uh and they ask us to suggest project. So I suggested something around those lines.

L

So I'd really appreciate your thoughts on this thing. I know it's very vague and.

L

And very genuine, but and after talking speaking with mark from rgw, he convinced me that maybe it would be easier for students to start with rbd. But the idea is still the same.

L

So basically, what I was suggesting is end goal, which is done there with multiple steps just to allow students to always have something working and even if the project proved to be too big, they can still have something to show their resume.

L

But end goal would be that the client would send the osd how much data he is planning to send on talking about big jobs, and then they will break it into multiple short steps.

L

Osd would still do single allocation, update robsdb once and then send the data to be written to glue store, but by passing rocks to be and bypassing other location. So I I know it's very tricky to do it's. It's really easy to say this, but I know that the detail is going to be held. So that's why I suggested different approach. It could be done by them, but it's of course, going to be less useful for us.

L

So if you could review it.

A

Yeah I'll take a look gabby, um you know, even just uh even if it doesn't fix any problems just having somebody going through and looking at, um where we're allocating memory and where we're just having a really clear view of uh of uh what kind of behavior we're requiring of tc malik uh and updating that to make sure we understand what that really.

L

A

Like would be super super useful.

L

Yeah, so I I'm still not sure that this thing would be possible to do or if it's possible, to make really into something functioning, but it might just be able to prove where the money is so, hopefully they will be able to do part of this. I really try to break it down into many steps that, even after the first step, they have something they could present, because it might be that even the first step would prove to be too difficult to do, and I also want your opinion on something else.

L

Once this meeting is over, there's some discussions, we're having with the ibm folks and I'm still trying to understand the cost of doing crash calculation on the client zone, and how does it scale when the map is big?

L

The numbers I was shown by the by the ibm folks, which might be completely wrong, seems to indicate that when the map becomes big, we are able to do very few calculations, even on a strong cpu, which seems crazy.

A

Sorry that we do very few calculations, did you say.

L

Yeah, so they created the machine with, I think 16 configuration of 16 servers, machine and 1000 osds, and when running on intel x86, they were able to get about 25 000 crash calculations per second, which is crazy. It means the client won't be able to do more than 25 000 items.

A

I've seen clients faster than that. Surprisingly,.

L

I can't see them doing half a million and more so I don't know if it's, because they are using multiple cores or if it's, because they did something very silly.

L

So at first they tried to do crash on a very small map. The ma I think their map was just using minimal configuration with three osd's one client three osd's, and then they were able to do millions of crash calculation.

L

When the map becomes 16, client and 1 000 osd, they dropped from maybe a million calculation into 25 000.

M

L

So I don't know again, it might be that you only use a single core, but still the scaling issue is is frightening. I'm still hoping that they did something very wrong.

A

It's not surprising to me that you would see it be slower with a larger map, but that is dramatic.

L

It was 40x.

L

I mean I, I would expect maybe twice three times slow, but does it mean that crash calculation or polynomials.

L

Every time you do a crush request, you have to do a full scan of the whole thing. Does it make sense to me.

A

Yeah, I think we need to know more information.

L

I can't put you on on on the menu list there, but it's really something looks unreasonable. I don't expect. I mean I understand that if the map is very big once there is all of uh uh failures, disconnect fell over fall back, the calculation is going to be polynomial.

A

But craig you, you probably know crushing better than anyone else uh here. Do you have any any thoughts.

B

But you calculated under changes, you don't calculate it on.

L

Every io yeah, I understand so I'm saying that thing, the recalculation of the rebuild of the map. I understand it must be polynomial that makes sense, but they are trying just to do calculation.

L

Where should we say your goal.

B

And that's why, when you go doesn't have any calculation? It's you get the object name, you create hash out of it and you go to the pg. The pg is in memory. You don't do any calculation.

L

In theory are correct, but I think there are a few extra steps because you have to check for snaps and for a few others. But I don't.

B

Think it should be, but it's not calculated, it's not crash calculate yes, but it shouldn't be crash. Calculation again. The crash map is in memory. You don't do anything with the crash. You may need to check all kinds of stuff to see what the object name is, but let me see most of the.

G

Conversation but yes, we do calculate the crush mapping on on every io. There's, no there's no cache of of crush of hash ids to crush buckets or um osds. So it runs.

B

That whole calculation.

G

B

The pg is on every io, you calculate that which speaks no.

G

You calc for an io, you need to target it to a pg and that requires running a hash and then and then mapping it to the to the to the pg and then the pg needs to be mapped to an osd, and that is a that is that is a calculation. It's not look.

G

It's looked up if the osd map has a pg, temp or uh or one of the various other overrides we have, but it, but otherwise it's a calculation and we don't cache it and it is not, and it is the whole point of of crush in the osd map is to not encode the lookup in the mav. It's a calculation, oh greg.

Q

Do you remember the points.

G

About caching them- and I don't know why it never happened- I think it just was never quite important enough, because for most people you don't notice it.

P

Do you remember what the time complexity is.

L

uh Greg, let me just put something into perspective, and actually this thing might be a a good way to approach it. Ibm is trying to put a safe client, safe, rbd client on a smart nic, and the smart nic has a very um what's the word, a very powerless, not.

L

So one thing we tried to suggest was the the the solution that we are trying to push. Also, that is currently under development, assumes that every ost going to have a gateway solution where, when the io arrive to it, it's going to do a full verification and if it belongs to it, it's going to push it down. Otherwise, it's going to be forwarded.

L

So if we could finish the calculation and even if we miss some calculation, even if yeah.

R

L

Doesn't exist now, osd, that's no big deal.

R

L

Going to be forwarded again.

R

G

That's that's fine, but it is not how it works right now um to answer your question mark you know, 10 or 12 years ago it was on the order of 10 microseconds with the maps that we tested it on. I don't know how dependent that was on the map. Complexity. If they got more complicated or not. I don't know how much it scales with cpu frequency.

A

Yeah, I wasn't, I wasn't.

G

A

G

Trivial, when we were worried about it on hard drives- and I don't think it's been re-examined since then, except yeah.

K

G

When someone goes huh we're spending some time in crush yeah, you can try it from any caching and then they don't.

A

Yeah, I I couldn't remember if in greg's, I'm sorry and sage's uh uh uh thesis if he actually talked about the time complexity or not, but it.

G

Was running but it was, you know when you cared about running out of hard drives and you could say.

A

G

It doesn't take any time that matters.

A

Yeah yeah no yeah.

G

Many things were less examined at that.

B

But greg you need to calculate the entire map or just the map for the specific pg, which is much simpler because it doesn't depends on on the number of pgs. You just calculate the for the specific pgs which ocd to use.

G

Yes, that's right, you calculate for the p, you you, you run a very basic hash from object id to pg, and then you need to run the crush mapping for that pgid to the osds it's on and that calculation. You know 10 or 12 years ago on cpus, then and crush maps from then took about 10 microseconds.

L

How big, how big uh configuration are you talking about.

G

I have no idea.

L

Yeah, they also tested this on a very small configuration. It was screaming fast and on a tiny configuration it completed it uh in one microsecond.

G

I mean cpus are faster than they were 10 years ago. So maybe maybe that's how long it takes now, but also yeah.

L

Configuration once the configuration grow to 1000, osds and 60 nodes, then it become uh what was this 40 millisecond?

L

G

I mean the last time I saw it get really long. I think it was as a result of um I don't remember what was triggering it, but but we were getting retries on on lots of so.

L

Again again, I'm sorry, I'm just confusing everybody 40 microseconds, we grew from one microsecond into like um 25 or something micro. Second,.

A

It's probably worth actually looking at how the algorithm works right like rather than than just kind of guessing at it like it's. It's probably worth something.

G

The last time it got really long um or that we saw it get really long and and did something about it was as a consequence of out osd still being in it. So you hit lots of free, tries and back offs, and you know conflict pretty badly and I think that might have been that might have been partly resolved in one of the straw iterations. I I don't remember.

A

I think what wasn't that, like, maybe straw, to the the why it got developed.

G

It was, I think it was fixed in one of the straw configurations. At least one of them was actually about waiting being wrong. I don't know if another one, I don't remember, yeah yeah, I mean the straightforward answer to this is yes, it's probably worth I mean the problem is it depends on the number of pgs that, like the finest tax thing, it's definitely worth like looking at cashing the crush calculations again, but no one has done that yet because it has never quite been worth it.

G

Yes, that is the straightforward. The straightforward solution to attempt.

L

Okay, but uh now on your side mark, are you guys testing this? On the client side, I mean, even if it's going to be 25, 40 micro seconds, it's a lot of money when you're talking about uh um nvme.

A

Ssd client-side, rbd and cfs have probably been the fastest in terms of single client performance and on the rbd side, which is probably what we've got the most experience with. Looking at at long-term testing, we are far more limited by the implementation of the client-side rbd cache than we were by crush crush. Very much was not the majority uh uh consumer of of time compared to other things. But having said that, these are on configurations that are like between you know, maybe eight and up to a hundred osds we've.

A

We've never done that kind of testing on like a thousand osds that.

L

A

L

So if you think about bloomberg, concern configuration the mega configuration, do we know how much they are being held back.

A

We do not know how much a thousand osd cluster, how much a single client trying to get. You know good performance on a thousand osd cluster. How much crushes contribute to that? We do not know that.

L

Maybe we should try and get the numbers because it might trigger some changes to this. Maybe this gateway design that we are trying to push on the ibm cloud side. Maybe that thing could be actually a general purposing or maybe it could be used. If you have more than any, then then some or is this, maybe you say you know what we could do flat calculation if you have no more than I don't know, 256 osds.

L

Once you go past this, you know what we're going to give you an approximation and the osd is going to do a secondary verification.

E

Well, it's been a few years since I looked at crash, but I recall that when you draw a straw, it just has to take a peek from any representation on a certain level. So it will definitely be a very different complexity.

E

If you had like a binary tree of hierarchy versus having everything just on same level, because if you have thousand is this on one level, then it will peak from all of them and there will be thousand um challengers for a specific seed to get uh being chosen for a pg set. If you have that tree, it will just reduce that set.

E

I'm thinking. It might be something like that that if, if you have a very long time of calculating crash just maybe.

A

That would not surprise me adam. That seems very uh like a good, very good observation.

A

Having said that, I don't actually know how much it actually doesn't matter. Like greg said I mean gabby, it sounds like you guys are seeing evidence that maybe does, but um it's certainly not been something.

L

About this evidence it was done by somebody from ibm and we just said something try it, and so I don't know. I cannot say for certain that the numbers are correct. I didn't do any of the calculation, I don't even understand the calculation. That's why I was trying to think about it.

A

But I would I will say that especially like a smart nick right, um if you're going for a really high ad up single client, big.

L

Cluster I mean: let's talk about our side, forget about ibm, they are not interesting. I'm sure we have big customers with some very big configuration.

L

Do we know? What's the impact of telling- and I know in the past same was this- was suggesting some kind of limited map. Some don't allow everybody to see everything. You could just create some kind of sub configuration, and is this something that we use day to day on the customer.

A

We don't have any mechanism right now for like reducing an object in a pool to be restricted to only like a subset of the osds that are. Are there that I know of anyway, like we can't just have like you know, a an rbd pool where a given block is represented by you know a small subset of the the the osds.

G

Yeah, that proposal was specifically for this nvme gateway, or you know similar sorts of things where the microseconds really do matter.

L

Yeah, but so I I that is when I I heard this proposal, but I was just wondering if this is something that we no, it.

R

Does not exist big.

L

Configuration because in theory they might suffer from the same thing you could say maybe shrinking.

L

If the map is very big, do you really need to see the whole map and again I'm assuming that it is a real problem and really when map is big, crusher equation that doesn't scale very well? I don't know that.

L

Maybe, if you do the testing and somebody would truly understand it- we're going to find that it's a linear scaling and we just didn't- do the right calculation.

O

I mean the thing is that it should.

G

Depend the the variable should depend on the number of buckets you have to descend through and how many times you have to back off and retry. I don't think the number of osds has an absolute number matters. um I guess you know, given that they tend to be divided into racks and rows and stuff, probably there's like a logarithmic scaling, as you split those up, but but in general it's just about it's just that the number of times you have to go through a bucket like a crush bucket um and run and run a hash.

L

I mean I could be missing something.

G

L

G

L

Who would know and be able to go on the calculation and see if we use the correct method to do the calculation? We ran some kind of a script. Let me find the script.

G

I mean nobody's gonna, know off hand, someone will need to go. Look at the crush code again and look at the script you ran and identify if it matters matches. If you didn't just run crush, I would just like make up a couple of maps of varying complexity and run crush on them in the environment you care about. You can like there are lots of ways to invoke that, using the crush tool and and and the testing, and it has testing built in.

L

So that was that suggests what what what they used.

G

Yeah, so that that is how I would test that at least you know. Basically, I don't know what the what the options actually do.

L

They use a very minimal settings, I think just three hours these and one client and they could do one million calculation and then we told them. No, no, that's the configuration configuration is 16 clients and 1 000 osd and then.

G

I mean the number of clients doesn't have anything to do with crush right. What.

L

I don't know I didn't know if there's any any uh dependency but okay. So it's good to know so it they only depend on the number of osds so for 1000 osd. They range from one microsecond per calculation into 40 microseconds.

L

G

I don't know: we'd have to look at the maps and- and someone would have to look at the code and figure out why that happened, because that surprises me. But it's possible.

L

That we just never noticed that all fits inside l1, cache and so on and when the map grows, then it doesn't fit l1 cache and you start to play drum. I don't know if it's l2, l3 or even drum.

G

Yeah, I'm not sure.

A

Certainly running running that through perf would be would be useful to look at things like cache, misses and everything else around it, and you can try running my clock profiler here as well, because this should be fairly easy to do. You just see where it's spending time.

L

uh Should we add you guys to the to the to the mailing list there or you guys, won't be interested, I mean mark you'll, probably have no way to escape. It.

A

Yeah, I mean that's fine, I've, I'm pretty busy right now. If I could try um but yeah.

L

I mean, even if you just be there, you might be able to capture some of the silliness that we're seeing and greg. I don't know, should I uh I.

G

Mean I can, I can repeat the things I've said here in various ways. If it becomes necessary, I'm not gonna be able to donate any work to it, but I don't mind providing a historical.

L

Perspective. Okay, so I'm just adding mark because his title requires that you'll be.

A

L

Everything which is performance.

A

um Gabby is it possible that they could have a live session, so we can actually do like real work here on this and I'll just talk about it, yeah sure, okay, let's, let's do that and then let's, let's just profile this thing and see where spending time. That will give a good indication where to code and where, in the code to start digging in and.

Q

Then I'm going to.

L

Sure it's possible, but I don't know I need to talk to them and ask them to give us access to the machines and such it's not. My.

A

They can do it too, like the the thing I just I mean they should have perf, they should have rpms for perf, so they can run that directly and then my thing should be really easy to compile. As long as they've got like, you know, gcc and and uh uh healthy tales on the system which they probably do so they work.

L

L

So mark you're on the thread I'm going to forward you the full thing.

A

Is this like a google chat thread? Is that what this is.

L

L

You're going I'm forwarding you the email I don't like so first you have the full discussion, but then also, I didn't afraid so. Everything new you'd be included.

A

Okay yeah, if we, if we profile this thing, it'll probably tell us we're spending time and then then we can figure out. I mean you can or or somebody can take a look at it and figure out. You know. Okay, does this make sense that we're in this area of the code constantly given what it's doing.

L

So what they also tried to do was comparing performance on arm cpu versus intel, and the surprising thing was that um was not so.

L

Powerless, as I was expecting, I was expecting in order of magnitude. I think it was like factor. Free intel was like three times faster than home, which, to me sounds very surprising.

A

Well, I mean, I guess it depends on what we're hitting, as the bottleneck right I mean arm in some pieces, is pretty good. It's bad at io, but you know, maybe if everything fits in cash nicely arm is doing. Okay,.

L

So I don't know, I don't know exactly. Actually I don't know at all. How would the map look for thousands of these? How big would it be if you can fit it inside an l1 cache, because this thing is just testing script doing one request after another, so eventually everything will be in cash if it's not too big, so l1, I think, is 64k and l2 is like 512, but it shares between all the cores.

L

But in this case they are probably all all the cores are probably using the same map. So it's no problem.

A

I suppose that's the other thing is whether or not this benchmark that is being run actually is um representative of what a real client would do in terms of the cash right.

A

L

So in our scenario, it's probably not extremely wrong.

L

Because we are think we are considering.

L

Splitting the work into two paths, because, on the smart link there is this fpga path which could move the I o very fast and they can move the data and the request wire speed.

L

But it's pretty useless for us, because every ion need to do crash calculation and you cannot do it on an fpga. So what we do we run everything on our processor.

L

Our processors are very bad in doing that, so I suggested the middle way in which we will compute a request on the nvme side on the snvmeq before they come to the fpga. The fpga would look at them push just the I o parameters using some kind of ring buffer to the arm cores. The ankles would do crash calculation and push it back once you have it. It can process the io from the cube.

L

Now, if we assume that the io are going in steady rate and cueing is good, then you should be able to do close to the number of what calculation we can do and the arm processor would be doing 90 of the time just crash calculation, plus some maintenance jobs. Like I don't know, if the crash map have to be updated, they have to do something, but on the normal flow, that's all they be doing just crush calculations, so the cash is actually going to be.

L

um It's going to be reasonable to assume that the cache is going to be all around crash calculation. But how big is a crash map, and can we maybe make it smaller by because by fitting this into l1 cache or l2 cache, maybe even l3ks, that might be a huge difference in performance.

A

Yeah, absolutely that's why I think we gotta actually start measuring what's going on when they run this.

L

So I'm going to say that we need your help in profiling. This problem.

M

L

Apply to all uh grease.

L

L

L

Okay. Actually I just added mark to this chat.

L

He can help profile.

L

Brush calculation.

L

Example, given is it l1 l2 dependent.

L

And sand that was this.

A

The the other thing I was thinking is that um the perf theme at red hat had some changes um to look at exactly these kinds of issues, uh to perf to look at exactly these kinds of issues, um and it might be worth reaching out to one of them to to also provide guidance on some of this.

A

I don't remember who it was it presented. Someone presented some of their work a couple years ago, though, uh to the the south community.

A

I'll try to look it up um and see if I can find out who it was uh one of the guys over. There, though, is kind of an expert in this area, so that might be worth trying putting him in if we can't find it ourselves.

A

All right well we're way over time. Let's, uh let's wrap this one up and uh yeah done.

L

Let's take a look at the other email I sent here about projects and keep in mind it's it's a couple of students. One of them is third year. One of them is four.

L

I think the fourth year is working now as intern in melanox. The third, I don't know where he's working and both of them are interested in doing really system level work.

L

A

Sounds good yep.

L

And you could just assume that at first I wrote that everything should be done around rgw, because rgw is using big ios by default.

L

Four megabyte is a very common size for rgw client, but so I said so that's going to be a real use case, but I was convinced by the other mark that uh rw is too complicated for them and they should start with rbd. If anything prove correct. We could then say the same thing should apply to rw, so don't ask them to understand rw, but aside from the rw, the rest is okay and actually one more thing. There is something I'm talking about the network that you don't have to wait on.

L

uh If there's a replica, you don't have to wait on axe from replica, and then I realize you don't it's it's it's. Okay, that's very easy to do.

A

Yeah, like yeah, like I said earlier, the the big win for me would be if, if someone can really take a look and figure out how our memory applications look uh just in the code, that would be that'd be huge if they can verify where we're allocating, where we're freeing and how much we're doing in different areas. Some of that that um prof output that I provided in the ether pad as kind of start in that direction, but um but having someone really run with it would be, would be great.

L

Okay, so this thing was only around the messenger.

L

Yeah request buffers.

L

A

All all right sounds good have a great week. Everyone for the the four people who get left bye.