Ceph Performance Weekly, 8 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-12-08

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

So I uh I, finally, for the first time in a while made it through all of the pull requests, so uh uh we've actually got a fairly up-to-date list now, um and luckily, you know last week was the majority of the work since it had been a while, since I had done it before, but uh but there were still a couple. I had admits that I needed to get in so uh hopefully, hopefully this is accurate. Now um so in the last week, I had nothing particularly new that I saw if I screwed up.

A

Let me know um I didn't uh see anything. There were a number of uh pull requests that got closed.

A

um Those were all by the bot from what I saw um and there was uh an updated PR here. This uh one uh on the rgw side for d4n uh looks like that's gotten. Some updates um in case he's been on top of that uh otherwise lots of stuff. From last week that hasn't seen Movement, we had a large number of PR's come in um related to to roxdb.

A

Well, he's not larger than birth just a few, uh but uh we should probably decide if we want to try updating for the most recent version which is required for for this uh ball compression option.

A

um I do think I'd like to see some test results from that before we merge it, though um otherwise uh various other stuff. Here um there was one that we didn't talk about last week that I did notice. This is uh uh enabling 4K allocation unit for blue FS um Adam. Here, oh.

B

A

Is not here today um and neither is Igor I did want to talk about that, one to see uh what kind of effects that might have, but since they're not here, let's move on, um there was also a PR from Igor about faster Blue, faster allocation in the AVL and hybrid allocators um I glanced at that, but I need to look at that more in depth. I think Igor was hoping for me to review that.

A

So uh that's on my list, um yeah otherwise I think we covered a lot of the new stuff last week, so uh any anything I missed that anyone has either knew or updated or anything I would like to talk about regarding pull requests.

A

All right! Well then, let's move on uh Luke I, see that uh you've put a discussion topic in. uh Do you want to take.

C

Over yeah I, just um we have a saf cluster where we're trying to optimize for performance. Of course. So it's a great place to be on this call. So we in in our last cluster we had uh we used raid, five um uh and and now we're looking to see what rate options people have made use of so we're in the kind of setup where we have say 200 boxes and each have 24, spinning drives and one SSD metadata, Drive and just different configurations.

C

You know we started playing around even with like um three three uh spinning discs uh grade zero clusters, uh just what what if people tried as far as different, even RAID, 0 raid five clusters or just all spinning discs, separate and and what are the implications with that with when you just have one metadata, Drive.

A

Sure sure um so it's been a while, since um I've done anything kind of uh real exotic uh back in the early days of ink tank. We we looked at a lot of different things like this, where we were trying different raid configurations and and different replication levels um when at least back in those days with file store. We definitely saw that the controller cache made a a fairly positive difference in terms of performance, um especially I think, with the way that file store did journaling.

A

um It seemed like that that helped quite a bit in terms of getting uh fast or lower latencies with blue store. It's a little trickier right because you've got your your um essentially during the right hand, log um typically on the ndme driver and the SSD drive, and um we actually saw early on that.

A

Sometimes controller cache could get in the way uh that happened on on some earlier HP boxes and actually I think we might have seen it with earlier LSI controllers as well, and I haven't looked at it since that was like five years ago. So um in terms of, what's out there now um I'm afraid I I'm, not not really up to date on uh on, you know what what the state of the art is in terms of this kind of stuff.

A

uh My guess would be that if you've got 24 drives and you've got one like fast, you know flash device may be bound by that device for for right throughput, um depending on the workload.

C

I mean our workload: I mean we typically don't have a problem with our ingest and write speeds. I mean that's not what we're necessarily optimizing for, though most of our most of our heavy workloads are through like an S3 uh and most of our data is like say, 30 megabyte per k files, and so we have different different operations that are reading tens of thousands, hundreds of thousands of these rather large parquet files from S3 that you know we have other workflows but I'd, say that's our most demanding one.

A

Sure sure so, really you're interested in like big streaming reads: right: okay, okay! Well, in that case, you'll have you'll, you'll, still see, probably the best performance using your application.

A

um The the impact of having to do reads from different uh EC chunks on on secondaries, rather than just the primaries kind of uh rough I mean it if you're doing like, if you're looking at 3x replication versus EC, it can be actually faster to do EC if you're doing big rights, but for reads it's at least I've, never seen it actually be faster to do EC with raids versus just always reading from the primary. So you know that's that's something to consider I!

A

Guess um whether or not you want to do like raid under an OSD.

A

Maybe I mean what would you if you're gonna do something like that? um Were you? Were you just thinking rate zero? Were you thinking like grade five or grade six as well.

C

So I mean we'll we'll test a bunch of these configurations, but we're thinking both raid zero testing and raid five testing I. Don't you know raid six I, don't think I mean just the extra layer, extra parity, you know uh disk I, don't think we're gonna need depending you know, so. Yeah grade zero and rate Five are mostly what we're thinking.

A

Sure- and you know what what you implement too I- don't know what what vendor, if you're, going through any vendors for support, but they may have their own uh opinions on what they're willing to support as well. um It's.

C

All Dell Hardware Dell, you know Dell perk cards for the for the controller.

A

Yeah, are you writing Community stuff? Are you like going through a vendor.

C

uh We're doing all this through Rook uh on our own kubernetes cluster.

A

Oh okay, okay! So that's like Community stuff, um so yeah I mean from that standpoint. You know whatever you guys are comfortable with. You can try.

C

Would you likely see the performance boost with with underlying raid or or.

A

I'm gonna guess that at some point, you're going to be limited by the OSD I mean I, don't know how fast these drives are. But you know if you're creating a big raid uh under the OSD, you may start seeing that um the OSD itself could be a bottleneck, but maybe not I mean I. I haven't tried it in a long time. So I guess I can't tell you um you. Can you can get pretty fast throughput like read and write throughput to an OSD backed by an nvme drive like you can get.

A

You know two three gigabytes per second per OSD pretty pretty reasonably. um But having said that, you know that's a different device than a raid full of hard drives. That may or may not have other bottlenecks that that hurt you.

A

um It was possible that that you may see better throughput by just you know, using individual drives and doing like replication. uh You know kind of the standard way and stuff.

A

But controller cache on the raid array too can make a big difference. So um I think the answer is I. Don't know, you'll have to try it, but.

C

And so what? What do you expect? The difference between uh EC versus just replicate I mean we've only for for definitely for our S3 store. We've we've only our last cluster in this class. We're only thinking of EC, but like is there a big performance hit.

A

Well, you have to fetch uh chunks from the replicas right and you're breaking down the up the instead of having like say uh one big object: I I, don't remember how rgw breaks stuff down by default, probably into like four megabyte chunks. Kc is that is that right.

D

Four megabyte objects.

A

Yeah, so you get your big like 30 megabyte parquet file that gets broken down into four megabyte objects and then, if you do EC on top of that, you're further breaking that down into separate chunks that then get distributed amongst the the different um uh osds right.

A

So instead of having like a big four megabyte object that you can read from just from the primary now you're reading a local copy from the primary, but then you're going and fetching uh smaller chunks from other osds to recombine back into the object that you can give to rgw. So you know the impact there depends on what these like highest. Latency fetch is from all of the the secondaries, and that depends on network.

A

It depends on you know whether or not that other OSD was busy doing something else at the time, whereas, if you're, using your application for reads, you can just read it from the primary will be done.

A

So you know it's: it's definitely going to be some impact. How much impact there's just there's a lot of factors.

C

I know you said yeah you haven't done raid testing in a while, but um if we're primarily doing those kind of S3 reads as our biggest workflow does, that we want to make the stripe size if we're rating like as big as possible,.

A

Like in terms of the the hardware raid, this underneath um yeah I mean that's what it used to be uh like for luster, right and other things back in the day, if you're doing big streaming reads, you wanted to like both make the the chunk size big and you want to make the um the the kernel setting for like how you um like Max sectors, KB I, don't know if you've ever look at that inside.

A

What's the setting um match sectors, KB, um basically like how much you're breaking stuff down at the Block device layer um for tuning raid, you probably want to be like looking at that and looking at what the max setting that the device allows.

A

um If this is delts, they used to use LSI I, don't know what they use now, but.

C

I I think under the chipsets LSI I think.

A

Yeah I mean outside doesn't like quite exist anymore, right I think they got bought at least once.

C

I think like, if you look at the or maybe maybe what it is, is that the uh at the OS layer it actually detects the chip. You know I think I've seen D message something about LSI, so maybe uh the exit, somehow that.

A

Way it used to be like a the 2008 and 2008 were the cards I used to kind of remember testing on, but uh but that was a long time ago. So yeah I mean the I guess. The the answer is I expect that um if you do EC at the self level, you should expect that it's probably going to be a little bit slower for this than than just doing replication. Because of that extra work I mentioned, and if you do raid, um you know below the OSD.

A

There's, there's probably going to be um some things that you'll need to tune there that that you know are are specific to the the hardware configuration that you're running um in terms of of the performance of that raid setup I mean it I would expect it should be like a big, fast, hard drive right. It's not going to be good at that small random. I o, probably not not as good as um the the disindividually I would I would think, um but the controller cache is going to have an effect there.

A

So um you know lots of variables.

C

I appreciate that so so we will be playing with a bunch of different those parameters. There's lots of lots of things to test and it takes a little while to build up a whole test cluster with the different things. So when, when we're, when we're doing our testing and we're doing a bunch of reads- and you know ultimately what we get is, we seem to overload the rdws and start getting timeouts um I mean. Is that what you expect when you're like kind of stressing uh S3, you know I, just you know, I'll load.

C

You know five terabytes worth of data in in these 30 megabyte chunks and then try and read it as fast as possible and basically, the more the more workers I have reading it. Eventually I just start getting rgw timeouts, um or at least that's. What's read from like you know the API at the at the application layer from like an S3 read: um are there?

C

Are there good things to try and measure when I'm overloading the rgw or or good things to tune in this kind of setup, where I'm doing La a lot of large object reads.

A

I'm going to defer to Casey since these these the expert at rgw here um Casey. What? What do you think? What would be the The Next Step larger? You said there.

D

um Well, my first intuition would be that it's just um the bottleneck is reading from rados, not a bottleneck inside of our GW itself, but it's hard to be sure.

A

um Would the timeouts be? Would that kind of go along with that Casey this? It's not being responses from the OSD fast enough or osds.

D

I think so: yeah um I mean a client shouldn't time out unless it receives no data at all, for something like 60 seconds or whatever its timeout is and rgw should be able to stream data. You.

C

Know at least divide.

D

In that timeout, what.

C

I see from our Gabi logs is, like you know, say I'm doing a five terabyte read that I I start getting pretty good. What looks like pretty good read speeds for for a little while for one or two terabytes and the latencies look good and then all of a sudden I'm getting latencies in you know the 10 second range in the rgw logs and then I start getting timeouts, but but definitely for or you know, you know say that took uh five minutes to read five terabytes in in the setup.

C

The first couple minutes I, don't seem to see any kind of timeouts or or even like large latencies and then just all of a sudden latency shoot up in the RDW logs and I start getting timeouts.

D

um How are you measuring the latencies? Are you just looking at.

C

I'm, looking at looking at the latency in the rgw logs.

D

um And that that times like how long a given request takes to complete, entirely right and so for very large object reads, you would expect it to be pretty high right.

C

Well, for quite a while, you know I'm getting a second, um you know like 0.1.2 seconds uh and then and then it just hits some kind of something and then all of a sudden it shoots up to you, know five seconds ten seconds and then I start getting at the same time. I start getting at the application layer like timeouts, which, what's seen, is from S3 perspective. Timeouts.

A

Is it possible that the the fast streets are coming from cash.

C

um I, don't think so, just with the amount of data I'm I'm reading I'm trying to do that to try and avoid having a lot of cash.

B

A

Like the the ones that you're reading are like the first ones that you wrote rather than like more recent stuff that you've written right, yeah, okay,.

E

C

I mean just any any tips on how to diagnose. It would be great, too.

A

There was a comment uh in the chat window: I, don't know if you see it.

C

um I have not seen this article so we'll definitely look at it.

A

In terms of diagnosing further um one thing that might be good is to try to see um kind of what, if you can, if you can track down, what's going on in rgw was versus what's going up on in the the osds like uh you know, even at a surface level, what looks busy right is rgw really busy are the OSD is really busy um you know is: is there can you cast our isolating what's what's doing work?

A

um What how many, how many drives.

C

Do you have locations, has 24 spinning drives one metadata.

A

And how many servers.

C

uh In our test cluster, where we've got like 20.

A

C

A

Those yeah, okay, pretty big and and when you see this High latency uh uh uh start kicking in what kind of throughput are you getting aggregate.

C

uh So we're we're reading at about 20 Gigabytes per second, and then we start. That's that's you know. That's our! Our aggregate read speed from our kind of test, read cluster um our workers, and that goes on for say a minute or two at that kind of rate.

C

You know getting close to that close to 20 Gigabytes per second and then it and then we start getting read timeouts in it. It drops to half that if it finishes and if not it it's, you know I at the application layer. At some point the timeouts are such that it. It won't finish. Okay,.

A

So 480 drives total. Is that right.

C

A

Okay, 480 drives what kind of networking.

C

uh Each box has 25 gigabit, okay,.

E

C

Last test we were, you know we had configured those as eight osds per box with uh three three disks in a raid. Zero setup. I think that last one that I'm quoting each box USD is and each was a three three disk grade. Zero array I see I, see.

A

So, okay, thinking about this you've got 20 boxes, 25 gigabit networking, so you should be able to do uh trying to.

A

Like around two, you do about two two gigabytes per second per per card for Nick I. Don't know what your your, how fast those can go individually.

C

I mean each Nick is 25 gigabit.

A

Yeah I'm just wondering what like real like. Realistically what you're getting like. If you do a network test with those.

C

uh I haven't done; I guess that, okay, that.

A

Might be worth doing um and we have seen situations uh sometimes where um the net, like the switch, looks good when you're doing like point-to-point tests and then, if you saturate it with like traffic from all Nicks to all other Nicks, then the switch kind of falls apart.

A

um That in the past, has been primarily due to like weird issues with the bonding configuration. I. Don't know if you're using bonding but.

C

We're not using bonding, but we're actually thinking about. There's there's they're dual 25 gigabits, but we don't have the other. The other uh Port lit, so something we could play around with is is is bonding that and having the 225s bonded, but we haven't played with that. Yet.

A

Sure, but okay, let's, let's just assume that you can do like two gigabytes per second and then it's reasonable for Nick. So with 20 of those things you should be able to do like up to like 40 gigabytes per second would be kind of what you'd expect the limit would be. That's assuming it's working right that shouldn't be limiting you right. You should be able to do pretty easily all right, okay, so then I think, and did you say you said three, a three disk grade zero behind each OSD and what was the search configuration?

A

What was the? What were you doing on that side?.

C

uh What was the question? I mean? We have eight eight osds replication or racial coding. Oh all, right, shortcutting.

A

It was eraser coding, okay, so you're, breaking up that four megabyte object into multiple smaller objects and then you're requesting to raise zero to service those um I wonder a little bit of the combination of VC and raid is maybe not the way to go. Maybe you do Wonder, try one or the other and.

C

Just oh yeah, we can definitely play with that. Yeah.

A

And have you at all yet or is it is.

C

We have not played with with we've only tried, Erasure coding so far.

A

So Maybe, if you, if you've, got a test cluster to just play on and and you can create different pools with different, you know stuff, replication or racial coding factors. Maybe maybe just you know, don't don't redeploy everything yet because you've already got the the three disk grade: zero ones, um but just try a couple different pools with different um different uh uh configurations. You know EC or replication and and see how that affects your results.

A

And then you know at some point you may want to see yeah like if, if between two just single disk osds, um how that affects it,.

C

I mean we did some initial testing with with single disks non-rated, but we should do some. Others, like I I, think there's a question whether to use the meta Data Drive at all with those or not just with is if, if the metadata drive would be overloaded.

A

Yeah I mean that's, it's only gonna affect your right path. I.

C

Mean if your primary one that's your right.

A

Path is that what you said yeah exactly right like um if well that's, not true, it is for the right hand, log it's not for for database. Oh node reads right, so so sorry, I I, that was a stupid thing to say um it. It primarily affects your right path in terms of the right log for for the database itself is going to be writing out SST files or reading in SST files.

A

So it you're going to expect to see like bigger, bigger reads, which you may actually be better on the hard drive. If, uh if you're, you know, only have one one device for 24 hard drives, you're just gonna have to to see it's. It's going to be a I, don't know a complicated thing.

A

um It might even be that you're better off like having just level zero sitting on the SSD and then playing you know, other things get fetched from the hard drive, but um I don't know how fast what what kind of uh devices that you've got for the fast device.

C

I mean it's: it's not nvme, but it's a pretty fast uh SSD, okay,.

A

Yeah tough, to say uh with just one for 24 hard drives. That's a good chance that you're going to be overloading that thing, but.

C

It for for reads, you think also, though I mean it may be overloading.

A

It depends that depends then, on how much uh you have and how much cash you have for OSD.

A

C

Reads it just how much cash? Where are you talking about.

A

uh Do you know uh the the OSD memory Target setting.

C

um But we have a lot of memory on these boxes, so we can configure it with more.

A

Okay, I I would um so the OSD will try to tune self-tune its caches based on the Target that you give it for memory, and so the the big one that affects performance here is the onode cache and blue store. That's basically like an inode, but for for sefl nodes.

A

um And if you don't have enough, oh node cache in in Blue Store to be able to fit all the O nodes that you're working on there. Then it will it'll fetch from roxdb and roxdb. Maybe we'll have to read an SSD file from disk to be able to fetch the owner that you need so.

C

What's the parameter called that we want to look at.

A

uh Postie memory, Target OSD.

C

Memory Target, I I would guess we're using whatever's default in like the standard Rook deployment.

A

Yeah, it's probably four megabytes, then or four gigabytes. I have four megabytes, um but if you've got tons of memory.

B

A

Yeah, you could absolutely bump that up it, it's possible that it might try to auto-tune that, based on the memory in your node I think there was something like that. So it's possible it's giving it more. You just double check.

C

B

Might be also relatively.

C

For the for, like the kind of reads, we're probably doing I think any anytime we're doing raid we're probably doing like 256 kilobyte Sprites.

C

If our basic read then is four megabytes, should we like get closer to that I would think.

A

Well, it's not going to be four megabytes if you're doing EC at the stuff level right um breaking that up into smaller chunks. Okay is.

C

A

Sorry someone else was trying to speak. Who was that.

B

uh Just me: that's yes, um so it might might be just perhaps Worth to recheck on the single disk.

B

um Despite you have a limited amount of flash, but the uh amount of memory and the handling for the larger lsds might be more complicated than the smaller individual osds. So you have a lot of data backed by one OSD and then try to just suck in everything from this large OSD.

B

You know so it might be a bubble naked at the end, with having multiple data just fed from different areas,.

C

Okay, we'll try that I mean one thing that just to notice that there's a change, because our old cluster was running Nautilus and even in that one we didn't have a separate metadata, Drive and like when you do a LS block. You could see the the like the the metadata partition on each drive, but when I don't specify metadata drive here and I get 24 drives, I, don't see like a metadata partition on each Drive. Is that a change with Quincy we're doing this on latest Quincy?

C

Is it just you do it internally and you don't see it.

A

Sorry Luke I was reading Tyler's response in the the chat window. uh Can you repeat the question.

C

So if you don't specify metadata drive it, just it just uses one on each of the regular drives and I guess. The reason I'm asking really is is is I'm Nautilus when we did that I think you saw in lvm like a a volume that corresponded to the metadata on each drive, but I don't see that in Quincy when I do that? Is that just a difference in organization? And it's just it's doing it.

A

Good question I, don't know the answer to it: um it's possible that they're, organizing things differently, I'm guessing novels, might not have been using stuff idiom um which Quincy it think is, um but this is more kind of how like work is doing things and I'm I'm, not an expert there.

A

If anyone else.

C

And when we're playing with maybe 24 individual osds, is there any any concern with the overall OSD count when we have when we get that up to 200 boxes,.

A

I mean it's: if you had 200 boxes with 24 osts per box or something cluster um I, guess the the question for you would be kind of um what you know. What are you thinking about in terms of how you you manage all this and like the things I've seen in the past, have been um sometimes the amount of data that's being fed into the manager can be like a little bit um overloading you might have to turn down some of those settings.

A

um You know other things like you know, just the the amount of of information this hitting the bonds and the manager and other things can be um something to pay attention to uh I mean other than that people run big clusters, you might I, don't I, don't know um if Dan comes to any of the um the the users meetings.

A

I think he does, though, and uh that would be a really good place if, if uh Dan is there to talk to him about it, because they have very very large deployments at CERN and uh and he can probably give you a a nice description of what they've they've had to do to kind of um you know uh uh manage at that scale.

C

Would, in which.

B

I think the the size with forehand and um a little bit osds is just a medium-sized cluster, so the usual deployment for some common environments. So it shouldn't shouldn't, be a big deal. If you go far beyond that, um so say two thousand dollars. Something like this might be more communicated, but still there it says: okay, so yeah.

C

Yeah as individuals it'd be 4, 800 osds.

B

I think this was also the size that that Kyle tested with um in his setup um and separating rjw uh on cluster rjw y's into different zones, and this was a deployment I think tested for for an 80 per um Stone area in because I was subclusters. So just I I think that's reasonable.

C

We've not played with any RDW zones.

B

I know just it's only just for the folks, um because uh sokai beta did some testing around how far you could scale.

B

It was based on the older bits, so I think that Nautilus still um but um soda was um some kind of performance boundary that's imposed by the rgw caching, um but um so this this size what they used um where I think 480 per subcluster. So we need a new Scale based on that, so it should be a reasonable size. Would you do.

B

All right, just in I'm, sorry, I'm, sorry just um I, think I, remember correctly uh so 20 gigabits per second of where something that was the upper limit that you could get out of such a set of drives, and just even, if you scale it beyond the number of rgws could scale. But but you don't want to get more audio will not get more performance out of this basic setting. You know.

A

A

Maybe question for you Luke how many rsws were you running.

C

uh We tried different amounts, so that's that's actually reminds me the question like what is the optimal number. We could run an rgw on on each box. We could do I think for the 20 node test cluster. We had eight or ten rgw nodes pill located.

A

I, don't know if there is exactly an optimal number I mean, like I, can typically scale pretty well, if the, if, if I'm, seeing that there's more capacity at the OSD level, um throwing more rgw's in and having clients talking to different ones um generally lets me scale farther uh when I've done. Testing like this.

A

um Are you trying to use like a load balancer in front of it, or do you actually have separate clients talking to separate servers.

C

With a separate, separate clients talking to separate servers, load balanced evenly across rgw's.

A

C

You know like in my test, read cluster. You know it said like 400 workers that are all reading and they're just each one evenly distributed across the rdws.

A

Is that, where you're going to do in production, yeah, okay,.

C

Or or I mean, we've tried a different load balancing mechanisms but basically load balance across RDW. Somehow.

A

Casey, do you do you have any opinions I mean I've I've just been there? Usually I can get pretty good scaling with multiple rgw's uh uh demons and clients talking to them individually.

D

Yeah, the uh that eight to ten number sounds perfectly fine.

A

A

Joshua did I see what.

C

What are some good mechanisms to to understand whether it's the it's, the rgws are being overloaded versus osds.

A

um So when you've got rgw working hard, it's going to use a lot of CPU like for for large streaming, read or write workloads. um I can typically see that uh it's consuming a fairly large number of cores. uh I I, don't want to give you a number off top of my head, because yeah.

C

And my past class are like I, think they're. Maybe eight rdws and I saw each one using two to three cores: okay,.

A

And I pushed it harder, I've been able to get rjwgs more than that, but it depends on the particular version of the code you're using too um how much when you're, getting that 20 Gigabytes per second, um how many rgw instances do you have running.

C

I think it was eight.

A

Is eight okay, you might be able to achieve that with fewer, um possibly I think I've gotten rgw to do more than like two gigabytes per second in aggregate before, but it's it's been a while since I looked at it.

C

I mean we're just getting these timeouts, so I just figured. You know for now like having more isn't heard anything testing, yeah.

A

Have you really interested if you did try with fewer if it if it topped out lower or if it you know, if you had more issues with with timeouts um I mean that would be maybe a one way to start playing with and seeing you know what happens if you scale that bigger or smaller, if you've got 20 nodes to work with I mean you can run a fairly large number of rgw instances that you won't have memory issues.

A

um I mean I've I've, run like four rgw instances on the same server before so I've I've done pretty big ones, um but with yeah 20 nodes I mean you could try up to you know 20 very easily.

A

Joshua I thought earlier: were you trying to talk? I didn't want to uh all.

B

I was gonna, mention is at least at 2 000 osds. We haven't had to do anything special with monster manager, okay,.

C

Thanks really appreciate all the comments and and help we'll we'll try a lot of these different um tuning parameters and different configurations and love to report back in a future meeting on what seems to work and probably more questions.

A

Yeah I I, don't know like with four thousand eight hundred I, don't know if that's actually a how big of a cluster that is in reality, these days, I I, don't uh typically get to test out that scale. So you're you're you're doing things that people do do, uh but I I I, don't so, um oh Laura. How big is the uh is, is the um our our test? Cluster are just like uh you know, tiny, tiny, OST, one.

D

That they give a cluster yeah.

A

I give us I, couldn't remember the name off top of my head, but yeah.

D

About a thousand about a thousand osds.

A

Okay, that's that's! Basically, we've we've partitioned stuff up. That's our I think our biggest in-house one that we test on is about a thousand. So um you know 4 800, you're you're. This is the the problem in storage all the time. Oh vendors never have setups as big as the the ones that users are building so uh uh definitely be interested in hearing hearing what you run into.

A

All right anything else, Luke on that I.

C

Not not right now, but really really appreciate all the input thanks a lot.

A

Yeah no problem, no problem and yeah I would definitely encourage you to reach out if Dan's at the user meeting um uh you know. Cern is a wealth of knowledge for for anything like this, so um he may have some really good advice.

A

All right, um the only other thing I was going to mention- is uh that there's more work ongoing with um the RBD mirror issue that that we've been trying to to get past um Adam started working on a new version of his uh shared blob code uh rather than having a single tracker now he's trying to make it so that we can uh add new extents to an existing shared blob and he's got code that works.

A

um We we tested it out, and it's it's, maybe not quite as impressive as the initial version of the the one shed tracker was but um I think it's actually better than the fixed version of the one share tracker. uh So in the chat window, I'll put a link to the latest test results. uh uh That's here for anyone interested um I also tested his latest version. This is the this elastic shared blob, uh uh Branch uh overlaid with my uh defrag on clone pull request, and so um the original is the Blue Line there.

A

uh His work is the yellow line. The the defray gun clone work. That I did is the red line, and then the combined is the green line and uh I know this is kind of hard to read. It's it's really um uh kind of nasty, um maybe I'll I'll actually share green and I can just kind of point it out.

A

um Oh I hate the way that they've got this now. uh Okay, because that is that showing up.

D

Yeah, it looks good.

A

On the wrong tab here, it looks like.

A

Okay, there we go so uh anyone see that.

C

A

I wonder if this was a bad idea.

A

Can you guys still hear me.

D

Yeah we were seeing it.

A

Oh, you were okay, I didn't hear anyone. uh I was afraid of this like like broke. No, no, no we're.

D

A

That's okay! Let me uh let me see if I can share it again. Then.

A

D

A different one: we see the whip atom, bs1 shared tracker, tab.

A

Yeah, let me um let me switch it over. This is where it kind of got messed up. Last time where I was trying to switch the other new test tab, you see that one.

D

New test, yeah.

A

Yep so um yeah, that's the the gist of it is really that the combined work, where we're doing both the defrag and the atoms elastic shirt blob, looks like um we could benefit from both of them and their their additive. So um that's kind of what we're hoping for is that, maybe maybe by using both we we can. We can kind of make this better.

A

uh The the thing that we're still worried about, though, is that just getting snap uh shots and clothing faster may not be enough to make RBD mirror faster, uh we'll we're gonna, hopefully get this into uh Paul's hand soon. So we can try with his RBD mirror setup, uh but that's that's kind of the. The concern is that this might not be enough. We'll see.

A

So um yeah yeah, hopefully- and that's that's the only other thing I wanted to talk about today, so.

E

Just a quick note, uh basically I raised you uh from the cursing, uh if you, if we have time, of course,.

A

Yeah absolutely go ahead.

E

uh Crc52 uh faster for uh arm V8 I've taken a look on the implementation in rocksdb and actually it's free available. The question is whether we really pick it up. I started, let's start from our good old old friend, from similar x86 investigation.

E

Here is the point where we are determining uh which uh which exact implementation should be taken and what should be I, guess: I, don't know rm2l I, guess it should be this one. Let me paste link to the fast path, but I think I think it's faster for arm 64.

E

And a bunch of things both compiled and runtime, maybe we're putting a bunch of of the backs here.

E

So, first of all, we need to have the auto detection at make level uh fully operational. uh So so let me pinpoint then, to an exact area to ensure the.

B

E

Is defined, I mean uh looking paste. We need to have have our irm 64 CRC defined, as comes from simek from cement detection.

E

And also another thing is that, let's do it the probability of uh of get aux valve, it's also checked by CRC and finally, the hard kernel needs to inform us is to set this beat select the proper CRC implementation.

E

If either, if any of of those is, uh is false, it's not correctly working, then we are calling falling back at a slow implementation, but just to ensure I understood that you saw those those issues in rocksdb, because okay, we have at least two places that are interested about uh your critical collection on the right path. uh First, one would be okay, free messenger uh Bluster, but those parts of blister that are being executed by the TP osdtp threats and finally, robsdb.

A

So the the background for folks on the call um we got I got an email from Rama over at Amper uh Computing uh trying to uh do some performance testing on their own CPUs and we're getting more performance to the next 86. We're trying to figure out why they did some wall clock profiling and uh in the wall clock profile, uh it appeared that we were using the the slow cr3 flow, CRC 32 path uh rather than the fast path.

A

That's the the background here um just give me one second Braddock to find the right line. Sure.

E

Mark just a note on that, uh it might be misleading. I recall uh from x86 that uh on the on some platforms, uh it was uh it was in line in a way and in profiling.

E

Somebody could get suggested that fast CRC is being used while uh because of identical right now, maybe some inlining, maybe uh maybe other compiler optimization, but uh it was at first at first glance it was looking like past uh algorithms uh was selected, but actually the slower one was was in use, so uh might be, might be worth to double check, okay extent, implementation.

E

D

A

E

That's the place for selection, uh see still the same, uh have irm 64, CRC and also the same way. Crcp model runtime check so basically like, like the regions mentioned before the same conditions, to have to have a fast CRC.

E

Anyway, what's it, uh how do we build uh well, what was the source of the packages images, whatever I used in this particular environment, great.

A

Question I have no idea.

E

uh Me neither no idea who who builds uh the arm 64 stuff.

A

It's it's safe, Quincy on Ubuntu installed with surfadium in containers; okay! So that's what it is! It's uh whatever stuff ADM does when you use it on on Ubuntu.

A

Yeah I have no idea, I, don't know even who who builds that if we do or if someone else does I think we build it, don't we.

E

I'm, not a 100 sure, I I, okay for sure so we have to. We have to make checkbox for 64. I'm, not sure it's an official platform.

A

Yeah, but that that might at least partially explain some of this.

E

Well, actually, at least it's worth feeling a Tracker, uh it's not uh it's not uh what's this information, this one RM, uh it's writer, rarely containing our uh in our trackers yeah. Do you have any any hardware uh for investigation.

A

I want to say that I thought we did have some kind of Emperor uh uh right in the lab. um You know we have to because.

E

Otherwise, no support for the Matrix, but for M65.

A

Yeah Dan Mick has worked on it uh before, but I, don't I, don't know anything about the uh other than that. We we have it.

E

Okay, um the testing environment uh then actually making a fix. I guess.

A

I, don't even know if we can use that hardware for testing or if it's just for builds.

E

No idea it looks like we will need to go to uh the main main increase. The task- maybe maybe somebody uh knows.

A

Yeah I'll I'll reply back to uh tarama, though, and and just mention that um we we'd like to to create a tracker for the uh CRC issue. So we can at least get it get into the.

E

System, well, it might be worth providing them with the links. Maybe they are doing. uh Okay aw, they are doing. Custom builds I doubt that uh some realistics for us.

A

Yeah they I mean they said the in this document. I see now that uh the the Seth Quincy and I'm going to install the surf, ADM and containers so I, don't think they're doing a custom, build.

A

All right, well um anything else, guys.

A

If not, we used up the whole hour. So uh thanks for sticking around and uh have a have a good week.

D

Thanks Mark, you too.

C