Ceph Performance Weekly, 3 Jan 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2019-01-03

Description

https://pad.ceph.com/p/performance_weekly

A

Hello, crazy mm.

B

A

You been hello in new year, I've moved to sister, actually a lot of fun, yeah yeah I'm, trying poking with from poking with the sharding, and it turn it to be well quite painful, hello, Kieffer, hello mark here in okay, okay,.

C

A

I just realized.

C

I, don't I, don't think it's, unlike the calendar, at least it didn't little. He was on my calendar, which is weird, but we might I, don't feel gay everyone or not. Oh we're getting enough people. So this is good.

C

Good morning, sage.

B

Good morning, everyone I.

C

Realized like 45 minutes ago that I had about three or four weeks worth of PRS to get through, and there was absolutely no way. I was gonna get done by the time we started the meeting, so I didn't even bother trying so next week. I will have that updated, but for this week up and sage is there anything you wanted to mention that you know about up top of your head. No.

C

Yeah NUMA discussion would be good all right, so you want to start us off with that, since it's the first in the list sure.

B

Yeah there was a there was an internal thread about noon. Opening that I resurrected this morning. I should just move it over to the development list today, but the basic idea is first add the infrastructure to nuuma node, um so add helpers to forgiving block device to determine what Numa, no that's attached to and then for object, store back end, look at any non rotational devices that we're using and look at the Numa node for all of them and if they all match then report that as the back ends, Numa node.

B

If they don't match, then don't do anything or if they're, not rotational or if they are rotational. Don't report anything report all that up through the OSD and then just report that include that of any OSD metadata. That's reported to the monitor at the first part. Second part would be do the same thing for the NIC. So, look at the IP, the public, IP and back-end cluster IP that we're bound to and look at, the Numa notes for those and if they match then report that as the network.

B

Even though the same thing report that up to the monitor that.

C

Sorry, why does the monitor need to know? Why isn't that? Just like a stat, that's reported back to it through the manager.

B

I'm sorry, um because the third thing, okay,.

C

B

3 or Nick, or the fourth thing is to add at SFO, SD Numa status or just new much commands something like that. That looks something like. Let me paste this in.

B

Like that, but how? Where I'll just report the OSD, the new my notes for the Nick and for the storage and if it if we are pinning the LSD, what how it spins and if the pneuma notes don't match, then we wouldn't pin and then of course, that I guess the fifth thing would be to if match, automatically pin.

B

If they don't actually nothing, because who knows what the right thing to use, you know really know.

B

And there'd be like a host Enuma in or something resetting to control, whether you that basically, first part would just be to like, have some visibility so that if, if it's clear that the storage is on one Numa node and it's clear that the network someone, then we know then report that so that admin can see it. And if that you know, if it's unknown they on this host, maybe we just don't know what the NIC is or maybe the storage is ambiguous.

B

Then we could, like you know, do negative one or a dash or something or just leave it blank or whatever right. But if we know that it's on one node, then whatever and so then in Emmett can look at this thing and say well, how come my storage? Doesn't this reporting a newa node and if they go look and debug the system that they can they'll notice that, like their devices spanning two they're using an SST device and an SSD journal, an nvme journal so Chu and me devices but they're under for Numa notes?

B

Maybe that's why? Whatever who knows or the same thing on the nicked? Maybe the NIC is the reporting, an uma note and that's because the front end and back in networks that are different or, more importantly, if they are reported, but they don't match, then they know that they're they're just either not using the right Nick or they don't have a NIC. That's attached the same socket as they as they're in VMI's. That would be a problem.

C

This is definitely one situation where I would think that, for a lot of people that don't look at this cast of regularly, some kind of like visualization of it would make a lot of sense right.

B

Yeah um missile chart like this is gonna, go a long way because you can just look and see if they match or don't match yeah whether we want to go so far as to like raise health alerts if they don't match um I think it's a little bit clearer, because they're gonna be a lot of deployed clusters that are totally fine. That, like are just the hardware, it's just not not ideal and there's nothing to be done about it, though raising a home doesn't really help.

B

So I think that's why I think I knew a status just a command that you can run that just tells you what the newest status! Maybe we can know service on the dashboard at some point like highlight and read the ones that are problematic um in orange, the ones that are like not ideal in the ones in green that are fabulous any.

C

Any to socket system with a single Nick is gonna flag, stuff right exactly right.

B

Like a system where somebody did the front and the back of network is also gonna flag stuff, because that's just turns out this time, I guess six would be to rewrite this stupid. Out-Of-Date document rewrites talks about front back networks for honesty's, because I think that's just a bad, maybe in the offense. You know five years ago, but it's just bad advice.

B

Now people shouldn't be used front end back in network exhibition instead put both Nick's on the front end network, at least for your flash clusters, both Nick's on the same network that separate eyepiece and then balance down Co STIs across them, according to where the a in actual SSDs are yeah.

C

When one thing that we we haven't really done a good job of over the years, I think is really focusing on like hardware topology right and part of this is because the vendors don't like it's impossible- are really hard to get the vendors to really like think critically about how the hardware inside the node is really laid out, or at least that I always found it was myself. But as we move forward, we probably need to really do a better job of getting people. Thinking about this.

B

There's some command and known as well as googling for this called Ellis Ellis topo, which is like a visualization of what the what the hardware looks like where the PCI buses are and and all this stuff I'm fooling it now and I'm, getting like pictures out of a GUI. But I thought this was a command-line application, but maybe it's actually a next application or something but um I did tells you how the l1 l2 l3 caches are how the Numa nodes are set at and so on. It.

C

Can surprises me that there's not some kind of like cluster wide view of this kind of stuff, like you to imagine that HPC sites would want some kind of nice. You know. Oh.

B

I, just first time, I seen any of these tools but um yeah. If you're looking like here put in the chat so like.

B

This looks like a pretty good tool to see what the hell's going on on your machine. If I can run it online.

B

B

Anyway, that's my update, pool.

C

Probably not well maybe this time I don't know, but at some point I still very much would like to talk about what an OSD really means and kind of what it means in terms of like a hard work. Sure like this, when we're moving into a world where you have different new menu nodes and lots of cores and lots of hardware in a box, what isn't what is no SD and why?

C

Why should it be one device or should it not be one device, and should it be like a conglomerate of closeby hardware or what I think I am following into the mindset of thinking that an OSD should be a grouping of closeby hardware and that then it figures out how to divvy up that Hardware. That's within like a Numa node, but that's that it might very well be wrong.

C

um Let's just can like I current thinking, I guess, but maybe this kind of can lead us into the update that Radek you have, because you were just recently talking about that. Can other paths of having an OSD be very simple and potentially like single threaded. So do you do you want to talk about what you guys are seeing.

A

I just wanted- maybe not even talk just asked for some input to the discussion we have around around sharding in in the system with two approaches were identified. One is to make is to make very simple OSD to think explicitly on one OSD on one core consuming one core having one thread and everything a synchronous inside another approach would be to make their OSD more beefy, be able to to span across multiple cores and do sharding internally, and both approaches are having their own advantages and disadvantages.

A

In sister we had we had discussion, and the conclusion is that we don't know how to judge appropriately those those presence, counsel and I would love to ask about the inflation of of OSD map, especially, let's suppose, if biggest this ad fended disadvantage of having simple OSD spanning only on one car is, we would need have a lot of OSD in cluster to saturate powerful devices that rate nvm ease. This.

C

Will translate.

A

The well this design translates into not saving some resources like network connections and also like, like OSD map entries, but well, is it a real problem? I.

B

Am inclined to say it's not that big, a problem but I think also it's I wonder if it's if it's inevitable, because if we're worried about, for example, having too many network connections, because we have I.

B

Don't know because thing is if we thought it's not so much a core that we want to, pin tier-I tassa per node is a core I. Guess percy. Sorry actually want like one literally one core, because at least for the network, we want to make sure that we had. We would want the TCP connection to terminate at the core, because if we don't, then we're sharing the network stack across scores and we are sort of losing the whole point of all this, like I'm wondering if, even if.

C

Even if we had maximum.

B

Flexibility, we could, we could have an OSD for the entire host, the entire Numa node for a single core like if we could do whatever we wanted. I think we would still possibly want to squash those tees down really small, because it's the minute we're sharing network connections across course, and suddenly we have all the cross course.

D

Well, sync: networking how all this pretending here but I mean this seems to go back to the discussion. That was they even you know various people, it's Sam Justin others had long ago about. We had that we had intuited that we really wanted to couple OSD from network connection and just.

D

To pipeline request for the Huracan and OSD about a network connection, that's appropriate to it, but it shouldn't be necessary to have a whole new network that can vacation for it for co-located I know for a group of us teas that are co-located on the node yeah.

B

I mean it certainly isn't necessary, right and like on and at one end of the spectrum it it makes sense to have. You know one network connection between two hosts right, because there's no reason why I need parallel connections.

B

The problem with that is that you suddenly have a point of convention and sharing multiplexing all these flows across that single connection, and when that network traffic is arriving at that host, then you have to send it to the right bucket on the right core for it to be processed, and it might actually be better to have parallel TCP connections.

D

Well, it may be better to have so it might be better to have some number to avoid. My point had lie blocking, but it's not DeMint, but there's no dimensional relationship with OSD. Are they orderly shouldn't, be a protocol should make it possible to efficiently.

B

It doesn't have to be, but maybe there maybe there's a performance reason right well, because the minute you are sharing it across course, then you have cross core contention and that network is being read.

E

There it's some layer and stack though, just because we're not explicitly doing the memory movement, it doesn't mean that you know if the colonel isn't having to do all the I/o through one core and then send it off to the right places.

D

Well, I agree with that, but that's different point: I mean I. The only thing is I think the cardinality of OSD and our connection. It should be different, logically for Maxwell, for optimal efficiency as well. The only the only thing, though, either I think really comes in as hol.

A

One quick question: before going further: if I may are we entirely focused? Aren't we over focused on Colonel networking internal things, I, guess that the main yeah I mean I, guess that the main reason of getting Easter inside is just bypass. Colonel distal is basically a scheduler userspace scheduler. It offers the PDK based network stack and mark and I bet, and that on powerful deployments we are we will. We will be going to use to expedite as well yep.

A

It means entirely and in.

B

That case each each instance of the OSD, or instance of the network endpoint whatever. However, we carve it up, it's going to be a Vista virtual, it's right, yeah, yes, that virtual networks function, which means that the hardware is doing DMA into memory for us and we're using DVD K. So there's no kernel involved, there's no. It should scale very well.

E

Well, yeah, but Canon actually I mean I. I, don't know the answer this question: can it actually go to like but actual DMA to every court of the system, or is it gonna be stuck at 60 or something I think.

B

It I think can have one for every yes. Every second I mean they're they're they're, designed for a virtual machines. Do you have like a thousand of them or something another there's billion? The idea was to have the hardware passed through direct into a VM, so you get high-performance networking inside of you and that's I think what it was built for, but here would be making use of it or.

E

The MSR are used to running, and you know shared hardware context, though.

B

A

B

Understanding is that this hardware capability was developed, support things like virtualization, where you had high perform. You wanted a high performance of the em that was pinned to the CPU at no scheduling overhead and had direct access to the hardware. No, it's just important things like virtual network functions and whatever Oh.

E

Anyway, I think Radek had somewhere else. You wanted to go yeah well.

A

I cannot say too much about SP, DK or ADP DK, but I have had some experience with the third technology that is I guess made to be combined with both really the previous one. Previous ones, I mean qat, it's for me. It's for offloading compression is for offloading encryption and the sharing the single device between multiple cars, multiple programs, multiple applications, where we're well designed from the very beginning, I.

A

Guess there is a concept of a hardware ring buffer, basically some kind of cue to commit to communicate with, and you have a lot of those ring buffers and they can be allocated at the setup phase of application, video by Cardinal and assigned to many many processes. I.

C

I, don't one of the things that I'm I'm a little worried about? Is that every it feels like to me or every time we we, we hope or rely on the kernel or some other piece of software to to do something? Well, if sometimes feels like we, we end up regretting the decision have in the future right that that something doesn't work right and it's not easy to fix and certainly not easy to fix by us.

C

So you know any any time we can do something where we're more directly accessing the hardware or can fix problems ourselves. It seems like we, we take on a lot of work and it's really painful for us, but but maybe it's the thing that in the long term, ugly qat.

A

Was was using was needing the kernel assistance only at the setup phase after after sitting up would wither the when the hardware is ready set up. Everything goes in user space yeah in some cases, even with the.

F

A

Physical address translation.

D

To given birth clarify, how do you, how do you see the qat example eliminating the discussion about how about whether they heard the cardinal opposed to your order, the commonality of core the mapping of our state core?

D

D

How does QT, how its kind of security.

A

I'm, just extrapolating my experience with a QA t, 2d PG, k, +, SP, DK I, guess that those technologies has have been made. The in very similar way, I mean seems like it's talking.

F

Talking to online devices use in PCI Peter your service interface, which is first part by the kernel. So it's a very direct way.

C

From a much more higher level point of view, one of the things that that I've always been kind of interested in is kind of who makes decisions regarding data placement in in SEF and we've been very reliant on crush all the way down to the device level for a long time, and it's it's kind of good in some ways, but in other ways it's really the the big problem is selves right is that it lets.

C

You avoid some kind of centralized lookup table somewhere, but maybe in a world where we're gonna have really fast, like env Dems in every node someday in the future, maybe no global, or even even like Numa node level.

C

Maybe maybe we you can do really fast look ups inside that node and it gives you more flexibility about data placement locally, like maybe maybe the problem that Crush solves is much more relevant when you're dis, panning lots of nodes in a big cluster relative to like placing data within some some very local Numa node is that is that worth thinking about? Is that something people are interested in I.

B

Mean I agree but I think that's the situation. We have they, where crush map sinks, the placement groups? No, it doesn't do that at that's placing christos T's, but how those these are mapped on. The internal hardware is like we can do whatever we want. There.

B

C

Like if we're, if we're really spending a lot of engineering resources towards a model where we have single threaded, OS DS, that that are really targeting like one device or a portion of a device, even it's not that you can't do the other thing it's just about where we've spent time focusing I, guess well,.

D

I'm confused, isn't the single threaded OSD? Isn't this function then to just be one?

D

We schedule a bowl unit of work that can be run, there's running on one core and don't you still have absolutely many requests such a request in parallel, and so don't you really have it in I am mapping of end course by MOS, DS I think that.

B

Split by PG yeah first I want to clarify when you say single-threaded OS, do you mean like a see stars? Do you running on one core? That's one one C star thread, but a bajillion futures and depletions or whatever right. That's. What you're talking about.

B

You're talking about a single core OST.

A

This yeah yeah.

B

E

But that means absolutely no across core communication and no like no locking every time kind of even gonna be necessary right right, because.

B

E

B

How long you just it.

E

Vastly simplifies.

B

The design, assuming we don't try to like share the OST cache which we could or can do it.

C

It does force you into a model, though, where you're sharing, potentially on current hardware design. It forces you to kind of adopt ahmal, where resources are shared across multiple asti's right, where you're almost certainly.

D

C

Take yes, the hardware is shared. Multiple OS DS are occupying a given hardware, resource yeah.

B

But I think the extent of that which that so that I think they're, two things that work on our favor one there's this newest, Biff and nvme means face stuff. Whatever that lets you actually much so certain about the SPD case I'd, we should double check if how you can, if you can carve at multiple devices, consuming the same SP DK device. There's a colon, vini, I'm, not sure I, don't think.

E

We found that from the discussions I've had I, don't think we want to rely on that from the discussions we've had it's all like very small numbers of channel based and if we lie I, just think from what we've talked about I, don't think it's the right. The right I.

B

Mean I think what.

E

B

The thing there remember is that we're talking about intersecting reality like in a year and so that it's a things like it's a question for what the what the roadmap direction is for the hardware manufacturers like are. They is just a thing that what is the plan for? Is this a thing that is going to be normal, or is it going to be a weird special case? I think that's that well.

E

They're happy with it is they're not trying to push it forward. What's that exciting I mean the conversations I remember having they're not trying to push forward that those features at all and it's not anywhere and I, don't think it's something we want like I. Think the best case is that if we did this, then each OST would basically have one of the flash channels, but once you do that it's all super slow one of the white Kenneth one of the internal channel, the NDS. It's the flash controller yeah.

C

Do you have a parallel channels right so yeah.

E

C

Like yeah, but.

E

Then you have no parallelization of the I/o request for an OSD. It's just stuck if you get into a write or if you need to do any race cycle, it's.

B

Like we said, I think we should talk to you, know our friends at Intel and Western Digital and find out how they're building stuff, because this is all changing quickly. I, don't remember having these conversations talking.

D

B

This, like you, know more than a.

D

Year ago, maybe things will always be a range of form factors that we have to just be able to adapt between yeah.

B

But I think for the ones.

D

B

Are like not the fast ones, and if all this matters to us right so well,.

D

It feels like it feels like they're, always I'm, not sure it feels like they're, even for our machines that are more complex than that more devices and more and more activation together or more cores. You want to run in that configuration, even if you want to be able to slim down and just some more isolated dispersed units. If that someone starts building that.

C

Paige, the other thing I wanted to get back to too is that I feel like that model really locks you into crush as the distribution mechanism right all the way down to like a partial device level like you're, relying on very much a static.

C

You know pseudo-random distribution, but you might not always want that right, like officially in the future, might be where you want some kind of like ability to to say this device is in the middle of some local maintenance or local. You know it's busy. I want. I have envied Jim's I. Can you know make really quick choices about where to put something? Maybe you want to do that internally or.

B

Maybe we want to have some sort of local redundancy within the node right. We tried away from that traditionally, but like.

C

Maybe we don't right now, but maybe we want the option. Maybe we don't want to close we're, not really good right, we're never really closing it off, but we're. Maybe we want to retain some some.

C

We don't want make it really hard for ourselves to do that in the future. Yep yeah.

B

I guess I guess what I'm getting at is um I wonder if this actually is gonna work in more situations than we think it might actually work pretty. Well, it's not like this I mean. Ultimately, if we have I, don't let's do that. Lets us sort of use as many cores or as few courses we want- and you know either like, but everything into one mega, OSD diamond for the whole host or have one per core per device or or slice of a device like.

C

B

Flexibility that sounds great. We can. We can handle everything, but there's a huge cost of that in terms of the code, complexity and the time it's going to take to actually develop something and I guess what I'm wondering out that.

B

Is that wonder if what we should focus on as sort of phase one is implementing a simple case: that's not going to be completely general, but actually is going to probably cover you know, 70% of these high performance cases and agreed somewhat gracefully and then get that working, that'll sort of give us like go to the upper bound, because it's the easiest case like it's all in one core, there's, no whatever and then start adding all the complexity. To you know.

B

Maybe we just want to be able to multiplex on the never side, but not on the stores side or maybe the other way around, maybe one multiplex, storage or whatever. However, we do it, the.

A

Complex eyesight is very interesting, going to talk about well. I I saw some discussions from couple months ago related to today to the approach to take for crimson or SD, and they were mostly focused on performance side, but there is even more important one I guess: correctness and complexity.

A

If you, if you have one one T star fret in HD, it's it's still more complex, not quite acceptable. If you have even a bit beefier or SD I mean multiple sister frets, but not sharing and new resources perfectly Charlotte without changing shot in the middle of the request processing. You can live with that, but in situation, where you have multiple these star frets exchanging data between them. Well, it's the pain, start they're.

A

Sharing data in sister should be made we're using message passing, but there is absolutely no language tool set to prevent you from accidental sharing. Well, even shank is not the whole thing you need to take care about. Even the allocation of memory on proper car is another pain. Sure sister has a foreign pointer, but it's like unique pointer of us. It's it's not it's not for it's not for sharing yeah yep.

B

A

B

I guess I would be curious to hear chief, whose thoughts and but the other people on the sea star Crimson team yeah was.

G

Pointer you, it also held it- has two versions of shared pointer has lightweight shared pointer, which is for a reference count on a single core, but this has the non lightweight shared pointer. So there is actually a way to you: share data across cores, that's supported by the system in the cases that we need to generally read-only data. If you actually want to do a modification for multiple places, you almost certainly want to just send a message to the core that owns its immutability.

G

To do the updates least all amount of reference to it and read it.

E

G

E

Radically, if you were there like, because I think you've been working forward on the assumption that you have multiple cores per OST of a sea star process, so were there like specific problems? You already run into that prompted this question, or are you just like wow would be really cool? If we just didn't have to do any other work cuz, it would be really cool, then actually the work. It would be so much simpler, I'm, just really really scared of Tiger selves into that that a whole lot of locals a.

D

Decision arise to have multiple sea stars in a single OS team. I thought there was I thought. The original idea was sort of mapping PGE to a core.

B

D

F

E

Then you have a.

D

Potentially write.

E

D

Cardinality of OSDs.

E

I'm, messenger I guess just the messenger and passing messages for the right core for the PG or wasn't like when they started. Writing the OST map, cache sherry code or I.

A

Guess well from the discussion I can find on sublevel it's. It appears about to be connected somehow to the crossbar we would need to have, but I'm not sure a it's. It's entirely correct in the case of having, in the case of very simple SD having one fret only well, we don't. We would not have to implement the crossbar in OSD, because the crossbar basically will be fulfilled with the current crash infrastructure. Wait! What what did this come up on there Septimus.

B

D

You mean cross core message: yeah, because.

B

You're, you have some core, that's handling the network traffic going in and then it has to pass it up to whatever core is handling that peachy. So there's some in here cross core between the messaging layer and that that.

D

B

D

Made that is always made sense, but I think the most I think you're on the right path. You know when you're talking about having a particular whiskey, try to be or least we've been there even specific. Have you even of know by invariantly be a kind of a logical task? Annika yeah single thread, but then, but then you have. The possibility of anymore is an appropriate number of those allocated India on any particular or with it, with affinity with India, and you have the fixed affinity in crush of PG, but but systems pry.

D

Some systems that aren't like crush I mean I strongly think the idea that the notion that the high-level benefit of crush you know and then systems like it of being able to a good intention to infer what what what you know in physical, this sort of high level, physical topology of a thing or the steering information about about an object is incredibly valuable and and and and and resurfaces in a lot in future topologies.

D

Also, even if there are fabrics or some fabrics that would have allowed us to get for more than one locations have access to a particular piece of data, I still see a role for the what school? What cross gives you there.

B

Yeah yeah I, don't think, there's too much question there.

B

I get the reason why I I'm thinking about this is just a comment that Braddock made I think it was yesterday and the Blues first and up about what if we just made an OST one core, would that be feasible and I guess my main motivation for talking about it now is just I think we should be either at least calling out which should be calling out in questioning our assumptions, because we sort of wind this whole seastar discussion, assuming that or I, did at least assuming that an osu is getting old, multiple seastar, threads I assume.

D

B

D

Think it's really interesting.

B

D

I mean: is there any? Is there any good question that you know there's all Paul, just four key bringing this back up but I mean for Mike understand my mental process is. That seems like that seems like the building block or a or a building block. But is there any question, but the.

F

B

F

D

F

Thing you just.

D

Said a permutation of what you just said, Creek sage, but is that an OSD has a logical core horological that you know a logical mapping to a core or a threat of control and C star, but it didn't seem. It didn't seem necessary to me to assume that there's only one per core: oh, no! No, hey, yeah!.

B

Yeah, maybe they didn't.

D

Even think about that.

B

My but basically I was going from previously I think about honesty, being in course, to being just one yeah.

D

I think it's intriguing of.

B

All of this complexity with.

D

That leads to a much more composable architecture as well and.

B

But it does, it has a huge benefit in terms of the code complexity. The cost is that we rely on the virtualization function in the network and on the storage side and so I think in order to I prove out whether this is actually a viable path. We need to do.

D

We have so. How does it in others that? How does that house like virtualization function, I mean there's, naturally, one.

B

Physical network interface on a host and you're gonna have like you, know, 3200 SDS or something if you have one 4 core, and so they each need to have their independent IP stack and set up to be fashioned. Hold.

D

On isn't it possible that there's another that there is instead one such one, such stack or one such that complex, it is, is, is one-to-one with more complex and not OSD, even in the single OSD thread? Well, then, you.

B

Have then you have the crossbar, you.

D

Do it well, your often your often good to have those crossbars, but you know or cross.

E

Mr. cross-court, that's point that if you have that crossbar that we're paying the complexity of doing the memory and poor tracking I.

D

Say you're gonna get III.

E

D

Sort of suspect that the idea of factoring that out completely from the design and not being making it a fundamental, primitive that can be accomplished, it is sort of fruitless, almost bill dedicated hardware just to accomplish it, I, don't we're.

E

Going to get well.

D

E

On a performance map or being able to equally align the performance between the CP between one CPU core and a chunk of storage and.

E

B

Concerns me on the network side, I'm, not concerned, because you have these this, that already exists, and you can you can reach agar your networks and they're, all it's all dynamic. So who cares right? Never side is easy. It's the storage, side and I. Think the question is at the end of the day, once we are sort of done with this whole process.

B

How much is, is it gonna, be one modern, CPU core one modern nvme or is it gonna, be like four cores four nvme, or is it going to be a quarter of a core to one Indy me like? What's the what's the bounds gonna be because currently we spend like eight cores keeping one Indian a disease, but we're like we're paying all this complexity. It's like the code is horribly inefficient, like if we actually could do it efficiently like where's.

B

What would the end point be because I and I can't really tell because I can see it. I can see it going both ways. I can see us wanting to have, like you know, push a million, I, ops and so an SSD and like you're gonna, you kinda want to be able to spend different amounts of CPU and they were doing Orisha coding, maybe or not whatever. On the other hand, I also see that um now the vendors want to build these boxes that have like you know, 64 12, terabyte nvme sticks in one box.

B

How many cores are gonna put in there.

C

Per page, the answer is easy: cores aren't getting faster, they're, not right. Their cores or not. Cpus are getting faster, you're getting more cores, but they are not getting faster, but an nvme drive as a conglomeration of lots of channels with lots of parallelism is right. So unless you can do what Greg was talking about and start like, the Salvatore's, the specific channels of energy.

B

C

B

Think one I think that's the question that we need to talk to WD and Intel about and find out what is the hardware roadmap? What is what do they expect is gonna happen here, because I think there is a something that's analogous to this or these indium you drives I, don't know if it's the namespace or if it's um you know, if you have 64 channels, each one gets 16 dedicated channels or whether they're getting multiplexed in the controller. I don't cares or maybe what I should you care?

B

We care, but I, don't know what the answer is, but I think that's that that's the question and that's going to determine whether this one core prosky has legs or not, say.

C

So sage, there's the there's one other thing: I want to ask you about back when I was working on pet store, I had proposed kind of this very simplified OSD model, and you were really against it. Then. Is it just the ability to do this kind of virtual was if wherever they call it, you have direct access to the hardware. That kind of has swayed you in the other direction, or is there something else that I don't.

B

I, don't be honest, I, don't remember what the what your purpose.

C

But it was very, very abstract in the get rid of all the everything just make a single single threat. You know core OSD, that does you know event. Basically, you know event model, but then it that was kind of with the assumption that maybe you're just going on you know a typical file system or something you know you. Don't you don't carpet up the device directly using you know all these new user space technologies. So so that's why I'm wondering is kind of what what's kind of pushed you and what is yeah what'swhat's your yeah.

B

Well, I think it was radix comment on Monday. Just maybe questioning us I went into this whole thing, assuming that we needed to have something that was very general and that we could.

B

We would have a crossword I, just assumed that a crossbar was going to be a requirement and that we were going to need to be able to carve up the OSD the devices the network in in various ways we find have multiple OA STIs in one process sharing the same network, or maybe we would I, don't know like it's sort of assumed that we'd be building something with maximum flexibility, but I think I sort of arrived at those assumptions or they'll assume those assumptions, whatever forgetting about the the fact that the network lets you sort of avoid that, because you can have at least of the best my understanding you can have independent DP TK stacks working against ave.

B

If using the hardware, virtualization function, yeah.

C

B

Didn't know you could do that I didn't used to know you can do that. So I think that's part of the reason why I sort of assumed that a requirement was inherently fundamentally required. That's.

C

Kind of the key piece, then, is that you cutting.

B

The heavy piece and I'm kind of assuming I'm kind of assuming that there's gonna be a similar capability on the nvme side and I. Think I.

E

B

E

Need to email, your some people, then, because, like we've, a strat and the answer was no not really like, there's an interface that looks like it might do that and the internal implementation. It's a lie, yeah it's at least that's that's what I remember from when we asked about the channels in the past.

E

The other thing, though, is I. Would it's it's good to question your assumptions, and but I mean we have, or at least I've been in conversations that did like touch on this sort of stuff in the past and I'm really really nervous about specifically requiring the performance of for the performance of all of the pieces to align.

E

I think we're gonna be in a use case, where that's just not true, and we're like this is gonna, be our only third path, and so I would like to see very explicitly what we've done so far, where that makes us think that losing this programming complexity is worth that loss of generality, because it is, it is simpler, but it's if we like it makes the primitives a little more complicated but I, don't from what I've seen and I mean I'm an outside observer. What's.

C

E

C

E

More complicated and it's gonna be really hard to retrofit, maybe also that we don't need.

A

That we, that general et the de huge performance will be the expectation to get huge performance will be not general. It will be focused.

F

A

Particular devices I mean nobody. Taking cares about about performance of spinors well,.

E

Except not actually true, like we're, mostly good enough, so we mostly don't care, but if we get into more deeper archival use cases and like like with these big objects, object RDW clusters, we actually do care some now.

G

E

Not because we don't, we can't saturate and hard drive really really easily, but because we're trying to shove more and more hard drives between one cook, behind a single CPC, CPU socket and- and it's also not just about the CPU amount of CPU used, but that the more, if do one ôs D, is one core. That means we have you know 16 or 32, or 64 x, OS DS per box guaranteed. We can't reduce that number at all.

E

It means that we have that many copies of all the live, OSD maps in the system and they're larger, um and that's more memory that might be used in a in a low memory ratio, environment.

B

Nervous, your point.

E

Craig is awesome because.

B

Regardless of what the like that, the top line, hero number whatever is and what the right balance is for that with the fastest Hardware in every category, the reality is that the hardware that that our users are gonna use its gonna, be like wildly varying right. So I think the question is like, if you think of in my mind, I'm wondering if, if we think about sort of the most extreme case of being these machines that are just like packed with enemies in packs.

B

Of course, if there's a model that lets us carve off a bunch of complexity that can handle that case and without and that piece that gets carved off, isn't the lower performance stuff, then that's a win right, for example, in the single where Oh Steve's live on a single core, that's totally fine with spinners right, you could have like 1600 Steve's on a single core and that doesn't change your software complexity at all and so it'll that model gracefully degrades to all the slow stuff.

B

The question is: is sufficiently general and flexible to handle the the high-end stuff.

E

In CPU terms, but I'm not sure that it doesn't memory terms.

A

Will be focused on saving memory, I see, I, see huge I, see huge increase meant in complexity, just to squeeze some network connection, big number of network connections, the number of panja's inside of OSD map and I'm, not sure it's just worth.

A

E

You can enumerate the the huge increase flexo you're, seeing that's and I'm like write it down like send an email, deceptive, elevated. That's what I want to see because you're more you're closer, this I am and I'm sure there is complexity like I know, there's I talked about some.

E

Real part about it um agree, yes, like we use a lot of memory right now and that aren't like, and the increases we've had in using memory are causing trouble, not just for like homebrew lab users, but for some of our commercial users.

E

So yes, like the amount of the amount of memory we use for a static overhead is something we need to be careful of.

D

Is syrup there is a revised er. The new information or really sweet informations I may have missed a, but where most of the memory utilization is minor. Something in the past was that a lot of it went into into PG are longer OSD, Maps, actually currently dominating I.

E

D

E

Many they've caused trouble on like our Big Bang clusters at CERN, though oh yeah.

B

They're, the biggest shared thing that shouldn't shouldn't have to be shared but I think.

C

Respecting their sufficiently.

B

Out-Of-Band that introducing complexity, just for example, as a hypothetical or the shared OSD map cache across most using cores, so that we eliminate the network crossbar. But we only have this cross core communication for a shared OSD map like that would be back.

D

Of the day when memory was really expensive, EFS got wins by by using UDP because it because it avoided of line blocking and it garbage collected all the all the overhead of early TCP stacks I, don't know how long you're gonna we're gonna get back there. Hol is going to be significant, but I just think of hardwiring. The decision like we're gonna, have one network stack for OSD a seems odd, given the way crush works.

C

The assumption that we're gonna do one OSD per core in the future. You know if we get away from windows, deeper device, but just say one mistake: your core: what in five years or 10 years, you know what how many OS DS does that mean for a cluster and help.

D

I would well again I, certainly wouldn't have thought there was only one receiver. Core I would have thought what the OS DS would have one core, which is what saved said originally yeah.

B

C

Okay, even worse right, but potentially no.

B

D

Rheticus on, as fathers are sorry, son overtime is isn't aimed at that model. Yeah.

B

I know Steve would not spam course. That would be that that's the simplifying yes I.

A

B

Worried about um you know, five years from now, when you have a 256 court system and suddenly out of twenty fifty six so Steve's on one box like what is that? How big do your maps.

C

Yet how much memory now are you really using per OST for OST map yeah.

B

C

But I wonder if.

B

That's if that's maybe that's.

B

C

The growth isn't linear right, the growth is exponential because the maps get bigger and you have more of them on every box, so it doesn't keep up with. Potentially the memory increases well.

D

It wasn't an Olaf Emerson is on a phone call, but back when we talked about this flat. We can we conceive this problem before we can see, but the quite in the same hardware in a specific way, but we assumed that this could be solved by by it by by sharding or fragmenting and isolating map, relevant portions of maps to area to subdomains.

G

Sure it's something about the the thing that I'm unclear about is in terms of pure complexity. The crossbar seems like less of an issue than stuff like PG split, merge or even a shared OSD cache. Okay, worse, the crossbar is just we allocate memory on one thing at the beginning of a request and then all the way. At the end, when we finish the request, we have to send a message to deallocate it locating memory by sending a message: isn't the bad part, it's all other stuff that I'd worry more about. It's.

B

The stuff, yeah, no you're, totally right, split and merge are a huge headache that are more of a concern but I think it's not so much about the allocation de-allocation, at least with my limit understanding. It's that yes, the thing the messenger reads: it off the network and probably does a CRC check and then hands it off to another core, um and so it's no longer in the like the l1 cache, because that other core, when it then does any additional processing or sense about the network, yep.

A

You are getting issues related programming, plus the extra complexity, the extra hit eggs coming from sister and some resources just cannot be moved like like like connected. They need pay on the car. They need day on the car where they have been created on I. Think.

B

Gray got the right question, because this is really about simplifying the design to avoid a bunch of programming complexity, and we have like most of us in the dark non not actually working on the code beside trying to make this like trade, kaak, kaak complicated like value decision or whatever, when we actually haven't watched any of the code and so I guess what I would ask is that ki foo, if you can bring this up in the next crimson call and maybe report back on what you guys collectively.

B

Think of this idea like how much is the savings in terms of code complexity?

B

Does it tie our hands in other cases and whatever I'm, also that those folks probably know something about the hardware, virtualization capabilities or whatever the right term is for the SSDs um I can reach out to to Shara separately? Let's see what Intel's take on that? That's better. It feels like that's sort of the next. That's not it! It's a conversation board.

F

Yes, consider up to you guys to try to call it being to invite you guys to the next crimson core and it's in the mail to summarize some of the problem. Consumption week we have. We got into the complexity introduced by the cost, bad messenger and some of the question and an assumption regarding to how to charge the online storage. How do we make use of it using the multiple study, I think I.

C

Also like to have talked to hell Fredo about what does it mean in terms of the complexity of having to divvy up hardware or different OS DS, or you know, whatever you want to call them right and with static allocation, saying this is a portion of an LVM device. It gets given to this small OST and this other one. This other one versus kind of the model of you hand. mmm You know resources associated with a Numa node to this. This OSD, you know conglomerate thing right and they figures it out.

C

You know, for me, that's kind of one of the things that's driving my viewpoint on it is. It seems like trying to make high-level decisions about what OS DS need. It doesn't work as well as the OSD itself, making decisions about what it can use and where she does it, I.

B

Think that the good news is that all of this is focused on like puri and emu devices and 95% of the complexity and the stuff I'm stuff that alfredo has been working on is all the other crap, where you have like carving it up into 12 pieces and sharing it among about hard bunch of hard drives and whatever I think a lot of that. It's less of an issue. Oh.

B

C

Envy, dims and and obtained devices for other nvme drives I mean I. Think.

B

The opt-in, stuff well I'm, hoping that either you're just gonna have opt-in devices or not obtain devices and you're, not gonna like have like an nvme and a piece of an up game and that in the case we're doing persistent memory. It's going to be actual dim form factor and so it'll be a little bit easier to manage. I'm hoping but I, don't know. Who knows there are a lot of questions.

C

Or for our lab Intel's, giving us a bunch of slower and B me drives, plus like one or a couple of options, I think per per node. So it is exactly that model right, yeah.

B

But we not might not necessarily want to do a hybrid OST right like for hard disks. Hybrid OC has really made sense because the flash was so much faster than USD and it was so much easier to get good performance out the hard disk with a little bit of flash I'm, not sure that the same holds true with quote: unquote, slow and VMI's versus I'm opting, but maybe I, don't know. I.

C

We're running out of time, but I want I, want assumption in the future, because I don't think it's true yeah, but.

A

Yeah I don't know anyway, just add more to the to the shuttle versus not sure the discussion there's also another very good reason to actually do sharding a reason advantage of the more beefy design that Keith already pointed out. It's the sisters, AIA scheduler.

C

That gives it to you for free.

A

Rs3, so what's.

C

A

The the possibility to employ the AIA, the sisters I, our scheduler user space, our scheduler you can have you- can do a lot more with with scheduling, io, prioritizing particular kinds of of I/o operations. That's good, very good, finding very good reason, a very good advantage, our for for the beefy design.

C

With that that mean like we can prioritize sorry when you say a prioritize like that. What's like the interface for that like what, how are you envisioned using it.

A

Well, I guess could say a lot more about that. It.

F

Specifically allows us to to to to share the device by by by a dying, different tasks with with it, with a share, for example, if, if it had to, should have a more more more share, we can't I needed a number like 1217. If, if another particle should have a less share, smallest share, we can assign the number like two. It's just a nice way.

F

We a device should be okay to two different tasks actually allows us to share our device for different different sister consumer, so, even even even if we don't do not have a current worth, support like assign different views, the way that different fresh channels, we can well use it as an intuitive.

B

B

All right, hey.

B

Thanks everyone for a lively discussion and looking forward to hearing what the rest of the Crimson team thinks about the idea.

C

I'm, sorry, we didn't get to the rocks to be sharding next time, maybe next week yeah. Does it work? Oh.

F

Yeah, by the way the cream meeting will be held in in very late thing in in a nighttime, it's like seven, seven eleven eleven were 10 10 p.m.

B

On Wednesday next Wednesday, okay,.

B

All right have a good one. Everyone thanks thanks. Thanks, guys.