Ceph Performance Weekly, 16 Oct 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-OCT-16 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

A

Good morning, guys.

A

All right looks like you may have a small crowd this week. A lot of folks are off at the OpenStack conference that we may not be getting a whole lot of people, but we do have a lot of stuff. That's been going on so I suppose we'll just get started here.

A

A

It's been about two weeks since we last had a meeting and there's a whole lot of movement in the world of pull requests. Some of the new things that have recently showed up that are kind of interesting is um Alan. Has his PR for implementing slab containers and intensity. The current memek will support a Zen master, and this is going to be really really important, going forward. I think both in terms of reducing memory, fragmentation with and stuff, and then also identifying where we are kind of doing dumb things.

A

We already kind of know that we've got a lot of places in the code where we're creating and deleting lots and lots of objects. So part of this will be till let us kind of track that down better and then also generally just to get an idea of where our memory is going. How much we are spending on different data structures, allowing us to better manage how much memory we are offering in different places and kind of restricting it in a way that is easy for the user.

A

So, rather than having all these different tweaks, where you can change this buffer in that buffer and the third buffer do something some element size count. The goal with a lot of us is to say that here's an amount of space that the user specifies. Maybe it's one gigabyte or something of ram, and then how do we divvy that up between all these different buffers in a same way? So that's that's.

A

At least part of the idea behind a lot of this is just generally making our our memory allocation Reggie, a lot more sound phone call or so really really exciting stuff there.

A

Let's see what else we have a couple of different blue store things this one about refactoring blue source since mid transaction I, think that would be good I, think that will lead to some performance improvements going forward, which would be good, uh there's a really interesting one here where how my has started implementing RDMA for the async messenger, so kind of previously. Excuse me previously, a lot of this kind of work had been done in xio messenger development on that has been a little bit slower lately.

A

So I think how am I was thinking that this might be something that could be done quickly.

A

It can turn means to be seen what the benefits and downsides of each approach are, but personally I think it's it's great to have people and working on things that they're motivated to work on so I think this will be really excited to see where it ends up and and also to to see where xio messenger ends up as work continues on that. So, if you're interested in in RDMA, it might be worth checking out and see kind of what he's doing and how that compares to exile.

A

11 5 30 is my pull request for optimi optimized rocks DB right ahead log settings I was primarily kind of benefits blue store, but it might impact other things too. I guess what we are seeing. I guess I'll get into it more later, but the gist of it is that big buffers tend to to help.

A

We have to decide where the trade off when the trade-off ceases to be worth it, but I suspect that we want bigger buffers. We currently use.

A

Let's see what else um to move our be issa, apparel transactions missions, blue store, that's kind of dependent on these other things that sages working on or through factoring, is forcing submit transaction. A lot of different stuff closed.

A

Were setting the MS async send in line to false now that, when it was set to true was we were seeing a pretty significant, randomly performance drop compared a simple messenger on the order of like forty percent we're still a little slower than simple messenger. For that particular case.

A

This is also on nvme, more now, maybe like five to ten percent slower. So or so. We still have some improvement who to make up there. I think. How am I was thinking that improving fast dispatch would maybe gain us. The rest we'll have to see if that can make it in for cracking or not.

A

Despite cannabliss, five to ten percent, randomly performance reduction on nvme I think we're probably going to stick with async messenger as the default just that we can get it tested and a bit large for for the next red hat self storage release, and hopefully, during the cracking development cycle, will kind of make up the rest of the performance and maybe even start consistently exceeding simpleness appear.

A

In that case, the good news, though, is that that's really the only case where we're seeing a regression from Nasik messenger in all the other cases that we've tested so far, it's like it's on par or maybe marginally faster, so I'm dead, good news there. The other nice thing too, is the async messenger. It should be a lot lot easier on the memory allocator, so I suspect that, with basic messenger memory, allocator tuning is probably going to be less of a concern.

A

We may actually be able to relax the defaults and reducing them in usage a little bit so silver lining there. um What else we have going on the 11 to 13 at a PG, fast influitive ute for to reduce / io metadata updates? That was really fantastic.

A

Basically, every single, I oh, we are doing all gathering all these statistics and it turns out that and I'm actually writing recording them to to leveldb or rocks TV depending on the back end, and it turns out that was a lot of data was like seven hundred fights for every I. Oh, so imagine if you're doing porque iOS, that's a fair amount of overhead.

A

So basically this just reduces the frequency for the attributes that don't really need to be required for every I, oh and that reduces things down typically to around, like 200 bites from 700 bites and every once in a great while and I see a larger one, but its overall much better I think we were seeing with blue store anyway, something like a fifteen or twenty percent performance improvement from doing that. I haven't tested file store recently to see how that improves, but I suspect that we're seeing some improvement there as well.

A

So that was a big win. Surprisingly big one. Actually, um what else.

A

Igor's purple compression settings get merged, that's good, I haven't tested any of that yet, but it'll be interesting to see how that does uh the fast thank encoding got merged. So again, this is kind of in-memory.

A

Encode decode and that that hopefully, should improve things fairly dramatically, I'd forgotten exactly what that did earlier today and was surprised. I didn't see any difference in the Brooks DB compaction behavior, but somnath reminded me that we're not actually changing the the actual encoding scheme so we're still using like bar int there.

A

It might be that by changing some of that, we might actually see better behavior in rocks DB. This just is in memory so yeah, that's that's just affecting kind of what our data structures look like.

A

A

The RC locking work is progressing well, it looks like um is, was kind of reading through the discussion on that, and it looks like people are pretty happy with how it's shaping up I'm I'm hopeful that that will yield some really good improvements, but we'll just kind of have to see how quickly it stabilizes.

A

Yeah and that's that's, basically, it there's a couple of things that were or steam that that the whole things have been having movement, that I'm, hoping will will see the Zipkin tracing work the cool summer of code. I hope that can merge time soon. It doesn't get forgotten.

A

Think Paige continues to work on some of the slab container stuff I, don't in 11, 309 I, don't know how how much of that is replaced by allen's new work, I think sage, that he had some stuff that you needed to do on this. So just in general, there's there's a cold import us out for some of that.

A

A

The Zetas scale integration, the sandisk guys, have been working furiously on trying to kind of make Zetas scale more performant in kind of some of the obscure ways that stuff want their that loose or wants to use it and I think they're they're. Getting pretty close. I heard recently that they have now got 4k random, I/o behavior in Zetas scale, beating blue store, sorry blue score with lettuce, kale, doing 4k random iOS. That is now beating, rocks DB with boost or so want to see kind of where they go with that.

A

But um but it sounds hopeful sounds like they've got some soup and terminally. That's that's doing really well and then yeah I haven't haven't seen too much of this other stuff. Any updates like on the PM device for blue store or any these other ones. I guess, but uh this kind of have to see how that goes. um Okay, that's that's! Basically it for poor requests. Are there any questions or comments on any of those I.

B

Have a question, oh sure, can you hear me so um it's yard dma, so I just wanna make sure I understand. What's being done, it sounds like, like RDMA supports being added to a sink messenger where before it was in xio messenger, and the problem was that xio messager couldn't really interoperate with you.

A

B

Other two messengers just did I get that straight. Yeah.

A

That's, that's probably a reasonable summary, so xio messenger is basically like a new messenger implementation that uses lip XIO behind the scenes and then can optionally target either RDMA or tcp or chasing messenger is basically kind of like every implementation of the current messenger scheme is compatible with simple messenger and then how am I has now implemented an RD at me back end for it, though, as far as I know, I don't think it uses live xio in any capacity, so basically just two different approaches that both happen to use RDMA under the hood at some level.

B

Yeah, that's really cool because it seems like that's much more like to get deployed because you can't assume our dma is like everywhere. But um no if.

A

Theoretically, live Xiao can use TCP behind the scenes as well. I think the the interesting questions that will come up with this are things like how is zero copy implemented and are there any advantages to one or the other in in the Box I? Oh we've hit some like scaling issues related to the way that, basically, it hit I think I'm, forgetting now exactly what it is, but I think.

A

If you have like multiple OS DS in one box, you can run to scalable it Gail ability issues with the way that Lebec taio does RDMA, but I'm, probably mangling it, but anyway, there's there's a lot of cam interesting technical details that might tender indicate which way in the end is the better way to go so I personally, I think it's really good that both are being worked on. This stuff is complicated enough. That I think there's room for multiple ideas here. So yes, that's my cake.

A

A

Cool any any other questions or comments, sending the pull requests.

A

All right I did not do a good job. I forgot I had to fill in the discussion topics, so there's there's not much there right now. Oh there's a message here or we get the sandisk zetas scale equal fdf. I do not know what fdf is yeah.

C

I, if the Knicks again mixed in addition of a PDF, is get us killed.

A

Okay, cool sounds fts, just like a existing key value, store, yeah.

C

I believe was actually the old name, and then we improved quite a bit from get a skill. Sorry from mdf I mean empty Death, Eaters get so yeah. It's like the same key value store, but next generation and improved version.

A

Thomna sound: do you guys have any and enter you guys have kept updated on the biggest or stand up we've been doing with their scale recently? um Would you be able to be rather give a short kind of overview of what you guys have been working on yeah.

C

So too yep can you hear me? Well, yes, yeah. Basically yeah new studies going through Lord of changes on the way actually allowed to model died, so go in I think it's dual version of the dealer new store where actually we started integrating and with that we will do short press that rocks to be pretty comfortably. So now, basically, we have introduced, although sharding and everything on the pill store, so that actually having extra rights on the on the underlying underlying waiter at a store.

C

So that is not hurting rocks to be because rocks will eventually it's it's koala, seeing all those things, but that's Harding data still, because it's an extra right and it's touching the other between or so so. We are actually actively working on right now to optimize that part and that's so actually it's taking time it's it's. It seems that is not trivial, so we are pretty close and we are working on that. So hopefully something will come up soon.

C

The advantage it'll be the scale is that we don't have to deal with these confessions and all those things we have predictable on the on the disk that we don't have in case of rocks to be because we don't know the exact compaction behavior and how much it will write based on the actual and obviously compassion will go higher the amount of data I will be writing and and in the in the stable state. What is the compaction ratio? It's very difficult to actually analyze?

C

The reason is that it also depends on the amount of data on the rocks. Tv show you if you right shake 100, gb, vs, 8, terabyte or Satan awake for now, the 16 Taylor white whiskey's are also, I will oppose the disc and it means wanna waste. If you do that, so the behavior will be completely different so that we are trying to characterize as well in the end parallel and comparing with the chick escape so yeah lot of work. So hopefully we will get some.

A

Poor um you guys, you think that you'll have camps me test results that you can share publicly anytime soon. I remember someone had mentioned I think that you guys were beating our rocks TV in your class version. Now, yes,.

C

So that's what we're gathering data for that, like comparing at least to our name, is to have that not to do like a bigger data set and and making like writing Lord of a meter data. So actually it's in stable state, so making blue store and rocks to be in a stable state and also making blue stir and a testicle in a stable state. That means they 54 MB objects.

C

Any of us likes a one terabyte a top shape our ways d and we want to freely, and if you want to write, 4k random right want to feel all the four mb object with 4k random right, and that takes huge amount of time, and that is the steady state because, after that there will not be any like extra 40 right, we'll grow the emitter, a task structure within the within the greatest or so at that point actually wow how it behaves so those kind of data we are collecting.

C

Hopefully, sometime soon we can present it in the the performance meeting, for one thing for sure is that if you run rocks to be continued and rooster continuously, even I I have some results on the 10 hours run right, mark life I heard of that, but even even that you will see that it's not stable. It's continuously basically yeah it's slowly, but for surely it's going down towards the steady state. So what is the steady state number? We don't even know today.

C

So that's what you are also characterized so in the rocks to me that I. That's why I was a 14-minute access all right when.

A

You increase the CIA into the.

C

14-Year, look like this. Okay.

A

Okay, so um for the people that are super familiar with booster here, basically, the min Alex size is controlling.

A

Essentially, how the end result of changing is basically that you change how much data ends up going into the metadata store the number of blobs and that, like it recorded, um increases dramatically as you decrease the metallic size. But you also, then don't have to do right ahead.

A

Walk rights for iOS of that size are larger, so there's kind of this big balancing act between how BIG's make Menelik size versus and how much metadata you right out versus writing out a right ahead, log right and the journaling right for the I/o, so kind of whether or not you make that or k on SSDs and nvme devices, or whether or not you knew make that AK or 16 k.

A

It is tough to determine because is it worse to do the extra journaling right, or is it worse to have more metadata that you have to report yeah.

C

One thing for sure mark is, is for it a skill with the performance will be will be hearting with Amina log high of 16 k, because in in case of rocket EV, the way we are doing is that we are just doing water right and you attract. We are deleting that key, so that really doesn't go to the SSD file friend and in 10 like impacting complexions right so for data scale. We checked, and we know that.

C

Okay, that extra for that 4k right, the double right basically is going so and then the leaf node we have so that will actually impact the performance badly. So we have to forget a skill we have to work with, at least in the first cut. We have to go with the middle of size of 4k and we will try to improve that. Okay, us.

A

That will be interesting, especially on spinning disks, see how that that does no.

C

Data skill is not for the spinning disk, so I think we will be longer yeah we will originally. Actually, we said is only four days either we don't happen even have any data for the dutiful alone, how it will be a one, the on the like hard days. So it's.

B

C

For flash- and it is the it's written and the flash optimized way so we'd recommend it will be if somebody is using a flash of all flash with blue store, so they should use hitter scale. So not what the height is, because we don't know Oh, would you suspect.

A

That a scale will be advantageous for people that have spinning disks with SDS or nvme devices or further metadata, and what I have log.

C

Yeah, so anything in so basically yeah, so anything on the whole SSD. You want to write so that actually it will help it should help but yeah way to characterize like for for that purpose. Also, but right now we are knotted there. Yeah.

A

Sure, okay, oh very interesting summer, thanks for the update, no.

C

But thanks yeah: actually we will update the community sometime soon. We are working on that something will come soon. Oh the challenges.

B

C

Of the wave blue story changing right so now the middle is a kind of stabilized to our jovenes and much easier, I guess: yeah yeah.

A

Hopefully, we've finished it for now. I suspect at some point we'll want to revisit barrington coding versus something else, but at least for now a fuller or stable all right, I'm going to bear my screen now and stay okay.

A

Underwire, let me pick this window. They wanna share I, guess we'll do the whole thing.

A

All right, can you guys see my screen.

C

A

Actually, you know what this is. This is a little bit old data I'm going to go for this other one yeah. This is the one I want here.

A

Okay, so over the last two weeks, we've been really really interested in Brock's, TV, behavior, really it specifically to blue store, but just kind in general and so in in blue store in the OSD logs rocks TV will dump out a bunch of stats periodically on compaction, and we wrote a little parser, that's sitting in the cpt tools directory that you can use to read in multiple OSD logs, and it will give you kind of a summary at the top of what happened across all these different compaction events and then, basically, a dump of the events over time and kind of give you a running look at them.

A

So we we went back and ran a bunch of tests on both our BD and on using Rados bench to look at chem.

A

The two cases that we were interested in one is the RV case where you have pre-allocated blocks like forming by blocks and you're, doing random rights to those and then the other case is in ratos bench, where you're creating porque objects for the first time and filling the disk with a bunch of small objects, and these these two cases are kind of different, in that the behavior with large buffers was expected to be different.

A

We kind of expected that in the rbd case, large buffers would help a lot to allow um iOS to different to the same object to maybe be coalesced and then the the other case. We expected that large buffers might actually hurt, because now you've got a bunch of data that all has to be compacted at the same time and competitors would take longer that was kind of the assumptions that we that we made. um What what it turns out is that that's sort of the case.

A

You can see that, like in this rbd case, where you've got small buffers, basically there's 32 meg buffers and we've got up to 32 of them. But the big thing here is that the the min write buffer number to merge as one that basically means that every time you felt one of those 32 megabyte buffers, it will go and write it out, and so do you have these small files in level 0 that then, are compacted and the compaction shouldn't take this long, and in that case yeah you see that actually, it is pretty fast.

A

Well, it's like twice as fast as in the other cases, oddly enough with 4k Menelik size that sorry, that was with 16 caiman Alex eyes in the 4k metallic size case. Where theorem we have more metadata to deal with, we don't have right head log rights that could be right leaking to level 0, it's actually taking a long time that was unexpected, but but there it is.

A

The big thing that you'll see here, though, is that in this case, where we have small buffers that get ridden out really quickly where they fill up. Where one fills up and we start writing it out the amount of data that gets compacted from level zero. Two of them one is huge compared to compared to the other cases.

A

It turns out that all buffers are really really increasing the amount of right and that we see to the point where it's the dominating factor, the the client throughput is dramatically lower when we have small buffers compared to large butters.

A

The unfortunate part of that is that it, it doesn't seem like there's right now, any good way around it. It kind of looks like we will need to allow large buffer alert large right, I had a lot of buffers and actually be using for 256 megabytes as opposed to 32 32, mingle, I buffers improved performance by like 2x so, and it was the same in every test here.

A

The only case that's different here is where, if you have 32 32 make my buffers, if you allow eight of them to be written out before you do a flush, then you can regain that performance back, but the behavior is really similar to what you see in the case, where you've just got four large buffers, as opposed to like eight at a time of 32 min bike offers.

A

So that's. That's the rbd case in the ratos bench case. We were really afraid that using large buffers was going to hurt that case that that, in this case, we'd see these big compaction events and that which maybe see stalls from it. But it turns out this- that's not really the case.

A

All we really see is reduction the number of compaction and in fact the average compaction time is lower when we use larger buffers, probably because there's so much less data to deal with the the right amp is just generally lower, so everything goes better in this case again, we see that performance is is again better with larger buffers, so essentially reducing the right amp is is trumping everything. It's this, the the big effect that we see. We also did test with the universal compaction and universal compaction is cam interesting. It helps my model.

C

What we go ahead solution on that, like your data, so you are saying that okay, 16 game winner, look sighs with 32mb buffer and eight of that right. The first one you are getting 2 42 mb per sec and and with 60 16 came in a lock size, 16 k, Mina, love, sighs, all right, yeah, 16k, Mina locks eyes with 256mb and one of them you are getting 2 35 mb per second. So it seems that okay, small buffer and number of what for more, is, is basically more suitable. No.

A

Well, it's a good question: um I mean if you look down at the radio Spence numbers below it. Arguably, maybe the the other cases is actually a little bit better in that picture. So it may also be just that we hit like a boundary to that's different um I, guess I guess there are actually a couple: fewer compaction. In the 16k case, with four large buffers, there were like two fewer compaction, Zand, the other one, but um the tokamak.

C

Should get like a little bit louder or 4k? Basically, you are seeing the twix performance because is it for 4k? You don't have any data for 32, mb n + 8 of that eight of that buffers right. You don't have any date, a guy.

A

C

A

Have that one I could do it quick, I, just hadn't run it um because sage was kind of uninterested in the 32 small buffer case with eight buffers, or was it but.

C

A

The end of the.

C

Tv side is that if you keep only one of this buffer, we may actually miss the update marching, because it's in only one buffer and the March will not happen. I don't know means I just read so I know the inside story, but ok that if you.

B

Agree on the rocks do between you.

C

Say that if you have one before it will, if you are updating a particular key, it will basically will not march so maybe say has inside so in to see I give.

B

C

On the PR as well regarding that so need to see that they don't know at what scenario maybe weirdo like doing over right on that particular object, and you see how it goes. Yeah.

B

A

May be that actually is a little bit better I'm, not opposed to considering that as an option as well. If we, if it can, makes logical sense in the gist of it, though right is that you know the numbers are pretty similar, I mean they're. It there's enough variability here that it, it might be that the the 32 small buffers with the men write buffer number to merge at set of eight is, is it might be marginally better, but it's close and the same with greatest bunch. You know, maybe maybe the alternate.

A

The four large buffers is with a min write. Buffer number to merge set at one is maybe marginally better, but again it's really close. So the takeaway, though, is that you can't in this case it does not look like having small buffers and only in trying to flush them really quickly. After one fills up. That seems to be the pathologically bad case. That's the one where yeah.

C

And one more thing: one more thing is that recently in the process between some profiling on that so load of rocks, TV things are popping up. So probably that number of buffers, if it is eight and thirty two things it has to do, Marge and all those things it may end up, having like lot of extra CPU cycles. So do you have any data on that so like is there any cpu cycle difference between one were four and eight buffer? I have.

A

The data- or at least I have some data- I, don't have profiling data of all this know.

C

What I mean have live? Yours? Are you seeing any difference? I have.

A

A what I, this literally just finished like an hour ago so high, I couldna copy and pasted all the data, and so that we can make you show it for the meeting so yeah, it's uh I'll, take a look at it and hopefully um present some interesting things there.

A

I suspect, though, that there's not that much difference, because in all of these so far I think I've been cpu limited for like 4k random I/o like this, so my guess is that is that there probably isn't a whole lot of difference, but but we'll see, maybe there is.

A

So, moving on, we ran the same set of tests that use Universal compaction and it's interesting.

A

We still see that the buffer sizes, basically how much data you have openness allowed in aggregate in buffers before you flush, is the predominant factor that determines performance. But having said that, it looks like universal compaction handles small, smaller buffers better in the object for K object creation case, then then level of action does, but it's a pretty big improvement in this can't worst-case scenario compared to level compaction and we do see a little bit of a performance increase for or these big buffer cases as well.

A

It's not very dramatic, but there's a little bit there. Maybe now may / may be like a couple percent, but in the rbd case, Universal compaction does substantially worse for for kind of these.

A

These other cases actually I should rephrase that for the pathologically bad case, where we have small buffers and fast flushing, it does about the same as it did previously, maybe even a little bit better, possibly in the 4k and Alex eyes case, but in the other cases like where we have 32 small buffers with a that with trying to merge or flush ate at a time it's substantially slower and in the big buffer cases, or at least with the 16 kayden and Alex sighs, it's also slower with the 4k min out guys, especially about the same I.

A

A

Interestingly, here, though, in the four case size, one of the things is that there's a lot less compaction traffic. It's about two-thirds, the amount of traffic in the other case, so rank right, amp is going down, basically, which is what universe actually supposed to do. So it is decreasing ramp. There's fewer connections, total amount of data output data is is much smaller. You know or two-thirds the size. I guess that's good. The performance is no different.

A

So that's interesting that that essentially I guess that means that these nbme devices have enough throughput still available to them that the amount of right traffic or into the device doesn't really matter we're being bounded by something else. So maybe, as we improve blue store, we might actually see this Universal compaction fork in and outside starting to pull ahead, because we're writing out less data. But the downsides of universal compaction, though, is that it increases thighs amplification.

A

Potentially, you can see up to a true x, increased temporarily in the amount of storage required during compaction and then also read amp, maybe bigger as well. So we'll just have to can't see how that plays out. I think for right now, I still am going to recommend that we stick with either for 256 megabytes buffers or this kind of 32 832 case and not do universal compaction.

A

There are some other limitations of the universe, compaction that I think you can only have 100 gigabytes of data total, so we may need to shard rocks TV be fully deep. Did that. But this is this. This kind of four came in Alex eyes. Case is really worth watching um because with less data being what's right and I should say that might actually start playing ahead at some point. So basically, this is this is what we've been looking at with the rocks.

A

Tv I know it's it's a lot of data, but but the hope here is is that once we understand this well, we'll kind of have a better insight into how all of this should be tuned and kind of as we change Rockstar change blue store, what things are going to be important going forward? So uh any any questions on this I guess before I before I pick it up screen.

A

B

Anyone hear me.

C

Yes, I can yeah I can.

B

Do we lose more.

C

Let's work loved, so I do good.

A

Hey guys, sorry about that.

A

I did quit out.

B

Yeah, we couldn't hear you for a second okay.

A

Were they any questions any of the Roxy be stuff.

A

Ok, let's see Ben, do you want to talk a little bit about yourself, scalability testing.

B

Yeah, so I'm not really going to do anything like the you know today. I just want to see. If was interesting to people and mention a particular problem. I ran into so we're testing with a thousand a hard drive, Oh SDS across 29 servers and we're basically trying to integrate that with OpenStack on 20 compute nodes, and you know, get some performance data for that.

B

The integrated stack and one thing we one thing I, you know: we've got a starting to get performance data for Steph that you may be able to share if people are interested at some point. But um the the thing that concerns me a little bit um is that I ran into a problem, tracker and I'm, going to try to dig that up for you right this second and post it in the chat window and see what people think um so, here's the tracker. So it started out as I noticed.

B

There was a problem running radis bench with CBT, and it was just basically that the number of file descriptors defaults of 2024 and that had to be increased um for a raid oz bench to work reliably um and I thought. That was that was it um and then later on, I started running cbt fio tests and lo and behold, I read in the same problem, and so what happens?

B

Is um uh it gets an error back saying can't create, you know, can't create the file descriptor for the socket and um but it doesn't error out it just kind of hangs, and what concerns me about that is that um you know if it just hangs. People aren't going to mediately understand, you know what's going wrong, and so my question to you is, you know: does this you know? Is this a bug in.

C

B

Or is it a bug in a you know, just the at is it isn't at is an application book, and you know what should be done about it. If anything, any comments, or I.

A

Know that periodically increasing the you limit? How at one point we increase at the 16 k, then I think it got increased again and now you're increasing it to you know 128k little crazy, well,.

B

Don't have to increase it that much that's true, but it needs to be bigger than 1024. That's the main point. Okay! um Oh this.

A

Was this was on the OS DS that you increase it? No.

B

Well, well, that already happens um so there's no problem there, but the problem is the applications that use lib, rbd, okay, famously and later on. We ran into the fact that openstax, you know libvirt, you know kvm processes that are implementing guests, don't increase it, and so that was causing his problems, which we fixed. But the point is like it doesn't show up. Clearly you know when that happened. When this is going wrong, it just kind of lib rbd doesn't seem to really kind of clearly log.

B

You know the problem and um I was wondering anybody is hit this and if uh you know whether there's something that needs to be done differently or is just a bug in these applications or what is.

B

A

Out of curiosity, what's your concurrency.

B

Come again concurrency for what fhf iowa process. Oh, oh, you mean the number of like the I/o depth yeah duh, because we only have one process accessing a symbol: a single RVD volume which is I, think the wave it typically gets used. Yeah.

A

Well, I was wondering like I'm, on each volume, how many concurrent iOS are you doing enough? I, oh well,.

B

You know 16 or whatever, okay.

A

So it's pretty small.

B

Is I guess that's small, um but I mean the point. Is a we tried to choose a value that was somewhat like what you might expect from a like, a real um you know or like a application, so we're basically trying to scale out the number of volumes rather than cram as many iOS as possible through a single volume. Yeah.

A

B

So I mean we don't have to resolve that today, but I just I just wanted to see. If, like you know, this is like a something that everybody already knows about accept me or whether it's something people don't know about.

A

Remind me and I'm really rusty on this, but is this related to.

A

Is this related to like how how like TCP recycles stuff like? Is it that you you run? If you do like lots of connections to lots of different hosts very, very fast, it will that will run out.

B

No, it's not because, first of all, when you increase the the file descriptor limit, the problem goes away and the second thing is the TCP. The thing you're talking about it has to do with recycling ports and if you have an application or your cost of connecting and closing and connecting and closing.

A

B

That's that's an issue, but it's time your mind away with sockets right yeah, but we're not doing that here, we're just connecting to the OS and staying there far as I'm aware.

B

Anyway, um but I mean overall, things are going pretty well, I mean it's. We got like you know. There was one test where we got up to like 40 2 gigabytes, a second. You know it's just kind of a nice little eye popping thing for me: I, don't.

C

See that every day and.

B

Getting up to like there was one test where retest random read where we got to like 140,000 die offs. You know um which is not for you folks, it's probably pretty boring, because you're working with all SSD configurations and that sort of thing. But um you know it's not bad- for heart,.

A

140,000 I have sent spinning this is pretty pretty nice yeah.

B

Yeah I mean it, you know it's it's, it's respectable number of not say claiming. Is the world record, but it's a it's useful. If.

A

You get a spare cycles that you're really interesting to see how blue sword us on that. So I suspect it bill. It'll, get better performance.

B

I would love to do that. um It's what we could do is try to schedule you in because, unfortunately, we're competing with a lot of folks for this hardware, so arm.

C

B

I, don't see why we couldn't, you know, get in the queue and try to get in there if it would be useful to the community yeah.

A

I have no idea of sensible supports boost or or not yet, but that would be yeah.

A

B

A

B

And I mean for four basic configurations anyway, and and SEF ansible worked really nicely here, because, although it doesn't paralyze make FS within a single node, it paralyzes across nodes.

A

B

Nicely so you you know, you eat your build time. Doesn't you know get that much bigger as you increase the node count.

B

Though, that's you know, there's a couple of the things I was going to mention, but they don't have time for that, and you know maybe next time I'll try to some things together, but about other work we've been doing having to do with hyper convergence and memory. You know virtual.

C

B

Of problems that we've run into a various sorts and to see if that is interesting to people all right, that's it for me: hey.

A

Ben a question about your your earlier question about the you limit. um So again, I'm sorry excited I'm not super familiar with all this, but it does. Each TCP socket require a filehandle. Is that that's is that true, yeah.

B

A

Exactly it, though, so wouldn't went the case, be that if you have a lot of PCP connections that have been opened and even closed, if they're hanging it. Basically it's waiting to recycle them when to keep a bunch of file handles open or am I mistaken is that is that not right? No.

B

You are correct, basically, if you, when you close the the TCP socket it the connection hangs around for a while is one of those time wait things that you states that you're talking about and then after some number of seconds, depending on the Carl parameter for it it recycles the connection, but you can. That can be adjusted, but I. Don't think that's what's happening here, because basically uh you I, don't think live. Rvd is constantly recycling. Tcp connections. I could be wrong, but I won't my understanding, it's opening them and keeping them open.

B

Is that correct mean well.

A

I would assume that I would assume that it's doing like okay, so you've got like a thousand no STDs or whatever right the presumably it's like if you're doing random. I always like blasting stuff off on all these different crazy directions, and I would I would assume, I guess that's closing the connection. So it's not keeping like connections to a thousand different servers open like from every single client all the time. Maybe the thing that's incorrect, but I I would suspect that is closing them at some point here. I will.

B

Definitely check that I mean because I kind of just only assume that it kept them around, but but that you know I would like you know it could be that it's doing something that I wasn't aware. So well.

A

Why I was wondering that even if it closes them, though, presumably all these file handles are being left open for a while? That's what I'm wondering if that's, why you have to kick it up so high? If that's that's, even though it's not keeping you know a bunch of them open concurrently, maybe maybe that's why you're running out of file angles, I'll check on.

B

That sure, let's get question.

A

Looks like there's a chat message here. Somebody oh yeah,.

B

A

B

Inside right, I agree: you can I just didn't I didn't notice that kind of time wait buildup, um but um I wasn't really at paying attention to it. At that point, I'm so I'll take a look yeah. It might be.

A

That just increasing them our file handles fixes it for you, but you may also, if this is what, if my suspicions, I guess is correct, if you reduce or even eliminate the recycling are the keeping it around I guess not the recycling to keeping it around I wonder if that might also improve the situation, but um anyway, that's. That was my thought. All right. Thanks a lot all right, but we are out of time guys any last minute comments.

A

All right, we'll have a good week, everyone and we will reconvene next week thanks.

B