Ceph Performance Weekly, 18 Jan 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-JAN-18 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

Good morning, folks, I.

A

Think some of the core folks will be here in just a little bit that a couple last-minute things to wrap up.

A

Hey Adam, I hope you don't mind I put on your your Wolcott profiler, since I am so excited about it. Well,.

B

No problem where you did to put it I didn't get it. Oh sorry,.

A

I'm, the the either pad.

B

A

I, hopefully, will get the other guys soon, if not I'll just start.

A

C

A

Have a sage good morning.

D

Morning, sorry, we were talking about something else in being very rude.

D

D

Quick look at ploy requests, there's a bit vector thing that I hadn't seen so it's like. That's not a thing: Jason reviewed it merged one of igor's, small optimizations, the aprox eyes thing. Finally, reverted that should have a pretty significant impact for people testing a master.

D

We close the ops, commit up applied thing and for what it's worth right now, I'm, looking at completely eliminating the applied callbacks in the object, art interface and doing tracking and fast or that's a whole other thing, I'll put on the list. Soon we have time, but basically.

D

Or nation of build a purchase concept and and see what the performance impact is and see if it is usable and I close, the old sync I think that thing, because I did like half of it on the read side, completions already and I'm, not sure if we want to do the right side, so you don't resurrect a Twitter.

D

Let's see, there's a pull request and flight on you see that looks pretty good. It's just swishing around the reads to not send messages to yourself to do the reads, but just to do it sort of in line it didn't do it before for simplicity, but the fix to make it do it efficiently is actually pretty straightforward.

D

So I think he has one small mark, another small change to make and then, if we can go and test it way, that'll work um there's still an open full request for doing tearing inside blue store, I'm still, not sure. If we want to do it or not, that's just sort of sitting there.

D

And let's see, josh has a back port of the randomization of the split thresholds for jewel. That looks good I. Think I just need to get tested.

D

And there's this qat poor class I mentioned last week: I think they updated it I'm, not sure what the line- that's it. Somebody somebody want about here to like Shepherd this one.

B

D

Just transparent for I see on this corner guys, um that's mostly it that's far as new stuff.

D

So she made here: um did you want to talk about the your stuff with VPP this week or do you anyway next weekend.

E

Architecture to introduce how to use the way PP to integrate into the staff and so I.

F

Can share my screen.

E

See my screen so I try to do some PLC to use the way PP after user stack user space tcp/ip stack, so we PP provide some interface to the application to use it after user space-tastic. So it has several I think has several interface, the popular one is it's called it. We see out a chance. We, we PP communication library, so the application used with the air has two mode.

E

One is just use the preload motor, so the application is not needed to modified so as native application, so just to use the LD preload Messer to preload it. We see our library, the library that it will hook the stock socket layer, and so when you call the POSIX socket, they will go to the this preload library with the optional library and this library will use a message. Something used to share memory to talk with we PP Pioneer API.

E

This the right side is VPP.

E

Software stack, so the binary API will call the session API ISM and then tcp/ip I go to the DPD kay so another mode to use. We see out that that you can build in the way CL API into your application. So it's not pretty promote a pre-loaded mode. So you just call the with your API in your application build in your application. Then we see I. We see a library will use the same method to message. Send the message to the way PP to set up the session connection.

E

The another one I'm not very familiar with is that they have plugging in the EPP card memory, what you'll network interface. So there is memory interface plug-in here they use the so there is a libel can be caught can be built in the Indian application, so application use that lack of memory interface to communicate. If it's that plug-in so just like to cut to talk with virtual network interface, I tried to use the right while used. We see out first, is use the preload mode, so I needed to modify them.

E

The safe code adjuster used POSIX stack so I think it should work because the preload mode we see I will hook the socket layer so we'll cover when the native application called the POSIX stack, something create, create, socket Bank socket, they will go to the we see our library took. That is them hmm the we PP provided the interface. So another subjection is that I, don't think.

E

Maybe it is a fateful sample, so self can't be work as a plug-in in we PP, but it needed more code to to integrate a staff into the way PP because it several that mode Stephanie the right in the way people process. So this is very to use the way see. How is a simple one and.

E

E

Okay, so the a PP application used, we see how so the for the server you need attach and abandoned for the client-side you need to the attach and the connection connector, and there is a share memory between the application and adele we PP process. So when the plant, when the data transfer there's one cup a one-time copy here, so that client application need a copy to the buffer to the shared memory to the power 4 and the EPP transfer the packet it received also need the copy tutor share memory file.

E

So there's one time copy here.

E

So inside I think messenger stack. Currently this is the current to lay architecture. So, under the sink messenger there is Network stack layer. Apparently it included included, included, POSIX tag. If you take a stack and RDM a stack and.

E

Under each stack, for example, under the typically stack there is the DPD coworker, so the typical worker there is a band center and even driver also has the DP DK Trevor, the even the driver.

E

Tipsy Trevor realized that even the driver. So if we a.

E

The latest of EPP and the sister also included the DP DK realization. So, no matter what we use sister are we PP. The typical code can be removed from the the staff I think messenger, because the DP TK device in initialized the memory pool initialized, is realized in the EPP stack RC star stack and oh.

E

If we used we PP, so we can remove the TPD connect to a way PP stack. If we won't use the built in the way CL API. We just need a right as I'm a star. A similar EPP stacker like at the POSIX stack now the color system call urban way created the sake top and the circuit. We need a call, the VCL library directory, otherwise we don't build in there with our library.

E

We just used a POSIX stack and use the preload period with yeah, so the bill I think a billhook nerd sake the layer, so the native native POSIX data will go strudel with your library to talk with them.

E

We talk with the way PP, though all the code, it D PDK about the TP DK device initialize and the memory pool management can be removed from the staff. I think messenger.

E

So this is the current who I am what would I do.

E

D

You're doing you're doing the preload method, then yeah.

E

Used to preload so I needed to modify the code adjusted to the promised task. Okay,.

D

That sounds great for doing a test to see what the performance is like, but I'm, guessing that what we really want to do. If, if we go forward with VPP, is we want to use the TCL directly? Okay,.

E

So it's just a socket layer, so it's imposing that not much code. The.

D

The main difference, if I understand it, is that all the calls are non-blocking, it's all asynchronous, so you have completions and you have you know, queue event.

D

The thing that is and I haven't really read the async master code much so.

D

um But the thing that confuses me is that if, if you're, using C star or even if you're, if you're using VPP, then it seems like that whole the whole thread pool model that facing Messenger has should shouldn't be necessary, because whenever you're initiating something you can just queue it with VCL and whenever you get a completion, that's already going to be executing in some other, like vcl thread context whatever that is, and that should be able to just again trigger the async event. Is that.

D

E

They have the opposed driver supported. So if we used something like POSIX stack, they can use the II poll.

D

Okay but the so there just be one socket that says: there's some VPP completion ready to go and then you would go grab it. Mm-Hmm.

G

Okay is that is that not updated if we, if, whatever reason we were able to use the in process, integration or PvP and SEP are co-located.

D

Yeah that one is a taller order right, these are like this would be stuff, would be a plug-in loaded directly into the VPP process on the host.

D

G

Taller order I'm not proposing I'm, promoting that, but the end, but the inverted conversion of control version of that might be more typical and I mean I mean. In other words, it might be something that eventually happens or I'm, not sure what the real reason. That would be that that they wouldn't want to support that. My.

D

So I was trying to ask this question when we're talking to the PvP folks last week and my understanding was that that's not a model that they plan to support. Although you know in principle, maybe they could, but that they're not planning on doing that and the models very much that there's a choice of dedicated VPP process that own owns the world and then they have sharing mechanisms so that you can utilize it over shared memory channels.

D

But if you, if you own the world I, don't make it easy, but I don't plan on my sort of consume PvP that way kind.

G

Of weird that okay yeah well the two models, at least in some test programs, and see and see what kind of results I mean I mean. If you look at the results and like remember, I, pour whatever they're sharing memory will do pretty well, but you'll definitely feel less wide from it.

D

Is that accurate, Chimay.

G

E

Think it is acceptable. I, I.

D

Think it's, it would be hard.

E

A different schedule mechanism I mean.

D

At a more likely model Matt, if, if we're thinking about such stuff as VPP, would be something like rgw being a PPP plugin, where I'm sure we can look at it differently, because that the types of things that are plugins today are like you know, network translation, layers, routing layers, all the Sdn stuff runs those plugins and things like load, balancers and I. Think something like rgw or it's it is a gateway. Function is like a good fit for that, where you're, you're, redirecting and sending traffic around whatever.

G

Have you we asked I, guess we'll be in to discuss all this with avi or people unless you start our community without their help.

D

G

D

I'd be very interested to hear their VPP so that the sort of the that takeaway that I am ending up with from that discussion with EVP folks was that VPP is, is doing a really good job of building like reusable accessible infrastructure. It's being used for estions load balancers, it's you know it requires into a virtualization, it's orchestrated via kubernetes, but it is like it's. It's really, a replacement for the kernel.

D

Network stack right, it's just a pluggable one that happens to be running in user space, but it's faster so and it'll having Seth be able to plug into it. When you're on hyper-converged knows like you're, an openstack for Cooper, Nettie, snowed and staff is one of many services like it makes it'll make great make a lot of sense to wire into VPP directly instead of using the normal sockets layer.

D

But it's not it's different than a situation where Seth wants to own the world or on the box. So if you imagine like a 1u server, packed with nvm use, doing nothing but storage, and you want your stuff, those T's in there, it's not it. It doesn't seem like it's a good choice there, where you want to eliminate that shared memory, overhead, so I think. My big question is how this relates to C star.

D

If we use the C star network programming deck, can that be if we're using that as our sort of general Network API that can either use POSIX or DB D, K and C Saarland? Can that also be used to talk to PPP yeah.

G

I think or fish were there, but I want to hear what they say: I might be dragons yeah.

D

Just again that thing that thing that comes up with the sea stars, Network stack, is that it doesn't have ipv6 that's sort of the current.

G

Concern I was expecting the TCP stack that in to stay stars who have evolved further than it is, and then so that seems like that suggests that those other stacks going to be during the picture. So yeah.

D

But again so I mean coming back to ASIC messenger.

D

Qin may do you mind showing the architecture slide. I think it was like you're stuck in a third slide that showed the OSI.

E

Messenger basic.

D

Messenger yeah yeah.

E

D

So that the thing that I don't understand is that if you there's all this stuff in a stick messenger with the event center and the event driver- and my recollection is that this is just a thread pool a set of workers in. If we move to see star, then that's all going to go away. So it seems like this is, if I'm understanding correctly, this is sort of an interim piece, because a signature is being the bridge between a threaded model and all that and the asynchronous.

E

D

E

Messenger will crater the Nasdaq network staff. Network stack will create a worker pour here and the worker pool here is: has each worker earth rider has a by the center, and so so each workers ride will handle them the how to say that the crew is located. Listen, the socket you're been abandoned. So when the connection come here, the workers right that there is a pool mode, they pull through the event from the event center and did.

E

They will put a TCP stack currently we'll put up for the sister TCP socket to get to know. If there's packet received.

D

E

So it's not the eep eep who mode he just called the pool.

E

Under under the PDK device and a DVT caterer get the.

E

Holidays, the worker works right, we'll we'll call the deputy can code together if there is a packet, the receiver, and they will call each each stack and layer to check. If this this layer has to have the package that we see with it. Okay, okay,.

D

So I mean joshing great correct me if I'm wrong, but I think what we. What we really want to do is get to the point where we sort of we just replace all this right with the reactor and so.

E

D

Of having a bunch of worker threads, so C.

E

Star has changed them change in the March. If it's used late, histor sister, the work mode will not be the same as the current staff use. I.

E

Mean the sister has has interface, you gesture caught them in your face and something in gene I'm, not a very familiar racecar. That work mode may be different with the curriculum. It's a fuse. Yeah.

D

I guess we wouldn't actually be explicitly thinking about her worrying about DP D K, and so you start with do that for us all right, yeah we're just coding if we just write a messenger, that's so.

E

Firm I understand that no matter the sister and the way PP, they also included all the TP TK work in this stack. So the DPD cake I think a stinkin messenger will not realize there is DP D K and a layer.

D

E

Gesture talk away still, other stack is okay,.

E

So that means a weight with the south. Will the cupholder with g pd k? But I don't know how the how they transfer the package from the you're dismissed a single messenger, because if you don't- and this realize the TPT cake exists, because because if he used a huge page and no zero copy, so how you manage the buffer?

E

If you don't realize there is DP CK and they don't realize the huge space in the city messenger, how they transfer the package from the buffer. I mean I'm, not yeah.

D

Okay, all right.

D

um Okay! Well thanks for thanks for sharing that um I think it gives it something more to think about. I'm, Josh and Greg J guys have any questions before we continue. I.

C

Think mark into C starts reactor is Imani should have how is handling networking to see what how it would make sense to integrate with you, B.

D

Yeah wonder if it's a good time to sing avi again or Georgia sting the list or whatever and ask about.

D

Their relationship to VPP, specifically an ipv6 and this U star, TCP, stack and and so on, to get some yeah get.

C

A better picture.

C

D

Alright, moving on a little bit, I asked Adam and Radek to look at to summarize the all the random stuff we've looked at in boost or to tweak performance and see which one still makes sense and what sort of opportunities there are. There's a link in the pad Radek summarized it.

D

The fusion one was trying to batch things into larger transactions and to eliminate sorting.

D

It tends to make rocks to be life easier, but there's a bunch of extra work to sort of offset it. So it only kind of helps.

D

I guess: what's the what's the sort of bottom line on this, and is this gonna be a so no impact on 4k random rights? So that's not super promising yep.

H

We made those testing solely using the file objects, the plugin- it's it's quite far from OSD, because it doesn't even try to mimic the PG lock related traffic. This might also we were testing also only using Rams rights while, for instance, the duplication we have infused and reordering branch made might be used may might be still useful for a sequential writes still needs a lot of not testing. Definitely inside for SD yeah.

D

Okay, what serious sort of are you optimistic not too optimistic like how earth-like.

H

I mean I, don't think it will be a breakthrough to be honest, yeah, okay, to get more from rock DB. We would. We would need to to to make the batching inside jobs to be much much more deeper it at the moment. It ends very, very, very, very quickly. It's shallow! It's on it's around iterate method of the right batch, so yeah, okay,.

D

Okay, the next one is skip list which TOB.

D

Forget her full effect on fi, the makes me nervous when we don't see an effect with fi o, because fi o most of the time is being spent in blue store and so usually any benefits we have will be even less than before. Once you add in all the OSD overhead, so then, what's the T and T and C of the Sun, do you.

H

Divide and conquer there is a hint validation algorithm in sight of ma'am inline skip list insert method at the moment. It's it's linear. The pull request, broad regard minute version. Unfortunately, it's it doesn't make a huge impact. We still waiting on caches, okay,.

D

The power splice.

H

D

The verenor also it's pretty small.

D

D

There was another one that was older, that was um just trying to. It, was doing battering on the blue star side to try to combine lots of transactions into one larger single right batch.

H

It's very big and pretty old one correctly.

H

It gave around 10% at the cost of introducing huge complexity because of sequencers. There was a need to rework the restructure the wall, but of of transaction context when getting when it comes to sequencers.

D

H

D

It because you're sharing a batch across different sequencers or is there like one match per sequencer.

H

A

You talking about the web es batch IO completion.

C

H

H

B

H

A

So I I recently tried to revive this on master, but it's rather, like you said, is complicated, I'm, not quite sure how to deal with some of our the things that we've added since then.

H

Yep we go to the final threat syndromes.

H

D

Okay, so basically nothing super promising right now, probably they think they think stuff is gonna, be a better investment. Oh there.

H

Was another change very small one, but can help with PD lock rocks DB internally can use multiple hints when it comes to operation on mm table. We have this feature: disabled I've tested it on fire on the fire plug in and didn't so huge difference.

H

But if the case related to PD lock are enough different from the from the booster specific ones. Maybe we could get something interesting.

H

Will be very, very small, it's almost unique iconic option that we passed two rocks to be.

D

Is it like making that the key names have some sort of prefix or something or you provide a function that, like.

H

There are multiple hinders of able to correctly, but you can still provide your own okay. That.

D

One seems like it might be a wind because it's a small, minimal complexity.

D

It seems like that one in the VAR int one since it's purely inside rocks TV would be the most promising yeah.

H

It's it looks, it looks that it's only testing and if, if it's good, we can send it to to the upstream yep okay.

D

All right, cool.

D

Thanks for thanks for putting that together can direct ourselves accordingly.

D

Mark you're gonna talk about Specter.

D

A

So we talked about a little bit last week. There is some new stuff out there. There's fur onyx has a more recent set of tests. Looking at the underflow protection patch that that went out- and it looks like that- actually hurts performance even more than and the generic version.

A

What else there's there's a bunch of other stuff floating around? It's all kind of messy right now, because there's multiple different patches from multiple different architectures there's some AMD specific patches for for Specter, and then there they both the Specter, multiple different version, Specter patches for Intel and then also meltdowns patches for for Intel. So all of this gets kind of confused too, when people are looking at benchmarks, but the the gist of it is. It looks to me like when the the smoke clears.

A

Probably we're gonna see pretty pretty harsh results for stuff on on Intel processors, when when doing lots of really CPU intensive stuff so like smaller and em, write some file store, I, I'm kind of a little concerned about I'm blue storm, maybe not as bad though, like you said sage, we're talking now what fewer syscalls I still suspect context switching will be kinda nasty but see ya.

A

The good news, though, is that the Red Hat performance team is looking at a lot of this stuff, so hopefully they'll they'll be able to guess numbers and maybe they'll be willing to send somebody to come to this meeting and present for folks, okay, so one one thing that you'll see there is the Scylla DB guys are are kind of showing off how how resilient they are when avoiding sis calls and context switching using seastar.

A

It's it's pretty pretty impressive frankly, so this may be just another another justification for why it's good for us to be looking at this kind of a chartered async approach yeah, they do pretty well.

A

Let's see, that's that's really about it, not much else right now for that maybe move on to and if Adam wants to talk a little bit about his wall, clock, profiler and.

B

Yeah sure glad to do that, first of all, I'll just I'm sure.

B

This is the tool which actually mimics the behavior of Marc's wall, clock profiler, which was based on gdb API. The only functional difference is that it uses a sleep and unwind directly and forms just an almost standalone binary. It's much faster I was able, without problem to sample running OSB, one time thousand one hundred times per.

B

Second, all the threads without noticing any performance drops so that that was nice and the bad side is that I wasn't able to to make it fully standalone tool and it's required to preload to the target some shared library before it runs. So that's that's inconvenience, huge inconvenience I intended to to fix it. But what is currently difficult for me is that default linker tools. I am NOT able to to get them to produce a binary, real, okay binary, which is properly linked.

B

Maybe it's just confined to badly constructed start a function because it's it's come constructed by the linker, mmm because this is of course broken I'm. Sure of that the rest of code looks more or less fine man, but I wasn't able to test that so, but it it will require time. But I currently suspended that's it's on some free time.

B

I I will resume that and try to to understand why why it's not linking properly I mean actually the gold linker is doing much better work because it is able to produce proper calls, but relocations of data in start start function are bad, but LD original LD even gets jump addresses wrong. I mean they are real okay, but to relocate to some weird places and that's broken. So that's that's it, but that tool is I think can be used. If there are some problems, please please inform me: I will fix anything. That's that's broken.

A

One thing I'll add that I was really really excited about that Adam was was willing to do for me is compare the gdb PMP results with his profiler and the results more or less agree with each other, which, at least for me, was really really exciting, because it means that we now have two tools that that both show similar numbers and before that we we didn't really have any idea whether or not what we saw was was really accurate.

A

It seemed like it was, but you know this makes it at least for me a little more likely that that what we're seeing is this, hopefully real. So thank you. Adam I really appreciate.

B

It I was really amazed that using gdb a tool that was reducing severly performance still, the output was similar to to the run when, when the performance of the process is not reduced, that was kind of strange for me, but so it so it looks I'm.

A

Curious if we will see situations in which that's not true, if there are things that yours are going to show us that that gdb PMP does not, but we'll just we'll, have to see yep.

A

I'm very excited, though, to add your support for your profiler into CBT, because, if it's not affecting the performance results, that means that we can run it during tests and maybe make sure, maybe go through lots of things to check and see, but there's a good chance. We can add it and start getting wallcloud profiles on all the different runs which, for me, it'd be amazing.

A

B

I'm I'm looking forward to to integrate it. Let's, let's try it sure.

D

A

I'll quickly go through my thing here, so on Friday I decided, maybe maybe slightly insanely I- guess to to go back and make new store work in master just because I was I had all of these memories of new store, doing large, sequential, writes better than blue store, and it seemed really strange to me and then I was I was noticing in in blue store when I was running for kit, our sorry large, sequential right tests that it was, we were spending a ton of time in the kV sync thread.

A

Doing F syncs, essentially lots of 270 11 k ready to head log rights per second, while doing these big 4 megabyte, sequential, writes, and then all this time spent F syncing that stuff so I was like well, maybe I'll just go back and see if I can make new store, work and cool couldn't see what it's doing so that became kind of like a Friday afternoon project and where we can get it working I ran some tests and the the interesting thing about is that actually blue store was a lot faster than new store, which was.

A

It was interesting to see I did sages, request, ransom, blocked, races, there's a link in there. You can see, there's actually in blue store. Some some tearing going on I haven't looked at, why I guess but, but you can, you can see kind of some interesting behaviors there.

A

Also one of the things that came out when I was doing. That is just kind of how much the read ahead when using fragments on the the file system helps when you have no client-side read ahead. That is one case where new store looks a lot more like file store and both are better than than blue store, but on the right path. All of the work we did to make blue stores writes faster, really helped I mean it's like twice as fast for for small for K random writes.

A

This is our BD with I know some some reasonably decent amount of of I/o depth. So just not not anything, you know real mind-blowing or anything here, but just you know it's nice to see that in reality, blue stores is actually on. The right path is, is doing really well, both for large, writes and for small, writes and then kind of just the confirmation that yeah we're we're kind of just hurting from not doing read ahead anyway. That's that's all. Oh.

A

Maybe I'll throw in here not related to that, but um there's an interesting post by Christophe Hellwig he's may not lining the old, a AOF sync code sage that you'd saw like from three years ago and he's working on some other stuff for the scylla, DB guys yeah.

D

A

D

I think it's gonna be more exciting for us. It illuminates assist, call when you're pulling for a iOS I'm, not sure if we would actually use the async sync or not not, focus, or at least but yes, it's exciting that we're moving forward there. This wasn't the pull request list, but I want to mention it. I came in I. Just noticed it this morning is one of the top ones. um The United stack folks have made a new cache steering mode that works way better for them.

D

They get like reduces flushed by 90%, with a zip distribution of random rights and with a similar hit rate to the normal right leg mode. So that's pretty big when what they basically did was they just devoted a bunch of memory to making it replate instead of using the hit, sets and sequential hits us to estimate temperature, they just went ahead and did a map of a hash ID to a counter. That is the temperature and they add like a exponential decay and they're doing that.

D

Instead, so it uses a lot more memory, but it's actually can create an accurate temperature estimate and the other thing that it does is when you read it doing that temperature estimation I think on the I'm, not sure handed it to the reply, I'm not sure exactly why they did that, but um yeah I guess that's the main difference. uh I have some questions on there for class. Let's see.

A

Sage I have a I would be very, very, very curious to see how this does compared to just like throwing away 99% of the of the the flesh requests, yeah.

A

D

Mean we already do that right, we are, you can already set a rate and it'll just we'll just sample and throw away.

A

So yeah well the whole point here it is. The whole point is to reduce the amount of flush flushes but keep the hit rate high right. So I'm curious once you normalize those make the same number of flushes. How much is the much better as the hit rate right.

D

So they did, they took the opposite approach. They fixed the hit rate is the same, but they have one-tenth of the flushes I see: okay, okay, that's what they did anyway, it's a remarkably complete pull request that adds Diedre bits and all the rest of it. It's in pretty good shape. I had some questions and it's gonna ask for more performance. It so, but that's kind of exciting and then the other thing is last thing: I'll mention an EC bug. Regression started me thinking about the unreadable callbacks again in the object stores layer.

D

These are really an artifact, a file store because you couldn't read what you wrote until after file store had journal it and then written it to disk, and so has this tool stupid dual callbacks, none of the other backends need that because they write they maintain an in-memory cache, maybe case or does, but we can just delete it if it doesn't really care about mem store and bluestar, don't need it because they have their own. They manage their own cache.

D

D

The that base, the thinking is basically to eliminate that from the objects or inner interface and instead at a kludge inside file store that keeps track of right center in flight. And so, if you try to read something that is in the process of being written, applied, it'll block that read until it's apply and you can read it back. So that sort of pushes the burden onto Pastore. And then we eliminate all the all the code and the OSD for all those callbacks and locks and whatever there's a ton of it.

D

Unclear what the performance penalty of doing that extra, tracking and file store will be or whether we'll get a benefit from removing the complexity everywhere else, because if it's a speeds that blue store by 20% and slows down file store by 20%, that might still be worth it. But we might want to wait one more release before we sort of as the balance shifts.

D

Or maybe it won't be that bad. You know, or maybe there's no performance win, and we should just wait until it's until we just want to subscribe to.

A

I'd say simplifying the code is more important than then whatever performance trade-off we get because we can make the we can bring performance back, but we we're not gonna do anything if we don't simplify the code. So back yep.

D

Especially with this like looming sea, star refactor, coming exactly.

D

Good, no, nothing does it. That's when you want to talk yeah, I'm, I'm kind of tempted to do the simple thing and just rip out all of the callback code without actually fixing file store first and just see what the effect is on blue store workloads to give us just a to know whether this is worth worthwhile.

D

But we'll see the main, the main thing that I that needs to happen is to take the sequencer concept and the collections and sort of combine them, because on reads, we need to wait for in-flight rights and in reality, sequencers and there's there's already a one-to-one relationship between those two but there's still two different structures, and so I need to combine those in the code and the interface whatever, and so that the reads will know what to wait. For.

D

But uh once that happens.

D

We can proceed so that's the next step.

D

A

We've got three three minutes: if anyone has any last-minute questions or.

D

Craig and Josh any thoughts on the relative value of simple code versus file, store performance versus I. Think.

C

Right now, um people do care about how sort of runs quite a bit. So if I want to test that before ripping out so I think your plan of, if see what the forensic difference is without that, and if it's really bad then maybe waiting a few releases to ship to people. It's a bit of a boost or yeah.

I

I also think that if we serve to see star, we can probably get that by just inserting futures or not so I had thought about how much but I think we just entered. If, like as part of the thing we give to, maybe not I.

D

Think that we've been tracking it the same like we still need to pay file star still needs to like pay attention to which rights are in flight. That's where I'm worried about the overhead yeah.

C

D

We, whether we're using locks and pthreads and signals and whatever or for our future I, think that's sort of I'm sure it helps but yeah but I'm. What I'm trying to.

I

Think of a way to light this I think that maybe we can avoid grabbing all the locks furiously for the other backends, but if we don't need them just like, if we don't have right, yeah.

D

A

I guess one one thing to keep in mind is that file store performance for anything fast might be going kind of downhill with all the specter meltdown stuff anyway, if it's not helping, certainly yeah. So if that's the case, people we might need to move people over to blue store I care about performance sooner rather than later.

I

So why is that? Do we just do a lot more syscalls for everything in file store than we do in blue store or.

D

I

Okay, I was trying. I was earlier trying to speculate wildly about what would happen with Spectre I was like well.

I

D

In blue starts a single syscall for submitting a huge batch of Io and in file store, you're doing a bunch of right calls. On the read side, the cash is in user space, and so, if it's cash data, then it's a win.

D

It's all the metadata, definitely cheaper when you're actually doing read io it's the same numbers as calls you're still just reading blocks off disk so help, but on the right side and the metadata side, it should be better.

D

A

Other thing is that blue store is already faster right, so even if we take a hit- or you know, if knowing right now, blue stores- probably at least fifty to a hundred percent faster for small random, writes and file store, is so you know file store taking it. You know, I, don't 30% hit vs. blue store, taking at twenty five percent hit you're going up in a better, much better place.

D

So you know what I'm, what I'm trying to figure out is how to like answer the question of what that performance penalty would be on file store without actually doing all the work, but I think there might not be a way to do that, but I guess. The upshot is that if the initial refactor is actually the first hard part and that we can do, there's no performance implications there.

D

It's just sort of simplifying the object store interface, so that sequencers and collections are the same thing and so I'll go ahead and do it at first and then it should be possible to have something pretty quickly.

H

It's an abstraction of device, it's not dependent of the Cardinal yeah.

D

True right, we could use a spk.

D

All right, cool.

D

D

All right, thanks, Ron, have a good week, guys yeah.