Ceph Crimson Weekly, 21 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Crimson/SeaStore 2021-07-21

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh So, let's start last week, I've been reviewing and runner's change to um to extract the the uh scrub related logic into its own into individual queue, so it can be reduced by um by crimson and also address couple um regressions introduced by the refractory which try to adapt the new um edit submit to api in the recent um system. Change- and I think that's me around.

B

B

uh I was working on the sister patch on the version 3 of sister patch, for uh some adding uh your ctl and their control interfaces to siege there. uh So the version three has been submitted, uh maybe not come and see it so yeah. So yeah that was update for me.

A

I just found it depending on the stream's uh review right. Oh yes, yes, this is.

B

Version three is printing here. I think you are not upstreamed yet yeah. I just shared the link yeah, okay,.

B

um Last week I reviewed some uh gears and they provided the system, like I say, magic support, and when I, when I do the testing, I go to an observer in mb3, but it's only once I can't reproduce it. It's reported. The um repository parent has been invalidated that not invalidated a dirt happened, but I can't reproduce I try several times, and so, although the trade still has not used interruptible features and also reported the confliction for the extent, so I will working on work on the uh to convert another tree to use interruptable future.

B

A

And correct anything you want to share with us.

A

Okay, probably next time.

C

Hello: okay, let's start from the uh backhand last week, a bunch of a bunch of fixes have been merged, targeting the problem about being emery's events in stray and reset reset states.

C

Here is the gist that describes the investigation uh appears. uh Well, I won't be. I won't be looking at them there. There were a few. Basically uh the problem there was. There were multiple uh problems around osd uh osd activation. uh In many situations we were, we were. We started crunching uh peering events uh in the booting state.

C

Well, after fixing that no more problems like emily's, however started it invites another another one that looks like a some.

C

Misunderstanding basically between the peering and messenger components. It's basically it's in lock. I think it could be a misdirection of of the pg notify. Second message looks like the peering state was wanted to send the message to one osd, but messenger somehow send it to to another, just paste it and just pasted a link to the gist.

C

The problem with that is, unfortunately, the problem is very, very hard to replicate I'm still trying, but in not to not to waste time in background. I started working on other things.

C

First, one is about some optimizations uh for uh buffer list cstr. uh That was uh an inefficiency uh with uh of this year. Unnecessary rebuilds have been spotted during during.

C

The coding of the of the message of the protocol header, basically because of uh of having uh offending the empty carriage buffer pointer at the end I tweaked buffer list to to account to account for that. Another thing I'm working on right now is the profiling of uh of crimson. uh Here is linked to the geese.

C

uh Well, with uh we c answer, we will look pretty good uh when using uh random info in 4k random rates, when using two instances of uh writer's bench sending requests to single uh to single osd with almost empty sea answer. We are seeing around 35.

C

Thousands of sites per operation- I'm continuing- and this is because of my talk of my last big talk with with mark nelson, who is testing with bluestar, and I were very curious how and how it it would look like if we use alienster with bluster, but the bluester with very small tiny, almost empty uh content inside idea is to expose.

C

The idea is to not focus on the objects that are part it's about exposing uh the inter the layers of osd, the components we rewrote with sister and I'm seeing an extremely interesting pattern: okay in classical or if I switch m in classical. If I in, if I uh focus and switch from memster from almost empty mems to almost empty bluester.

C

Basically, there is no no harm. Similar uh efficiency is preserved. However, that's absolutely not the case of uh for crimson osd.

C

The answer to almost empty blaster, with the alien star adapter.

C

Hit is around two times. Instead of thirty five thousands of cycles per up, I started seeing 70 77 uh thousands of cycles per operation, and this term was uh almost profiling.

C

Well, there is huge impact of of the semaphore plus event led, not sure it's a it's everything, perhaps well, I'm speculating uh in profiling, I'm seeing direct costs of of uh of the syscalls, but maybe there are also indirect costs like trashing cpu caches, because also the ipc was instruction per cycle. Metric has been severally hit in my in c answer. Configuration criminal does something around one and half instruction uh per cycle, which is pretty good because because uh classical there's around, you know something like that. So we are.

C

We are two times better lies in cpu, but when using aliens stir well, the ipc drops somewhere around one, in other words, still uh still sniffing will hopefully will know more today. That's me.

A

Ritek regarding the the observation on the course introduced by by semaphore dude during the cinema 4 introduced by me in recent and shotless performance improvements.

C

uh Yeah, but it's not only about some offer they they. There is also overhead from event. Fd.

A

A

How does it come kick in? I'm not sure where we are using ft, which is different from the classical sd.

C

Me too, my idea is to I want to tweak. I want to. I want to compare memster uh to manster idea is to tweak alien store to host, not only bluester but also memstor, so I could uh compare c and store.

C

I mean vanilla, vanilla, uh tiny, objective, implement implementation with this almost almost the same door, but through the exposed, 3d, alien, uh alien component aliens component.

C

That would uh hopefully that would allow to to judge the real overhead we and that is being imposed uh by aliinster.

A

Just a side note regarding the the overhead introduced by semaphore, I did some investigation uh last night um after reading um mark nelson's comments. I I realized that we could use a is a spin lock based implementation.

C

A

Improving the semaphore, actually that's what we have if we were using a zipper 20.

A

But the it's a it's a that would imply a lot a lot of work to to port it to to zipper 17 17.

C

But in common we have we uh people in common. We do have a bunch of a bunch of we have implementation of spin lock and we were optimizing it, but I'm not sure we merged the the patterns. There were a bunch of them I will take. I will take a look on that because you know if, if you're doing a busy weight in in in uh on iran, you want to take care about things like pausing, cpu, etc.

C

Some I'm having I'm looking at the ah sorry I need to reset the ssh. I will. I will post uh I will as a comment to the to the guest. I will post uh paste from uh from their free part, but from memory I can say.

D

That how much of a difference in in rupert are we talking about here total two times it's two times slower because of this one semaphore not.

C

Sure, that's still under profiling. The overall difference between uh between comment.

D

And you can test it with and without it just comment it out.

B

D

Test it with and without it comment it out.

C

No, then not yet just.

D

C

Tell you the exact number because they haven't. Oh, I think that's.

D

C

Yeah, it's it's work in progress.

D

I just I just meant that it that'll give you a more direct answer than profiling. That's all of.

A

Course all right, you.

D

A

Just comment out to my semi implementation and just revert it to see how it improves, helps.

D

Yeah see what I'm, what I'm saying is if the difference isn't huge or honestly, even if it is, is that really our priority right now like if we wait long enough, we may get this equals plus 20 implementation anyway,.

C

Yeah, but there is one uh one problem you know: marcus uh is going to prepare some slides for management about the efficiency of of crimson and he does that basing on the uh on the alien star, with bluester uh and and and at the moment on the slice we are maybe uh 25, 30 percent better than classical and uh uh marx complains. It's uh it's it's it's it's not enough. So I'm trying to point out that this kind of.

D

It'll be far lower with an actually loaded blue store right.

C

Yes, because, uh even because of um slow, uh because the bitcoin itself is not fast, yes exactly, I think it I I speculate. I bet that in real workloads, the blues, the objects are, the bluester will take most of the cycles being being burned by entire osd.

D

Yeah, so my point is that the fictional value being reported to management isn't fictional enough, isn't a very good reason to spend time on things.

C

Okay, but this will uh this requires a convincing mark.

D

That's fine point mark at me.

C

D

All email mark.

C

Anyway, I'm curious, uh I'm still curious what whether the semaphore can explain uh entirely the the results.

C

Yeah, it's very quick so.

A

Thank you, reddick, we'll see anything else from you.

C

Nope thanks: okay.

E

Hi er first, regarding the conversation we just said just some materially one, you might want to take a look at this is a book performance book that a guy named paul mckenny wrote his performance in performance in snp issues for many years he was working for ibm. I think now he's in intel, but he's the guy who worked for 20 years to get rcu into the linux kernel.

E

He pretty much knows what he's talking about. So this is a book that summarizes performance issues, performance techniques in multi programming, any any.

E

Yeah, I know we don't need no. We know that, but some numbers that what he has in, at least, for example, take a look at this page later on. He has lectures with numbers of uh that were measured times of various operations like taking a sermon for spin, locking some discussions of this, which is pretty nice.

E

I know we all know that.

E

Basically, we all know everything, but uh here, but uh it's good to see the numbers and being reminded of some of the considerations and a pretty much an expert one of the best experts in the field. Now, apart from that, didn't do much in the last week on code actually for other for doing something other things, but I'm still working on scrubbing.

E

I have a bug in the back end refactoring, which I'm trying with one of the counters which I'm trying to solve, and I have introduced a bug into the scheduling somewhere during the latest fixes. I introduced a bug there and I'm trying to solve that too things: crashes, professional, that's it for me.

A

Thank you for sharing sam.

D

Yep, I am doing some refactoring to the lbap3 use more of an iterator based approach that will hopefully eliminate the uh fine tool insertion bug that may running into and we'll make some later improvements for supporting clone easier. What's up.

A

um Sam sorry, I missed your last words um eliminate what what bug.

D

There's a bug in the lba manager where, when you insert a an extent the place it tries to insert it in the tree is incorrect. After split because of a bug in the way fine tool works, it's just not correct, so I'm rehabbing it to fix that. Okay, it's probably the cause for most of john may's crashes,.

F

uh This week I uh I modified the uh extent placement manager pr and uh pushed it. I also added the extend scan mechanism for the extent placement manager. It is all in the pr right now and uh I think it's ready for review. uh I am I'm also working on adding voltage device support into systole right now. uh That's all for me.

F

Sorry, I can't hear you.

A

Sorry, I forgot to admit myself and I I recall that we have an offline discussion on on moon's work. I I think you might have some some my have some some might be confused on the scope of his work or the target device of his uh his work.

F

Oh sorry, I I don't quite follow you.

A

Okay, probably we could we could discuss later right after the update.

F

Oh okay, okay,.

A

B

um Yeah last week I I added matrix at the cache layer and I'm trying to review the extent placement manager design, which seems has impacts on many systole components. So I think it might work sam to also look at it.

B

That's all for me.

D

Sorry, look at what.

B

The extent placement manager- I think it has yeah- you- can pack some many other system components.

A

I just recall that I, the offline discussion was uh was the way was the englishing, so you mentioned that you're, not quite sure about the target device of the immune rhythm device work. Is that true, like you're, not sure what what the exact device is changed? I.

B

Guess it was the same device that doesn't need to trim or is not segmented.

B

That's my guess so.

A

Is that is that correct.

D

Sorry, what's up.

A

I'm thinking I'm appreciative, yeah.

B

According to my understanding about the random blog manager ibm, I think this is the kind of device that doesn't need a segment to trim so that it can have the same performance when writing blocks randomly.

D

Yes, that's correct.

B

D

I still wanted to go through the um extent placement manager and the uh the way the transactions should support. It is very much like the out-of-line events. That's why but other than that you're right it doesn't need to be cleaned. It doesn't have segments as such.

A

So it's more like a generic um ssd device right or ram pmap.

D

Yeah more, like optane devices,.

A

D

Ssds, maybe it depends um that that will depend on whether on whether it's more efficient to treat those as being segmented or not because they may be their internal garbage collection may be so slow that it's more efficient to write to them. Sequentially, but that's just a call we'll make after benchmarking and measurement.

A

Makes sense to me.

D

Prior to direct mutation, right.

B

D

We would still.

B

D

To do a write ahead to the journal prior to mutation, so that portion of the of the transaction hasn't been created. Yet.

A

Anything else as well. I will update the uh this pad right up to the meeting.

A

C

A

C

I forgot sorry, I I I missed myself late, uh okay, I posted comment with the pastes from from perth from perth report and it's in alienster at the alien side, sorry and the alien threat threats uh site, as well as the reactor site. However, I'm not entire, I'm not sure it's all it's solely because of uh of the semaphore.

C

uh I think that even uh alfie, fellow, even as basically since the and beginning of max testing, we were seeing a pretty uh pretty nasty performance of of of bluester.

C

The question is, it could be attributed to the armadale's law.

C

And how magnificent.

C

But from the gist is from the comments, it's rachel visible that the the after the work on conveying the uh on conveying the requests. uh It's pretty significant is pretty large in comparison to real work. Real number of to real work performed actually uh doing uh bluestar things just compare the bluester reed and uh bluestar gettiers, which is 6.6 and 2.2 respectively, with the entire, uh with actually the fact that the uh the loop, the thread will loop, takes around 30 percent 25 and 30 percent of of all cycles burned by the entire process.

A

By looking at the.

A

The performance test results, like eight percent of time, was used by a semaphore weight, slow, which happens.

C

It's not time it's about cycles, cycles, yeah, so so be aware, if you, if, especially if you compare with with the results from mark's gdp pmp, which is about uh profiling in work uh in what time world club yeah exactly.

A

So I think the reason might be that we are not sending the request fast enough, so the the alien store, the ellen stress, actually goes into the slow, slow path because he's cannot get the message fast enough using the futex. So it's oh.

B

A

It's a it's jumping to right into the into the kernel and what is was swapped out.

C

Well, perhaps another factor could be that in case of cn star all all the requests are all the futures actually have returned to a full field. There is no, there is lesser need to go through the reactor, but well I'm just speculating.

D

Wait a minute if it's just the semaphore we're just using that to bound the size of the queue right.

D

Am I remembering this correctly.

A

I think so we're using upon bonus size of q.

D

Does it need to be bounded.

B

D

I mean we do kind of want to push back, but.

D

Now it does need to be done. So what's what's happening is it is what's happening that wait? Why is this even coming into play? How much parallelism is there in this test.

C

uh It's profiled under two instances of radius bench, doing uh doing 4k random reads all over.

C

Bluester instance.

C

There is very little amount of data there.

D

What was the total parallelism, how many.

B

D

Expect to be outstanding at once.

C

Somewhere close to the uh to the sea answer results uh just.

D

How how how many reads does that does the test keep out at once.

D

Can't hear you, okay,.

D

uh I can't hear radic: can you.

C

uh Going through the prof uh through d.

D

You know I'm just asking about the test construction itself, not not the results. It looks to me like it's two two random reads or two um radius bench instances which are both sending cue size. One right.

D

This is fully fully sequential.

C

uh They were random, but okay, they were intended. uh Let me paste.

D

I'm going to paste the command here.

C

At the exact command uh comment, uh I used to collect the data to read to record this stuff. Okay, uh updating the comment.

B

D

I'm reading the command you put in at the very top.

D

That's, that's the you ran two instances of this bench command right.

C

Exactly okay, updated the comment.

C

They were two raiders bench uh instances doing uh 30-second random rate over single uh osd instance.

D

But also so total parallelism is two right. There should be two reads out at a time.

C

uh That's uh that's a question also. I think that um that.

C

The queue uh dies for each no.

D

No, the radius bench instance with this configuration, if I'm remembering correctly, sends one read waits for it to come back then sends another read right.

C

I doubt I think it sends uh uh up to 16 requests uh in parallel.

D

So the default is 16. yeah.

C

Okay, parallelism is three two right, this better. It would parallel.

D

Okay, so parallelism here is 32, which means we can't possibly be hitting the semaphore limit here right, even if all 32 reads wound up in the queue at once that wouldn't be enough to hit the limit right keith. I remember seeing a much larger number.

C

What's the number, what's the limit there.

A

Yes, the size of the queue is a 120 28, so it depends on if we we hit the limit of the queue size and actually there. This.

D

It means we're not hitting the limit. 128 is bigger than 30.

D

unless we have.

B

D

That that's happening, I don't believe it.

D

So some weight happens by the way when the queue is empty right.

A

Let me check if the queue is bounded, because the local skill can be enlarged when the the skill size is the default size of the.

D

Yes, but you set up the way the semaphore works, to make sure that that won't happen. We always take from the 74 prior to submitting to the queue right.

A

D

So what's happening here is the semaphore is being used to inform the consumer of how many things there are in the queue. So the only reason sem wait. What happens if the queue were empty, which means 32.

B

D

Enough parallelism to keep the osu running, though the whole time which sounds right.

A

D

Latency is way lower.

D

So the crimson osd is going to spend and, more importantly, the alien store parts would have spent most of its time waiting. Not doing. I o.

A

So I'm warning you: I could put more load on the on off d.

C

D

So more than likely what's going on is that sleeping at all costs us latency, because every time we go to sleep, we have to then wake the threat back up to do work.

C

Well, we could uh just limit the.

D

What I'm saying is, I'm not sure I care about this test.

C

D

This is really artificial. What you're, essentially doing is you're measuring the wake up time for an idle osd.

C

We can verify, does hypothesis just by lowering uh the number and seeing whether it's uh the q, uh the q size.

D

You're not hitting the max the max is irrelevant here.

A

But you can't could test the problem.

B

Verify yourself by.

A

Larger size or queue, but I don't think it matters.

D

No, you can't, it won't make any difference at all.

A

Yeah because the.

D

Problem isn't that the queue is too full. The problem is that the queue is empty. Most of the time. Oh okay, I see yeah so what's happening.

F

D

Waiting for work, that's why it's going into the sem slow path. It's because it's literally going to sleep! That's what it's supposed to do.

D

The semaphore is zero, so you wait! That's how semaphores work! That's also. How can producer consumer cues work when the consumer is, has no work to do it's not supposed to consume cpu time?

D

So, if you did measure performance impact, my guess is that what you're measuring is the increase in latency associated with waking back up when the work actually shows up?

D

What did what was the implementation before by the by the way? How did we go to sleep like before we added the semaphore.

D

There was just like an atomic variable, or did it busy wait.

A

No, we don't do the weight before that. We were using a uh like way before that we were using a single queue, but uh johan was added to the charitable queue so to improve the parallelism, and after that I I believe.

B

A

Lock, the lock is too time consuming. So I and I don't really want to look to exist in the uh in a c star institute. Stress I'm I mean mutex opposing nucleus, so I I trade it with uh with semaphore in hopes that it will be faster by removing the mutex.

D

So you're, are you comparing performance between no? What's the performance? No hang on just. Let me finish: what's the performance comparison here, then, what are we comparing against reddick.

D

You get why we have.

C

It uh alien star with uh we are comparing those the following setup.

D

But not a prior version without the semaphore okay, so the problem is that any version of crimson that has ever existed with aliens or has had this behavior.

C

D

We've never allowed alien store to busy wait so whether we were using p thread mutex to wake up or whatever doesn't really matter. um I will point out, however, that the sharded work queue concept will make this problem worse to not better in classic osd. You may recall back in the day when charted work, queue was what was originally added for osds that were frequently idle started work. You increased latency, because the individual worker cues had to go to sleep.

D

So not only do we only have 32 parallel requests in flight, we're partitioning them a lot among how many queues.

B

D

A

A

Into the default number.

A

A

Almost there just a sec.

A

Oh, it's a six.

D

Sorry, you said six, yes,.

A

D

Yeah, so I think I think that's that's what's going on so if we want to test this for real, you should be looking to saturate. The osd 32 is not remotely enough.

C

Okay, uh the point would be that we are, we won't be able to do uh to saturate, because the reactor part would saturate.

D

ah I see um in that case, what you've basically demonstrated is that it's impossible for the alien queue to be a throughput limiter.

C

D

So we should remove the starting.

A

Removal sharing.

D

Yeah, so the reason for starting is to increase throughput, but sharding always costs latency. What we've done is we've traded, we traded latency for throughput. We can't use.

D

Wait a minute what I mean is: we should make the default one right right now.

C

D

In the future, we may need it if there are multiple cores, but I'm telling you that the total number of alien work queues to core will be somewhat south of one, so it might be. We need two crimson cores per alien store queue.

C

uh One will be too much too small. Okay now I recall, and I've introduced that that variable.

C

And I did that because one was uh not enough there's if there is against any, I need to dig, but there is against comparing multiple, uh multiple values. However, it was before introduction of sharding.

D

You're saying you're looking for a regression that happened before starting.

D

Is that what you said.

C

D

You're looking for a performance regression that happened before starting.

C

Yeah, I think so.

D

Is it big because we expect performance regressions, we're adding code.

D

For instance, if.

C

Interruptable futures.

D

If interruptable future caused a regression it it did so by fixing a bug, the old code was wrong.

C

I see it's an essential complexity and you really want it no other way.

D

Well, no, there is another way like we will eventually need to start pressing these things down. It's just that performance investigations like this are super expensive.

C

Okay, I found the gist.

A

By the way, do you record that what's the test you performed when you were testing the shortest queue.

F

Oh, I didn't do random, read I what I did. It was random right. uh It's about! Oh, okay! uh It's about uh fifteen thousand uh iops! I if I recall it uh correctly,.

A

Were you using the redis bench or some other test, uh some other tool.

F

uh No ios using fio and with rbd image.

D

ah So that's that's the other thing. um Radius bench is kind of terrible at this, which is why I was asking about total number of attributes. Rbd or fio is going to be able to generate more concurrent reads than greatest natural. Well, given the same well,.

C

Raido's bench also, but we will need to bump up a proper, uh a proper parameter. There is it's comfortable.

D

No, I know but fio you can actually do things like control the distribution of reeves over the device. It's just a better tool.

B

D

Like raido's bench is fine, but it's it's a really coarse tool.

D

It's also surprisingly, client limited in a lot of situations.

B

D

So you actually did see a performance advantage up to four threads interesting.

D

What was the workload in this one.

C

uh 4K uh random.

D

Looks like rato's bench again.

C

Yeah, yes, yes,.

D

This was with looking for the command also with the default 16 threads.

A

B

And just two of them: okay,.

A

But we were, you were using right by then.

D

Now this is still same deal, so this is the same test as far as I can tell.

C

Very likely, so I think so, yep.

D

I mean I'm looking at the command. You ran the same one.

D

So wait tell me back at the beginning, so this semaphore problem, you're tracking down this is just you're. Just comparing the you're just looking at the perf dump of.

B

D

Run you're not comparing it to any previous commit okay, so this isn't a regression you're, just interested in. What's taking a lot of time here,.

C

Exactly okay, because my main uh focus is not even on regression we have or not in uh in alienster, I'm curious. Why introduce why introducing alienster uh lowers the styx per up so much two times. So it's not about eight eight percent, twenty percent. It's two times slower slower.

D

Less efficiency, try are you, what are you doing with the uh blue store core uh threats? Are you letting them schedule anywhere or are you trying to force them somewhere.

C

uh Not sure I'm using defaults.

D

So it might be worth making sure that they never get scheduled on the same core as the crimson reactor yeah, I mean that's the most obvious reason right.

C

I've, but I think we'd already pinned them in default, configuration to different cpus.

D

From each other yeah, but I'm not sure because.

C

I know I'm sorry one single osd. I know there is only.

D

C

Single on within the testing, so sharing uh that's.

D

Not what I'm saying I'm saying, I think the zero thread for blue storm might map to the same core as the c-star reactor.

D

There are some situations where we do that on purpose, so that.

B

F

Worth I think the default setting is uh pin blue stars blue store threads to uh the last uh five or ten cpu cores it'll. Never.

D

F

Yes, I think that that that that's not uh that's not the the sister thread. That's not the cpu code that star threats are running got it.

A

Where are we, I think we probably we should use fio and to to put more loads on on the single sd to ensure that the alien store does not do the slow weight. Am I right.

D

Well, I'm just so. The other thing is: if this is time wasted on a thread that isn't actually a blocker, then I kind of don't care so remind me of the of the structure. Here we have thread pools responsible for submitting the blues right.

D

Yeah so radic riddle me this is this thread pool actually relevant.

C

uh Sorry, uh I was, uh could you uh could you repeat please I missed ms depart.

D

Oh well, no, it is.

A

Actually takes cycles.

D

Yeah, I know I'm looking well, we sort of expect it to right. Every time you go to sleep, you need to do some. The the thread itself needs to do some cleanup that you know calls back into the curl which puts its sleep upon wake up. It has some restoration work to do, but before it finally gets you know back to.

C

Yes, and those are only the direct costs of the operations, they are also wait.

D

So what I'm saying is the process of going to sleep costs cpu cycles, but the fact that we went to sleep means that we didn't have anything better to do you see what I mean.

F

So uh it costs more cpu cycles because it is relatively.

D

Free, uh yes, exactly and it's not that it costs more cpu cycles. It's that a larger percentage of the cpu cycles that it spent not sleeping were spent going to sleep, but that doesn't mean that most of the time was spent. That.

D

Way, you see what I mean, so in other words, if the, if the threat actually spends 50 of its time, sleeping just literally, not scheduled at all and then of the remaining time. Five percent is spent going to sleep and coming back or eight percent here, and the remaining forty percent is spent doing real work. Then, in this performance graph we'd see five percent over fifty percent as its cpu cycle spent in stem weight, but it would still be fully almost completely underutilized.

D

The only reason some weight would seem to be taking up so much time is because it was mostly idle.

D

Like for what what perf tells you is percentage um it? It samples cores at different times, attributes those things to threads and then gives you the ratio of samples that happened in each um get each bit of the call stack to the total set of samples. But it doesn't take any samples when the threats asleep.

C

And the and it counts globally it in incomes, every single threat of the process into the uh into the denominator.

D

Because if I'm looking at.

C

D

It looks like it's being it's: it's only considering things that happened as sub calls under crimson os threat pool loop. Well, I don't think that's well.

C

It's because I filter it, uh the it at the level of uh of a perfect report. However,.

D

Right, but that means that this eight percent is relative to that call graph, not relative to the whole program.

C

Wall program, I'm pretty sure uh I just filter it.

C

The view knows the data being uh used to do the math, in other words, 27 percent spent in the loop is about, uh is in comparison to the all cycles recorded, which means all uh threads from the entire process. I was just take a look on the perfect.

D

27 of the samples are from the alien threat pool threat itself.

C

D

So if both threads were running the whole time, that would be fifty percent.

F

uh Is there a possibility that uh the sister thread just uh can't keep the queue full, because it's running on only one core.

D

No, I think we're latency limited.

D

I think I think that yeah, I think, we're late. It's limited here either. It's the case that the crimson reactor that the c-star reactor is, as you pointed out, saturated and it can't keep blue store uh full or neither is saturated, because 32 concurrent reeds wasn't enough either way. I don't think this. This this thread pool thread is saturated. I think it's sleeping about half the time.

C

I agree that the alien part looks unsaturated.

D

So, for that reason I really don't care about the stem weight. Example. That's just an artifact of a thread without enough work to do.

C

Agreed the question is whether whether the t star the reactor part is able uh to.

C

To saturate the sharded, uh the sharded alien still.

D

I mean you've essentially set things up so that it can't right.

D

If I understand the construction of this test correctly, there's no data in blue store, so everything fits in cash right yep, so you've basically set things up so that we will do more work in the crimson reactor than we will do in alien store.

D

C

D

C

D

C

D

D

Like this is what that looks like.

B

A

Neck is is on the uh it's in the sister site,.

D

Well, not necessarily it could be the client. We haven't actually proven that the c-star part is saturated either, but whatever it is, it's not it's not. This alien store thread.

A

Probably a bit better test is to to use fio and wizard with a larger depth of iodep.

B

A

B

Print out the sample value to verify whether which number it is.

D

Well, it's stem weight, so it almost and given the 32 parallelism, it basically has to be the case where it's empty feel free to print it out, though yeah but again I I think this is exactly what you'd expect to see with the setup you've created.

D

I think this is normal. Behavior.

F

uh I recall that when I was testing the alien store, uh I compared it with the classic osd and uh uh with the same uh blue store threads and uh the classical the classic osd is about uh 30.

F

uh The iops of the classic osd is about 30 higher uh and at the time the sister thread is one is using is fully using the cpu cord that is running on so at the time. I think it was the systar thread that was the bottleneck for the whole uh testing.

C

Okay, so you had saturation, uh have you checked the psych spell.

F

um No, I didn't do that.

F

A

I'm confused if the if the c-star strategy is saturated was saturated, why why did you try to increase the parallelism by by adding more threat on this on alien side? Oh.

F

Oh, uh uh I didn't do I didn't increase the the alien storage risk number of for uh for for better for better performance. I just want to compare it with the classic osd.

F

So I I I just set the the alien store, thresh number to the same as the blue store in store, and then I run the test and once I found out that uh the sister thread was the bottleneck, I I didn't do any more tests. uh So uh I don't know if I'm making myself clear.

A

By shorting the queue you have a better performance than.

F

Oh, no, I I uh actually I I just thought it, because sooner or later we we will run crimson osd on multiple cores, so I think it would be better to have the edit store uh work. Queue uh started.

A

So your improvement is not supported by by performance test right. My understanding, yeah.

F

Yes, yes, yes,.

A

Okay, so there's a chance, in other words, there's a chance that before we check before we change, we might have better bet performance, honest with a single queue uh with.

B

A

Setting on the inside.

F

uh That's possible, but I think uh that possibility uh is not supported by our bio-radix test, because uh the test shows that um it the the same weight on small cpu cycles, because it is relatively free. So so.

A

By adding more certain.

C

The test were is constructed uh with the c and memster uh in mind. Using it, the bluester is, uh is well it's just an incident.

B

F

I I still think uh if.

F

B

Star the performance of crimson is badger, so I I don't think with alien store that suddenly this star becomes a bottleneck.

C

Well, it looks to me that alien star thirds not hurts the ipc ratio of entire process other. In other words, it hurts the performance of a reactor itself.

B

Yeah, by submitting the request from the the c star call to uh blue star call right so that process. Yes exactly.

C

The over overall cycles per op goes down by factor of two the same with ipc.

F

So uh the performance degradation we're talking about is in terms of ipc right, not exactly.

C

I think the reason uh the main reason uh is because uh not only because of uh doing more work like conveying between the stuff between multiple threads. It also because of lowering the efficiency of.

C

Well, I haven't put the number into the gist yet, but just to recall just to recall when using uh c and stir crimson is able to hit around one and half uh instruction uh per second with uh uh with bluester plus alienster.

C

It goes down to maybe one instruction per second, which perhaps could be explained by the indirect costs of of let's say stema for waiting, but it's just a speculation. Can.

D

You get per thread cycles per or instructions per cycle, information or per core.

C

Because the answer basically makes sense. Yes, we can do that.

D

Yeah have a have a look at that if fairly boringly be that blue store's efficiency is just a lot lower. So when you add that to the average well, if it changes.

C

That is also a possibility, but we can, it could be judged pretty easily. uh Perfstat has option to profile a specific thread only.

A

Yeah we, I think we can also try to reverse the shutter q change, to see if that helps with the performance. Actually, because, because there's a like six six threads in this picture, they are trying to swap in and swap out when, there's a drop in the queue when there's a not and when there's not so a single single single thread might help.

A

If, if the the load is not enough to saturate the uh the charge, q.

B

I have a question about the best practice to do performance tests. So I think if we use fio with diva bd is it mean also means that we have a much longer hour pass, but we need to first go through liberty and then to the crimson. So that's not a good practice.

B

My question is: if we test criminal performance with dbd, does it has a much longer? I will pass in the rpd part.

D

You are worried about the impact of client-side library on the total testing environment. Yeah. um Not really, if you make sure you put the client in a in a resource unconstrained place and give it enough total parallelism. I don't see why it would matter also keep in mind. Rato's bench does almost everything librarbd does.

E

C

On hot path, I think that the raiders bench very very closely resembles uh listen, resembles. Basically, rbd makes some extra.

C

Extra like, uh like the classes on only on on called paths.

D

Yeah in either case, the expensive parts of the object are in the messenger. That's present, in both cases.

B

D

And, of course, we're not testing the client. So if we're worried about testing throughput, we can always just use more rbd clients.

D

It's not a big deal same is true of radius bench. Of course,.

B

Yep, oh so, uh oh I mean test in the or in one environment.

B

Well, I mean I don't have multiple machines too,.

D

You should always be pinning your clients to different cores from the osd under those conditions, but again the rpd isn't really more expensive than raido's bench.

C

And everything under assumption, you have enough course to uh to delegate your tenancy freighters arrangers. Okay, if you are testing on laptop, that's that's a cosmic constraint, but on any reasonable server. It should be fine.

A

B

A

Later later, have a good one. Thank you.

C

Thank you see you later bye.