Ceph Ceph Month 2021, 11 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: Performance Optimization for All Flash based on aarch64

Description

Presented by: chunsong feng]
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

uh Let's start.

A

um My name is chenzon from I come from huawei and today we will introduce the performance optimization for wolfram's, based on oculus x46.

A

First, let me introduce safe solution based on queen pong, 920 common 920 I'm.

A

This is most powerful um business view. The core corner is 32 48 and then 64 cores the prolink means they're, maybe 2.6 or 3 gigahertz memory controller is for ddr or controller, and several technical technology architecture based cranbone chips.

A

We have tesla hardware, it's high quality, cpu and then and nick and ssd, and we we have tested on openonline.

A

We improve cpu use and 10 cpu prefers on or off optimize the number of concurrent slots and use pneuma optimization, and we test the different kernel pretty size for 4k, and I have 54k when optimized nick performance use interrupt core, binding mtu, readjuster and a tcp parameter adjustment.

A

We also do some io performance optimization today. We as we will introduce this uh first is there access across new mom and multiple nikkei deployment, ddr multi-channel deployment messages, loader slot a very slow long queue waiting time in their osd and use the 34 page size opt message clc 32c in luxdbm.

A

And normally the osd will receive data from nick and their validated to the mme system may maybe receive in one cpu and a router to ssd in another cpu. It will close new map.

A

A

We hope we can deploy much unique and later it's new map have a need and a ssd thing and say data flow kind.

A

Then we can receive data in one link and then write it to ssd in the same numerator. So the data flow is completed completely with within new map.

A

After the nema award deployment, we test the 4k and lk and run right. It will have nine percent lps higher and, as the latency will be, eighty percent lower than same.

A

Before uh for multiple technique, and it will, for example it if they need install in numerous yellow and as it will have multiple, uh for example, it have four board. Now we can, we can assign exams and.

A

To imagine look, and I will then say iq so: what's the argus in nikkei, zero will benefit and uh we will use.

A

We will assign some osd and use this sneaker, and so sneaker interrupts and osd process will either do well in the same numerator. The nickel result, factor and osteolysal melody will uses some pneumonia.

A

This will be much balanced, saying the system, which is one we if we get a we have tested as this deployment uh with um before with this opportunity, optimized as a 4k running light, is 7 lps higher and as the latency is this position.

A

Lower ddi channels, uh we will have uh support uh added channels um in one cpu, two cpu either it can separate.

A

Supported 16 channels, we have tested 16 channels, ddr with 12 channels. Ddr said, say: 16 channels, ddi is about seven percent high and is 4kb.

A

Bandwidth then run away is about 11 and percent high.

A

Now we gradually increase the osd client messaging cab and the folder is is 100. We test 201.

A

So 100, maybe four hdd, is okay for ssd. We should use more. We should use added.

A

Increases okay.

A

There are some problems in this picture.

A

Okay, but the message is receiving message work. It will include op2 or multi obq. It may insert a.

A

uh And uh all these threads will decop from it and then write it to data disk uh when if when ail is finished, historious reader will get event and then get it and they added to qeq.

A

Then in restore kiwi sync thread, it will swap towards the qbq to kill committing queue and then flash it to wall and a db a partition or disk and saying it will out of the flash it will include. You can recommend to finalize queue saying you know in pistol. Okay, we finally finalized thread. It will include your contact, sql and saying osd slider will hunt will swap was the context queue to oncoming school.

A

uh We have, we have collected the latest, let's say they say opp for qrp. Let's say it's occupied 80 and 16 percent of all the our latency, and it is this. uh It is the status queue let us see from here to uh before. Sync people sing it about it. Take a occupant building, 36 percent, uh since, let me see, can we commit a little same?

A

uh Is a after everything before flash before flash to do to this to here they occupy 27 percent.

A

Since 2 latency is much higher since this one uh I have and collected the coolant cool length of foundation. The average length of obq lens is about a 10, it's less than time, and so the queue time is short. It's a length of kvq and the current committing is about uh it's greater than scheme.

A

The energy cool latency is long um because the op-q is processed by by multi-sliding and the kbq, and the communicable kv committed to finance skill is processed in one thread.

A

Multi-Threaded one leg.

A

Says schedule is not a fair, so we we use the cpu partition. It used. Cp partition to separate them says message work in both this thread to one area and then each door. L is located single cable final thread to another another area and to achieve fierce schedule overseas, and we will test it.

A

In 17 high lps as the latency is about 12 percent lower than before.

A

And we test the we compare this 4k process and a 64k page size and same for 64k, page size, reduced, tlb means and the performance is about 12 percent, saying 4k, precise, very, very light.

A

But there are some issues we should keep in mind and faster. If small use it, he may visit with the large large pitch size will, with the memory to reduce memory waste in published reserve. We will modify this.

A

We use the default usb precise to align and after we use a small size, a line. This is a 4k. This is a 64k, and so is it. It can reduce memory waste and, secondly, we found that in memory form it users use the user troll, which is for 4k practices, uh pcc shift, uh we modify it, they use the same page shifter for com compatible with wireless page size.

A

The still have a lot of amplification issue when blue fs buffer l is set to true metadata is rather than using. Buffalo and a single file range is quality to write data to disk by page in the column user page size is 64, since it will rather buy the mini. Size is 64, page 64.

A

Okay, now we have same magnification factor is two points: x4 2.464 4k pages and uh four five, four five point: four, six: four sixty four pages uh we have tested 10 blue fs power. L to force the metadata is rather using directional single file range is not a coordinate.

A

The magnificat factor is 2.29 and too many lines affect the disk life cycle. So we suggest that a blue fs power l2 force when used four six page size is america and a critic kernel, precise issue. When there it is a malloc page size is more than the kernel page size.

A

The memory keeps increased until it approach was the memory target and the performance details uh significating to ensure that the pesticides of which is america is great. Greater than kernel, precise,.

A

A

Is about here the same server pacific? If you use elevation, you you, you should backport this feature to elevation and, as the performance is approachable in blue above 3 percent, the further it use slow, css 32. This is a algorithm.

A

So it's better to use I'm 64 instructor.

A

Okay, any question.

B

Was a question in the chat? Can you explain the peaks and troughs and the osd client message cap.

A

A

You mentioned this one.

B

C

That looks about right, yeah.

C

It wasn't clear to me um why that's uh such a noisy line, I mean whether there's some architectural thing, why say 200 is better than 400 or whether it's just that you've got a lot of noise in your benchmarking or if you could yeah. Just the shape of that graph was quite confusing to me, and I was wondering if you had an explanation for it.

A

Hi uh cohen.

A

Okay, so hi hi.

D

uh Yeah, so uh actually so uh when we are citing for this for the slot when we are opening the throat, uh so the the data is quite quite small, but uh when we are closes more yeah, so the data is various and you can see uh the peaks and throws so we've got actually so all data we have done.

D

We have taken from the uh experimental environment, yeah.

E

I have a related question about this graph. Was there a um a big q depth when, when this test was done, like did you have to have a very deep cue before you would see this improvement, or did you also see an improvement iops with a shallow q.

D

So hi yeah, so uh you mean the uh testing uh environment or uh yeah. I mean the testing cluster yeah.

E

Yeah and for this test in particular, I mean my intuition says that making this client message cap high will.

B

E

Only help if you have like a really deep queue, which would mean really high latencies and I'm not sure how like, if that's going to impact real cluster workloads or if you necessarily want those really deep cues. um But I'm not I'm, I'm I'm not sure. I'm wondering.

D

What what your tesla is uh so a hydrants also, uh could you uh a sure actually, what is the scale of the cluster? We will? We have site to to get this data.

D

Okay, so hi uh so high surgery, so we have uh actually three node of osd. So each of the osd node we've got uh 20 osgs so totally for 16., yeah. Okay, sorry totally for 6k yeah. Sorry.

E

Okay, that makes sense um first of all, just a thank you for this talk. This is full of really good information. um I really enjoyed it. um I have. I have some other questions too. um If you have a psych um on the the the first section where you talked about um the pneuma affinity, um there was some code we added in an octopus, maybe that tries to automatically pin osds to a node when the nic and the um and the network adapter the nick and the nvme are on the same pneuma node.

E

I'm just curious. If that worked on your system or if you had to manually in the osd to numero nodes in order to get the nice nice pinning.

D

D

uh So we can come to the new map page.

E

Yeah number one background.

D

Number one yeah.

E

Yes, yeah did the automatic pinning work or did you have to do this manually? That's the question.

A

D

Using the version beyond uh beyond staff 14, so that can support to have the pneuma uh affinity automatically yeah. So we don't need the actually the minor side.

E

That's excellent. Okay! That's good news. Glad to hear that um the um the the cpu partitioning that you mentioned um is a new concept to me. I didn't realize I didn't realize you could do that. um I would be curious to see um details of how you how you set that up. um I don't know if you could send us a follow-up email or maybe this is all common knowledge. I didn't actually realize them.

E

I guess that's sort of the question, in my mind, is if this is also something that we can make the osd do automatically.

E

um For example, if we told the osd that it gets like eight cores, if it could automatically divvy up its threads, so there are four each or something across the the blue store side and the um the osd apk side.

E

um Do you have any sense of whether that's that's something that that we can automatically do or did you find? Is this something that you think has to be sort of manually set up um in order to get this type of.

E

D

Yeah, I'm I'm not getting clearly messaged here right, though,.

E

Your third, your third point: maybe you go forward.

D

E

uh Yeah one more.

E

Or maybe four one more five.

D

E

uh Next one this one: yes, this one here, I'm wondering if this is something that um you think we can make the osd automatically do on its own. So it separates those two thread pools or is it something that is has to sort of be manually set up.

D

uh So you you mean that we have set these osds automatically.

E

Do you think it could be done automatically? I guess is the question.

E

Like ideally, we would want this to just the osu to just do this on its own, so it can maximize the performance. um Okay, do you think that's feasible, and if what do you think what would be necessary? Do you think to make that happen?.

D

E

Does that does that make sense to you? Maybe you can.

F

I think the question is just like: how can we partition these threads and assign them to different course? Is this feasible? Okay? Yes, can we do this problem, programmatically, writing security and let it do it on behalf of ourselves, so you can can improve your preference without the people in the intervention.

A

We we have, we modified the board and at a two part one is.

A

The surf there still curve and this one is another uh acidity like a mystic misca chord. This one is the thera chord. This one is musical, and so uh we can't configure anything surf conf.

A

A

The con configure fire like like.

A

E

Let's see see okay.

A

A

I added two uh to configure uh once uh there are air cores, it is modifiers threaded affinity, our messages, work and a tv http, because the words is threaded is malicious and has a controls. The data flow can use the data flow and, as I said, blue store kv, fellow okay, final, and I will restore all thread and to the mystical, because there are only um we have some some workout, maybe signal um you will use this to separate, say two threader. So it's a two area.

E

Let's see okay, yeah yeah, I wonder if this is something that we could teach um that fadium to do. Is it has sort of the whole node view like if we, if we know that a certain number of um cores are dedicated um to osds, then it could.

E

It could divvy up all the ranges on all the ocds, for you.

C

E

Anyway, yeah, that's that's awesome. um A couple, a couple, other quick questions. um There's a there's a page size. There were two code changes you had for the the page sizes. um Have you have you submitted um patches or pull requests for those upstream.

F

E

Because they've looked at even monster, oh they're already in master excellent, okay, awesome, great news, um and then there was um the the point you mentioned about right amplification. um If my, if I remember correctly, I think the most recent change there was that we're gonna, our blue store metadata is writing using direct. I o, which means it shouldn't, have any right amp in this case, but the reads are still using: buffered io, um I don't know, keep it.

E

Maybe you were paying attention there if you remember, but I think that this she should be dealt with. Also a master is that is that right.

F

I I cannot exactly record when the diy was added for the right operation.

E

Yeah, okay, okay, um but it was this. The specific problem was that, if we're doing buffered io with large pages, then we have.

B

E

But that's not direct io, then we wouldn't have that issue.

F

E

Yeah, I think, okay and then the last thing was this: that tc malik and kernel page size. um What is the what's the tc malik option that you were tuning, you control that tc malik h, size.

E

I wonder if that's also something that we can make sure that either the cefosteat process itself or the orchestration or whatever is setting automatically.

D

Actually is also uh is referring to uh what is a choosing optimizations. We have done.

E

Yeah so here what's what's the option.

D

What is the option.

A

D

F

Which is encoded in molecule unless we recombine it, we cannot change it. So if we use the package shipped by this tool, we'll stick with it. Whatever we have.

E

F

I mean on these arm systems. You can.

E

You can choose the page size when you boot up right, like the page size is a.

F

Yes, that's my impression.

E

I mean I wonder if what we would need to do is have an alternate compiled version and somehow that are both inside the container and when we start up, we somehow decide which one to lb preload like that, based on the on the page size. Would that.

F

Be based on the the the digital digital version right.

E

Yeah or yeah I mean it's, not the disturber version right, because it's it if you reboot it's like a kernel option, isn't it right? So in the in like the startup script, it would have to like look at the current running page size and then ld preload, the right compiled variant of tc malik, based on that when starting usd yeah.

F

Once we figure out the page size, we can reload reboot that container and and use the right setting.

E

Yeah or just do it in the startup like, if, because all these, and what's up vdm at least it's the there's, a vast script that actually runs the ocd process yeah, and so it could pass in the right data right and variable environment, variable or.

F

A

F

Is unchangeable only the center standard rail has a 64 large page and on another district we have. The smallest 4k page. Is that right.

A

E

Is that because does that mean the kernel? That's what the kernel supports. You mean.

F

Yes, it's a it's a it's pretty configured, so it cannot be changed even.

E

At the time, uh okay, okay,.

E

Okay- um and I guess the the last thing I had was um just a quick question about the um the rocks tv drc- backport that you did um is there a maybe we should include that patch in the the the version of rocks tv that we're building with um or possibly we should just fast forward like for quincy. We should probably just fast forward to whatever the latest rocks to be released that hopefully includes that patch.

E

um But for that.

F

I think we already included the crc fixing in pacific. We did okay, okay, that's also mentioned in in in transforms attention, slides.

E

Oh there, it is yeah got it: okay, okay, cool.

E

Awesome, that's all I had I. This is great. Thank you. This talk.

B

Do we have any other questions for transsong.

B

All right: well, we got a little bit of a gap in our schedule, um so we have about 19 minutes before the next presentation, uh with anthony on intel flash base. So, but I wanted to thank chonson for uh taking the time to present to us as well as kevin for helping with the translations as well with the questions. So thank you. Everyone for your time appreciate it.