Ceph Ceph Month 2021, 23 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: Optimizing Ceph on Arm64

Description

Presented by: Richael Zhuang
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

uh Hello, I'm rachel and my topic is uh set on m64. uh It's about our practice of death storage ecosystem on amp servers. This is the agenda of my slides. Firstly, I will give an overview of the self-related work we have done on amp.

A

Servers yeah: this is the framework of seth now practice to enable and improve safe storage. Ecosystem on m64 includes using unspecific instructions or features to do some common leaves optimization like utf-8 drc isa and some of safe speeches requires the support of some server projects, so we also enabled those server projects on m64, including sbdk system, safe csi and so on. Besides, we also try to do some optimization in this server project and the support to use save as storage save backhand for openstack kubernetes on r64 has also already been found.

A

Yeah, I would keep trying to leverage and cpu features to optimize safe, common labs. Yeah. It's not easy, as most safe codes are already highly optimized, but we still try to find some potential points to optimize, especially for arm architecture.

A

The following is some optimization. We did in early days. One is the optimization of utf-8 string. Handling utf-8 is a variable lens coding, method for unicode.

A

The original implementation, encoding and checking is done, provides it's not so efficient and our optimization can give up to 8 times boost full stream, validation and 50 game for string encoding through operating several bytes at a time, and another optimization is about the widely used crc-33 implementation. This on v8 pmo instruction.

A

We can achieve up to three times performance boost on platforms, while crc extension is available.

A

And isa library, sl, is the intel storage, a celebration live library. It provides algorithms to help, accelerate and optimize our storage. The functions provided in this library help with state protection, debt, integrity and debt security.

A

Instead, isao can upload support for compression encryption and so on, and in this library we have added support of crc, igb, red aes gcm for encryption and different multi-buffer algorithm implementations like multi-buffer md5 multiplier for sharp one shot, 256 and so on and multi-hash jawan shah 256, announcing hash using unspecific features in instructions.

A

Besides, because iso provides multi-binary versions of some functions, developers can deploy a single binary with multiple function versions and then choose features at runtime for m64. We added the utility functions together, cpu feature set and provided a framework to generate function best on the feature sets and now we are working on some other algorithms like aes, sts and multi-hash xiaowan, plus mobile 3.

A

If you are interested in early implementations, you can refer to sal and isl crypto github for details. The related code is put under a accessible directory.

A

uh Yeah, we also enabled and bench sorry benchmark set on 64 kb kernel page because um has a 64, kb kernel, page support. A large page has some benefits compared with small page on amp platforms compared with 4k.

A

A 60k page system removes one level of page table which brings faster tlb lookup. Besides, it can improve, tlb hit thread or each entry can traces, a larger memory region.

A

Now there are already some pages to support set under 64kb kernel page and to solve some memory best problem.

A

Here we used a set test cluster with one monitor one manager and three osd to do a benchmark. uh Each osd backhand is one p4610, so me ssd. We tested the rbd, sequential and random grid and write with different block size.

A

This is a performance with only four cpu cross activated on the cluster side.

A

The orange the yellow bars are the same rbd bandwidth of 4kb kernel, page size and the orange a sorry and the orange ones are 64 kb kernel page size from the graphs we can see. We can get about three to eleven percent boost for rate according to different block size from 4k each for a million bytes broadcasting and for right is 8 to 22 percent boost and 6210 percent for render rate and about 6 to 15 boost for render right. So in our english test case using 64 kb kernel page size.

A

Yes, that brings some benefit benefits.

A

This is also the rpg performance, with only two cpu cards activated on the cluster side. The performance difference for rides and render rides yeah, it's more obvious at 4, kb size, it's about 12 performance, game and 4, 4 million bytes broadcast it reaches about 33.

A

In addition to the work we have done, we are investigating new optimization points like a leveraging, a new feature, a scalable vector extension called sve for shot to do some optimization and also we can try to leverage the non-temporal instructions to prevent cash pollution in some cases, um and yet we are also investigating the lost db compression labs, optimization potentials.

A

And then it go to safe storage with spdk, yes, bdk, uh the storage performance development kit um yeah it can achieve high performance by uh through moving all of the necessary drivers into user space and operating in a port mode. Instead of relying on interrupts, which can avoid kernel contact switches and eliminate interrupt handling of heads.

A

Instead, spdk can be used to accelerate the block service built on saf. We can use spdk users based on vmi driver instead of kernel and maybe driver in blue star and besides, as mentioned in the english link, spdk ice, crc target or environment or fabric target can be leveraged to accelerate client io performance on safe cluster.

A

By introducing some caching solution yeah, we have also done a lot of work in sbdk, including enablement and optimization on ancestry4.

A

Yeah, it covers the optimization of some common algorithms like best 64, which is used to encode binary date to printable acid 2 characters and decode such encodings to binary dates with in string 6. We can achieve up to 2.3 times boost for encoding and 1.7 times, speed up for decoding.

A

Besides with this sum, optimization related to the memory barrier, you know that modern, most modern cpus, employee performance optimizations that can result in out of other execution.

A

This reordering of memory operations.

A

Because I'm predictable behavior in concurrent programs, especially when we write multi-threads and block free codes, so we need memory barrier to enforce and ordering constraints on memory operations issued before and after the barrier.

A

Unlike x86 non-memory model m has a weak memory model, and this essentially means that a few guarantees are given us to the observed order of cpu memory access. Thus any load or store operation can effectively be reordered with any other laws of operation as long as it will never modify the behavior of a single, isolated threat.

A

Rationale behind this behind a weak memory model is light. It places fewer constraints on partners and provides a lot of scope for hardware optimization.

A

In practice, we can use more precise constraints to make fewer instructions outside of code sequence affected by the barriers. For example, we have dsb dmb and isp. A dsp is the date synchronization barrier.

A

It completes when all instructions before this instruction completes and the dmv is the data memory barrier, ensures that all explicit memory assesses before this interaction completes before any memory accesses after the dmv instruction. We can choose the one.

A

According to the detailed code context context, for instance, if we just want to ensure the memory assess is in order, dmv is enough, and there is also a half barrier load, acquire and store release which can be used to further relax the barrier and besides, we can replace the zinc buildings like zinc, batch and aids with the atomic buildings or just use atomic functions provided in b11 to remove some unnecessary floor barrier.

A

Actually, they are used in the spdk and widely used in dbdk. Now, dpdk has adopted the c11 memory model to develop robust code that performs best on our supported architectures.

A

Apart from the above, we also did some optimization in sbd, mme of tcp. For example, we can leverage gcp's incoming cpu feature to get the cpu affinity of a socket, and then we can distribute the processing of this socket to specific cpus, which provides optimal pneumatic behavior and keeps the cpu cache hot, and this can bring about about 10 percent performance boost in our test and all the related pages are ready in spdk projects. There are some other optimization usb, but I'll not go further on this.

A

Let's go back to set yeah go back to step n spdk, and this is the performance state of set with spdk on arm servers. The test cluster contains rails d and one key 4610 mm per osd. We collected the 4kb sequential and the random rate and write dates.

A

The yellow bar is based on blue star without sbdk, and the orange orange one is based as bdk from this graph. We can actually, we don't see any obvious performance improvements in our test case yeah. There is even some drop in right performance here since it's hard to get benefit from spdk with the traditional, safe osd framework.

A

More needs to be done.

A

Yeah known that my self, the psychology is being reflected based on the stars. High performance uh framework for the age of uh persistent memory and faster may need storage system.

A

It's a nothing design and each car has its own resources. It uses message queue or communication between costs, and I think all this makes it a natural fit for um architecture with many costs.

A

Yeah about six star work we have done uh includes upgrading this star, dpdk leverage new hardware. Your system has decay and path to achieve zero copy between c star, hip and nick. It uses a physical address for dma by referencing.

A

The file approximate delve page map, which is a legacy method, and this update leverages iommu to map cstar heap virtual address directly to io virtual address for dma covers, which makes full use of modern hardbail and significance simplifies the code, and this upgrade also upgrades to new of those api and it besides to make this start work on an 64. We also face some parts like network stack, crash issue on arm and fixed system, 64 gb per shot memory limit for the network stack crash issue. Actually it's caused by a code snippet like this yeah.

A

It's a function curve with two parameters. First parameter is function; g, three, a pointer about the seconds as uh yeah. The reference the pointer is called was okay on x86, as parameters passed from right to left in stake, pushing order so the first, the second parameter used pointing is caught before the first one. The three pointer by fails on um as parameters are passed from left to right to leverage abundant general purpose registers, so three points of code before use points which can lead to crash.

A

So we need to set the parameter contents before calling this function.

A

uh Yeah this is the sister http benchmark online servers. We can see that the performance um scales up almost linearly with the wesley edition of the cpu cross, but actually we haven't verified and benchmarked set this door on our server. I will do this later yeah and see how the performance is.

A

And finally, we will go to our work about bringing safe to the crowd on st4.

A

The current save as open state storage, backhand is mature and the basic functions are like using that as open stack, swift, singer glance, backhand or work well on ancestry4 and in addition to openstack, we have supported seth as kubernetes container cloud storage backhands on 64.. We added the official support for some critical container images, such as the kubernetes csi, sidecar image, and added an image support for safety, safe csi community also added almost just in its community ci.

A

uh Yet we also support look on m64, which can simplify the deployment of shaft cluster in kubernetes and the following related work. We're trying to do includes supporting container options, storage interface, online c4 and some work about kubernetes storage, e2e test improvements, yeah um yeah. I I think that's all I want to share here uh any.

B

Questions I have a bit of a long question from dan in chat.

C

Hey thanks: well, it's um it was interesting. I I guess I'm naive user, but I didn't realize that other architectures can implement the isa libraries. So my question is basically like. Until now we didn't, we didn't ever use isa for erasure coding because we're afraid of being locked into intel like from an operations point of view. Can we safely use isa, eraser coding and then be free to move to other cpus non-intel in future, with the same osd pools same osds or same pools.

D

D

Hello, yes, um yeah yeah. I work with uh richard so for I say I'm more more familiar with it. I did some work in the ico library, so you mean iso. Is the ise er, intelligent storage, accelerator library? Yes,.

D

So so far I say, or actually we implement some arm related yeah, I'm ready to accelerate the codes to the isa library and for easy. I remember how we implement uh upload there, as upstream as an implementation of arm base that you see in the icl library. So I think this is the icr libraries supports both internal and arms.

C

Okay, thanks we'll trust, we'll test it uh yeah.

E

I would imagine what we really want is a mixed cluster that has some intel nodes and some epic nodes and some arm nodes in the same pool and then make sure that everything behaves as expected.

C

I mean yeah thanks sage, I mean we, it was andreas at cern, he he wrote the first isa library for ec, but yeah, like I said, we've never even used it, because we had the impression that I mean our our vendors will switch us to amd at any time. Actually our next. Our next block is epic and we were just afraid to afraid to ever use it.

E

I guess the other thing you might take a look at. um I believe that there's uh an erasure code, non-regression um data corpus, um that's part of the repository, I'm not sure if it's been like refreshed with new data, as new uh um ec libraries have been added but probably should be. But the idea there is that there's a whole bunch of original coded data, that's stored in git and then the unit tests just like, read it and make sure that it that it's readable.

E

um So that might also be a good.

C

I guess and in the lab you have mixtures of different cpus.

E

um We have a new set of machines that are epic but um they're not used for unit tests or anything, and we do have a bunch of arm nodes that do run the unit tests, um but it I would probably review the the unit test and make sure you understand what it's actually testing for.

C

Okay, okay, I guess also it's like it's it's one thing: if it works, if you can read today, but also it should be performing well, it shouldn't be like since yeah emulated or something.

E

C

Yeah yeah, okay, thanks cool.

E

I was impressed that danny mentioned in the chat eight to twelve percent improvement over 4k, and there are almost identical results in this. In this deck like nine to 11 or something.

E

E

You have any more questions for.

E

E

Okay, all right.

D

Thank you rachel for the presentation.

E

Yes, thank you.

A

Thanks for listening.

E

Thank you. Actually, I guess, since we have a couple of minutes, um I have I'm just out of curiosity here. um John in the chat was mentioning a raspberry pi 2 pi 4., um I'm curious what people who are running stuff on arm on raspberry pi's are using for the actual storage, because maybe the pi 4 has beta ports, but the three, I think, didn't right. You had to use like a usb adapter or something like that.

E

Is anybody on using using the raspberry pi.

A

uh Do mean raspberry pie.

E

B

I actually have a colleague that has a raspberry pie cluster at home. um He talks about it all the time way too much.

B

I don't actually know what he's using for the the data um I I always assumed it was just usbs and that he only just maxed out the data ports that you're talking about and then, if you have like five or six, then you know that's the limit of but five times three fifteen usds right, but um I don't actually know I can ask him. I think he's on holiday at the.

E

Moment I guess that question could go to mike, actually, because you put together a raspberry, pi cluster right yeah, um it's along the same lines of using a usb adapter, though as well yeah, okay,.

E

That's going to set one up just so I can vocally the arm stuff, but we also have a bunch.

D

E

In the lab, so it's not entirely necessary.