Ceph Ceph Month 2021, 22 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: Evaluation of RBD replication options @CERN

Description

Presented by: Arthur Outhenin-Chalandre
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

Hello: everyone, I'm archer, I joined cern safe team since march, as a fellow and I will talk to you about the evaluation of rbd replication option for safe infrastructure.

A

So what is rbd safe at cern? It's four ibd clusters for openstack cylinder about seven thousand volume, which represents eight metabyte row and with this project we want to allow users to replicate between different clutches. So this means between different data, centers or different rooms in the data center.

A

So the objective is to provide a disaster recovery solution for cinder volumes. We want to enable that on a subset of our images, and we want the users to be able to choose which which images they want to have this feature enabled. So this means a deep integration with openstack, and we also don't want to that. The user that enable this feature have a great performance impact on their workload and we also want to to be able to to replicate a massive amount of data.

A

So this means a suitable replication performance in inseph, so how everyday replication is done in ceph, it's handled by a mirror called rbd mirror which uh which basically read the the state of rbd images from a source cluster and replay those on a target cluster. So this is a really high overview that I will explain a bit more later, so there is currently two over two operation modes supported every digital and lbd snapshot, and this talk is a bit about comparing those two and but first our test setup.

A

So this is running safe octobus. We have six experimental machines with 60 oasds for in a hybrid scenario, with lgd for the data and ssd for the db. We also have 18 osds for the 18 osd ssds for the lbd journal, so these are dedicated to the rbd journal and- and we also have a few number of clients running fio with random write workloads, so first the rb digital. So in this mode the led client will write data to the lbd image as usual and also to the journal.

A

So this is kind of a double write scenario. We choose this mod first because it has full support in openstack. You can basically set up the replication and manage this with sender and have uh in case of disaster your business continuity and failover to to your site. If you have a certain openstack cluster design and some requirements in openstack. For that.

A

The first problem is about the mirroring performance, so the the replays are quite slow with the default settings. We observe in our tests about 30 megabytes per image, but this is per image. So, with one epidemiology you can you can manage multiple images, so this can scale a bit with the number of images, but from a pair image point of view, there is a really high risk that the replicated image will lag behind and that the journal will not be trimmed from the source cluster.

A

So this means, if you have a dedicated pool, which is maybe smaller, you can run out of space in your pool. If you don't have a sufficient amount of capacity there or you don't monitor as close as possible the the space there and the the the journal size.

A

uh Fortunately, there is some option for tuning uh returning this style uh in our test. We we managed to increase the the replay speed to about 40 or 50 megabits per image. That's the that's not really sufficient, um so so yeah. This is a big problem for us in this mode.

A

We also have client performance. So here you can see in the left plot it's about the small right, so 4k block size in this test. We compare the the hdd and nondrum, also so regular, regular pool regular settings, basically with uh the hdd pool, but with uh ssd journal on the image.

A

You can see that the the performance doesn't suffer much, so this is really acceptable for for us, but on the on the right plot, you can see the the bigger right with four meg.

A

So this is about fifty percent uh performance impact, and this is uh really problematic for us, because we do not want the user that on high bandwidth and will have big block size to suffer that much in term of performance with the enable replication.

A

Fortunately, for us there is another mode, everyday snapshot. So in this mode you take a snapshot at a various times in your images. So this is not continuous replication like before. This is a quality time replication so based on the snapshot and the the red mirror. Behavior will be a bit like if you do, if you create snapshot and you do lbddf and lbd imports from the rbd cli.

A

So the the red clients will not really be involved in the the replication process and you can expect better performance as compared to to journaling the the only impact should be related to snapshot streaming, which can be a bit slow with current osd implementation and the replay workload, but you have to take into account if you want replication with whatever mod you choose the the replay wall code from day zero.

A

So that's not a real problem and uh the replay are really fast. uh We read the default settings, so you can expect at least in our test cluster. We we saw 200 megabytes per image, and this is really great because our cluster is not able to do more than like 400 megabytes or 500 megabits total.

A

So this is maybe about half the performance of your cluster. So we with three image running at max speed. We tour through image we can max out our cluster. So that's really good.

A

The only problem with this mode is that there is no openstack support so far. Cinder does not allow you to enable snapshot mirroring, but we contributed some batches and sends those to cinder developers.

A

You can see in this slide the new option that we added to cinder. This is not merged yet so you can maybe expect 200 in those uh but yeah.

A

Maybe in the next version of cinder or major version you you will be able to have a snapshot mirroring in openstack.

A

So now it's time for conclusion. So we've seen in this presentation that journal journaling has some performance issues, so our next goal is to to to go with snapshot mirroring and we will continue our test further to report uh or fix, if possible, any bugs that we may found in rbd replication.

A

So far we have submitted five pair to to improve lbd, replication, observability and stability, and we were planning to to continue to do so and contribute with the the community and that's the end of my thought. Thank you.

B

Thank you arthur. You have any questions.

B

I have another quick one. um What was the frequency of the the snapchats you were taking like? What sla were you targeting.

A

uh So we um in my initial testing, I I tested with the the minimal ra rate uh possible, which is one minute, uh but we will not uh consider uh any definitive uh settings for that, uh because there is also the the ongoing work on uh on fs3s with the snapshots. So we will probably have higher frequency on snapshot than that.

A

I don't know you should. If that's answer to your question, but yeah.

A

A

Any other questions for arthur.

C

Have you uh have you encountered any issues with the lack of proper support for fs freezing uh and you know possible application level freezing yet or is that? uh Because in my experience like uh this is definitely something that you want, but many people don't even realize that they don't have it. uh So I'm curious. If, if you actually run into any issues there.

A

No, no, I I we like we, we don't have um some beta testing or things like that for users. Yet so we do not. We do not encounter this sort of problem, but yeah.

D

A

Expect that, like with their face freeze, we will be able to to reach some form of application. Consistency and some application already installed those hooks. So that will be a cool thing to to have, but.

B

A

No, we don't uh had any issues without the fs3 support so far,.

B

I have, um I guess one other question about the the journal mode. I mean it's expected that um if your write rate is limited by the cluster, um then because you're doing this double right, then you'd see half the performance, I'm just wondering if that would be the case in general, for example, if you have a much larger rbd cluster, then the the performance that a client sees not necessarily going to be limited by the available cluster bandwidth.

B

A

I guess it works both ways. I.

B

Guess if the, if the client is bandwidth also- and it's writing everything twice- it'll also see half.

A

B

A

I have some backup slate. Maybe I can okay yeah, so okay, this is our self-cluster performance, and here we compare the the rope pull of hdd to the ssd pool and, to my um understanding I think the like the the really large difference in 4k in 4k uh like helps with uh the the performance impact which is really small and uh like the 30 40 percent. The 30 percent performance impact with hdd compared to ssd in a bigger right uh does not really help that much with with performance in this mode.

A

So this is what I I think, but I'm not like hundred percent sure that this is the the root cause.

D

um One parameter that might affect its journal set. So when you are writing to image uh you writing and.

D

They doing writing to different objects located in different locations. So as such, I was talking that when you have more osgs your, I always spread among mo osds, but when you are writing to a journal, um you are writing. Just only in called journal set uh small number of objects and then when they are filled, they you are writing to the next the next.

D

So in my experiments, I noticed that when you increase this parameter.

D

Increasing the number of simultaneous journal objects. You are writing. You also can improve your performance.

A

Actually, we tried that, but uh it was kind of similar to yeah.

D

Okay, so it was not in your case.