Ceph Ceph Month 2021, 15 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: RBD latency with QD=1 bs=4k

Description

Presented by: Wido, den Hollander
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

Yeah, so we're on our first second presentation for the day uh stuff, rbd performance with acute depth of one and block size of 4k and widow is nice enough to go ahead and give us a nice lightning talk presentation here, so widow, will you please take it away.

B

Yeah sure I'll keep it uh brief, because this needs a lightning talk, talking about rbd performance, with the q depth of one and a block size of 4k, and so you might be asking yourself why qdf1 and 4k block size well. What I would say is that, from my experience is that single thread I o latency is very important for many applications.

B

What we, what we do see with benchmarks, that people focus on high amounts of bandwidth or large amounts of iops, but they simply go multi-threaded with a high queue depth and then they can say well, we reach a million ios with ceph or we reach 2 million iops.

B

But in the end, if you look at the latency of a single io, it could be pretty high and, for example, a php web server or a mariadb database, or a redis cache when it's flushing to its disk they're all doing single threaded io and then the latency of that single thread. Io starts to matter and that's what you notice it's, how snappy applications feel it's by using a q depth of one. So all the benchmarking I do with seth.

B

Almost always, I start with q depth of one block size 4k, then that's my starting point and from there I start increasing q depth, increasing the block size, and then we get more information about the performance of the cluster, but it all starts with a queued for one and a block size of 4k, so um low latency ceph. Well, you should understand that ceph itself will never provide you the lowest latency possible. That's because sev was designed for other things than latency. It was designed for redundancy scalability and data safety.

B

If you take a local nvme and put it in your laptop or your server, it will get a way better. Latency than ceph will ever provide you, because we need to go over the network um over tcp ip. Then it goes to the cpu. Then the cpu is doing its thing. Then the code of course, which runs in the cpu of ceph, does its thing, and then it writes to three nodes so keep in mind we're replicating usually two or three times, and then that simply takes time.

B

So writing a block in ceph will be slower than different types of storage. However, redundancy scalability and data safety- and I always say I have never seen self lose data because ceph itself is always something else which happened. You know lots of hardware failing but ceph itself. Just cares about your data and safety is the second or the third, for your performance is the second or third priority for ceph safety. Is the number one priority so, but what can we achieve? What can we get out of ceph in terms of iops?

B

So I do benchmarking with fio super simple configuration. We take the I o engine rbd and then we use the pool ibd. I have an image there called fio1 make sure you run these tests multiple times, because you need to like pre populate the rbd image by running the test a couple of times and then you simply say I have an iodef of one block size of 4k and I run the test and then after 60 seconds it tells me how fast or how low the latency is of the cef system. So some hardware setup.

B

I took some super micro systems with an aim: the epic 16 core, cpu, 256, gigabytes of memory um for samsung pm um ssds and then the 100 gigabit networking with melanox. um Now a few things to mention here is that the main performance gains you're going to get is pinning your cpu c state to number to one. That's a kernel parameter, that's uh it's you look it up on google.

B

You can find how you can tune it and set the performance profile of the cpus to performance, so that means that the cpu will run on its maximum gigahertz as it can run. I think in this case it's 2.4 gigahertz um and then because then you get the lowest latency from the code possible.

B

100, gig networking doesn't matter. I had 100 gig here. So that's what we use, but the amount of bandwidth which we're using for this test is a few megabytes per second. It's not um gigabits per second, so 25 gig networking works. 10 works. 10 is slightly slower, but keep this in mind it's slightly very slightly slower.

B

Most of the time ceph spends in the cpu, where the code is running, so the c-state pinning and the profiling of the cpu matters that might matter so much wise ubuntu, 1804, cef version 1528, and then I turned off all the logging, so debug underscore osd equals one debug underscore ms for messaging, equals zero et, cetera, et cetera, turn off all the logging, and then we can give it a go.

B

So what can we achieve? 1364 iops, which I was able to get with this hardware? So that's a write, latency of 0.73 milliseconds for a 4k block being written to three notes at the same time. So this includes all the replication. The block which we just written has been written to three different nvmes in different notes within one millisecond, that's what we were able to achieve, and I you know it that that's a fairly good um performance.

B

If you look from the perspective where self comes from, we replicate over the network, it's a distributed file system which can scale out and we still get a very decent performance. Do we want more? Yes, always so? Well, you know: what can the future bring us?

B

The seth crimson project for redesigning the osd should provide us better latency. Although crimson itself is not focused at the moment at providing lower latency they're just revisiting the code in the future, it should provide lower latency, and then we have the rbd persistent right back cache, which uses a local nvme inside a hypervisor to cache ios. I tested this with 1624, but it was not stable enough to provide real results, which I could present here during this lightning talk.

B

I did see about a performance increase about 2x, so 3500 iops with a much better write latency. However, that's what I saw for now I'll revisit this in a later stage, when the code is more stable, but at the moment we need to do is with the 1364 iops which we're able to achieve.

B

So if you want to get to this um faster cpus higher clock, cpus gain you more benefit in terms of latency than more cores. So if you need to invest, go for higher clock cpus with less cores. So if I go back to the hardware at the reason this one I chose for a 16 core, because this specific super micro system can have 10 nvme. So we have 16 cores on 10 nvmes. You could also say if there was a cpu with a 10 cores, there's none with eight cores.

B

Maybe you can go with eight cores and even higher clock cpus that will bring down the latency. It will not bring down.

B

Give you more total, I o for the whole cluster, because that still relies on the amount of core. So it's a balance. If you're looking for lower latency, you need faster cpus. If you need more total amount of io for the system, then you need more cpu cores in the whole cluster, um and that was my lightning talk about um ceph with low latency rbd.

B

And if you have any questions, this is where you can find me um or ask them on the users or devmailing is because that's where I'll hang out as well. We.

A

Got four minutes so if anybody wants to ask questions on the spot right now, more than happy to answer.

B

A

I have a quick question: was rbd cash enabled.

B

In this uh no rbd cache was not enabled because if you look at the rbd cache code, it's it's right through until flush. So only if a client sends a flush towards rb lib rbd, it enables the cache, and if we go back to the fio configurations, it says direct equals one. So that means it's not sending any flushes. So all the I o is synchronous which is being sent by fio. So no, the ibd cache is not enabled. I I turned this setting.

B

It's called rbd underscore right through underscore flush, and if you set that to false, then it always goes into an rbd um cache and turned on, and then I think I saw about 10 to 20 000 iops, but yeah, then we're just writing stuff into memory of the ibd cache.

A

C

I have a question: this is dan. uh Well, it may it might be for elia if elia is still online, but yesterday elia presented, he mentioned some iops improvements on live rbd. Maybe in quincy I'm not sure exactly when they're coming but yeah are they going to also improve q, depth, 1 performance or maybe veto? You know.

B

Well, um I doubt it because I also did benchmarking with the radars client so rados and then, if you set radar's bench, but if you set minus t so the amount of threats to one, then you can write blocks of 4k directly to rados and the latency I see with qdf1 so with a single threat on radars is about the same as I see with rbd.

B

So I do doubt if this improvement of lip rbd will lower the latency, because I do think we're bound by the osds and and by you know, tcp ip in general, not by liv rbd.

D

Yeah, I agree probably not for this test, because those improvements, uh uh mostly you know, cut the fat that we had uh between libraries.

D

Some of it was in liberators mostly in uh library, because the fat that is in liberators uh kind of sort of still there and uh where uh what what what those improvements were targeted at uh is really the only onve clusters and you know, fairly high q depth, so probably probably not for this test.

D

One thing I want to note as far as bio and flushing uh recent versions of fire have actually been modified. uh The rpe engine uh within fio will now issue a flush in the beginning of any test just to deal with that setting.

D

uh So in the future, you might need a way to you know if you want to uh test without ib cache enabled you would need to turn it off uh in the sam.conf elsewhere, because uh fire will issue that single flush at the beginning of the test, just to just tweak that uh uh you know just to move the rbd into the uh into the state where it uh thinks that the client sends flashes, and so it will do caching by default.

B

Okay, that's good feedback indeed, because that will um you know, give different results and people might have the id that they're able to do 20 to 30 thousand iops with their um with their fio. But actually it's all cache of live rbd. So are you sure that with direct equals one that it will still center flushes.

D

I think so uh this was uh uh we did this to address the discrepancy between uh fio and ibd bench demand, because uh rbd bench uh has been doing this uh you know has behaved this way where it would uh to possibly uh send the flush in the beginning of any benchmark uh for many years now, and uh fire wasn't doing this and we were getting complaints uh that you know uh here here are my rbd bench results, and here are my fire results and they're vastly different, uh and so at least the intent of the change was to do it even with uh direct, um but I wasn't involved so I'm not sure just something I wanted to bring.

B

Okay, cool, that's! That's good feedback.

B

Any other questions or comments.

B

Thank you yeah. So, just as a closing note, I always test, as I said, with the q depth of one, because that tells me the latency and I think that latency is the foundation for most applications their performance. So I always try to test with.

B

A

All right, thanks wedo for the lightning talk appreciate it. Welcome, got some good questions in.