Ceph Ceph Month 2021, 11 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: Intel QLC SSD: Cost-Effective Ceph Deployments

Description

Presented by: Anthony D'Atri
Full Ceph Month schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

Intel qlc cost perfect of ceph on nvme um legal disclaimers.

A

Per slide, ssd versus hdd the reality I'd like to use those ssds for self osds, but they can't compete with hdds ssds are too expensive, ssds are too small, qlc technology is too slow and the drive rates per day aren't are too low. Hds are more reliable.

A

These are are are definitely now. Myths.

A

Start with cost uh tlc uh um excuse me still awake getting my throat up in the morning, qlc tc over crossover soon or today, um they're competitive. Now, especially if you consider some subtle factors that some of the online tco calculators don't include um like uh the impact on your service, you know how well your service can run with hdds some people short stroke, hdds or limit the size of the hdds that they'll use because of the interface bottlenecks and the recovery times that they experience with hdds.

A

Oftentimes one sees hd deployments running out of iops long before they run out of capacity, and people will deploy extra drives in order to to approach their iops needs, and I've even seen people with hed clusters who have data coming in faster than they feel that they can safely expand them.

A

Other costs include terabytes per chassis, terabytes per ru uh terabytes per watt, um the operational expense of swapping drives of having to rma and crush them.

A

A lot of operations, for example, won't bother, are making something worth under five hundred dollars because of of all of the hassle and effort, um some places that cost that much just to get somebody through the door. If it's off, you know in a site.

A

Ssds allow you to do cluster maintenance without prolonged and reduced redundancy.

A

If you have a failure- and it takes you a week to recover, then that's a big window of risk for additional failures, um and then you know the less quantifiable. How much does the degraded user and customer experience cost, especially during recovery or when you have an incident.

A

A

Qlc today's qlc ssds enable you to deploy with fewer chassis and fewer, are using fewer racks, all of which can be quite expensive. 144 layer, qlc nand enables high capacity devices intel's nvme qlcss.

A

These are available capacities up to 30 terabytes and you can fit up to 1.5 petabytes raw space per rack unit with the the ruler e1.led sff drives um that you know having come from a time cough off when my first hard drive was a five megabyte, 14 meg pack, that it that just blows me away the density that we can get today um and the advantage to iops that that ssds give you allows flexible capacity provisioning where you don't have to provision for iops anymore. You can provision for your capacity.

A

uh Performance um ssds, including qlc, are, are fast and wide. Until this is the d5p30 50 nvme qlc delivers up to 800k 4k random, read iops a 38 increase over the previous generation um up to megabytes per second sequential, read um more than double an improvement over intel's previous generation qlc.

A

um This sata drives saturated about 550 megabytes per second and the pci gen4 nvme interface crushes the the side of bottleneck. um Two more osbs per device improved throughput, iops, intel latency. This is common practice with nvme drives. The nvme interface allows you to do that and you know work around some of the serializations and the code and so forth.

A

um Operational.

B

One really quick question: are you presenting on the other screen, possibly because we're seeing yeah? Okay, now we're seeing your update, I think maybe the blue jeans is just updating more slowly. That might be it.

A

Okay, are you seeing the operational advantages.

B

Like yeah yeah and I'm seeing it now, yeah, okay, sorry go ahead.

A

I only have one screen now.

A

Operational advantages, the rgw service um for soft object storage, for example, is is prone to hot spots and qos events, one strategy to address the to mitigate latency and if bottlenecks are to to limit the size of hdds used, for example, to eight terabytes um other things that one can do are to limit scrub intervals, uh put a cdn on the front end um throttle or cache on the load balancers in front of multiple rgws, but sometimes osd up waiting, especially when using ec on hdd is.

A

You know, may still take weeks when it's done so gradually as to not impact user service. Even osd crashes can impact avi api availability and the whole reason we use ceph is you know, for continuity of service.

A

Replacing hdds with intel qoc ssds for bucket data can markedly improve your qos and serviceability of your clusters. Reliability and endurance qlc reliability is better than you think, and it's actually more than you need it turns out. Most ssd failures are firmware unless you can fix them on the drive.

A

You know which, in my experience, is not as true with hdds and studies, show that 99 of ssds never actually exceed 15 of their rated endurance, which was a bit of an eye opener. Having come from an assumption that you needed one theoretical drive rate per day to do rbd, for example, the rgw service, you know in one case has been calculated for seven years of endurance using the previous generation qlc.

A

The new drives would be even better.

A

Revival, endurance get with the program erase cycle, that's a nan joke everybody. Google, the 30, terabyte intel ssd, d5p5316 qlcssd much like the lydium q38 space modulator is rated at over 22 petabytes of iu, aligned, random rights, um one drive rate per day, 7.68 terabyte tlc ssd, is rated at less than 15 petabytes of 4k random rights.

A

Something that that is not often known is that with many ssds you can actually tune endurance via over provisioning. You can adjust the amount of reserve space that the drive has, which can extend your reliability considerably.

A

If you need to, if you need to adjust it, reliability and opex drive failures, cost money and quality of service. um I've run stuff clusters. Where you know a dry failure sets off escalations um and you know the fewer times any of us have to be awakened at four. In the morning.

A

The better um eight terabyte hdd is um have a 0.44 percent annual failure rate spec, but turn out to be on average, one to two percent actual failure rate in production, um intel's dc qlc nand ssds um in in practice have an average failure rate of less than 0.44 percent.

A

They have a greater operating temperature range, um uh better uh uber error rate um and you know consider the cost to have hands, replace the failed drive. I've been in situations where it was four hundred dollars just to get somebody through the door.

A

Intel qlcssd deliver up to 140 104 petabytes written significantly outperforming hdds.

A

Here are some figures, comparing publicly available figures, comparing a couple of hard drive models um in the middle, a tlc ssd model and then um qls cue, lc, ssds, with random, sequential writes and with with a a figure showing um additional overvisioning.

A

This shows at just how much improvement you can get in endurance by adjusting um your over provisioning.

A

Optimize endurance and performance um with with cue it with these cute course iu, qlc ssds is beneficial to align the bluester min alex size to the the iu size. The indirect union size of the drive um 64k for the new drives 16k for the drives that are already out in the field um right: the lines to iu boundaries on multiples, enhanced performance and endurance, um and you know simply aligning with bluester menelik size.

A

A

You know very well, align your metadata, but exactly turns out to be a small percentage of the overall workload and that doesn't have a large impact on the drives and durability, because again because the drives are so big, some example use cases, rgw large objects rbd when used for backup archive and media I've experienced rbd used both ways. um Cffs has a four megabyte block size and it seems as though dlc could be a good fit for us ffs, um that's testing. We still need to do um next slide additional optimizations.

A

There are more things that we're exploring um aligning the the roxdb block size to the iu size um and exploring rocks db universal compaction.

A

um I believe neither of those is is by default and they may- and you know they should give us better performance and better right amplification at the expense of you know, using a bit more space. But again when you have 30 terabytes, maybe that's a maybe that's a good trade-off. um Their other roxdb tuning may be beneficial.

A

Another another route is to use optane to accelerate wall and db on a separate device or engage in other right shaping activities. um Crimson, you know, may bring us. You know uh some optimizations for, um of course, iu drives as well um and for gw. We're looking at separate pools for large and small objects um with esc, and especially with raising the blue storm magnetic size, backup, rgw's fit for small objects, isn't great and you can have a bunch of strands of space.

A

And so you know, I I talked to the the folks on the rgw refactoring call a couple weeks ago about perhaps using lua scripts to adjust storage classes and have multiple pools.

A

There's talk of there being scaffolding within rgw to have multiple pools for um behind the scenes more transparently, and so um those are some additional um things that we can look for in the future um and here's a bunch of, although the references uh powerpoint really really likes to to make these gray, I'm still learning powerpoint and that's the end of the story.

A

B

Questions I have some questions um at towards the beginning of the talk you mentioned. The the new ruler form factor ssds, that let you fit one and a half petabytes in a rack unit, blows my mind too, um are those by those um zns devices um and or.

C

D

B

Spoon or what's what's the zena story here, I guess.

A

um You're, young, maybe you can you, can um chime in there. um These are. These are not cns, I think zns, um I'm not sure if we're shipping zns yet um those are you know, you know the the the long format roller drives.

A

um The 30 terabytes and super micro offers um in this fine print that you can't read. Super micro offers a system that gets you around. 1.4 terabyte one terabyte raw and somebody else that I found um offers a system that has additional um ruler drive slots on the back, and so you know it's a fairly deep chassis, because you have these long drive bays on the front and the back.

A

But as I calculate um that's, a two-year system that can take you know like has can take over it can take 108 drives. I think- and so you know, if you do, the math modulo base 2 base 10 rounding that case you somewhere in the neighborhood of 3 terabytes raw space and 2u.

A

You know divided by 2 1.5 per. U I've.

B

A

In situations where I had to fight tooth and nail to get a single rack unit, you know uh literally a a pop that was in a closet in sofia, bulgaria, for example, I'm very focused on on minimizing rack units.

C

So, can you hear me.

A

C

Okay, yeah, I was sorry I had some technical issues at the beginning. um First is the audio. Now is the microphone, but um so this is you young um hi sage. So yes, I mean no. This is not a zns device. uh This qrc drive is just that anthony introduced today is block device only, but it's using the course in direction, meaning that you know it's optimized for larger block size than the traditional 4k.

C

um So the zns is you know. Most of the vendors is under development. At this moment, um I don't think there is a product, that's already being like like available to the broad market. Yet.

B

Is that is the plan, though, to offer these these devices in with the dns interface.

C

Yeah, definitely we do have a plan to offer zns feature in the future, with uh with the ruler form factor.

B

Cool okay, okay, thanks um uh a couple slides after that, you mentioned um over provisioning to um to balance against durability versus capacity. Is that is that, like an explicit management operation that you have to instruct the device to do that, or is it just a matter of writing to a constrained set of lbas in order to avoid that trade-off?.

A

I've I've been I've been wondering for for years with, if, if it matters which way you do it, some people do effectively short stroke it just from the user side, but you can also use the usual tools.

A

You know, including an intel's intel moss tool to just direct the drive to report yourself at a smaller size.

A

um And um now that I work with experts, I I want to look into that to see if, um if it makes a difference, which will you do it, yeah.

B

Certainly, I think, imagine lots of users to be very curious to know I mean I suspect it depends on the firmware how the firmware behaves. I would vary device device, but at least if we have a vendor answer, yeah.

A

Certainly not filling the drive you know where leveling you know is going to have is going to have a similar effect. um uh I would. I would definitely look into that. I know exactly who I want to ask.

B

Cool um a couple slides after this you're talking about um reliability- and you mentioned um the point four four four percent um uh afr for for ssds, um but earlier you'd mentioned that, usually when ssds fail, it's like a firmware fix and not like an rma hardware. Failure is that afr.

B

Is that firmware errors, or is that, like.

A

The truth, I'm not sure.

C

I I can't take this yeah, so the afr target for ssd include all the failures, not only former, although former may take the majority of it. There's also, you know hardware failure and also you know different kind of uh failures for for ssd, um usually for uh intel's nand ssd. We have an annual failure rate target of less than 0.44.

C

Well, most of the time we are able to beat this target and in the in our uh qlc nand uh example, we are way below this 0.44 target and the reason is uh you know so: qrc nand right now is, uh and intel is only one, that's using a floating gate technology for qlc nand, and uh you know, as far as I know, uh nobody else is using floating gate ssd technology nowadays.

C

So this floating gate technology is really good for having large capacity nand components as well as have really fast nand components um and also reliable uh nand component. That's you know, have a really good retain retention rate of their data. So that's why you know qlc intel skill. Cssc has a really low annual failure rate, uh which is quite different from what the market think uh usually when they think about qlc. They think about the quality is lower than tlc, which is not true at all.

B

What's what's flooding gate.

C

Well, floating gate is a kind of technology that we we make this nand component.

C

The another technology that is different than floating gate is charge trap, so it just you know, floating gates means you have two gates uh in each nand cell and you use that floating gate to store the zero and ones um off of the data.

B

Okay and then I guess the last question was um there: was you had a slide, we're talking about the um the iu and the block size.

B

Yes, yep, um I guess uh maybe it's just yeah one more back one more. The the question was um is yeah. There we go is the um is the iu, something that is exposed in the was it the kernel block, device properties and sysfs or wherever it is, though you can tell what the preferred right unit size is like. Is this something that we can have make blue store automatically detect of the device so that it automatically sets blue storm analog size accordingly? Or is it um yes, something that you have to know.

C

Yeah the answer is yes, um I can't remember the the the the you know, the thing that you have to read, but definitely it is reflected. um Okay,.

B

Okay, and in that case, um a pull request that that makes blue store do that during makefest um would be extremely welcome, because at the end of the day, we don't want users to have to know any of this right. We want them to.

A

B

Be able to plug in this device and have it use an appropriate.

A

On the other hand, they also don't want rude awakenings if they're writing a bunch of small objects.

B

um Yeah yeah, but if I don't I mean they shouldn't be writing I don't know yeah somehow we want to know well.

A

You know I've seen people, you know using object, storage to mirror their their their get repositories, but then you know also to to store you know. uh Tv shows stepped into portuguese right. um So a lot of it depends on your workload.

B

A

I will certainly work on, um but that's a very good idea automatically setting that and I'll talk to our internal folks who are working.

B

On bluestar about that yeah I mean maybe a tunable that controls whether you automatically align the allocation size. Oh.

A

Of course, I'll always give somebody.

B

The option, um or even like a health warning or something if, if this allocation size is low, I don't know exactly how we'd want to do it, but giving some giving providing visibility here so that, and ideally some default behavior that's pain would be great.

A

Yeah, that's a terrific idea.

B

And the one other thing I mentioned here is that um I mean you call out that the rock's db workload is a pretty small fraction of this, which is is true, but there's also a bluefs allocation size tunable um that could be set. So we could also tell bluefest to um respect a larger box size as well.

A

Yeah I actually asked um one of our internal folks um who was described to me as a blue store expert um who who works on that. I think he's in the prc and um he thought that it wouldn't make a difference or the blue fs alex says.

A

C

Know in the absence.

B

A

More information, I I have to defer to the guy who wrote the code yeah.

B

Yeah, okay, cool, okay, thanks! So those are the questions I had. Thank you.

B

A

A

Wait we can call it whenever you're ready mike then.

D

Yeah just giving people a moment, okay, all right! Well, thank you, anthony and you're on for.

A

As everybody's fumbled to.

A

Unmute all righty.

D

All right well, thank you for the presentation and thank you, everybody for joining us for our uh final day for week, two for stuff month and be sure to join us for week. Three and uh recordings for these will be posted up on the stuff youtube channel today, as well.

D

Thank you everybody and have a good morning good evening, wherever you're at. Thank you.