Ceph Ceph Virtual 2022, 10 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How we operate Ceph at scale

Description

Event: https://ceph.io/en/community/events/2022/ceph-virtual/
Presented by: Matt Vandermeulen

How we operate Ceph at scale

As clusters grow in both size and quantity, operator effort should not grow at the same pace. In this talk, Matt Vandermeulen will discuss strategies and challenges for operating clusters of varying sizes in a rapidly growing environment for both RBD and object storage workloads based on DigitalOcean's experiences

A

Hi everyone I'm Matt from the storage systems team at digital ocean. This is how we operate set at scale. There's a lot of content in these slides, so I'm, basically going to be speedrunning it. um A quick run through of the agenda. I'll talk a little bit about what do is who we are on the storage systems, team um I'll, move on to our use of Steph at do how we approach our Automation and what we use it for which leads into operating clusters.

A

I'll finish off with a little bit of reflection, not just with stuff, but our approaches as well and we'll wrap up with, hopefully, some time for a hiring plug and some q a so what is digitalocean? We are a cloud provider founded in 2012, based on the core concept of Simplicity. We started with the droplet a five dollar SSD backed virtual machine, which was very attractive in 2012.. In 2016, we introduced the second product, which is volumes step back, detachable droplets storage.

A

uh Since then, our product portfolio has grown significantly, including spaces in 2017, which is our setback. S3 compatible object, storage, offering, along with dbas doks app platform, lbas and more most recently serverless functions and manage hosting with cloudways. We have data centers in eight different regions, some with different Metro choices such as sfo2 and 3, that give our customers more than a dozen choices to place. The resources we had an IPO in 2021, and now we get to join in on these stocks.

A

Games uh sources is a small team of six Engineers with a number of goals. The scale of a team and the scale of our deployment should not be tied together. There should never be a ratio of Engineers to clusters when considering team size. Just because we have another end clusters over a year uh doesn't mean we can hire new Engineers to dedicate them to those clusters.

A

A huge help with that is we try to automate everything we possibly can as potently as possible and end goal or dream, as it were here, is that we'd never have to SSH into a node for database operations. At the same time, we're not going to let perfect be enemy of progress. There's a lot of room for hacky, uh one-off bash groups.

A

Eventually, the automation becomes complex enough that it makes more sense to promote it to a service, that's useful for things like Network, remediation or Auto PG repair before cefted for us and even an entire disk replacement pipeline.

A

So let's talk about stuff at do uh stuff used at Geo is growing rapidly. I used for blocking object, storage, which Powers both the volume and spaces uh products at do with many grown clusters. Other teams make heavy use of volumes and spaces for many of do's other product offer.

A

uh So some quick stats about do these are the numbers I'm allowed to share can't go any further. uh 46 clusters in total uh 38 are production uh from uh they're running Nautilus and eight staging, some of which are specific. More than 140 petabytes of Ross, storage and ceph are biggest clusters are over nine petabytes. uh This does not include the droplet backup snapshot or bring your own image storage, which is pretty staggering in itself. There's more than 23 000 osds in our Fleet and 1200 servers run.

A

So I want to give a quick mention to as many parts of the automation as I can, including outside of our team. So a quick Whirlwind, that's outlined here, there's a launch process of qualification and procurement that happens before we decide to deploy a set of equipment for a new generation of clusters. A lot of the stress testing is automated, though it's mostly through scripts and tooling that it's run directly on the nodes. We do the same sort of thing when qualifying new Drive models as well, so our data center Ops do the rapid stack.

A

They make cabling beautiful power on all the equipment hand it over to you, network and Hardware engineering teams. Networking team will then configure switches as appropriate. There's additional complexity here for the public facing load, balancers and spaces clusters for the hardware engineering team takes the server nodes through provisioning, workflow and they're left with a base OS with up-to-date firmware on all components.

A

Storage systems then runs our version of provisioning, workflow and populates ceph cluster from scratch. I mentioned a bunch of automation that we use and some of the tools that we use to list here. Chef is used for a log, Decor, OS, stuff, General, config management. We use ansible for things that are sub-specific, such as deploying the cluster from scratch, augmenting a cluster with more notes or drives. Awx is an open source, self-hosted solution for running ansible playbooks with it. We can share failure modes with the team and we have a detailed history of runs.

A

We still write one-off bash scripts from time to time where automation doesn't make sense. This is often due to a seeing it that's a one-off as we understand it at the time, but we still fully document those with context and tickets, because sometimes that stuff gets moved into a Playbook again, don't let perfect video on immediate progress, something to note here. We don't use stuff ADM at all.

A

Today it didn't exist when we started and we haven't been convinced that we'll gain much benefit from its day because of some of the requirements for Secrets management, other Geo ecosystem ties. We can't use it to call off-the-shelf solution. Anyway. We've had support luminous, Nautilus and Pacific. Upstream automation has changed between these. We retire find grain control over the cluster layout for and behavior for specific needs. We are still keeping tabs on the options available, for example, rook, and we may evaluate them in the future, so the bulk for automatic lies in ansible deployments.

A

Ultimately, this is what allows us to operate stuff at scale. It's not particularly new or Innovative, but it is cool to see a bunch of the animal turn, a bunch of metal into user. Consumable storage, as mentioned previously new cluster deployments and augments- are done through these playbooks augments come in two forms: either adding disks or adding hosts we handle both of them in a similar Way by expecting or increasing a discount on each node.

A

Node reboots will safely reboot any of the nodes that we need to either when we want to On Demand or whether we just need to for a kernel updates or the likes. We set our nodes in central fig up using the ansible playbooks and reconfigure them that will uh the stuff upgrades are also done through these playbooks, and this is pretty easy because we use containerized deployments. We've been running set for a very long time, and because of that, we've had some file store, osts that we want to move to Blue Store.

A

This could be done safely with our playbooks, either by draining an OSD and recreating it or doing it dangerously by destroying and recreating in place. uh Osb restarts can be done one at a time or host at a time, and the Playbook will just wait for the recovery to complete before before. Moving on um since we've been running these clusters for quite a while, eventually they get old and we need to shut them down.

A

uh This kind of tear down is largely handled by the automation, but does not include getting the data off the cluster as some of the utilities and goodies that we use here to make all the other stuff work uh include roles like stuff weight, healthy, which is possibly the widest used role in our uh in our repo. At this point, it ensures that clusters in an expected State before continuing determining health, is super simple for a script and it's super boring for a person. So we let the script determine safety is appropriate.

A

There are no maintenance up and node maintenance down rules that safely pull any type of node out of service and brings it back in their Global maintenance locks using router slots before progressing, which is just a primitive concurrency control mechanism that leverages Target cluster. This just allows us to make sure that no two operators are going to try to run different playbooks on the same cluster.

A

We also have slack utilities that can let interest parties know when things are happening as they happen, and there of course ties into our secret storage and order to push and create new key Rings deployment as necessary. A quick note about the automation is it of course, thrives on consistency. Snowflakes are inherently inconsistent and, unfortunately, some Winters bring more snow than others when doing operations. Try to think about how a change in one cluster today might affect your assumptions tomorrow across the fleet.

A

Some ways your cluster might end up being different from others or even nodes within a cluster might differ. uh Hardware configs are the easiest deviation, especially these drives tend to go end of life and you mix in the Next Generation centralized config can change between the Clusters, and you might also forget about that. One single OSD that was given a specific config option and might just cause confusion down the road. There might be a long-running script in the background that you completely forgot about, and now both you and the balancer are totally confused.

A

um Definitely had that happened highly recommend melting, your snowflakes, so they're, all kind of part of the same puddle um glossing over that we build our own ceph packages. uh We want to move on to deploying the cluster, and ideally this is as simple as firing off a Playbook and just waiting until it succeeds or it fails. uh Realistically, it is generally that easy, but there's a ton of work that goes into these playbooks, and some of this is worth calling out explicitly.

A

First off we make sure that Chef converges on a host, uh that's just kind of our base entry point, and then we start getting into the meat which I'm just going to skim through here. uh There's some safety checks along the way, such as ensuring that all the drives have the same size for host. We then create system, D units for daemon's, configure manager, modules and more throughout this process.

A

um The next diseases have been done, but we do the standard, dance of creating a mind map deploy the mons and strip Forum. That sort of thing, then we want to create our osds and pre-populate the crush tree ahead of time. This is super simple for us, because we have an ansible inventory.

A

These ansible and tutorials generate before us using some other Geo ecosystem, and it has some specific attributes about placements, such as the rack number of discs, whether it's an index or a data node that sort of thing um next up, we deploy the OSD containers and start them across the entire cluster. This is also very simple because we just enumerate the disks on the host and we have a tool that wraps self deaf volume lvm, because the crush tree was pre-populated.

A

This can be done in parallel across the cluster, which makes it very quick for us. uh Finally, we do quick verification that all the osds we expected to create were created and started. We also checked the cluster Health at this point and verify the cluster is healthy and is as bored as it ever will be in the future we'd like to fire off this Playbook automatically after generating the inventory from complete tickets during our handoff.

A

This is an example of promoting an automation to a service, though it's effectively just orchestrating Playbook launches, one of the biggest post-employment operations we have is a capacity augment. This is where block an object varies very um slightly. Deploying on the osds in the containers are the same, but giving them pgs is different. On block side, we use our communities, which is open source to slowly update osds over time. This is mostly done to mitigate pairing light C, which I'll talk about a little bit more. In a moment.

A

The object uses PGE remapper, which is also open source and it cancels backfill via up maps. We then slowly undo them in a loop which brings the adpgs back to these new osts. This is done because traditionally object heads for Hardware, where flapping osds were not uncommon. The recovery weight from those flaps would get put off by ongoing backfill and eventually turn into backfill weight. This just kind of snowballed into never ending tiers.

A

We mostly use PG remapper for an object now to maximize backfill, concurrency and minimize degradation, it's possible now that we could make use of the up map balancer for both products after we cancel backfill. The balancer would then start opportunistically removing up maps that aren't needed and it can be turned off if necessary.

A

So now that we've got a cluster released to the world and it's no longer bored, we want to keep this thing up to date. Handle failure modes do all sorts of Maintenance, so uh some planned operations, such as uh cluster augments and capacity management, requires, as discussed um OSD restarts often either due to updating stuff disk failures or simple flapping.

A

If we've got a case of a slope that is, flows, uh node reboots to keep a kernel up to date or nodes, either failing due to bad Ram, anything in the Network's Stacks solar flares, you name it all of these things will cause pgs to start peering during peering. No I O can happen on these pgs. While peering is very, very quick on our plot clusters, it's never going to be faster than our P99 read. uh This is this can cause some cascading issues to our most our most latency sensitive customers.

A

This is less important on the object clusters, because there's the HP HTTP overhead and that latency is usually longer than PG's appear to give a bit of an idea. This is a P99 read latency graph and those spikes there. You might be able to tell where OSD restarts- and it's important to note here that we measure this from inside the cluster against a real RPG image. This means we don't have all the overhead of the network between a droplet and a cluster. We also specifically measure I o latency with this tool.

A

That is it's a tool to do nothing else like application logic, so some latency sensitive applications with high transaction rates might feel these OSD restarts more than most applications.

A

So what can we do about this? Latency? uh We can reduce the paxos proposed interval from its default of two seconds to a quarter. Second is to justify stage on mailing list. However, we actually observe that this made things a little bit worse for us. So we tried another approach. We could try, starting all the osds without letting their PCS actually appear.

A

By setting no up, we can also check the admin socket on the osds for their current status and wait for them to just kind of hang out at uh pre-boot State and once all the osds per hosts are at pre-boot. We can then unset NOAA allowing pgs to begin peering which reduces the OSD map updates. Now we get to deal with the recovery overhead on the cluster for a bit, but that's still better than hearing for us. We know that we'll never be able to eliminate peering, latency and immediately consistent storage system.

A

However, if we can make some progress in minimizing it, it really helps those applications that are the most sensitive to latency. So look at this graph, I zoomed in way further on this one, to reduce the noise, reduce the noise and show the difference when combining both the no up and the pax's proposed intervals. That stages recommendation. We see great Improvement to latency during period, so that's blocked what about tricks and objects? We have clusters with billions and billions of objects.

A

I can't share the numbers, but it's nuts, the rgw index layer, doesn't handle it very well. When we have buckets with Way Beyond a hundred thousand objects per Shard rule of thumb. uh There are too many shards that will heavily uh impact list performance. This was a huge problem before we could do any kind of dynamic resharding back in the Luminous days, but even after re-sharning, what about cleaning up the old shards?

A

Those old shards still have a ton of omap data that must be recovered or backfilled to other osds from acquired flapping index osds before async recovery meant that we were often better off recreating an OSD if it flapped, which is, of course, dangerous.

A

A roxyb compaction, whether full or incremental, could cause any sorts of timeouts which would cause the OSB to Flap after a ton of blocked, I o which of course hurt availability. So a super quick Whirlwind brought to be primer. That cost is over just about everything.

A

um It says log structured, merge tree, which is append only which means that every new entry in the database is, of course, a new right, but that also means that deletes our new rights that we call tombstones a process to remove the deleted data is needed. This rocks can be compaction. It must either read the full database or ranges of it to compact. This is a lot of time spent in rocks to be code that isn't spent serving customer traffic.

A

Data is stored in different levels, from l0 and on L1 and on are get exponentially larger and we've spilled into level five, which is kind of terrifying.

A

um It's important to note here that we aren't pointing the finger at either rgw's interaction with rxdb or roxubi itself. However, we have observed a rough blue FS interaction with roxdb when iterating over large amounts of tombstones. This is largely improved today, because the cefts and the community are awesome and the scale at which we're at is not what rgw defaults are tailored for, though it has proven quite capable of handling our scale when tuned.

A

So we needed to dig into the index compactions to figure out what was up in roxdb, there were files which won't compact until the level is full. This meant there was no upper bound on the tombstone lifetime, leading to slower durations. We explored a lot, a lot of rocks to be options looking for silver bullets, and this effort started back when we were on luminous I'll stress that I've condensed many months of effort that most of our teams put all into a single bullet.

A

On this slide, there was a ton of Discovery and testing throughout with Nautilus roxdb was upgraded and we had more options to explore. So our Silver Bullet, we discovered that newer, Rock CB gave us access to GTL compaction when sale data reached a certain age. Compaction is triggered on the file within Rock CB. This means that the first TTL compaction run on an OSD that hadn't had full compaction in a long time took quite some time, but for us it was uneventful.

A

During this time we disabled, GC and LC, which is helpful here, because their index heavy workloads, the load on our index nodes is consistently higher than it was previously, and for us, this trade-off is absolutely worth it. The higher load isn't worrying for us at all. There's plenty of Headroom for us and the notes are forwarding. The index. Utilization and capacity sense has dropped a staggering amount from some osds freed up double digit percentages of their capacity use.

A

We have since disabled periodic compaction in favor of TTL compactions, and this has been our biggest Silver Bullet for index stability. We have no reason to believe that this will backfire on us today, um hopefully next year. We won't be here telling you that we're wrong, but these slides are a couple months old at this point, so I think we're good. Hopefully.

A

Finally, we want to make sure that all the Clusters are doing what we expect. First, similar to ceph ADM, the manager of Prometheus module was not available when we started so we wrote an open source, the stuff exporter, that support is written, go and doesn't rely on the manager which has had some scaling issues in the past, but otherwise it accomplishes the same thing uh keep an eye on the fill rates and projections for capacity. uh This is especially important today, as supply chain issues make lead times absolutely terrifying.

A

um This is different than having a finger on the pulse of what capacity is that today we want to understand weekly, monthly and even yearly Trends. We wrote a tool called stork supporter which runs on each host and talks to all the admin sockets for a ton of extra insights. This can also check Hardware on a host such as reading smart info reallocate sector power on hours Etc. These are useful metrics to help identify if that drive is headed towards failure.

A

You'll start to get an idea for how many rights of traffic take over its lifetime. How long that lifetime might be. We also monitor Network reachability to every other host using f-ping with an expected MTU and don't fragment Flags. uh If this is ever flaky, we might have great failures on a single link and a cluster which, in a distributed system, can cause an entire world of confusion. Network monitoring for these sorts of things can be tricky but identifying a single network.

A

Cable, that's failed in some way can safely from digging into unrelated areas back that haven't failed in any way.

A

We also have a tool called merigraph, which is how we measure the cluster latency measure and, as I mentioned earlier, it's a very useful observability tool to get a client-side perspective of what's going on and with all things monitoring, you should only alert on things that you could take. Action on.

A

Informative learning is not actionable and that's what our graphs are for, uh something we're still working on is looking at using Prometheus and alert managers inhibit rules so that in the example of network problems, only the network alert would fire before a slew of other alerts that were firing because of that core Network problem.

A

So closing up, let's uh take a quick look at how we do things differently. In hindsight, we spent a lot of time with division between the block and object teams. They used to be separates pillars under storage. uh This meant that we treated our clusters very differently. Automation configuration largely diverged, creating a lot of confusion and duplicate work throughout the life cycle of the cluster. There were multiple sources of Truth for different things where chef and ansible can get their information from. If that were minimized, it would reduce a lot of confusion.

A

The use of the centralized comp instead of scattered cefcomp files, for example, has been great. uh We have a lot of Automation and some of it gets pretty complicated. Now uh some of that might be better office services and just like determining whether something is automating. That line should promote automation to services and finding the right time or finding the time to be able to do it is challenging um melt.

A

All your snowflakes, a unique cluster, is going to become a problem somehow someday if all the snowflakes military together they become part of the same digital ocean. So thank you. That's about wraps it up, quick firing, plug uh check out our careers page and if we have some time here, I can try to take some questions.

A

First off I want to say thank you Matt for presenting.

A

Yeah absolutely.

A

I, don't got some questions coming in from chat yeah, so is stronger exporter open source? Unfortunately it is not. It is something that we've kind of talked about back and forth, but we haven't there. There hasn't been an effort to actually look at open sourcing it yet um I think we have to look at what it's, including uh before. We can look at that. uh Do we balance primary pgs uh balance in what way.

A

Yes, so we do use separate clusters per service um per product. Rather so we have clusters for object and we have clusters for our block.

A

uh This was I think it was originally because they ran on several generations of Hardware um way back way. Back in the day, uh we envisioned um objects as being used for a very different use case other than what turned out to be web assets. We were expecting large objects at the time, um so we geared our deployments for large objects where all of the block deployments were deployed expecting all block workloads.

A

um So our our newer generation clusters are are much much more tuned to uh widely widely varied uh workloads on object, um so bouncing primary pgs evenly over all the osds. Yes, we uh we do use the up map balancer today on the Clusters um PG remapper has been useful for for kind of circumventing that for other maintenance operations like if we need to drain an OSD um or just to cancel ongoing pre-mept backfill. For any reason.

A

And any other questions we have Matt here.

A

Lots of wisdom here.

A

Okay, well, thank you, Matt for taking the time uh to present.