Ceph Ceph Days NYC 2023, 8 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC 2023: Ceph at CERN: A Ten-Year Retrospective

Description

Presented by: Dan van der Ster | Clyso

In 2013, the data storage team at CERN began investigating Ceph to solve an emerging problem: how to provide reliable, flexible, future-proof storage for our growing on-premises OpenStack cloud. Beginning with a humble 3PB cluster, the infrastructure has grown to support the entire lab, with 50PB of storage across multiple data centres used across a variety of use-cases ranging from basic IT apps, databases, HPC, cloud storage, and others.

A

So this is a 10-year retrospective on what it was like operating operating, sefitz CERN, um this so quick for for those that uh that don't know me. That's my email addresses I'm from I'm a Canadian guy I'm from University of Victoria uh I've, been on the engineering staff at CERN, since 20 2008 I was managing Seth there until last year and then I handed off the reins, because I became Chief I.T architect at CERN.

A

That's one hat other hat as I've been proudly happily supporting the Seth foundation and being a board member founding board member and now honored to be the on the staff executive Council since since 2021, when Our Benevolent dictator for Life moved on to the greener pastures and then actually latest news is in April. I'm relocating to Vancouver and I'll, be taking a sabbatical from CERN I'm, going to be joining iwakim and and Company to to help help build up klysto in North America.

A

um What's that cool picture behind that's this, the the LHC at CERN, Large, Hadron, Collider, 27, kilometer ring this is where, where I I work these days still 100 meters underground. My house is like right there somewhere right there. That's my house, there's a ski hill going up when it gets cooking a little bit melts the snow a little bit, not not too much, but this is what it's like 100 meters down, that's the that's! The superconducting accelerated magnets.

A

They accelerate protons up to the speed of light minus 11 kilometers per hour just hit the brakes a little bit. They Bend those opposing beams together at Four, Corners of that Circle Collide them like and then take pictures in giant cameras like this. So you can see you can see someone that may or may not be me standing there to get the idea of how big that camera is, how big that that camera is um and then it takes pictures like this. This is a picture of the Higgs boson.

A

This is the whole reason that we built the LHC at CERN, and this is the particle. This is not the particle. This is like the splash after that particle was created and then decayed very quickly, and that's the thing that gives everything Mass, that's why we weigh something: yeah yeah so actually yeah. It's really crazy. I'm, a computer guy, but somehow I got my name on the paper that discovered the Higgs boson for Atlas and which then got a Nobel Prize. Okay, I don't have the prize.

A

Someone else got it, but this is cool like how can you get a highlight of your crew beyond that? Anyway? Okay, 10-year retrospective, you know, I'm moving around I have I'm kind of busy guy. How do you outline a 10-year retrospective?

A

What would one do these days?

A

One would ask a certain intelligent system. What should we? Oh this? Unfortunately, the text is right. There write an outline project you're at retrospective sure here's an outline. This is my SEF GPT bot, okay, Seth GPT wrote the outline for my talk. I'll, give you an outline, introduction background of cephit CERN 10-year review of how it was to operate and some lessons learned in future directions.

A

Of course, the outline was more detailed that it gives so I'll. Give a brief overview of our use, the importance of it in our infrastructure. This is real output from Child GPT, guys, unbelievable and the purpose of this Outlook. So here's the introduction. This is overview of suffixer. We started in 2013 with a 300 terabyte proof of concept that we tried to break. We couldn't break it okay and then we that allowed us to justify a three petabytes cluster back in 2013 for openstack um 2014 2015. We did.

A

We did some r d and eraser coding writing the first Isa acceleration for erasia coding. um We we developed the first object striping for a particular use case that we had for physics, data 2016. We upgraded our In-Place stuff from three petabytes to six petabytes. This was like proving that ceph organic storage actually was really really working. Then we had eight production clusters. Then we got into CFS and S3 in production scaling out. Okay, I run out of things, there's nothing so notable there in those years, but we're scaling out business continuity, Disaster, Recovery Solutions.

A

By now we have around 17 clusters and around 100 petabytes on the floor. This is interesting. These two numbers, because in 10 years that's about five Moore's Law, doubling periods and that's almost exactly matching right. That's like 96 petabytes! If you, if you, if you, if you do the math, that's kind of interesting, but it also means that my SEF budget did not increase in one decade.

A

Okay, importance of certain steps: Etc, okay: this is just the logo drop boom. This is all the stuff it just. It runs the whole infrastructure at certain. It infrastructure analytics platforms. uh It we don't. We don't use Seth for the for the massive massive scale.

A

um That hundred petabytes is within a context it's with an exabyte context of disks and tapes for the physics data, but we use it sort of for everything else. Okay, so it's really proven it becomes the the key. The key piece purpose of this talk, uh if you don't remember your history, you're condemned to something decades, a round number and I've got 10 fingers. So 10 sounds like an interesting number and it's kind of like a memoir.

A

You know okay, next section, why did certain choose ceph as a storage solution challenges face in implanting an overview of the ceph architecture?

A

Ysef? So going back, you know we, we started with openstack. The management forgot to get a storage solution for the cloud, so okay, see what's out in their open source land. This stuff was the best we tried gluster, we tried our netapps, we tried staff SEF was Far and Away the best. We all know that story.

A

um The key thing is we needed an organic system that would grow with our usage and our evolving use cases. Flexible, storage, no data, migrations, data, migration like forget it, you can't you can't just like shift the data to another platform as you're growing, so on the on the floor behind the scenes, transparent, migrations that was key and it has proven for 10 years. This is it. It works, um also best bang for the buck.

A

We wanted that Linux that Linux feeling right you buy the cheapest commodity Hardware, and you and you just put some free software on top and you get something: Enterprise quality. That's and it's true, it all works. This is this. This all ended up as true challenges implementing stuff.

A

So these are more like, let's say the initial challenges when you start- and this is I- think probably echoed many people have this thing um you want to install the software- you need to make it work on this Hardware or else the hardware comes from some someone else decides which Hardware we buy it in bulk for scientific use, cases, okay and it's always two gigabytes per CPU core, always the largest cheapest hard drives data center hard drives not desktop stuff, always always in jbods with 24 that you can then do 24 48 up to 96, let's say I, you know why do you need three replicas?

A

This is I mean I'm. People are still asking that question. Why do you need ssds? I mean this is still something to justify. There's also some not invented hair syndrome. At certain I mean there are very there's a storage Department of 50 developers at certain so to to use something from outside was it was quite a took quite some work, there's the impression that the only thing good about Seth is Crush. For some reason, I mean crush is like crush, is even broken.

A

Right, Crush actually doesn't work I'll get into that a little bit later, tiny bit but um set the real value of stuff. Is that like engineering effort to make the failure failures like transparent and the osc is just heal? That's the real real true, like Gem of stuff um and also like the old timers, said, forget about any kind of software device. Storage you'll never solve the latency problem, there'll always be like one or two extra Hops and just forget it.

A

So that's kind of still true, isn't it, but anyway um our architecture so I'm, just following chat gpt's that line right I'm. Not. This is not creativity at all, I'm, so lazy, uh it's the same host recipe. Since the beginning. We have server. Quads I've got a picture on the next slide. Dual xeons now amds, always two gigabytes per hyper threaded core. Now these are 256 gigabyte machines, one from one gig to 10 gig to 25 gig on our standard stuff.

A

We have 100 Gig testing as well, always dumb hpas, never raid controllers, uh one or two of those j-bods and then always okay, I managed from the beginning to convince that we need a little bit of Flash. It was a reliable file store Journal at the beginning, and now it's a blue star block DVD and while on the on the flash, originally we were, we were spending a bit extra five Drive Right per day, SATA ssds 505 per osc 5 osds per now scale down a bit.

A

It's around those one drive right per day, nvmes and totally comfortable, running 12 or even more osds per nvme. It's fine!

A

Here's, the photos, I mean this is like probably the same aisle like just to prove that nothing has changed it's. These are those server quads they have. They used to have yeah you've, probably seen these there's three there's four servers in there and then those 24 disc, J Bots it's the same um and that picture's even from 2017., it's I, just got bored of taking pictures because it's always the same all right, voila, Network, wise, very simple stuff.

A

um We originally, although we did originally, we had this idea- that okay, actually what it is, let's say at CERN, it's complicated to plan where you're going to get space in the data center, so the servers just end up anywhere. So then you try to make a cluster across like different switches across different routers. You know multi-path routing, and we learned that this is not the way to go with stuff, because any kind of fault anywhere and you've got osds flopping up and down it's just a disaster.

A

So now we usually put it all behind one switch, which is also slightly crazy. But if you have many clusters- and you teacher teacher users about uh you- know trying to use many clusters, this probably ends up a more reliable system modes yeah. So most of our issues are related to router line card failures or packet corruption. Now we just use sort of one cluster behind watch, switch. The whole switch cluster will go down inaccessible all at once. It's it's better I think we used to it's better yeah.

A

Well, we can there we go something for after to discuss. We used to have one cluster for all. This was also the dream of Seth right one cluster. You create a pool for all these different things, um but we now have many several sufferers. Several clusters across several zones- um teach your Builders about. Downtime budgets is like the most important thing. This is part of the Google SRE book right. Downtime budgets spend your downtime budgets too.

A

Okay, where are we onto now Seth GPT I, hope you noticed the ceph logo on Seth, GPT, yeah um performance and scalability upgrades challenges and success stories, so performance, yeah, I guess I was a little bit tired. When I wrote this slide, saf performance can be summed up like it is what it is right. It's not performance.

A

First, in that triangle of things that you choose between consistency, availability and performance, I mean I, don't well partition tolerance, but whatever um you know, we prioritize consistency and availability in Seth or partition tolerance, it's actually a CP system. But anyway we don't prioritize performance right, that's Mark's job always keep Market busy, um but anyway the raw user. The users always expect like raw nvme fio performance, like management, will buy. Okay. These students, they say to a million iops I- gave you a thousand of them like where's, my billion iops right.

A

That's the expectation um so always have to remind the complexities of distributed clustered storage across the network um and but anyway, I think this isn't even a problem in experience. 10 years almost nobody understands their. I o workloads and requirements. Everybody wants that performance, but in practice we still throttle our attached storage to 200 iops 500 iops. We allow bursting up to a thousand everybody's happy. Almost like 99 of people are happy.

A

So that's the scale of storage performance, however, like from the operations side they're still like you really want this I, it must be possible to take 100 petabytes of disks and put small amount of Flash and make you know. The goal is I want, like a cheap performance fix, make the make the cluster perform like 100 petabytes of Flash transparently, and it should be just like invisible um file store, had a simple way to to speed up rights. This was nice, I.

A

Think a lot of Seth operators appreciated this um and then lots of clever folks used be cash. I, don't know if anybody in the room remember doing this kind of thing. We used be cash, you still do yeah, um but uh but it's it's scary, to put other things in between I think it's scary, okay, I know it's probably working in practice, but but it's scary to put other things, because Seth wants to scrub the discs and detect the the durability issues immediately itself.

A

um Blue star is deferring small rights to block DB, but I'm pretty sure it's still bottlenecking on fsyncs to spinning discs. So we don't quite get the same performance out of blue star as we did with file store for rights, at least, but that's good. Maybe it's a config issue. We're talking to Mark about this I just know from other file systems that I use like ZFS.

A

They have a persistent layer, 2 uh cache and log device, and it works very well to to site of sort of hide that spinning disc slowness underneath and um it would just I think we should try to put something more focused on this, a bit and stuff to put something more more like that built in directly to SEF. Instead of having to rely on be cash and other things, but there is one thing there was one like cheap fix. This is actually a plot.

A

It's unfortunate um I didn't have the the I didn't look forward enough when we set up our original monitoring dashboards 10 years ago to set the storage schema to keep 10 years of retention.

A

So we only keep five years of retention here, and this is a plot of of the like every minute on every step, cluster I run rattles bench, four kilobyte writes and just to see how long it takes to write the objects- and this is what this is on- one of our on our biggest original oldest cluster, and um actually the laser doesn't work on this. You see it was originally originally. It was under 10 milliseconds for this slow bulk stuff.

A

It got pretty bad there for quite a few years, 2018 2019 up and above 50 milliseconds for a 4K right. I mean that's terrible right, but what's in what's interesting is that nobody complained because no, but because it's protected in it's like inside a VM who does synchronous IO like consciously or almost nobody is it's all. Buffered and Linux is Flushing it behind the scenes and the users didn't even really notice. But then okay, we got some AMD epic CPUs and then nothing worked. It got like completely broken. Is anyone from AMD in the room?

A

Okay, it wasn't amd's fault, it turned out to be the um you see around like end of 2021, like suddenly the latency went down and it's because we we learned that spinning discs around the like, say: 14 terabyte generation across manufacturers added a um what what one of the vendors calls a media cache, and this thing uh this thing specifically I mean I still haven't been able to dig out why they did it like for exactly which use case.

A

It must have been in collaboration with one of the the the Enterprise and ass vendors, but specifically, it accelerates uh synchronous, direct, I o. So, if you're, using o direct and osync at the same time like like Seth, does it accelerates in a fast part of the disk? Okay, so it turns all of your. If you have it off on these latest generation drives um yeah cool magic fix.

A

Yes, if you turn, if, if you have the switch to turn this on and off is just to turn the right back, cache on and off on, the spinning discs WC on WC on off out of the box, they're, usually coming with the with that right. Cash on and recent generation distal just completely unusable, but you switch those discs to right through mode and it's like magic fix mode. You suddenly, your whole cluster becomes with spinning disc, performs like almost like ssds again. It's like this is a spinning disc cluster and um that's really great.

A

We still are like scared that something is not persisted, but but that's a couple of years now I mean this looks too good to be true right, yeah. It looks like that's what it looks like, but uh so far so good laughs, okay scalability, so we had lots of fun with our big bang scale testing. We did a couple of blog posts with sage on this. In 2015 we did a 30 petabyte test.

A

When that was a big number, then in 2018 we did a 72 petabyte stuff test with 10 over 10 000 osds, because ceph actually has a hard, not hard code, but a Max maximum 10 000 oses per cluster. We just wanted to have a test where we could bump it up above that we did that, and we found lots of problems. I mean all of the maps were too large. The cluster was flapping up up and down.

A

Osd's flopping, the mon was completely overloaded with getting all this all the PG stats and everything was just way too noisy. So clusters were really inoperable above 2000 osds. That's why we have the set manager. Now, that's that's the reason we spun off cepha.

A

That was all that work was going to the mon. In the past we moved all the non-critical workload off over to the Seth manager. Demon- and it also gave as a bonus um people that like are Python devs and things like this, it gives them an easy way to contribute to Seth through through the modular interfaces of the manager, but still I, think we should avoid putting more than a few thousand two to three thousands about the limit, although afterwards it gets sluggish and I see undresses.

A

Yeah yeah, you see some things yeah exactly yep. Okay, some improvements, blue store was a huge Improvement. um You know, file store was convenient, uh simple to understand. There were a lot of painful things like like splitting the directories and then merging them back. I mean this was quite painful.

A

People these days can just ignore that you have a nice warm fuzzy feeling from blue star that it's always serving good data. You didn't have that with file store and and checksums now at Blue Store someone's shaking their head, no yeah someone's taking their heads. Yes, okay, well, Jerry's, still out on that I've got that warm fuzzy feeling, um but it is pretty scary. One of my managers said because he was like oh I.

A

Remember: Oracle did the same thing invented their own file system underneath and it took like I, don't know what he's that's before my time, but he said it took like decades before that stabilized. Maybe someone else knows there are still some. You know blue star's still scary, uh but this like major kudos to those low-level gurus that that know how this works and know how to fsck a corrupted, Blue Store wow, that's a place we can improve.

A

um uh Most cluster operations are more civilized, I've got some stuff on the next slide. You know cefs. This is Improvement, so both Seth over the time CFS was remember in the olden days we had ceph is odd, like this is awesome. Block store is awesome. Object Store is awesome. Cfs was almost awesome for the longest time and then it was declared awesome right and it continues to improve. Today.

A

um um Some of the challenges operating Seth we learned early on, and we still know to like make changes gently to a soft cluster. Don't do like massive in I. Think in in Berlin I had to talk about a leap of faith like sometimes you do a self command and then oops see you in like three months.

A

In a large cluster, rebalancing is the normal state of a cluster, so kind of like tuning that understanding that feeling the heartbeat of the cluster is important. um You know we aim in our operations to aim for a health, okay with no backfilling at least once a week. If you, if you're doing things more than more, if you're keep doing operations that take weeks on weeks and weeks, I mean things will accumulate and you'll be in trouble. I'm. Sorry, we have some scripts to do this. Gentle re-weight gentle split this.

A

These used to be really used. I, don't know how many, how much people use them anymore, but Seth is getting really good at this. Now it's solving most of these kind of operations problems. Now it has uh like it tracks the number of misplaced objects. It keeps that all within a scope it has schedules for the balance or in scrubbing and stuff. It's all. All of this got much better, but I think it's still way too easy to do that. Oops, um I, don't know a couple of months ago, someone was on the mailing list.

A

I, don't know if they're in this room, but I did this and then it was just like really like it. We estimate it's going to take three months to complete Jeepers I, can't imagine that's a long time with no sleep, foreign I think we should do something like Nomad and terraform, where you can type a change plan. It it'll tell you what it's going to do and how long it should take and then and then then you can apply it. I, don't know something like that. That kind of semantic would be really helpful for operators.

A

um Other challenges remember when non-uniform data placement was like costing us all a lot of money, unbalanced osds. If you had a 10 variance, this would cost Millions right in in Lost space petabytes um up map balancer solves this problem solve problem solved, um but because- and this is what I alluded to earlier crush- is broken.

A

There's there's some mailing list threads. It's if you have the same size, everything it's fine, but as soon as you have some different sizes. Actually, the algorithm is wrong. I'm, sorry, those threads about this. It's called the multi-pick anomaly and stuff. Maybe somebody's smart wants to pick this one fix this one day, but it would require some some deep work.

A

um The built-in balancer is really great, but it's not perfect all in all systems. This is where power of the community, the JJ, are you here to the JJ? That's you yeah people use it.

A

This is great some, you know, so it works really well, um and this up map tool is really a Swiss Swiss army knife of um for for, for, like a more experienced staff operators, this is a kind of mechanism we can use to direct data exactly where we want- and you know this PG remapper from from I- think digital ocean right and up map remap that we made I mean these kind of things are quite useful.

A

um These are features we want to bring into cephs somehow uh scaling staff ffs is remains, probably the more challenging out of the Seth demons, because the MDS is single threaded, so you need many of them. It's very stable these days, it's very stable but like adding removing on the Fly. This is not so practical if you've got like some AI workloads running there.

A

I need to upgrade, and you know it's hard to convince like a thousand physicists that we're gonna do an upgrade and they have to stop all the their long-running month-long jobs, um the yeah. So there's there's a bit more and it's still scary, if there's a if there's a problem instead of us, this remains a very scary thing uh and snapshots are a bit too scary to enable at scale and I, wrote, Ask AP, I mean I, think there's someone in the room here that might have a tear in their eye. Thinking about it.

A

um S3 scaling one Region's, not very hard. We had trouble with um so I'll come out. Our stack is traffic front ends used to be AJ proxy. We switched to traffic because we managed it all with Nomad um and we route like routers Gateway speed performance comes from spreading across many buckets spreading across many rattles gateways, but we group our routers gateways using pattern matching on the bucket names. According to our users, these are different user communities. This is a good way to get nice performance.

A

We suffered some issues, but it's overall working very well, but multi-region seems difficult. I heard I heard. Maybe it's going to be fixed in Quincy or Reef. It's getting there. So we'll see I'd like to hear more about that. um It makes it easy I alluded to this earlier. It makes it too easy. Seth makes it easy to put all your eggs in one basket, but but someone told me don't put all your eggs one basket. We learned the hard way.

A

I, don't know if someone can I don't know if the slides will be available afterwards. But if you, if you're curious in February 2020, we we we made a talk. This is from June that year, while we were all in lockdown, but this was a talk that was called like the bug of the Year okay. Actually little did we know that there was like a bug of the century at the same time that we were there like fixing just emerging right.

A

But what happened in this that there was a bug in the low-level lz4 compression library and it just corrupted all the OST maps on all the osds at the same time and I was having a coffee and then boom I had to put it down, and then that was eight hour very painful traumatic day, but everything recovered perfectly. All the data was intact. No, not one single bit lost and now Seth is stronger because of it.

A

But this motivated us to spread out across many clusters and learn about you know: storage, availability zones and things like that and and.

A

Oh I gotta go no successes. Success I didn't want to put all the successes, because there's just too many but I want to talk about why it succeeds right. It succeeds because it integrates with the platforms. This always is like we got it. We can't forget this. You got to be stay trendy, like sage used to say um it was kubernetes. I was openstack now it's kubernetes and openshift and the CSI stuff, and then CFS and S3 make it integrate with everything. That's why everything you can download anyone download something from GitHub.

A

It's got an S3 back end right! That's why we're that's? Why we're? Even here probably now um it succeeds because it protects our data, despite faults everywhere, someone's going like this I mean no come on, it goes it does um we never. We didn't have any Corruptions 10 years, no Corruptions. uh It's like core building block thousands without trusting it, at least in our environment. It continues to grow because it used to be flexible, expanded, I think we've been through four Hardware generations with the same bits yeah and the the user Community.

A

It's always been the strength of Seth. Let's not forget it built up because of all of these ceph days, the community manager over there on his phone. Oh taking a picture. That's why yeah um kovit heard us really. The community is kind of like it's a it's we're not doing the best. This is the best thing to happen in in in in in in a few years now so and I hope to see everyone at Amsterdam as well, because a good Community is good for all of us.

A

um Okay whoa. This is a pretty long like outline I'm, probably out of time, um Lessons Learned in future directions, key takeaways from 10 years of operating safety, CERN all right, so I couldn't um uh what were my key takeaways from running sefitzer in the past 10 years,.

A

um Here we go okay,.

A

Proper planning and design this- this is real guys. It's terrifying, proper planning is not a critical. You know you can read this later. It's hilarious, monitoring and maintenance are essential, resilience is key, testing is crucial and collaboration is important. Running sapphid scale is a complex task. It's important to collaborate with other organizations and individuals who have experience. This includes participating in the open source, Community attending conferences and meetups, and sharing best practices with other SEF users.

A

Oh my God, so yeah I just wanted to thank my brilliant certain colleagues who helped build our it and storage infrastructure to the kind and knowledgeable Community the hosts today and those that helped organize it and, of course, the Giants who build Seth standing on your shoulders. Really, uh that's the end of my talk. Thank you.

B

Come on questions great.

A

B

Thank you, two quick questions. uh Five years of retention for you is like 25 for the rest of us, I'm, pretty sure I'm speaking for everybody here so congrats, on keeping all that data there, but how many operators do you actually have running this back end like managing this on a day-to-day basis, two Warm Bodies 24 7, just on call. Oh, no.

A

No I mean so the no, no, no no I mean the so certain has a. uh We have our data center, it's old school. We haven't we're building a new data center, there's one person that sits there all the time, just looking for things to turn red who then calls a list of the phone? Okay, that's the database. They call that you know on-call type stuff.

A

The SEF team is is is is is to to people basically two Warm Bodies, maybe with a couple of students, maybe yeah yeah, but that but that's part of a storage team which is like 40 or 50 people doing all kinds of things which is part of an I.T Department with 250 people right. You know it's. If you don't need much. Many people, if you're, focusing on one technology right.

A