Ceph Ceph Days NYC 2023, 17 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC 2023: 100 Years of Sports on Ceph

Description

Presented by: Frank Yang

Working together with a major American sports league, we built a multi-site 40 PB active archive housing over 100 years of game video and audio assets by using Ceph as the foundational storage technology. Along the way, we learned many lessons about architecting, deploying, and operationalizing Ceph from the vantage point of a large, modern, and rapidly growing media company. We would like to share our experience and learnings with the community to help others traveling a similar road.

https://ceph.io/en/community/events/2023/ceph-days-nyc/

A

So my name is Frankie yank. So that's the title of the talk today. um By the way we didn't get the permission to use the logo and their name so, um but they are here in the audience. So um I'd like to take the time to uh thank our partner there. uh They had the vision to start this uh project, and so we we are just very um fortunate appreciative to be invited on the journey.

A

Right with that, so the starting point, like um the starting point of this project, so 100 Years of um Sports, um is uh um very large set of data, and these are mostly video video data right uh and so the motivation there is that these datas today are stuck on a large and growing.

A

The the data is stuck on um tape today, right and there's a large growing media data. Okay, now these aren't just regular data. These are irreplaceable data, they're, not just test data that we can regenerate um there of um historical and cultural importance. So a lot of this motivation is around preservation, making sure the data is available not just now, but for a future generation. There are the crown jewels of the league, and so, in addition to preserving the data, we want to be able to actually do something with it.

A

So it's not about just preserving the data but to be able to compute on it to put analytics on it and monetize on it. Okay, so what kind of infrastructure you know what kind of storage we need? So this is where uh the motivation is to open up this infrastructure, not just for today. For future um the the size of the data is growing and the type of data that's going to be put in.

A

There are also changing all right so these days, not only do we have more cameras, uh more angles, we have higher frame rates. We have data now that are not video right, you're talking about the uh the the metrics, uh not just from the game itself, but from the audience from you know, other use cases in the stadium, uh and so the requirements for putting this infrastructure together. Is that not only do we have to have robust data? We want to have fast access to these data.

A

um uh The data needs to be um not only had it available, but needs to be accessible from multiple sites anywhere anytime. Okay, at the fastest speed that we can, um we can make possible. uh We want to be free from any vendor, login, okay, so hence Seth and other open sources. You'll see later that are used uh in in this project. uh And, fundamentally, you know the the cost structure of this needs to be better than uh just dumping.

A

This Cloud these data into the public, Cloud okay, so those are the the criterias that kind of we set forth as we started this project. Okay. So why SEF right so open source? Definitely there is one of the the major motivation as I mentioned earlier, but of the other open source uh out there I mean Seth is the only one to us anyway, that's viable because it's the um it's proven right. It's got the large community. That's why we're all here, uh it's actively being developed, you know their their stuff that are coming down.

A

The pipe is now static, um open source right, uh it's free from choice of Hardware vendors like they're, they're self-implantation out there you're using old, not just different vendors but different types of Hardware right, so that's very important to us as well, uh and then, at the end of the day, it's all about control of the data, and these data are staying you know within within the league. They want full control over it and they want available to to do what they need with it.

A

And so that's why we went down the uh the path with uh Seth.

A

So what do we set out to to build? So this is a multi-year multi-phase project, so in the first phase, uh where we put together, uh putting together um 40 petabytes of what we call Active archive okay, so uh the um the the bars that we set for ourselves in this first first phase is uh one you know, as I mentioned earlier, the the cost structure of it like this is all on-prem.

A

You know in color, so you add together all the the harbor or the um folks involved right and um all the software requirements and so forth needs to have a lower cost structure than than being in the public Cloud. uh We need to have high durability, so this is uh using object using SEF as an object store so as three type axis.

A

So we want to be comparable to a cloud durability, type of numbers uh highly available, like I, said, access anywhere, survive, site failure, survive, Hardware failure, Supply survive um software failures, uh so there's a high availability involved in this, and we want to be able to scale to hundreds of petabytes, so we may start with 40. But this thing like I, said: let's go instead of data, so we want this thing to be able to scale to hundreds of petabytes um and perhaps Beyond uh the next one is very important right.

A

It's about the uh the operational efficiency, so Seth, um like the folks in the audience here or a lot of them, are all experts in safe, but we don't want to have to have the entire ID team be safe experts right. We want to make this operational efficient. We need to be able to have folks that are just you know your usual I.T guys or um uh be able to manage this size of a cluster. It is not just the storage itself, but all the things are around it.

A

So it's a turnkey easy button for essentially having a software to find storage, including all the other pieces of what it takes to be able to access data, uh and we want to have the ability uh to be able to compute. So, like I said it's not just a storage, it's the computer associated with it's networking is the access. Is the user control? Is the the certificate, controls and um or aspects of um making this possible.

A

So this is what we ended up with and it took a lot of planning, um but this is what's built today: okay, okay, so there there are two sites, uh not surprisingly one on the west side and one on the east side of of the United States, uh and we have a media manager uh whose job um in this first phase is to take the data from the tape archive uh and then put it into these uh two sites and they're they're copies. So the two sites basically have identical copies of the data.

A

uh There's a extra copy that will go into the public Cloud, okay, but most of the active part of it of the active active Computing are done directly on-prem on these sites uh within the sites. You'll notice that, in addition to the production cluster right, those are the uh the um the the eight racks per site that are holding the production data. uh We have, very importantly, these sandboxes, okay and I can't stress the importance of having these sandboxes. So these sandboxes are essentially many replicas of the production cluster. So I have the identical Hardware.

A

They have the identical setup. They just have less of it. Right so same versions, running um same setup and that's where all the not only the staging happens for before we push anything into the productions for either upgrades uh or loading new softwares doing configuration changes uh all happened on the same box first before production, but that's also. We where we can run experiments so before we decide what we wanted to put on there uh and tuning the performances they're all down on sandboxes. First, it's also the canary in the coal mine in case.

A

Something does happen and it has happened right as it's a again can't stress the importance there any um bugs any usability issues: user errors oral detected first on the sandbox before they um you know. Hopefully they never show up in the production Department uh the the networking um within the site store 100g networking. We want to make sure that networking is not the bottleneck, uh there's also uh 200 gig links between the two sites over private Network. So this entire, um you know lower half of this.

A

It's all private Network, where um we hope the bandwidth is not what we make sure that bandwidth is not the bottleneck.

A

Okay, so what's what's in some of these uh or in in these uh in these racks, so the osds are in jbots, okay, the j-bots are zoned into two halves: two halves, um each of them. Each half basically has a OSD server that are managing a 53 osds, um and then we also have uh within the Clusters the compute nodes, which today are mostly serving the purpose of being the Red House gateways and the uh the load balancers okay, and so when the racks are fully filled, and this is all planned out right.

A

So first phase is the racks are actually partially populated, but we've already marked uh out uh how many racks uh the placements of the stuff uh USB upgrade. Okay, in fact this year we're we're in the process of doubling the size you know from from 40 petabyte. You know to X that and just some interesting facts right so when these racks are fully populated. Each rack is about 20 kilowatts, so that's about 40 refrigerators worth of um of power, that's being consumed.

A

You know on the rack, you know the the uh the the the density and then the the weight of the racks in there. If you add up all the not just the servers but the the physical drives, these are rotational drives for cost reason.

A

um And you know it's an act. It's an archive. We use rotational drive. So if you add up the entire weight of of one rack, um that's that's over a ton, it's about the size of a small car, okay, and we have many of these. So just some interesting facts. There.

A

uh The current state is alive: it's working right now: okay, we are actually putting stuff into it. uh The total storage between the two sites is about is 36, petabytes tall, okay, so about 18 percent. Right now uh we have 44 uh OSD servers. Okay, uh the number of jbars in there is 22 right with 100 plus drives in each one. So total we have. You know over 2000 drives uh that are that are in uh spread across these two sites.

A

uh We got 16 compute nodes, uh like I, said doing mostly uh Radars gateways and low balancers uh about 20 terabits per second of networking capacity. uh Combined between these sites. uh Every node has 400 gigabits per second um links to get out of the uh the server uh that's for redundancy reason, and also for bandwidth reason. Okay and as I mentioned earlier, 200 gigabits per second between the two sites.

A

So remember the criterias we set up for ourselves, you know because so from the overall economics we look at a five-year, TCO um I think we hit the the target. A lot of that is the um you know. We, the the hardware, is definitely up on cost. But if you look at the the five year uh amortization of the hardware, um you know compared to like a typical 250k per per petabyte type. You know Cloud cost.

A

uh We we come out ahead and a lot of that is, we don't have the overhead of the a lot of the software license that are otherwise required. uh If you were to either build this on-prem or use the cloud, uh but of course disclaimer your mileage may vary right. Our partner has pretty pretty good buying power uh and, and so, uh depending on your discounts and um your um your buying power, like your mileage.

B

A

May vary, you have a question already.

C

A

C

A

So this this is uh comparing the the uh public Cloud costs. If you were to just you know, kind of search. The cost on the internet uh compared to uh included in here is the uh the the the hardware costs. There are some software costs overheads a number of people required to operate it amortized over five years.

A

All right, and so what was the pain point? You know how we built it, and you know what we went through like um where the challenges are so I mean Seth. Seth is not easy, but that wasn't the main problem right. A lot of this is the the careful planning up front. Like I said this is not a a a a test environment. This is not a cluster we gonna put together and tear down right.

A

This has to last for for years um and longer, and so a lot of that is careful planning, a front uh planning out the um the rack, elevations, uh selecting the harbor, uh both for the initial phase and then for the uh the future expansions right- and you know, I- have to take the cost into consideration. You know so there's plenty of redundancy planning of performance built into the hardware, but you know we can't we don't have infinite uh budget to uh to do this either uh the networking aspect of it as well.

A

You know I talked about uh and at the end of it it's just you know good old system, engineering right, uh it's it's not just the Seth aspect of it. Earlier uh we heard about uh how to make things simple right to make it more consumable for the users. uh So bringing stuff is one thing.

A

So if has uh tons of knobs either you can tune for for uh depending on what how you're using it- uh and so we have to Stills down to something that's easier for the operator: who's, not a safe expert to be able to consume and use this environment, and then there are other things that are needed um to again make this into a service right, an active archive service that can be consumed. So it takes a lot of other things or a lot of things other than Seth to make this possible.

A

So here's just a sample of all the other open source projects that are used in this uh to make it possible. We got stuff that are interacting. You know at the uh the hardware level so directly talking either to the Linux environment, to the hardware orchestrating or taking data out. We got stuff on the uh the platform side for orchestrations across the uh the the different servers right, so that they're they're managed like clusters. We got stuff that are user interfacing, apis guis. That are written in order to abstract things away and make things simpler.

A

So just give an example right. So um when we have um stuff that's collected from the um from the the OSD nodes, let's say right, so we have agents that are running on there. They're collecting the data uh they're being sent back to the um the micro services that are running on the back end of your Kafka okay. So we use kapha as a reliable bus to talk between, um at least in One Direction, uh getting the uh the metrics back to the the back end over at the back end we're storing States.

A

So these aren't States uh for some of the states for um for Seth, some of the desired States for all the application, their versions and the configurations for the networking Etc. uh The states are stored in uh postgres, okay. So that's what postgres is there for?

A

uh We have metrics that are also coming from this bus that are built into a Time series, and so the time series are stored in uh Prometheus we're also using Prometheus to generate alarms, I guess we can have guys staring at dashboards all day long, which you know we do provide, but we also want any um failures or threshold Crossings. You know events of interest to be able to generate automatic email, slack messages to the uh to The, Operators Okay.

A

So um for some of the um the the states and metrics, you know that we need facts access because a lot of things that are reactions remediations that are built in. So if something happens, uh if it's something that the software can remediate, we're not waiting for um an operator to come in and click buttons right so for stuff that are um requires fast action right requires uh data to be available.

A

Some of those datas are cached in redis, um because the micro services are running across in different containers, Okay, so so storing them in memory access. The memory isn't sufficient because multiple containers need to access data. So that's what the the Raiders caching is for, so that just kind of gives a uh a sample. Like I said you know all the the stuff that's involved uh to make this possible and there's logic. That's coordinating all this okay. So this is not just uh uh scripting and scum guy.

A

You know running running uh uh scripts by hand. uh This is logic, that's all built in it's programmatic, uh and so we can do a lot more when we're using programmatic um code like uh like, go like goaling um to kind of tie all these open source together. So this is where all the determinations of comparing the Discover State versus the desire State, you know it doesn't match. This is something immediate or is it something that I need to alarm and have a person come in and be involved?

A

uh It's the logic for like checking the um the rados gateways right, um uh putting the certificates um managing the certificates, verifying the certificates, um user access control, dishing out the credentials for who has access and who doesn't and then also a different layer. Sorry access! You know basically, uh our back access for infrastructure for the storage for the S3 Etc.

A

Okay, uh so you know we got Asian collectors that are running on all the notes uh that are taking all the like I said earlier, taking all the data and feeding it back into the uh the back end.

A

So to give a uh another example, here of uh kind of a real life, real life um um example of how how this came into play, so not that long ago right so we we um in the sandbox, uh deploy it um well in a sandbox, uh because it's a Sandbox, we tend to do a lot of experiments over there. So we've deploy with tear down we've experimented. We you know, create failure, scenarios just to test recovery there, so they're, usually stuff, that's left over in there right.

A

We try to clean it up as much as possible, uh but um we didn't realize at the time so part of the the orchestration is done with Seth ansible right, and so we were trying to remove a node using zephensible to Persian no from ceph, and you know stephensible does what it thinks is the right thing. uh It wants to purse the osds that are belonging to that set. No, and but it doesn't query Seth.

A

It goes in, there goes to user lip, um oh sorry, uh varlet, Seth, um OSD and just look for the files in there assume, hey, there's all your osds. You know, let's, let's go Persian well, it turns out.

A

Like I said, there are stuff that are left over they're OSD in there that are no longer osds or worse, yet they're osds in there with tags that are already being used by other osds, and so it goes there and just yeah osd9, okay, remove osd9 osd10 remove osd10, and so it ended up removing osds or actually now belonging.

A

In other notes, perfect example great use case for sandbox, but once we realized that, then it's very easy for us to go change the logic this is where the the power of using you know, language like go comes in right. You can compile quickly. You can change quickly. We can push these updates out uh quickly. So you know in a matter of day or so like we can go out there um Discover right through our agents, all the the files in there we can query set.

A

We can compare, uh make decisions on what our real files in there. What are no longer, you know um valid files, stale files that needs to be removed. We add checks in there to now. Okay. Well, if you're, gonna, Purge a node here are the osds that you think you're going to remove, oh by the way before you remove it. Let's run software to go, compare against what's actually in SEF uh before you go and and execute anything okay.

A

So these are the kind of things that the the logic again, that's tying not just the uh the the set aspect of it, but all the other peripherals, all the other applications, um uh the power of what these logic um can do and are actually necessary. If you wanted to operate reliably long term and easily in this type of a large environment, multi-cluster multi-site typing part.

A

All right so as the wise Master Yoda would say, do or do not. There is no try. Okay, you're gonna do Seth you're going to use SEF in the production environment. You have to be committed that they're uh getting started is easy.

A

Okay, making it robust, making reliable, uh gets to be more difficult, uh but you got to persevere through it and um once you get past that, uh then Seth is a great platform right. It's great, not just for storing media archive, but for general purpose storage. This archive I talked about is not only media like they're they're, it's they use it for VN backups. They use it for any storage right, and so so it's a very worthwhile um investment.

A

Okay uh system engineering, like I, said all the stuff in terms of Hardware in terms of networking in terms of other softwares and applications. How are you going to use it? How are you going to secure it? Like that's all important things and again they need to be planned out not just for day Zero but long term for or um or you think, the uh the the Clusters are hidden, okay uh and then the automation, the smarts, so so I think at any given point in time.

A

We all become experts in some areas of SEF or something because you know it's a bug about debugging or some features that we're writing, and then we move on to something else, and we forgot. We forgot how smart we were. We were you know two years ago, uh but we need to embed those smarts not just in documentation but into the software itself. Okay, hopefully a lot of that is embedded in safe itself.

A

You know Seth versions, progress, uh but the stuff, that's not inside you know the the smarts are required, the interaction between Seth and and either the infrastructure itself or the applications.

A

Those smarts need to be embedded in software and that's how over Generations, you know as people come and go as we move on to um newer and better things that the uh the automation remains um smart. You know, and and being able to do a lot of things on its own without having to refer back to documentations, and you know call it the uh the original guy that wrote the features and things like that and then live and die.

A

You know, buy the QA and the uh the sandboxes can stress the importance of that uh Performance Tuning staging debugging troubleshooting, uh it's a lifesaver for us multiple times anytime's over and you know the the economics right having this environment is all you can eat. Once we have this infrastructure running, especially with the sandboxes like any experiments, we wanted to try anything that we wanted to do um it's it's there. It's there for us right. That's all! You can eat type of scenario. So um that's great okay,.

A

So I'll close there uh again, you know thank our partners for uh making this uh possible uh and I'll take questions.

D

um The the tape to stuff thing is interesting: what was the business like justification or Reason to move things that you said? Are the crown jewels off of a media like tape onto stuff.

A

Well so tape, uh those of you have dealt with tape. I mean it's the Aging industry right. The hardware are harder harder to come by. The upgrade Cycles are tough. um You know the the whole upgrade process is crusty at best at best, okay, so the motivation there is, if you don't, if you leave it on tape, how much light does it have? Okay?

A

So that's one of the one of the the major justification is that: well, if you don't do it, you know the there there's actually a non-zero chance that the data may not be there anymore uh years from now. Okay, then it's just the economics of okay. Well, comparing tape to rotating drives- uh and you know the economics of that, and so that's how it was justified.

A

E

About how you did the performance tuning uh for the cluster, did you end up with like an eight plus three Erasure coding.

A

Yes, thank you, so I forgot to mention that so uh the in the production environment, the Erasure Coatings, were um A4, okay, so 50 overhead. uh It was chosen based on the number of racks that were available and their constraints between the racks that are available in one side versus the other, so we could add more racks in one of the sites, but the limitations and wanting to make these identical. That's how we we chose those and then when we do performance tuning. A lot of that you know besides tapping into the community.

A

um What's possible, is empirical all right. So we do um again. We have nice automations for doing that uh being able to try the number of uh Radars Gateway demons, number of demons per note, number of load balancers that we put in front of it um and how those load balancers are distributed. So we ended up with um AJ proxy. It was a low balancer, low, balancing Radars gateways, three to four of them per server on the same server, and that is a building block repeated.

A

You know end times and being as many as we need for the um the the type of access like in the beginning, we're just you know, basically taking from tape and throw it into Seth like we. We try to have many of these right later on. They may get reduced okay.

B

So how will you um maintain check consistency between the different sites and also the public Cloud? As you grow number terms, number of number of objects stored amount of data stored, especially after you've, had some kind of outage anywhere in any of the three. The consistency of the data yeah between the between the three copies, basically yeah.

A

Yeah, well that that's a good question that at the infrastructure level right now, it's it's difficult and we're not quite doing it there at the the infrastructure, in terms of you, know, occurring Seth right, so the the media manager that you're seeing over on top there like how the datas are replicated and which site they sit in and whether they're, consistent or actually being done at the higher level. Right now,.

B

E

I you're saying that replication is done by higher layers.

A

That's right so we're not. We actually did play with the uh the several application at one point uh doing the across the two sites, uh but of course, maybe in coming up later, we will have it, but right now this Sev doesn't have the ability to go, go replicate itself into a public Cloud right, so the hardware is needed anyway, and so that just became the um the the de facto.

F

um So this is not so much a question. I think this is a excellent project. I like to have a poster question to rest of the team here that I know you say that tape is Agent Technologies, but I want to let you know that uh tape. Actually, you know because we have the tape business I tell you the pave right now the business is growing like crazy because of the the hyperscalers I mean it's just too much data. You know.

F

In order to prevent the climate change, you need to find a way to sustain it again, your projects, it's great okay. So what we're thinking about is whether we can come up with the. But the problem is the tape you know it's. The interface is hard to use. I mean you know, think about you get somebody some kids from the you know college. You know, ask them tape like myself, have no idea what rewind means in the in you know, there's a forward and Rewind. What is the rewind?

F

Remember what right so we're thinking about whether we can put object, storage interface in front of tape, to make tape to be more consumable. The problem is not so much tape. Technology is aging. It's. The tape is hard to use and hard to manage if we were to put the object interface in front of tape, whether we can solve that problem.

F

I know, sorry that this is independent on your projects, but.

A

Let's suppose the question: that's a valid, that's a valid approach, yeah and then, like I, said it's a question. That's not just for me. It's to the the team. There, no I think I think the economics of it um in maybe the environmental aspect of it does definitely makes sense.

A

uh But for other considerations you know just to um um keep it in mind as well right. So so it is also about the the access you know, the the fast access and also correct random access. Any part of the data, sometimes we're accessing. You know even parts of a file correct.

F

File, you know we even think about how do we do Erasure coding across tape, because again, just the sheer amount of data is too big and we need to figure out, maybe there's different between the warm archive and co-archive yep. Okay, it's just a question, not a directly. You know to your project.

A

E

You uh you mentioned random access to parts of files. Just now, uh do you see Seth giving you any advantages over tape uh in this project that you can talk about.

A

Well, yeah, definitely right so right now, just just putting the entire um petabytes of data into ceph compared to robot arms. You know going in there and fetching a tape you.

E

Know yeah I imagine ready.

A

Already, that's an improvement all right, huge Improvement, uh like in the tape. There are cash, so they are um servers and rotational drive there that are caching it, but you know it's not the size right now we have the entire content.

E

A

A

Well, some of that is is also for the future, like um uh the purpose of those ssds part of it is uh so that the uh the metadata can be be stored there, but we do have expansion slots available on those OSD servers. So there are talks of. It is again beauty of Seth right. We can have massive petabytes of data on rotational drive, but we can have small pools that are around.

A

Let's say, nvme ssds only right and we do want to see or what kind of performance we can squeeze out of it, and perhaps you know, in addition to being archived some of the the primary storage you can go on there as well. So that's what they are. Therefore,.

B

A

uh Well, so the some of the tests we do for reliability is um to protect against failures, and so we do try to mimic things like a harbor failure or dry failure, and you know it's. The project has been around for a couple years now, so we do actually have real Hardware failure and real Drive failures in production environment that we had to to to to to to um to go resolve. Okay, so I don't have the numbers in terms of the actual.

A

uh You know how many nines, if we were to tell you up all the run times and the um the uh um and the number of times that um we encounter issues. But you know we haven't, had lots of data in the production environment.

A

Okay, thank you very much.