Ceph Cephalocon APAC 2018, 22 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Erasure Code at Scale - Thomas William Byrne

Description

Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Thomas William Byrne, Science and Technology Facilities Council Linux System Admin

A

Relatively large set faster, this primarily doing a racial coded storage, yeah.

A

So this is where I'm based this is the Rutherford Appleton Laboratory in Oxfordshire in the United Kingdom there's a bunch of relatively large experiments on site. We we took there mainly sort of particle physics experiments, but they have a whole host of users, people that sort of biologists and chemists come and use them to study, study, other things, they're studying yeah.

A

So there's a lot of research and a lot of world leading research that goes on at SD FC and we need world leading computing to actually facilitate that and that's where I come in I work inside the computing department and my my role is primarily providing storage for the Large Hadron Collider experiments, yeah.

A

So the LHC experiments at CERN produce a fairly ridiculous amount of data I think this year they meant to produce around 50 petabytes of data and all of that data needs to be stored somewhere in an experimental nice. The WL CG, the worldwide LHC computing grid was set up with the aim of doing exactly that. So it's a collection of sites around the world that are involved in the story in the storage analysis, where one of the biggest sites involved in this project.

A

We have over 30 petabytes of disk storage and then tape, storage and analysis machines. Currently around 10 petabytes. Our storage is safe. That's what that's! What I'm working on and yeah and as as we go on more of that historic will become safe. So we start ramping, ramping that up cool no I've. Also on the map there I have the institute of high-energy physics, it's just over the other side of Beijing and they're, also involved in the collaboration.

A

So what do you actually need to provide if you're wanting to provide grid storage? It's pretty simple for the most part. Traditionally, the storage was all. It was hierarchical file system, like storage, there's, no real need for that anymore. We've. The experiments have moved away from using that sort of stuff for the most part they're just using the storage as as an object store.

A

So one of the things we were looking at when we were looking at sort of replacing our storage was whether we could use an object store, and that was one of the reasons why we started looking at self yeah. This is very much high. Throughput computing we're not talking about high performance computing here, the problems are, for the most part, embarrassingly parallel they're broken down into very small parts, so we're not particularly worried about sort of single stream performance. We're not particularly worried about latency yeah, there's, no, there's no user interaction going on here.

A

All of this is jobs have been scheduled. Yes and then. Lastly, there are a few non sort of mainstream protocols that you need to support to actually work in the grid. Specifically, those grid ftp, which is used for external transfers, which is yeah, shuffling data between sites and then extra key is used internally, which is the jobs, will be fetching data off the storage yeah.

A

So yeah back around the release of Firefly we've started looking into using Ceph for this role, replacing replacing our disk only storage right from the get-go. We had an incredibly tight limit how the terabyte limit was low and we weren't going to be able to push that. So there were going to be a couple of caveats to using SEF.

A

Mainly we were not going to be using replication, so we were looking at region coding from the beginning. We were always going to have large storage nodes. We weren't going to have a nicely SPECT, safe cluster for lots of AI ops. We were gonna, be looking at sort of 30 30 plus drives in each storage node, and we were gonna, have other nice things like SSDs for journals, and all of that it was gonna, be everything was gonna, be collated, co-located on the data discs yep.

A

So the way we ended up supporting the grid protocols we needed to was using writing our plugins directly on top of liberators or on top of Liberata Stryper. We had to help some help from CERN doing this. They were interested in using it for a different reason, and so we were able to collaborate on there and make sure we got what we needed yeah. We did try and get experiments use s3, but there was definitely limited success in there.

A

That's always the goal with this, when we have new people coming on and using it, we go trying to push them towards using industry, support to stuff get them using, read less GW. It makes our lives easier and it means that we have less custom solution in the end yeah, oh and at some point during the whole process. We came up with this acronym, so it's called echo.

A

So this is a quick diagram of how we actually get data in and out of our self cluster. So just start with sort of on the right hand, side that was basically all we had. So we had our cluster and then we had a couple of external gateway machines, which is just big machines, with lots of networking that ran all of the protocol servers that we needed, so the external sites data coming in and data going out would go through them and data going on to the worker nodes.

A

The analysis jobs would also go through those external gateways.

A

This wasn't ideal when this was something that was quickly identified as being a generally bad idea, so we wrapped up the accurately server with the plug-in and the various things that needed safecom and the correct key ring into a container, and now that runs on all the worker notes. So when it, when a job requests files on one of the worker nodes, it thinks it's speaking to the external gateways, but there's a little bit of host redirection.

A

It goes on and it means that it ends up just speaking to the local gateway and then never bothering our external gateways, which worked really well actually.

A

Cool, so this was the first lump of storage. We bought four, we bought four okay, yeah, there's not a great deal to say here, it's fairly standard, it started its life. Yes, after this jewel, we quickly upgraded to a sink message. Sorry to Kraken, because we had a lot of problems. I will talk about them in more detail on the next few slides and then it was upgraded to luminous. Recently yeah there's nothing particularly interesting with this hardware.

A

We were aiming for about a hyper threaded core, her OSD and enough around her OSD, and that was really all we could afford, and so that was what we got and yeah. It gave us 13 petabytes of total drawer space.

A

So we had some decisions to make when we were coming up with how we were actually going to do the arranger coding- and this was obviously back in 2014-2015. There wasn't a lot of there- wasn't a lot of information out there about what people were doing with sort of large-scale erasure coding. So we spent a lot of time messing around with this and getting things to break and then finally, getting things working.

A

We ended up with eight data stripes of three parity stripes, which gave us an overhead we could afford in terms of we could buy enough storage to meet the pledges we needed to meet and gave us decent security. We were sick. We were starting to see issues with raid 6 on our existing system. We were losing data because things were breaking in rebuilds and so being able to being protected against through failures was a lot better.

A

Initially, we were interested in going for a higher number of data stripes. Obviously, the high of the more data stripes you have, the smaller the overhead is for the same amount of data security, so that was something we were interested in, but what we found was that it was very hard to keep a cluster actually stable when you've got these placement groups that contain 19 OS D's each yeah things didn't work very well that this was again pre, async messenger. So this was on jewel.

A

Things have changed the rest of these settings of very default. We were already taking a fairly large gamble by using arranger coding at this time. There wasn't a lot of people using it and it seemed it seems sensible to stick with the things that most people would be using so that we hopefully weren't the first people to run into things when things went wrong.

A

Cool so yeah a little bit about what our paws are actually made up of, so our largest data pool is just over 4 petabytes now, and so that's 4 petabytes with yeah 2048 placement groups. So we've got yeah a little bit over 2 terabytes of data per placement group, which is fairly large compared to what a lot of people are doing.

A

It does make things like deep scrubs take well over an hour which can be a bit annoying sometimes- and this is something that where we are looking at, we will be increasing this number and we're gonna be aiming for around one terabyte of placement. Group I think is a reasonable thing. One of the things that bit us early on is, if you're, if you've, got very large sort of ec placement groups with lots of our seeds in them the recommendations for how many placement groups you should have in your cluster.

A

Don't necessarily work particularly well, and you end up with far too many placement groups, or you end up with each OSD being part of far too many placement groups which can lead to serious performance issues so yeah in terms of the sort of having so much data in each placement group. We're not seeing any issues if everything works well, we're getting the throughput we need and yeah from that respect is working fine.

A

What we were seeing issues with in the early days was actually just the amount of communication that is involved in a large placement group like this, when you've got 1111 OSD is needing to communicate with each other. In order to actually peer that placement group you, what we ended up with was we're in a situation where peering would actually cause OSDs to not be able to get that the heartbeats wouldn't get through and so OS DS would end up knocking on the ro s DS offline, they would go, they would be marked down.

A

This would trigger more peering and then that would knock out the stuff offline and you'd end up in a situation where you can never actually get a cluster back to active MacLean, because the amount of here at the amount of traffic required to do that meant that the OS DS would they go offline, which was fairly obnoxious.

A

So we spent quite a lot of time tuning sort of trying to figure out how we can stop that happening. This was what worked for us just essentially making them a little slightly more resistant to being marked out so so early, and that made it a lot more stable, I say again. This was all so. This was all in the early days. This was in Joule. Yes, so this was pre async messenger things have got a lot better with async messenger, and so it would be in a sense.

A

It will be interesting to sort of trace all again with a luminous cluster, but we are where we are now.

A

So I'm not gonna, dwell on this too long. It's not a lot of interesting information here, yeah our crush way. Ups. We've kept them very simple because of the nature of what we're doing we're not worried about trying to be as trying to have as much availability as possible, and so it was deemed acceptable that we would be in danger of losing availability. If we lost power to Iraq, that sort of thing doesn't happen. Often we've got dual power supplies in all of our racks and our networking is redundant for the most part.

A

So we're not particularly worried about that, but just in general, having having the failure domain at the node does I mean coming from a system where you don't have that coming from a system where you care about every node, it makes a massive difference and yeah there's a lot more, a lot more nice to operate on a daily basis.

A

So I mentioned earlier that we use Liberator striper for ya throughout for our plugins to our external protocol service. When the objects come in from the experiments they come in at a range of sizes, and so it's yeah. It makes sense to strike these down to manageable numbers or manageable size of objects on disks liberato strive for that, as I think was mentioned earlier. Actually, the default stripe size for liberalist Raipur is somewhere around 4 megabytes.

A

This number makes sense when you're talking about replicated pools, you end up with 4 megabyte objects on disks, which is a completely reasonable size. When you start talking about high sort of large numbers of data stripes in arranger coded pools, you suddenly end up with very small object on the disks, so in this case we would end up with 512 megabyte objects.

A

If we were looking at higher K numbers, which we were doing, it was even smaller objects, it suddenly means that you're incredibly dependent on how many ions you can do for what your actual performance is, which is something that, from from the use case, I've described we're mainly worried about throughput, and so we did a lot of investigation into what. What should you actually do? What's, if what's a good stripe size doing what what's? What's a good size of object to actually aim for?

A

That's on the disk in in our case, that meant going with Liberato Stryper striping things into 64, megabyte chunks, stripes, even and then that means at the end of the day, you've got 8 megabyte objects on your disk yeah which which works for us. It was the it was the least bad option. It works fairly, well, yeah, and at that bottom point it's just sort of basically what I've just said.

A

If you do have a ranger coded and a large number of data stripes in an erasure, coded pool and you're using sort of default object, sizes, things get very small and things get very well distributed, which can be a problem.

A

If you then lose a placement group, you've suddenly lost a little bit of every single object in your cluster, so yeah so I've sort of described it a lot of the decisions that went into actually making it and a lot of the yeah a lot of the considerations we had to think about, and then I'm gonna sort of talk a little bit about what it's actually like living with it and how it works for us. So I guess the first easy thing to talk about is: does it actually work and yes, it does work?

A

It's performance is pretty good. It will happily do yeah over ten gigabytes, a second of sort of production IO. So this is all of this is just actual production work. This is collision files being pulled off the off safe and on to the work. You notes, we see down the bottom I'm, not sure.

A

If you read it, but that Pete there was the first time we got over 20 gigabytes, a second that was a few months ago and yeah Seth is handling this fine there is, it hasn't given us any indication of where we were actually going to see it, often that yeah, which is pretty exciting.

A

It will be interesting to see how far it can go. We've had all the performance problems we have had have been entirely, or mostly due to the external gateways, and that's things like badly thought out buffer sizes, meaning like excessive memory usage and port exhaustion due to miss configuration that sort of stuff, so that's been interesting and the gateways on the worker nodes have been really really good.

A

Being able to remove that bottleneck as one meant, the external gateways have a much happier life, and then we actually just we get a much more a much better performing cluster.

A

Okay, so the other sort of performance you care about when you're running a cluster is what it was the back filling like in again. In the jewel days, we were seeing issues with the backfill and causing high load impacting client IO, and we spent a lot of time trying to tune that down so yeah. This was some of some of the tuning. We did yeah we're in a much happier place now.

A

Backfilling can backfilling goes on as part of the normal operation of running it most weeks, we'll be doing something, and there has no effect on client error from what we can see and we'll have been happily doing: 20 or 30 gigabytes of backfill traffic and 10 or 15 gigabytes of client traffic with no issues the bottom eight we do see when we're actually adding nodes is actually the networking. So the cluster networking of the node we're adding will be saturated, awhile. There's enough.

A

Pgs backfilling, which I guess is, is a nice place for the bottleneck to be. We can always improve that.

A

A

This is yeah. This is sort of a taste of one of the operational issues we had in the summer, so yeah back in the summer. We we started when we built echo. We started out with 30 of the nodes and then added the remaining 30 as sort of an experiment of. Can we actually do this? Does this actually work for us?

A

What we ran into was this bug that tracker there and essentially a quick, quick rundown is when a when a ranger coded placement group is backfilling, the primary will be sending out requests for all of the shards from all the other OS DS. If one of those OS DS can't read it shard, if there's a pending sector or something's gone wrong on the desk, and it can't read the shard, it will reply and say: I can't read that chart and the primary instead of being aware. Okay, that's fine.

A

I can still reconstruct the object, which is crash, which is really unfortunate and obviously it will then be restarted and then crash again and still and then, when system D stops restarting it. The second OSD in the set war, then sort of become the primary real, be like we were doing backfilling and then suffer the same fate, so this yeah. This was interesting. This hit us fairly hard. We ended up with multiple placement groups down over the course of an afternoon and no clear idea of what was going on.

A

While we were actually trying to figure out what was going on, we ended up doing some fairly drastic things to try and recover the placement groups, and we ended up actually losing one which was pretty unfortunate and yes, so, once we once we got to the bottom of it and managed to start removing the OSDs that had any any OSD that had a single pending sector. It turned into a bit of an operational nightmare.

A

It's it's been a fairly tough six months, trying to deal with running a cluster doing cluster operations, most of which will involve backfilling when you're in a state where any any backfilling has the potential of bringing down the placement group and yeah removing reducing your data availability as of a few weeks ago. This is actually fixed, which is nice, so yeah, when I get back I will be upgrading as soon as I, possibly can but yeah. So this this was an interesting problem. We had so continuing on the theme of problems.

A

This is the other main operational thing that we issue that we get when we're running running faster. Think the thing that takes the most of our day-to-day time in terms of mundane things is inconsistent placement groups. So an inconsistent placement group is when deep scrubbing happens and the shards of an object don't agree with each other. We get a fair amount of these most of these or all of these without fail have been due to a there being a pending sector on a disk.

A

So a disk has a sector, that's become unreadable, and so, when the object, when, when the OSD tries to read the shot, it can't yes I'd, say probably nine times out of ten, maybe even more than that. This doesn't result in us. Taking that disk out healthy discs do develop ending sectors- it's just they are. They are designed to work around them. The firmware will work around it.

A

It will remap the sector so most of the time we will not be taking the discount, we'll just be doing a PG repair which will just try and rewrite the shard, the broken shard and then rescrub the placement group, and then it will come back yourself. Okay, in this respect, I think this is a slight complaint in the health era, for what is a harmless problem on that, these types of pools seems like a lot of overkill.

A

It makes monitoring things like safe health if you're, if you're doing call-outs and everything something that makes a lot of sense to call out on this, the health of the cluster. If you suddenly are in this state for over ten hours a week, it's a bit of a nightmare so yeah, that's that's something! That's interesting! We've actually kind of got around that we use the scrubbing scheduling in the opposite way from the last speaker. We actually schedule it, so we only scrub in day time hours.

A

So at least don't get woken up in the middle of the night.

A

So yeah, so this is one of the things that's very new to us. Coming from our previous storage system, we never had to worry about disks sort of throwing pending sectors. All of that was hidden from us.

A

Yeah rate raid controllers are very good at doing that they will. They will try and repair a disk until it breaks and then fought until it's deemed too broken and then throw it out and it will be repaired.

A

Yeah. It's been a large part of our operational, just our operational work. Our development work has been figuring out. How we deal with these we've been coming up with short-term solutions of pulling discs out as soon as they get pending sectors and then essentially, remapping them writing to the whole disk reading from the whole disk, deeming it healthy putting it back in this is a lot of work for what our single sectors on the disk going bad, which can be expected of any disk over its lifetime.

A

So having to move petty terabytes of data around for single pending sectors does seem a little bit a little bit silly yeah, and this is one of the areas that we didn't expect to be putting work in with SEF. And it's interesting that we are, and it will be interesting to see where we go in the future, with that, whether we get better at dealing with them or whether Ceph can get better at dealing with them.

A

So yeah so back when we were figuring out what we were actually going to do, how we were going to manage this cluster, the decision was made that, let's just do it manually, obviously we're managing set up, can't free we're managing the Oasis but SEF itself we're not using anything to generate our crush map.

A

This works. Okay. For us, we do a lot of part of echo so yearly operations. There will be nodes going in and notes notes yeah going in and though it's coming out, yeah we're gonna, be there will be new generations every year and will be retiring old generations every year. So what we've? What we've ended up doing is we're using SEF deploy for normal, getting OSDs onto disks and then, when we're thank you yeah when we're making changes when we're doing real weights. All of that is just manual krushworth, edits, yeah.

A

It seems to work very well and we're going to continue with it for now.

A

So the reason why we're doing this reason why we're not just using the set tools provided is that, when you're making changes on the scales of thousands of OSDs sticking things in for loops results in a bit of a mess, you get sort of extended periods appearing and you get you sort of you lose the ability to roll back changes if you're.

A

If you do, if you're doing these, these sorts of changes that you get halfway through re weighting, all of your nodes for all of a generation of nodes down to zero- and you realize you've done something completely wrong.

A

It's trickier to roll it back, and it's not as quick to roll it back, there's just being able to push the old crash mat back in so ya, have being able to do manual, crush map edits and then do it, sort of a step change and there's just one recalculation, and it's certainly like oh yeah, there's six percent of objects, misplaced! That's what we expected. It seems to be a much cleaner way to do that. It would be cool if there was a tool to do this.

A

I think this is something that would improve the usability, certainly for us I'm sure. Other people would enjoy this too, being able to sort of batch up a bunch of changes on an offline copy of the crash map and then push that out in one go and you can make it as user friendly as you wanted, and you can do all sorts of fun stuff with actually analyzing. What's happening with what data you're moving around.

A

Yeah, just a little bit about our data distribution, there's a histogram of OSD utilization there, as you can see that we've got a nice spread from about 8590 to 20, so the cluster is around 60 percent full and yes, we have a fairly impressive spread of OSD utilization that yes, this is something that I you can expect, but what we found is sort of, as we've started, filling up and needing to re-weigh that foot the filled up, OS knees down using re weight.

A

My utilization is perfectly effective at keeping the ones that are nearly full moving the back into the middle of the pack, but you never really lose. The longtail you've always got OSD is only have one or two placement groups actually have data in them and it's a bit yeah. It's certainly something that I think could be improved, and that is one of the things that has been improved, and so we've been looking at the balance of module, which is new with luminous, and it does seem to just seem to do some things. A lot better.

A

So it'll be interesting to see how that works. When we get it working on echo,.

A

So yeah, in conclusion, we've had an we've had an interesting. First year, we've been in production for pretty much exactly a year. Now, we've had about things that have gone wrong and we've had things that have gone right, surfers, definitely handle things we've thrown in it.

A

It's yeah, it's gonna, be really exciting to see how Seth handles us putting a lot more hardware into it. Ekko is so it's currently around 10 petabytes. It's going to be around 30 petabytes in a year's time you're in a bit at the time, and so it's yeah, it's gonna, be really interesting to see how the performance goes.

A

Improves with that. There's also some things that I'm not so excited about. I'm, not really I, don't really see the issues we're having with disks doing anything but scaling up with the number of disks and then especially as we start having aging generations of hardware in there yeah as hardware you've got five year old hardware and brand new harder, and how are you deal with that, so that will be fun, yeah and then I think. Finally, so to sum it all up using a range of coding, there originated SEF on large storage nodes.

A

It's definitely working for us. It was yeah a bit of a gamble when we started there was some questions that we weren't quite sure where which way they were going to go, but we've got a lot more confidence and it's clear that safe as a community is getting behind a racial coding with yeah RB d on a racial coding and set for fest on a racial coding.

A

There's gonna be a lot more people using it, and that can only be a good thing for us yeah. Thank you.