Ceph Ceph Month 2021, 25 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: Evaluating CephFS Performance vs. Cost on High-Density Commodity Disk Servers

Description

Full schedule: https://pad.ceph.com/p/ceph-month-june-2021

Presented by: Dan van der Ster

CERN operates several PB-scale CephFS clusters to offer shared storage for IT apps and services. Our lab also developed EOS to deliver high-density, low-cost physics storage and support the APIs needed by the LHC community. This work evaluates CephFS+EOS: CephFS is a reliable low-cost EC backend and EOS is layered on top to add the missing functionalities. We present performance tradeoffs, networking overheads at scale, and optimized Ceph tunings, then conclude with ideas for the future.

A

Click uh so this is, uh this is a work that andre andreas and I did um in the last few months, so just some background. First, like on the physics side, at cern, we do our computing on this thing called the worldwide lhc computing grid uh cern forms the tier zero, so we have around 135 petabytes of disk replicated twice and almost 400 petabytes of tape, and we for our scientific data, taking, we still add 50 petabytes per year.

A

In addition to this, like around around cern, there are 14 tier 1 centers with high speed optical links discern and around 160 tier 2 centers. These are universities and labs around the world.

A

In total, this makes around a million cpu cores processing, something like 2 million jobs per day. We have around one exabyte of storage globally, with around one terabit uh total internal connectivity and for example, last year we transferred 300 petabytes around on our wlcg, our worldwide lecture grid. Now this whole thing on the storage side is powered by some open source storage software all developed within our high energy physics, hep community. These are things like dcash, dpm, eos, storm x, root d: these are all site storage solutions.

A

uh Protocols that are used are like http, x, root and gsi ftp to transfer between sites. We have a thing called the file transfer, service or fts, and this does third-party transfers between sites and it schedules these transfers according to network constraints, and then we have something called russio, which is what we call a data orchestrator. So it places the data and interacts with the file transfer service according to different policies.

A

We also have, I don't know how good that this works, but we have like a video of this kind of global uh data transfers.

A

Lots in europe, lots of north america, lots of age, lots in australia also in africa, um next slide. So this brings to like what this work is about. So in the next few years the high luminosity lhc data taking is going to increase our demands on storage, so we'll need to be taking something like 500 petabytes per year by 2028 and open source storage. Software like zeph, has compelling features of maturity. So we want to ask it begs the question: what role they'll play in future physics storage systems.

A

um However, it's known that like off-the-shelf software, misses some high-level features that we have so one solution would be to layer our high energy physics, specific gateways on top of open source storage. So here in this presentation, I'm presenting a novel combination, cfs plus eos software, written at cern.

A

So I don't have to go into detail on what ceph is it's popular part of the open infrastructure? Stack um lots of sites have it anyway, so maybe it's useful to put some small thin layer on top to then be able to expose the university's infrastructure to the to the lhc computing grid.

A

um Cfs, I don't have to go into detail what it is. It's nfs like clustered file system used for home directories, hpc scratch areas or shared storage, I'd scale out it uses radios underneath it can do replication or eraser coding, and it's also read after write consistent, which is important and it had on the clients, the mds delegate, capabilities to the clients so that they can either do buffered I o or asynchronous asynchronous buffer. I o or synchronous as needed, going very quickly through this overview stuff. At cern, we've had cefs in production since 2017.

A

Currently we have three different use cases hpc scratch areas where we do triple replication. These are. These are osd's co-located on some compute nodes of hpc cluster.

A

We have openstack manila used massively at cern, so this is again replication, one petabyte uh usable capacity at the moment, uh and then we use it also also for some uh on-prem groupware solutions. So we have ceph osds co-located on openstack hypervisors, and then we run some erasure coded cephes there.

A

These and more than 30 petabytes of other ceph clusters have been robust and performing in any kind of disaster scenario. Everything seems to work after infrastructure outages. The data is still fine, our users are are not. The failures are basically transparent to our users and we've also been through three procurement cycles now, um and we just replace and rebalance that data and everything works.

A

Great, however, cfs like I mentioned, misses some features that are essential to high energy physics, like some authentication mechanisms which, like sci tokens, x-509 cabaros, also some very uh feature-rich quota and access control, uh things that are required in high energy physics, the storage protocols like x, http, x, root d and third party copy I mentioned before, um and also in this use case.

A

We don't have experience at the at the data, taking what we call data taking rates, which is between 10 to 100 gigabytes per second streaming for for days at a time without stop so eos is a is a large scale, storage written at cern for physics, and we use that we have this in production for many years now at cern, and we have 300 petabytes of capacity available.

A

At the moment, it's implemented in a storage framework called x, root d. um Basically, the there's a name space which is present. There's a namespace, implemented by a thing called quarkdb, which is a kind of consensus distributed raft cluster with rocks dbe behind fsts are like the osds. They store data either locally or they can also gateway, remote storage and then mgm is like the mds.

A

It caches metadata and maps file names to inodes, um so actually it's very straightforward for us to use cefs behind the scenes of a eos cluster, simply by tricking eos into using it as a local file system. So all the redundancy and high availability is delegated to this ffs layer and we configure our eos storage to store with just a single copy.

A

In case any of these kind of fsc gateways fail. You can move, you can move the the the virtual fse file system, that's on top of us to another node to easily recover in case of failure.

A

So we did a proof of concept of this. um We took. We took eight large, very large, new machines, so they're dual xeon, not very much ram 192 gigs of ram. They have 100 gigabit ethernet, 60 terabyte drives each and two uh one terabyte ssds, so the ram. I said it's not very much it's roughly three gigabytes per per spinning disk. This is different from what we run in products right now.

A

We run actually 96 12 terabyte drives 192 gigs of ram, um but at that ratio it's really getting to be too little ram for for this fosts everything that we buy, because we have hundreds of petabytes, it has to be optimized by price per per um price per terabyte.

A

The cef back end was octopus version 1528 and we configured so we have osd's installed on this hardware for the mon mgr and mds. We didn't we weren't, particularly benchmarking metadata performance, so we just put them onto some vm somewhere else in the data center, and we have the metadata pool on the ssds. Though, and then we have a few different cfs data pools, testing, three different erasure coding, layouts, four plus two, eight plus two and sixteen plus two, and to try to make things fair.

A

We had a number of placement groups so that the number of placements per osd was roughly equal, so like 50, 40 40 for these different layouts, and then we used the up map balancer to make sure everything was was balanced um in the test. Okay, this is this is like just to explain how it works. We did 16 megabyte, if suppose, an object was 16 megabytes. This is sent to the primary osd hdd1 here. Who then does the work to to split this into the different pieces and the erasure coding pieces?

A

uh We in our test. We varied the object size to see what impact it has on performance, and then we did different kinds of tests. So what we call back end test was just native cefs benchmarking, where we ran on on a separate set of nodes connected to the same switch. Also with hundred gigabit networking. We just ran dd to see how quickly that we could really pump files into this ffs, and then we also after layering our eo software. On top, we did.

A

We did uh kind of benchmarks of this layered indirect writing. In all cases, each client node is running 10 in parallel and we're always writing two gigabyte files and then they're just looping like this.

A

um This is a larger picture of the setup just to say that when we were doing the back end test again, we were dding directly from a mount looking up the mds where to write and then writing to the to the osds.

A

When we were doing the that's what we call the back end test the front end test. Okay, we have eos on top, so we have to ask an eos name: server called eos mgm. That's like the eos mds, where it should write and then that maps to one of the fsts that is mounted cfs and then it writes to seth.

A

So some back end streaming numbers on this cluster we were able to so we very here on the left. um This is streaming read performance. We vary the number of clients nodes running and for up to three nodes running. We were getting linear increase in the throughput that we were able to read. So it was like four and a half gigabytes per second, then nine, then fourteen, um but then it started to to saturate. So it's saturated around 20 gigabytes per second of reading performance writing.

A

We actually did better um so around six gigabytes per second per node added to the cluster until it's saturated around 33 to 34 gigabytes per second.

A

This is with four plus two erasure coding on ffs.

A

We noticed something interesting, which was that as the osd's got full, the performance dropped the right performance. Here we showed that up to 50 full everything was was working very well, but then, as as we got to 75 and then 90 full, we lo, we saw up to a 30 percent performance cut in the streaming right performance it correlated with increased io 8 on the discs. So we just assumed this is just the blue store allocators spending more time having to fit these blocks onto the disk and lots more random seeks to write.

A

um Here. We varied the erasure coding layout. We went from four plus two to a plus two to sixteen plus two to see how that impacts. The streaming rate performance uh instead, ffs the default, is four megabytes, but on this particular use case that we were running, this didn't give us the optimal performance by by increasing to say, 64, megabytes and then doing 16 plus 2, erasure coding or 128 megabytes object, sizes and doing 16, plus 2 erasure coding.

A

We could get 420 megahertz, I can per per right stream, so remember we're doing 80 streams in parallel always so. This is per stream performance, and that was the that was the optimal.

A

um On the read ha read side, it was similar. Okay, the object size influenced the performance that we could get reading and we really could get for the small object sizes uh we could. We could only get maybe 200 megabytes per second and then with 128 megabyte objects. This is the large yellow plot. Here, 128 megabyte objects we could get, um we could get up to like 380 or 400 megabytes per second. We were also varying the block size.

A

So in our client application we could configure whether we wanted one one megabyte or 128 megabytes to read and by increasing the the block size we could increase the read performance quite a bit.

A

We did notice, however, in writing, some quite serious right performance tails. So by tail I mean you know, you have. The distribution of rights looks like this distribution.

A

That's in the top left corner here, 19 or the average time will be some peak of a plot like this, but then you can measure the 99th percentile or the maximum time for the for the slowest transfer um for the small objects, uh the the the the the mean, which is shown as the gray in this plot here was always was- was quite reasonable, but the 99th percentile or 100th percentile was was maybe double the the mean.

A

However, when we got to the 64 megabyte objects and 128 8 megabyte objects, we had very huge long tail distributions, so we were waiting. The slowest transfer was really like, like maybe 10 times the 10 times the average, which is quite poor for our data, taking type scenario on the reading side. This didn't this tail. These long tails were not so apparent um and even with the long with the largest objects and the largest block sizes, uh the the tails were were minimized.

A

So for reading we can do we can we can still have these uh huge objects, huge ios.

A

Now, that's all! That's all like back end performance, cfs alone. Now we go to set ffs and we layer our gateways on top our high ninja physics gateways um in this plot. Okay, we start at the left, with four plus two erasure coding, four megabyte objects and we have a certain performance which is the gray. Okay. We add our eos front end. The average speed takes is roughly the same. Okay, the average throughput is basically the same, so we don't have a performance penalty to add our eos front end. However, the tails increase substantially.

A

We got huge, 99th and and max transfer times. um So what we did to work around this was on the client side. We started throttling the bandwidth so by throttling down to 26 gigabytes per second total or which was 325 megabytes per second per transfer. We could bring those tails back down to almost like native sffs and then, if we increase that slightly, we got started increasing the tails again.

A

So we see that was like a sweet spot for this, for this particular use case and cluster, but we really need this client side throttling to to protect uh from long tails in read performance. Okay, we have we started with here. We start with uh native ceph cephalon, with four plus two erasure coding, four megabyte objects, one megabyte reads: okay: we can optimize ffs alone by increasing, so we still do four plus two erase recording, but we increase to 16 megabyte objects.

A

We do eight megabyte reads: okay, this decreases the transfer time per per transfer, so this showing you that, by playing with cfs file layouts, you can gain a lot of performance and then, by we add our eos frontend. On top, we got even slightly better performance. Okay. This is due to um the ios being better scheduled somehow by being by being shielded by by the eos front. End, it's not a huge effect, but it was noticeable and there were no long tails for reading.

A

So that come that's the end of the of the the raw benchmarks and I'll talk a little bit about what we did with what we had what we observed on the cef side, so on this large cluster or this relatively large cluster, with huge boxes and very fast network, um even just while rados benching this cluster, we found that the rados clients themselves were throttling themselves because there's something in there's a there's. A client parameter called objector in flight op bytes and it's limiting to 100 megabytes by default.

A

But on this cluster with, like so many spinning disks and so much network throughput, we needed to increase the in-flight bytes. uh So we could get the best windows bench performance by increasing this to one gigabyte.

A

um This was, of course only for like user mode clients, we were doing some fuse tests as well, and some rails benches on the side. It doesn't apply to the kernel staff of s that that that does this- I don't know what it actually limits to, but it's something larger than the default rates.

A

Client. Now. Something interesting that uh that came up during this is that eos software has an internal fsck function where it's always scanning the files. It's always hammering the mds, so the mds is always having to load cache and then trim the cache of thousands of inodes and stay. Underneath the mds cache memory limit.

A

We found just by observation that each inode is consuming around 3 kilobytes. So if we had a 64 gigabyte mds, this would hold around 21 million. I notes, but that's we need we need. We have file systems with something like a billion files, so it doesn't all fit in memory, and actually this fsck was very.

A

It could very easily cause the mds to go out of memory because the I the mds in this version that we tested will very happily hand out caps and load inodes to clients at something like 50 kilohertz, but then, when it comes time to trim its memory, it asks for those mds capabilities back from the clients, but only at maybe five kilohertz by default.

A

So this was this contributes to something like one gigabyte per second of inode cash growth uh and you very quickly within a few seconds, your your mds goes out of memory, so this was all fixable by changing the tuning parameters of ceph, some caps recall tunings and the we worked with patrick upstream to to get some increased. Some increased, lift rate of caps recall and there's a pr there linked, and this actually works really well.

A

The numbers that are now the default in ceph actually work really well for all of our use cases and there's also a new capabilities acquisition. A caps throttle to prevent this, maybe even without tuning these, without paying so much attention to tuning them now. Something that's unsolved is that during our testing one day out of the blue, the performance, the right performance of the cluster dropped from something like 25 gigabytes per second, which was the normal to under five gigabytes per second, and there was no changes to the cluster, nothing obvious.

A

um We confirmed this like in our front-end testing and also with raido's bench, and then the root cause was found to be just one sick. Spinning disc in the cluster, maybe it had a poor sata connection, but we could observe by measuring that disc itself directly. We saw something like two seconds of latency doing: small ios.

A

There were no. I o errors, no smart errors. The drive was just slow, so a very quick fix was simply to stop the osd. Stop the system. Ctl stop the osd process immediately. The right performance went back up to 25 gigabytes per second and then, of course, the data was backfilled somewhere else. So we want to find a way to better identify these kind of six sick drives. I guess we can call them. um We have lots of internal metrics.

A

We could actually find this drive right away just by running ceph, osd perf and sorting by the op commit latency, but we it would be. We've we're working ourselves just now on trying to find how what what is worth, which kind of threshold of uh of off latency is worth alarming or worth warning the unit, the user. um You know in seth we already warn about high network latencies, and we can.

A

We monitor the smart status as even predict the status, but I think we can also look at the anomalous op commit latencies so coming to the end. So this proof of concept demonstrated that we can get uh per client node up to four gigabytes per second uh reading and writing and it works very well for our use case.

A

uh We filled a cfs up to 95 capacity up actually, and you have to really balance your cluster with upmap to be able to do this, um but for anyone that wants to that dares to fill a cluster that much, we will just want to make it clear that operating a cluster where it's nearly falls very hazardous.

A

We have a performance cut off that we observed the rados level, probably caused by by disk fragmentation. Maybe if we use block db on flash it would help. Actually we didn't even use block db on flash in this case and then, of course, you have to reserve adequate spare capacity to handle any kind of failures, like one rack, free or at least one host free um on the network utilization side. So we have this very fast network.

A

We want to make sure that we're using the network we found that right performance is limited by the network connectivity, so we didn't see any cpu or disk. I o bottlenecks. Read performance, however, was probably limited. By seeking we measured that with this basic eraser coding, it basically doubles the network throughput based on what the user is. Actually writing.

A

um So nine gigabytes per second inbound translates to like five gigabytes to get a local disk output and five gigabytes sent outbound to other nodes in the cluster. We could afford to double the satur, the double the network connectivity on these nodes to thereby saturate all of the available disk. I o in these particular nodes, so we could use public and cluster network isolation which we didn't. We also found that when we were doing concurrent, writing and reading the rights were taking priority so in this- uh and this is actually what we want.

A

So it's okay! If we have, if we're doing data taking, we want the right reads to be de-prioritized, but if you leave the I o prioritization just up to ceph, um then then uh the the red is just showing that in these various in these various tests, okay, the rights we're taking the the most of the bandwidth.

A

uh I asked at a previous meeting like it would be interesting if we could actually tune this directly so that we could, we could specify by policy how we want the ios uh to be prioritized. Of course, we can do this in our front end as well. So maybe that's a better.

A

That's another path, um our front end. So this is like a case. If anybody needs to put a front end in front of cfs you can you can see that it has marginal impact on the overall performance compared to native the native back end rados, uh you might get tails, for example, like we've seen um and going forward, we we might want to like co-locate everything on the same boxes rather than putting the our gateways on separate boxes and then connecting her to a remote cluster. We might want to put everything locally, however.

A

We rely on the kernel performance and there's like a well-known bug that if you put a cfs kernel mount if you mount local osds, this can cause a kernel deadlock under some scenarios.

A

If is a high memory pressure, it would be safe to use a fuse mount or access with lips ffs, but we found in experiments that live server, vest performs quite poorly, and I think that this was mentioned this uh this week that there's a a global, lock and probably this is why we see poor libs, fs performance so coming to conclusions, uh these pieces of software, seven eos are easily stackable and give excellent performance on the high density, commodity, disk server and hundred gig network softwas is extremely reliable, high performance and flexible with tunable qos, um and it has, as we know, a large and active user community beyond our physics communities and then in this stack.

A

The reason why we layer this, of course, is that we have all of the tools that we need for our use cases like strong authentication, the remote protocols that we need in our in our applications.

A

um Also, like fine-grained resource control, find green quotas according to our user communities, and then you can also we've built other services on top of eos, like we have cern box, which is a sink and share thing, and also we have a new open source tape. Software called cern tape archive which is linked to this, so we can think of putting this all all these pieces together.

A

um What are we doing now, so I won't go into too much detail, but we're doing we're now testing this sort of thing we want to start. We will start testing this in production to get to see if we can really have real-life gains in usability performer performance and operations.

A

um It also removes some limitations that we have on the eos side, and then we have on the on the like thinking about how this can be implemented. Even optimized implementation, we're considering how to unify the name spaces and localize the I o, so that when we use one name space between cfs and eos, but also um do the I o like so that the clients they don't have to go through a a special eos client. They could just use the native s client on the on our large batch systems, and that's it thanks.

B

I had a quick question about the um the read versus performance.

B

I was a little bit surprised to see that you're, you're, right or sorry you read um throughput, seemed to uh shape her off before the writes did, and you mentioned um uh the seek latency on the disks being the the likely culprit um yeah. So I think that's that's that's generally right, but that's only part of the story. um Did you try playing with the the read ahead on the setting on the kernel client? We didn't we just yeah. Usually what happens is there are only a certain number of reads and flight?

B

It only reads so far ahead, and so you are waiting for the the arms to move around for those like whatever 100 megs or whatever. It is in front of your your read position um and so there's some built-in latency there. But if you just extend the read ahead, then it can like fetch that data ahead of time and then you can get much much more.

A

I mean that will help if, if things are laid out, linear, if things are laid out linearly linearly according to how we're reading them yeah but we're, but we're thinking that actually that read ahead can be okay. When we, when we write things, are going in a like they're going sequentially, but when we read back, maybe we're not reading back at the exact same order. That's right: okay, okay,.

B

Yeah that'll, it only really helps if you have really big files.

B

Anyway, something to take a look at.

A

Really really big files, yeah yeah, but also we have remember- we have like 80 clients reading, so they might so the allocator's always picking like the next spot on the disk when you're writing. But then, when you read those back.

B

A

Right, you have to see down yeah.

B

Yeah, so increasing the read ahead, just means that you can have more um osds busy moving around and reading data at a time so and theory, your reads should be able to saturate your overall network capacity or whatever, so you should get more than than you're right. um If you have enough for your head anyway, yeah we'll talk about it, we'll try.

B

B

There's a question in the chat.

A

I don't have the chat in my.

C

It's uh do you see the use case for cfs snapshots in your specific environment in the future.

A

uh So yeah not not for the data taking scenario like I'm, describing here not for the high throughput, but we layer like in the same storage systems we layer on top um like sync and share like own cloud, which is behind certain box, and for that, yes, we use snapshots extensively.

A

uh We use snapshots to keep the older versions of the of the files and the you know the analyses working in project in progress and we're midway through implementing this right now and we've kind of you know with it with a snapshots, are different than like things. You don't have a dropbox which is like uh snapshots are more like that. The what's the mac os thing called.

A

I forget where you can like slide the snapshot right to it, where everything is all at a point in time, um but for the machine yeah time machine so but for synchro you want per file versions um and that's where that's where we could see like more effective use of cfs snapshots instead of having we kind of have to hack file versions into snapshots, which is a bit weird.

A

We end up using a lot of uh indirect soft links to the files anyway, yes, we will be using snapshots a lot in the next in the upcoming use cases.