Ceph Cephalocon APAC 2018, 22 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph for Big Science - Dan van der Ster

Description

Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Dan van der Ster, CERN Storage Engineer and Ceph Advisory Board member

A

Thanks for having me at this first cephalic on in asia, first in the world, I'm from cern the biggest particle accelerator in the world, I'm going to explain to you how we're using SEF set for big science.

A

This picture here on my first title slide is the Large Hadron Collider, which is just beside Geneva Switzerland, just beside Lake Geneva here on the side, and you can see we have a giant tunnel here, going under the earth, accelerating particles and colliding at four locations around the circle, the circles actually 27 kilometers in circumference and the scientists when they go down or the technicians when they have to do some work. They even ride bicycles around you can see here the tunnel.

A

When we're colliding, those particles like I, said they're going in other directions, either direction, and you see we can make an animation the particles coming colliding and they collide in these giant particle detectors, which are essentially giant digital cameras. This is an example of one. Can you see the scientists in this picture he's down here?

A

This is one big camera.

A

And this is what what are the what the pictures look like. We collide these particles together to to generate the fundamental building blocks of our matter. This is a Higgs boson. This was was discovered just a few years ago. It's the fundamental particle which gives mass to everything in the universe. So this is actually what they look like. You can tell what it looks like because there's a couple of elementary particles coming out this red one in these different directions and some other jets in this other direction.

A

So from this, the physicists know that this is a Higgs boson.

A

To study this kind of physics, we have a lot of computing power needed at CERN. This is a picture of the CERN data center. We have around 300 petabytes of storage and 230 thousand CPU cores, but in fact our computing is just one part of the global worldwide LHC computing grid, where we have around 150 data centers around the world, universities and research labs and one of those sites is our friends that I have the high energy physics Institute here in Beijing, which we had the pleasure to visit a couple of days ago.

A

I think there's somewhere in the audience here. Hi guys.

A

So that's a little bit of background about CERN and now I want to explain how we use SEF. Yesterday and today our adventure with SEF started in February 2013, when we were just starting to get involved in building a cloud at our at our laboratory with OpenStack, and we had the question: how do we provide storage for OpenStack at the time? There was one already one clear candidate and that was SEF. So we started this proof of concept.

A

Within a few months we had tried to break stuff as much as we can and it was working well. So this led us to build a one, petabyte usable, so three petabytes raw cluster by the summer of 2013.

A

So this is that first cluster you see we had is basically two occupied one half of one of the rows in our data center, and this was made exclusively for cinderblock storage.

A

We had around just over one thousand three terabyte disks, two hundred really nice SSDs for the OSD journaling, and this was the time of set dumpling. So we were using this quite a while ago, and you know when we put this in production. It was just before the Christmas break. Certain closes for two weeks around Christmastime and we didn't quite understand how many replicas we should have, so we even used for replicas over Christmas break so that we would be very sure not to have any interruption to our nice holidays.

A

When we came back, we rethought things and now everything is three replicas and we're very happy with like that over the years following things grew for us. After this three hundred terabytes, then three petabytes production cluster we started using for other use cases. We contributed part of the eraser coating libraries to Ceph. We write something. We wrote something called the Rado Stryper which lets us do some physics data. In parallel striping to the OSDs. We had a. We tried, making a very simplified file system which we called ray dos FS.

A

This one was was not ideal. We we moved on from that to something else we upgraded in 2016. We were upgraded, our three petabytes block storage cluster to six petabytes. So this we, we completely replaced all the hardware with no downtime and by today, by late last year, and now we have eight clusters of stuff in production, and here are those clusters today. So we have still.

A

The main key use case is OpenStack cinder and glance where we have that very close to six petabytes of production, and we also have a second smaller cluster in a satellite data center with half a petabyte we're now fully also in production with CFS, where we have over around 1.5 petabytes in total across three different clusters. I'll explain a bit more detail about those.

A

This thing that set oops this cast or excerpt D. This is a physics use case where we have over five petabytes of physics data stored in Seth, and then we also have what I consider to you. A small s3 Swift object store for other use cases, so I'm going to normally I would talk about block storage first, but I'll do things in reverse, because sefa Fez is now stable.

A

So we talked about setteth s when we have been using so like, like many users, you know we have a kind of filer use case for several different IT applications, things like puppet or gitlab repositories or openshift. You know you need. You often just need a place to store to store files, and we have been doing this for the last few years with virtual NFS filers.

A

So these are basically virtual machines with cinder block devices using ZFS and a nice backup tool called zero to move to make NFS highly and not highly available, but at least reliable and backed up, and we have around thirty of these kind of virtual filers and they work very well I mean here's. Here's an example: this plot is showing we had.

A

We can do close to like 50,000 stats on one of these virtual machines per second, which is quite high performance, but this kind of architecture is not very scalable to manage the the quotas you have to detach and reattach our be the the block devices. If you that the users are growing their space too much. It's also labor-intensive to create new things. You have to manage them as precious individual machines and you can't scale the performance horizontally.

A

So this is why set of s is interesting, so we started last year to evaluate OpenStack Manila with set FS because it was starting to matter most of our features. It has. It supports multi tenants with the with each of the users isolated, so they can't delete or affect the other users data, and it also supports quotas and, more importantly, it has easy, self-service provisioning. So the users can just either through an API or click through their web browser.

A

They can create a new virtual filer on demand, it's a very, very nice procedure and it has scalable performance. We just add more OSDs or add more MDS as needed when we need more performance. We've been testing this since middle of 2017, and indeed, at that time a single metadata server was a bottleneck. But now with the set luminous release, multi MDS is a stable feature. So we have this in production.

A

It's already so successful that one user, an open shift user in fact, asked for quota to create 2,000 shares, and this is something that's no problem. We can just grant it.

A

We also use this for kubernetes and we're working also on a set a fest plugin for kubernetes and we're really happy that kernel quota support will be coming because this is an essential thing for the for our kind of use cases.

A

Here's a couple of pictures showing multi MDS in production, just to confirm that this is stable. So we had a pre-production testing environment where we had 20 different tenets. That's that's 20 different OpenStack projects that were using this for several months and we had two active MTS's and this was stable so on our production cluster.

A

We enabled this in January of this year and in this picture on the bottom, you can see, for example, when you go from 2 active MD Isis to 3 active MTS's, there's a chef just balances the metadata workload across those different MD. Yet the new MDS, so this code is, is, is working quite well and it's stable.

A

So what about HPC on set FS a CERN is mostly an high throughput computing lab. So we do file based parallelism or embarrassingly parallel workloads, but within our lab there are still several smaller HPC use.

A

Cases like beam simulations, fluid dynamics, calculations, quantum chromodynamics and also something that maybe is not a typical HPC, but ASIC design like in this picture shows some of the custom Asics that need to be designed to build these large LHC detectors in the CMS detector, which is one of the one of the four main detectors, there's around 1 million chips that need to be designed and 1 million chips in the detector and this this software commercial software needs full POSIX compliance, consistency and parallel I/o as well.

A

So we think about a software-defined HPC. We all of our computing is built with commodity parts, both the high throughput and also high performance computing, I'm, a compute side. We solve this with software such as HT, Condor and slurm, but on the other hand, typical, HP storage is not that attractive for us, mainly because we miss the expertise to operate those typical HPC storages, but also because we miss the budget. We want to build something with the lowest price hardware and just add open-source software on top to enable the HPC workloads.

A

So we've done this with set of s and Manila. We've done this with 300 HP C computing nodes accessing a one petabyte set of s since the middle of 2016, and by today we treat HPC as just another OpenStack Manila user, and indeed it's quite stable, but maybe it's not the highest performance option. So what about performance? So we were very interested when one user, mr.

A

John, bent posted on to the semi thing list his contribution to supercomputing 2017, a new benchmark called IO 500 in HPC world there's the top computing clusters or top supercomputers, are measured with the top 500. So for storage, there's a new benchmark called IO 500 and the goal of this benchmark is to share the good and the bad the heroes and the anti-hero is just to try to get this kind of progress and improving the performance of storage systems.

A

We've just started testing on our set fest clusters. This benchmark includes two of the traditional HPC storage benchmarks. It's called IO R. This is measuring throughput and also metadata is measured with MD test and find ok, and this each of these tests has an easy mode and a hard mode when they in the easy mode. The different parallel jobs are writing to individual files. So that's relatively easy for a file system, but in the hard mode they all these hundreds of jobs, try to write to the same file.

A

That's a parallel I/o case, that's very difficult! So here's a first look at this and the disclaimer is that this is straight out of the box with no tuning with luminous version 12 to 4 tested just this March. So our cluster that, where we have this, has just over 400 SSDs per server they're, located on the same machines as the as the CPUs, where the tests and where the clients are running and then for the metadata servers.

A

We just have two VMs somewhere in our cloud, so not not put in any kind of optimal ideal performance, location and overall, ok at the bottom, you see there's a score, our bandwidth in gigabytes per second and our ups. But how does that? Compare to the other known results from the IO 500?

A

Well, I can say it's it's a number 10 position, so the score 2.46 is just below the number 9 for 4.25, but you can see that we're already ahead of number 9 in gigabytes per second, but our metadata performance is quite a bit lower. So this kind of activity is something that we're gonna spend some time to improve and try to boost ops FS in this rankings to get some to be able to use this as a real HPC storage system now move on to the rathaus gateway, so for s3 at cern.

A

We use this because s3 is a standard object. Storage, where different developers, inside cern, can just have access to storage that their applications already know how to use. We do this with a luminous cluster with a racial coding, and we just run virtual machines as the rathaus gateways. We have one single region here and the types of use cases that we have our physics data with very small objects, so each, for example, each collision. We would store in a different file, so we could have millions and millions of those.

A

We also use this for volunteer computing, so maybe some of you remember SETI at home, searching for extraterrestrial intelligence. So this is. This is actually generalized in a piece of software called boink which lets users run a screensaver and donate their CPU cycles to physics research. So we use that the backend storage for that is on set s3 and we use features such as pre signed, URLs and object expiration to make this all usable for the users and it's all very stable.

A

One thing that we do in addition to the standard out of the box rattles gateways that we run H a proxy in front. This is to make easy high availability so that we can so that the rattles gateways can be restarted at any time without any downtime, and we also use H a proxy to map some special s3 buckets to some dedicated Rattus gateway instances, so that some of our largely use cases don't affect the general service.

A

Now, I move on to the block devices RBD.

A

This is a picture from a couple of days ago, of our cinder glance, cinder and glance uses use cases so on the Left first, we can see that we're storing, currently more than 4000 glance images. These are not just system images but they're, also some machine virtual machine snapshots and on the right we see that we have more than 5000 cinder volumes attached. So we can see. This is a very popular service in our cloud.

A

Our service is made available through several different volume types, so the vast majority of users are using our standard volume type, which has QoS limitations similar to that of a single spinning disc. For our higher I ups use cases, we let users create something that we call IO one volume type and that's also quite popular, and then we have different regions different with different qualities of service related to power or locations.

A

So these are our other volume types which are like the CP one cpio one and the ones that start with WI G those are in our second data center, so we can make a full complete service. This way.

A

This service continues to be highly highly reliable. Sometimes we brag to our colleagues that we've never lost one single bit that we're aware of we find this is reliable, for one reason is because the quality of service limitations- let you really make this so that no one user can affect the others. Everything is very, very stable.

A

One of the interesting parts of running a block storage service is that your clients and having such a long uptime that it's difficult to upgrade your upgrade those clients. But thanks to some recent security exploits Spector and meltdown, we had the opportunity to reboot all of the hypervisors, so this brings in the newest SEF client. This is really nice, we're doing still some ongoing work.

A

We like to contribute where we can upstream, so we just contributed a slightly improved version of our BD trash so that we can have a trash bin and then restore the block devices when they're removed and we've just started a project with Rackspace to work on improving our VD performance with a kind of simple caching and we're also working on a backup driver for glance right now to backup from our BD to s3.

A

Now, hyper-converged, SEF and cloud is something that we're also interested as a new paradigm for our data center, and we say hyper-converged here what we mean is to put the OS DS on the same nodes as the hypervisors, so we have a small cell in our cloud.

A

Actually, this is this: is we're testing this first with the HPC workloads, so we have these 300 or 3 to 400 SSDs in an inn in these HPC compute nodes and I can say that technically, this is working quite well. There are some cases where you need to use CD groups to to encapsulate to isolate the different processes, but in fact, the bigger issues that we see are that are related to our operations.

A

Culture, because we have separate separated, unique teams for storage and for processing, and so we on the seft teams, side, don't own those servers. So we need to cooperate with the cloud and HPC teams. You need to develop common procedures when to intervene on nodes, how to drain nodes, how to upgrade how to reboot the the cloud guys need to know how to operate Seth and the Ceph guys need to know how to operate cloud.

A

This is an area that needs to be solved for hyperconvergence to work in our environment, now move to a kind of section where I give some user feedback from like dan as an operator upgrading from jewel to luminous in general, went well, no big problems, we're replacing OS DS that also all the new OS DS in our cluster are built with blue store with the newest tool, set volume, LVM format and we're also converting existing file store, OS ds2 to blue store with a script. That's tuned for our infrastructure.

A

We're also very excited about the Ceph manager balancer feature. So, finally, we can make the OST utilization very flat across the whole cluster and we're actively testing this, and it's really convenient in fact that this is just all written in pipe. So we can patch and adjust things just how we like it in our environment.

A

Balancing data is a very interesting topic for us because we were recycling through different generations of hardware, and this is a picture of one example where we replaced three petabytes with six petabytes of new hardware on the Left plot.

A

We see the capacity of the cluster increasing, as we add racks and then decreasing as we drain the older racks and then increasing again, as we add a new rack of observers and decreasing, and then on the right, we can see the new OS DS, filling up on the existing OS DS draining data as the as the data rebalances across the cluster. So this is a thanks to SEF, backfilling and recovery. This is a kind of intervention you can kill you out, transparently, without your users, users noticing any downtime whatsoever.

A

Now I'll talk about some of the the challenges that we have these days, how we would like to improve SEF to make our operations a little bit easier on the RB, D and OpenStack side of things. One of the questions that we often ask is: how do we identify the most active volumes?

A

We really need something like our BD top, so we can just run that and then see immediately who's, the highest user and then maybe go speak to them. On the performance side. There are some use cases where you want microsecond, latency and kilohertz I, ops, such as databases, and for that you we probably need persistent SSD caches on the local hypervisor.

A

Also, some use cases require on the right encryption or client-side volume encryption. We don't have that yet, but it would be really useful and in the case of hyper-converged clusters, where we have different cells with local local OSDs, we just need some tooling in OpenStack, so that the users are always getting a volume, that's close to where they're running their urghhh machine on the sefa fest side for HPC.

A

It's clear that we need to work a little bit on the parallel I/o try to get the best performance possible in this example benchmark and then also from the operations point of view. We need a simple way to copy huge amounts of data across different places in the cluster, so, for example, our sink could be made smart about how Saif is storing recursive statistics recursive change time statistics, so we could have a safe mode added to our sink. Someone in the audience is interested to implement such a thing for the general use case.

A

If we were to use set FS in general for all of our POSIX like storage at CERN, we would need to do a lot more testing. We would need to be able to scale to 10000 or even a hundred thousand clients, but for this to work really well, we would need, for example, throttles and tools to discuss, to identify and block or disconnect the the disruptive users or clients. So a set client top is a similar thing that we need.

A

We probably need native Kerberos without an NFS gateway, because gateways will be bottlenecks and we need group accounting and group quotas for the for the non Linux clients. Then, of course, we would use highly available sifts or NFS gateways, but then this leaves the question still open a backup. If we were to open this to our 10,000 users, we would very quickly have a problem of how do you backup 10 billion file set of s?

A

So maybe we can think eventually about doing binary diffs between snapshots in the file systems, to enable this kind of use case on the raid o sleve 'l. We also have some challenges, such as. We have some very large clusters with old configurations, old, tunable, 3 petabytes of our BD data with the hammer to Nobles. How do we handle that cluster moving forward over the next 5-10 years?

A

Do we just live with these old ones, or do we find a way to enable those new features without moving all of the data and for some of our physics use cases would be interested in a pool level object back up to do, for example, convert from replicated to eraser coating or copy do disaster recovery copy of entire set pools to something that's not safe, and then there are some areas that we can't use SEF yet, but we hope to use and that's, for example, storage for large enterprise databases, I think we're getting close to that large-scale batch processing.

A

This is related to this many tens of thousands clients issue that I mentioned. We have a use case where we need single file systems spanning multiple sites, and we have use cases where we need hierarchical storage set fest with a tape back-end, maybe maybe a glacier back end for s3.

A

Now, let's look to the future. This is high energy physics computing for the 2020s, we're currently in our second run of the LHC, and we're generating 50 to 80 petabytes of data per year in the early 2020s will shift to a new run with increased strength of the accelerator and we'll be generating 150 petabytes of data per year. But then, in run four, it's estimated that we might need up to six hundred petabytes per year to store.

A

So we need a storage system for that and, additionally, with our global grid, we think of something that we call a data Lake, it's different from the conventional industry data Lake. What we mean is that we want to have a globally distributed file system with flexible data placement to where we want at the different universities and ubiquitous access to all of the data.

A

With these kind of use cases in mind. This is what motivates us to do these large scale, which we call big bang tests in cooperation with the Ceph core team. A couple of years ago we started this with our first 30 petabyte test 7,200 OSDs, and we indeed found some limitations which at that moment were fixed.

A

We did a second run with the dual release of Ceph, which actually, in this case, we found some limitations in the messaging between the Ceph, Mon and thus fos DS, and this was part of the motivation to develop the staff manager that we now have in production for Luminess. We repeated the test middle of last year with a 65 petabyte cluster 10800, OS DS, and this revealed just a few minor issues remaining which were fixed. Luckily, before Luminess was released, so now we can say that Seth is scalable to more than 10,000 OSDs.

A

That's the end of my talk. I want to say some thanks to some of my certain colleagues. These are the team of people that do all of this work at CERN the operations and development side of things for SEF, the cloud and HPC and, of course, I want to say thanks to the whole community. This is a picture of the cephalo go with all of the contributions from the luminous release, and maybe you can even see your company in there. So thank you for listening.