Red Hat OpenShift San Francisco 2019 | OpenShift Commons Gathering, 28 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lightning Talk: Exascale Schoal Architechture in Ceph with Kyle Bader

Description

Lightning Talk: Exascale Schoal Architechture in Ceph with Kyle Bader of Red Hat.

Filmed on October 28th 2019 in San Francisco.

A

So Kyle Bader I were the storage bu a Red Hat been working on stuff for a while, and one of the things has been coming up recently is, is how do we kind of take, takes F and take object? Storage to the next next level of scale. Right I have this old logo from Ceph from like 2007 on here we used to say it was petabyte scale storage. Well, it's like not cutting the mustard anymore. So how do we adapt?

A

So some of the largest tests we've done were with maybe 10,000 OS T's every now and then CERN the same place that collides the particles. They'll, say: hey we're getting a new shipment of hardware in. Do you guys want to run some tests on it for three weeks? So we've done this a series of time. The most recent one we did was with 10,000 drives, so we built a 10,000 drive. Shaft cluster ran it for a few weeks and each of these times we've kind of pushed it to make a bigger cluster.

A

So if you look at the kind of the highest capacity drives, you can get these days they're around 16 terabytes and if you were able to have 10,000 of them, you're gonna get into the on order of a hundred sixty petabytes, which is a really big data store. But we have customers that are starting to say: hey I want I, want a hundred I want 200, petabytes of storage.

A

What does this look like and they also want to have access like really demanding throughput, particularly if they're gonna be using it for machine learning applications where you know they're, saying hey, you know: I'm gonna have all these cars they're gonna be pumping all this data into this thing. You know on the order of petabytes per day and then I'm gonna have to every few every so often every so many months. I need to go through all of the data and you know retrain it with the with the new fresh data and so you're.

A

Looking at like a very serious amount of throughput demands, so we were kind of thinking. Okay well! Well, how can we? How can we kind of solve this? How could we kind of cater to these really really demanding use cases where you know people need hundreds of gigabytes per second of throughput. They need these. You know 200 petabyte clusters. What can we do so?

A

One of the things that was done number of years ago was: we were working with the Yahoo folks and they kind of we helped kind of Co develop an architecture where they had multiple SEF clusters, backing backing the storage for Flickr, and so we, you know, I, think this. This same sort of architectural approach is still relevant today and so kind of you can create architecture, that's kind of like a Sheol or like a group of squids.

A

So it's you have a group of SEFs subclusters, acting as a bigger cohesive whole with like a single namespace across of them, so that you can have you know billions and billions and billions of objects and- and you know, hundreds of petabytes worth of data.

A

So what is you know? What can you? What can you get out of this? Well, in terms of sub clusters, we were kind of seeing what sort of throughput you can get in and out of a relatively modest sized cluster, and then you can kind of extrapolate things. So, even with a relatively modest cluster of about 700 spindles, we were able to do a little bit over a petabyte in 24 hours. So this is kind of validating the use cases where people are.

A

You know I need several petabytes a day on on the order of how many disks you're going to need. In order to be able to absorb that much data, we wanted to see how many objects we could stuff into single bucket right. So a bucket is like a flat namespace and people often will want to put in many.

A

So we tested I, think up to 250 million objects and a single bucket, which is which is a lot, and you can see that our latency after kind of an initial step, there was some sharding where we internally shard the metadata. But after that we had very consistent latency, even though we had you know: 250 million objects in a single bucket, which we considered to be a lot.

A

But yet we also wanted to test. You know storing whole, it's not a placeholder, but this is a lightning talk, so we also wanted to see you know, let's put billions of objects into the cluster and global, and so we did. We put a billion we put over a billion objects into the cluster and observed how the the change in performance changed over time. You'll see that it was relatively the red bar being the latency and then the blue bar is the object population.

A

So our latency a stayed relatively flat until the point where we were taking up all of the SSD for our metadata and we slowly were starting to spill some of the metadata over to disk. So as there was a higher percentage, you can think of it like as a cache miss right so as as less of it was on SSD, the more of it had to come from hard disk. That's why the latency was starting to creep up over time and then for reads.

A

It was relatively stable right because it's still just gonna be a seek to read, and you can't there's not a lot. You can do to accelerate the seek other than you know. You can't cache the entire population of objects.

A

So, with kind of those individual sub clusters out of the way you know what is a, what is a show? What would a show look like well in a safe, multi-site topography, you have these ideas of zones and zone groups, and these zones and zone groups were originally kind of put into place to Anor in order to be able to do replication between them, but that doesn't necessarily have to be true.

A

So by creating a realm, that's like an s3 global namespace for buckets and then each zone group a bucket can live in exactly one zone group, and then you can potentially configure have multiple zones in his own group and then do replication between them. But if you only have one zone in each its own group and you're, basically just partitioning the namespace of buckets, so each bucket lives in exactly one zone group, and so, if you're, interacting with the object store, you don't necessarily need to.

A

Like you. Just have your you know your s3 a path or you go to your. You know bucket dot, s3, dot, example.com or whatnot and it'll get routed appropriately and the way this routing works is.

A

Dns right, so, if you go to your s3 example.com, it's going to map to kind of the the main IP address and then or if you go to use like path style access, you're, going to go through the s3 dot example.com, and this is going to round. You know you could potentially just round-robin this around. The cluster is like in the most simplistic sense, but you could also potentially do something sophisticated in.

A

You know h8 proxy, where you have some sort of lua based routing that that actually looks into the cluster to find out where a bucket lives- and you could kind of do it like I, don't know if anyone is familiar with like an SDN controller, but it's kind of like a first packet approach. Where you you know the first packet gets resolved by the control plane and then it embeds kind of like a something in a lookup table so that subsequent ones don't have to go through.

A

That logic could do something similar to that and then you'd have a separate cluster endpoint per sub cluster, similar to how AWS has different endpoints for different regions right so you're, effectively creating little regions that each have their own south cluster and then the with a some sort of DNS plug-in it would map.

A

So if you say you know, Kyle is the name of my bucket I go to Kyle right, you go to the DNS and you'd have a DNS plug-in right now someone has written one for power, DNS that talks to staff and then says: ok, this bucket is in this one and then they'll respond with the record that'll route them to that cluster. Well, you could do the same thing and I think kind of the where you could kind of take.

A

This is to take it and put it into core DNS and then, if you had an operator that was deploying multiple clusters inside of an open shift, it could automatically configure all of this wiring and route the traffic appropriately. So that's kind of that was my quick, quick little talk Thanks. Thank you very much.