Ceph Ceph on OpenStack, 26 Apr 2016

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Ceph at Scale - Bloomberg Cloud Storage Platform

Description

Chris Jones, Technical Lead for Bloomberg Cloud Storage and maintainer for Ceph's Chef cookbooks on github, will provide a technical overview of how Ceph Object and Block storage are used at Bloomberg with and without OpenStack. Learn how-to build your own multi-petabyte hyper-converged cloud storage system using the same automation tools used at Bloomberg on commodity hardware. Topics will include: -Multi-Cluster architecture with index sharding for Object Store (RADOS Gateway) -Architecture to

A

It's on okay, well, I, guess we'll go ahead and get started.

A

My name is Chris Jones I'm with Bloomberg and we're gonna talk about some of the things with scaling SEF in this case, if I can get it to work all right, real, quick 30 years and under 30 seconds, we are primary, primarily a financial services provider and you can see our glue Bloomberg terminal there. It has over 60,000 functions and does a tremendous amount of things for the financial market, including data streams, etc.

A

So what do we use of SEF? We use the the object, store, the block volume and we also with the OpenStack size. We just started using the ephemeral storage. One of the guys here that worked on that is here and I believe that the ephemeral storage has now become one of the more popular things that we offer, which is kind of cool. We actually I, think I, guess it was at Vancouver last year that we saw that, and so he went back and kind of implemented it.

A

Now we it talks about hyper-converged, whatever kind of hyper-converged before it was cool. You know everything now is 100% buzzword compliant with hyper-converged. This super hyper converged, you name it, but we started doing that so probably three years ago and where we had everything on, for example, head nodes and head nodes or controllers in the OpenStack world, but then we had not only that we were running mines OSDs, you name it in those same things and that started becoming a problem.

A

So this past year we kind of reacted it a little bit and went to a pod architecture that gave us the abilities to scale out do different things of that nature. So we can actually have three of these per rack.

A

Now you see there in the middle, it talks about tor. So not only we actually put those with each bundle with each pod, and so you can actually make tweaks. We can do things we can scale, however, we want to within the data center, etc, etc. So it gives us a lot of flexibility to do that, but what we were seeing is that our object store was actually becoming very popular, so we wanted to scale that, but the problem was we didn't want to have.

A

You know, have a massive amount of compute space because of the fact that it was so expensive because all of our nodes are running SSDs. All our whole set clusters, SSDs I.

A

Mention while ago the ephemeral piece you can get an idea, it's a give you a little bit of a visual on it because I'm kind of a visual guy, but the if you're, using the Ceph side and then you're looking at the ephemeral, then you can see that there's a lot more network traffic, etc, etc. But there has you know: there's a lot of trade-offs like with safe, it's safe from you know. Well in this particular standpoint. So if one of our customers stands up, however many VMs, etc, etc.

A

They're, okay, but with the ephemeral, but it's a little slower, but with the ephemeral side, it's kind of it's a lot faster and we kind of look at that and say: okay, are you? Gonna live life dangerously because of the fact that you know it could go away at any time.

A

Now the numbers here don't represent actual, don't don't look at it and say: oh, this is what they this, what they find on their production cluster, because that's not true. These are actually done with some of our lab equipment and some of its old so looks you look at some of the comparisons between it and so, when you look at this, compare like do I run stuff with everything or do I mix in some ephemeral, etc.

A

So we get it to the thing that we're doing right now, which is the object, store the object store. We started kind of breaking that out, and so what we're doing now is we have basically, we start out with three racks and our initial piece where we had it is we had a hardware load balancers, but now we're actually that don't, we've actually created our own custom load balancers.

A

Now, in that case, and also the other thing too, I don't have a pointer, but one of the things to remember is each one of these racks are routed there on their own subnet. So you get into a situation with like, for example, with keepalive D that doesn't want to. You know, transfer over the IPS etcetera, the the VIP. But the fact is it's there, so we fix that with other configurations and I'll show you that in a minute, so the our OpenStack cluster runs on a bun, but our object store actually runs on rail.

A

Now, there's no particular reason why we did that it was more thinking. There was more of less an olive branch because we have a lot of storage groups. We have a lot of other groups within the company and a lot of those guys. They use rail and so they're comfortable with it, and so I wanted to get more of these guys involved so that it would help us, especially with the OP side of it, because I didn't want to sit and do with ops stuff all the time and so I got those guys involved.

A

So we basically put it on rail. Now in this you see that the top part is our tour and then, of course, then we have three 1u nodes. Now the 1u nodes are going to be there. Basically, our Mon nodes, they are our rails gateway and our load balancers and then the other 17 or 2u nodes, and those are all of our OSD notes. Now that's important because I just came from a talk which was a great talk from Comcast and they actually are running large density servers and I.

A

Think roughly about 72 drives per one. In our case, we actually have 12 12 drives in this 12 Penner's and there are six terabyte drives, and then we have two SSD journals now. The interesting thing about that is that they're, not those SSDs, are not just the journal, so they're, actually they're, co-located or co-hosting, so to speak with the OS itself. So basically a small portion of the first top of it is actually the OS rated to the second, and then we have six journals on one s, SSD and six journals on the other.

A

Now and then the journal sizes are larger. We went a little larger on our journal sizes because we had the space and so they're running at 20, 20 gig also re interfaces. We have to nix our two ports ones for a cluster size. It's tin, gig and one for a public is Tinggi. Now write us gateways in Mons. Don't use the the cluster side, so the only thing that they they actually do is the the public side. So in that case we're actually bonding those and then we're binding them in the basically LACP mode.

A

So it's mode 4, so we can get some aggregate pieces out of them.

A

This is pretty much some of the things we talked about, but one of the things I skipped over there I was gonna talk about a little over here is to actually get scale we are are currently our OS I mean our OpenStack ones are running all replicas three replicas on straight SSDs, but with the object store, what we've done is it's all erase your coding and kind of find out whe what we're doing with an eraser Cuttino. You know, there's not a whole lot of Doc's out there.

A

There's not a good explanation of why you do this, why you don't do this etcetera, etcetera? So a lot of this stuff was trial and error and then, of course, bringing in some folks from Red Hat to take a look at it and see. Then we came up with some combinations from there, but the interesting thing too, on the rightest gateway side is, you know if you've ever created a rightist gateway?

A

It creates a pool, set, we're roughly about 14 pools approximately in each of those, you know, has some sort of function within ratos gateway, the the most important one is the the bucket pull the right of the dot rgw bucket. That is the only one that you actually do. Eraser coating on the rest of them are actually replicas and you'll see some of that here in a minute and what I'm talking about? Also to the another important thing that we done- and it was also mentioned in the other talk to- is the OSD nodes.

A

We run a hardware controller zone because there was a significant performance increase. If you try to do software pieces or own boards controllers, etc. We actually one of the guys here who actually started doing some testing with some new equipment. They got it in and apparently we somewhere along the lines, we didn't have a controller and they were like hey. Why are these drives and everything supposedly better than all this other stuff, but it's slower than our current architecture, and then we got in start digging now.

A

He did primarily and realize that oh, we don't have our controller, so we wasn't doing apples and apples comparison and it was a significant performance increase.

A

So this gives you a more of a logical view of how we have our object store. We have our spines, but then we're also. Then we have our to load balancers. Now. The interesting thing like I said before with keepalive D is the fact that you can't really span multiple subnets in this particular case.

A

So you will see if you do like a nike space, a on the first load balancer and you do the one on the second one you'll see the the VIPs over there, whatever hired many ups that you may have, the difference is we're we're now going to use BGP, that's going to advertise these routes and we use birds for that, because we're everything we're doing we're trying to stay in a an open source or basically a non vendor solution.

A

That's where we're heading we're trying to get everything toward a non vendor solution, so we set up the bird on the BGP so that it advertises to its peers and it's peers are the spines and then the rest of the network can recognize where they are now in doing that, we don't want to advertise the secondary because it will get confused. They want to happen in because they did. What happens is that when you start doing doing rate of Skaguay calls, you start doing different things, you'll see, just connections will drop and what it was.

A

It was the routes and everything else was getting confused. So there was there's a configuration setting that you can do in bird, which basically makes this a secondary advertiser sort of a primary, and so that worked out pretty good. Now the radio's gateway. We run multiple instances, irate us gateway, / raid, us gateway, node and remember, institute this it's a 1u box. It has 210 gig ports and it has 256 gigs of ram.

A

Now the original piece call for 128 gigs, but when they came in they had 256 and I wasn't going to turn it away so I kept it and I just didn't, say anything life's good, so we were kind of looking at those hey. Why don't we start doing something a little different? We can actually approach this a little different, and so in that case we started doing some investigating on how can we run multiple instances of Redis gateway on a single node?

A

We get split it out from a standpoint of, and the reason for this is we have what's. We have private networks in our group, so in that we can't let this private network see what's going over in this private network, etc, and so typically all of our clusters are inside a private network within itself. So if the private network a wants a cluster, then we have to build out a whole cluster because there's some security standpoints in the past. The way it was all worked out. You couldn't do that.

A

You couldn't say here's a converged piece of hardware. Why don't you all come into to that so, but with the object store, that's the very first product within Bloomberg. That's been allowed to do that, so we have now a centralized server cluster that now takes access from private network, a private network BCD how whatever and so each one of those ratos gateway boxes are also weighted, and so and the reason for that is. We want to be able to scale them out if we need to or if we have failures we can set up.

A

For example, we could set up OSD nodes as a lower weighted rate of Skateway if we have to- and the same goes for other things too even months- and you can see that in here in just a bit, so each one of the load balancers basically weights. This puts it on a different port for if, if the VIP is coming in off of private network a goes over here, it goes to a different port.

A

If it's private network B, now here's some important configuration pieces, the you know it was pointing out in the last section that says you know it has I, don't know how many but I mean it has so many knobs. You can't even count, and so you do this. You do that and you turn one over here. Something else happens, you're kind of like oh what's you know this doesn't make sense sometimes, and sometimes it does so. You have to actually tune it for your given environment.

A

So one of the things that we've looked in what I was telling you before is each rack is on its own subnet cool thing. You know, because all the examples you see show it in one aggregate like a slash, 24, whatever etcetera, etcetera, but we actually have it in slash 27 for that and they're all routed, all routable on our OS DS.

A

We, those are all x FS with onboard controllers like we talked about and then our rate of skate wave components, one of the things that we're testing right now we haven't implemented it yet. But what we're testing is the Federation with regions and zones, and we've got another cluster that we're about to stand up. So we can do that and then, of course, the eraser coding pieces and the thing to keep in mind about eraser Kody, it's different than replicas replicas, like did did simple I mean you got one object, another object and another object.

A

Eraser coating is different. The crush maps are different. Everything about it from that standpoint is different, and so you actually have to have a reasonably good custom crush map, and so we have two rules that we created. One is for the replicated and one's for the eraser, and what we've done is we've done it by racks and in by hosts than by OSDs, so that we can kind of distribute the low, because the whole thing about Seth is data distribution.

A

That's the whole point, because you don't want a rack being full over here or almost full, and then you have two other racks, three other racks or whatever, and they are not completely full but you're kind of like. Why are all this? This funky stuff going on well- and this comes back to the crush map and also to one thing, to keep in mind and I would recommend this with almost any configuration is to do sharding on your bucket indexes.

A

Now that's an interesting thing and the way it works is that there's a setting I think it's yeah right below there that you just basically tell it like a mac shard of five. Now that five is just a sample and you're gonna see something too, because everything we do is all open source. All thing is open source, except for the data of our given machines itself, such as the MAC addresses in IPS and all that other good stuff. But everything else is open sourcing, but you can take it.

A

So you may do some of your buckets. You may have it set at five, and then you say you know: I need to go up a little higher and then you set it to ten, but your previous, all your other component, buckets that you had or the buckets or actually pools, or in this particular case or buckets and they'll all still be at five, but all your new ones will take on the new configuration. So it's not going to go back and change any of the old stuff. Also Civet web.

A

We kind of booted Apache out. It was a memory hog and a lot of other things and then of course, the front end piece is now. This is an interesting little piece, because a lot of people don't know about or know that you can do it. It's really nothing there's just left out of docks, and that is you could there's. Civic wave itself has a lot of options. It has a lot of different config settings that you can do. You just have to go.

A

Look at the Civic web project and then you start seeing all those different components. So then you come back. You look at some of the code within the rate of Skateway, then you say: oh I can use that. I can use that the one there that says number of threads equal 100, that's actually a default setting. We were playing or I was playing with increasing decreasing, looking to see what happened, etc. But then I just left it in there, but that actually 100 is the default setting within civic web.

A

Network all right, most everything you'll find performance, wise or basically falls into these two pieces, the load balancers and your network. The like I, said before the rate of skate waves of the mods and the load. Balancers all are bonded. We had 210 gig ports and they're bonded with mode 4 for aggregate in this particular case. Also we're using jumbo frames now, so we set the MTU to 9000, definitely want to do that on your cluster piece that so that's a given and the reason. One of the reasons for that is.

A

Last year we were doing some testing on our clusters that were in our DMZ and I was comparing it to s3 and I had to do that, because one of our customers was doing stuff and say: hey we're gonna move over here. We're gonna do this whatever, and so we needed to get good benchmark is I. Think it was canonical.

A

Yesterday in here was talking about that when you're doing comparisons of looked with your storage cost-wise, you want to it, doesn't really matter. What's going on everywhere, you need to base your cost off of what's Amazon doing, that needs to be your baseline and the reality is for us.

A

Not only was it the cost that we had to look at for baseline the price, but we also had to look at the performance because of our customers say: hey I, get faster doing this or I get faster over here now, that's only in like I said, there are dmz side. We have many private, secure, Network says so that that's not never a factor in that particular case, but on the other side it is now.

A

If you do, you know, oh and the other thing about the MTU 9000, so I had problems last year, so it basically takes a cloud to test the cloud. So I was testing a lot of different things with jmeter from Amazon back into our cluster and then from Amazon to s3 and by the tweaks we've made with rate of skate weight. I had parody with s3 I saw. The irony was I'm onna ec2, comparing myself coming back to our DMZ and then go into a closed region for EC.

A

For s3 and I was on parody performance, wise the, but we also saw some things dropping along the way, and it was because of the fact that the discovery mode so some of the devices, etc etc, wherever they may be, we're not doing what they're supposed to do, which is Auto adjusting an MTU from that standpoint, and so we just went ahead and set it at 1500 on our public side.

A

The the config setting that I talked about earlier with bird is like on that bottom line there it's talking about setting secondary nodes with your ASN, and this is real important. If your lever do it any BGP stuff, because you want to make sure your routes are advertised correctly now.

A

Obviously everything we do I mean we can't even approach this without automation. There's just absolutely no way the last talk they were talking about. You know it was tweaking this hardware to this hardware and this hardware and again it was a density node and those were purpose-built components and you had the time in our case we're using lower density because I don't care for machine guys, I, don't care, throw another one in you file a ticket, get it in I. Don't care for drive, fails file a ticket get it in.

A

So we have spares for those very reasons to do that, and so we're not tweaking all the hardware everywhere we can, unless it can be fully automated. If it can be fully automated, it makes sense, then we definitely do that. So the and all this stuff isn't shift. Now you see the first one there, that's our Bloomberg OpenStack. It was originally called B CPC.

A

It's still called that, but this made about 400 other changes in veins and stuff, so it's BCC now for Bloomberg cloud compute because and then we've matched with our Bloomberg object: storage to blue blue blue cloud storage, and so, if you see, if you go to the github up there, you can actually go ahead and clone it. You can do it right now.

A

I did it while ago and the other talk, because I was just testing the performance of the network and all that and I even built a self cluster on the lap this laptop while I was sitting in the other session, so you can clone it. You can build you and then basically run the the vagrant up or actually there's a couple other things that you would do and that would build out a full open stack along with Ceph in that particular stereo.

A

The the other interesting thing here, if you look at the second one, that's set chef now that and if you look at the the github on that, that's actually managed at the Ceph repo and we actually we created it, and we basically are the admins for that and so, and that is a complete cookbook. That will give you everything that you see here, plus more for everything, including CFS, etc. So and then the next piece is our object, store which I'm talking about and that actually implements the Ceph chef.

A

So in essence, what the the second the the chef BCS does, is it actually, when you basically say set up in this case for development environment, because we do everything on VirtualBox for our development and then we roll it into our hardware. Inside of it, it will actually go out grab all the cookbook there have all the dependencies grab all the packages, everything it needs, because in our scenario, we cannot out the outside world. So look for security reasons, etcetera, etc.

A

So we we operate behind lots of proxies and you name it and in like I said all of this stuff is on github, completely free go out, get it right now. Actually that would be awesome. Man, issues and pull requests. They'll be really good, because there's a lot of enhancements out there that need to be made to everything.

A

So here it talks about again the just so you know, I wasn't kidding your screenshot of the actual github page the same here for the cloud storage and then our BC PC. This is a much larger project, obviously because it has all of the OpenStack components, etc. But again it has the configurations and things necessary to build out full clusters because I don't want to like in the storage side.

A

I don't want to build one way on hard or over Harold vagrant and all this other stuff and then do something different completely different on the hardware. So we try to keep everything as close as possible, but there's some some challenges with that too, especially when you get into the networking side so that you can't do with the VirtualBox side of it and then you actually have to say: hey I've got if actually put this on hardware to see what happens now.

A

This is something I added. So if you, because this question gets asked a whole lot on eraser coding, how much capacity will I get if I do a ricer coding versus replicas, so I'm like a a lot I, don't know how much you gonna do I, don't know.

A

I can't answer that so well, I got a little tired of kind of answering that question or not answering that question, and so I thought you know what I'm just gonna create and then go through the formulas set it up so that you can actually plug in the number of OSDs you're going to have what size they're going to be etc. Show you what your off compute size is and then allow you to do.

A

What if scenarios, where you can look at your case, ID and your inside, and then it kind of trigger out where you kind of balance, where you want these eraser code settings to be that's also and the github, so you can actually go and download it right now and start playing with it now.

A

The interesting thing here, if you see- and this gives you a better vision because I'm visual, so it gives me a better picture of what my capacity is going to be so, for example, I know if I increase my K or decrease my M, then I'm going to better utilize. My storage, but there's trade-offs there remember I was talking about replicas were something you know: hey copy, a B and C simple, no problem.

A

The eraser coating takes, for example, if I have this uses, 10 for easy, math and I have a 10 gig file that I I have out there and then I, say: ok, I'm gonna do a K of say, 5 or any of those really. In that case, then that 10 gig gets split into 10 evenly into basically the number of case in this case 10 so now, I have 10 objects that are going to be floating around. The storage I have to do something with that.

A

I've got to balance it because it's distributed, and so then it allows me to look at this because I have to look at those two values to determine how I want my crush map to look like, because you can't just go and use the default crush map or you can't use a simple one that says: okay step, choose leaf, blah blah blah Rach, you know, tie track and then set this gonna make it where everything looks, good failure, domains, etcetera.

A

It won't work because you'll go to implement this and you're saying why aren't any pools being created? And the reason is because the crush map is looking for K number of whatever you told it to look for. Do you have K number of racks?

A

Do you have K number of hoses thanks, miss nature, depending on what your failure domain is and your eraser coated profile- and this gives you the abilities to say okay I'm, going to trade off a given pool for whatever reason so I can actually maximize my make my storage more efficient or I want to see something a little different, but I see if I go back to that. Okay. Well, one thing I didn't mention was the only em side.

A

This is important because those are your parody chunks, so it takes see they say in the scenario: I have my 10 gig. It takes that divides that, for example, by K and then it'll add any buffering. So it's all even and so you've got a one gig a 10-1 gigs, but then all of a sudden- but let's say I- have five set at my M on my inside. That's my parody side, you're also going to see 5, 1, gig pieces or other chunks.

A

So you could have a total of 15 that are floating around out there.

A

The this gives you kind of a what I was talking about before and this. Actually, these numbers is percentage-wise, they're actually taken from the PG cow calculator, and you know you go to set comm / PG calc and you can go in there and kind of set and play with okay. How do I look and set up my pools or my PJs and those fish nature and inside us inside the cookbook?

A

If it sells, it actually implements the that same PG calculator inside the cookbook, but it gives you another option because, instead of doing to the nearest power, you can also say: hey I want to go to a higher power, so you have the option to do whichever way you want and that will actually help you with your PG distribution, but the you can see the amount of data that each of these pools in a typical just plain, simple, out-of-the-box, raitis gateway pull set. The is 96.9% of that data is stored in your buckets.

A

Now, it's not. You could probably get away with some of the others being rachelkitty, but this is the recommended way of doing it. We're only the actual buckets themselves are racer coded. Everything else is replicated.

A

So the okay, so now you're kind of looking at a record in because I know everybody wants to maximize their dollars. They want to maximize, but they don't have a whole lot of information on it. It sounds really complicated, etc and it sort of kind of is when you start to look at it at the first time. But then we start working through and looking at how the objects are laid out with itself.

A

It begins making a lot of sense, and so it becomes more and more clear and then you can actually begin doing some really neat things with your crush map and distributing your load a lot better. You in the main thing to with your extra coding pieces, is think about your failure. Domains like, for example, a typical failure domain and the replica is Iraq.

A

That's a typical thing: people try to do say: I can lose a rack or whatever, and it doesn't bother me so, but with eraser coating and that's not going to be your failure domain most likely I definitely won't unless you have a whole Iraq's. So, in this particular case, our failure domain is, is a node, an actual storage node. You can make them as iOS D or at different pieces, and so what happens that?

A

The reason why that's important is because your crush map and oil that will try to basically keep going without so it doesn't repeat an object of one of those aggregate objects inside of the same house. Now that would be horrible horrible, because what happens when that house goes out? Well, you just you just lost, you lost your data, so instead it will actually try to disperse that, etc. So, there's a couple different ways to kind of check on this play with it see, see the settings etc. One is again you set your profiles.

A

You can set as many as these profiles as you want the default to own these. All these defaults are, for example, the plug-in is J eraser, because you could you can change that. I left it at J register in that particular case, and so you can set these values like 10 like I was in that scenario, I was just talking about and I said the failure domain to host you could create multiples of it.

A

Doesn't matter they're, just profiles, nothing happens at this stage, then you come down now when what I like to do because we manage our whole crush map itself, so we set the own start. Equals faults all that stuff. So because of that, if you start having no STDs go down and start flapping and all that, then you have to kind of roll them a little bit. So I don't like to do that. I don't want to do that.

A

It's less works better for me, so what I typically do before I do any of this is all set the know down in that particular case, because I want to keep everything up. I don't want to go back and fix it later from me playing with it and now, of course, that's gonna, say health warning, but that's because you set a flag. That's the only reason you want to say that the then you create a pool, and so you create a new pool.

A

Add your PJs, your PG s's, and then you give it the name of the profile you just had and then the name of the crush map rule set that you're going to use. In this case, I'm, gonna use and I happen to call it a CD underscore eraser, because the other one is for replicas.

A

Then you you come down and you start looking at your. You want to get your PG number after, if everything if it goes out and starts building your pools and then that you know that's good, but you want to see where they're distributed and because you don't you want to make sure that you don't have things like I said all in one host or even all in one rack and because if when it's the it's actually tempting when you say oh wow, I had a help.

A

Okay and on my first Direction code, the cool great.we staff support racer Cody. But then you start looking at oh no everything's in one rack. So if this rat goes out I'm hosed, so you got to start playing a little bit so yeah taking the stereo. In this case, you want to look at your pools, your OSD, so Sefo stls pools. You can do it. You can do SD dump on that and just look at the top part as well and the the the PGS and all that are going to be.

A

You know unique integers, it's a plus some other things behind the decimal, but the primary part is the actually I'm talking about the pulsar, and you want to get that the pull ID and then you want to do like a PG dump, and then you want to grab on the PG. Never has a couple like I said: there's several different ways: you do it. This is why I do it and then what that does is that shows me?

A

Okay, now, I'm, only looking at the my PG map are dumped on just the pool I'm interested in because it doesn't have any data in it now. But it's going to tell you where it's mapping to what OS, DS and so you'll see something like like in this case, 1005 212. So it's on basically five. Just in this scenario: 5 OS, DS and then you can find out.

A

You know, do a semi, oh s, D find and then you can find out where, like 10 being the idea of the first OS, did you find out well host its own, which is Iraq? And if you segmented your obviously your host names properly, which you should do always and then you'll know what rax etc are out there. And then you start making adjustments to your crush map. Based on that, because that's where you're gonna find where your placements are and that's like I said it's critical.

A

You can't skip that part because you want to kind of understand where the distributions are testing. So everything you know. Obviously, testing is critical with OpenStack when we first started doing things, it was we've used tempest and we actually have rally inside of our class. We don't reuse it as often now, as we did a couple like last year. We used it a little bit now, but on the safe side, the radius bench, obviously the cost bench, and then the eff, iOS and and the reason for some of that is just you want to test.

A

You want to look at your drives, etc, but the one that I use the most is jmeter and the reason I do that is I, set it up in a master slave configuration so that I have several many instances that are running and they're gonna be running tests which are going to query objects and things this nature. So I'm going to see, what's actually like a real use case of something that's going on with and then I'm going to most likely do random, which I do I actually do range.

A

Random, byte range requests so that nothing gets. You know you're not dealing with one, the first TN of each object or whatever it is you're segmenting, iid random. In that particular case, and then you can find out where some of your performance bottlenecks are and again most of time, you're gonna find that it comes back to your network.

A

You'll have balancers those are typically there, but then, at the same time, if your OSDs are just screaming and doing work things of this nature, then you're gonna see tremendous amounts of latency from there and that'll help you kind of pinpoint, but along with the other tools.

A

So what are we looking at going forward on some of this stuff? We're definitely obviously always looking for improvements, better monitoring, even some DevOps pipelining, and also to the one thing about that. The Ceph chef cook books and the BCS is built so that you can build it and plug it right into a pipelining system, something like go CD or something of that nature or even Jenkins, but Jenkins doesn't do pipelining. All that great!

A

That's that's a debate, don't throw stuff at me, but and then the where we're going to is NSSF with non vendor solution solutions, that's kind of where, in the the, where we're kind of heading with some of this, and so we're definitely looking for performance improvements, no matter what and we're looking at better multi-tenancy capabilities in jewel. That's where some of the some of that's being kind of laid down a little better. We're gonna be testing that, and just so you also know the so.

A

What we have our typical, our smallest cluster is roughly 3.6 petabytes as our smallest, and so the I got somebody to agree he's sitting in the back of the room and he got he got. It agreed with some others that hey I've got another cluster over here.

A

There's 3.6 petabytes that hasn't been stood up yet can I use that as a lab and they're like ok, you have 90 days so I'm like great, so that's what I'm gonna be doing with it and testing a lot of these others pieces with it, as well as enhanced Securities and then also the other thing that we're kind of looking at too is for specific use cases, some of the nvm e's that we may be looking at some hot-swappable components and in any other high-performance DS and could rollers etc, and they maybe then this is just a maybe with the RDMA pieces.

A

So that way, your OSDs can't communicate a little faster directly without going through all the tears, the the other networking components etc, and that's it and here's the again I just want to put this back up here. The guy yesterday said something about Twitter, said: hey I, don't tweet much, but here it is I'll put some stuff out there same here: I, don't tweet much, but the findings that I have the things that I'm allowed to share on our findings with this test clusters etc.

A

I will put it out there, so in that'll be one way that it'll do it and then, of course, will always continuously be updating matter of fact. I just made pull requests and emerges this morning on some of the cookbooks, etc, etc. So questions thank you.

A

I'll try to answer them, so I can't guarantee it.

A

Alright, so we're running beta with our object, store right now and that version is, is hammer nineteen. Ninety four point six so I didn't mention that, but that's what it is, the what we are testing the new cluster is actually going to be jewel. So that's what we're doing on that.

B

High okay, so your analysis- brilliant, thank you so much, but what you're using is a hardware is looks like or sound to me is an enterprise grade. Sst right, that's true! So if someone asks you, for example, telco service provider, hey I want to swap I'll change. My tape solution into self based low cost storage.

B

That cannot be a high end. Ss.

A

B

Regular drives what will be your input and what will be the variable that will be added into your analysis, increase your pain, tolerance.

A

That's the first thing: no, that's actually true and the reason I say that part of this this talk here is part of a much larger talk and it talks about you have to have buy-in along the way everybody has to realize. You're gonna give and take that's only that's just the way it is, and so you've got to be able to change your pain tolerance, so you can get there. So what you have to look at what you know with SST is you can talk to some of these vendors that are out here?

A

They know what way better than I do, but you have basically they can only write so many times, and you know your your things of this nature. The mean time between failures is really low on consumer grade, but can they be used yeah, but I just really depends on your use case. I mean just really look at your use case. I'm, not saying don't use do it because you can do anything. You want.

C

So the question about the probability kitchen part: you know: I don't have much experience in a circle thing, but for application part when you use that, did you use default to pocket type. When you in your in your creche map, you have to specify like a straw or you.

A

Go back to the first I've messed that first piece yeah.

C

I'm just talking about the question context that was first part, okay and asking about which cheaper key type did you use further? You know replica based in a pool I would even basically you know there are a bunch of like written list or straw tree or stuff syndicate. You have a rack or where crucian map definition there yeah.

A

So let me pop over here: I, don't know if this will work. Can you cut that back on for a second.

A

See the so here it gives you a an example of some of the Tings tunings we're actually using the second release of the straw calculation for tunable. So this is our base, and this this is actually because it's the saying is: it's actually our base, that's in production of what we start with and so you'll see the sets coming down for the different pieces, and then you'll also see different and actually the bottom one down there with the you know, with the minus three etc. That's that's actually not in our production face.

A

I was actually just testing some stuff on vagrant this morning on that build so but from there what happens is and the the Ceph chef piece in the OSD sections. If you have a racer coded enabled in the cookbook, then what it does is it moves the OS when it creates the OSD. It moves that into the appropriate slot inside your crush map tree and then it balances based on those weights on the rack and in the nodes, etc. Okay,.

C

A

D

So I was wondering in my company: we have a really bad we're starting to put stuff in production, and our network is very slow. Is there any point in your opinion to tune the performance when the network is just slow like? Do you think we need to talk to infrastructure to get well.

A

D

I would get everything I could get yeah. No, but like do you think it makes any sense to do performance tuning now and then wait for them to fix network or just fix network right now and then do performance tuning as well.

A

So that's a good question, so in essence what you can do right now without trying to do any tuning, get a baseline with what you have you get a baseline on your performance. You get a baseline on how you're delivering everything and then from there start tweaking it a little bit and see if what were the deltas are between your but from where you were.

A

Your tweaks are hopefully they're in a positive direction and not a negative direction, and then you can actually compare that, and so then you have a better idea that when this you get better throughput, etc, etc, and you have a little lower latency Network, then you're gonna be able to take advantage of it immediately. So.

D

You think is like predictable, because I was worried when you said you're playing with so many knobs that you know, maybe that was the reaction will change when the network changes no.

A

You're, actually it should be to the to the good side of that. So definitely on that I got it thanks.

D

A

Anything else: okay, great thanks.