Ceph Ceph Tech Talks, 24 Sep 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2015-SEP-24 -- Ceph Tech Talks: Reference Architectures

Description

A look at building reference architectures for Ceph.
http://ceph.com/ceph-tech-talks/

A

Alright welcome back everyone to the monthly septic shock, such hallowed halls and speakers. There's a SAM just to gesture, and you have a lot of different talks on the core aspects except, but pretty much anything that's technical it has to do with Seth is something that we're looking to host under stocks.

A

Today we get to hear about building a set reference architecture from a couple of the expiration group done just that linking as always, will you know if you have questions going along, feel free to type them in the chat you can get to them at the end, if you'd like to unmute your microphone and ask questions, let's try and do that at the end.

A

That way, folks that are following along a call to the YouTube channel can see the whole presentation before the questions start without further ado, I'd like to introduce red captivating, try later we're going to give you a rundown of how we build reference, architectures, there's much bigger way. Yeah.

B

Thanks Patrick, so cut and I are going to keep this interactive and conversational between us as we go along informal best. That's one of the things that we do. We spend time with a lot of our partners, building reference architectures one of the key questions, or one of the reasons we do. This is one of the questions most frequently asked questions that we get is hey. This software-defined storage stuff is great. It runs on anything.

B

What do you recommend answering that question of course, like like you on the phone we'd like to answer that question based rooted in empirical data versus on kind of hearsay, so we spend quite a bit of time that in various partner labs running through a variety of different benchmarks and load tests in order to have data upon which we can base answers to questions like that.

C

B

We're just going to kind of walk through some of the things that we look at some of the considerations. We have been doing that I'm in full screen mode, so Patrick I won't be able to see the the chat window. So if there's any anything urgent comes up in chat, just flagged me. Okay, how much you.

B

I'm going to this little thing here,.

B

Okay, so the building blocks for reference, architectures, common being distributed, storage. Of course network- is the foundational building block and sometimes something we don't talk about enough. You know the the impacts of of 10 gig I know it's. It's well known in the community to the use of bootlegging networks, one client facing in one cluster facing, but as you as you go up in network bandwidth to 40 gig, as you look at you know, bonding or not you know, to bond or not to bond. You know makes a significant difference.

B

That's so those are some of the variations and, of course, then up to server in the last, oh, probably 24, to 36 months. As you know, there's been into just an explosion of variations on storage, x86 based storage servers. You know a couple years ago is kind of the the default was a 12 Bay, so 12 three-and-a-half inch drive to you dual socket server.

B

Now there there's just a litany of of different storage servers out there and just just looking at capacities. You know 12 13, a pinch bait 24-3 now Finch Bay, 30, 3660, 72 76, and so you know and then of course, the question. The question that folks ask is well, what's the what kind of performance can we expect with different server types? And you know let alone different media types within those servers, different.

B

Different capacities of HDD, obviously lower lowering the capacity increasing the quantity of servers. You know changes the spindle count, obviously going to change the type of performance. You can expect the classic question of right, SSD right journal ratio to HDD, but what about instead of a SAS or SATA SSD writer, and what about PCIe and bme right journal? You know how does that change things, and then what about putting the OSD all together on instead of hard drives on flash drives?

B

You know how does that affect things, so lots of different permutations going on the platform level and then on top of that at the OS Evert level.

B

You know folks say ok, so what about, if I'm driving this load through a kvm, BM or or just driving the load through the radio, skateway, no verte, layer or emerging, of course, or questions we're getting on from a micro service as a container type verte level, and then on top of that all this is kind of this is kind of pseudo Maslow's hierarchy of needs here at the bottom, being kind of food and shelter working up to that self-actualization level is. Is the defined workloads?

B

What in the end, people are, of course, have the underlying staff to accomplish work, so some of the reference architectures of course would would cut a cross-section, a vertical cross-section, starting from a defined workload, and all the way down through others are foundational, in fact that if you look at so, if you look at r FR ref are flavors.

B

Some ref arcs are read more like how to integration guides, and these tend to be at the upper layers of that hierarchy of needs. On the previous slide, we're now you know kind of SEF layer and the OS or Burt level, and then, on top of that, you know, set + OS, Burt + packaged workloads. On top of that, an example that is here in the link.

B

This is a Adele Red, Hat reference architecture for red hat enterprise, linux openstack platform, so even a rello SP, OpenStack plus SAP on top of Dell PowerEdge servers and that ref arc reads more like a how-to integration guide.

B

So, and that's that's- that's that's one flavor another flavor read more like performance and sizing guides and listed our three different links. There one as a reference architecture, we've recently published with supermicro it's across their family. The second one is a performance white paper would in which we've collaborated with Cisco's UCS team and the third one is one from scalable informatics. Another performance white paper. So all of those read more like performance and sizing guides I'm. So right now, as you is just FYI, our team is currently focused on on the bottom one.

B

The performance and sizing guides, mostly because there's just so many questions that we get peppered with about performance and sizing is the explosion of permutations of server media and network grows.

B

So that's that's going to be kind of how we focus our comments today is that's where we've been focused with a and building these reference. Architectures is more of the latter we're working towards working towards the former, the top one there. In fact, we've begun to work with the the MySQL Maria to be community at the top layer of the stack, but but again weighs a lot of our work has been foundational. That's its kind of that kind of set, your expectation for the type of reference architectures.

B

We've been most recently involved with okay by the way I'll say right here at the beginning to the group.

B

One of our goals with these these reference architectures in producing these performance and sizing guides is, is to is to create a community asset such that. If you know one of your saying, yeah I'm deploying a new set cluster, my workload IO patterns look kind of like this. My capacity, oh it's going to be about you know a petabyte or two and I'd like to run it on all flash or I'd like to run it on is cheap and deep, as I can go. I'd like to it.

B

You know, I'd like to go out and query some. Some community repository to find out who's done, who has performance, revolt performance results from an architecture like I'm looking to build and save me some time, I can read up on some empirical data in some type of structured form, so kind of, let's that's what we're working towards. If we have time at the end, we'll I'll paste up there, we have right now it's it's! It's simple.

B

It's just a massive spreadsheet that we have currently working on we're close to 30 different results from 30 different configurations, and we have them all lined up side by side as we can. We can look down a row and see how various performance metrics varies with different configuration. So it's quite it's quite insightful. It's it's fascinating! In fact, so that's we'd, like two mature that so, instead of having to be a spreadsheet jockey, it's it's a little bit more user friendly. But that's it's a anyway!

B

So that's a plug out if anyone's got great ideas funneled through Patrick to us, we're very interested in in than that. Okay, so, back to the back to then talking, there are considerations for these reference, architectures, okay, so performance and sizing guys where most of our work is. That here are three links illustrating some recent work.

B

Okay, so here are some design considerations that, as we structure these reference, architectures that we focus on so I'm going to I'm going to focus on a couple of these, then in particular again it'll be conversation between Kyle myself. You probably heard Kyle on this menu before senior senior storage, architect from not only the ink tank days, but the dreamhost days.

B

Kyle column cover five and six year along the way. I asked some other things. Okay, so here here, just six. By no means is this an exhaustive list, but here's some here's some useful design considerations that we we speak about in constructing these reference, architectures and expectedly, as we speak about in in taking from the top and architecting sep solutions. So these items this is meant to re. You know you know after having these conversations with, you know, internally or with whatever, whoever the solution. Architects are to understand these six things.

B

You know 1 +, 2, + 3, you know equals, then you know the building up enough data in order to design target server and network hardware architecture, that's what the line is meant to be. It's meant to be a summation line. Okay, so we're going to talk through each of these design considerations.

B

If we have time at the end, we pulled a few graphs from one of our reprint performance graphs from one of our recent reference architectures, you can kind of see the variation between different configurations and how and how those may be affected by some of these design considerations and how they might be more applicable to certain workloads.

B

So if we get time at the end, it's certainly the slide. You'll have them in the download we'll talk through a couple of performance, graphs, okay, so number one. Now the first design consideration qualified need for scale-out storage run going to spend a lot of time on that here, something that you do all the time every day.

B

So just going to kind of leave this at that not going to focus much on that one interesting thing, just as a side note to ran across this graph from IDC a month or so ago. Fascinating.

B

You know the trend line with storage capacity, moves back into the server away from external sandnes depicted there in dark blue in his external capacity and back towards internal capacity, very, very interesting five-year moving a trend.

B

So, anyway, that's just an interesting thing so anyway, so the first, the first design consideration qualify, need for scale out store to be, even as, as you all know, as architects that frankly they're depending on the size and the you know, it's an Oracle database. You know sighs workload whatnot. It may not in fact be the right solution, but so moving on past that, okay, the second design consideration designing for the workload I. Oh, and this this will this kind of sets up the way that we approach the reference architectures.

B

We have a four by three matrix, it's kind of a small, medium and large, with different I old patterns which you'll see in a minute here. So this kind of sets sets that up as we look at the different considerations of the various workloads, so the the first, you know the overall that the topmost governing factor we see it is okay. People designing these sep solutions is, is it are they performance, oriented or cheap and deep oriented you cheapen? Deep?

B

Being you know, their their overriding interest is cost capacity, cost per capacity cost Barack density, cost per watt, cost per thermal unit, cost cost, costing them cheap and deep order. They have some type of performance, overriding performance objective. Of course our cost is always a factor but its if it's an overriding performance objective again. That's that we test different in a reference architectures. We try to go through different configurations that will be optimal for these different types of work load patterns and then descending within performance.

B

You know, certainly some you know performance can be further broken down into sort of throughput oriented with a lot of you know a lot of the media streaming.

B

Oh, you know it might be an image, a lot of JPEG images. It may be video audio, but in my large block you know typically with the Iowa pattern, typifies by large block I, oh and in fact large, but going down the list sequential large block by 0, vs, I, ops, intensive, obviously tip if I'd by I mean the poster child, I, ops, intensive, workflow, being 4k, random, I/o, so small, and so all these considerations again. We try to in the reference architecture work. We try to say.

B

Okay on this slide here in this 4 by 3 matrix, the Rose become academies are very coarse-grained, generalized work, letta workload, I/o categories, if I, ops, optimized workloads, throughput optimized workloads and cost capacity. Optimized workloads, obviously cognizant, that within within a single cluster, you might have different pools carved out for different workloads. But this this helps us again to identify reference architectures, which are optimal for these different types of workload, io categories and then back up to this thing.

B

Here, some of the things that become interesting down below or things like the read/write mix, as we benchmark things like, like erasure, coded pools versus replicated, pools the as you'll, see the in the results: the for instance right, the right performance, right, performance of a ratio in a ratio, coda pool versus replicated, pool it's a different ratio than read performance and for all the logical reasons that you know. That makes sense when you think about what's going on on the covers there.

B

A lot of these factors become relevant in in identifying candidate, architectures server and network architectures, which will be used in the reference architecture benchmarking.

B

So this is. This is one half of the of the grid that we lay out again. So as we as we identify candidate architectures to test.

B

And when we look at performance, optimized, of course we're as we as we as we discuss with our partners and we work with work with network partners. We work with server partners work with media partners. We discussed this with them as we're looking at performance, optimized reference architectures, the benchmarking, for that word say: okay. Our goal here is, of course, we're trying to get in by definition, performance, optimized, we're shooting for the highest performance.

B

This pool can yield again whether it's megabytes per second for throughput, oriented or or iOS 4i, ops, oriented, but clearly not in a vacuum, cost always matters, so it becomes okay, we translate the lowest. We also get list pricing information for the for the configuration so that we can begin to do some relative and most vendors are aren't all that.

B

They don't like, of course, to to have a lot of lists of their absolute pricing. Information bandied about for obvious reasons, so we convert that into two relative comparisons and again, if we have time at the end, we have some relative comparisons that reflect lowest cost per performance unit. So yeah sure you can, you know a configuration. Myeeeeh might yield the highest performance, but it sealants its liquid nitrogen cooled and it costs. You know: 100 million dollars, okay, that it's nice that it's highest performance, but it's not it's not attainable by by mere mortals.

B

So we look at some cop X some capex elements as well. As you know, one of the things that Kyle drills and over and over again is from his experience on the operations side of the fences. Hey. You know, power and cooling, always matter so yeah. You might get a cheap, cheap, capex solution, but if you know you're burning through the the Los Angeles power grid, because if you're sucking so many watts and an air conditioning, then it's it's still too expensive.

B

So we try to add the reference architectures paid and pay consideration that well and then finally, is the meets. The minimum server fault, the main recommendation and we're going to get into that and that's a favorite topic of Kyle. So he's going to talk to that one we get to get down farther below, ok and then cost capacity, optimized set otherwise cheap and deep. Of course, it's it's. The governing attribute is lowest cost per terabyte, but again it's it's. It's also a capex and opex game.

B

So the same effects things matter here, but in instead of looking at best performance, / thermal unit or wat. It's in this case it's it's again. It's back to cost capacity, optimize its its per terabyte and, of course, because some some data centers pay.

B

You know the payment is that the optics cost is more oriented towards floor space. The in building floor space. Then, then that's an important element here and we relax a little bit the minimum server fault. The main recommendation here for cost capacity, optimize clusters, again with the belief that it's it's you know it's less performance, sensitive, okay. So with those considerations, then the third of six design considerations we look at are the obvious storage access method.

B

This is simple and obvious a couple of caveats that will come into play later on the used a white font- or I should say I use white font- sorry about that, the last at the bottom.

B

It's kind of hard to see that, but, for instance, obviously ratos block device is supported with a replicated data protection scheme, only so choice of storage access method if, for instance, if the store Jack's, but if the storage access method required by the workload is, for instance, block that that immediately constrains the data protection game used to a replicated pool which then obviously drives the it constrains. The permutations of a server network in particular server emedia architectures targeted.

B

So we know that's that's kind of the third design consideration, then the fourth one identifying capacity and identifying capacity at face level- one might say: well, you know what does that really have to do with the with the reference architecture? Well, it's it comes back to the previous slide about or the previous conversation about fault. The main considerations you know. Clearly, if you are, if, if you know, if you're architecting a or you're designing a reference architecture for a half a petabyte solution, then you know it's probably not going to be.

B

It's probably not going to use big old n72 of a servers from a fault domain perspective, so identifying the capacity has significant ramifications into the fault domain, and so that's actually a good transition. So Kyle as we look at fault domain, risk tolerance and clear that we've turned this as risk tolerance. Because some you know it's it's it's a choice. It's a subjective choice by the architect here, based on the environment. So Kyle talk to us a little bit about the this question about okay.

B

What what percentage of a cluster capacity do you want on a single? No talk a little bit about that.

C

Sure, absolutely so, and you have a scale-out storage systems, typically you're transitioning, from the mindset of where you want to, UM instead of trying to make sure that a singular host or hora para hosts are, you know, highly highly phone caller within themselves. Instead, you want the the cluster cluster software to provide fault-tolerance.

C

In order for that to work properly, you need to have sufficient hosts in the cluster so that you can survive the loss of one of these hosts in the case of a SAP storage cluster.

C

You know kind of the rule of thumb that we provide is that you don't want to lose more than a tenth of the cluster by with a single node field, because not only are you going to lose, you know ten percent of the capacity in the case of a node failure, but you're also going to have the additional workload of having to recover from that film. So when you start to get into clusters that are smaller than than 10 nodes, and you can see how this can be very problematic right.

C

So if you have the absolute minimum three node cluster, one of those hosts fails, you lose one third of your capacity and the remaining the remaining two hosts have to not. Not only is there less aggregate cluster bandwidth available, but the remaining bandwidth also has to deal with the recovery of that fail, toast so based on testing two we've done. We really like to steer her customers towards.

C

You know, clusters that have a minimum of 10 nodes. You know for performance and in you know, relax that a little bit for you know, cost capacity conscious, Buster's, here's.

B

Another question for a cow before I leave this slide, so the one of the things I know that we've talked about a lot before is is also the clusters reserve capacity in terms of terabytes, so there I mean there's a certain amount of reserve that you should always have just for good for normal operations. But then let's say that you have a three node cluster. You know: how does that reserve capacity need to grow when you have a smaller cluster? It talks a little bit about that sure.

C

Right, so, if you want to be able to UM have the cluster, to, you know fail in place and recover from a host failure. um You don't want to, you know, send someone to the data center and try to repair a host and bring it back up. You know operationally it's better if you can just let the software recover from the failure.

C

Now, if you, if you have three hosts and one-third of your house key, is in one host and you lose that host, then then that means that you have to have sufficient capacity on the two surviving hosts to store the replicas that were armed on the failed host.

C

um So as as each member of the cluster tape makes up a larger percentage of the total used capacity, the the less or the the larger your overhead will be in terms of how much excess or how much capacity you need to leave unutilized in order to fail in place from the lasa note.

B

Excellent thanks, so you can. You can see why this I mean this as we identify target architectures debase the reference. Architectures I mean kind of the classic, the classic conversation when people are new to that theater, the concept of self, not obviously after they've, been around for a while, then they're new. You know they read about these, these big old 72 Bay servers and they read about the cost efficiency than man. That's that's exactly what I'm going to do this! This opportune storage! I love this stuff. I can choose any platform.

B

I want I'm just going to go, get all 72 base servers, um but but then, as kyle's explained here yeah you you better have a cluster of a of a you know a petabyte or two, because you know when you, when you cram it, you know. Let's say you got a 72 base server with the sixth terabyte drives. You know six times six times, seven you're looking at close to a half a petabyte of raw capacity in a single chassis.

B

So yet this this this becomes a significant consideration as we as we identify different architectures deco. That's number! Five. Now number six and I'm going to t this up for Kyle's as well here so data protection schemes I've. One of the statements we made at the bottom here is one of the biggest choice is affecting the TCO in the entire solution.

B

So talk to us and then that's. Obviously it's meant to be a blatant attention. Getter here, you've all evolved in part of conversations with with the with management or with sales. People may sell yeah. That's that's just a detail. You know, don't don't let's not trouble with that detail then! Well, it's it's! Actually. You know what he wanted. You want to spend half as much or twice as much, obviously, because the the quantity of of media that you need is heavily impacted by choice.

B

So the Kyle talk us through talk us through the choices of replication ratio and I. Think we keyed it up here. This just kind of talked us a little bit about replication racial coding, their their design and.

A

B

What points you want to cover on these on this point here.

C

Sure so you, the basics in replication, is just making copies of each data soem and set because it's based on a burritos and everything is being stored as an object internally for each object, there's going to be n number of replicas. Typically, people are using 3x replication, and so, as such, you have one-third of of you know, usable to to rock capacity erase your pudding. On the other hand, uses uses math to generate generate a parity bits such that I.

C

You take, you take an object that would be written into the cluster, and you divide that into a number of chunks. Then you also generate an additional a number of parity checks. Both of these can be configurable and then the list you know all all these different tongue split, the ones from the original object and the parity chunks are then distributed across the cluster reporting to the crush mapping.

C

And so in this way, if you lose one of those chunks, those chunks can be reconstituted from from the parity or in the case of loss of parity bits, they can be recalculated from the original chunks. So um this is. This is very similar to like what a traditional raid array uses internally, except instead of striping.

C

You know chunks across discs or striping them across OS DS and we're using a crush to enable advanced placement length of these chumps such that you, you know, have one on each show, students and enough in and different different hosts that you can reconstitute them just despite the loss, the host or no SD, as you can see, by the kind of a quick figure that brett has on the slide here, is that you know you end up being able to to utilize a lot more of the or your usable capacity come in relation to your raw capacity is much higher, making it much more efficient in terms of the cost that it is incurred to store data.

C

In some cases, though, the data protection scheme is going to be predicated on. The way that you are accessing said said, data, so in the case of block storage, for example, replication is, is the only mode that is supported, and that's because you know, as since the block since a block device has striped across many objects.

C

Sometimes you know segments of that of that block storage will need to be updated and that that requires that an object be mutable in the case of erasure coding, objects are immutable, and so you can. You can't have block storage directly on top of racial, go to.

B

Yeah thanks God good stuff, so that those six design considerations then yield a four by three matrix like this. So this is how we go into these reference architectures with one of these empty. When we say we want to, we want to work through a variety of different permutations of server media network, to identify optimized configurations and again I mean this is one of the reasons why we listed the criteria for what, how we define optimized.

B

Clearly, every environment is some mix of these different types of workloads, and but at least it provides a way to frame the conversation to say: okay, here's a particular configuration. That's that's that's for luck, for instance, for larger cluster slices, which would frequently tends towards higher higher density servers, and it's particular configuration with network and media lends itself well towards either throughput or I ops or cost capacity. Optimization one of the things just one of the things to note here for cost capacity, optimized configs and how we approach the bench marking for these reference. Architectures.

B

For that, for that row for cost capacity optimized, we stay true to the criteria, which is, as we discussed above cost capacity optimized. So, for example, we use eraser erasure coding for the cost, capacitive the cheap and deep, so classic use case of object archive so for those architectures that we benchmark their erasure coded and they don't use flash write journals, because that adds a significant element and so the cost. Again. That's that's! That's the objective for that row. The cost drops dramatically between throughput optimized clusters and cost capacity.

B

Optimized clusters, when you can I mean back to the last thing that cow covered the the data protection scheme when you're when you have, for instance, seventy-three percent usable to raw capacity versus thirty-three percent. You know if you need a petabyte of usable storage instead of buying three petabytes of rot in order to get a pet abided usable you're buying.

B

You know more, like you know, 1.4 petabytes, so already that that's a tremendous difference in the cost and then on top of that, when you eliminate the dedicated SSD right, Jerome just go locate your journal on your spinners. So anyway, it we've stayed true. The configurations that we benchmark for cost capacity and those aren't reference architectures are true to the objectives there in the net in that fashion. So this is the 4 by 3 matrix the one on the left. We we've labeled OpenStack starter, a lot of folks.

B

Look at you know as they as they do. Initial pilots and proofs of concept with OpenStack de might might be using relatively small amount, small amounts of capacity, a glance imagestore a little bit of persistent cinder devices and whatnot, so it's that's kind of, but for the other ones we've said: okay, small starts at a half a petabyte. You know medium. Look at that type of there under medium. Is that one kind of terabyte? What is that it's meant to be one petabyte Sony did strike the T of that or large okay.

B

So now it looks like we will have some time just to so. This is how we approach the reference architectures. We again we sit down with our our various partners, who, who have been to date, been very gracious in loaning gear.

B

Actually, we work quite a bit with intel on this as well. So part of gear, of course, is the process. We sit down and say: okay, this is this is what we're trying to populate. We want to identify. Of course, there are a hundred different permutations that we could test. We want to narrow it down to a handful of configurations that that theoretically should be optimizing. Then we test them and publish the results.

B

So what you're going to see on the the next couple of slides are extracts from I recently published SEF on supermicro reference architecture, as as noted before, of course, you know we're red hat. We work on lots of different platforms. This one just happened to be the most comprehensive ones who choose have chosen extracts from that one, and it's got. These results are based on lab, benchmark results from a bunch of different configurations.

B

Okay, so this one this first one here again with the full document. The links are in this slide decks. You can read the full document at your leisure, it's it's! Oh it's a little over 40 pages, long, so lots of graphs. We haven't. We haven't extracted all the graphs, but we've extracted a few okay, so we'll just we'll just kind of give an overview of a few of them just to kind of give you a sense for for how to quickly read them.

B

Okays first is okay, so the the axes of this one here, the obviously the x-axis- is the object size fed into the load test utility so ranging from for K through 64, too close to a magnitude, 24 meg and then the the y axis is megabytes per second and at the top it's either megabytes. You know. Obviously we have graphs that megabytes per second aggregate for the entire cluster, but then we in order to have a little bit of a normalization, so we can have a better comparative. Then we break that down okay.

B

So if you, if you divide the overall cluster throughput by how much throughput on average a server is producing that's this one is per server and then we further normalize it down to OSD. So we can get a comparative measure because clusters are different. We benchmark clusters of different sizes, we benchmark chassis, Zand, different sizes of you know: different quantities of os DS per node, and but the least common denominator for making a direct comparison courses is a amount of performance, / OSD.

B

So in the title that will indicate what we're looking at here, this one happens to be a 3x replicated pool it's it's using Rados bench, so it's coming through a deliberate O's level, not going through. In this case. This study is not going through ratos block device or rgw and then the lines, one of the lines how to interpret that the shorthand notation airlines.

B

Generally speaking, the first number is quantity of hard drives: Oh s, DS per server, chassis or server, I should say the, and so the top line is 12. That's 12, 0 s, DS plus 1, is plus one dedicated flash right journal. So, for instance, going down to the green line. 36 plus 2 would be 30, 60 s, DS per server, plus two dedicated flash write journals and in the reference architecture.

B

You know that it goes into more detail like those to happen to be PCI mdme, flash, not standard, SAS, SATA, SSD and then up the obvious. The networking is 10, gig, plus 10 gig. That's you know front and back facing 10 gig networks. So we followed our nomenclature through all of these graphs, except for one exception.

B

0 plus 2 you'd, say well that doesn't make any sense. How could I have 0 s DS and only write journals? Well, it's kind of kind of didn't stay true to the nomenclature. What what it was is and since we've modified the nomenclature to make more sense, but it's it's two nvme flash devices running OS DS also co-located journals. So it's it's! It's 20 SD! So that's the light blue one, the 0 plus two. That's how you understand that every but everything else is, is follows the standard. Okay. So look at the lines.

B

So what observations? Okay? So one might observe that okay, the good old tried-and-true 12 plus you know 12 Bay server is the touch. Is the turquoise line? It's it's is what's yielding per per server under the the 10 gig network saturation point. Would you know to what about eleven hundred megabytes per second somewhere in there?

B

So just under the network saturation point, and then you see on top of that, a couple of them per server hitting that the network saturation point, then you begin to see the ones above that that are going beyond the 10 gig network saturation point because they're on 40 gig happen. In this case they were using mellanox connect, X 3 40 gig cards and that's in the reference architectures. You read. Okay, so you say wow.

B

You know that that's pretty sweet that the pink line, 60 plus 12 on 40 gig, wow I'm getting I'm getting like what is that you know two and a half three times more work done per server and than I am on the 12 bay. Okay, cool now check this out. If you look at the same graph, but instead of per server, if you look at how much work you're getting done, how much value are you getting out of an individual OSD.

B

So the graph almost flips, that that the amount of work that you're getting down / OSD on that on the lonely, 12 Bay- is now 75 megabytes per second versus the pink one is now down around.

B

38 something like that, okay, so the two points to take away from the slide. The first is: it's really interesting: it's really useful to look not only at the performance per server, but also look at performance per individual OSD, because that's it's the greatest unit of cost. Actually, let's that's point one useful to look at both point to is hey. I mean this. This is benchmarking reference architecture, it's uh it's always we're always learning new things and having new configuration.

B

So as we we spent a lot of time actually with a variety of different performance teams, including mark who I think gave last month's talk on here. You know Kyle and I chatting with Mark about what what is what is the throttle? Why is why is the pink line not able to get as much work done per LSD as more scarce server so consider this is a snapshot in time. I expect we'll probably make progress in figuring this out.

B

In fact, we've already seen some some preliminary results in unpublished, studies that where this goes up but anyway, so a couple different takeaways here call anything you want to add on this. One.

C

B

Ok, so then, of course, you can look at it right. Ok, so here's a softball one for Kyle. So if you look at so we stayed with / OSD here again we're looking at a variety different configurations. You can see that hey reference, our cadets, why we do reference architectures, coming back to the coming back to the the original simple question: when people say hey love, this opportune run on anything. What do you recommend? Well, of course, in this world as architects, you can never make a blanket recommendation.

B

We can provide a benchmarking data to help make a decision stone. That's, of course you can see. It's hey. These benchmarks produce a wide range, particularly in this case, if you have large block iOS, there's a pretty big spread here, though architecture matters, so so here's a softball on Kyle, so the previous slide, so just looking at the 12 Bay.

B

So if I go up, one slide for sequential read throughput at seven, the 12 babe the individual drive was was producing 75 megabytes per second throughput with the largest block side, but that drops all the way from 75 down to around 25 with sequential writes. Why are we only getting about a third of the write throughput per cluster? In this case? It's normalized bro SD. Why is that calm, I.

C

Mean this simply due to the fact that you're doing triplicate replication so because you have to make three copies of it. You're gonna see about one-third of the amount throughput, as you would see all rights and.

B

Then then talk us through the next slide. Is we shipped we shift from that? The previous graphs were 3x replicated and we shift, and in this case, to erase your coding so talk to us about okay. So here you can see that the spread is around what about from about 6 megabytes per second per dr to around 25 megabytes per second, and that spread then shifts from 10-2 around 42? um Why has the? Why has the spread shifted up when we use a racial coating instead of 3x wrap.

C

Why are we seeing increased throughput on erasure coding or why or why do we see more variance between the kofedix with reading the.

B

Former, why are we seeing in general? Why are we seeing more sequential write, throughput pearl st.

C

I mean so when you're, using or when you're, using a replication and you're writing to disk some in the case of 3x replication, one copy is being written at the primary and then the primary is in streaming the two replicas to the secondaries. So coming out the back end, the cluster network. You have a 2x amplification of traffic, um whereas and you're also, you know, have a 3x amplification of the actual data being written and platters with a racial coding.

C

Not only are you sending less network over or nest less data over the network, but you're writing less data to planners. So between those two um you know you don't you're not going to be. It's unlikely that you're going to be bombed by quite a back-end network and also the the total amount of data that's being written to disk is less so you know throughput will be higher.

B

Thanks for that, thanks car okay, so this just kind of gives you a feel for the type of as we do these reference architectures again we're looking to identify configurations that are optimized for the criteria we had in that 4 by 3, matrix, noticing the clock we're down to a couple minutes left, so we're going to call it here.

B

There are some other things we bring in relative price performance and then, in the end, this this was the for that particular study. This was this was that four by three matrix, so specific model numbers for that one super micro. Actually they produce separate thefts queues for these configurations and, and you can kind of see how that works. But and then the reference architecture goes into additional subsystem guidelines, like you know: CPU memory, server, chassis, size and whatnot.

B

So here are a bunch of the ones, both refworks and white papers that been recently published, check them out. So sorry went a little bit long. It didn't allow a great deal of time for QA but Patrick back over to you, everyone, an alert yeah. That's fine.

A

You know if we run over a few minutes or whatever I'm sure people can stick around. So if anybody has questions now, is the time go ahead and type them in the chat or come off newton asking question while waiting for people to take their questions in brent. One thing that I did want to share was the reference architecture. / evolve, SEF, bragg stuff that were working on.

A

I think you briefly mentioned that there at the beginning, but maybe you want to talk a little bit more about about what we're going there, since it was kind of your it's kind of becoming your project now with the turkey hey guys. Oh.

B

Yeah yeah so to answer Patrick's question when I put up this slide right here, so wouldn't it be cool I mean so, let's say so, we're all architects on the phones of some type, wouldn't it be cool to have like like a hundred different permutations. You could say huh one what happens when I, when I change this parameter change that parameter and then be able to see.

B

Just I mean because in 10 seconds you can look at this graph and you can kind of take in you know how changing different parts of the architecture affects performance. So if the community were to contribute to this benchmark results library we would we would go, we would accelerate the quantity of comparables dramatically and that that would be. That would be graphing. We think that would be really helpful, be interested in feedback on that. So I think that was what Patrick was was referring to not sure that it exists up there.

B

In fact, when people were speaking with says, not sure exists out there for any storage, let alone stuff it'll be cool. Now the thing is going to say on: that is one of the things you can say. Well, gee, you know it's you go back to the good thing about statistics is you can make it say anything. You want kind of thing well, one of the reasons, one of the ways we're trying to standardize things by which we work with Mark Nelson for to have him open source, the Ceph benchmarking tool.

B

So at least the test harness being used here and cluster setup would be done in a standardized ways. That's that's one step we've taken to try to get a normalized of doing sure, normalized numbers yeah.

A

That's free and the little piece that I wanted to share on the community side of this is that we are working on building this community collector. Those of you that have been around for a while know that we used to have an idea called set brag where you'd be able to submit your cluster details and statistics in an anonymized way, so that we could start seeing which clusters were out there and what things are available.

A

We're doing a little bit of a bit shift on that and allowing people to submit performance results and cluster makeup, and things like that, and eventually we're hoping that that will be in a interactive format on metrics, f, calm in such a way that people can start playing with. You know if I want a throughput, optimized cluster, with options x, y&z, they can see what other people have done and how they can best get to their end point, perhaps with specific Hardware keeping so it'll be fun to see.

A

In the meantime, we've had a question that came in. Are you only using randos bench for performance evaluation? Perhaps you could talk a little bit about the the tooling and you mentioned march and CDT and some standardization stuff, but tell people a little bit about what you're using and how you got. Two numbers.

B

Can you go ahead and start off on that column and, if I have anything to add oil, add.

C

Right so um this is kind of our first foray into doing a extended testing with a partner, and we wanted to test at the lowest level. First, um you know kind of established baselines, especially so that we understand throughput because it also pertains to being able to calculate you know. Recovery mean time to recovery and such and- and so yes, the this first. This first analysis focused only on ratos bench. um We've been, you know, most recently, it's not published yet, but we've been doing extensive testing.

C

The follow-up testing to this has been doing a lot of fio testing on storage, both using the Lombardi lib, rbd and Chen through fio, and um you know a fio with the aio engine inside of kayvyun camus, virtual machines that have rbd block storage devices attached to.

B

Yeah thnkx, the only thing I'll add to decals comments. There is, then, on top of that, we've worked with the members of them actually exhibit from percona at the MySQL rady be community too he's created. A cbt is a test harness you plug in different load. Test utilities in the test harness so as Kyle mentioned, we've been using with in cbt, would be using Rados bench and different flavors of fi 0 for 4 random for measuring random I/o, but this guy from Percona Zard has also added assists bench.

B

So we can drive this bench, mysql work, load, testing from cbt and- and so that's part of the unpublished study that Kyle's, mentioning they're, so low test, ratos bench, various flavors of fi, 0, dis bench and intel has also been integrating cause bench into cbt with we right now. It's we haven't had the band wit to do any study, cbt based studies, they're driven studies with cause, but that's another load test utility. That's ingrained in the cbt.

A

All right not seeing any other questions. If people have more questions, let us know. Oh here we go. Are you planning to give recommended configuration options? / set up.

B

So I'll respond to copy everything to add there, so we have some. We have some recommendations or guidelines / subsystem on that, though things like cell scroll down a couple, slides things like, for instance, we you know cpu, we did quite a bit of testing, for instance, with sick dual socket versus single, socket and and didn't see frankly for for throughput, optimized and cost capacity. Optimize cluster didn't see a huge difference in that second socket being filled.

B

So there's some considerations like that in there so individual guidelines subsystem guidelines, there's a few things in there, the I think kyle called them a very technical term, Kyle's bag of tricks in terms of the just some of his favorite Colonel to nobles and and and SEF cough tunable and whatnot. That's also a like everything. You know it's it with every release that it changes and we learn, but we also put a few through a few of those things in there as well Kyle anything bad there I.

C

Mean I think the biggest thing that we found is that you know using using tune d and the performance profile. A tenth tends to do very well for set workloads, at least the benchmarks that we were running here arm and and then probably. The most important thing is that when you have these machines that have a lot of different devices generating interrupts to make sure that they're being spread evenly across across your new processors.

A

Anyone has any other final questions mature, throw me the chat, you sort of.

A

Not seeing any other questions, though so Thank You Brent, Oh Kyle, this is another great talk, I'll, throw it up on YouTube here once it's once it's done, but otherwise we'll see you guys. Next month, paulo lolly on SF, Tech, Talks, paige, johnson, calm, we don't have a type of a tox laid in front or her. Yet we're still looking for something but definitely keep your eye on November. It will not be on the fourth Thursday, as it usually is.

A

It will be on a tuesday, I'm 17th, there's going to be a talk about the postgres sequel on set under mesa, said Aurora with dr., so some all kinds of good stuff crammed into that one's the container, mojo, some stuff, mojo and somehow database workloads. So it ought to be a good one. So if nothing else we'll see you guys in October and then again in november thanks everybody for coming.