Ceph Conferences, 11 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Accelerating Ceph Performance with High Speed Networks and Protocols

Description

High performance networks now able to reach 100Gb/s along with advanced protocols like RDMA are making Ceph a main stream enterprise storage contender. Ceph has gained major traction for low end application but with a little extra focus on the network it can easily compete with the big enterprise storage players. Based on technologies originally developed for the High Performance Compute (HPC) industry, very fast networks and Remote Direct Memory Access(RDMA) protocol is now moving to the Enterp

A

Alright, we'll go ahead and get started here with the next presenter on our SEF de track at open source days here. Thank you, everybody for making it. Our next speaker is going to be robbed from Mellanox he's going to talk about SEF performance across high speed networks, so Rob thank.

B

You very much okay, everybody hear me in the back. Okay, awesome.

B

All right so I'm going to talk about how to improve the performance of SEF using high performance networks and protocols, I'm going to focus on four areas. First of all, just the wires making the network faster. Secondly, the architecture architecting the network to be you know, faster to be able to utilize the faster wires. Thirdly, flash storage and how to get the most performance out of flash is F on a network.

B

And finally, a new protocol will not a new protocol, but a new protocol to SEF called our DMA or remote direct memory access a quick bit about the company I'm from Mellanox.

B

So we build end-to-end, Ethernet and InfiniBand high performance solutions, so 1 to 100, Gigabit, Ethernet and all the different InfiniBand speeds by end-to-end I mean we build the adapters in the silicon that go into the servers and the storage systems and the appliances and the switches that are in the middle and the cables that connect them all up and we provide software drivers to hook them into the operating systems in the different applications.

B

We're pretty good at it. So last year we owned, or we had, according to the analysts, more than 85% of the market for Ethernet adapters over 10 gigabits and that's the part of the market. That's projected in the next 3 years to grow to be even bigger than 10 gig. So that's that you know where all the growth is. So we know a little bit about high performance networking, so I'm going to start off with the wires.

B

So this is some testing we've done and there's actually a white paper on our website on exactly how how we did it in all the configurations and I think they're, actually handing out these white papers or some flyers for it at our booth over in the marketplace. If you want to stop by there, but what we did was we tested, starting with one gig all the way up to forty gig, and you can see, of course, major results in differences between one and ten gig, but even going from 10 to 40 gig.

B

You know to more than 2x the performance on bandwidth and 15 percent higher ions.

B

So if we look at some of the new network speeds the 25 50 gigabit compared to 10 gig, we see the results. There are just as dramatic, so even going from 10 to 25 gig, you get almost a hundred percent improvement in bandwidth and a hundred percent improvement in ions.

B

Now 25 gig is really widely available. Now it's.

B

It's kind of what they ordered I. Think there's a saying now: 25 is the new 10 in fact, Cisco switches. Now 10 gig switches, 10 25, it's the same, the same switch and they don't charge a premium for it as well, and the adapter price is at 25. Gig are very, very close if not the same as 10 gig. So it's not a break the bank thing and much easier than putting 210 gigs in.

B

If you do an internet search of recommendations from different vendors of Ceph or VARs putting stuff together, what you find is most of them on average are saying you want a 10 gig NIC for every 15 hard drives and an OSD, so 25 gig is a great solution for for a lot of hard drives, and this these high performance networks are available today.

B

So multiple vendors, it's not only my company I know a 25 gig alone, there's at least 4 other vendors, providing products today and you're going to have a higher speed networking starting later this year, we'll be introducing 2 and 400 gigabit ethernet solutions.

B

So, with these faster wires, we need to also look at the architecture of the network in SEF. There's two logical networks that can actually be two physical networks, the public network or where the clients are connected and the cluster network that just connects the OSDs and there's a lot of performance that happens between the OSDs or a performance needed between the OS DS, because there's a lot of traffic there for basically the the replication and recovery and rebalancing as well as the heartbeat.

B

You know making sure that the the different the OS DS are all up and running.

B

In fact, if you look at the gains you get by separating those networks by making that physical, those two logical networks, physically separated, you can see that if you've got 10 Gigabit Ethernet on a single network with no cluster network that you're using over 50% of it just for the OS DS.

B

If you separate them and put in a faster network like 40 gig, you can see that you get a much a huge boost in your overall networking capabilities and you free up that private side network for reads and writes to the OS DS from the clients.

B

There's a lot of traffic on that cluster Network I talked about for replication and erasure coding and I'm going to show you what I mean by that. So for from for a read operation, pretty simple, the client goes to the OSD reads the data, but if you look at what happens on a right, so there's no extra cluster traffic in this scenario. But if you look at a right, you get.

B

You know two more writes that occur, assuming you're doing a3 replication on that cluster network or, if you have it all on one network, you're actually for every right, tripling the amount of data that crosses the network. So by segmenting it you're going to see a improvement on the client side in a loaded system. And if you look at the recovery from a replication, you see a large impact.

B

So, for example, here's some different network speeds and the time it takes to recover from an OSD loss to terabit size, a 20 terabit and a 200 turbot size OSD. You can see with 10 gigabits, it's going to take you half an hour if you lose a node I'm, sorry with 10 Gigabit, Ethernet and 2 terabytes, it's going to take you half an hour to recover from a lost OSD and that's full bandwidth so that entire 10 gigabits is being used to get it to that time.

B

If you've got other traffic like you'd, have in segments of the network, then it's going to take a lot longer and remember that time is risk right, because if you lose two more OSDs or if you're, only using a to replication, that time is a risk that if you lose another one you're going to have a data loss, you can see that if you've got a 200 terabyte, oh s, d, it's going to take you almost two days on a 10 gigabit cluster Network private cluster network, doing nothing than that to backup that data.

B

But you go or to recover from that. But if you go to 25 gig, just 25 gig, it's going to drop it to under a day and you can get it in. You know just a few hours with a hundred gig, so you really want to fence off that client network from that cluster Network and that's going to give you a lot of performance and a lot more reliability.

B

So the easiest way to do that.

B

If you look at a typical network architecture where you've got the core in the distribution and the access point, usually the core is at a higher speed nowadays, probably mostly 40 gig, with 10 gig to the to the clients, you can simply add a switch in the rack where the OS DS are, and now your OS T's are on on a cluster network independent of that of the client network, and you also now easily can change the speed of that cluster Network because it's not part of the overall network infrastructure and it's not going to cause bottlenecks higher up in the core to distribution layers and it's not very expensive.

B

This is a little half wide one. You switch that we sell, but there's multiple vendors of switches at these speeds as well, but this this product on its 40, gig, 40 gig. This product would cost you for 16 ports just over five thousand dollars. So five thousand dollars to decrease your recovery time huge and improve the overall reliability. And if you wanted to put in a hundred gig, you could do it for under ten thousand dollars.

B

Let's now look at erasure coding, so in this case, instead of making copies we're using a special algorithm that can take the data and break it up into small pieces across multiple OSDs and the advantage here is instead of needing 3x the storage. You only need one and a half times the storage, but there's a payment for it. And another reason to have that separate cluster network. Is that there's a lot more traffic that goes on small message.

B

Traffic as the shards are called those little pieces that are broken up by this formula and distributed across across the different OS DS are sent out, so it's less than the traffic that you would see with replication if you're using three replications, because it's only 50% more than that one right, but there's still more traffic and a lot of small traffic, so latency is important.

B

The read operation, though, is the other way because in a read operation with just replication, you're only asking for the read from the client and you're getting back the data, but here you've got to decode that data that's spread across the different OS T's. So it's going to create more traffic on that that cluster part of the network good reason to have it independent and I.

B

Don't have a slide for it, because it's rather complicated, but you also when you have an error here, and you have to recover think about the recovery mechanism and the traffic. So you not only have to decode the met the data from the OS T's that remain, but then you have to re-encode it across the OSD to the to the replacement OSD or if there's less OS D's, you have to re, do the calculation, so that can be very heavy traffic on on the cluster Network as well.

B

The other downside to using a racial coating is that it's a very, very heavy load on the CPU, because the CPU has to do that calculation. It's a very complicated calculation on all the data in order to put it into the format needed for for erasure coding. One way to get around that is a Knicks are Knicks that have offload engines for a racial coding in the way this works, at least with our solution, is when the data is sent to the adapter, the the ethernet adapter.

B

It actually does the calculation in hardware and distributes the data across the nodes, thereby offloading the CPU from those calculations, and everybody knows that SEF loves loves to use CPUs. So it's a good thing.

B

So here's what we're, how we're going to implement it? There's a module in Ceph that does the erasure coding and what we're doing, and it's in development right now is we're creating a replacement for that module that simply uses this offload. So it's just a matter of switching that module by the way. If anybody else has questions, don't hesitate to interrupt. I should have said that earlier.

B

So the next area I want to talk about is flash I. Think everybody realizes that flash storage is our SSDs are becoming super popular across data centers. In fact, many data centers are going all flash because it solves the problem, believe it or not, of reliability because describes have all these moving parts and although flash was initially supposed to timeout and have all these problems, it isn't turned out to be that way because of the software or the load. Balancing and all the special software on flash that distributes the load across all the nand chips.

B

And so many data centers are just saying: hey, I'm going all flash, because the maintenance is on my hard drives is very expensive and, secondly, they're doing it, because you don't have to worry about matching up the storage to the per performance of the applications anymore, because all your storage is fast. So anyway, flash is becoming super popular, but that comes with a price right because it puts a big load on the network and here's. Why?

B

Because the change first of all, the the y-axis on this chart is logarithmic, but the change in performance just between hard drives and SSDs is a hundred times. That's huge and the change between hard drive or SSDs in the new persistent memory.

B

That's going to come out over the next five years is another hundred times so there's a ten thousand times improvement happening in storage in a ten year period to understand the magnitude of that think about how far it is from here to Saint, Louis Google says it's about a thousand miles in 18 hours now think about. Hopefully, a lot of you are from Boston think about the distance from here to Boston College, it's roughly 10 miles, 15 minutes according to Google vs.

B

18 hours, that's the magnitude and difference in performance just between hard drives and SSDs. If you add the persistent memory in here, that's like 1,500 feet, so 15 minutes to Boston College, or how long does it take you to walk 1,500 feet? So this is putting a huge change or load on different components in the whole ecosystem, especially the storage side are the networking side and here's. Why? So?

B

If you look at this, this chart shows how many hard drives it takes to fill a 10 gig, which is the red line and then a 40 gig, which is the yellow line. And then the blue line is a hundred gig. How many hard drives it takes to fill those wires and you can see almost 25 drives. It takes to fill a 10 gig link and you know hundreds to fill the other, the other wires.

B

If I just switch that serial ATA interface too hard to a SSD interface they'll go from a hard drive to nan. Now it's just 2 and for almost 10 or 9 will almost fill a 40 gig link. Now, there's a new technology for SSDs called nvme, and this was a read as of the interface to SSDs, because they initially came out with the legacy hard drive interfaces which were slowing them down.

B

If you switch to nvme drives one of them overflows, 10-gig, two of them overflow, 40, gig and 10 of or 4 of them, almost feel a hundred giggling.

B

So, if you're buying SSDs for their performance and you're, not upgrading your network you're, you know wasting some of your money you're getting performance locally, but if you're remoting that storage you're wasting your money you're, not getting the full value of that of that high-performance SSD.

B

So this is some testing. That's been done by many companies. It was a little bit old, so it's a couple years ago and they did testing on SEF systems with disk, changing the disks changing of SSDs the networks. The CPU is the operating systems and you can access this on on the internet or you can go to our booth and they can show you how to get to it, and what you can see is the the difference that just adding some SSDs provides to the performance.

B

If you also improve the performance of the network, so you can see on the far one here, I get the pointer to work in this one. You had a mixture of SSDs and nvme and then here taking the hard drives out, you increase the performance just using the nvme SSDs, and then you can see. If you add a lot of flash, the more flash, you add, the faster the performance you get so it's important to improve or to increase the performance of your network.

B

If you're using SSDs here's some data from quanta on the same area here in the in there demonstrate or in their test facility, they were able to show almost a 7 times improvement in performance by using a faster network with an SSD implementation.

B

Here's some more data with from multiple different vendors and you can see different speeds of the network and the performance they got and different numbers of SSDs. These are all straight: full SSD, implementations of Seth.

B

The last subject: I'm going to talk about for improving the performance is a new. Is a technology called our DMA or a protocol called our DMA. So we've talked about how to improve the performance by having faster wires, Andrey architecting, the infrastructure of the network and how to take advantage of the SSD performance, but now we're going to talk about a new protocol or a change at the protocol running on those wires, and this isn't an old technology. It's been in the market for many years, but in the HPC world, the world of supercomputers.

B

It started out about 20 years ago almost now, and it's now part it's embedded InfiniBand technology and the protocol that runs over InfiniBand wires and it's now a dominant part of the HPC market. In fact, if you look at the top 500 supercomputers in the world, you can see it's the navy blue line. It's the dominant interconnect between those supercomputers super companies. These days aren't the the big circular craze that they were when we are kids they're.

B

Now a cluster of pcs or RISC processors all connected together with a very, very fast low, latency, Network and the protocol. Those networks run is our DMA.

B

If we look at the regular straight storage market and look at the three different areas, object, block and file, as we talked about, you can improve the performance of those with these. You know these are common protocols for those different areas just by flasher or high bandwidth Ethernet, but also you can use our DMA technology and it's been used in the ethernet world for many years, just over an over Ethernet in a technology called rocky or our DMA over converged Ethernet, but in the segments that needed high performance.

B

So, for example, if you think of block there's a technology called I, scuzzy or protocol called I scuzzy for networking, block, storage and there's a version of it called icer that is I scuzzy over our DMA. If you think about file, Microsoft has a technology called sips or SMB. You know, I think is the new name for it and there's a version of it called SMB direct which works over our DMA for higher performance. Nfs over our DMA is for the NFS file system or file protocol and then on object.

B

The technology we're going to talk about is SEF over our DMA now I said it was a niche for high performance, but that niche is becoming mainstream because that very high-performance, SSD interface called nvme. That I talked about a few minutes ago, is now being put into a protocol called nvm Euler fabric so that it can be transmitted across a network and that standard came out way over almost a year ago now, and it includes of what's called a binding layer in the standard for transport over our DMA and the new memory technologies.

B

The new persistent memory technologies that are again a hundred times faster than SSDs are there's work underway for the last year and a half on how to put that across a network, and all of that is focused around our DMA protocol. So because of that this, this, our DMA technology is going to become much more mainstream in the data center over the next few years. So what is it? And how does it work? Basically, it's the remote.

B

It stands for remote direct memory access, so it's a remote version of DMA DMA is how you can move data inside of a computer. It's the local version, how you can move data inside a computer without sitting in a software loop and moving in a word in a time, so instead there's a hardware engine in the CPU that you give it a pointer to memory here, give it a pointer to memory here, kick it off or give it a counter and it moves the data without the CPU being involved. This is the remote version.

B

So what happens? Is you tell the our NIC, the adapter that supports our DMA, the memory location, local, the memory location remotely the counter and it moves the data without any interaction of the CPU? That's different than how normal traffic goes through a tcp/ip stack because there the CPU is involved and the CPU is controlling the transport layer, which is the layer that takes care of making sure the data gets there recovering from errors, those sort of things that's all handled in the hardware of the our neck.

B

So that's where you get very very low performance and high bandwidth, and you get also your CPU cycles back which, like we said earlier, Saif likes to use an example of that.

B

Actually one more thing on our DMA over Ethernet, so initially our DMA Auvergne Ethernet, when it was in those niches it required in the earlier implementations technology that came from Fibre Channel over Ethernet called PFC or priority flow control to be implemented on the switches and that gave it the flow control it needed to keep it from overloading the network because it provides so much data. Now that you can use still use that with our DMA over Ethernet and it does improve the performance, but it takes some some.

B

You know it takes more effort because you have to configure the switches now later question.

B

Okay, now, as that transport layer has been improved and no change to the protocol, so it's not a change in the Iraqi protocol. It's just an implement better implementation of the embedded transport layer on the adapters that need for priority flow control has gone away, and so your and so now you no longer have to have special settings on your network to run our DMA over Ethernet right.

C

This is the diagram. I was looking at and looks like the our DMA does not need any kernel does not involve kernel at all, or the application is directly talking with hardware or just trying to understand this PS.

D

Is it better yeah.

C

So the user memory needs to be pin for the hardware to access it right, so who's taking care of all that. So.

B

There's there is a software interface, an API driver for the our NIC that has commands to to take care of those issues. Okay,.

C

So so the API is a kernel API for RDI DMA. Oh yes,.

B

C

So the colonel is still involved in true.

B

C

Bye picture doesn't show that's a little confusing yeah.

B

It's a simplified picture, okay, so looking at what you can see from doing our DMA, this is the Microsoft SMB direct and basically they you can double the performance of SMB by using our DMA.

B

You can look at the demonstration on YouTube that they did at the Microsoft I think their conference is called ignite so at the ignite conference a couple years ago, because this technology has been out for a while, they did a demonstration, a live demonstration, which is you can look at with this YouTube on this YouTube site video and what they did was live turned on and off tcp/ip in and our DMA, so they switched back and forth and basically when they did that the performance- or you know the bandwidth that I showed before, of course was doubled, but the latency also have Duende in the CPU utilization dropped by over 30%.

B

So this is using software to find a software-defined storage application like SEF with our DMA, that's in production today. So if we look at our DMA over SEF, this has been an ongoing project with for multiple companies involved. Our company company in China called X Chi Samsung, SanDisk RedHat have been contributing to this effort. It was first released in beta in the hammer release back in June almost two years ago.

B

Now it provides our DMA on both the prior version on both the public and the cluster side, and it did have high performance and I'll show you some of the numbers in a minute, but it did have limited scalability a lot of it had to do with the memory requirements that you brought up a minute ago and I'll address that in a second but and now recently there have been lots of updates, they're being pushed upstream in the community constantly and the focus has been on increasing the scale and the performance and also we we've implemented a version that no longer requires pinning of the memory.

B

So one interesting thing about our DMA is that you have these memory locations on two different sites across the network. Well, you have to pin down that memory and you set up memory pieces of memory for all the different connections you have and what was implemented recently was a way to do that dynamically so that it you don't have to pin all that memory. You just pick the parts that are needed are pinned and the rest. It's dynamic, I'm pinning basically mechanism, question yeah. So.

E

Just does the does. This have to require a single layer to network across all the whole SAP cluster know.

B

E

There's a you know: routing step, yeah.

B

Yeah, so that's a that's another good question, a very good question: um I didn't go into the history of Rocky, but there's two different versions of rocky rocky v1, which has initially came out many years ago in rocky v2, which came out about four or five years ago. The difference is that rocky v1, which is kind of obsolete now not I, mean it's not used very much. Most of most everybody uses rocky to know, and the difference is rocky. One was layer, 1, layer, 2 and rocky v2 has an IP UUP, let UDP layer.

B

So it's routable. Thank you for the question good question, so you can go to the Mellanox community pages and it walks you through how to configure SEF to use our DMA and here's some of the performance numbers for that. So we saw it measured it in two different ways: one just raw performance, the same set up and the other to see the CPU savings. So we could save multiple cores on both the client and OSD and still get 44% more performance.

B

And if we had the same number of cores used on both sides, we could get almost 60% and performance improvement like most things with SEF, depending on the load, the workload you get different performance and in this case, which is where we've seen the best performance where you've got a high ops workload with the small block sizes, and here we could see performance improvements of up to 3x, using our DMA with very low latency, so to kind of prove to everybody that this isn't just slide where we have lots of customers around the world in multiple different industries.

B

You know from financial to cloud to education and more, and if you want to know more about using our products with cell phone, we have a booth over in the marketplace and I'm also open to answering some questions. To kind of summarize, the benefits Sun, you know to improve SEF. One thing to make sure you do is use faster networks.

B

Ten gigabit is not enough, and especially if you've got more than 15 hard drives or you're using SSDs SSDs just require our SSDs will give you higher performance, but make sure you improve the network as well and having a separate cluster network is a way to get better reliability in your system and to improve the performance. And then, if you want to go even further into the turbo area of performance, you can look at our DMA for Seth.

B

F

So it's my understanding right now that our DMA and Saif is still kind of upstream development. Is that correct, or is it fully supported in any of the the you know, GA versions of sev? It.

B

Is not fully supported in any of the GA versions today, any.

F

Idea when that will happen, I.

B

Think we are hoping to have that happen. You know later this year early next year and then.

F

The other question was about PFC and raqi Oh.

D

F

A specific adapter that no longer requires PFC like connect, x4, Connect x5, or is that just kind of universal across all the adapters? At this point so.

B

If you take our adapter product line, just to give a little background, for you obviously know how well but connect x3 is probably the was our 10 40 gig product that came out probably four ish years ago, Connect x4 was the 10 25 40 50 100 gig product that recently came out and then connect. X5 is the product that just came out.

B

Each one of those products has a faster ability to recover from problems that it sees in the network. All of them will recover, but more and more of that transport layer has been improved in the hardware of the adapters going forward so connect x4 can definitely do it. In a in a in a rocky environment using in not using PFC, you need to use an implement a function called ecn or extended notification of congestion, I think it stands for and then connect x5. You don't need anything, but one thing to keep in mind.

B

If you're trying to get super high performance like we're talking about here, you can't just run it on a 10 gig network in the middle you have to either over provision or implement some type of congestion management in order to get the performance, let you know the highest performance levels and nothing comes for free, but it, but it does work. You just lose performance. Yes,.

G

ah Very good work I have a couple couple of questions: yeah.

D

G

First thing is the IPO over IP numbers right. Did you compare with that scheme and we already missed.

B

That yeah, so you can definitely run this with InfiniBand. There's really no difference in the API interface and you'll see faster performance with InfiniBand, depending on the wire speed, because InfiniBand has lower latency yeah.

G

My question is compared: these are DMA solution. Computer already my solution weeds that default one running over IP only how much benefit you see how.

B

Much difference between an herb and know.

G

How much difference between our DMA and IP over IP IP.

B

Over IB, oh I see okay, so you're just running IP on it and you know I, don't there may be some numbers for that, but I don't have them. Let me I suggest you stop by the booth, though over the marketplace, because there's a marketing guy there named John Kim, who has organized a lot of the testing and he would know if we've tested that okay.

G

Yeah, the other thing is you mentioned about the memory consumption yeah I'm wondering do support like you, d or BCG, you've other than just RC. Do you have any ideas, yeah.

B

It's it has to be the RC, so I'm, not a super InfiniBand expert I, come from. My focus is on the storage, but I have done it in the past so that the dedicated or the type of connection over InfiniBand, where you have, or over our DMA in general, where you know that that you that your your message has been sent and guaranteed it does that that's and it also handles the just plain messaging rate, the UDP part of it. So this is new.

B

This new dynamically allocation works across all the different types of our DMA. Okay,.

G

Thank you, you're.

B

D

Question we got in to connect X evasion coating support what kind of throughput the performance improvement? Can you expect using these type of adapters versus doing it in CPU, so.

B

The connect x5 as I said earlier is the one that supports that and it's just come out and we haven't released any performance numbers on it. Yet.

H

B

I guess some I probably can't tell you in a public forum. If you want to stop by the booth this afternoon, we can discuss it more sorry about that. It's it's good.

B

Other questions yeah.

H

I think probably a lot of people are still running file store based SEP Buster's, but with the change to blue store and sort of not been necessarily the need for a separate journal.

H

There's obviously performance implications to that. How do you see that feeding into this also dramatic increase in the capability of I/o from these additional devices? So.

B

I'm, not a hundred percent versed on the differences in the load on the network, but if the load on the network is the same, meaning that you're not reducing the amount of data, that's moving across the network. With this change, you should see no difference.

B

Anything else great. Well, thanks a lot.