Ceph Ceph Day Melbourne 2015, 13 Nov 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sudarshan Ramachandran -- When 10GbE is Not Enough

Description

http://ceph.com/cephdays/ceph-day-melbourne/

A

So my name is Sudha I, look after mellanox in this region, I'm a network guy, not a storage guy. So please don't ask me any HOD storage questions.

A

So I've got some slides from our corporate guys if they're to marketing ish I'll sort of skip through them, but just bit of an overview on mellanox in general, so we're about to 32,000 employee company based out of Israel. That's where all the engineering ird happens.

A

Also, you know corporate and sales and marketing out of the US listed on NASDAQ about half a half billion dollars a year revenue and growing about forty percent year-on-year.

A

So the unique thing about mellanox is we completely end-to-end everything from the network adapters to the switches, the cables, the silicon in all those three and at the speeds of 10g 40g, and now the new 25g 50 g and Hunter G and we're shipping all of these products today so completely n2 in from 10g 200g hydrogen network adapters, they may not have thought a right yet we're shipping them all. Today.

A

So the markets that we have traditionally played in our high-performance computing- that's where we do most of our energy focuses on where we're trying to maximum increase the throughput, reduce latency and do all the good things that then work well in different industries such as high performance computing, where low latency and high bandwidth is extremely important. And then the nice thing was things like clouds web two point: oh, they all kind of had needed needed, similar characteristics. They needed to fit more VMs on the servers that they need to further pipes.

A

They needed lower latency that so that they can move VMs around very fast. They needed to present storage at low latency, so the high frequency traders wanted, you know low latency cards and and switches to do. Although algo trading databases things like the Oracle appliances, exadata exalogic, ERP systems, teradata is all these sort of enterprise-grade appliances use mellanox at the back end, but you've, probably never heard of us and under pending all that has always been storage.

A

So the type of storage appliances- we've we've always been in the back end of our like, like I, mentioned the articles, the fujitsu is the the toshibas za right, X netapp's, so we've kind of been in behind the scenes. I mean a lot of these traditional enterprise grade products.

A

So I don't need to tell you why steps popular but the the key points here that it's scale out so now the dependence is on the network and our aim is to never be the the bottleneck.

A

We don't want to be the weakest link so to speak, and so everything we do is to try and push the weakest link to somewhere else and try to lead the way.

A

So everything mentioned their high availability replication. There's a lot of things happening in SEF behind the scenes that the client is not aware of, and that's probably what I'll try to show you a little bit about today.

A

So the key- I guess, primers, that the SEF users are trying to achieve our high throughput and high I ops and those are things we achieve with high bandwidth and low latency. That's how we've try to facilitate those two parameters and these sort of technologies have been proven, like I, said in the high performance computing industry and that's what we're trying to bring to SEF.

A

So if there's only four things that you know you might remember today is this is pretty much it. So we increase throughput with more bandwidth. We increase I ops with more with low latency.

A

We try to simplify your infrastructure, make it more resilient by providing fata pipes so that you know you don't see network related issues and we try to free up the CPUs that you purchased to do to do other work from doing network work by offloading it onto our nicks, and we've got some other uploads coming up as well. That I'll talk about.

A

So self has two logical networks: it's got the client network and the public network, so the public network and across the network, typically they've been either been separated, they've been 10, G or 1 G, and you know every time you write something three times as much work happens at the back end to duplicate it or even when using a rosier, erasure, coding, you're still doing one and a half times the work on the back end.

A

So, like it says, you know, the the network load to the clients is dwarfed by what's actually happening at the back end, and you may not realize this or be able to measure this, and so it might affect your performance and your perception of the platform.

A

Add it to this solid state. Hard drives, are bringing Layton sees right down and, and that's another reason and right up to using ramdisk and things like that. That's how we try to. We try to turn everything so that you know we can keep up with with the best.

A

So this is a simple comparison. You know, don't ask me whether it's this cpu and that hard drive, but the principle is basically one geez out of the question. 10G has benefits to both latency I, ops and throughput, but for TG clearly, has you know additional benefits one this is showing Seth will eat up all the bandwidth that you give it and will perform better I. Guess that's one of the key things.

A

So you know, you might say, I don't need for TG, but you may not realize what's happening at the back end, so it is kind of eating or you can give it and so really what we're seeing here is. You know two and a half times the throughput, with 40 G over ten fifteen percent, higher I ops- probably just using hardest in this particular case- and you know, that's that's a pretty good improvement and then anytime you buy mellanox 40g switch or a 40g Nick. You actually get 56 g for free, so for TG.

A

256 g is a 40-percent boost, it's not an electrically standard but as silicon can handle it. So if you want to use it, you can and you get another twenty-five percent boost on. You know random reads:.

A

This is a an example of what you can do. If you have a flat a pipe is you can potentially bring your to sort of networks and converge them into a single 40g network, and this is just a reference architecture from supermicro.

A

You can go, look it up, but the main point is making it more simple: less cables, less less network cards, less switching to achieve greater performance and, in this case, basically showing that the three lines at the at the top a 40g- and there are perhaps approximately two times the read-through put to the tangi infrastructure and half the latency. And again you don't have to go bonding, multiple 10g links and doing all that sort of stuff.

A

I won't talk about this too much, but you've already heard from from sandisk. Basically, what we're we're doing there is where the tangi network card- that's in these boxes are sorry the four TG network card. That's in these boxes, you can choose your own switch, but hopefully use choose a mellanox which and I'll sort of give you reasons to to why you might want to do that and there's some performance figures there on 10 g vs, 40 g.

A

So till pretty much recently, we've basically had three basic switches in the offering we aren't going to them to in too much detail. But I want to focus on the small one and you. Firstly, you'll notice very low power consumption, less than 100 watts for any of these switches, and these are basically capable of running 56g, full and full line rate with all the ports running at all times, without dropping a packet. That's because the the silicon-silicon capacity has the ability to handle all the switching required at full times.

A

So I said till recently, because now we've just released all the 100 G switches. We were part of the consortium that brought 25g to the market as well, so now you're going to see more and more 25g Nick's, which we see as a more cost-effective way to building out a network than 10 G you'll see 50 G Nick's again, not yet I triple e ratified, but that will be the new 40 g and these divide very nicely into a hundred g top. Iraq switch.

A

So the other important thing with storage and latency is having very consistent latency and we, we have kind of a very deterministic latency across all packet sizes, extremely low. Compared to this is that the competitor, which is a tried into silicon, which is pretty much, goes into every other switch, whether it's extreme or a wrist or brocade. So what you can see here is, depending on packet, size and depending on how much your switch is being loaded.

A

Layton sees, can vary so much that it can t be measured in some cases. But for us we have a consistent 200 to 230 13, anos, ekans, latency and layer 2, which means that cook really ties well into not dropping packets, as well as into achieving line rate across the whole switch.

A

So I want to give an example of exactly how you might put a solution together, usually you're not going to have more than four six eight ten safe storage servers that you'll achieve a quite a lot of performance with that sort of a size.

A

So what we have here is one of our best sellers, which are 12 port for TG switch. Usually carried my laptop bag, so this 12.4 TG switch is what's shown up here: it's not a blade enclosure or anything. It's a rack. Mountable switch, you put two side by side and one year and what you achieve is a hitch, a solution like this and just one you it's a it's for TG ready, which it's 40.

A

In other cases, if you have just 10 G servers, you can use it at 40 G later or you can uplink to your your line at ng. And what we're seeing more and more is networking for storage is becoming the server guys domain. It's not the IT guys anymore, so the server guy is really defining I need a you know: low latency, high bandwidth.

A

It doesn't need a lot of features, just needs to kind of have cut through performance, and so it's becoming tightly coupled with the storage and again low latency low power, I mean two switches doing for TG at 100 watts. Together, you typically see about 600 watts with some competitors, which means, if you're in a Colo site over a year, you're going to save a lot of money as well as space.

A

So this just as two switches can support up to 20 steps servers in hey che and- and you know, that's a very nice solution- then I want to show you how you can scale out if you had even largest needs or you're presenting to your cloud servers or you're presenting to compute servers I'll skip through this.

A

So let's assume each of top Iraq has two of those switches. Then you create another layer on top for the aggregation layer where each of these racks could have a combination of self and compute servers where it's all connected in a non-blocking network with a che at the root level- and you know this can scale up to a certain size depending on your blocking ratio and things in layer two and going out even further.

A

You can have more root switches at that level and scale out at the leaf level as well, and in this example we're showing that perhaps you want to put all your SEF servers in an M lag pair at the leaf level, so that all that sort of back in traffic is confined to those switches.

A

You still have the bandwidth to the compute nodes, but in this architecture, you're kind of getting reducing the the traffic of the set of the storage leaking out into the rest of the network unnecessarily, even though the network will handle it.

A

So the next thing is: how do we reduce latency even further and that's through RDMA and adi amazed technology? It's an implementation that is being used in high-performance computing for a very long time. The whole idea is: how do you move data around without the CPU being involved? How do you do it by?

A

How do you not do it over TCP, which is a very sort of fatty? A protocol takes a lot of CPU and it slows you down. So basically, what adi adi ma does is. The point is to take move data from server to server, be the memory of separation of the memory of server be directly without talking to the cpu and the colonel. So what we have here is traditionally application buffer in in separate talks to the kernel buffer goes to the hardware across TCP and backed the other way. So with rema.

A

You simply talk directly to the hardware from the software, the application buffer and not over TCP, and then you go directly to the other side. So what were right now? We have beta code already in hammer, so you might see 2 2, 3, 3, X performance and by the way RDMA exists in our lowest end. Tangie Nick to our highest and hydrogenic is just always a.

A

So what we're trying to do now is bring it into general availability and production in the next release. It is community driven. So there are, you know various priorities in the community.

A

So if you are interested in this, so to speak up and show your your your needs and your request to the community and hopefully that'll sort of speed up the process and I've got a short video now in this first example, using 40g with normal TCP and then 40g with a DMA turned on your eye up to increase about forty four percent, depending on number cause you're using less till you in this example, you in in the first set you're using less cause and you're achieving more I ops. In the next example.

A

Maybe I'll cut the back to front, but the bottom line. Is you get better I ops and perhaps use your cpu for for other things? So here's the.

A

This is a video we created for the OpenStack summit in Tokyo last week doesn't have any audio, because it's just meant to be a booth presentation, but I'll just walk you through it.

A

So just a server and a client going through a switch and our adapters at either round.

A

In this case, it's completely round disk. We want to show a maximum performance. So first the demonstration is just over TCP.

A

Big up to the ramdisk.

A

So what you'll see is about.

A

140,000 I ops, with just normal TCP, now we'll here's a quick waters, RDMA.

A

So again, bypassing the colonel and you see here, the CPU is being maxed out trying to do all the network traffic.

A

So the reason we can offload it is a silicon can can do all the network transport rather than the CPU having to do that work. So the intelligence is in the network cards in this case.

A

So now CP is barely being used and the network's you've improved, increase bandwidth achieve near line rate reduced your latency and not use you see for you to do that. So now, we've turned on RDMA.

A

A

And now you're getting four hundred thirty four thousand items.

A

So 2.5 3x and we do see, there's a room to improve this a lot more there's some sort of fundamental structural things. I need to be changed and I also happens in the next six months. It'll. Basically, when you switch it on with your existing mellanox nicks it'll just work.

A

A

No, so so, under on the next side, auriemma is not the standard we've implemented in our Nick's. On the switch side, you just need to turn on data center, bridging and I think priority flood control and third parties, which can can it's kind of like lossless traffic. That's all it is yes, correct, yeah.

A

So yeah, that's my summary I guess the point was 10 G's, not enough! Don't go putting multiple 10 G's, either to solve a problem, take to step up to forty or fifty six, as in our case that you get for free, improve increase your throughput, I ops and, as I said, Honda G has begun as well, so we're now shipping, hydrogen X cables and switches as well as we can be. We can be found in appliances like the sandisk appliances or you can build your own.

A

We have reference architectures, and the other thing to mention is that new connectors for range of cards will also have arranged occurring. Offloads so I think we're implementing to the four methods as far as I understand, and so when that gets switched on all that all those calculations and algorithms will be offloaded to the knick as well. So thank you.

A

Any questions, so how do you may is is rocky that's over Ethernet, yes, so rocky, but a DMA is a term used in InfiniBand in general, but what we do is bring a DMA to ethernet and that's a DM 0 / converged ethernet, which is known as rocky. So as soon as all that rema code works, it doesn't matter whether it's a fini boundary thur night. It's the same thing.

A