Ceph Ceph Tech Talks, 26 Jan 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-JAN-26 -- Ceph Tech Talks: High-Performance Production Databases on Ceph

Description

A detailed look at Medallia's approach and implementation of database workloads on top of Ceph. This presentation was run as a part of Ceph's monthly Tech Talk series (http://ceph.com/ceph-tech-talks/)

Associated slides: http://www.slideshare.net/Inktank_Ceph/2016jan28-high-performance-production-databases-on-ceph-57620014

A

So I'm the well I'm here today to talk about a high performance, Prussian databases running on safe and I'm, an architect at my lab Medallia. For those of you who don't know us, we collect we analyze, we display terabytes of structured, unstructured data for our multibillion-dollar clients, and we do all of this.

A

In real time now, I've been here since 2010, we've grown from 70 to 700 employees and the reason I'm here today is: we actually run real live high-performance databases in production, with customer data on safe now our journey there is actually been a little bit of a long one started about. A year ago we have an in-house analytics engine which is really high performance. We had a new version of it was even higher performance scale to thousands of servers, and then we took a peek at our pressure environment, which was hundreds of servers.

A

Not thousands, and all these servers had individual names and they were all of them, individual, tiny, little precious snowflakes. They were all entirely different. All the services that were running there were manually placed somewhere, and so was the server placement. It literally was whom there seems to be space in this rack. Therefore, this is where the server goes. There's no connectivity on the switch, but there's connectivity and switch over there. So, let's have a long cable.

A

This quickly ended up with all the snowflakes in environment, where it is, the mantra goes, don't touch it and the most precious thing we had was the database servers. They were really don't touch, it don't upgrade. The BIOS don't upgrade the firmware. If there's a critical bug fix for the storage controller, well, upgrading it will imply downtime and there's a possibility that, with this critical bug, fix also comes in your buggy.

A

So do you really really need to fix a bug fix that may or may not cost or is corruption, so it was clear we needed to do something. In fact, we need to quite a lot of something so we've said: ok, instead of incrementally going step by step by step, we're gonna actually jump directly into the future. I'm gonna skip ahead, two or three generations and go directly to next generation. That means we looked at micro services. We live containers. We looked at well pretty much every industry buzzword that is out there.

A

We did set up a proof of concept, our proof of concept uses 40, gigabit, networking, end-to-end all the way down to the server it's non-blocking. It's open networking we use SEF for the storage layer and we use darker for the containerization, and this proof of concept really really blew us away. We did have to modify doctor and the network Union safe, a little bit to get to the level we wanted, but at the end it's resilient enough. That is actually a major problem for us to test the resiliency.

A

We have to kill so many servers to get to the point that this does not survive. Then we actually ran out of capacity for another resiliency animal. Those are extremely performant. In fact, it was performant enough that it was on par with our current dedicated database servers, and these are purpose-built beasts, Geron databases as fast as they can now, whenever all performance is at that level and the resilience is at that level, the question becomes can run everything we have on this new infrastructure.

A

Now, to do that, we actually puente said: okay, we're gonna need some designing tools here. We first of all we want commodity products means we want commodity components, Wiimote supported open standards. We really like things where we can go look at the source code, while it doesn't always allow us to actually fix issues. It gives us the illusion that we can. It gives us defensive control and it does give us a little bit of an insight into what this company or provider actually thinks about quality of code. We weren't fully automated provisioning and reinstalls.

A

We want things that are cheap and scalable and always a cheap here. That doesn't mean I want to spend less money, but it does mean that I want one more compute power or more storage capacity or more performance or more networking for every single dollar that I spend and we want to be scalable. We really want something where, if we need more capacity, if we need more performance, if we need more of anything, we just add more acts to it.

A

Leslie we want immutable servers and that really goes towards once the server is up and it's serving traffic for some kind. It shouldn't change if you've ever done, production level, debugging and figure out that hmm somebody has a credit JVM, but the JVM that was running when this application was started was a different version. These are not so fun to debug.

A

We also very much wanted a case where there are no special machines, no magic appliance sitting in the corner. No little thing, that's sitting over there that everybody knows is essential, but nobody really knows what it does. All of those have to go. We shouldn't be a single service that is tied to specific hardware. That means every component must be able to run anywhere as much as possible.

A

We run redundancy at the a player at the software layer instead of having individual servers with double power, supplies double networking more layers of raid, and you really care to think about these things will inevitably, sooner or later fail, regardless of how many cables you've actually purchased into the box. So if you design for failure, think that it is gonna fail sooner or later so will this sign around the failure with the sign for self-healing without for sale? Lindsey, you end up with something that is ultimately much much easier to maintain.

A

The most important principle, though, is keep it simple, because if your design for your entire infrastructure is simple, then it's also simple to fix. It's simple to diagnose it's simple to understand. If somebody new joins your organization, it's simple to explain to them, however, thing works so that it will be productive in a very short time and in all long half of the case were when something breaks.

A

Everybody is looking to a consultant to go fix it for you, because nobody actually understands it, because it is complex and mostly simple means that in a short presentation here, I could actually explain the major design goals and a major components of how it works, and hopefully to a level that someone else can you know replicated now we ended up with a standard rack. Our standard rack is 22 compute nodes, the compute nodes are Intel, CPUs, run on Linux, with memory for gigabit networking and a small that is D.

A

This SSD here is used mostly for the components and/or, the containers and v2s. Our storage nodes are also Linux, with an Intel CPU, some memory same 40, gigabit, networking, a lot more storage, necessarily and PCIe, and we RAM for journals. The N realm for journals is being rolled out right now. Actually, as we speak so by the end of the day or tomorrow, we should have that up and running one of the great advantages of Ceph. You can do. On-The-Fly updates, I love it. When our networking we have three switches per act.

A

This is also Linux with cumulus and it's also Intel granted it's a lot less powerful CPU, but it has the same amount you own. If that's, the same concept of binary is portable across all of them. Yes, memory, and it has a lot more networking. In fact, if you log into the switches here, they look like any other server that just happened to be have 32 network cards, so this is unified.

A

Scalable and what this means for us is that my monitoring tools, things like how do I look for a host health, that's the same tool on my computer on my storage node and my networking switches.

A

Now the challenges, if you're wanting everything as container is really. Where do you draw the line? You can run your application in a relocatable container and in fact that's been done quite a lot. You can run your load balancer in a relocatable container. That has been done, but it's a little bit more challenging. You can run your DNS server in a relocatable container that actually turns out to be quite challenging, and, finally, you can run your database in relocatable containers or any database in relocatable.

A

Containers is a major problem, because the database sort of requires resists resilient storage and it really wants a storage to be the same after a power loss. Also, unlike my application, my application talks to zookeeper. It has no problem doing dynamic discovery. My database Posterous has no concept of what zookeeper is it talks IT addresses.

A

So there are some challenges we need to overcome. The first one is networking. Let's follow the life of a web request. Web request comes in talks to the datacenter firewall. Our firewall forwards, this to the host, where it's been told that a load balancer is running in this case nginx. Now our engine X modified to go talk to the zookeeper, and it's known to a zoo keepers in zookeeper. This information of here is the application for this URL. You know this is forwarded now at some point.

A

In the past application talk to the zookeeper with this basic set up, which you can just get off the shelf almost anywhere. Your application can be relocated anywhere because it doesn't really matter what the IP address for it is it's gonna talk to the zookeeper drop that itself when it boots. But what are these two servers dies?

A

Well if the nginx host dies and we need to start nginx somewhere else, and we need to give it a new IP address, then we need automation for updating the data center firewall.

A

Generally speaking, having automation that updates your firewall from the data center network itself is not really something your security group is gonna enjoy and for the zookeeper. How do you find the zookeeper if you don't use the IP addresses? If the is you, keepers changed IP address? Well, so you keeper does have forums. You have five of them, but soon later, if you just wait long enough, you will have lost one event you so you can say that I find zookeeper through DNS.

A

Well, in that case, all you've managed to do is make DNS that service that cannot fail and that needs to be relocated.

A

We've modified the standard container networking specifically for doctor to use full layer three routes to every single container. What that means for us is that every single container has a unique IP address and is visible across the entire data center.

A

To do this propagation of IP addresses, which you need in order for us to figure out. Where is the IP address located now we use OSPF. You could also use BGP. We just picked off AF because it is fantastically easy to set up. Most bf is a link state database. It's supported by every vendor, and this support by multiple vendors is important for us.

A

In all the components we looked at for OSPF, we looked at all the major storage providers, Network routers and quite a few of the white box providers and all of them every single one has a working OSPF implementation and we were not able to find a pair of two vendors that were not compatible inter implementation, but it means it's very easy for us to switch to any other provider. If you saw one entry- and this does give us fully relocatable IP addresses now.

A

Let's say that I'm not talking about a web application I'm talking about Postgres database. Well, let's say that we're running this application is talking to Postgres and the server dies. That's unfortunate, but hey I can relocate the Postgres database instance and it maintains the same IP address. So the application will able to connect to it real soon. There's only one small problem, which is that the storage it was using is still on the host that is now dead.

A

This is where we went to SF, because the problem here is darker. Images are ephemeral. They have persistent volumes which work right on your local machine. There are a lot of solutions, both proprietary and open source, for dr. volumes to be relocated, but if you want something that is actually a full-on high availability, in other words, it will survive, not just a voluntary shut down, but whoops the power went out.

A

You have to go look at ICC, which you have to go talk to a large storage vendor which tells an appliance and they're very happy, because the large storage vendor there's a lot of money on this appliance. You can talk to the same large storage vendor to say: I want NFS, because I want something that is a little bit more file system like great.

A

The wash drawers render is now super happy, because this is even more expensive and even less performant or you can try P NFS, which seems to be the direction that the storage vendor industry wants to go. We try P NFS. It did work for a very short amount of time, and then it really didn't work anymore, and most of these proprietary solutions are scale up. If you want to do scale out for them, it really ends up being you, my multiple appliances- and you say these file systems- are here these file systems over there.

A

The really major one, though, for us, was the SLA all of these large vendors. They support for our hardware support on site. They will have a tech on site in four hours to tell you that you have a problem, you already knew you had a problem and if you have a customer on the phone, and this customer is telling you hey I'm, paying you a lot of money. Where is my data? I? Can't access my application right now for our hearts? Port isn't good enough! You need something that is simple enough, easy enough to go.

A

Fix, diagnose, repair that your own people in house can go fix it in a matter of minutes not hours. So this being the sophistic talks, I'm just gonna go too deep into house. F works. The important parts for us is that there's no need to communicate with a metadata service in the hot path. It truly is a scale out solution. It is also very very clean design.

A

There's a white paper for house F works, I recommend you go, read it because reading that this was easy enough for us to understand that we can go fix some of the basic problems ourselves and it gives us the confidence that in the future, if there is a problem, we can actually go fix it and there's a large enough community that will make sure that the problem doesn't appear again and it is you know if you need more capacity, you just add more servers. If you need more aggregate performance, you just add more servers.

A

If you need more single known performance as in single client performance, that's where you have to get a little bit creative and that's what I'm gonna get into here.

A

But with this, the storage problem is solved, because if my poster has a host now dies, I will just start the server somewhere else and is connected to the same replicated cluster. Yes same IP address to the application distance. Look like a temporary Network glitch, at which point it we give reconnected to Postgres database and everything is not fine. Now, if you have relocatable infrastructure, you can actually have these to piggyback on each other, because what happens when the server for your set for monitor dies?

A

It's an interesting task to switch the self monitor, eyepiece, technically speaking, you can add a single monitor, remove a single monitor, add a single monitor, remove a single monitor.

A

It is a little bit not giving me warm fuzzy feelings when I do it and also there's a problem which is, if you don't update the client config, then, if one of them we start in the middle of this, they will try to connect to the monitors that no longer exists, but hey the monitors are services, there's no difference between a monitor in the service, so we set up the monitors and each of them get a unique IP.

A

If the machine that hosting the monitor dies, we just start to say, monitor somewhere else with the same IP. At that point, the monitor is got to come up conclude that it is very out of date. In fact, it has no data whatsoever, so it will happily the data from in the other managers and then it's up and running now. This is not automated.

A

The foobar potential here is very high, because if you have a split brain scenario or if the system decides that the server that was running the monitor is dead, but it turns out it actually isn't. Now you have three monitors with the same ID and the hilarity ensues so so far. This is a task where the human being must be able to go in and say.

A

Yes, the server is really dead, yes, really started somewhere else, but it does give us relocatable monitors, there's human intervention, but no human has to go to the data center and with this setup we are really in the space were for our servers. If a physical server dies and it doesn't matter which server it is or if a physical switch diet doesn't matter which wrist it is well, it's a problem to be solved.

A

Next week we just mark the server as down and once a month we have our harbor vendor, come on-site with spare parts fix the servers that are broken and going in. There's no point doing this middle of the night. Rushing to the data center to fix your physical server now for provisioning and orchestration. We, this is a two full autumn. The first part is we always network boot, our servers, both the storage servers and the computer servers.

A

We have small POS, Linux and interim FS, which is actually the internet installer with a few small modifications and extensions. We use this to do self encrypting drives. We have data at rest, encryption for everything, including all the compute nodes the boot drives. So the key is never known by the runtime OS, and this literally means that when a server powers on it cannot boot because the boot drive is encrypted and the server doesn't have the key it has to network boot.

A

To get the key now, what this gives us is, first of all, it means we need certain data at rest, encryption requirements. More importantly, it means that if I have a problem with the server, it is actually completely safe to unrack that server and toss it in the trash, because all the day is encrypted, there's no way to get it out.

A

Other things the remote guru does is check the state. If told to do so, it will go and update the firmware check the bios version and check the bios config and make sure that the OS is the right release level and when all these are done, it will go boot. The OS. Now this firmware and bios inversion and comfy is actually really really important.

A

When you talk high-performance in most distributed system performance is not dominated by the fastest machine, it's dominated by the slowest machine, and it's very annoying to figure out that out of your 64 storage nodes, one of them happens to be running an older version of the storage firmer, and it's actually slowing everything down across the network.

A

So doing this gives us completely uniform machines. We don't have any half installed, have forgotten state. All of them are always completely identical, and for us it means that when we do performance improvements, we do it on one machine and once that one machine looks good, we say this is the new Golden State. Please clone it for everybody else, including former settings for orchestration. We use Aurora and mrs.

A

Aurora is programming. Answer data center, like it's a pool of resources and it's it's fairly straightforward missus keeps track of your data center. In other words, it keeps track of. Where do you have spare capacity? How much capacity is there there's a lot of different schedulers, one of which is Aurora and for Ora? You will tell it. I want to run this particular job.

A

I need this many CPU cores I need this manage memory and I want this many instances you can also give it fairly complex rules such as well much like the Sarah Siddons rules, you can say: hey I, want three instances and I want them to be in different racks. For us, this is really important because we stripped down our servers as far as we could, so they run single power, supply single networking and we have resilience at higher levels. So our failure domain is one rack.

A

So when we run three instances of something it's important for us that they run in different tracks, we did have to extend darker a little bit specifically for the ability to give it this unique IP until it use layer, 3 routing, they are nothing no layer to bridging and we also to have modified darker to directly be amount set volumes. So this will do behind the scenes, the RBD map, if the device you're looking for like in this case, if there is no RVD for demo, it will create one.

A

Since the images are thinly, provisioned our default size for everything is 10 terabytes. That may come back and haunt me at some point. It will then run. You know mkfs and amount it if needed. If it's already there. It's just mounted with discard options, and in this case, if we're telling it specifically, this is the safe volume and it wanted to rewrite. And this is all it takes to actually get up a fully relocatable thing if I run, this were to do some modifications inside this demo shut down the server run it somewhere else.

A

I would actually get the same data.

A

How fast is it that's? Where things get interesting? On a networking side, we get about five microseconds latency with the super high tech performance tool called ping in reality, latency can be lower if you're running full-on RDMA, but five microseconds is small enough. That I don't really care for our application.

A

A lot of that, unfortunately, is still dominated by single stream. Tcp. The performance for single screen TCP is either 22 or 38. Gigabits per second I'll get back to why in one slide for multi stream, TCP it maxes out at close to 40 gigabits, and we can relocate IP addresses across the entire thing in less than 50 milliseconds. In other words, if an IP address was on one end of the data center and the needs muted out there, that takes less than 50 milliseconds for storage, single stream.

A

Io is 550 megabytes per second, that's completely limited by the sizes. These random stores nodes, if you have multi stream IO. In other words, if you had do a lot of I/o across multiple or the images at the same time, we can get close to 4 gigabytes per second and we can reattach in again in less than 50 milliseconds in reality, I think the real kick and we attach time is a little bit less, but this was the granularity for my timer Alma test.

A

I did now the reason that performance is so varying for the network is Numa, now think of a storage node on a dual circuit system. You typically have two CPUs two physical CPUs. Each of them have local memory and each of them have a local PCI, Express bus and the problems really start to appear.

A

If you have your network card on one CPU connected to the PCI Express bus and the SAS or SATA controller on the other CPU connected to that one sees PCI Express controller because for a single request, imagine that all we want to do is copy data from the SATA or SAS controller to the network card. Let's assume that thread that wants to do this work actually Rams on CPU zero when it wants to talk to the SAS or SATA controller, it has to go over this inter CPU link.

A

You know, take control of the necessary resources, do that remote I/o and every single command or port change it once the sender. The controller has to go over this intercity bus when it talks to the NIC. It is really fast because it's local, um this terse, seem to be a little bit of a performance bond work and you have the exact same problem if your task runs on CPU number. One now talking to the storage controller is really fast, but a NIC is slow.

A

If you are on lower speed, great networking or lower speed grade SATA controllers, then this bus is not a problem at all, but the second, you really really care about latency really care about single screen performance. It is a problem. Now we have solved that on a storage controller by running single socket, so all of them run single socket, which means both the SATA controller and the NIC and everything else. It is hash to that single socket CPU. There is no Numa, there is no bus. That is the ball like.

A

This does, of course, limit the number. Of course you can have per storage node. Today we run a 2667 CPUs. That's eight course at deliver above three gigahertz, which, as it turns out, is complete overkill for an OSD node demo time so I do apologize. I had record this demo in advance being connected to the VPN and on blue jeans at the same time, turn out to be a little bit of a challenge.

A

So let me open this, so this is my demo. The URL here by the way should be live, so you can actually go and look at the URL, though that's going to show the end state. So this is a small application. This is written in node.js I do apologize. This is my first nodejs application error. So there's probably quite a lot of style things that are wrong here, but it does connect to a database. It does a little bit of a select, and it just shows this on the screen.

A

Now, if I have a client and I connect to this database, I can insert values to it. Everything's looking good I can load it up, and yes, it should there. It's actually a database great now, if I go to aggregate. Where is this database running so on? My missus I can figure out that okay, it's running on this particular host right now, so I can see your another shell into that host and how do we actually demo that something is resilient?

A

Well, you can be nice to it and you can say: please shut down the database or all of that or we can t be nasty to the host itself. Just double checking that you do the right host I once rebooted the wrong host. That was not fun and let's see her nothin and it is scum.

A

So we just power off the host that was running the database now. How does this look there application? Well, the database is now gone. Where is it well to the client? It looks like somebody rebooted the-- or restarted at a post press server. In reality, it has now been relocated to a different host. Gets the same ip connected to the same storage and our insert statements, work and the application is still responsive.

A

So, let's talk performance and let's talk specifically real-world performance, because real-world performance and what you can read on data sheets are two very different things, especially when it comes to databases on the data sheet. You will see that an SSD has 100 K for K random, write, I, ops, fantastic great. That is absolutely true. If you have a very deep IO pipeline and you never need to acknowledge the writes in the real world- databases work differently, they don't have an I/o depth of 64.

A

They have a higher depth of 1 because they will usually read an index. Then it will okay, what is actually in this index blocked and if it's a b-tree well, it could be several layers of index 2. It's really an x-block process index block, Teek, read the index blocks, process, innings block, and it's this iterative process of. Please give me one piece of data process it and then give me the next piece of data, which means it actually is a random pattern where it processes in between. So you need full round-trip.

A

More important, though, is writes because a database well when you write a transaction, especially when you write this word known as commit the guarantee to the user, is that this is now persisted on durable storage. In other words, if somebody were to pull the power on absolutely every single thing in the data center right, this nanosecond, it will survive and be there when you get back up.

A

This means that a transaction commit is a full round-trip, all the way down to the storage layer, to be able to say for sure that, yes, this is durably stored.

A

Now in the real world, a dedicated database server has a lot of buffer cache and if you just- and these prices are just taken from New York but eternal gigabyte entry level, enterprises is deep with super caps. If you buy 24 of them, that's 15k worth of hardware. 500 gigabyte of lpddr4 ram is 4k. So, yes, a dedicated advisers have a lot of buffer cache. Now, for us, we have two types of tables. We have a few gigabyte tables and we have a few terabyte tables. Our application does very, very, very heavy caching.

A

So there are few read requests and even for the few read requests that are there. Well, if you have a database container or word, if you ever, you know dedicate a database but we'll stick with the database container. They have plenty of memory, so most of the indexes most of the small tables they actually sit completely in buffer cache. So your read performance is dominated by how fast you can read from the buffer cache we loosely translated.

A

How fast is your memory and the memory is fast, but if a user actually modify something, then there's a transaction, which means our bob lake is around F data sink, especially if there are multiple users with multiple transactions at the same time, because as long as the transaction is running, it still holds locks. So it's really important that this if data sink returns quickly, so that the locks the transaction holes can be released so that the other transactions can start grabbing them.

A

Now, if you look at self specifically RVD, there are three ways to mount it. You can mount it refuse, which is easy and gives you low performance. Quick files with mixed, read, write when F Data Sync gives about 640 I ops. You can easily use the ICS a target which actually is a lot harder to use and it's slow and as a mixed, readwrite IO on par with what fuse does and then there's KR BD, which is the internal, our BD. It's actually really easy to use.

A

This is our building map and it's a lot faster in the same mixed readwrite test that I did for the other two here. It ended up with 55 15, 50 I, ops per Java. The downside of the qidan kernel one is: it doesn't have the fans image features, you lose the exclusive locking support and you lose striping.

A

Hopefully that will be coming in a separately student or kernel release. I should say now. A problem on doing realistic. Testing with file is that you need something that resembles Postgres. We can and we do use PG bench. The problem is that PG bench workload and our real application workload differs quite a lot. Our real application workloads has a lot of very large transactions with very large objects where each row is a humongous beast.

A

Where's PJ bench deals with very well-formed transactions, so what we've done is we've observed the production area pattern when we try to tune a file to replicate the same pattern and everyone something provides good results on file. We apply it to real database and we have seen that if we see an increase of 25% on the file, our real-world database transaction rate will also be increased by about 25%.

A

The important things you need to do is, first of all, allow the buffer cache by default file and most testing tools bypass the buffer cache. In order to test your underlying storage, but hey infraction, and in all real-world scenarios, you have a lot of our caches so to leave it on make the I/o that's one. You run multiple jobs and use 8 kilobyte blocks, because this is what the database does.

A

I would love it if the database used asynchronous I/o or if you used to direct kernel interfaces, if it has a very deep aisle depth, but it doesn't so. You need to actually make your fire parameters suck, but it replicates exactly what the database is doing, and that also means using F Data Sync, every humbleth block or so, and use very, very large files and use semi random access. The reality is that even for read requests, it is often subsequent blocks and it's often random.

A

Do you need to basically go in a pattern where it creates two or three subsequent blocks, then there's a C.

A

So again, now post press doesn't use an offense aerial, so neither does the benchmark now, since I know that my read caches will actually cover up most the weeds I focus mostly on write performance I have to comparison targets. One is local software raid zero with eight Samsung 50 pros three Samsung 850 pros for the writes here. The latency is fairly low and I obscure job is also fairly low, and this is just a sanity check for a you know. How fast can it be if I just do some local hacking I'm just tossing together?

A

Some hardware I also have a full MLS I. My grade rate controller in raid 6 with 24 drives an interesting thing in there is this one: does a better job of the eye? Ops per per job you know, does have a battery back right, back up and so forth, but if we compare the local raid 0 with the controller based raid 6, if you look at the latency, the 99.99 percent mark in on the raid controller is measured in milliseconds, quite a lot of milliseconds Paravel. The target, of course, is 4k RBD to be beating.

A

These and clarity does beat the local raid zero in number of eye ops. It does not beat the local raid 6 in a movie ops, but the latency, which, at the end of the day, is what dominates transactions for us. It is much better than even the dedicated rate controller and Emelec the dedicated rate controller KR PD does survive controller failure. Battery-Backed caches are fantastic and till the controller dies, and you now have data sitting in a cache that is battery backed where the controller doesn't work anymore. This is when you start finding.

A

Are your backups or you go? Look at your Postgres slave and figure out: hmm was it up-to-date and can I over where zonkey RBD we just start the same job somewhere else.

A

So current challenge for us is first of all, we have a little bit of a challenge around locking if you run x4 on RBD and let's assume we have the following scenario: it's running the database is running and somebody reboots the switch I. Do it lost power or it's maintained, or something like this. At this point, a few minutes later, Aurora is going to detect that hey this compute node seems to be dead. Let me start the job somewhere else. Now your database is started somewhere else.

A

It's happily relocated and its new location monster, be the image and mounts text for a file system. Great the database is now back up and then the switch finishes rebooting the all job, which is still running now as soon as Aurora can talk to it. It will tell this node whoops. Please stop that job, because I already started it, but there aren't gonna be a few seconds red runs and it now happily renounced RBD file, system and rights.

A

To it, and if this happens, you're gonna figure out how to repair a broken x4 file system with a database on top of it. We thankfully tested all these failure scenarios before we roll it out to production and I strongly recommend you test your failure scenarios in a lab before you roll it out to production.

A

Based on this, we design a workaround which is a modified RBD wrapper. If you literally unmapped us or we look at safe, does and Arbit does support explicit locking, but it's voluntary.

A

Which means our BD map doesn't check the lock, but you can do so now. We try to lock the image and if we don't get to lock the image, in other words, somebody else holds the lock. Then we check the status of the image. If there's a watch from image, that means somebody else is holding the lock and they're still alive. In other words, whoever just told us to map this image is wrong, because the original job is definitely still running.

A

Would you check for the watcher 3 times 15 seconds apart and if we find it, we just say our work. If we get all the way here, then for the last 45 seconds, we haven't seen a watcher on this image. So we know that for the last 45 seconds, nothing as we had this for our buddy map, but her luck is still there. At this point we blacklist the original, lock holder and then steal the lock on one map. We are remove the lock and when the node is rebooted, it am blacklist itself.

A

So this means that if we have one of these nodes and they go away for more than a minute or women 45 seconds- and this is relocated- whether that nodes comes back up, it's now blacklisted. If you reboot the node, it will unblock this itself, because when a node comes back up fresh, it has no jobs. It will wait to be told what to do by. There were masters, it has no mappings.

A

um We also need to make this faster. We beat legacy hardware for latency in the 99.9% range, but the problem is in the 50% Emily consumed work and the 90% latency March yeah we're actually beaten, but we don't want any compromise on performance. We are currently rolling out enemy Ram for the sefirot Romans. It did take a little while to get that out there, because we have a requirement that all storage at rest must be encrypted, and if your RAM is non-volatile, then yes, it does kind of storage.

A

If it does arrive, a complete power loss- and it comes back in an intact state, then it is storage and it needs to be encrypted. Pmc actually modified there. In the 60s didn't see thin to support encryption for us, so we're really happy with that. We do have single storage server testing, which actually looks awesome.

A

The hardware is being installed this week. Someone was installed yesterday, some today, one of the advantages of surface. We are doing this while the system is like just taking down one rack at a time installing the the NVRAM building it back up. So we should have large-scale testing results in about two weeks.

A

We are always so experiment we at the Rocky IV, which is our DMA over UDP, in order to see if we can reduce some of the eighties be latency as soon as we have hard results for this we're going to post it to the self mailing list.

A

If you want to try any of this out, if you can go to get up come Italian, you will find our modifications to both darker and over up and we'll also put out our automated provisioning there. This this always Network boot. The automatic vias of this is a trainer that will be there as soon as we can take some things like the SSH keys out of the repository. We just need to move that into a separate repository and then we're going to project that also out there as open-source.

A

If you want an exact replica of what we did, um you know I'll share the slides. You can go, look at it, but it's compute, node stores, notes networking this idea that everything is Linux. Everything is something you just shell into same storage tools for everything and open source as much as possible for all of it.

A

Thank you. If there are any questions, I'll be happy to answer them.

B

Excellent, thank you. That was a great performance here. Everybody that's interested I will be unmuting all. So if you would like to ask a question, you can unmute yourself in blue jeans and ask a question or you could just go ahead and type it in to the the comment box. Any questions, maybe I should open this them.

B

A

B

World, if you want to send me your send me, your slides, when this is done, I will make sure that the video and the slides are posted for people that would like to go back and revisit this afterwards. Absolutely.

A

I heard some talking, I, don't know if that was a question or not. It was very quiet.

B

No I haven't seen a question come through, yet it's just a background noise after I unmuted microphones. So any questions going once twice.

B

No here comes one: how do you handle OSPF updates for moving ideas around there's a container around OSPF daemon, or does it talk to the switch statement? No.

A

We're an almost bf demons on the host, so we treats servers and switches are all part of the same OSPF domain. The container does specifically not run OSPF if it did, that would be a little bit of a security problem if any reach into the container, but since it runs on the host, we run quagga on the host.

A

What koga sees is that hey there's a new kernel route to this ten one, two, three four, let me put that into the OSPF domain and broadcast it, and the same when the container goes away now, quagga will notice that hey the kernel route just changed the ten one, two three four or ten one, two three I can't reach it anymore and puts that into no SPF database, which is again propagated.

A

Does that answer the question.

A

Other questions.

B

Sounds like your performance was feature complete. Everybody got everything they needed. So thank you very much and remember folks, next month, we'll be back here on the 25th of February at the same time, to hear about the latest and greatest from sefa fests, so yeah I'll be sure to be there for that.

B

One all right and if you do want to just keep an eye out on the mailing lists and social media and everything I'll make sure to get the links to this YouTube recording and the the slides that I'll post up on SlideShare after he sends them to me. So thank you very much serve all we really appreciate it. This is great. Thank you.