OpenZFS OpenZFS European Conference, 2 Jun 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: HybridCluster

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh Luke I was going to uh introduce now luke marsden. He.

B

You saw him earlier on, but he's now going to talk about his company, hybrid cluster and I'll just load. Your presentation, for you awesome thanks.

B

Cool, so I guess um this is just a sort of 30 000 foot view of some of the features in zfs that we use, um and a little bit about the sort of architecture that that we've been working on for for some time um is, is kind of different to the uh sort of big shared storage box approach.

B

B

Cool okay and it works good um cool, yeah, so um yeah, I guess just before I dive into how specifically we use the features of zfs um a little bit on actually what the sort of shape of the architecture that that we've been building looks like, as I was saying, it's a little bit different to the sort of approach of a um a big shared storage box that exports data over nfs or iscsi. It's it's more, an approach of unifying the compute and the data nodes into um the same place.

B

So if these are two nodes in a cluster for example, um then each node has its own built-in storage pool, and then we've got a container-based virtualization layer that sits on top of the storage pool and we've got a proxy layer that sits on top of that handling any incoming requests, and this can all run on sort of commodity hardware. uh It can run on top of cloud infrastructure and um it sort of opens up some some interesting possibilities, but so yeah I mean it's.

B

uh The approach is is really to having a multi-container system that can be split across multiple nodes running in multiple data centers, and one of the most important use cases that that we use so we lean on features of zfs4 is is effectively backup and what I mean by that is snapshot, and then I use zfs destroyed to refer to pruning of snapshots.

B

So what our system does is it's so just to give a little bit more context? First, actually so each one of these containers that might be running some web application or database, for example, will be backed onto its own independence, lfs file system, and that file system will then be independently snapshotted and replicated out between the different nodes in the cluster.

B

So from a sort of implementing backup perspective, we are using continuous snapshots. So whenever a piece of data changes on a user's file system, we're taking a new snapshot very quickly and having those snapshots available is a useful way of allowing customers to uh roll back to previous versions, and- and so that's a typical backup use case. We also prune those snapshots so we're automatically taking the snapshots and we're automatically deleting old ones. So we keep, for example, the last hour's worth of snapshots down to 30 second resolution.

B

Before that we keep the last day's worth of snapshots on an hourly resolution and before that daily resolution, and so on. um We also do replication across multiple machines, and this is where zfs send and receive pipelines come in really really handy. So whenever one of those snapshots gets taken, um the system will automatically uh it automatically sort of ensures that there are slaves allocated for each container, that's in a master mode on a on a server, and so it will.

B

Whenever it takes a snapshot, it will replicate that snapshot to whatever the configured number of other slaves is that could be in the same data center or in a different data center. But matt's invention of zfs send and receive, is absolutely critical to us being able to do this, and it allows us to do near real-time replication across across data centers and that's really cool um and then the last feature uh is I sort of already.

B

This is sort of tied into the uh the backup piece, but um actually giving users the ability to do a rollback themselves uh exposes this sort of cool time machine don't tell apple. I said that feature um that lets you um if your website gets hacked. For example, uh we often deploy this in a sort of web hosting scenario. So if your website gets hacked, then you can roll back to before um it got hacked and apply your patches or, if you accidentally drop a table in your database.

B

You can just roll back to 30 seconds before that and then there's some other sort of neat things that we've built on top of that um the ability, if seeing as this data, is being continuously snapshotted, actually just sort of on this um replication piece.

B

What I think is really quite neat about the approach of replicating sets of snapshots around between different machines is that, rather than a classical backup architecture which, um where you, where you're sending your backups off to some backup server and then, if you want to recover your data from that backup server, you have to drag the data back.

B

The architecture that that we've got here instead allows every replica to both be a user-facing backup and a hot spare at the same time, um and I think that's, I just think that having replicated snapshot trees is is really the way forwards for for all sorts of use cases.

B

um So then yeah, because we're doing this replication of data around between different nodes. We can also fail over um fail over to a very recent backup, so you can specify a threshold to say if I lose a node, I want to recover my to the latest backup within a certain threshold or that failover can be initiated manually, and we can also- and that's useful, so you, if a server fails, you can recover all the applications that were on it to another server in the same dc or if a data center fails, then.

B

Similarly, you can fail over all the applications that are on that server to another data center. Then um the live migration is sort of fun and I'll talk about this in more detail. In the other talk I'm doing, but it's basically the ability to um with our proxy layer seamlessly migrate applications around between different servers and different storage pools and uh different clouds. So um that's it really um in terms of uh the platforms that we depend on from uh from an open, zfs perspective, uh we've built our platform.

B

uh The initial version of our product is built on freebsd and it's built as a of classical web hosting stack. But with some of these neat distributed systems, features built in um freebsd is an excellent base.

B

uh It's been very stable and we've done quite a bit of work to improve sort of festivity on freebsd, we're now looking at moving to linux because of market forces and um because of an interest in containerization and importing some of the sort of storage stuff we're doing to to that context, and hence the interest now in zfs on linux, and we hope to sort of reproduce some of the success we've had making zfs on freebsd more stable with linux as well and um yeah. So I just a general point.

B

I think that opens edfs has huge potential, especially in the sort of upcoming um cloud infrastructure and containerization world, and um I'm really excited to to see such good forward progress on it. So thanks um any questions.

B

Do we have any questions on this side.

A

Yeah, so what kind of replication latency do you see um like? What's the lower bound next second 30 seconds, two minutes.

B

Yeah, it's great question. um I mean the the short answer. Is it's configurable um and it's a trade-off between like disc io and network late network versus your replication latency, but in typical configurations we see a replication latency around 30 seconds and obviously it depends on your application as to whether that's acceptable for automatic failover or if you just want to do manual failover in that sort of scenario.

B

um I have this uh ongoing curiosity with the I with the concept of doing synchronous at fs replication, but that's not something. We currently have resources to do.

B

Cool question over the back: we're back.

B

Oh you're, getting mobbed by me.

C

In your practical experience, how how do application failover, how the applications behave when you fail over and there's there are pending connections that will be routed by the proxy. Do they behave well? How about they have? They might have stale data 30 seconds ago, and the connections might be from an already established communication and on the other side, you don't have the data that's been passed.

B

Yeah, um so it's a good question I mean so generally. um Most applications will perform pretty well out of the box um when you take a consistent, zfs snapshot of an application and its file system, well an applications file system and then recover it somewhere else. It's just like that. uh Server lost power and came back up.

B

So if you've got a decent database, then it will recover automatically pretty quickly if you're running uh my isom, then maybe you're, not in luck, um which is why uh we also have hooks into my sequel in particular to do a flush tables with redlock operation for my isom tables, um but yeah. um The other part of the question was um sorry. What was the other part of the question around.

C

How do applications behave when is that really a problem from your experience or that's something that most of the applications you work with are.

B

Yeah, you are asking also about connections and routing as well um and yeah I mean so. The application behavior is is generally fine, like I say it's as if it came back after power failure. um The um the question of sort of routing connections around.

B

The approach that we're taking is not to try and get like completely 100 seamless failover, as you can tell it's more of a sort of people use. The word cloudy in a derogatory sense to say things that, like aren't like really really hardcore um up to the millisecond, and that certainly like uh limits some of the applicability of this approach. But I think that there's still a large class of applications for which it's acceptable.

B

um But we will like we'll lose uh in-flight connections when a server fails, for example, and um we'll uh have a brief period of downtime, because there's a timeout um before which we initiate a failover. So if someone's doing a long-running download, then they'll have to restart it or user might need to reload the page or something.

C

Then, what what is the purpose of the proxy? Wouldn't that have been the same situation with normal routing or just rerouting connections.

B

So one of the nice things about this approach is that it doesn't require any specialist network. You can just run it on completely commodity network with regular ip addresses.

B

A proxy layer will also publish dns records and it'll publish multiple, a records for each thing, so you can even have like multiple a records in different data centers and um then, if one of the nodes fails and just completely goes offline, then the browser will retry the other ip address and most well-behaved clients will will retry um another ip.

B

So uh in response to like what what's the purpose of the proxy there's, also, the proxy's also used in live migration and I'll go over that in the next deep dive.

C

Okay, thanks cool.

B

No worries any other questions.

B

B

Cool, so thanks very much luke um no worries I was gonna get back to then just talking about another.