OpenZFS 2017 OpenZFS Developer Summit, 31 Oct 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: MMP: safe zpool import for HA clusters, by Olaf Faaland

Description

From the 2017 OpenZFS Developer Summit:
http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit_2017

A

So our next presenter is Olaf Allen from the Lawrence Livermore National Laboratory, and he's going to talk to us about how to safely deal with ZFS in a clustered environment and more likely about how do you safely import the pool in a clustered environment and not corrupted by import on several nodes? At the same time, all right, please, welcome Olaf.

B

Hi, so um okay, so you heard the problem already. You have you know a clustered environment. You've got shared storage and you want to make sure that you don't accidentally import the pool on two nodes. At the same time, CFS won't notice that this has happened. Both both nodes will start writing over blocks and you'll lose you'll, lose your entire pool, so we've written MMP to reduce the risk of this happening and it's merged in ZFS on Linux.

B

It's reasonable to ask why why do it in CFS and why ZFS doesn't prevent this already the two mechanisms it has the name space check when you perform the import is local to the node and the host ID.

B

Must be disabled with the force option in the failover on the failover node, because the pool is definitely just imported on the other node and the whole reason, and that you're importing is because that nodes presumably down.

B

However, maybe it's not really down. We've had cases where we have an external. Well, that's right. We have an external system in a system that attempts to make sure that these two systems can't import the pool at the same time, in our case, they're typically work with power control, so they're supposed to detect that host a is down power, it off and then start services on host B, but we've had cases where the H a system was misconfigured and we've had cases where the power control system that the H a system depended on lied.

B

So we started working on this using a design that was written by Richard Kariya back in the day and found partially an issue of my learning curve, but also just found that it was too complicated for us to get done quickly and that there were issues that made it more complicated than it needed to be, and so we took a step back and said. Okay, what do we really care about?

B

The main thing is: try not to disrupt the way people use the FS already try and make it as safe as possible so that it's difficult to miss configure make sure that we can test it viably in our automated test system so that if we break it, we know and simplify it enough that we can get it done and merged.

B

So the big consideration is, if you're going to do this is where do you look? How do you? How does post B know that the pool is imported somewhere else?

B

We wanted to keep it simple, so we, among other things, we chose to use the discs themselves so on the discs or the devices themselves on the devices we've got our map here of where ZFS keeps its data.

B

We thought about just about everywhere. The blocks in the pool aren't a good choice because you have to import the pool to find them, and even if you import the pool read-only, that's actually there, it's not reliable. You can't we'll talk about it. More later we looked at stealing space from the boot lock or the boot header, or the blank space at the beginning of the of a label, but we're concerned about interfering with someone else who is also trying to steal a new space.

B

We considered the envy list, the name value pair list that contains the configuration, but it's stored, packed on disk for one thing and it's more than a sector in size, and so you can't guarantee that if you read it, you're gonna get one whole envy pair or envy list. You might well get parts of more than one, because it's being overwritten at the same time, you're reading it and as had been observed by I. Think probably many people before I even came to this.

B

The you were blocks being written to the block slots already provide an indication of activity. So why not use this free capability? That's already there.

B

B

One of the issues that comes up with that as well: okay, what? If there's no activity in the pool if it's quiet, roblox, don't get written and what's more, they get written on devices where there's dirty data typically, so you need something, that's that you can count on, even in those circumstances.

B

So what we decided to do was to elq was to choose one of those slots dedicated to MMP. The rest of the roblox slots are used in exactly the same way that they were already when the sink goes to write a slot. It just chooses the next slot wrapping around and but it skips that one at the end and the one at the end we used to write to indicate activity and when the pool is quiet,.

B

This just what I just discussed so another question is: what do we put there? We considered a different structure than a new row block, at least for that for that MMP dedicated slot. It could really be anything, but there were a couple of advantages to using to extending the over block structure.

B

One is that, then, we could add m and p information to all of the over blocks that are written, and so, when the when the import occurs and the portion of the import, fetches, the newest or best uber block it'll have useful MMP information for us. So we get. We get some information without having to go. Look some special place for it, the other, the other or another advantage is that it doesn't introduce competitive compatibility problem. We don't have to worry about.

B

You move a pool from from a machine, that's used for failover purposes, you move it to some non shared storage where you don't need this feature anymore.

B

That's it's just a new verb, locking, there's no, no issue of compatibility and confusing it. In particular. If we write valley, duper block information there, so we added so we used the time stamp. That's already there.

B

This gives us one second resolution and we don't really care what time it is. We just look to see if it changed, and then we added three fields at the bottom. Magic delay and sequence. Magic is to tell us whether or not there's valid MMP information here at the end of the struct yeah, for example, if the pool is brought over from a beam that has a ZFS that doesn't understand NP, then we ignore those those fields.

B

Delay is the average time between MMP rights, which I'll go into in a little more detail in the sequence at the moment is unused, but it would allow us to provide sub second resolution on a quiet pool sub. Second, evidence of changes, so I'll briefly go over how the import works. This is specific to to Linux in in one way, at least, which is that Linux always gets a config from user space. That's as I understand it not true and Lumos, and there may be other things that are not true as well. Yeah.

B

So, on Linux user space, utility searches, the disks looking for labels and assembles a partial configuration, it's partial because it doesn't have I, can't remember if it's log devices or cache devices, but one of them doesn't have a have a label with enough information to know what pool it goes to and spares I think is yeah.

B

So it assembles this partial config and that issues a try, import, trying port I octal is used to get information about the pool. So if you issue is the pool import without giving that all or the pool name find out the status of the pool or just get a list of what pools are available?

B

Try and port is what provides that information, so the utility passes the partial config in and the first well almost the first thing that the tranport does is search all the labels of those of those devices reads all the rubra blocks slots and finds the best one meaning the one with the latest, the highest transaction group in the latest timestamp.

B

It then uses the block pointer from that e were reluctant. The maus get the config yet reconciles the mas config with the one, the partial one that it was given. It checks other information like features and then passes back a full config if it was able to assemble one in information about about the import and whatever may have failed or and user space then takes that full config passes it back in with an import ayatul, along with any flags like the force flag to ignore the hosts list.

B

Almost all or a whole lot of that work that the training port dad did gets done again and the import either succeeds or fails. Possibly after having rolled back, it then passes back status, information to user space, which reports its results.

B

So what I'm going to do now is go through how we arrived at the implementation that we that we ended up merging, starting with the the initial idea was well. We've got this. These newer blocks on disk that already provide an indicator. So what we'll do is we'll just issue the try import repeatedly for some polling period and look for change, and if we see change within that polling period, then we know that we can't import safely.

B

If we see no change, we'll assume that we can safely continue on with the import process, so that was try number one. We added two that we we. So we added that code to the user space utility and we added an mm pthread in the kernel that would just write on a fixed schedule to that one dedicated MMP block choosing one randomly.

B

So the first problem we ran into is that that try import, sometimes panicked a node and it would panic the node, because the import code first tries to load the moss so that it can get the Moss config and the root block pointer that it got from that best or block, may have DBA s that refer to devices that don't exist if the configuration is changing.

B

So that's kind of a bummer, and we found that going past that even if you survive that step again, if the pool configuration is changing later on the mas config and that partial config that userspace assembled get reconciled in, that code is more brittle than it probably needs to be well, some of its definitely more brittle than it needs to be.

B

But at that point we said okay, well, there's a fundamental problem. We can't fix, even if we made the code less brittle. Maybe this maybe we should just try and avoid this problem at all. Instead of pulling by repeatedly issuing the try import octal from user space, let's pull within the octal itself, we'll just repeatedly fetch the best super block for whatever the poling period is, and if that doesn't change, then we can proceed. If we see change, then we bail out, we don't do the rest of the import process.

B

So that helped a lot. You know the node. Doesn't panic? There's the problem of perhaps there's a delay between the triumph port after being issued and the import I octal we may. The trying port may have concluded that its safety import, the pool, but now the node is busy or something happens, and by the time it actually issues the import act. All the pools been imported on another host and it's not safe anymore.

B

So now we do the pole and within both the Tri import and the import which wasn't, which was which was fine because they actually go through the same code path. But now we have to wait twice as long, so we decided we'll pass the transaction group and time stamp that we found when we pulled back.

B

If we concluded there's no activity, we say this transaction group and this time stamp are the last one we saw and it should be safe and then, when the user space issues the import octal, it passes those values back in and the import looks to see. Well, have the txg and timestamp changed. If they haven't changed, then the activity test is still valid. So then it can continue and perform the import.

B

Now, up at this point, this polling period and how often those MMP blocks are written was based on settings or maybe even hard-coded in the code hard-coded in the code, maybe would be okay because at least it's the same everywhere but and but then the user gets no control which is sucky and settings is bad because maybe you've got hosts a is told to wait for 10 seconds and host B is told to wait for one second or rather host a is told to to wait for one second to write.

B

Every 10 seconds and host B is told to wait for one second and so that the isn't long enough, because you've set settings poorly. So we added the the field to the to the the camera, what it's called, but the field that that records the time between rights. That I showed you that's at the end of the over block, and so now.

B

Not only does that record, essentially what the user setting was, but if there's some delay in the IO pipeline, that's causing these MMP rights not to get to disk on the fixed schedule, which is ideal host. B knows that so if, for some reason, the MMP rights are only landing once every 10 seconds, because there's something gone badly wrong host b knows that it has to wait a lot more than 10 seconds and it can calculate an appropriate pong period.

B

Another potential problem is that there are more than more than two hosts involved and two hosts both try and port the pool at the same time. So we add a random small, random additional time to the calculation. So we've got the calculate the time. That's based on the average period between writes and then we've got this random term. One of the nodes will win hopefully, and the other nodes will see activity, because the node that finished first will write an MMP block and they'll see that change.

B

Now we also don't want to penalize people that don't have shared storage or don't need failover.

B

Why wait some amount of time for your import? So we added a property multi host on means. We want activity, checks and off means. We don't.

B

But you can't check properties without importing the pool because they're stored in the pool. However, it's okay, because we've got those in piece truck MMP fields at the end of our box structure, so we zero them if the. If the property is off, we set them if the property is on and the importing host host B can tell from the over block what it whether or not MP is required.

B

So that's that's, ultimately what we, what we arrived at these by the way these these frequencies, that I have here, I, I, realized or wrong. So I'll give the corrected slide to matfer for posting. But we added a couple of the clean module parameters. One is the multi host interval, that's how many milliseconds should go by during which every video should get one MMP right, and that's so that the idea was that that's easy to understand. Every every device should get a right, and so every device is providing protection to the pool.

B

And then we just you know invert that to turn it into a frequency.

B

The other issue that we address with module parameters to try and detect when the protection isn't working, because the OMA P rights are failing. So we keep track of how of how long it's been since one succeeded and yes.

B

If it's been too long, we suspend the pool too long is defined as that MMP right interval times the number of failures that the user is willing to tolerate and I think we have it set by default to ten, but that's something that they can change.

B

And again we got a lot of this for free, just because the FS already had the code to fetch the quote: best uber block best, meaning the one with the newest information and.

B

So for those people we just use whatever the most recently synched.

B

Uber block information was transaction group, block, pointer and so on, and we just update the timestamp. So that gets to be the newest one. If the pool is quiet and the activity is evident.

B

Like I said, a really important thing for us was being able to test it in particular, being able to automate the test.

B

Importing two nodes on the same or importing two pools on the same node, it's pretty hard because the the first thing that ZFS does is check to see if there's already a pool with that good or name imported and the host ID is obviously there's only one host ID.

B

But what we did was we use the tests. We run Z test. It's got its own namespace, it's running in user space, it's easy to to tell it what host ID to use, and so now we have what looks like an active pool files or loopback devices or whatever, and we can then try and import that through the kernel and if Emma P is working, it'll detect the activity and it will fail, we did have to modify z-test a little bit.

B

I've just get some tests, it does things like stops activity in the pool and run GDB to check that there's no space matically space map leakage and that sort of thing would disrupt the test. So we had a flag to prevent that sort of activity.

B

There are, there are important limitations. The biggest one is just that if there's something gone badly wrong in the IO pipeline, that introduces long delays but doesn't prevent rights from landing. Eventually, then, that defeats MMP, because it's depending on I'm, seeing your activities for some period of time, but you can't wait forever. So we pick an amount of time, and so, in our case, this is used in conjunction with an H, a system that powers off nodes and provides us with a second line of defense.

B

Other lesser lesser issues are that there's no ongoing check so once the pool is imported, we don't check to see if something's changing out from underneath us, but that could certainly be done and it would be I think a big improvement at the moment. If the pool is suspended, there's no protection. When it's resumed, we don't check to make sure that nothing changed. We actually took a stab at that.

B

There were minor complications, nothing that couldn't be worked around, but we ran out of time basically to get it merged. So that's something that I anticipate will will add.

B

When you envy at the moment, doesn't offer any protection, if you're, if you're, changing the structure of the pool in such a way that you try and add the same device to two different pools by doing it on two different nodes and therefore remove it. Well yeah- and it's not a big window in the sense that when you add a device label gets written but and these these operations all check for a label before they begin their work.

B

But you can add, if you can use a force option, that'll tell them not to look for that or to ignore it. And in any case, if you had an empty device- and you add it to two pools at the same time on two different nodes that wouldn't be detected. So that could be improved. And then we also didn't do anything to prevent someone from hosing themselves with see pool label clear, which we could do.

B

So that's it questions.

B

So, oh right, sorry, Thanks. The question is: why did we add UB magic instead of just adding a feature flag or bumping? The version essentially, and the main reason was that we're trying to make it easy for someone to go back and forth between a compatible and MP compatible implementation and not and the people compatible implementation. If they need to do that, and also because it made our life a little easier because then we didn't have to try and write code to help help people handle that transition.

B

The question was: is this SAS based and what would it look like if more than two hosts are involved, so I? Don't think it needs to be SAS based at all or I should say? No, it's not. That happens to be our environment, so maybe that's the example I gave, but ZFS is just looking at the devices on disk without regards to how they're connected just with regards to their content.

B

As far as multiple hosts, just as an example, we have servers where four hosts are sharing the same storage and in our particular configuration all four of those hosts. Don't normally aren't intended to import the same pool, so we actually have pairs of two, but it's only a few keystrokes away, and one can imagine that maybe there would be situations where, where you actually would have more than a pair, but it should work equally well with with end hosts, at least for some small number. I haven't thought about limitations.

B

So the question was: did we consider making enabling multi host basically changed, fail mode to panic, so that we didn't have to worry about the suspend case that I described, and we did? Our thinking was that.

B

We felt like we didn't want to take away control if we didn't have to, and and especially because that's a spend case is something that we really, even though we haven't addressed it. It's it doesn't seem like there's anything fundamental that would prevent us from doing so.

B

We, it seemed unnecessary to take that control away, but you're right. That would be that would mitigate the problem. Absolutely.

B

Okay, so the question was for the UB MMP delay field. That represents the amount of time between MP rights.

B

I described it as an average and the question is: why is it not the longest time and I agree with you, and in fact it's not really quite the average, so I should I actually should probably change that in the slide too. So it's a rolling average if it's going down, if there's a long delay that it that value is set to the longer delay.

B

So we rephrase that when a when an MP write succeeds, we go look at how long it took because we recorded in the V dev the time that we issued the right. So we can do the subtraction and see how long it took. Then we go look and see what the current value of that average is for for the pool, and if this new delay is greater than the existing average, then the existing average value is just made this new delay.

B

So if we have some long delay some sudden long delay than that MMP delay value is now the longest this new long period. If it's less than the existing MMP delay, then we use a decaying average 127 times the existing value plus one times the new value so that the clay decays slowly over its time and if you're getting bumpy values it. It will at least eventually go down.

B

B

So I think. Basically the question boils down to. Can this be made fast this this? The time that you have to wait is that reasonable, okay.

B

So yeah we did think about that.

B

So there are a couple of limitations for a quiet pool the activity signal that you're seeing are the the MMP rights that are being written by that by the DMP thread that are on some fixed schedule and at the moment the best, the smallest you can go as far as detecting that change is one second, because we're using the uber block timestamp field as uber locked Internet field and the txg is the two elements that we look for change in and the txg doesn't change in a quiet pool and timestamp change.

B

So it's only once a second but the MMP sequence field that we added, but we haven't used yet could be just incremented every time we perform an M&P right, and so then you get to control half and those happen, and you could have much smaller periods where you can detect change and even in a pool, that's that's, maybe not quiet, but where the activity isn't where it's not being slammed.

B

Maybe those MMP rights would happen in a high frequency than the sinks do. Depending on how you know what the workload is. However,.

B

There's that ugly problem about about delays in the I/o subsystem and I, don't I've only personally been doing things for doing work involved with file systems for about three years, but I know a lot of you have been doing it for a lot longer than that and I understand that bad things can happen, resulting in very long delays, machines that are just horribly bogged down, and you know swapping like crazy, interrupts firing like nobody's business, and now every right takes an eternity to get to disk and that kind of thing is problematic because we're depending on that on that polling and looking for activity, you can get fooled.

B

How often that sort of thing happens and what takes to make it happen? I'm, not a very good person to ask, but I suspect. There are a lot of people in this room that are.

A

B

The question statement was just well: there are other mechanisms like reservations and disk reservations that that allow you better or give you other mechanisms to provide protection, and this is just one that would that that adds to the options and hopefully, overall of it, reliability.

B

B

Don't rely on wall time.

B

So we're looking for we're looking for change in the value we're not looking to oh.

B

B

B

Right, hmm that's a good point: okay, yeah! That's a very good point. So I would say that if we got off our butts and started putting the value in that MMP sequence field that would address that problem and because we don't look to see that these numbers have increased. We just to look to see that anything changed at all and the intent to the MMP sequence is that we would be stuffing a monotonic value in there. You know the the ticks or whatever you know that whatever timer is available for the system.

B

In our case, Linux.

B

Yes, sorry, that's correct. We are looking at the txg, so even if the seconds don't change, if there's, if their transaction group syncing, which obviously are monotonic, then were protected. Thank you. That's correct.

C

Like an extra extra bone extra check, even if two systems.

B

B

So the timestamp proof is so. The txg provides us with an activity indicator when the pool is busy and the timestamp provides us with an activity activity indicator when the pools not busy but you're correct. It's not monotonic, and so we should actually use the sequence number field so that that's closed. Yes,.

B

So the question was given that this doesn't handle all all cases and is intended as sort of protection as last-ditch protection. What how do we? How do we handle it in our systems if MMP detects that another system has is actively has the pool imported when we don't expect that to be the case like, for example, do we power ourselves off or something.

B

And the answer is basically: yes, we use pacemaker and so and it's configured so that.

B

If it, if it checks, if, if, if an operation, fails an import, for example, although this applies to other operations, its configured to power off the node, where the operation failed and then attempt it on a failover partner and in our case it powers them off, it doesn't power them back on again it could be configured to do so.

B

So that's that's how we attempt to address that? We don't we don't panic from within from the node itself on the theory that we don't need to. We just we just failed the import, but it's a good point: it's not perfect, and you could. You could certainly do that.

B

Now, we've only seen it in testing and in our case, so in our case it actually protects us in two different scenarios. Really one is the fail or scenario that we've been talking about with where, where H a system is really the primary line of defense and MP is a backup, but the other place where it protects us where MP is actually the main line of defense is that we've got these clusters.

B

For example, we have one file system, one network file system, it's got 52 nodes, those were all pairs, so that's 26 pairs of nodes that are sharing storage and we use a tool called P dish to run a command across all of those 52 nodes at the same time, because we want them to be the same, but you know P dish, zpool, import, a would be really bad, and in that case it's some human admin typing this, while he's doing maintenance and there's no H a system to protect him or her.

B

So that's the other place where this protects us.

B

Looks like we're done. Thank you. Oh.

B

Lots of really helpful people, I wrote a lot of code, but a lot of other people did too and.