►
From YouTube: MMP: safe zpool import for HA clusters, by Olaf Faaland
Description
From the 2017 OpenZFS Developer Summit:
http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit_2017
A
So
our
next
presenter
is
Olaf
Allen
from
the
Lawrence
Livermore
National
Laboratory,
and
he's
going
to
talk
to
us
about
how
to
safely
deal
with
ZFS
in
a
clustered
environment
and
more
likely
about
how
do
you
safely
import
the
pool
in
a
clustered
environment
and
not
corrupted
by
import
on
several
nodes?
At
the
same
time,
all
right,
please,
welcome
Olaf.
B
Hi,
so
okay,
so
you
heard
the
problem
already.
You
have
you
know
a
clustered
environment.
You've
got
shared
storage
and
you
want
to
make
sure
that
you
don't
accidentally
import
the
pool
on
two
nodes.
At
the
same
time,
CFS
won't
notice
that
this
has
happened.
Both
both
nodes
will
start
writing
over
blocks
and
you'll
lose
you'll,
lose
your
entire
pool,
so
we've
written
MMP
to
reduce
the
risk
of
this
happening
and
it's
merged
in
ZFS
on
Linux.
B
However,
maybe
it's
not
really
down.
We've
had
cases
where
we
have
an
external.
Well,
that's
right.
We
have
an
external
system
in
a
system
that
attempts
to
make
sure
that
these
two
systems
can't
import
the
pool
at
the
same
time,
in
our
case,
they're
typically
work
with
power
control,
so
they're
supposed
to
detect
that
host
a
is
down
power,
it
off
and
then
start
services
on
host
B,
but
we've
had
cases
where
the
H
a
system
was
misconfigured
and
we've
had
cases
where
the
power
control
system
that
the
H
a
system
depended
on
lied.
B
So
we
started
working
on
this
using
a
design
that
was
written
by
Richard
Kariya
back
in
the
day
and
found
partially
an
issue
of
my
learning
curve,
but
also
just
found
that
it
was
too
complicated
for
us
to
get
done
quickly
and
that
there
were
issues
that
made
it
more
complicated
than
it
needed
to
be,
and
so
we
took
a
step
back
and
said.
Okay,
what
do
we
really
care
about?
B
B
We
thought
about
just
about
everywhere.
The
blocks
in
the
pool
aren't
a
good
choice
because
you
have
to
import
the
pool
to
find
them,
and
even
if
you
import
the
pool
read-only,
that's
actually
there,
it's
not
reliable.
You
can't
we'll
talk
about
it.
More
later
we
looked
at
stealing
space
from
the
boot
lock
or
the
boot
header,
or
the
blank
space
at
the
beginning
of
the
of
a
label,
but
we're
concerned
about
interfering
with
someone
else
who
is
also
trying
to
steal
a
new
space.
B
We
considered
the
envy
list,
the
name
value
pair
list
that
contains
the
configuration,
but
it's
stored,
packed
on
disk
for
one
thing
and
it's
more
than
a
sector
in
size,
and
so
you
can't
guarantee
that
if
you
read
it,
you're
gonna
get
one
whole
envy
pair
or
envy
list.
You
might
well
get
parts
of
more
than
one,
because
it's
being
overwritten
at
the
same
time,
you're
reading
it
and
as
had
been
observed
by
I.
Think
probably
many
people
before
I
even
came
to
this.
B
B
B
One
of
the
issues
that
comes
up
with
that
as
well:
okay,
what?
If
there's
no
activity
in
the
pool
if
it's
quiet,
roblox,
don't
get
written
and
what's
more,
they
get
written
on
devices
where
there's
dirty
data
typically,
so
you
need
something,
that's
that
you
can
count
on,
even
in
those
circumstances.
B
So
what
we
decided
to
do
was
to
elq
was
to
choose
one
of
those
slots
dedicated
to
MMP.
The
rest
of
the
roblox
slots
are
used
in
exactly
the
same
way
that
they
were
already
when
the
sink
goes
to
write
a
slot.
It
just
chooses
the
next
slot
wrapping
around
and
but
it
skips
that
one
at
the
end
and
the
one
at
the
end
we
used
to
write
to
indicate
activity
and
when
the
pool
is
quiet,.
B
B
One
is
that,
then,
we
could
add
m
and
p
information
to
all
of
the
over
blocks
that
are
written,
and
so,
when
the
when
the
import
occurs
and
the
portion
of
the
import,
fetches,
the
newest
or
best
uber
block
it'll
have
useful
MMP
information
for
us.
So
we
get.
We
get
some
information
without
having
to
go.
Look
some
special
place
for
it,
the
other,
the
other
or
another
advantage
is
that
it
doesn't
introduce
competitive
compatibility
problem.
We
don't
have
to
worry
about.
B
B
This
gives
us
one
second
resolution
and
we
don't
really
care
what
time
it
is.
We
just
look
to
see
if
it
changed,
and
then
we
added
three
fields
at
the
bottom.
Magic
delay
and
sequence.
Magic
is
to
tell
us
whether
or
not
there's
valid
MMP
information
here
at
the
end
of
the
struct
yeah,
for
example,
if
the
pool
is
brought
over
from
a
beam
that
has
a
ZFS
that
doesn't
understand
NP,
then
we
ignore
those
those
fields.
B
Delay
is
the
average
time
between
MMP
rights,
which
I'll
go
into
in
a
little
more
detail
in
the
sequence
at
the
moment
is
unused,
but
it
would
allow
us
to
provide
sub
second
resolution
on
a
quiet
pool
sub.
Second,
evidence
of
changes,
so
I'll
briefly
go
over
how
the
import
works.
This
is
specific
to
to
Linux
in
in
one
way,
at
least,
which
is
that
Linux
always
gets
a
config
from
user
space.
That's
as
I
understand
it
not
true
and
Lumos,
and
there
may
be
other
things
that
are
not
true
as
well.
Yeah.
B
B
It
then
uses
the
block
pointer
from
that
e
were
reluctant.
The
maus
get
the
config
yet
reconciles
the
mas
config
with
the
one,
the
partial
one
that
it
was
given.
It
checks
other
information
like
features
and
then
passes
back
a
full
config
if
it
was
able
to
assemble
one
in
information
about
about
the
import
and
whatever
may
have
failed
or
and
user
space
then
takes
that
full
config
passes
it
back
in
with
an
import
ayatul,
along
with
any
flags
like
the
force
flag
to
ignore
the
hosts
list.
B
B
So
what
I'm
going
to
do
now
is
go
through
how
we
arrived
at
the
implementation
that
we
that
we
ended
up
merging,
starting
with
the
the
initial
idea
was
well.
We've
got
this.
These
newer
blocks
on
disk
that
already
provide
an
indicator.
So
what
we'll
do
is
we'll
just
issue
the
try
import
repeatedly
for
some
polling
period
and
look
for
change,
and
if
we
see
change
within
that
polling
period,
then
we
know
that
we
can't
import
safely.
B
If
we
see
no
change,
we'll
assume
that
we
can
safely
continue
on
with
the
import
process,
so
that
was
try
number
one.
We
added
two
that
we
we.
So
we
added
that
code
to
the
user
space
utility
and
we
added
an
mm
pthread
in
the
kernel
that
would
just
write
on
a
fixed
schedule
to
that
one
dedicated
MMP
block
choosing
one
randomly.
B
But
at
that
point
we
said
okay,
well,
there's
a
fundamental
problem.
We
can't
fix,
even
if
we
made
the
code
less
brittle.
Maybe
this
maybe
we
should
just
try
and
avoid
this
problem
at
all.
Instead
of
pulling
by
repeatedly
issuing
the
try
import
octal
from
user
space,
let's
pull
within
the
octal
itself,
we'll
just
repeatedly
fetch
the
best
super
block
for
whatever
the
poling
period
is,
and
if
that
doesn't
change,
then
we
can
proceed.
If
we
see
change,
then
we
bail
out,
we
don't
do
the
rest
of
the
import
process.
B
So
that
helped
a
lot.
You
know
the
node.
Doesn't
panic?
There's
the
problem
of
perhaps
there's
a
delay
between
the
triumph
port
after
being
issued
and
the
import
I
octal
we
may.
The
trying
port
may
have
concluded
that
its
safety
import,
the
pool,
but
now
the
node
is
busy
or
something
happens,
and
by
the
time
it
actually
issues
the
import
act.
All
the
pools
been
imported
on
another
host
and
it's
not
safe
anymore.
B
B
If
we
concluded
there's
no
activity,
we
say
this
transaction
group
and
this
time
stamp
are
the
last
one
we
saw
and
it
should
be
safe
and
then,
when
the
user
space
issues
the
import
octal,
it
passes
those
values
back
in
and
the
import
looks
to
see.
Well,
have
the
txg
and
timestamp
changed.
If
they
haven't
changed,
then
the
activity
test
is
still
valid.
So
then
it
can
continue
and
perform
the
import.
B
Every
10
seconds
and
host
B
is
told
to
wait
for
one
second
and
so
that
the
isn't
long
enough,
because
you've
set
settings
poorly.
So
we
added
the
the
field
to
the
to
the
the
camera,
what
it's
called,
but
the
field
that
that
records
the
time
between
rights.
That
I
showed
you
that's
at
the
end
of
the
over
block,
and
so
now.
B
Not
only
does
that
record,
essentially
what
the
user
setting
was,
but
if
there's
some
delay
in
the
IO
pipeline,
that's
causing
these
MMP
rights
not
to
get
to
disk
on
the
fixed
schedule,
which
is
ideal
host.
B
knows
that
so
if,
for
some
reason,
the
MMP
rights
are
only
landing
once
every
10
seconds,
because
there's
something
gone
badly
wrong
host
b
knows
that
it
has
to
wait
a
lot
more
than
10
seconds
and
it
can
calculate
an
appropriate
pong
period.
B
Another
potential
problem
is
that
there
are
more
than
more
than
two
hosts
involved
and
two
hosts
both
try
and
port
the
pool
at
the
same
time.
So
we
add
a
random
small,
random
additional
time
to
the
calculation.
So
we've
got
the
calculate
the
time.
That's
based
on
the
average
period
between
writes
and
then
we've
got
this
random
term.
One
of
the
nodes
will
win
hopefully,
and
the
other
nodes
will
see
activity,
because
the
node
that
finished
first
will
write
an
MMP
block
and
they'll
see
that
change.
B
B
But
you
can't
check
properties
without
importing
the
pool
because
they're
stored
in
the
pool.
However,
it's
okay,
because
we've
got
those
in
piece
truck
MMP
fields
at
the
end
of
our
box
structure,
so
we
zero
them
if
the.
If
the
property
is
off,
we
set
them
if
the
property
is
on
and
the
importing
host
host
B
can
tell
from
the
over
block
what
it
whether
or
not
MP
is
required.
B
So
that's
that's,
ultimately
what
we,
what
we
arrived
at
these
by
the
way
these
these
frequencies,
that
I
have
here,
I,
I,
realized
or
wrong.
So
I'll
give
the
corrected
slide
to
matfer
for
posting.
But
we
added
a
couple
of
the
clean
module
parameters.
One
is
the
multi
host
interval,
that's
how
many
milliseconds
should
go
by
during
which
every
video
should
get
one
MMP
right,
and
that's
so
that
the
idea
was
that
that's
easy
to
understand.
Every
every
device
should
get
a
right,
and
so
every
device
is
providing
protection
to
the
pool.
B
B
B
But
what
we
did
was
we
use
the
tests.
We
run
Z
test.
It's
got
its
own
namespace,
it's
running
in
user
space,
it's
easy
to
to
tell
it
what
host
ID
to
use,
and
so
now
we
have
what
looks
like
an
active
pool
files
or
loopback
devices
or
whatever,
and
we
can
then
try
and
import
that
through
the
kernel
and
if
Emma
P
is
working,
it'll
detect
the
activity
and
it
will
fail,
we
did
have
to
modify
z-test
a
little
bit.
B
B
There
are,
there
are
important
limitations.
The
biggest
one
is
just
that
if
there's
something
gone
badly
wrong
in
the
IO
pipeline,
that
introduces
long
delays
but
doesn't
prevent
rights
from
landing.
Eventually,
then,
that
defeats
MMP,
because
it's
depending
on
I'm,
seeing
your
activities
for
some
period
of
time,
but
you
can't
wait
forever.
So
we
pick
an
amount
of
time,
and
so,
in
our
case,
this
is
used
in
conjunction
with
an
H,
a
system
that
powers
off
nodes
and
provides
us
with
a
second
line
of
defense.
B
Other
lesser
lesser
issues
are
that
there's
no
ongoing
check
so
once
the
pool
is
imported,
we
don't
check
to
see
if
something's
changing
out
from
underneath
us,
but
that
could
certainly
be
done
and
it
would
be
I
think
a
big
improvement
at
the
moment.
If
the
pool
is
suspended,
there's
no
protection.
When
it's
resumed,
we
don't
check
to
make
sure
that
nothing
changed.
We
actually
took
a
stab
at
that.
B
B
When
you
envy
at
the
moment,
doesn't
offer
any
protection,
if
you're,
if
you're,
changing
the
structure
of
the
pool
in
such
a
way
that
you
try
and
add
the
same
device
to
two
different
pools
by
doing
it
on
two
different
nodes
and
therefore
remove
it.
Well
yeah-
and
it's
not
a
big
window
in
the
sense
that
when
you
add
a
device
label
gets
written
but
and
these
these
operations
all
check
for
a
label
before
they
begin
their
work.
B
But
you
can
add,
if
you
can
use
a
force
option,
that'll
tell
them
not
to
look
for
that
or
to
ignore
it.
And
in
any
case,
if
you
had
an
empty
device-
and
you
add
it
to
two
pools
at
the
same
time
on
two
different
nodes
that
wouldn't
be
detected.
So
that
could
be
improved.
And
then
we
also
didn't
do
anything
to
prevent
someone
from
hosing
themselves
with
see
pool
label
clear,
which
we
could
do.
B
So,
oh
right,
sorry,
Thanks.
The
question
is:
why
did
we
add
UB
magic
instead
of
just
adding
a
feature
flag
or
bumping?
The
version
essentially,
and
the
main
reason
was
that
we're
trying
to
make
it
easy
for
someone
to
go
back
and
forth
between
a
compatible
and
MP
compatible
implementation
and
not
and
the
people
compatible
implementation.
If
they
need
to
do
that,
and
also
because
it
made
our
life
a
little
easier
because
then
we
didn't
have
to
try
and
write
code
to
help
help
people
handle
that
transition.
B
The
question
was:
is
this
SAS
based
and
what
would
it
look
like
if
more
than
two
hosts
are
involved,
so
I?
Don't
think
it
needs
to
be
SAS
based
at
all
or
I
should
say?
No,
it's
not.
That
happens
to
be
our
environment,
so
maybe
that's
the
example
I
gave,
but
ZFS
is
just
looking
at
the
devices
on
disk
without
regards
to
how
they're
connected
just
with
regards
to
their
content.
B
As
far
as
multiple
hosts,
just
as
an
example,
we
have
servers
where
four
hosts
are
sharing
the
same
storage
and
in
our
particular
configuration
all
four
of
those
hosts.
Don't
normally
aren't
intended
to
import
the
same
pool,
so
we
actually
have
pairs
of
two,
but
it's
only
a
few
keystrokes
away,
and
one
can
imagine
that
maybe
there
would
be
situations
where,
where
you
actually
would
have
more
than
a
pair,
but
it
should
work
equally
well
with
with
end
hosts,
at
least
for
some
small
number.
I
haven't
thought
about
limitations.
B
B
B
B
B
I
described
it
as
an
average
and
the
question
is:
why
is
it
not
the
longest
time
and
I
agree
with
you,
and
in
fact
it's
not
really
quite
the
average,
so
I
should
I
actually
should
probably
change
that
in
the
slide
too.
So
it's
a
rolling
average
if
it's
going
down,
if
there's
a
long
delay
that
it
that
value
is
set
to
the
longer
delay.
B
So
we
rephrase
that
when
a
when
an
MP
write
succeeds,
we
go
look
at
how
long
it
took
because
we
recorded
in
the
V
dev
the
time
that
we
issued
the
right.
So
we
can
do
the
subtraction
and
see
how
long
it
took.
Then
we
go
look
and
see
what
the
current
value
of
that
average
is
for
for
the
pool,
and
if
this
new
delay
is
greater
than
the
existing
average,
then
the
existing
average
value
is
just
made
this
new
delay.
B
So
if
we
have
some
long
delay
some
sudden
long
delay
than
that
MMP
delay
value
is
now
the
longest
this
new
long
period.
If
it's
less
than
the
existing
MMP
delay,
then
we
use
a
decaying
average
127
times
the
existing
value
plus
one
times
the
new
value
so
that
the
clay
decays
slowly
over
its
time
and
if
you're
getting
bumpy
values
it.
It
will
at
least
eventually
go
down.
B
B
B
B
A
B
B
Right,
that's
a
good
point:
okay,
yeah!
That's
a
very
good
point.
So
I
would
say
that
if
we
got
off
our
butts
and
started
putting
the
value
in
that
MMP
sequence
field
that
would
address
that
problem
and
because
we
don't
look
to
see
that
these
numbers
have
increased.
We
just
to
look
to
see
that
anything
changed
at
all
and
the
intent
to
the
MMP
sequence
is
that
we
would
be
stuffing
a
monotonic
value
in
there.
You
know
the
the
ticks
or
whatever
you
know
that
whatever
timer
is
available
for
the
system.
B
B
B
So
the
timestamp
proof
is
so.
The
txg
provides
us
with
an
activity
indicator
when
the
pool
is
busy
and
the
timestamp
provides
us
with
an
activity
activity
indicator
when
the
pools
not
busy
but
you're
correct.
It's
not
monotonic,
and
so
we
should
actually
use
the
sequence
number
field
so
that
that's
closed.
Yes,.
B
So
the
question
was
given
that
this
doesn't
handle
all
all
cases
and
is
intended
as
sort
of
protection
as
last-ditch
protection.
What
how
do
we?
How
do
we
handle
it
in
our
systems
if
MMP
detects
that
another
system
has
is
actively
has
the
pool
imported
when
we
don't
expect
that
to
be
the
case
like,
for
example,
do
we
power
ourselves
off
or
something.
B
B
Now,
we've
only
seen
it
in
testing
and
in
our
case,
so
in
our
case
it
actually
protects
us
in
two
different
scenarios.
Really
one
is
the
fail
or
scenario
that
we've
been
talking
about
with
where,
where
H
a
system
is
really
the
primary
line
of
defense
and
MP
is
a
backup,
but
the
other
place
where
it
protects
us
where
MP
is
actually
the
main
line
of
defense
is
that
we've
got
these
clusters.