►
From YouTube: Fault Management by Don Brady & Justin Gibbs
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
B
B
B
So
at
about
that
time,
FMA
was
landing,
and
so
there
was
an
effort
to
make
CFS
visible
in
that
FMA
space
and
it
turned
into
sort
of
a
road
map
and
they
added
diagnosis
and
basically
spread
it
out
over
several
years,
as
as
they
got
more
familiar
with
the
territory.
They
you
know,
started
adding
items
to
the
roadmap
to
to
fill
out
the
stack
and
some
of
the
features
never
made
it
back.
I
sort
of
alluded
to
that
Italian
generic
IO,
false
smart
data
telemetry.
B
Some
of
that
data
is
not
actually
being
fed
into
the
diagnosis
engine.
So
that's
one
of
the
things
I'm
hoping
we
could.
You
know
so.
Every
new
that
effort.
Most
of
this
history,
I
gleaned
from
Eric,
Chirac
I,
had
I
talked
with
him
and
he
was
sort
of
the
instigator
I
guess
of
some
of
this
stuff
and
at
fish
works
he
was,
you
know,
actively
involved
with
with
bringing
FM
a
pool
FM
a
like
sort
of
support
into
ZFS.
B
So
yeah,
so
so
my
my
overall
goal
for
this
talk
is
to
sort
of
renew.
Is
that
stage
roadmap
approach
that
they
were
following
before?
If
you
look
back
into
the
history,
there
was
a
phase
zero,
a
phase
one
phase,
two
and
I
frantically
have
googled
for
Phase
three,
but
I
have
not
been
able
to
find
it.
So
we'll
have
to
invent
that
one.
So
yeah,
like
I,
said
at
the
end
and
then
maybe
even
tomorrow,
a
breakout
if
people
are
interested
yeah.
B
So
this
is
sort
of
a
simplification
of
fault
management
in
ZFS
sort
of
shows
all
the
key
players.
Overall,
the
idea
is
automated
diagnosis
and
isolation
of
a
fault
of
a
B
dev.
A
fault
is
something
we
can
associate
with
an
impact
like
loss
of
redundancy
and
then
there's
a
corrective
action
on
lining
to
disk
replacing
a
disk.
B
B
B
So
yeah,
so
these
errors
come
in
and
then
then
the
engine
per
Vita
will
associate
a
case.
The
cases
sort
of
like
a
case
file,
detective
it'll,
actually
open
a
case
and
in
in
the
case
of
CFS,
we
attach
a
third
engine.
A
cert
is
just
a
fancy
way
for
saying
software
error
rate
discrimination,
so
we're
looking
for
an
events
in
a
teep
time
period
and
these
these
varies
across
platforms,
but
it's
typically
on
the
order
of
like
15
errors
in
10
minutes
or
something
to
that
effect,
and
this
was
what
was
chosen.
B
B
B
Yeah
so
I'd
only
place
with
a
spare
and
and
and
currently
the
I
think,
what's
in
the
top
of
ZFS
open
ZFS
tree
is
real
simplistic.
It
just
takes
an
array
of
spares.
Typically
global
and
it'll.
Try
I
have
a
faulted
disk.
Let
me
try
the
first
fair
and
see
if
it
works,
and
so
I
know
for
a
fact
that
people
want
to
have
a
better
matching
algorithm.
B
We
know
we
now
have
the
capability
of
having
different
tiers
of
like,
for
example,
metadata,
and
you
might
want
to
have
a
separate
spare
for
that
and
in
the
case
of
D
raid,
which
which
intel
is
working
on
it's
a
d,
clustered
radiant
solution.
It
actually
has
virtual
spares.
So
that's
another
special
case
where
you'd
have
to
actually
have
criteria
for
matching
the
spares,
instead
of
just
blindly
trying
to
try
and
things
until
something
works.
B
Ok,
so
then,
this
other
this
other
agent
and
I
took
the
liberty
to
call
up
the
disk.
Add
agent
in
Solaris,
it's
called
ZFS
module
and
I.
Just
you
know
that
just
doesn't
resonate
very
well.
It
came
out
of
the
system,
ant
loadable
modules
framework
and
for
various
reasons,
it
was
implemented
that
way
and
it's
fine
and
but
it
it.
In
essence,
it
is
an
agent,
so
I
took
the
liberty
to
call
it
an
agent
and
that
basically
will
consume
the
disk.
B
Monitor
events
like
if
you
get
a
disk
add
the
disk
agent
will
will
see
a
new
disk
and
it'll
try
to
match
it
against
any
important
poll
to
see
if
it's
missing
the
most
obvious
case
is
online.
If
you
had
a
late,
a
late
responder
during
your
import
or
a
for
disk,
went
offline
cuz
somebody
unplugged
it
and
push
it
back
in
this
agent
will
actually
see
that
disk
notice
that
it
belongs
to
a
certain
pole
and
place
it
back
online.
B
So
it's
like
it's
like
actually
magic.
You
can
take
out
a
random
discussion.
Full
stick
in
a
totally
different
discs
and
ZFS
will
automatically
make
it
good,
and
you
know,
as
soon
as
the
V
Silver's
done.
It's
it's
a
brand
new
disk
in
there
without
having
to
actually
sit
in
the
command
line
and
type.
You
know
off
line
and
replace
all
that.
B
B
C
Excellent
okay,
so
the
the
support
in
FreeBSD
I'm,
gonna
talk
about
was
developed
by
spectra
logic,
I'm
no
longer
working
at
spectra
logic,
another
as
well
Andrews,
but
Allan
summers
is
there
and,
and
the
poor
soul
is
trying
to
upstream
all
the
work
that
we
did
while
we're
there
you'll
see
pull
requests
from
him
in
Alamos.
In
fact,
I
saw
some
activity
from
him,
I
think
today.
C
C
Okay,
so,
as
Don
talked
about
there's
kind
of
a
simple
description
of
what
you
want,
this
event
daemon
or
this
fault
management
system
to
be
able
to
do
it's
supposed
to
be
able
to
detect
drives
that
are
in
a
bad
state
that
could
be
either
degraded
or
faulted
in
ZFS
degraded
means
that
the
kernel
will
continue
to
use
the
device.
It
will
continue
to
read
and
write
to
it,
but
it's
kind
of
like
a
notification
to
the
fault
management
system,
that
this
is
an
ailing
drive
that
should
be
retired.
C
A
fault
to
drive,
though,
is,
is
basically
taking
out
of
service
whenever
we
detect
these
drives.
Assuming
that
we
have
spares
available,
we
want
to
activate
spares
for
them.
If
somebody
decides
to
pull
a
drive
out
and
you
have
a
different
drive
in
or
shove
the
same
drive
in,
we
want
to
detect
those
events
and
make
sure
that
the
drive
comes
back
into
the
pool
and
in
these
situations,
where
we
have
like
replaced
by
physical
path,
which
is
this
magical
thing
that
Tom
was
talking
about.
C
Where
you
have
an
array
that
has
physical
path,
information,
you
pull
out
a
drive,
you
stick
in
another
drive
that
is
of
similar
capabilities,
similar
size.
The
system,
basically
just
detects
that
decides
to
aggregate
it
into
the
array
and
bring
it
online
in
this
last
part
deactivating
a
spare
after
a
successful
resilvered.
That's
in
the
case
where
you
have
activated
a
hot
spare
and
for
instance,
you
went
and
removed
the
original
failing
device
and
replaced
it
with
a
nice
healthy
device.
C
The
system
resilvered
back
onto
the
original
device
in
its
original
physical
location,
and
at
that
point
we
can
return
that
spare
to
the
spare
pool.
That's
deactivating,
a
spare
in
most
cases
the
active
deactivation
supposed
to
occur
in
the
kernel,
but
as
we
found
in
the
development
of
this
work,
there
are
cases
where
that
doesn't
happen
and
I'll
have
more
about
that
in
a
later
slide.
C
So
one
of
the
challenges
that
happens
in
trying
to
make
this
all
work
is
that
we
have
events
that
come
from
different
places
within
the
kernel
and
in
fact,
in
the
Illumina
Solaris
implementation.
You
have
system
events,
and
then
you
have
ear
aport,
sorr
error
reports
and
they're
kind
of
considered
to
two
separate
namespaces,
sometimes
going
through
different
systems,
and
yet
in
order
to
be
able
to
be
successful
in
managing
one
of
these
faults,
you
might
need
information
from
all
of
these
different
places
and
you
have
to
actually
aggregate
it
correctly.
C
C
In
the
block
diagram
here,
I'll
give
you
a
sense
of
the
other
systems
in
the
freebsd
kernel
that
we're
using
to
get
this
information,
so
some
some
problem
areas
that
occur
because
of
the
asynchronous
and
distributed
system
that
we're
trying
to
manage
here.
So
when
things
happen
in
the
system
and
and
in
the
ZFS
case
that
could
be
coming
from
a
user
initiated
command
like
acting
adding
a
device
or
deleting
a
pool
or
manually
on
lighting
a
device
or
doing
a
replace.
C
So
we
have
to
actually
aggregate
this
information,
detect
the
fact
that
we
can't
act
on
it
just
yet
and
then
rely
on
another
event
that
comes
out
of
ZFS
to
let
us
know
that
we
should
try
again
in
the
case
of
hot
device
removal.
Usually
that
happens
because
you
know
our
users
are
special.
That
usually
happens
when
the
device
is
actively
being
used
right,
so
we're
sending
iOS
left
and
right
to
this
device
and
we
pull
it
out.
C
Well,
what
happens
it
kind
of
looks
like
because
we
see
maybe
some
IO
errors
or
whatever
during
that
process,
that
this
is
a
drive,
that's
on
its
way
out?
Well,
it
is
it's
just
that
it's
completely
leaving
the
building.
It's
not
that
it's
it's
dying,
and
so
we
have
to
be
careful
about
how
we
attribute
those
errors
to
make
sure
that
we
don't
decide.
That
is
a
bad
device
that
can't
be
brought
back
into
the
array
when
it
returns.
C
C
Devices
can
degrade
slowly
over
time,
so
we
need
to
actually
carry
state.
This
state
needs
to
survive
across
reboots
pool
import/export
things
like
that,
and
then
events
don't
always
happen
right.
So
I've
run
the
system
for
a
while
I
have
some
things
that
go
bad
I.
Do
some
imports
exports?
Whatever
events
come
flying
at
me,
I
turn
off
the
system
and
I
turn
it
back
on
again
I'm
at
some
basic
state
and
at
that
basic
state.
I
still
need
to
know
without
having
the
event
stream.
What
things
are
failed?
C
What
things
I
need
to
take
action
on,
even
if
they've
occurred,
while
the
system
was
powered
off,
for
instance,
something
has
returned
the
drive
that
went
missing
or
added
spares,
while
the
system
was
now
so
in
FreeBSD.
The
main
systems
that
we're
dealing
with
to
be
able
to
affect
this
fault
management
are,
of
course,
ZFS
with
its
streams
of
sis
events
and
he
reports.
It
also
has
a
nice
high
octal
interface,
which
is
what
allows
the
user
space
portion
of
the
failure
management
system
to
affect
change
in
the
system.
C
We
also
have
this
thing
called
geom
in
freebsd.
That's
where
we
get
physical
path.
Information
john
is
kind
of
like
a
Lego
brick
system
which
allows
you
to
take
a
raw
device
or
compose
multiple
raw
devices.
Ad
partitioning
do
volume
management
all
those
types
of
things,
and
so
the
geometric
nature
of
how
it
slices
and
dices
and
combines
is
why
it's
called
geom,
but
we
primarily
use
that
for
physical
path,
information
and
then
demo.
Fess
is
the
way
that
we
detect
that
a
device
has
come
or
left
the
system.
All
of
that
information.
C
All
of
those
events
goes
through
deb,
CTL
9.
You
can
think
of
deb
CTL
9
as
like
the
poor
man's
version
of
an
event
reporting
system.
Essentially
it
predates
json,
it
doesn't
use
XML.
It
basically
is
just
strings
of
key
value
pairs
that
happen
to
come
out
this
driver,
and
it
was
primarily
done
back
in
the
PC
card
days.
If
you
can
remember
those
so
like
you
insert
a
modem
or
you
sort
of
you
know,
an
ata
controller
in
your
PC
card
and
a
script
would
run
when
we
developed
this
at
spectra.
C
We
were
thinking
about
doing
a
better
event
management
system
in
the
kernel,
but,
as
usually
happens
when
you're
trying
to
ship
a
product,
you
get
to
the
point
where
things
work
well
enough
within
your
product
and
then
you
get
busy
working
on
something
else,
but
anyway
the
stream
of
string
data,
you
know,
is
what
you
can
really
think
of
an
event
which
is
just
a
key
value.
Pair
and
string
data
come
through
the
user
kernel
boundary
through
dev,
CTL
or
dev
dev
CTL
dev
D
is
the
generic
demon
that
can
like
fork
off
scripts.
C
If
certain
things
happen,
but
we
can
actually
subscribe
to
the
event
stream
by
also
looking
at
its
named
pipe
and
that's
how
ZFS
D
enters
the
system
live,
dev,
D,
CTL
the
thing
at
the
top
of
ZFS
D,
what
our
attempts
to
abstract
away
the
event
stream
in
its
current
format,
from
what
ZFS
D
has
to
do
so
again,
hoping
that
in
the
future,
some
better
event
stream
would
become
available,
unlike
in
other
systems,
this
is
not
a
pub
sub
stream.
Basically,
you
connect
to
that
name
pipe.
C
You
get
all
the
events
so
ZFS
D
has
to
filter
and
deal
with
the
fact
that
if
it's
too
slow,
it
could
lose
events
and
then,
at
the
very
bottom
of
the
ZFS
D.
We
basically,
you
know,
lose
the
normal
lives.
Efs
live
ZFS
core
libraries
to
be
able
to
make
our
changes
in
the
system.
The
case
file
repository
is
where
we
store
information
about
devices
that
are
in
a
slowly
degrading,
State,
essentially
in
freebsd.
C
So
we
sit
in
the
loop
until
we
basically
get
a
clean
scan
of
the
system
without
new
events
that
might
change
our
view,
our
world
view,
then
we
enter
a
normal
event
loop,
where
we
can
process
events
from
these
different
systems
that
I
mentioned
before
because
of
the
event
flood
that
can
occur.
Essentially,
we
have
to
have
this
yellow
diagram
that
the
yellow
Dimond
up
there.
C
We
have
to
notice
if
an
event
has
been
dropped
in
our
event
stream
and
essentially
we
modified
dev
D,
the
demon
that
feeds
us
data
to
make
sure
that
it
would
always
close
our
pipe
if
we,
if
it
couldn't
buffer
an
event
for
us.
So
essentially,
if
our
stream
gets
closed,
we
have
to
go
back
to
the
main
loop
again
resynchronize
and
make
sure
that
our
worldview
is
correct.
The
main
way
that
we
deal
with
or
detect
errors
is
through
this
path
here,
ZFS
emits
an
a
bead
evident
event.
C
C
If
and
then
we
we
go
and
once
we
have
all
the
information
in
the
case
file,
we
can
try
to
evaluate
the
case
and
see
if
there's
something
that
we
can
do
so
if
we
have
decided
to
degrade
the
device
already,
we
can
online
of
spare.
But
in
most
cases
what
happens
is
if
we
have
like
just
a
single
I/o
error,
we'll
open
up
a
case
file,
it
will
sit
there
and
we'll
go
on
our
merry
way.
C
C
They
could
do
things
like
that
and
essentially
what
we
do
is
we
take
the
pool
gooood
and
we
can
use
that
to
look
for
all
case
files
that
are
about
V
devs
on
that
pool
and
look
for
look
to
see
if
we
can
now
with
the
new
state
of
the
pool,
do
something
to
rectify
that
particular
problem
and
then
the
last
two
Deb
FS
and
John.
That's
basically
where
we
detect
that
devices
have
departed
or
arrived,
and
the
geometry.
C
Ivor,
so
we
actually
in
both
the
demo
fest
handlers
and
the
geometers,
have
to
kind
of
cross
check
each
other
if
a
device
arrives,
and
it
has
physical
path,
information
well
great,
we
can
do
all
the
checking
that
we
do
with
physical
path
information,
if
not,
hopefully,
there'll
be
a
Java
event
later.
That
will
tell
us,
you
know
where
it
exactly
lives
in
our
chassis.
C
C
So
we
ran
into
a
couple
things
while
we
were
doing
this
work,
probably
the
biggest
issue
that
we
found
were
where
things
around
the
way
that
spares
are
kind
of
bolted
into
the
system.
Both
spare
and
aux
devices
behave
a
little
bit
differently
than
other
V
devs
that
are
part
of
a
config
they're
attributed
to
a
pool
through
a
moss
object,
they're
not
in
the
main
label
information.
So
a
lot
of
the
things
that
we
end
up
doing.
C
That
would
affect
something
like
a
raid,
Z
V
dev,
or
something
that
we
just
can't
do
for
spares
and
aux
devices.
And
you
know
Allen
is
probably
the
best
person
to
ask
about
all
the
different.
You
know
kind
of
torturous
use
cases
that
we
did
to
be
able
to
make
these
things
break.
But
you
can
imagine
things
like
you
know.
You
have
a
pool,
you
export
it.
You
move
it
into
another
chassis.
Maybe
some
devices
get
mixed
around,
including
a
spare.
C
You
activate
a
spare
on
a
new
pool
because
that's
certainly
allowed
there's
nothing
that
stops
a
spare
from
being
added
to
another
pool
it
gets
activated
and
then
maybe
it
fails.
But
you
can
wind
up
any
situations
where
multiple
pools
have
pointers
to
devices,
some
of
which
might
not
be
might
not
be
even
in
the
system
anymore,
because
of
the
way
that
this
accounting
is
done.
The
solution
for
spectra
was
to
only
activate
spares
or
only
call
things
to
spare.
C
When
we
decided
we
could
actually
use
a
spare,
so
we
don't
use
a
global
spare
pool
and
a
spectra
appliance
is
basically
maintained
by
the
management
software
in
the
appliance.
When
we
decide
that
something
needs
to
be
spared,
we
call
it
a
spare.
Just
then
add
it
to
ZFS,
add
it
to
a
pool,
and
then
the
rest
of
ZFS
D
takes
takes
into
fact.
C
The
one
in
the
middle
there
rates,
the
spare
raid-z
of
spare
or
replacing
mirrors,
is
also
another
interesting
one.
This
one's
pretty
hard
to
hit,
but
essentially
in
ZFS.
The
expectation
is,
if
you
have
a
pool
that
has
parity
information
that
you
can
always
recover.
By
doing
some
kind
of
you
know
read
from
the
other
Vida
or
reconstruct
from
parity
or
whatever,
and
that
that
ability
is
done.
C
You
know
at
the
top
level
Vida
EV
layer,
and
for
this
reason
you
usually
can't
make
like
a
raid
Z
of
mirrors,
but
you
can
in
this
case
and
when
you
do,
if
you
try
to
read
from
a
spare
that
has
reports
bad
data,
essentially
there's
not
enough
information
in
the
stack
to
be
able
to
force
the
read
of
the
other
member,
which
is
kind
of
surprising
and
lastly,
what's
missing
inside
Ziya
ZFS
D
in
FreeBSD,
it's
most
of
the
stuff
that
Don
talked
about.
This
is
basically
enough
to
get
a
product
out
the
door.
C
We
don't
have
things
like
the
ability
to
take
smart
data
and
use
that
as
a
diagnosis
engine,
we
don't
have
the
ability
to
detect
differences
in
performance
between
peers,
to
notice
that
a
device
is
slowly
degrading
but
perhaps
not
throwing
errors.
Yet
physical
path,
replacement
works
great.
As
long
as
you
have
no
partitions
and
in
an
appliance
you
can
do
that,
but
a
lot
of
times
users
like
to
have
GPT
partitions
and
things
like
that,
and
if
you
don't
have
a
really
good
physical
path
provider,
you
can't
even
do
this
magical
spell
replacement.
C
B
Yeah,
so
on
I'll
just
go
through
this
quick,
we're
running
a
little
bit
short
on
time,
but
so
on
Linux.
We
we
had
this
thing
called
Z
I
referred
to
it
earlier
and
basically
it's
an
event
monitor
in
the
user
space
and
basically
it'll
it'll,
collect
or
watch
for
events
and
send
them
out
to
any
of
these
things.
We
called
zealots
that
will
listen
to
or
that
have
subscribed
to
a
certain
event
class
and
then
they
can
perform
an
action,
typically
send
an
email
or
do
something
like
that.
B
There's
no
diagnosis
in
the
current
state,
however,
we've
had
some
recent
developments
and
we've
sort
of
expanded
the
mission
and
and
out
of
the
FMA
logic
into
into
zetz
itself.
So
rather
than
like
on
solaris,
where
there
run
a
separate
plugins,
we
just
basically
bake
that
logic
into
into
the
that
process.
So
then
it
can
actually
do
the
equivalent
work
that
I
referenced
earlier
with
the
retire
agent,
a
diagnosis,
agent
and
add
agent,
and
so
that's
that's.
What
we've
been
working
on
lately
see
I.
B
Think
I
have
a
well
so
that
all
the
all
the
different
platforms
essentially
have
to
do.
The
same
thing
they
have
to
see
a
disk,
and
this
is
just
sort
of
the
basic
schema
that
we've
implemented
and
it
matches
exactly
what
you
would
expect
on
Lumos.
Basically,
you
have
all
the
all
the
keys.
You
need
to
make
a
diagnosis,
so
some
of
this
stuff
is
landing.
We
landed
that
phase.
B
One
work
there,
which
is
basically
doing
the
auto,
replace
and
auto
expand
all
along
that
disk
out
agent
and
then,
of
course,
to
dis,
monitor
itself,
and
then
we
have
some
other
stuff
coming
up
in
future.
We're
doing
a
work-in-progress
if
anybody's
interested
in
helping
out
there
see
see
me
later
today
or
tomorrow
and
Justin
touched
on
this.
B
We
really
want
the
diagnosis
agent
to
be
smarter,
so
we
need
input
from
people
on
what
they've
seen
what
what
are,
what
are
good
metrics,
to
look
at
to
make
it
smarter
and
then
that's
pretty
much
all
I
had
I
have
some
resources
here.
You
can
refer
to
them
later,
but
but
basically
the
fault
management.
It
might.
B
B
I'll
just
go
back
to
that
yeah.
So
basically,
if
there's
any
questions
but
I'm
really,
you
know
we're
really
looking
for
input
on
people
on
on
how
to
make
better
diagnosis
of
when
that
resources
going
bad.
One
of
the
other
things
we
need
to
consider,
too,
is,
if
you
had
a
top-level
videos
that
we're
all
like
a
single
video,
maybe
back
you
know
back
by
some
harbor
rate
or
something
like
that.
You
don't
want
to
necessarily
look
at
the
errors
and
say
well:
I've
had
15
errors,
you're
gone
when
it's
like.
B
B
That's
a
good
question:
that's
actually
working
progress,
we've!
Actually!
Oh
I'm!
Sorry!
The
question
is:
how
do
you
automate
this
when,
where
there's
a
lot
of
hardware
dependencies
in
this
stack,
that's
a
good
question:
we've
actually
been
able
to
simulate
removal
and
adding
a
devices
in
in
Linux,
at
least,
and
so
we've
automated
the
testing
of
auto
online
and
that's
a
start.
B
C
In
the
case
of
what
Spector
did
there
is,
since
all
of
our
enclosures
have
at
least
an
expander,
we
basically
did
most
of
our
simulated
device
faults
by
disabling
flies
on
expanders
to
be
able
to
take
out
drives
basically
without
the
drive,
knowing
that
it
was
gonna
lose
connectivity.
We
also
had
from
a
previous
generation
product.
We
had
the
ability
to
actually
deep
our
drives
just
on
demand
programmatically,
and
so
we
could
do
things
like
that,
and
so
you
can
imagine
tests
where
you
take
out
a
portion
of
the
drives.
C
Maybe
you
export
the
pool
you
bring
them
all
back.
You
try
to
do
an
import
all
of
those
types
of
permutations
where
things
that
were
possible
from
a
programmatic
standpoint,
but
it
required
you
know,
because
we
were
doing
and
actually
with
the
real
hardware
it
requires.
You
know
a
chassis
with
a
significant
number
of
slots
and
the
software
that
would
allow
you
to
play
with
the
5
status.
There's
the
ability
to
change
the
5
status,
though,
is
all
up
streamed
in
freebsd.
Either
can
do
it
with
the
cam
utilities.
C
We
had
this
this
pie
in
the
sky
idea
of
being
able
to
do
it
actually
with
target
mode
simulation,
because
we
had
also
another
guy
on
staff
who
did
a
lot
of
target
mode
stuff,
but
we
just
never
found
the
time
to
do
that,
but
it'd
be
nice.
If
you
could
basic,
we
just
do
the
injection
down
there
with
you
know,
injecting
a
SCA
seq
codes
or
whatever
to
simulate
that
a
particular
IO
failed.
C
Right
right,
so
the
question
was
I
mean:
is
there
a?
Is
it
possible
that,
through
very
special
activation
of
spares,
that
you
could
wind
up
with
a
pool
that
is
now
all
of
his
v-dubs
are
bigger
and
what
prevents
that,
from
turning
into
an
auto,
expand
and
I?
Think
that
is
an
optional
feature
right?
You
can
turn
off
option
auto.
D
D
A
C
Yeah,
so
just
for
the
recording,
essentially
autoexpand
requires
a
certain
amount
of
minimum
space.
Before
it
will
do.
The
expand
has
to
be
at
least
enough
to
be
able
to
do
another
Metis
lab,
and
so,
if
it's
just
a
small
difference,
you're
not
going
to
run
into
that
problem,
and
it
is
still
an
opt-in
feature
to
have
this
happen.
C
C
Okay,
so
there
were
two
questions
there.
One
was
you
know:
what
do
you
do
in
the
face
of
SAS
instability
right,
so
SAS,
flipping
around
and
and
injecting
errors
that
aren't
really
related
to
the
be
devs,
and
in
that
case
I
mean
I,
guess
I
didn't
I
need
to
see
the
failure.
We
chained
up
a
lot
of
fairly
large
systems,
but
didn't
actually
see
I,
guess
the
type
of
errors
that
you
saw
as
we
tuned.
C
The
point
that
I
think
Don's
been
trying
to
make
is
that
in
these
systems
we
have
this
framework
for
being
able
to
do
diagnosis
and
recovery,
but
it's
just
a
basic
framework.
It
doesn't
it's
not
complete,
just
as
as
you
found,
and
then
what
was
the
second
one.
Oh
right,
a
closure
infinity,
we
didn't
have
a
really
good
answer
for
that
at
spectra.
Essentially,
what
we
did
is
in
our
management
platform.
C
Essentially,
what
ZFS
do
so?
The
question
was,
if
you
were
to
add
smart
information
into
the
diagnosis
engine
right
in
this
case
in
FreeBSD
or
whatever,
would
you
have
to
replicate
the
type
of
dictionaries
and
information?
That's
recorded
inside
smart
tools?
As
far
as
the
failure
system
is
concerned,
it
just
needs
the
diagnosis.
The
output
right,
it
just
needs
to
know,
should
I
replace
this
drive
or
not,
and
so
you
just
need
a
binary
answer
out
of
the
smart
tools.