►
From YouTube: OpenShift Commons AIOps SIG Talk on DiskProphet: Disk Health Prediction Brian Jeng ProphetStor
Description
DiskProphet: Disk Health Prediction Brian Jeng (ProphetStor)
recorded on March 25 2019 at OpenShift Commons AIOps SIG
A
So
I'm
Brian
I'm,
an
SI
over
at
profit,
store
and
I'm
gonna,
be
talking
about
disc
profit
and
what
this
profit
does.
Is
we
use
machine
learning
AI
to
predict
future
disk
failures
up
to
six
weeks
in
advance
an
additionally?
We
can
also
predict
performance
and
capacity
for
up
to
90
days
into
the
future,
as
well
as
give
correlation
for
the
effects
of
disfavor
between
the
node
application
and
cluster,
and
our
biggest
use
case
right
now
is
for
Seth.
A
So
in
2016
we
partnered
with
another
big
company
that
actually
presented
at
Red
Hat
storage
day
in
Seattle
that
wanted
to
do
a
petabyte
set
cluster
for
OpenStack
club
and
they
found
there
was
three
major
stability
issues
with
a
stuff
closer
that
was
sort
of
blocking
their
project.
The
first
one
was
that
every
time
disc
failed
or
you
know,
SD
failed.
The
map
would
change
the
crush
map,
which
would
cause
placement
appearing
and
backfilling
or
the
cluster
would
rebalance
to
heal
itself.
A
A
But
it
essentially
did
the
same
thing.
We
could
predict
dis
failures,
six
weeks
in
advance
and
then
they
had
they
drew
out
all
this
architecture
stuff.
But
the
most
important
thing
is
this
craft
at
the
bottom
right.
You
can
see
that
there's
a
normal
workload
here
of
around
400
or
so
I
ops
and
then,
when
they,
when
they
simulated
a
disk
failure
by
just
pulling
a
disk,
they
found
that
the
cluster
performance
dropped
below
200,
so
they
dropped
around
40
to
50
percent
ops
and
persisted
that
way.
A
Oh
sorry,
it
persisted
that
way
for
the
whole
duration
of
the
test,
so
800
minutes
around
12
hours
or
so
versus
with
our
disk
prediction.
You
can
see
that
with
being
able
to
know
it
just
is
about
to
fail
in
advance.
We
can
take
pre-emptive
measures,
we
can
disable
the
cluster
rebalancing
and
then
we
can
remove
it,
the
disk
and
replace
it
within
an
hour
and
how
the
performance
go
back
up
to
a
fraction
of
the
time
in
a
fraction
of
the
time
right
and
then
the
same
company
tested
art.
A
Our
prediction
engine
against
20,000,
drives
over
the
course
of
90
days
and
they
found
that
we
had
an
accuracy
rate
of
96%
and
a
recall
rate
of
97%
and
the
recall
rate
is
actually
the
more
important
statistic
here.
It's
it's.
The
number
of
correctly
predicted
failed
disk
over
total
number
of
failed
disks.
So
out
of
every
100
discs
that
failed,
we
would
correctly
predict
97
of
them
it,
and
then
this
is
just
shows
that
we're
already
integrated
in
the
set
community
we're
called
the
disk
prediction
plugin.
A
You
can
just
enable
us
through
the
manager
daemon
and
then
you
can
just
use
stuff
native
commands
to
access
our
prediction
and
yeah.
So
we're
we
release
with
Nautilus
for
older
versions
of
stuff.
You
would
use
this
this
one
line.
Installation,
then
you
can.
You
can
use
that
with
ansible
chef
puppet,
any
kind
of
automation,
software
to
make
it
simple
for
a
mass
appointment
and
our
biggest
our
biggest
account
right
now.
A
It's
actually
in
Michigan
there's
three
universities,
Wayne
State,
Michigan,
State
and
University
of
Michigan,
and
what
what
they're
set
up
is
they
all
three
of
these
campuses
share?
A
single
Giants
F
cluster
and
they
put
all
their
research
data
on
this
set
cluster.
So
it's
they
have
to
make
this
F
cluster
as
resilient
as
possible,
and
so
what
we
provide
is
just
the
dis
predictions
and
allowing
them
to
monitor
the
health
of
their
discs
before
they
fail
all
right
and
I'm.
A
A
A
How
many
are
bad
are
gonna
fail
in
less
than
two
weeks
less
than
six
weeks,
and
you
can
go
to
the
disk
health
list
here
to
get
a
list
of
every
single
disk
that's
being
monitored,
and
then
you
have
all
the
UI
unique
identifiers
and
which
you
know
what
it
saw
and
the
size
the
serial
number
the
vendor
all
over
here-
sorry
all
over
here,
so
you
can
see
that
so
you
can
easily
identify
the
discs
so
yeah.
So
this
is
so.
A
This
would
be
where
you
would
go
for
the
disc
details
and
then
we
also
alluded
to
it
earlier.
We
also
have
prediction
for
capacity
and
performance
so
over
here
we
have
a
cluster
capacity,
but
we
also
go
down
to
the
OST
level.
I'll
just
use
I'll
just
use
pools
because
it's
more
more
interesting
and
then
we
can
predict
future
use
future
capacity
for
the
next
up
to
next
ninety
days.
But
of
course
this
depends
on
how
much
data
you
have
so
the
general
rule
of
thumb
is
for
every
cycle
that
we
predict.
A
B
A
B
B
C
What
one
interesting
thing
to
note
is
that
you
can
run
the
prediction
in
the
cloud
or
a
local,
so
you
could
also
run
this
setup
in
a
completely
non
software-as-a-service
environment
as
well.
Yeah.
A
Because,
in
order
to
be
with,
if
they
wanted,
like
a
lightweight
version
of
our
predictor
and
then
so
we
would,
we
just
gave
them
like
one
with
with
less
baggage.
That
would
be
only
70%
accurate,
that
they
could
enable
locally.
But
it
would
it
wouldn't
use
all
the
metrics
that
were
provided
for
the
prediction
it
was
requested
by
them
to
have
a
local
lightweight
lightweight
package.
Okay,
yeah.