►
From YouTube: Venkat Kolli -- Ceph on All-Flash Storage
Description
A
A
A
The
use
cases
that
we
designed
this
for
is
primarily
for
a
very
large
capacity
workloads,
beat
Big
Data,
large
container
storage
containers.
Typically,
when
you
see
as
an
all-flash
storage
system,
you
tend
to
see
it
in
a
in
a
small
capacity
for
a
very
limited.
You
know
capacity,
but
a
really
high
performance,
but
that's
not
where
we
are
addressing
this
is:
where
does
not
where
we
are
with
our
flash?
We
are
really
going
after
a
very
large
capacity
systems
like
scale-out
systems
like
sap
right
we're.
A
Typically,
it's
deployed
on
hard
drives,
but
we
think
we
have
a
cost
point
and
we're
able
to
bring
the
cost
to
be
competitive
enough
to
offer
the
flash
advantages
for
this
large
capacity
workloads.
So
these
are
the
big
data,
large
media
repositories,
etcetera,
is
what
what
we
really
designed
the
system
for
so
again,
giving
a
very
brief,
specs
and
war
review
of
the
system
before
we
actually
go
talk
about
the
self
and
how
the
self
works
with
flash
right.
So
this
infinite
flash
is
the
in
the
system.
A
So
it's
a
very
able
to
pack
in
512
terabytes
of
flash
in
a
3u
chassis
right.
It's
a
very
high,
dense
and
a
very
large
capacity,
and
you
don't
necessarily
need
to
use
it
with
all
512
terabytes
and
especially
when
you
use
it
in
a
scale
or
cluster.
Typically,
most
customers
don't
pack
it
up
as
as
a
full
capacity
as
a
single
node,
because
you,
you
know
you're
going
to
be
scaling
out,
so
you
can
start
off
with
the
smaller
capacities
for
each
evening
flash
node
and
scale
out.
A
At
the
same
time,
there
are
customers
who
are
looking
this
for
multi
petabyte
deployments.
So
in
that
case
the
density
in
a
three
you,
you
know
the
512
terabytes
in
the
three
used
becomes
critical
for
them.
So
that's
what
you
know.
This
is
what
it
is
designed
for
them
and
the
way
we
achieve
that
is
with
these
cards.
These
are
not
SSDs.
These
are
specially
designed
cards
specifically
to
achieve
that
scalability
and
the
density
right,
and
it
is
a
SAS
based
flash,
and
you
can
see
that
there
is
a
huge
capacitor
bank
at
the
bottom.
A
That
gives
you
the
powerful
capabilities
of
the.
So
it's
how
the
full
enterprise
SAS,
based
at
Cape
voltage
with
this.
The
one
thing
that
we
purposefully
designed
this
is
to
not
include
the
servers
in
it
not
include
the
compute
in
it,
and
the
reason
for
that
is,
we
want
to
make
it
a
more
disaggregated
solution,
and
primarily
that
is
important
for
self
is
because
you
know
self
could
be
used
for
multiple
workloads.
Right
and
different
types
of
objects
are
different
types
of
data,
be
the
object,
data
block
data
or
in
the
future
file
data
right.
A
This
is
based
on
the
current
six
gigabyte
back
backplane
that
we
have
and
soon
in
a
couple
of
weeks
we're
going
to
be
upgrading
their
to
12
gig.
Then
you
can
get
the
performance
almost
doubling
it
now.
Please
remember
these
numbers
and
you'll
see
when
we
talk
about
self
right,
how
they
relate
to
the
the
performance
that
we
are
able
to
get
at
the
raw
box
right.
We
opted
myself
and,
in
fact,
suppose
optimize
their.
It's
almost
almost
getting
a
full
raw
performance
out
of
it,
and
that
is
quite
you
know
quite
interesting
right.
A
You
know,
given
the
capabilities
of
self
right,
all
the
features
and
everything
and
still
I
is
able
to
fully
optimize.
The
box
is
a
quite
remarkable
and
that's
something
that
you
know
that
you
know
we're
just
very
proud
to
be
involved
in
making
the
SEF
quite
efficient
that
way
right
and
just
very
quickly
from
the
available
T
standpoint.
But
this
comes
up
all
the
time,
because
when
you
have
these
large
systems,
you
know
what
the
data
availability
is.
A
Flash
obviously
is
a
lot
more
reliable
than
then
the
hard
drives,
but
also
every
single
component
in
this
box
is
completely
hot
swappable
and
you
can
have
it
in
this
fully
redundant.
You
know
you
see
that
for
fan
dries,
you
know,
and
please
do
a
configuration
in
the
back.
You
can
see
it
here,
but
again
you
have
multiple
redundancy
failure,
and
here
and
the
MTBF
the
failure
mean
time
between
failure
is
about
1.5
million
hours.
A
This
quite
high
begin
it's
just
because
it's
now
flash
solution,
so
that's
basically
what
the
infinite
flash
system
is
right
so
coming
to
what
we
are
done
with
so
again
we
sell
this
other
so
as
a
solution
with
SEF
right.
So
we
have
it
fully
optimized
and
tuned
self
configurations
that
comes
with
the
comes
to
the
system
and
again
one
thing
that
we
want
to
really
know
that
emphasis
is
sandisk
from
it's
very
early
on
has
been
committed
to
the
open
source
right.
A
Anything
that
we
do
with
SEF
is
one
hundred
percent
committed
back
to
open
source,
so
we
use
the
open
source
step,
we're
not
making
any
proprietary
in
our
branches
or
any
proprietary
extensions
of
self.
So
everything
and
what
you'll
run
on
on
infinite
flash
is
the
open
source
f,
and
it's
good
thing
that
you
know
we
make
our
living
selling
the
boxes
nor
the
software.
So
we
don't
have
to
go
how
to
go.
You
know
you
know
do
this,
do
this
thing,
so
we
are
fully
from
very
early
on.
A
Yeah.
Imagine
our
surprise.
So
we
are,
you
know
proud
of
this.
You
know
big
huge
in
the
all-flash
system
and
it's
doing
no
better
than
than
a
hard
drive
system
right,
so
I
think
at
the
time
we
were
able
to
get
about
10
to
15
ki
ops,
with
the
SSD
journals
and
hard
drives,
and
we
even
when
you
put
that
in
an
all-flash
solution
with
similar
configuration.
A
It's
really
not
that
much
different,
so
one
thing
that
we
quickly
realized
that
you
know
there's
a
lot
of
optimization
that
is
required
within
self
and
also
a
lot
of
tuning
with
the
hardware
is
required
right,
because
most
of
the
deployments
we've
seen
at
the
time
and
most
of
the
experience
in
the
community
at
the
time
is-
is
having
the
self
optimized
for
hard
drives,
not
for
an
all-flash.
Yes,
flash
has
been
SSDs
have
been
used
for
journals
from
very
early
on,
but
you
know
having
it
as
an
all-flash
solution.
A
There's
no
one
has
started
they're
doing
that
when
we
started
looking
at
it,
so
we
quickly
determine
that
it's
really
the
voice.
These
are
you
know
getting
the
bottleneck
here
and
by
the
way
I
know
some
of
you
are
new
to
self.
So
you
know
some
of
the
terms
that
will
be
using
here
are
and
if
you're
unfamiliar
with
its,
please
feel
free
to
stop
us.
But
mostly
you
know,
these
are
the
talks
that
generally
are
oriented
to
the
community
that
you
know
the
people
are
the
users
of
SEF
today
right.
A
So
if
there's
anything
that
that
we
need
to
slow
down
and
clarify,
please
do
stop
us.
So
one
thing
that
when
we
started
this
is
that
the
voice
is
turning
out
to
be
the
bottleneck
here
right.
So
it's
about,
we
were
able
to
get
about
thousand
I
ops
around
the
time
and
that's
using
about
4.5
course
/
or
OST,
and
so
we
thought
about
and
third
location,
then,
should
we
be
using
more
always
these
part
that
always
/
SSD
right
or
in
our
case
that
flash
card
remember
our
flash
card
is
a
terabyte.
A
So
it's
a
huge
capacity
cut
that
we
have
right.
So
we
thought
about
using
more
OST,
but
we
quickly
moved
away
from
that
idea,
not
just
because
it
is
not
optimal,
but
you
know
having
the
failure
domain
or
the
crust
rolls.
Magic
thresholds
is
going
to
be
a
nightmare.
You
will
not
like
it
and
it
is
going
to
be
very
hard
to
manage
so
an
interest
of
the
you
know
the
usability
and
there
and
the
manageable
to
the
solution.
A
We
quickly
move
away
from
that
idea,
so
we
started
working
on
enhancing
the
data
path,
primarily
focus
on
the
reeds
around
the
time.
So
we
we
started
when
we
started
this.
You
know
we
thought
that
there's
a
lot
of
room
for
improvement,
and
especially
that's
you
know
more
impactful
on
the
reeds
and
I'll
actually
talked
about
in
the
roadmap.
What
we
are
doing
now
with
the
rights
right,
so
the
data
that
we
have
today
in
which
I'm
going
to
show
the
numbers
are
primarily
with
these
changes
in
without
the
right
changes.
A
Yet
right,
the
right
changes
are
going
to
go
into
jewel
and
with
the
upcoming
release.
So
in
the
reed
side,
what
we
found
out
is
that
there's
a
lot
of
context,
switching
happens
in
self
and
they
did
not
matter
much
when
you're
using
in
the
hard
drive,
because
the
latencies
of
hard
drives
are,
you
know
pretty
high,
so
in
no
matter
how?
How
much
context
switching
is
happening.
A
It
didn't
matter,
but
when
you
put
it
on
a
low
militancy
media
like
flash,
that
becomes
very
expensive
right,
so
we
kind
of
took
a
lot
of
you
know,
queuing,
optimization
right
and
we
removed
a
lot
of
lot
of
Lok
Lak
contentions
right
so
make
the
locking
more
granular.
So
you
can.
Actually
we
can
we
speed
up
a
lot
of
this
thing,
so
you
know
doing
a
lot
of
lock,
optimization
and
queuing.
Optimization
really
made
a
lot
of
difference.
Everything
self
and
the
other
aspects
is
the
socket
handling
right.
So
there
are
too
many.
A
There
are
many
other
things
are
a
whole
bunch
of
things
that
have
been
done
that
to
get
to
the
performer,
which
I'm
going
to
talk
about
in
a
few
minutes.
You
know
primarily
just
you
know,
enhancing
the
cash
lookups
right
and
you
know
anything
anything
regarding
handing
the
copies
copy
mechanisms
right
all
of
them.
A
So
there
are
many
minor
things
that
went
in
that
really
made
it
quite
a
bit
of
a
difference,
and
the
net
result
is
that
so,
with
the
current
testing
that
we
are
done
right,
which
is
currently,
we
are
testing
it
on
hammer.
We
are
able
to
get
to
or
80
ki
ops,
/
OSD
remember.
This
is
when
we
started
this
about
thousand
die
offs
is
where
we
were
at.
The
more
interesting
thing
is
really
the
looking
at
the
the
cpu
usage
right,
even
in
a
thousand
I
ops.
A
It's
become
very,
very
optimal
in
terms
of
the
inner
cpu
usage
as
well,
and
this
is
one
thing
that
we
you'll
find
and
when
you
see
the
performance
numbers
coming
with
the
flash,
the
one
thing
that
you'll
quickly
realize
is
that
now
cpu
matters
a
lot
because
with
the
flash
you
completely
remove
the
bottlenecks
from
the
storage
media
and
all
of
that
bubbles
up
now,
it's
actually
the
cpu
and
especially
with
the
small
blocks,
especially
if
you're
going
to
be
using
like
the
OpenStack
workloads,
which
is
at
the
4k
rights
or
4k.
You
know
iOS.
A
This
is
the
CPU
matters.
A
lot
and
I'll
show
you
some
a
couple
of
members
that
shows
you
how
much
difference
the
CPU
in
our
Township
I'm,
pretty
sure
that
makes
Intel
very
happy
all
right.
So
most
of
these
changes
have
been
the
reed
performance.
Changes
have
been
to
through
these
releases.
Most
of
them
are
in
giant
so
anything
that
you
obviously
hammer
inherited
all
the
giant
release,
so
anything
that
if
you're
going
to
be
using
hammer,
you
will
see
the
same
performance
coming
out
up
there
coming
over
the
sirs.
A
Okay,
when
you
deploy
it
in
our
flash
so
quickly
getting
to
the
number
so
talking
about
the
system
itself,
the
one
that
we
used,
so
the
one
thing
that
we
started
wanted
to
really
compare
is
to
kind
of
see
how
the
staff
would
perform
right
without
any
tuning
and
and
with
all
the
tuning
and
all
the
changes
that
you
are
done
for
for
flash
right.
So
we
used
in
this
test
configuration
we
use
the
one
in
free,
flash
I
of
100
is,
or
is
your
number
for
the
infinite
flash
system.
A
So
this
is
a
15
12
terabyte
system,
a
fully
populated
system,
and
we
used
to
OST
notes.
These
are
the
dual
sockets
in
a
12-core,
each
so
24
physical
course
for
each
OST,
node
and
therefore
blocked
or
drivers
or
rbd
clients
running
this
right
and
using
a
40
gig
switch
a
little
more
details
on
this
again.
Just
if
you
want
to
know
the
configuration
that's
used
for
this
particular
test,
so
you
have
this
for
your
reference
right
so
very
quickly,
so
this
is
where
so.
This
is
a
test
that
we
done
with
the
giant.
A
This
is
a
very
first
release
that,
with
all
the
changes
that
I
mentioned
in
the
previous
flight
and
a
couple
of
slides
ago
right,
this
is
the
the
net
result
of
what
we
were
able
to
get
right
now
to
compare
this
again.
This
is
a
lot
more
lot
of
data
in
here,
but
let
me
just
quickly
walk
through
what
it
is.
So
this
is
done
for
an
8k,
a
random
read,
io
workload
right.
A
What
we
are
really
comparing
between
the
red
bars
and
the
blue
bars
really
are.
Essentially,
if
you
take
the
safe
as
it
is
without
any
sort
of
tuning-
and
you
know,
without
making
any
changes
with
just
the
defaults
and
using
the
same
hardware,
so
the
board
that
both
those
tests
are
run
on
in
a
flash
one
is
without
any
tuning
right
at
the
other,
is
essentially
with
the
complete
tuning
and
with
all
the
changes
that
that
went
in
right.
A
So
if
you
look
into
the
on
the
read
side
right,
so
the
blue
bars
is
basically
what
it
is
with
any
sort
of
tuning
on
all
flash
and
the
red
bar
that
you
see
there
is
about
250
ki
ops
compared
to
about
10
to
15
ki
ops.
So
the
net
lesson
that
I
want
to
take
you
away
from
is
is
tuning
matters
a
lot
for
self,
for
the
particular
hard
drive
for
the
particular
hardware
right,
and
that's
especially
true
with
an
hi
performance
system
like
flash
right.
A
So
you
really
need
to
figure
out
how
exactly
and
I'll
talk
about
some
of
the
tuning
parameters
that
we
used,
and
that
made
it
a
lot
of
difference
here,
but
essentially
having
these
self
tuned
for
the
particular
hardware
that
you're
using
makes
a
huge
amount
of
difference.
Tremendous
amount
of
difference
and
I
don't
think
this
is
going
to
be
the
true
with
the
current
release,
with
the
hammer
release,
because
with
the
hammer
there's
a
lot
of
changes
that
went
into
the
defaults
as
well.
A
That
really
is
not
going
to
be
this
dramatic,
but
it
still
is
going
to
be
highly
impactful
right.
So,
that's
something
that
you
would
have
to
consider
it
the
same
thing
with
the
latency
as
well.
The
the
multiple
the
bar
that
you're
seeing
is
for
different
q
depths
right
so
and
obviously
using
an
all-flash
system
like
infinite
flash,
a
large
hi
performance
system.
You're
you
get
a
much
better.
A
A
You
know:
6616
queue,
depth,
okay,
so
very
quickly.
The
other
thing
that
I
really
want
to
show
as
a
showcase
test
is
that
so
how
does
it
behave
as
a
cluster?
How
does
the
performance
scale
when
you
put
it
on
multiple
internal
flash
knows
right.
So
in
this
case
we
are
taking
the
same
capacity
right,
the
512
terabyte
capacity,
and
that
I
showed
you
earlier
now.
We
split
this
into
different
nodes
in
a
cluster
right
because,
after
all
and
the
end
of
the
day,
this
is
a
scale-out
cluster
right.
A
A
You
use
the
same
servers,
so
the
key
difference
now
you
need
to
note
is
that
in
the
previous
case,
you
have
a
full
capacity
by
12
terabyte
right
running
with
the
20
s,
denotes
and
running
it
with
a
few
rbd
notes
right
for
clients,
/
/,
infinite
flash
system.
Now
you
actually
have
60
s
denotes
powering
this
powering
this
you
know
cluster
right.
So
at
the
end,
is
it's
basically
385
terabytes
around
385
terabytes,
the
total
capacity
of
this
cluster
with
the
60
s
denotes
and
running
about
55
gateway
notes
right.
A
So
that's
in
the
same
configuration
and
the
same
type
of
servers.
Writing
this.
Now,
if
you
look
at
the
performance
that
that
you
get
our
out
of
its
I
mean
running
the
same
for
K
blocks
right,
so
we
are
now
getting
almost
900
k,
I
ops,
on
a
385
terabyte
capacity
cluster
right.
So
if
you
note
that
million
I
ops
at
the
raw
box
level,
we're
almost
there
running
safe
right,
getting
the
performance
out
of
it,
however,
I
mean
the
big
difference
really
is.
A
Is
that
we
kind
of
spread
that
out
into
different
infinite
flash
nodes
right
the
same
capacity
rather
than
you
know,
measuring
it
on
a
single
box
with
that
difference?
But
still,
if
you
look
at
per
terabyte
performance
right,
so
you
actually
with
self
you
can
actually
get
to
almost
a
million
I
ops
with
three
in
385
terabyte,
flash
sister
right.
So
that's
basically
the
net
of
what
you
know,
what,
with
all
the
optimizations
and
all
the
changes
that
went
into
the
surf
and
how
is
able
to
perform?
A
Ok.
So
the
just
a
couple
of
other
points.
Again,
it's
a
from
the
latency
standpoint.
We
are
around
averaging
around
two
milliseconds
of
latency,
with
SEF
right
at
the
fork,
a
blocks,
and
if
you
look
for
two
9s
consistency,
latency
it's
around
10
milliseconds
right,
a
39
s
is
about
20
milliseconds,
two
nines
is
around
10
milliseconds.
A
One
is
obviously
as
I
mentioned
before.
If
you're
looking
for,
if
you're
working
on
a
small
block
workload,
CPU
makes
a
huge
amount
of
difference.
So
in
this
noise
I
mean,
if
you
have
your,
we
are
still
CPU
bound
at
the
4k
blocks
right.
So
if
you
increase
the
number
of
course,
our
number
of
servers,
that's
powering
this
you'll
actually
get
higher.
You
know
higher.
There
are
much
more
improvement,
improvement
for
I,
ops,
number
of
pi
ops
and
then
there's
a
dissolution
right,
but
obviously,
as
you
get
to
the
larger
block,
CPU
doesn't
matter
much.
A
Now
it's
a
bandwidth.
There
starts
the
network,
that's
really
getting
good
and
away
the
critical.
So
we
are
testing
this
to
the
40
gig
and
that's
where
so,
as
you
get
to
anywhere
from
64
K
block
workload,
anything
higher
than
that
you
really
40
gig
almost
wake
up
the
requirement
for
this
kind
of
solution
right
or
else
you
really
really
will
be
severely
constrained
at
at
the
network
level.
So
we
do
most
of
we
do.
A
We
have
a
lot
of
customers
that
deploy
10k
they're
dead,
sorry,
10,
gig
and
again
we
want
to
talk
about
the
network
is
in
the
next
session,
but
one
thing
that
you
know
your
loan
is:
that
is
that
you
know
for
most
typical
workloads
and
using
an
high
high
performance
media
like
flash.
You
really
network
needs
to
be
sized
properly
and
40
gig
is
going
to
be
getting
a
critical
requirement
once
you
reach
a
certain
capacity,
okay,
so
moving
on
from
the
reads
too
right.
A
That
still
is
a
big
constraint
for
for
the
right
performance,
so
this
is
still
so
for
the
flash
when
you
put
it
in
all
flash
right.
That
becomes
a
major
issue,
so
we
saw
a
lot
of
spiky
spiciness,
even
when
we
used
is
some
kind
of
low
performance
first,
what
we
tried
out
is
that
we
try
to
use
the
nvram
front,
ending
the
flash
right
so
to
kind
of
make
that
make
the
journals
more
even
more
efficient
right.
A
So
typically,
this
strategy
is
that
you
put
the
data
on
hard
drives
and
you
put
in
our
journals
on
on
on
flash
on
SSD
right.
So
we
took
the
next
step
of
putting
the
data
on
the
flash
and
putting
the
journals
on
and
envy,
and
we're
am
the
one
thing
that
we
found
is
that
there
is
obviously
some
improvement
there,
but
there's
a
lot
of
spiciness.
A
That
happens
because
of
that
the
big
batch
processing
that
that
happens
at
the
backhand
right,
so
you
don't
have
any
consistent
performance
and
for
most
of
you
have
been
running
storage
systems
for
long
inconsistent
performance
is
worse
than
not
having
a
you
know.
Even
an
average
performance
is
worse
than
a
bad
performance,
consistent
bad
performance.
Now
most
customers
and
most
of
your
workloads
Aparicio,
would
prefer
to
have
some
kind
of
consistent
performance
even
as
bad
as
it
is
right
then,
having
something
that's
very
spiky
and
unpredictable
right.
A
So
that's
quickly,
you
know
went
out
of
the
way,
so
we
went
to
basically
work
on
primarily
to
eliminate
this
kind
of
heavy
batch
workload,
so
we
modified
the
buffer
writing
primarily
again.
This
primarily
will
benefit
the
flash
right,
but
most
of
the
things
that
we
went
to
kind
of
figure
out
how
to
handle
those
buffer
writes.
All
of
these
changes
are
are
in
dual
release
right
now,
and
so
in
our
early
testing.
What
we
found
out
is
that
we
are
about
2.5
times
purely
on
the
right
more
than
hammer
what
hammer
is
getting
right.
A
So
when
you
see
the
release
with
Joe
of
again
this
again,
this
is
it
all
flash
by
the
way.
So
most
of
these
things
would
not
matter
much
when
you
deploy
it
in
hard
drives,
but
when
you
deploy
it
an
all-flash,
you
will
see
with
the
upcoming
duel
release
on
the
right.
This
is
going
to
be
about
2.5
times
where
we
are
with
the
with
a
hammer.
The
latest
is
also
cut
to
have
that
you
have
that
okay,
so
there
are
a
few
other
things
quickly.
A
What
we
are
doing
is
on
the
we
are
working
with
mellanox
very
closely
on
the
Rema,
so
there's
a
significant
reduction
in
CPU
even
further,
then
what
what
you,
what
you
are
seeing
out
there
remember
as
I
said
before
you
know,
CPU
becomes
the
critical
factor
when
you
put
it
on
our
now
flash.
So
that's
it
becomes
a
key
for
us
to
have.
A
You
know
some
of
some
of
these
techniques
with
like
RDMA
to
minimize
the
contention
and
we're
working
with
clearly
closely
with
the
new
back
and
store
the
new
store
and
then
to
make
them
more
optimizations.
For
that,
so
we
have
sand
disqus
developed
a
key
value
store,
that's
primarily
for
any
open
source
system
running
on
an
all-flash.
A
A
Key
value
appear
to
be
used
as
a
back-end
one
of
the
key
things
also,
that
makes
a
huge
amount
of
difference,
although
it
is
not
quite
part
of
the
self
code,
is
the
memory
allocation
of
the
underlying
OS
right,
so
the
TC
monologues,
J
email
log
in
a
sink
messager.
So
these
are
the
some
of
the
key
to
Nobles.
That
I
mentioned
earlier
makes
a
huge
amount
of
difference.
So
right
now
most
of
these
changes
that
we
are
manually
doing
right.
So
that
makes
a
you
know
difference
of
almost
a
3x
of
the
improvements.
A
So
one
of
the
things
that
we
going
to
be
doing
is
that
we
will
be
working
with
the
other
set
providers.
We
primarily
partner
with
red
hat
as
one
of
our
premium
premiere
or
self
provider,
and
we
kind
of
make
those
as
the
defaults
and
as
make
it
available
as
a
tuning
scripts.
So
when
you're
deploying
it,
you
know,
becomes
easy,
and
so
you
don't
have
to
manually
do
these
things,
okay,
so
very
quickly.
So
what
we
do
with
self
again-
and
she
mentioned
so.
A
Our
key
focus
really
is
with
self-
is
to
get
the
performance
highly
tuned
and
highly
optimized
performance
for
all
flash,
but
also
make
it
a
lot
more
usable
when
you
deploy
it
on
a
large
scale
system
like
infinity
flash
right.
So
it's
basically
starts
off
with
an
open
source
self
and
with
all
the
changes
that
we're
done
again,
part
of
the
open
source
of
the
community
staff
right
so
with
with
the
open
source
and
the
and
the
sandisk
enhancements.
There
are
a
whole
bunch
of
things
that
we
built
around.
A
It
primarily
for
you
know,
make
the
installation
easier.
Patrick
talked
about
self
deploy
and
the
other
new
provisioning
tools,
so
we
are
going
to
be
working
with
the
new
provision
tools.
Currently,
our
installation
is
based
on
safe,
deploy.
It's
a
modified,
safe,
deplore,
enhanced
f
deployed
to
that
is
specifically
tuned
for
infinite
flash,
so
I
mean
the
key
thing.
Is
that
our
goal?
Really?
Just
as
any
other
you
know,
systems
vendor
out?
A
There
is
to
make
self
more
consumable
and
more
easier
right,
getting
safe
more
to
beyond
the
smarter
folks,
like
you,
you
know
who
can
handle
this
by
yourself,
but
most
of
the
system
administrators
out
there
are
nearly
not
as
capable
and
that's
one
thing
that
scares
them
heavily
right.
When
you
look
into
self
as
powerful
as
it
is
to
you
know,
make
it
you
know
easily
digestible
and
easily.
Consumable
is
a
key
part
of
our
strategy
and
that's
true
with
the
Red
Hat
are
pretty
much
all
the
other
safe
providers
that
are
out
there
right.
A
So
there
are
things
that
we
are
doing
many
things
that
we
are
doing
I'm
not
going
to
walk
through
all
of
this,
but
primarily
on
the
usability
side
and
on
the
planning
side
right.
How
you
actually
get
the
right
configuration.
Remember
right.
The
configuration
matters
a
lot
attuned
configuration
matters,
a
lot
in
performance.
So
how
do
you
get
this
out
of
the
box
using
infinite
flash
and
and
lastly,
the
supportability
aspects?
A
lot
more
log
collection
diagnostics
are
built
into
the
self
on
infinite
flash
right.
A
We,
you
know
what
team
is
about
lower
25
engineers
that
is
focused
on
staff
purely
on
self
right,
and
half
of
that
is
primarily
working
on
the
performance
enhancements
that
I
mentioned
earlier,
and
the
other
half
really
is
in
the
amount
of
heavy
testing
that
we
do
obviously
for
infinite
flash
again
one
of
the
early
vendors
of
the
all-flash
system
right.
So
one
of
the
things
that
we
have
to
go
do
is
to
really
enhance
the
pathology,
a
test
suite
to
bake
it,
the
more
relevant
for
now
flash
system.
A
So
we
did
a
quite
a
bit
of
work
and
a
lot
of
contributions
that
happen
from
sandisk
on
this.
Automated
technology
tests
sweet
as
well
right,
and
we
still
continue
to
do
that,
and
so
there's
a
very
heavy
amount
of
testing
that
we
do,
and
you
see
some
of
the
numbers
there
in
terms
of
the
hardening
in
terms
of
scale
testing
and
in
terms
of
the
failure
testing
right.
So
this
is
a
again
just
to
make
it
more
enterprise
ready
and
for
the
customers.
A
You
know
who
just
you
know
who
just
are
you
know
want
to
have
a
more
assured
solution
with
the
hardware
and
software
combined.
So
that's
basically
what
we
are
going
and
what
we're
going
to
be
providing
right,
one
of
the
other
key
difference
with
infinite
flash
compared
to
what
you
see
out
there
in
terms
of
self
deployment,
the
mini
infinite
flash
differs
slightly
because,
typically,
when
the
self
is
deployed,
its
deployed
in
a
convoy
snowed
right.
A
So,
where
you
have
the
hard
drives
attached
within
a
within
a
CPU,
complex
right
within
a
node,
and
you
scale
with
those
notes,
so
you
have
the
CPU
and
I'm
the
drive
and
the
storage.
You
know
kind
of
more
or
less
in
a
fixed
ratio
right
and
it's
much
easy
to
deploy,
because
you
just
need
to
replicate
with
the
nodes,
and
you
know
self
can
take
care
of
all
the
all
the
you
know.
Balancing
and
everything
now
coming
that
Hardware.
A
The
key
thing
is
that
the
very
large
scale,
all
these
customers
are
moving
away
from
this
hyper
conversion
odds,
because
one
thing
that
they
find
is
these
conversion
notes
are
guests
to
be
quickly
unbalanced,
based
on
their
workload
characteristics
right.
Obviously,
so
when
you
need
more
processing
power
right,
you
have
to
bring
in
add
in
more
nodes,
adding
more
capacity
to
it
or
even,
if
you're
not
going
to
be
using
that
capacity,
and
vice
versa.
A
Sometimes,
when
you
have
change
your
application
policy
to
add
more
more
protection
right,
so
you
need
a
lot
more
capacity,
but
you
don't
need
all
the
processing
power
that
goes
with
it
and
in
a
very
large
scale
that
becomes
expensive
to
use
the
processing
or
to
use
those
resources
that
you're
not
using.
So
for
them.
It
is
very
important
for
them
to
tune
the
configuration
very
specifically
to
the
amount
of
compute
that
they
need
and
the
amount
of
stories
that
they
need
in
the
amount
of
network
they
donate
right.
A
A
So
within
the
same
cluster,
it
is
not
possible
for
you
to
tune
your
system
to
the
right
amount
of
you
know,
workload
that
you
are
basically
planning
on
and,
if
you
think
about
it,
you
know
most
of
the
clusters
that
are
built
out
today
is
built
in
the
in
a
OpenStack
environment
where
they're
building
it
for
private
clouds.
A
lot
of
the
enterprises
are
building
their
private
clouds
and
when
you're
building
it
for
a
private
cloud,
you're
not
optimizing
for
any
single
workload
right.
A
Your
goal
really
used
to
make
it
as
a
universal
infrastructure
that
we
actually
can
have
multiple
workloads
go
existing
in
it.
So
it
is
very
important
for
your
cluster
to
be
balanced
and
to
be
able
to
tune
and
handle
the
different
types
of
workloads,
and
that's
where
this
disaggregation
comes.
It
becomes
really
critical,
but
they're
right.
So
that's
that's
one
of
the
reasons
why
we,
we
specifically
by
design,
eliminated
not
having
the
CPU
and
everything
built
into
the
storage
right,
so
that
it
could
be
balanced
right
way.
A
It's
also
some
of
the
techniques
like
this
like
where
you
disaggregate,
and
you
look
into
the
world
all
system
costs,
including
the
optics
and
and
the
capex
everything
you
will
be
actually
very
surprised,
just
like
our
largest
customers,
are
finding
it
to
be
much
cheaper
than
the
hard
drives
for
the
performance
that
you're
getting
so
the
other
ways
that
some
of
our
customers
are
deploying.
It
is
with
the
combination
of
hard
drives
right.
So
this
is
not
a
typical
way.
A
And
so
this
is
again
one
of
our
customers
they're,
using
it
to
be
used
in
having
a
low
activity
data
on
a
hard
drives
and
that
are
back
in
it
bare
behind
infinite
flash
right.
So
that's
one
way
to
be
using
it
with
the
hard
drives
and
there's
one
customer.
That's
actually
trying
this
out
as
having
keeping
the
primary
copy
on
flash
and
keeping
the
secondaries
on
the
on
the
hard
drives
and
you
keep
you
can
have
the
higher
affinity
set
to
the
primary
copy.
A
So
most
of
your
reads
keep
happening
just
with
the
on
the
primary
copy
right.
So
because
you
have
a
lot
of
bandwidth
coming
out
of
this,
so
they're
completely
fine
with
that.
Obviously,
your
rights
are
going
to
be
limited
to
what
the
hard
drive
can
handle,
but
even
there,
if
you're,
able
to
eliminate
the
reads
from
the
hard
drives
and
just
keep
the
hard
drives
only
as
you
know,
just
going
to
rewrite
going
there
now
he's
just
doing
a
sequential,
writes
and
hard
rest
can
handle
just
the
sequential
writes
fairly
well
right.
A
So
if
it's
just
doing
one
thing,
you
know
it
can
just
do
okay,
so
your
your
performance
is
really
not
going
to
be
that
bad
then,
compared
to
where
you're,
using
a
hard
drive
system
for
both
reads
and
writes
simultaneously
right.
So
when
you
just
relegate
that
to
just
the
right
for
protection
where,
as
this
customer
is
doing
right,
that's
also
works
fairly
well
for
them
right.
So
so
this
is
again
just
as
a
TCO.
A
As
a
quick
note
about
where
this
customer
is,
how
measure
and
the
kid
is
the
data
coming
from
from
the
customer.
They
basically
are
planning
a
hundred
petabyte
cluster,
so
this
is
going
to
be
one
of
the
largest
cluster
sep
cluster.
This
is
not
yahoo.
By
the
way,
it
is
something
that
the
customer
don't
want
to
be
pub.
A
Don't
want
to
be
announced
right
so
in
the
next
year
and
a
half
two,
they
plan
to
go
to
a
100
petabyte
cluster,
and
they
did
the
TCO
analysis
on
the
traditional
hard
drives,
which
they
are
planning
to
how,
before
with
with
before
evening
flash-
and
so
you
know,
based
on
the
commodity
hard
drives,
they
are
able
to
measure
to
be
about.
You
know,
around
45
million
is
what
they
were
budgeted.
This
is
including
the
three-year
optics
right.
A
You
know,
basically,
the
acquisition
and
a
tree
or
objects
combined,
and
they
compared
with
infinite
flash
with
different
techniques.
So
one
is
with
their
full
replicas.
Obviously,
it's
more
expensive.
The
second
bar
that
you
see
is
there
is
the
full
replicas
three
full
replicas
on
the
object,
storage
running
on
your
new
flash,
but
the
other
key
thing,
that's
very
interesting
thing
and
reaching
and
talk
offline
is
a
ratio
coding
and
how
it
works
with
flash.
The
one
thing
that
that
we
find
with
the
orig,
according
how
many
of
you
are
familiar
with
Eurasia
Cory
good.
A
A
It
is
basically
showing
the
total
data
center
footprint
right,
so
they
were
planning
to
have
around
90
racks,
or
this
hundred
petabyte
data
based
on
their
earlier
hard
drive
based
model
and
with
the
infinite
flash
with
an
erasure
coding
model
which
they're
going
to
write.
They
are
able
to
get
the
218
racks
from
around
95
racks
to
18
rack.
They
eliminated
a
complete
data
center,
build
out
for
this,
and
that's
basically
the
huge
amount
of
safe.
This
customer
really
doesn't
care
much
about
performance.
For
them.
A
Performance
is
not
the
key
criteria
for
you
know,
for
they
are
quite
happy
with
what
they
are
able
to
get
with
a
hard
drive
hard
drives
today.
Right
failure
issues
are
different
things,
so
their
experience
is
that
they
have
35
hard
drive
failures
per
week
on
the
unlit
on
the
cluster
size
and
when
they
compare
that
to
our
AFR,
which
is
less
than
point
15
%,
it's
going
to
be
one
SSD
or
one
card
failure
per
week,
so
25
hard
drives
do
one
card
and
the
big
difference
again
is
not
the
cost.
A
There
obviously
means
all
covered
under
warranty
for
them.
You
know
once
you
have
a
hard
drive
failure,
but
the
ceph
to
rebalance
your
cluster,
that's
going
to
take
them
more
than
a
week
with
the
workload
that
they're
having
because
they're,
not
balancing
that
we're,
not
optimized
network
at
that
cluster
for
rebuilding.
It's
primarily
it's
built
for
the
throughput
higher
throughput,
so
that
whole
rebalancing
of
the
cluster
and
the
spiciness
that
comes
out
of
that
is
a
huge
issue
right.
So
that's
a
big
difference
as
well.
A
A
So
operates
primarily
is
one
thing
that
I
talked
about.
Is
that
the
failure
rate
in
the
power
savings
and,
if
I
don't
know
if
I
mentioned
this
way,
with
the
infinite
flash?
It's
about
470
watts
of
power
at
a
fully
populated,
fine
and
12
/
10
and
12
terabytes?
Now,
if
you
can
imagine
you
know,
that's
almost
equal
into
a
or
a
single
server,
your
single
socket
server,
what
it
costs.
A
So
the
primary
topic
savings
are
few
one
is
the
power
savings,
you
know
power
and
cooling
and
the
whole
data
center
space,
and
this
all
the
labor
costs
are
involved
in
handling
the
failure
media
failures.
So
those
are
the
three
key
factors
that
basically,
that
factored
into
their
optics
savings
with
an
all-flash.
B
A
They
figured
out
that
if
they're
going
to
be
using
flash
they're
going
to
be
using
SMR
drives
because
they're
purely
there
for
production
they're
not
trying
to
do
any
reads
or
any
I/o
on
that,
so
they
felt
that
you
know,
because
even
if
they
have
failure
in
those
in
those
media
drives,
you
know
it
really
is
not
going
to
be
a
big
much
impactful,
so
they
figured
that
those
SMR
drives
and
in
the
last
one
is
going
to
be
just
as
a
passive.
Writing.
Writing.
A
B
B
A
So
currently
it
is
the
you
know,
balance
it
is
currently
running
on
XFS.
The
one
thing
that
you
talked
about
you
know:
that's
coming
up
in
a
roadmap
is,
as
we
move
to
the
new
store,
the
new
back
end.
That's
coming
with
SEF.
We
have
this
new.
You
know
a
key
value
pair,
that's
optimized
for
flash,
so
we
want
to
kind
of
move.
Have
that
and
right
now
we're
in
the
process
of
making
that
open
source.
It's
currently
a
proprietary
software
from
sandisk.
B
A
A
Yeah
so
again,
if
you're
talking
about
the
OpenStack
summit
in
Tokyo,
so
sandisk
be
the
presentation
allen,
samuels
are
yeah,
so
we
did
a
presentation
on
just
igreja
coding
of
it.
I
strongly
suggest
you
to
go,
look
at
look
it
up.
It's
all
on
the
on
the
web
and
just
the
gist
of
it
again
is
the
same
thing
that
I
was
mentioning
before
when
you
use
an
all-flash
storage.
Your
holy
rigid
coding
dynamics,
change,
changes
right
compared
to
a
hard
drive
system.
A
One
is
that
you
know
with
just
a
twenty
percent
or
head
you're
able
to
get.
You
know
in
fact
much
better
protection
than
a
full
to
copy
or
three
copy
right,
even
with
the
full
copies
most
of
the
customers
or
a
hundred
percent
of
our
customers
are
just
using
it
for
two
copies
right
with
flash,
because
you
know
flash
failure
rates
are
much
much
lower.
A
So
so
you,
even
when
you
compare
that
to
to
copy
solution
right
with
the
two
times
the
warhead
in
our
row,
with
the
gist
twenty
percent
or
head
you're,
getting
much
better
protection
and
the
key
thing
there
is
that,
obviously
you
know
when
you
use
the
erasure
coding,
the
big
downside
to
it
with
the
current
hard
drives.
Is
that
rebuilding
mechanisms
right?
So
how
long
would
it
take
to
rebuild
and
how
can't
a
cpu
heavy
that
is
and
that's
what
the
flash
is
taken
care
of
and
eliminating
that
right?
A
So
it's
actually
much
it's
much
more
suitable
now
for
the
active
working
data
rather
than
just
an
archival
data,
that's
typically
what
you
region
coding
is
currently
mostly
used
for
right.
So
that's
the
key
difference
really.
That
makes
a
huge
amount
of
difference,
and
obviously,
today,
with
self
natively
only
the
object
store.
You
know,
rgw
gateways
are
the
only
ones
that
negatively
can
can
support
the
orig
according
and
that's
something
that
we're
working
on
a
blueprint
to
make
that
also
for
an
A
two
blocks
as
well.
A
C
B
A
Again
so
you
know
the
gist
of
it
is
that
you
know
mostly,
we
are
focused
on
the
rights
that
are
coming
up
in
the
near
term
for
for
the
jewel
release.
That's
going
to
be,
you
know
little
or
a
q1
before
we're
targeting
q1,
but
I
think
is
a
little
bit
going
to
sleep
beyond
q1
and
the
next
focus
area
for
really
is
just
to
drive
down
the
total
cost
when
you
deploy
it
an
all-flash,
I
talked
one
about
the
erasure
coding.
A
The
other
aspects
that
we
are
working
on
with
the
community
is
a
compression
and
d.
Do
the
one
thing
with
the
d
dope
is
a
little
bit
iffy,
because
again
we
were
not
sure
for
the
kinds
of
work
clothes
that
that
you
know
that
gets
deployed
on
self
right.
We
know
a
lot
of
all
flash
vendors
out
there
make
a
big
deal
out
of
the
loop
and,
in
fact,
the
prices
that
the
code
is
with
the
huge
dee
doop
assumptions
of
marantine
or
six
x,
28
x.
A
You
know
dee
doop
savings,
you
know
we
don't
go
to
those
you
go
and
do
those
shenanigans
or
anything
that
we
do
is
basically
/
draw
device.
But
one
thing
that
we
want
to
make
sure
is:
does
dee
doop
really
effective
for
the
type
of
workloads
that
gets
deployed
on
self
right
right?
Now
we
hear
from
some
customers.
Yes,
you
know
some
of
their
VMs
have
a
high
de
loop
affinity.