►
From YouTube: 2015-SEP-24 -- Ceph Tech Talks: Reference Architectures
Description
A look at building reference architectures for Ceph.
http://ceph.com/ceph-tech-talks/
A
Alright
welcome
back
everyone
to
the
monthly
septic
shock,
such
hallowed
halls
and
speakers.
There's
a
SAM
just
to
gesture,
and
you
have
a
lot
of
different
talks
on
the
core
aspects
except,
but
pretty
much
anything
that's
technical
it
has
to
do
with
Seth
is
something
that
we're
looking
to
host
under
stocks.
A
That
way,
folks
that
are
following
along
a
call
to
the
YouTube
channel
can
see
the
whole
presentation
before
the
questions
start
without
further
ado,
I'd
like
to
introduce
red
captivating,
try
later
we're
going
to
give
you
a
rundown
of
how
we
build
reference,
architectures,
there's
much
bigger
way.
Yeah.
B
Thanks
Patrick,
so
cut
and
I
are
going
to
keep
this
interactive
and
conversational
between
us
as
we
go
along
informal
best.
That's
one
of
the
things
that
we
do.
We
spend
time
with
a
lot
of
our
partners,
building
reference
architectures
one
of
the
key
questions,
or
one
of
the
reasons
we
do.
This
is
one
of
the
questions
most
frequently
asked
questions
that
we
get
is
hey.
This
software-defined
storage
stuff
is
great.
It
runs
on
anything.
C
B
B
Okay,
so
the
building
blocks
for
reference,
architectures,
common
being
distributed,
storage.
Of
course
network-
is
the
foundational
building
block
and
sometimes
something
we
don't
talk
about
enough.
You
know
the
the
impacts
of
of
10
gig
I
know
it's.
It's
well
known
in
the
community
to
the
use
of
bootlegging
networks,
one
client
facing
in
one
cluster
facing,
but
as
you
as
you
go
up
in
network
bandwidth
to
40
gig,
as
you
look
at
you
know,
bonding
or
not
you
know,
to
bond
or
not
to
bond.
You
know
makes
a
significant
difference.
B
That's
so
those
are
some
of
the
variations
and,
of
course,
then
up
to
server
in
the
last,
oh,
probably
24,
to
36
months.
As
you
know,
there's
been
into
just
an
explosion
of
variations
on
storage,
x86
based
storage
servers.
You
know
a
couple
years
ago
is
kind
of
the
the
default
was
a
12
Bay,
so
12
three-and-a-half
inch
drive
to
you
dual
socket
server.
B
Now
there
there's
just
a
litany
of
of
different
storage
servers
out
there
and
just
just
looking
at
capacities.
You
know
12
13,
a
pinch
bait
24-3
now
Finch
Bay,
30,
3660,
72
76,
and
so
you
know
and
then
of
course,
the
question.
The
question
that
folks
ask
is
well,
what's
the
what
kind
of
performance
can
we
expect
with
different
server
types?
And
you
know
let
alone
different
media
types
within
those
servers,
different.
B
Different
capacities
of
HDD,
obviously
lower
lowering
the
capacity
increasing
the
quantity
of
servers.
You
know
changes
the
spindle
count,
obviously
going
to
change
the
type
of
performance.
You
can
expect
the
classic
question
of
right,
SSD
right
journal
ratio
to
HDD,
but
what
about
instead
of
a
SAS
or
SATA
SSD
writer,
and
what
about
PCIe
and
bme
right
journal?
You
know
how
does
that
change
things,
and
then
what
about
putting
the
OSD
all
together
on
instead
of
hard
drives
on
flash
drives?
B
You
know
folks
say
ok,
so
what
about,
if
I'm
driving
this
load
through
a
kvm,
BM
or
or
just
driving
the
load
through
the
radio,
skateway,
no
verte,
layer
or
emerging,
of
course,
or
questions
we're
getting
on
from
a
micro
service
as
a
container
type
verte
level,
and
then
on
top
of
that
all
this
is
kind
of
this
is
kind
of
pseudo
Maslow's
hierarchy
of
needs
here
at
the
bottom,
being
kind
of
food
and
shelter
working
up
to
that
self-actualization
level
is.
Is
the
defined
workloads?
B
Some
ref
arcs
are
read
more
like
how
to
integration
guides,
and
these
tend
to
be
at
the
upper
layers
of
that
hierarchy
of
needs.
On
the
previous
slide,
we're
now
you
know
kind
of
SEF
layer
and
the
OS
or
Burt
level,
and
then,
on
top
of
that,
you
know,
set
+
OS,
Burt
+
packaged
workloads.
On
top
of
that,
an
example
that
is
here
in
the
link.
B
So,
and
that's
that's-
that's
that's
one
flavor
another
flavor
read
more
like
performance
and
sizing
guides
and
listed
our
three
different
links.
There
one
as
a
reference
architecture,
we've
recently
published
with
supermicro
it's
across
their
family.
The
second
one
is
a
performance
white
paper
would
in
which
we've
collaborated
with
Cisco's
UCS
team
and
the
third
one
is
one
from
scalable
informatics.
Another
performance
white
paper.
So
all
of
those
read
more
like
performance
and
sizing
guides
I'm.
So
right
now,
as
you
is
just
FYI,
our
team
is
currently
focused
on
on
the
bottom
one.
B
So
that's
that's
going
to
be
kind
of
how
we
focus
our
comments
today
is
that's
where
we've
been
focused
with
a
and
building
these
reference.
Architectures
is
more
of
the
latter
we're
working
towards
working
towards
the
former,
the
top
one
there.
In
fact,
we've
begun
to
work
with
the
the
MySQL
Maria
to
be
community
at
the
top
layer
of
the
stack,
but
but
again
weighs
a
lot
of
our
work
has
been
foundational.
That's
its
kind
of
that
kind
of
set,
your
expectation
for
the
type
of
reference
architectures.
B
One
of
our
goals
with
these
these
reference
architectures
in
producing
these
performance
and
sizing
guides
is,
is
to
is
to
create
a
community
asset
such
that.
If
you
know
one
of
your
saying,
yeah
I'm
deploying
a
new
set
cluster,
my
workload
IO
patterns
look
kind
of
like
this.
My
capacity,
oh
it's
going
to
be
about
you
know
a
petabyte
or
two
and
I'd
like
to
run
it
on
all
flash
or
I'd
like
to
run
it
on
is
cheap
and
deep,
as
I
can
go.
I'd
like
to
it.
B
You
know,
I'd
like
to
go
out
and
query
some.
Some
community
repository
to
find
out
who's
done,
who
has
performance,
revolt
performance
results
from
an
architecture
like
I'm
looking
to
build
and
save
me
some
time,
I
can
read
up
on
some
empirical
data
in
some
type
of
structured
form,
so
kind
of,
let's
that's
what
we're
working
towards.
If
we
have
time
at
the
end,
we'll
I'll
paste
up
there,
we
have
right
now
it's
it's!
It's
simple.
B
It's
just
a
massive
spreadsheet
that
we
have
currently
working
on
we're
close
to
30
different
results
from
30
different
configurations,
and
we
have
them
all
lined
up
side
by
side
as
we
can.
We
can
look
down
a
row
and
see
how
various
performance
metrics
varies
with
different
configuration.
So
it's
quite
it's
quite
insightful.
It's
it's
fascinating!
In
fact,
so
that's
we'd,
like
two
mature
that
so,
instead
of
having
to
be
a
spreadsheet
jockey,
it's
it's
a
little
bit
more
user
friendly.
But
that's
it's
a
anyway!
B
So
that's
a
plug
out
if
anyone's
got
great
ideas
funneled
through
Patrick
to
us,
we're
very
interested
in
in
than
that.
Okay,
so,
back
to
the
back
to
then
talking,
there
are
considerations
for
these
reference,
architectures,
okay,
so
performance
and
sizing
guys
where
most
of
our
work
is.
That
here
are
three
links
illustrating
some
recent
work.
B
Okay,
so
here
are
some
design
considerations
that,
as
we
structure
these
reference,
architectures
that
we
focus
on
so
I'm
going
to
I'm
going
to
focus
on
a
couple
of
these,
then
in
particular
again
it'll
be
conversation
between
Kyle
myself.
You
probably
heard
Kyle
on
this
menu
before
senior
senior
storage,
architect
from
not
only
the
ink
tank
days,
but
the
dreamhost
days.
B
Kyle
column
cover
five
and
six
year
along
the
way.
I
asked
some
other
things.
Okay,
so
here
here,
just
six.
By
no
means
is
this
an
exhaustive
list,
but
here's
some
here's
some
useful
design
considerations
that
we
we
speak
about
in
constructing
these
reference,
architectures
and
expectedly,
as
we
speak
about
in
in
taking
from
the
top
and
architecting
sep
solutions.
So
these
items
this
is
meant
to
re.
You
know
you
know
after
having
these
conversations
with,
you
know,
internally
or
with
whatever,
whoever
the
solution.
Architects
are
to
understand
these
six
things.
B
B
So
if
we
get
time
at
the
end,
it's
certainly
the
slide.
You'll
have
them
in
the
download
we'll
talk
through
a
couple
of
performance,
graphs,
okay,
so
number
one.
Now
the
first
design
consideration
qualified
need
for
scale-out
storage
run
going
to
spend
a
lot
of
time
on
that
here,
something
that
you
do
all
the
time
every
day.
B
B
So,
anyway,
that's
just
an
interesting
thing
so
anyway,
so
the
first,
the
first
design
consideration
qualify,
need
for
scale
out
store
to
be,
even
as,
as
you
all
know,
as
architects
that
frankly
they're
depending
on
the
size
and
the
you
know,
it's
an
Oracle
database.
You
know
sighs
workload
whatnot.
It
may
not
in
fact
be
the
right
solution,
but
so
moving
on
past
that,
okay,
the
second
design
consideration
designing
for
the
workload
I.
Oh,
and
this
this
will
this
kind
of
sets
up
the
way
that
we
approach
the
reference
architectures.
B
We
have
a
four
by
three
matrix,
it's
kind
of
a
small,
medium
and
large,
with
different
I
old
patterns
which
you'll
see
in
a
minute
here.
So
this
kind
of
sets
sets
that
up
as
we
look
at
the
different
considerations
of
the
various
workloads,
so
the
the
first,
you
know
the
overall
that
the
topmost
governing
factor
we
see
it
is
okay.
People
designing
these
sep
solutions
is,
is
it
are
they
performance,
oriented
or
cheap
and
deep
oriented
you
cheapen?
Deep?
B
Being
you
know,
their
their
overriding
interest
is
cost
capacity,
cost
per
capacity
cost
Barack
density,
cost
per
watt,
cost
per
thermal
unit,
cost
cost,
costing
them
cheap
and
deep
order.
They
have
some
type
of
performance,
overriding
performance
objective.
Of
course
our
cost
is
always
a
factor
but
its
if
it's
an
overriding
performance
objective
again.
That's
that
we
test
different
in
a
reference
architectures.
We
try
to
go
through
different
configurations
that
will
be
optimal
for
these
different
types
of
work
load
patterns
and
then
descending
within
performance.
B
Oh,
you
know
it
might
be
an
image,
a
lot
of
JPEG
images.
It
may
be
video
audio,
but
in
my
large
block
you
know
typically
with
the
Iowa
pattern,
typifies
by
large
block
I,
oh
and
in
fact
large,
but
going
down
the
list
sequential
large
block
by
0,
vs,
I,
ops,
intensive,
obviously
tip
if
I'd
by
I
mean
the
poster
child,
I,
ops,
intensive,
workflow,
being
4k,
random,
I/o,
so
small,
and
so
all
these
considerations
again.
We
try
to
in
the
reference
architecture
work.
We
try
to
say.
B
Okay
on
this
slide
here
in
this
4
by
3
matrix,
the
Rose
become
academies
are
very
coarse-grained,
generalized
work,
letta
workload,
I/o
categories,
if
I,
ops,
optimized
workloads,
throughput
optimized
workloads
and
cost
capacity.
Optimized
workloads,
obviously
cognizant,
that
within
within
a
single
cluster,
you
might
have
different
pools
carved
out
for
different
workloads.
But
this
this
helps
us
again
to
identify
reference
architectures,
which
are
optimal
for
these
different
types
of
workload,
io
categories
and
then
back
up
to
this
thing.
B
Here,
some
of
the
things
that
become
interesting
down
below
or
things
like
the
read/write
mix,
as
we
benchmark
things
like,
like
erasure,
coded
pools
versus
replicated,
pools
the
as
you'll,
see
the
in
the
results:
the
for
instance
right,
the
right
performance,
right,
performance
of
a
ratio
in
a
ratio,
coda
pool
versus
replicated,
pool
it's
a
different
ratio
than
read
performance
and
for
all
the
logical
reasons
that
you
know.
That
makes
sense
when
you
think
about
what's
going
on
on
the
covers
there.
B
B
And
when
we
look
at
performance,
optimized,
of
course
we're
as
we
as
we
as
we
discuss
with
our
partners
and
we
work
with
work
with
network
partners.
We
work
with
server
partners
work
with
media
partners.
We
discussed
this
with
them
as
we're
looking
at
performance,
optimized
reference
architectures,
the
benchmarking,
for
that
word
say:
okay.
Our
goal
here
is,
of
course,
we're
trying
to
get
in
by
definition,
performance,
optimized,
we're
shooting
for
the
highest
performance.
B
This
pool
can
yield
again
whether
it's
megabytes
per
second
for
throughput,
oriented
or
or
iOS
4i,
ops,
oriented,
but
clearly
not
in
a
vacuum,
cost
always
matters,
so
it
becomes
okay,
we
translate
the
lowest.
We
also
get
list
pricing
information
for
the
for
the
configuration
so
that
we
can
begin
to
do
some
relative
and
most
vendors
are
aren't
all
that.
B
They
don't
like,
of
course,
to
to
have
a
lot
of
lists
of
their
absolute
pricing.
Information
bandied
about
for
obvious
reasons,
so
we
convert
that
into
two
relative
comparisons
and
again,
if
we
have
time
at
the
end,
we
have
some
relative
comparisons
that
reflect
lowest
cost
per
performance
unit.
So
yeah
sure
you
can,
you
know
a
configuration.
Myeeeeh
might
yield
the
highest
performance,
but
it
sealants
its
liquid
nitrogen
cooled
and
it
costs.
You
know:
100
million
dollars,
okay,
that
it's
nice
that
it's
highest
performance,
but
it's
not
it's
not
attainable
by
by
mere
mortals.
B
So
we
look
at
some
cop
X
some
capex
elements
as
well.
As
you
know,
one
of
the
things
that
Kyle
drills
and
over
and
over
again
is
from
his
experience
on
the
operations
side
of
the
fences.
Hey.
You
know,
power
and
cooling,
always
matter
so
yeah.
You
might
get
a
cheap,
cheap,
capex
solution,
but
if
you
know
you're
burning
through
the
the
Los
Angeles
power
grid,
because
if
you're
sucking
so
many
watts
and
an
air
conditioning,
then
it's
it's
still
too
expensive.
B
So
we
try
to
add
the
reference
architectures
paid
and
pay
consideration
that
well
and
then
finally,
is
the
meets.
The
minimum
server
fault,
the
main
recommendation
and
we're
going
to
get
into
that
and
that's
a
favorite
topic
of
Kyle.
So
he's
going
to
talk
to
that
one
we
get
to
get
down
farther
below,
ok
and
then
cost
capacity,
optimized
set
otherwise
cheap
and
deep.
Of
course,
it's
it's.
The
governing
attribute
is
lowest
cost
per
terabyte,
but
again
it's
it's.
It's
also
a
capex
and
opex
game.
B
B
You
know
the
payment
is
that
the
optics
cost
is
more
oriented
towards
floor
space.
The
in
building
floor
space.
Then,
then
that's
an
important
element
here
and
we
relax
a
little
bit
the
minimum
server
fault.
The
main
recommendation
here
for
cost
capacity,
optimize
clusters,
again
with
the
belief
that
it's
it's
you
know
it's
less
performance,
sensitive,
okay.
So
with
those
considerations,
then
the
third
of
six
design
considerations
we
look
at
are
the
obvious
storage
access
method.
B
It's
kind
of
hard
to
see
that,
but,
for
instance,
obviously
ratos
block
device
is
supported
with
a
replicated
data
protection
scheme,
only
so
choice
of
storage
access
method
if,
for
instance,
if
the
store
Jack's,
but
if
the
storage
access
method
required
by
the
workload
is,
for
instance,
block
that
that
immediately
constrains
the
data
protection
game
used
to
a
replicated
pool
which
then
obviously
drives
the
it
constrains.
The
permutations
of
a
server
network
in
particular
server
emedia
architectures
targeted.
B
So
we
know
that's
that's
kind
of
the
third
design
consideration,
then
the
fourth
one
identifying
capacity
and
identifying
capacity
at
face
level-
one
might
say:
well,
you
know
what
does
that
really
have
to
do
with
the
with
the
reference
architecture?
Well,
it's
it
comes
back
to
the
previous
slide
about
or
the
previous
conversation
about
fault.
The
main
considerations
you
know.
Clearly,
if
you
are,
if,
if
you
know,
if
you're
architecting
a
or
you're
designing
a
reference
architecture
for
a
half
a
petabyte
solution,
then
you
know
it's
probably
not
going
to
be.
B
It's
probably
not
going
to
use
big
old
n72
of
a
servers
from
a
fault
domain
perspective,
so
identifying
the
capacity
has
significant
ramifications
into
the
fault
domain,
and
so
that's
actually
a
good
transition.
So
Kyle
as
we
look
at
fault
domain,
risk
tolerance
and
clear
that
we've
turned
this
as
risk
tolerance.
Because
some
you
know
it's
it's
it's
a
choice.
It's
a
subjective
choice
by
the
architect
here,
based
on
the
environment.
So
Kyle
talk
to
us
a
little
bit
about
the
this
question
about
okay.
B
C
Sure,
absolutely
so,
and
you
have
a
scale-out
storage
systems,
typically
you're
transitioning,
from
the
mindset
of
where
you
want
to,
instead
of
trying
to
make
sure
that
a
singular
host
or
hora
para
hosts
are,
you
know,
highly
highly
phone
caller
within
themselves.
Instead,
you
want
the
the
cluster
cluster
software
to
provide
fault-tolerance.
C
You
know
kind
of
the
rule
of
thumb
that
we
provide
is
that
you
don't
want
to
lose
more
than
a
tenth
of
the
cluster
by
with
a
single
node
field,
because
not
only
are
you
going
to
lose,
you
know
ten
percent
of
the
capacity
in
the
case
of
a
node
failure,
but
you're
also
going
to
have
the
additional
workload
of
having
to
recover
from
that
film.
So
when
you
start
to
get
into
clusters
that
are
smaller
than
than
10
nodes,
and
you
can
see
how
this
can
be
very
problematic
right.
C
So
if
you
have
the
absolute
minimum
three
node
cluster,
one
of
those
hosts
fails,
you
lose
one
third
of
your
capacity
and
the
remaining
the
remaining
two
hosts
have
to
not.
Not
only
is
there
less
aggregate
cluster
bandwidth
available,
but
the
remaining
bandwidth
also
has
to
deal
with
the
recovery
of
that
fail,
toast
so
based
on
testing
two
we've
done.
We
really
like
to
steer
her
customers
towards.
C
B
Another
question
for
a
cow
before
I
leave
this
slide,
so
the
one
of
the
things
I
know
that
we've
talked
about
a
lot
before
is
is
also
the
clusters
reserve
capacity
in
terms
of
terabytes,
so
there
I
mean
there's
a
certain
amount
of
reserve
that
you
should
always
have
just
for
good
for
normal
operations.
But
then
let's
say
that
you
have
a
three
node
cluster.
You
know:
how
does
that
reserve
capacity
need
to
grow
when
you
have
a
smaller
cluster?
It
talks
a
little
bit
about
that
sure.
C
Right,
so,
if
you
want
to
be
able
to
have
the
cluster,
to,
you
know
fail
in
place
and
recover
from
a
host
failure.
You
don't
want
to,
you
know,
send
someone
to
the
data
center
and
try
to
repair
a
host
and
bring
it
back
up.
You
know
operationally
it's
better
if
you
can
just
let
the
software
recover
from
the
failure.
C
B
Excellent
thanks,
so
you
can.
You
can
see
why
this
I
mean
this
as
we
identify
target
architectures
debase
the
reference.
Architectures
I
mean
kind
of
the
classic,
the
classic
conversation
when
people
are
new
to
that
theater,
the
concept
of
self,
not
obviously
after
they've,
been
around
for
a
while,
then
they're
new.
You
know
they
read
about
these,
these
big
old
72
Bay
servers
and
they
read
about
the
cost
efficiency
than
man.
That's
that's
exactly
what
I'm
going
to
do
this!
This
opportune
storage!
I
love
this
stuff.
I
can
choose
any
platform.
B
I
want
I'm
just
going
to
go,
get
all
72
base
servers,
but
but
then,
as
kyle's
explained
here
yeah
you
you
better
have
a
cluster
of
a
of
a
you
know
a
petabyte
or
two,
because
you
know
when
you,
when
you
cram
it,
you
know.
Let's
say
you
got
a
72
base
server
with
the
sixth
terabyte
drives.
You
know
six
times
six
times,
seven
you're
looking
at
close
to
a
half
a
petabyte
of
raw
capacity
in
a
single
chassis.
B
So
yet
this
this
this
becomes
a
significant
consideration
as
we
as
we
identify
different
architectures
deco.
That's
number!
Five.
Now
number
six
and
I'm
going
to
t
this
up
for
Kyle's
as
well
here
so
data
protection
schemes
I've.
One
of
the
statements
we
made
at
the
bottom
here
is
one
of
the
biggest
choice
is
affecting
the
TCO
in
the
entire
solution.
B
So
talk
to
us
and
then
that's.
Obviously
it's
meant
to
be
a
blatant
attention.
Getter
here,
you've
all
evolved
in
part
of
conversations
with
with
the
with
management
or
with
sales.
People
may
sell
yeah.
That's
that's
just
a
detail.
You
know,
don't
don't
let's
not
trouble
with
that
detail
then!
Well,
it's
it's!
Actually.
You
know
what
he
wanted.
You
want
to
spend
half
as
much
or
twice
as
much,
obviously,
because
the
the
quantity
of
of
media
that
you
need
is
heavily
impacted
by
choice.
B
A
C
Sure
so
you,
the
basics
in
replication,
is
just
making
copies
of
each
data
soem
and
set
because
it's
based
on
a
burritos
and
everything
is
being
stored
as
an
object
internally
for
each
object,
there's
going
to
be
n
number
of
replicas.
Typically,
people
are
using
3x
replication,
and
so,
as
such,
you
have
one-third
of
of
you
know,
usable
to
to
rock
capacity
erase
your
pudding.
On
the
other
hand,
uses
uses
math
to
generate
generate
a
parity
bits
such
that
I.
C
You
take,
you
take
an
object
that
would
be
written
into
the
cluster,
and
you
divide
that
into
a
number
of
chunks.
Then
you
also
generate
an
additional
a
number
of
parity
checks.
Both
of
these
can
be
configurable
and
then
the
list
you
know
all
all
these
different
tongue
split,
the
ones
from
the
original
object
and
the
parity
chunks
are
then
distributed
across
the
cluster
reporting
to
the
crush
mapping.
C
And
so
in
this
way,
if
you
lose
one
of
those
chunks,
those
chunks
can
be
reconstituted
from
from
the
parity
or
in
the
case
of
loss
of
parity
bits,
they
can
be
recalculated
from
the
original
chunks.
So
this
is.
This
is
very
similar
to
like
what
a
traditional
raid
array
uses
internally,
except
instead
of
striping.
C
In
some
cases,
though,
the
data
protection
scheme
is
going
to
be
predicated
on.
The
way
that
you
are
accessing
said
said,
data,
so
in
the
case
of
block
storage,
for
example,
replication
is,
is
the
only
mode
that
is
supported,
and
that's
because
you
know,
as
since
the
block
since
a
block
device
has
striped
across
many
objects.
C
B
Yeah
thanks
God
good
stuff,
so
that
those
six
design
considerations
then
yield
a
four
by
three
matrix
like
this.
So
this
is
how
we
go
into
these
reference
architectures
with
one
of
these
empty.
When
we
say
we
want
to,
we
want
to
work
through
a
variety
of
different
permutations
of
server
media
network,
to
identify
optimized
configurations
and
again
I
mean
this
is
one
of
the
reasons
why
we
listed
the
criteria
for
what,
how
we
define
optimized.
B
Clearly,
every
environment
is
some
mix
of
these
different
types
of
workloads,
and
but
at
least
it
provides
a
way
to
frame
the
conversation
to
say:
okay,
here's
a
particular
configuration.
That's
that's
that's
for
luck,
for
instance,
for
larger
cluster
slices,
which
would
frequently
tends
towards
higher
higher
density
servers,
and
it's
particular
configuration
with
network
and
media
lends
itself
well
towards
either
throughput
or
I
ops
or
cost
capacity.
Optimization
one
of
the
things
just
one
of
the
things
to
note
here
for
cost
capacity,
optimized
configs
and
how
we
approach
the
bench
marking
for
these
reference.
Architectures.
B
For
that,
for
that
row
for
cost
capacity
optimized,
we
stay
true
to
the
criteria,
which
is,
as
we
discussed
above
cost
capacity
optimized.
So,
for
example,
we
use
eraser
erasure
coding
for
the
cost,
capacitive
the
cheap
and
deep,
so
classic
use
case
of
object
archive
so
for
those
architectures
that
we
benchmark
their
erasure
coded
and
they
don't
use
flash
write
journals,
because
that
adds
a
significant
element
and
so
the
cost.
Again.
That's
that's!
That's
the
objective
for
that
row.
The
cost
drops
dramatically
between
throughput
optimized
clusters
and
cost
capacity.
B
Optimized
clusters,
when
you
can
I
mean
back
to
the
last
thing
that
cow
covered
the
the
data
protection
scheme
when
you're
when
you
have,
for
instance,
seventy-three
percent
usable
to
raw
capacity
versus
thirty-three
percent.
You
know
if
you
need
a
petabyte
of
usable
storage
instead
of
buying
three
petabytes
of
rot
in
order
to
get
a
pet
abided
usable
you're
buying.
B
You
know
more,
like
you
know,
1.4
petabytes,
so
already
that
that's
a
tremendous
difference
in
the
cost
and
then
on
top
of
that,
when
you
eliminate
the
dedicated
SSD
right,
Jerome
just
go
locate
your
journal
on
your
spinners.
So
anyway,
it
we've
stayed
true.
The
configurations
that
we
benchmark
for
cost
capacity
and
those
aren't
reference
architectures
are
true
to
the
objectives
there
in
the
net
in
that
fashion.
So
this
is
the
4
by
3
matrix
the
one
on
the
left.
We
we've
labeled
OpenStack
starter,
a
lot
of
folks.
B
Look
at
you
know
as
they
as
they
do.
Initial
pilots
and
proofs
of
concept
with
OpenStack
de
might
might
be
using
relatively
small
amount,
small
amounts
of
capacity,
a
glance
imagestore
a
little
bit
of
persistent
cinder
devices
and
whatnot,
so
it's
that's
kind
of,
but
for
the
other
ones
we've
said:
okay,
small
starts
at
a
half
a
petabyte.
You
know
medium.
Look
at
that
type
of
there
under
medium.
Is
that
one
kind
of
terabyte?
What
is
that
it's
meant
to
be
one
petabyte
Sony
did
strike
the
T
of
that
or
large
okay.
B
B
Actually,
we
work
quite
a
bit
with
intel
on
this
as
well.
So
part
of
gear,
of
course,
is
the
process.
We
sit
down
and
say:
okay,
this
is
this
is
what
we're
trying
to
populate.
We
want
to
identify.
Of
course,
there
are
a
hundred
different
permutations
that
we
could
test.
We
want
to
narrow
it
down
to
a
handful
of
configurations
that
that
theoretically
should
be
optimizing.
Then
we
test
them
and
publish
the
results.
B
So
what
you're
going
to
see
on
the
the
next
couple
of
slides
are
extracts
from
I
recently
published
SEF
on
supermicro
reference
architecture,
as
as
noted
before,
of
course,
you
know
we're
red
hat.
We
work
on
lots
of
different
platforms.
This
one
just
happened
to
be
the
most
comprehensive
ones
who
choose
have
chosen
extracts
from
that
one,
and
it's
got.
These
results
are
based
on
lab,
benchmark
results
from
a
bunch
of
different
configurations.
B
Okay,
so
this
one
this
first
one
here
again
with
the
full
document.
The
links
are
in
this
slide
decks.
You
can
read
the
full
document
at
your
leisure,
it's
it's!
Oh
it's
a
little
over
40
pages,
long,
so
lots
of
graphs.
We
haven't.
We
haven't
extracted
all
the
graphs,
but
we've
extracted
a
few
okay,
so
we'll
just
we'll
just
kind
of
give
an
overview
of
a
few
of
them
just
to
kind
of
give
you
a
sense
for
for
how
to
quickly
read
them.
B
Okays
first
is
okay,
so
the
the
axes
of
this
one
here,
the
obviously
the
x-axis-
is
the
object
size
fed
into
the
load
test
utility
so
ranging
from
for
K
through
64,
too
close
to
a
magnitude,
24
meg
and
then
the
the
y
axis
is
megabytes
per
second
and
at
the
top
it's
either
megabytes.
You
know.
Obviously
we
have
graphs
that
megabytes
per
second
aggregate
for
the
entire
cluster,
but
then
we
in
order
to
have
a
little
bit
of
a
normalization,
so
we
can
have
a
better
comparative.
Then
we
break
that
down
okay.
B
So
if
you,
if
you
divide
the
overall
cluster
throughput
by
how
much
throughput
on
average
a
server
is
producing
that's
this
one
is
per
server
and
then
we
further
normalize
it
down
to
OSD.
So
we
can
get
a
comparative
measure
because
clusters
are
different.
We
benchmark
clusters
of
different
sizes,
we
benchmark
chassis,
Zand,
different
sizes
of
you
know:
different
quantities
of
os
DS
per
node,
and
but
the
least
common
denominator
for
making
a
direct
comparison
courses
is
a
amount
of
performance,
/
OSD.
B
So
in
the
title
that
will
indicate
what
we're
looking
at
here,
this
one
happens
to
be
a
3x
replicated
pool
it's
it's
using
Rados
bench,
so
it's
coming
through
a
deliberate
O's
level,
not
going
through.
In
this
case.
This
study
is
not
going
through
ratos
block
device
or
rgw
and
then
the
lines,
one
of
the
lines
how
to
interpret
that
the
shorthand
notation
airlines.
B
Generally
speaking,
the
first
number
is
quantity
of
hard
drives:
Oh
s,
DS
per
server,
chassis
or
server,
I
should
say
the,
and
so
the
top
line
is
12.
That's
12,
0
s,
DS
plus
1,
is
plus
one
dedicated
flash
right
journal.
So,
for
instance,
going
down
to
the
green
line.
36
plus
2
would
be
30,
60
s,
DS
per
server,
plus
two
dedicated
flash
write
journals
and
in
the
reference
architecture.
B
You
know
that
it
goes
into
more
detail
like
those
to
happen
to
be
PCI
mdme,
flash,
not
standard,
SAS,
SATA,
SSD
and
then
up
the
obvious.
The
networking
is
10,
gig,
plus
10
gig.
That's
you
know
front
and
back
facing
10
gig
networks.
So
we
followed
our
nomenclature
through
all
of
these
graphs,
except
for
one
exception.
B
0
plus
2
you'd,
say
well
that
doesn't
make
any
sense.
How
could
I
have
0
s
DS
and
only
write
journals?
Well,
it's
kind
of
kind
of
didn't
stay
true
to
the
nomenclature.
What
what
it
was
is
and
since
we've
modified
the
nomenclature
to
make
more
sense,
but
it's
it's
two
nvme
flash
devices
running
OS
DS
also
co-located
journals.
So
it's
it's!
It's
20
SD!
So
that's
the
light
blue
one,
the
0
plus
two.
That's
how
you
understand
that
every
but
everything
else
is,
is
follows
the
standard.
Okay.
So
look
at
the
lines.
B
So
what
observations?
Okay?
So
one
might
observe
that
okay,
the
good
old
tried-and-true
12
plus
you
know
12
Bay
server
is
the
touch.
Is
the
turquoise
line?
It's
it's
is
what's
yielding
per
per
server
under
the
the
10
gig
network
saturation
point.
Would
you
know
to
what
about
eleven
hundred
megabytes
per
second
somewhere
in
there?
B
So
just
under
the
network
saturation
point,
and
then
you
see
on
top
of
that,
a
couple
of
them
per
server
hitting
that
the
network
saturation
point,
then
you
begin
to
see
the
ones
above
that
that
are
going
beyond
the
10
gig
network
saturation
point
because
they're
on
40
gig
happen.
In
this
case
they
were
using
mellanox
connect,
X
3
40
gig
cards
and
that's
in
the
reference
architectures.
You
read.
Okay,
so
you
say
wow.
B
You
know
that
that's
pretty
sweet
that
the
pink
line,
60
plus
12
on
40
gig,
wow
I'm
getting
I'm
getting
like
what
is
that
you
know
two
and
a
half
three
times
more
work
done
per
server
and
than
I
am
on
the
12
bay.
Okay,
cool
now
check
this
out.
If
you
look
at
the
same
graph,
but
instead
of
per
server,
if
you
look
at
how
much
work
you're
getting
done,
how
much
value
are
you
getting
out
of
an
individual
OSD.
B
38
something
like
that,
okay,
so
the
two
points
to
take
away
from
the
slide.
The
first
is:
it's
really
interesting:
it's
really
useful
to
look
not
only
at
the
performance
per
server,
but
also
look
at
performance
per
individual
OSD,
because
that's
it's
the
greatest
unit
of
cost.
Actually,
let's
that's
point
one
useful
to
look
at
both
point
to
is
hey.
I
mean
this.
This
is
benchmarking
reference
architecture,
it's
it's
always
we're
always
learning
new
things
and
having
new
configuration.
B
So
as
we
we
spent
a
lot
of
time
actually
with
a
variety
of
different
performance
teams,
including
mark
who
I
think
gave
last
month's
talk
on
here.
You
know
Kyle
and
I
chatting
with
Mark
about
what
what
is
what
is
the
throttle?
Why
is
why
is
the
pink
line
not
able
to
get
as
much
work
done
per
LSD
as
more
scarce
server
so
consider
this
is
a
snapshot
in
time.
I
expect
we'll
probably
make
progress
in
figuring
this
out.
B
C
B
Ok,
so
then,
of
course,
you
can
look
at
it
right.
Ok,
so
here's
a
softball
one
for
Kyle.
So
if
you
look
at
so
we
stayed
with
/
OSD
here
again
we're
looking
at
a
variety
different
configurations.
You
can
see
that
hey
reference,
our
cadets,
why
we
do
reference
architectures,
coming
back
to
the
coming
back
to
the
the
original
simple
question:
when
people
say
hey
love,
this
opportune
run
on
anything.
What
do
you
recommend?
Well,
of
course,
in
this
world
as
architects,
you
can
never
make
a
blanket
recommendation.
B
We
can
provide
a
benchmarking
data
to
help
make
a
decision
stone.
That's,
of
course
you
can
see.
It's
hey.
These
benchmarks
produce
a
wide
range,
particularly
in
this
case,
if
you
have
large
block
iOS,
there's
a
pretty
big
spread
here,
though
architecture
matters,
so
so
here's
a
softball
on
Kyle,
so
the
previous
slide,
so
just
looking
at
the
12
Bay.
B
So
if
I
go
up,
one
slide
for
sequential
read
throughput
at
seven,
the
12
babe
the
individual
drive
was
was
producing
75
megabytes
per
second
throughput
with
the
largest
block
side,
but
that
drops
all
the
way
from
75
down
to
around
25
with
sequential
writes.
Why
are
we
only
getting
about
a
third
of
the
write
throughput
per
cluster?
In
this
case?
It's
normalized
bro
SD.
Why
is
that
calm,
I.
C
B
Then
then
talk
us
through
the
next
slide.
Is
we
shipped
we
shift
from
that?
The
previous
graphs
were
3x
replicated
and
we
shift,
and
in
this
case,
to
erase
your
coding
so
talk
to
us
about
okay.
So
here
you
can
see
that
the
spread
is
around
what
about
from
about
6
megabytes
per
second
per
dr
to
around
25
megabytes
per
second,
and
that
spread
then
shifts
from
10-2
around
42?
Why
has
the?
Why
has
the
spread
shifted
up
when
we
use
a
racial
coating
instead
of
3x
wrap.
B
C
I
mean
so
when
you're,
using
or
when
you're,
using
a
replication
and
you're
writing
to
disk
some
in
the
case
of
3x
replication,
one
copy
is
being
written
at
the
primary
and
then
the
primary
is
in
streaming
the
two
replicas
to
the
secondaries.
So
coming
out
the
back
end,
the
cluster
network.
You
have
a
2x
amplification
of
traffic,
whereas
and
you're
also,
you
know,
have
a
3x
amplification
of
the
actual
data
being
written
and
platters
with
a
racial
coding.
C
Not
only
are
you
sending
less
network
over
or
nest
less
data
over
the
network,
but
you're
writing
less
data
to
planners.
So
between
those
two
you
know
you
don't
you're
not
going
to
be.
It's
unlikely
that
you're
going
to
be
bombed
by
quite
a
back-end
network
and
also
the
the
total
amount
of
data
that's
being
written
to
disk
is
less
so
you
know
throughput
will
be
higher.
B
There
are
some
other
things
we
bring
in
relative
price
performance
and
then,
in
the
end,
this
this
was
the
for
that
particular
study.
This
was
this
was
that
four
by
three
matrix,
so
specific
model
numbers
for
that
one
super
micro.
Actually
they
produce
separate
thefts
queues
for
these
configurations
and,
and
you
can
kind
of
see
how
that
works.
But
and
then
the
reference
architecture
goes
into
additional
subsystem
guidelines,
like
you
know:
CPU
memory,
server,
chassis,
size
and
whatnot.
B
A
You
know
if
we
run
over
a
few
minutes
or
whatever
I'm
sure
people
can
stick
around.
So
if
anybody
has
questions
now,
is
the
time
go
ahead
and
type
them
in
the
chat
or
come
off
newton
asking
question
while
waiting
for
people
to
take
their
questions
in
brent.
One
thing
that
I
did
want
to
share
was
the
reference
architecture.
/
evolve,
SEF,
bragg
stuff
that
were
working
on.
A
B
Yeah
yeah
so
to
answer
Patrick's
question
when
I
put
up
this
slide
right
here,
so
wouldn't
it
be
cool
I
mean
so,
let's
say
so,
we're
all
architects
on
the
phones
of
some
type,
wouldn't
it
be
cool
to
have
like
like
a
hundred
different
permutations.
You
could
say
one
what
happens
when
I,
when
I
change
this
parameter
change
that
parameter
and
then
be
able
to
see.
B
Just
I
mean
because
in
10
seconds
you
can
look
at
this
graph
and
you
can
kind
of
take
in
you
know
how
changing
different
parts
of
the
architecture
affects
performance.
So
if
the
community
were
to
contribute
to
this
benchmark
results
library
we
would
we
would
go,
we
would
accelerate
the
quantity
of
comparables
dramatically
and
that
that
would
be.
That
would
be
graphing.
We
think
that
would
be
really
helpful,
be
interested
in
feedback
on
that.
So
I
think
that
was
what
Patrick
was
was
referring
to
not
sure
that
it
exists
up
there.
B
In
fact,
when
people
were
speaking
with
says,
not
sure
exists
out
there
for
any
storage,
let
alone
stuff
it'll
be
cool.
Now
the
thing
is
going
to
say
on:
that
is
one
of
the
things
you
can
say.
Well,
gee,
you
know
it's
you
go
back
to
the
good
thing
about
statistics
is
you
can
make
it
say
anything.
You
want
kind
of
thing
well,
one
of
the
reasons,
one
of
the
ways
we're
trying
to
standardize
things
by
which
we
work
with
Mark
Nelson
for
to
have
him
open
source,
the
Ceph
benchmarking
tool.
B
A
That's
free
and
the
little
piece
that
I
wanted
to
share
on
the
community
side
of
this
is
that
we
are
working
on
building
this
community
collector.
Those
of
you
that
have
been
around
for
a
while
know
that
we
used
to
have
an
idea
called
set
brag
where
you'd
be
able
to
submit
your
cluster
details
and
statistics
in
an
anonymized
way,
so
that
we
could
start
seeing
which
clusters
were
out
there
and
what
things
are
available.
A
We're
doing
a
little
bit
of
a
bit
shift
on
that
and
allowing
people
to
submit
performance
results
and
cluster
makeup,
and
things
like
that,
and
eventually
we're
hoping
that
that
will
be
in
a
interactive
format
on
metrics,
f,
calm
in
such
a
way
that
people
can
start
playing
with.
You
know
if
I
want
a
throughput,
optimized
cluster,
with
options
x,
y&z,
they
can
see
what
other
people
have
done
and
how
they
can
best
get
to
their
end
point,
perhaps
with
specific
Hardware
keeping
so
it'll
be
fun
to
see.
A
In
the
meantime,
we've
had
a
question
that
came
in.
Are
you
only
using
randos
bench
for
performance
evaluation?
Perhaps
you
could
talk
a
little
bit
about
the
the
tooling
and
you
mentioned
march
and
CDT
and
some
standardization
stuff,
but
tell
people
a
little
bit
about
what
you're
using
and
how
you
got.
Two
numbers.
C
Right
so
this
is
kind
of
our
first
foray
into
doing
a
extended
testing
with
a
partner,
and
we
wanted
to
test
at
the
lowest
level.
First,
you
know
kind
of
established
baselines,
especially
so
that
we
understand
throughput
because
it
also
pertains
to
being
able
to
calculate
you
know.
Recovery
mean
time
to
recovery
and
such
and-
and
so
yes,
the
this
first.
This
first
analysis
focused
only
on
ratos
bench.
We've
been,
you
know,
most
recently,
it's
not
published
yet,
but
we've
been
doing
extensive
testing.
C
The
follow-up
testing
to
this
has
been
doing
a
lot
of
fio
testing
on
storage,
both
using
the
Lombardi
lib,
rbd
and
Chen
through
fio,
and
you
know
a
fio
with
the
aio
engine
inside
of
kayvyun
camus,
virtual
machines
that
have
rbd
block
storage
devices
attached
to.
B
Yeah
thnkx,
the
only
thing
I'll
add
to
decals
comments.
There
is,
then,
on
top
of
that,
we've
worked
with
the
members
of
them
actually
exhibit
from
percona
at
the
MySQL
rady
be
community
too
he's
created.
A
cbt
is
a
test
harness
you
plug
in
different
load.
Test
utilities
in
the
test
harness
so
as
Kyle
mentioned,
we've
been
using
with
in
cbt,
would
be
using
Rados
bench
and
different
flavors
of
fi
0
for
4
random
for
measuring
random
I/o,
but
this
guy
from
Percona
Zard
has
also
added
assists
bench.
B
So
we
can
drive
this
bench,
mysql
work,
load,
testing
from
cbt
and-
and
so
that's
part
of
the
unpublished
study
that
Kyle's,
mentioning
they're,
so
low
test,
ratos
bench,
various
flavors
of
fi,
0,
dis
bench
and
intel
has
also
been
integrating
cause
bench
into
cbt
with
we
right
now.
It's
we
haven't
had
the
band
wit
to
do
any
study,
cbt
based
studies,
they're
driven
studies
with
cause,
but
that's
another
load
test
utility.
That's
ingrained
in
the
cbt.
A
B
So
I'll
respond
to
copy
everything
to
add
there,
so
we
have
some.
We
have
some
recommendations
or
guidelines
/
subsystem
on
that,
though
things
like
cell
scroll
down
a
couple,
slides
things
like,
for
instance,
we
you
know
cpu,
we
did
quite
a
bit
of
testing,
for
instance,
with
sick
dual
socket
versus
single,
socket
and
and
didn't
see
frankly
for
for
throughput,
optimized
and
cost
capacity.
Optimize
cluster
didn't
see
a
huge
difference
in
that
second
socket
being
filled.
B
So
there's
some
considerations
like
that
in
there
so
individual
guidelines
subsystem
guidelines,
there's
a
few
things
in
there,
the
I
think
kyle
called
them
a
very
technical
term,
Kyle's
bag
of
tricks
in
terms
of
the
just
some
of
his
favorite
Colonel
to
nobles
and
and
and
SEF
cough
tunable
and
whatnot.
That's
also
a
like
everything.
You
know
it's
it
with
every
release
that
it
changes
and
we
learn,
but
we
also
put
a
few
through
a
few
of
those
things
in
there
as
well
Kyle
anything
bad
there
I.
C
Mean
I
think
the
biggest
thing
that
we
found
is
that
you
know
using
using
tune
d
and
the
performance
profile.
A
tenth
tends
to
do
very
well
for
set
workloads,
at
least
the
benchmarks
that
we
were
running
here
arm
and
and
then
probably.
The
most
important
thing
is
that
when
you
have
these
machines
that
have
a
lot
of
different
devices
generating
interrupts
to
make
sure
that
they're
being
spread
evenly
across
across
your
new
processors.
A
Not
seeing
any
other
questions,
though
so
Thank
You
Brent,
Oh
Kyle,
this
is
another
great
talk,
I'll,
throw
it
up
on
YouTube
here
once
it's
once
it's
done,
but
otherwise
we'll
see
you
guys.
Next
month,
paulo
lolly
on
SF,
Tech,
Talks,
paige,
johnson,
calm,
we
don't
have
a
type
of
a
tox
laid
in
front
or
her.
Yet
we're
still
looking
for
something
but
definitely
keep
your
eye
on
November.
It
will
not
be
on
the
fourth
Thursday,
as
it
usually
is.
A
It
will
be
on
a
tuesday,
I'm
17th,
there's
going
to
be
a
talk
about
the
postgres
sequel
on
set
under
mesa,
said
Aurora
with
dr.,
so
some
all
kinds
of
good
stuff
crammed
into
that
one's
the
container,
mojo,
some
stuff,
mojo
and
somehow
database
workloads.
So
it
ought
to
be
a
good
one.
So
if
nothing
else
we'll
see
you
guys
in
October
and
then
again
in
november
thanks
everybody
for
coming.