►
From YouTube: Ceph at Scale - Bloomberg Cloud Storage Platform
Description
Chris Jones, Technical Lead for Bloomberg Cloud Storage and maintainer for Ceph's Chef cookbooks on github, will provide a technical overview of how Ceph Object and Block storage are used at Bloomberg with and without OpenStack. Learn how-to build your own multi-petabyte hyper-converged cloud storage system using the same automation tools used at Bloomberg on commodity hardware. Topics will include: -Multi-Cluster architecture with index sharding for Object Store (RADOS Gateway) -Architecture to
A
My
name
is
Chris
Jones
I'm
with
Bloomberg
and
we're
gonna
talk
about
some
of
the
things
with
scaling
SEF
in
this
case,
if
I
can
get
it
to
work
all
right,
real,
quick
30
years
and
under
30
seconds,
we
are
primary,
primarily
a
financial
services
provider
and
you
can
see
our
glue
Bloomberg
terminal
there.
It
has
over
60,000
functions
and
does
a
tremendous
amount
of
things
for
the
financial
market,
including
data
streams,
etc.
A
So
what
do
we
use
of
SEF?
We
use
the
the
object,
store,
the
block
volume
and
we
also
with
the
OpenStack
size.
We
just
started
using
the
ephemeral
storage.
One
of
the
guys
here
that
worked
on
that
is
here
and
I
believe
that
the
ephemeral
storage
has
now
become
one
of
the
more
popular
things
that
we
offer,
which
is
kind
of
cool.
We
actually
I,
think
I,
guess
it
was
at
Vancouver
last
year
that
we
saw
that,
and
so
he
went
back
and
kind
of
implemented
it.
A
Now
we
it
talks
about
hyper-converged,
whatever
kind
of
hyper-converged
before
it
was
cool.
You
know
everything
now
is
100%
buzzword
compliant
with
hyper-converged.
This
super
hyper
converged,
you
name
it,
but
we
started
doing
that
so
probably
three
years
ago
and
where
we
had
everything
on,
for
example,
head
nodes
and
head
nodes
or
controllers
in
the
OpenStack
world,
but
then
we
had
not
only
that
we
were
running
mines
OSDs,
you
name
it
in
those
same
things
and
that
started
becoming
a
problem.
A
A
Now
you
see
there
in
the
middle,
it
talks
about
tor.
So
not
only
we
actually
put
those
with
each
bundle
with
each
pod,
and
so
you
can
actually
make
tweaks.
We
can
do
things
we
can
scale,
however,
we
want
to
within
the
data
center,
etc,
etc.
So
it
gives
us
a
lot
of
flexibility
to
do
that,
but
what
we
were
seeing
is
that
our
object
store
was
actually
becoming
very
popular,
so
we
wanted
to
scale
that,
but
the
problem
was
we
didn't
want
to
have.
A
A
Mention
while
ago
the
ephemeral
piece
you
can
get
an
idea,
it's
a
give
you
a
little
bit
of
a
visual
on
it
because
I'm
kind
of
a
visual
guy,
but
the
if
you're,
using
the
Ceph
side
and
then
you're
looking
at
the
ephemeral,
then
you
can
see
that
there's
a
lot
more
network
traffic,
etc,
etc.
But
there
has
you
know:
there's
a
lot
of
trade-offs
like
with
safe,
it's
safe
from
you
know.
Well
in
this
particular
standpoint.
So
if
one
of
our
customers
stands
up,
however
many
VMs,
etc,
etc.
A
A
Now
the
numbers
here
don't
represent
actual,
don't
don't
look
at
it
and
say:
oh,
this
is
what
they
this,
what
they
find
on
their
production
cluster,
because
that's
not
true.
These
are
actually
done
with
some
of
our
lab
equipment
and
some
of
its
old
so
looks
you
look
at
some
of
the
comparisons
between
it
and
so,
when
you
look
at
this,
compare
like
do
I
run
stuff
with
everything
or
do
I
mix
in
some
ephemeral,
etc.
A
So
we
get
it
to
the
thing
that
we're
doing
right
now,
which
is
the
object,
store
the
object
store.
We
started
kind
of
breaking
that
out,
and
so
what
we're
doing
now
is
we
have
basically,
we
start
out
with
three
racks
and
our
initial
piece
where
we
had
it
is
we
had
a
hardware
load
balancers,
but
now
we're
actually
that
don't,
we've
actually
created
our
own
custom
load
balancers.
A
Now,
in
that
case,
and
also
the
other
thing
too,
I
don't
have
a
pointer,
but
one
of
the
things
to
remember
is
each
one
of
these
racks
are
routed
there
on
their
own
subnet.
So
you
get
into
a
situation
with
like,
for
example,
with
keepalive
D
that
doesn't
want
to.
You
know,
transfer
over
the
IPS
etcetera,
the
the
VIP.
But
the
fact
is
it's
there,
so
we
fix
that
with
other
configurations
and
I'll
show
you
that
in
a
minute,
so
the
our
OpenStack
cluster
runs
on
a
bun,
but
our
object
store
actually
runs
on
rail.
A
Now,
there's
no
particular
reason
why
we
did
that
it
was
more
thinking.
There
was
more
of
less
an
olive
branch
because
we
have
a
lot
of
storage
groups.
We
have
a
lot
of
other
groups
within
the
company
and
a
lot
of
those
guys.
They
use
rail
and
so
they're
comfortable
with
it,
and
so
I
wanted
to
get
more
of
these
guys
involved
so
that
it
would
help
us,
especially
with
the
OP
side
of
it,
because
I
didn't
want
to
sit
and
do
with
ops
stuff
all
the
time
and
so
I
got
those
guys
involved.
A
So
we
basically
put
it
on
rail.
Now
in
this
you
see
that
the
top
part
is
our
tour
and
then,
of
course,
then
we
have
three
1u
nodes.
Now
the
1u
nodes
are
going
to
be
there.
Basically,
our
Mon
nodes,
they
are
our
rails
gateway
and
our
load
balancers
and
then
the
other
17
or
2u
nodes,
and
those
are
all
of
our
OSD
notes.
Now
that's
important
because
I
just
came
from
a
talk
which
was
a
great
talk
from
Comcast
and
they
actually
are
running
large
density
servers
and
I.
A
Think
roughly
about
72
drives
per
one.
In
our
case,
we
actually
have
12
12
drives
in
this
12
Penner's
and
there
are
six
terabyte
drives,
and
then
we
have
two
SSD
journals
now.
The
interesting
thing
about
that
is
that
they're,
not
those
SSDs,
are
not
just
the
journal,
so
they're,
actually
they're,
co-located
or
co-hosting,
so
to
speak
with
the
OS
itself.
So
basically
a
small
portion
of
the
first
top
of
it
is
actually
the
OS
rated
to
the
second,
and
then
we
have
six
journals
on
one
s,
SSD
and
six
journals
on
the
other.
A
Now
and
then
the
journal
sizes
are
larger.
We
went
a
little
larger
on
our
journal
sizes
because
we
had
the
space
and
so
they're
running
at
20,
20
gig
also
re
interfaces.
We
have
to
nix
our
two
ports
ones
for
a
cluster
size.
It's
tin,
gig
and
one
for
a
public
is
Tinggi.
Now
write
us
gateways
in
Mons.
Don't
use
the
the
cluster
side,
so
the
only
thing
that
they
they
actually
do
is
the
the
public
side.
So
in
that
case
we're
actually
bonding
those
and
then
we're
binding
them
in
the
basically
LACP
mode.
A
This
is
pretty
much
some
of
the
things
we
talked
about,
but
one
of
the
things
I
skipped
over
there
I
was
gonna
talk
about
a
little
over
here
is
to
actually
get
scale
we
are
are
currently
our
OS
I
mean
our
OpenStack
ones
are
running
all
replicas
three
replicas
on
straight
SSDs,
but
with
the
object
store,
what
we've
done
is
it's
all
erase
your
coding
and
kind
of
find
out
whe
what
we're
doing
with
an
eraser
Cuttino.
You
know,
there's
not
a
whole
lot
of
Doc's
out
there.
A
There's
not
a
good
explanation
of
why
you
do
this,
why
you
don't
do
this
etcetera,
etcetera?
So
a
lot
of
this
stuff
was
trial
and
error
and
then,
of
course,
bringing
in
some
folks
from
Red
Hat
to
take
a
look
at
it
and
see.
Then
we
came
up
with
some
combinations
from
there,
but
the
interesting
thing
too,
on
the
rightest
gateway
side
is,
you
know
if
you've
ever
created
a
rightist
gateway?
A
It
creates
a
pool,
set,
we're
roughly
about
14
pools
approximately
in
each
of
those,
you
know,
has
some
sort
of
function
within
ratos
gateway,
the
the
most
important
one
is
the
the
bucket
pull
the
right
of
the
dot
rgw
bucket.
That
is
the
only
one
that
you
actually
do.
Eraser
coating
on
the
rest
of
them
are
actually
replicas
and
you'll
see
some
of
that
here
in
a
minute
and
what
I'm
talking
about?
Also
to
the
another
important
thing
that
we
done-
and
it
was
also
mentioned
in
the
other
talk
to-
is
the
OSD
nodes.
A
We
run
a
hardware
controller
zone
because
there
was
a
significant
performance
increase.
If
you
try
to
do
software
pieces
or
own
boards
controllers,
etc.
We
actually
one
of
the
guys
here
who
actually
started
doing
some
testing
with
some
new
equipment.
They
got
it
in
and
apparently
we
somewhere
along
the
lines,
we
didn't
have
a
controller
and
they
were
like
hey.
Why
are
these
drives
and
everything
supposedly
better
than
all
this
other
stuff,
but
it's
slower
than
our
current
architecture,
and
then
we
got
in
start
digging
now.
A
A
That's
where
we're
heading
we're
trying
to
get
everything
toward
a
non
vendor
solution,
so
we
set
up
the
bird
on
the
BGP
so
that
it
advertises
to
its
peers
and
it's
peers
are
the
spines
and
then
the
rest
of
the
network
can
recognize
where
they
are
now
in
doing
that,
we
don't
want
to
advertise
the
secondary
because
it
will
get
confused.
They
want
to
happen
in
because
they
did.
What
happens
is
that
when
you
start
doing
doing
rate
of
Skaguay
calls,
you
start
doing
different
things,
you'll
see,
just
connections
will
drop
and
what
it
was.
A
It
was
the
routes
and
everything
else
was
getting
confused.
So
there
was
there's
a
configuration
setting
that
you
can
do
in
bird,
which
basically
makes
this
a
secondary
advertiser
sort
of
a
primary,
and
so
that
worked
out
pretty
good.
Now
the
radio's
gateway.
We
run
multiple
instances,
irate
us
gateway,
/
raid,
us
gateway,
node
and
remember,
institute
this
it's
a
1u
box.
It
has
210
gig
ports
and
it
has
256
gigs
of
ram.
A
Now
the
original
piece
call
for
128
gigs,
but
when
they
came
in
they
had
256
and
I
wasn't
going
to
turn
it
away
so
I
kept
it
and
I
just
didn't,
say
anything
life's
good,
so
we
were
kind
of
looking
at
those
hey.
Why
don't
we
start
doing
something
a
little
different?
We
can
actually
approach
this
a
little
different,
and
so
in
that
case
we
started
doing
some
investigating
on
how
can
we
run
multiple
instances
of
Redis
gateway
on
a
single
node?
A
We
get
split
it
out
from
a
standpoint
of,
and
the
reason
for
this
is
we
have
what's.
We
have
private
networks
in
our
group,
so
in
that
we
can't
let
this
private
network
see
what's
going
over
in
this
private
network,
etc,
and
so
typically
all
of
our
clusters
are
inside
a
private
network
within
itself.
So
if
the
private
network
a
wants
a
cluster,
then
we
have
to
build
out
a
whole
cluster
because
there's
some
security
standpoints
in
the
past.
The
way
it
was
all
worked
out.
You
couldn't
do
that.
A
You
couldn't
say
here's
a
converged
piece
of
hardware.
Why
don't
you
all
come
into
to
that
so,
but
with
the
object
store,
that's
the
very
first
product
within
Bloomberg.
That's
been
allowed
to
do
that,
so
we
have
now
a
centralized
server
cluster
that
now
takes
access
from
private
network,
a
private
network
BCD
how
whatever
and
so
each
one
of
those
ratos
gateway
boxes
are
also
weighted,
and
so
and
the
reason
for
that
is.
We
want
to
be
able
to
scale
them
out
if
we
need
to
or
if
we
have
failures
we
can
set
up.
A
For
example,
we
could
set
up
OSD
nodes
as
a
lower
weighted
rate
of
Skateway
if
we
have
to-
and
the
same
goes
for
other
things
too
even
months-
and
you
can
see
that
in
here
in
just
a
bit,
so
each
one
of
the
load
balancers
basically
weights.
This
puts
it
on
a
different
port
for
if,
if
the
VIP
is
coming
in
off
of
private
network
a
goes
over
here,
it
goes
to
a
different
port.
A
If
it's
private
network
B,
now
here's
some
important
configuration
pieces,
the
you
know
it
was
pointing
out
in
the
last
section
that
says
you
know
it
has
I,
don't
know
how
many
but
I
mean
it
has
so
many
knobs.
You
can't
even
count,
and
so
you
do
this.
You
do
that
and
you
turn
one
over
here.
Something
else
happens,
you're
kind
of
like
oh
what's
you
know
this
doesn't
make
sense
sometimes,
and
sometimes
it
does
so.
You
have
to
actually
tune
it
for
your
given
environment.
A
So
one
of
the
things
that
we've
looked
in
what
I
was
telling
you
before
is
each
rack
is
on
its
own
subnet
cool
thing.
You
know,
because
all
the
examples
you
see
show
it
in
one
aggregate
like
a
slash,
24,
whatever
etcetera,
etcetera,
but
we
actually
have
it
in
slash
27
for
that
and
they're
all
routed,
all
routable
on
our
OS
DS.
A
We,
those
are
all
x
FS
with
onboard
controllers
like
we
talked
about
and
then
our
rate
of
skate
wave
components,
one
of
the
things
that
we're
testing
right
now
we
haven't
implemented
it
yet.
But
what
we're
testing
is
the
Federation
with
regions
and
zones,
and
we've
got
another
cluster
that
we're
about
to
stand
up.
So
we
can
do
that
and
then,
of
course,
the
eraser
coding
pieces
and
the
thing
to
keep
in
mind
about
eraser
Kody,
it's
different
than
replicas
replicas,
like
did
did
simple
I
mean
you
got
one
object,
another
object
and
another
object.
A
Eraser
coating
is
different.
The
crush
maps
are
different.
Everything
about
it
from
that
standpoint
is
different,
and
so
you
actually
have
to
have
a
reasonably
good
custom
crush
map,
and
so
we
have
two
rules
that
we
created.
One
is
for
the
replicated
and
one's
for
the
eraser,
and
what
we've
done
is
we've
done
it
by
racks
and
in
by
hosts
than
by
OSDs,
so
that
we
can
kind
of
distribute
the
low,
because
the
whole
thing
about
Seth
is
data
distribution.
A
That's
the
whole
point,
because
you
don't
want
a
rack
being
full
over
here
or
almost
full,
and
then
you
have
two
other
racks,
three
other
racks
or
whatever,
and
they
are
not
completely
full
but
you're
kind
of
like.
Why
are
all
this?
This
funky
stuff
going
on
well-
and
this
comes
back
to
the
crush
map
and
also
to
one
thing,
to
keep
in
mind
and
I
would
recommend
this
with
almost
any
configuration
is
to
do
sharding
on
your
bucket
indexes.
A
Now
that's
an
interesting
thing
and
the
way
it
works
is
that
there's
a
setting
I
think
it's
yeah
right
below
there
that
you
just
basically
tell
it
like
a
mac
shard
of
five.
Now
that
five
is
just
a
sample
and
you're
gonna
see
something
too,
because
everything
we
do
is
all
open
source.
All
thing
is
open
source,
except
for
the
data
of
our
given
machines
itself,
such
as
the
MAC
addresses
in
IPS
and
all
that
other
good
stuff.
But
everything
else
is
open
sourcing,
but
you
can
take
it.
A
So
you
may
do
some
of
your
buckets.
You
may
have
it
set
at
five,
and
then
you
say
you
know:
I
need
to
go
up
a
little
higher
and
then
you
set
it
to
ten,
but
your
previous,
all
your
other
component,
buckets
that
you
had
or
the
buckets
or
actually
pools,
or
in
this
particular
case
or
buckets
and
they'll
all
still
be
at
five,
but
all
your
new
ones
will
take
on
the
new
configuration.
So
it's
not
going
to
go
back
and
change
any
of
the
old
stuff.
Also
Civet
web.
A
We
kind
of
booted
Apache
out.
It
was
a
memory
hog
and
a
lot
of
other
things
and
then
of
course,
the
front
end
piece
is
now.
This
is
an
interesting
little
piece,
because
a
lot
of
people
don't
know
about
or
know
that
you
can
do
it.
It's
really
nothing
there's
just
left
out
of
docks,
and
that
is
you
could
there's.
Civic
wave
itself
has
a
lot
of
options.
It
has
a
lot
of
different
config
settings
that
you
can
do.
You
just
have
to
go.
A
Look
at
the
Civic
web
project
and
then
you
start
seeing
all
those
different
components.
So
then
you
come
back.
You
look
at
some
of
the
code
within
the
rate
of
Skateway,
then
you
say:
oh
I
can
use
that.
I
can
use
that
the
one
there
that
says
number
of
threads
equal
100,
that's
actually
a
default
setting.
We
were
playing
or
I
was
playing
with
increasing
decreasing,
looking
to
see
what
happened,
etc.
But
then
I
just
left
it
in
there,
but
that
actually
100
is
the
default
setting
within
civic
web.
A
Network
all
right,
most
everything
you'll
find
performance,
wise
or
basically
falls
into
these
two
pieces,
the
load
balancers
and
your
network.
The
like
I,
said
before
the
rate
of
skate
waves
of
the
mods
and
the
load.
Balancers
all
are
bonded.
We
had
210
gig
ports
and
they're
bonded
with
mode
4
for
aggregate
in
this
particular
case.
Also
we're
using
jumbo
frames
now,
so
we
set
the
MTU
to
9000,
definitely
want
to
do
that
on
your
cluster
piece
that
so
that's
a
given
and
the
reason.
One
of
the
reasons
for
that
is.
A
Last
year
we
were
doing
some
testing
on
our
clusters
that
were
in
our
DMZ
and
I
was
comparing
it
to
s3
and
I
had
to
do
that,
because
one
of
our
customers
was
doing
stuff
and
say:
hey
we're
gonna
move
over
here.
We're
gonna
do
this
whatever,
and
so
we
needed
to
get
good
benchmark
is
I.
Think
it
was
canonical.
A
A
Not
only
was
it
the
cost
that
we
had
to
look
at
for
baseline
the
price,
but
we
also
had
to
look
at
the
performance
because
of
our
customers
say:
hey
I,
get
faster
doing
this
or
I
get
faster
over
here
now,
that's
only
in
like
I
said,
there
are
dmz
side.
We
have
many
private,
secure,
Network
says
so
that
that's
not
never
a
factor
in
that
particular
case,
but
on
the
other
side
it
is
now.
A
If
you
do,
you
know,
oh
and
the
other
thing
about
the
MTU
9000,
so
I
had
problems
last
year,
so
it
basically
takes
a
cloud
to
test
the
cloud.
So
I
was
testing
a
lot
of
different
things
with
jmeter
from
Amazon
back
into
our
cluster
and
then
from
Amazon
to
s3
and
by
the
tweaks
we've
made
with
rate
of
skate
weight.
I
had
parody
with
s3
I
saw.
The
irony
was
I'm
onna
ec2,
comparing
myself
coming
back
to
our
DMZ
and
then
go
into
a
closed
region
for
EC.
A
A
Obviously
everything
we
do
I
mean
we
can't
even
approach
this
without
automation.
There's
just
absolutely
no
way
the
last
talk
they
were
talking
about.
You
know
it
was
tweaking
this
hardware
to
this
hardware
and
this
hardware
and
again
it
was
a
density
node
and
those
were
purpose-built
components
and
you
had
the
time
in
our
case
we're
using
lower
density
because
I
don't
care
for
machine
guys,
I,
don't
care,
throw
another
one
in
you
file
a
ticket,
get
it
in
I.
Don't
care
for
drive,
fails
file
a
ticket
get
it
in.
A
So
we
have
spares
for
those
very
reasons
to
do
that,
and
so
we're
not
tweaking
all
the
hardware
everywhere
we
can,
unless
it
can
be
fully
automated.
If
it
can
be
fully
automated,
it
makes
sense,
then
we
definitely
do
that.
So
the
and
all
this
stuff
isn't
shift.
Now
you
see
the
first
one
there,
that's
our
Bloomberg
OpenStack.
It
was
originally
called
B
CPC.
A
It's
still
called
that,
but
this
made
about
400
other
changes
in
veins
and
stuff,
so
it's
BCC
now
for
Bloomberg
cloud
compute
because
and
then
we've
matched
with
our
Bloomberg
object:
storage
to
blue
blue
blue
cloud
storage,
and
so,
if
you
see,
if
you
go
to
the
github
up
there,
you
can
actually
go
ahead
and
clone
it.
You
can
do
it
right
now.
A
I
did
it
while
ago
and
the
other
talk,
because
I
was
just
testing
the
performance
of
the
network
and
all
that
and
I
even
built
a
self
cluster
on
the
lap
this
laptop
while
I
was
sitting
in
the
other
session,
so
you
can
clone
it.
You
can
build
you
and
then
basically
run
the
the
vagrant
up
or
actually
there's
a
couple
other
things
that
you
would
do
and
that
would
build
out
a
full
open
stack
along
with
Ceph
in
that
particular
stereo.
A
The
the
other
interesting
thing
here,
if
you
look
at
the
second
one,
that's
set
chef
now
that
and
if
you
look
at
the
the
github
on
that,
that's
actually
managed
at
the
Ceph
repo
and
we
actually
we
created
it,
and
we
basically
are
the
admins
for
that
and
so,
and
that
is
a
complete
cookbook.
That
will
give
you
everything
that
you
see
here,
plus
more
for
everything,
including
CFS,
etc.
So
and
then
the
next
piece
is
our
object,
store
which
I'm
talking
about
and
that
actually
implements
the
Ceph
chef.
A
So
in
essence,
what
the
the
second
the
the
chef
BCS
does,
is
it
actually,
when
you
basically
say
set
up
in
this
case
for
development
environment,
because
we
do
everything
on
VirtualBox
for
our
development
and
then
we
roll
it
into
our
hardware.
Inside
of
it,
it
will
actually
go
out
grab
all
the
cookbook
there
have
all
the
dependencies
grab
all
the
packages,
everything
it
needs,
because
in
our
scenario,
we
cannot
out
the
outside
world.
So
look
for
security
reasons,
etcetera,
etc.
A
So
we
we
operate
behind
lots
of
proxies
and
you
name
it
and
in
like
I
said
all
of
this
stuff
is
on
github,
completely
free
go
out,
get
it
right
now.
Actually
that
would
be
awesome.
Man,
issues
and
pull
requests.
They'll
be
really
good,
because
there's
a
lot
of
enhancements
out
there
that
need
to
be
made
to
everything.
A
So
here
it
talks
about
again
the
just
so
you
know,
I
wasn't
kidding
your
screenshot
of
the
actual
github
page
the
same
here
for
the
cloud
storage
and
then
our
BC
PC.
This
is
a
much
larger
project,
obviously
because
it
has
all
of
the
OpenStack
components,
etc.
But
again
it
has
the
configurations
and
things
necessary
to
build
out
full
clusters
because
I
don't
want
to
like
in
the
storage
side.
A
I
don't
want
to
build
one
way
on
hard
or
over
Harold
vagrant
and
all
this
other
stuff
and
then
do
something
different
completely
different
on
the
hardware.
So
we
try
to
keep
everything
as
close
as
possible,
but
there's
some
some
challenges
with
that
too,
especially
when
you
get
into
the
networking
side
so
that
you
can't
do
with
the
VirtualBox
side
of
it
and
then
you
actually
have
to
say:
hey
I've
got
if
actually
put
this
on
hardware
to
see
what
happens
now.
A
A
I
can't
answer
that
so
well,
I
got
a
little
tired
of
kind
of
answering
that
question
or
not
answering
that
question,
and
so
I
thought
you
know
what
I'm
just
gonna
create
and
then
go
through
the
formulas
set
it
up
so
that
you
can
actually
plug
in
the
number
of
OSDs
you're
going
to
have
what
size
they're
going
to
be
etc.
Show
you
what
your
off
compute
size
is
and
then
allow
you
to
do.
A
The
interesting
thing
here,
if
you
see-
and
this
gives
you
a
better
vision
because
I'm
visual,
so
it
gives
me
a
better
picture
of
what
my
capacity
is
going
to
be
so,
for
example,
I
know
if
I
increase
my
K
or
decrease
my
M,
then
I'm
going
to
better
utilize.
My
storage,
but
there's
trade-offs
there
remember
I
was
talking
about
replicas
were
something
you
know:
hey
copy,
a
B
and
C
simple,
no
problem.
A
The
eraser
coating
takes,
for
example,
if
I
have
this
uses,
10
for
easy,
math
and
I
have
a
10
gig
file
that
I
I
have
out
there
and
then
I,
say:
ok,
I'm
gonna
do
a
K
of
say,
5
or
any
of
those
really.
In
that
case,
then
that
10
gig
gets
split
into
10
evenly
into
basically
the
number
of
case
in
this
case
10
so
now,
I
have
10
objects
that
are
going
to
be
floating
around.
The
storage
I
have
to
do
something
with
that.
A
A
Do
you
have
K
number
of
hoses
thanks,
miss
nature,
depending
on
what
your
failure
domain
is
and
your
eraser
coated
profile-
and
this
gives
you
the
abilities
to
say
okay
I'm,
going
to
trade
off
a
given
pool
for
whatever
reason
so
I
can
actually
maximize
my
make
my
storage
more
efficient
or
I
want
to
see
something
a
little
different,
but
I
see
if
I
go
back
to
that.
Okay.
Well,
one
thing
I
didn't
mention
was
the
only
em
side.
A
This
is
important
because
those
are
your
parody
chunks,
so
it
takes
see
they
say
in
the
scenario:
I
have
my
10
gig.
It
takes
that
divides
that,
for
example,
by
K
and
then
it'll
add
any
buffering.
So
it's
all
even
and
so
you've
got
a
one
gig
a
10-1
gigs,
but
then
all
of
a
sudden-
but
let's
say
I-
have
five
set
at
my
M
on
my
inside.
That's
my
parody
side,
you're
also
going
to
see
5,
1,
gig
pieces
or
other
chunks.
A
The
this
gives
you
kind
of
a
what
I
was
talking
about
before
and
this.
Actually,
these
numbers
is
percentage-wise,
they're
actually
taken
from
the
PG
cow
calculator,
and
you
know
you
go
to
set
comm
/
PG
calc
and
you
can
go
in
there
and
kind
of
set
and
play
with
okay.
How
do
I
look
and
set
up
my
pools
or
my
PJs
and
those
fish
nature
and
inside
us
inside
the
cookbook?
A
If
it
sells,
it
actually
implements
the
that
same
PG
calculator
inside
the
cookbook,
but
it
gives
you
another
option
because,
instead
of
doing
to
the
nearest
power,
you
can
also
say:
hey
I
want
to
go
to
a
higher
power,
so
you
have
the
option
to
do
whichever
way
you
want
and
that
will
actually
help
you
with
your
PG
distribution,
but
the
you
can
see
the
amount
of
data
that
each
of
these
pools
in
a
typical
just
plain,
simple,
out-of-the-box,
raitis
gateway
pull
set.
The
is
96.9%
of
that
data
is
stored
in
your
buckets.
A
A
So
the
okay,
so
now
you're
kind
of
looking
at
a
record
in
because
I
know
everybody
wants
to
maximize
their
dollars.
They
want
to
maximize,
but
they
don't
have
a
whole
lot
of
information
on
it.
It
sounds
really
complicated,
etc
and
it
sort
of
kind
of
is
when
you
start
to
look
at
it
at
the
first
time.
But
then
we
start
working
through
and
looking
at
how
the
objects
are
laid
out
with
itself.
A
It
begins
making
a
lot
of
sense,
and
so
it
becomes
more
and
more
clear
and
then
you
can
actually
begin
doing
some
really
neat
things
with
your
crush
map
and
distributing
your
load
a
lot
better.
You
in
the
main
thing
to
with
your
extra
coding
pieces,
is
think
about
your
failure.
Domains
like,
for
example,
a
typical
failure
domain
and
the
replica
is
Iraq.
A
That's
a
typical
thing:
people
try
to
do
say:
I
can
lose
a
rack
or
whatever,
and
it
doesn't
bother
me
so,
but
with
eraser
coating
and
that's
not
going
to
be
your
failure
domain
most
likely
I
definitely
won't
unless
you
have
a
whole
Iraq's.
So,
in
this
particular
case,
our
failure
domain
is,
is
a
node,
an
actual
storage
node.
You
can
make
them
as
iOS
D
or
at
different
pieces,
and
so
what
happens
that?
A
The
reason
why
that's
important
is
because
your
crush
map
and
oil
that
will
try
to
basically
keep
going
without
so
it
doesn't
repeat
an
object
of
one
of
those
aggregate
objects
inside
of
the
same
house.
Now
that
would
be
horrible
horrible,
because
what
happens
when
that
house
goes
out?
Well,
you
just
you
just
lost,
you
lost
your
data,
so
instead
it
will
actually
try
to
disperse
that,
etc.
So,
there's
a
couple
different
ways
to
kind
of
check
on
this
play
with
it
see,
see
the
settings
etc.
One
is
again
you
set
your
profiles.
A
You
can
set
as
many
as
these
profiles
as
you
want
the
default
to
own
these.
All
these
defaults
are,
for
example,
the
plug-in
is
J
eraser,
because
you
could
you
can
change
that.
I
left
it
at
J
register
in
that
particular
case,
and
so
you
can
set
these
values
like
10
like
I
was
in
that
scenario,
I
was
just
talking
about
and
I
said
the
failure
domain
to
host
you
could
create
multiples
of
it.
A
Doesn't
matter
they're,
just
profiles,
nothing
happens
at
this
stage,
then
you
come
down
now
when
what
I
like
to
do
because
we
manage
our
whole
crush
map
itself,
so
we
set
the
own
start.
Equals
faults
all
that
stuff.
So
because
of
that,
if
you
start
having
no
STDs
go
down
and
start
flapping
and
all
that,
then
you
have
to
kind
of
roll
them
a
little
bit.
So
I
don't
like
to
do
that.
I
don't
want
to
do
that.
A
It's
less
works
better
for
me,
so
what
I
typically
do
before
I
do
any
of
this
is
all
set
the
know
down
in
that
particular
case,
because
I
want
to
keep
everything
up.
I
don't
want
to
go
back
and
fix
it
later
from
me
playing
with
it
and
now,
of
course,
that's
gonna,
say
health
warning,
but
that's
because
you
set
a
flag.
That's
the
only
reason
you
want
to
say
that
the
then
you
create
a
pool,
and
so
you
create
a
new
pool.
A
A
Then
you
you
come
down
and
you
start
looking
at
your.
You
want
to
get
your
PG
number
after,
if
everything
if
it
goes
out
and
starts
building
your
pools
and
then
that
you
know
that's
good,
but
you
want
to
see
where
they're
distributed
and
because
you
don't
you
want
to
make
sure
that
you
don't
have
things
like
I
said
all
in
one
host
or
even
all
in
one
rack
and
because
if
when
it's
the
it's
actually
tempting
when
you
say
oh
wow,
I
had
a
help.
A
Okay
and
on
my
first
Direction
code,
the
cool
great.we
staff
support
racer
Cody.
But
then
you
start
looking
at
oh
no
everything's
in
one
rack.
So
if
this
rat
goes
out
I'm
hosed,
so
you
got
to
start
playing
a
little
bit
so
yeah
taking
the
stereo.
In
this
case,
you
want
to
look
at
your
pools,
your
OSD,
so
Sefo
stls
pools.
You
can
do
it.
You
can
do
SD
dump
on
that
and
just
look
at
the
top
part
as
well
and
the
the
the
PGS
and
all
that
are
going
to
be.
A
You
know
unique
integers,
it's
a
plus
some
other
things
behind
the
decimal,
but
the
primary
part
is
the
actually
I'm
talking
about
the
pulsar,
and
you
want
to
get
that
the
pull
ID
and
then
you
want
to
do
like
a
PG
dump,
and
then
you
want
to
grab
on
the
PG.
Never
has
a
couple
like
I
said:
there's
several
different
ways:
you
do
it.
This
is
why
I
do
it
and
then
what
that
does
is
that
shows
me?
A
Okay,
now,
I'm,
only
looking
at
the
my
PG
map
are
dumped
on
just
the
pool
I'm
interested
in
because
it
doesn't
have
any
data
in
it
now.
But
it's
going
to
tell
you
where
it's
mapping
to
what
OS,
DS
and
so
you'll
see
something
like
like
in
this
case,
1005
212.
So
it's
on
basically
five.
Just
in
this
scenario:
5
OS,
DS
and
then
you
can
find
out.
A
You
know,
do
a
semi,
oh
s,
D
find
and
then
you
can
find
out
where,
like
10
being
the
idea
of
the
first
OS,
did
you
find
out
well
host
its
own,
which
is
Iraq?
And
if
you
segmented
your
obviously
your
host
names
properly,
which
you
should
do
always
and
then
you'll
know
what
rax
etc
are
out
there.
And
then
you
start
making
adjustments
to
your
crush
map.
Based
on
that,
because
that's
where
you're
gonna
find
where
your
placements
are
and
that's
like
I
said
it's
critical.
A
You
can't
skip
that
part
because
you
want
to
kind
of
understand
where
the
distributions
are
testing.
So
everything
you
know.
Obviously,
testing
is
critical
with
OpenStack
when
we
first
started
doing
things,
it
was
we've
used
tempest
and
we
actually
have
rally
inside
of
our
class.
We
don't
reuse
it
as
often
now,
as
we
did
a
couple
like
last
year.
We
used
it
a
little
bit
now,
but
on
the
safe
side,
the
radius
bench,
obviously
the
cost
bench,
and
then
the
eff,
iOS
and
and
the
reason
for
some
of
that
is
just
you
want
to
test.
A
You
want
to
look
at
your
drives,
etc,
but
the
one
that
I
use
the
most
is
jmeter
and
the
reason
I
do
that
is
I,
set
it
up
in
a
master
slave
configuration
so
that
I
have
several
many
instances
that
are
running
and
they're
gonna
be
running
tests
which
are
going
to
query
objects
and
things
this
nature.
So
I'm
going
to
see,
what's
actually
like
a
real
use
case
of
something
that's
going
on
with
and
then
I'm
going
to
most
likely
do
random,
which
I
do
I
actually
do
range.
A
Random,
byte
range
requests
so
that
nothing
gets.
You
know
you're
not
dealing
with
one,
the
first
TN
of
each
object
or
whatever
it
is
you're
segmenting,
iid
random.
In
that
particular
case,
and
then
you
can
find
out
where
some
of
your
performance
bottlenecks
are
and
again
most
of
time,
you're
gonna
find
that
it
comes
back
to
your
network.
A
So
what
are
we
looking
at
going
forward
on
some
of
this
stuff?
We're
definitely
obviously
always
looking
for
improvements,
better
monitoring,
even
some
DevOps
pipelining,
and
also
to
the
one
thing
about
that.
The
Ceph
chef
cook
books
and
the
BCS
is
built
so
that
you
can
build
it
and
plug
it
right
into
a
pipelining
system,
something
like
go
CD
or
something
of
that
nature
or
even
Jenkins,
but
Jenkins
doesn't
do
pipelining.
All
that
great!
A
That's
that's
a
debate,
don't
throw
stuff
at
me,
but
and
then
the
where
we're
going
to
is
NSSF
with
non
vendor
solution
solutions,
that's
kind
of
where,
in
the
the,
where
we're
kind
of
heading
with
some
of
this,
and
so
we're
definitely
looking
for
performance
improvements,
no
matter
what
and
we're
looking
at
better
multi-tenancy
capabilities
in
jewel.
That's
where
some
of
the
some
of
that's
being
kind
of
laid
down
a
little
better.
We're
gonna
be
testing
that,
and
just
so
you
also
know
the
so.
A
A
So
that
way,
your
OSDs
can't
communicate
a
little
faster
directly
without
going
through
all
the
tears,
the
the
other
networking
components
etc,
and
that's
it
and
here's
the
again
I
just
want
to
put
this
back
up
here.
The
guy
yesterday
said
something
about
Twitter,
said:
hey
I,
don't
tweet
much,
but
here
it
is
I'll
put
some
stuff
out
there
same
here:
I,
don't
tweet
much,
but
the
findings
that
I
have
the
things
that
I'm
allowed
to
share
on
our
findings
with
this
test
clusters
etc.
A
A
B
High
okay,
so
your
analysis-
brilliant,
thank
you
so
much,
but
what
you're
using
is
a
hardware
is
looks
like
or
sound
to
me
is
an
enterprise
grade.
Sst
right,
that's
true!
So
if
someone
asks
you,
for
example,
telco
service
provider,
hey
I
want
to
swap
I'll
change.
My
tape
solution
into
self
based
low
cost
storage.
A
A
That's
the
first
thing:
no,
that's
actually
true
and
the
reason
I
say
that
part
of
this
this
talk
here
is
part
of
a
much
larger
talk
and
it
talks
about
you
have
to
have
buy-in
along
the
way
everybody
has
to
realize.
You're
gonna
give
and
take
that's
only
that's
just
the
way
it
is,
and
so
you've
got
to
be
able
to
change
your
pain
tolerance,
so
you
can
get
there.
So
what
you
have
to
look
at
what
you
know
with
SST
is
you
can
talk
to
some
of
these
vendors
that
are
out
here?
A
They
know
what
way
better
than
I
do,
but
you
have
basically
they
can
only
write
so
many
times,
and
you
know
your
your
things
of
this
nature.
The
mean
time
between
failures
is
really
low
on
consumer
grade,
but
can
they
be
used
yeah,
but
I
just
really
depends
on
your
use
case.
I
mean
just
really
look
at
your
use
case.
I'm,
not
saying
don't
use
do
it
because
you
can
do
anything.
You
want.
C
C
I'm
just
talking
about
the
question
context
that
was
first
part,
okay
and
asking
about
which
cheaper
key
type
did
you
use
further?
You
know
replica
based
in
a
pool
I
would
even
basically
you
know
there
are
a
bunch
of
like
written
list
or
straw
tree
or
stuff
syndicate.
You
have
a
rack
or
where
crucian
map
definition
there
yeah.
A
See
the
so
here
it
gives
you
a
an
example
of
some
of
the
Tings
tunings
we're
actually
using
the
second
release
of
the
straw
calculation
for
tunable.
So
this
is
our
base,
and
this
this
is
actually
because
it's
the
saying
is:
it's
actually
our
base,
that's
in
production
of
what
we
start
with
and
so
you'll
see
the
sets
coming
down
for
the
different
pieces,
and
then
you'll
also
see
different
and
actually
the
bottom
one
down
there
with
the
you
know,
with
the
minus
three
etc.
That's
that's
actually
not
in
our
production
face.
A
I
was
actually
just
testing
some
stuff
on
vagrant
this
morning
on
that
build
so
but
from
there
what
happens
is
and
the
the
Ceph
chef
piece
in
the
OSD
sections.
If
you
have
a
racer
coded
enabled
in
the
cookbook,
then
what
it
does
is
it
moves
the
OS
when
it
creates
the
OSD.
It
moves
that
into
the
appropriate
slot
inside
your
crush
map
tree
and
then
it
balances
based
on
those
weights
on
the
rack
and
in
the
nodes,
etc.
Okay,.
C
A
D
D
A
So
that's
a
good
question,
so
in
essence
what
you
can
do
right
now
without
trying
to
do
any
tuning,
get
a
baseline
with
what
you
have
you
get
a
baseline
on
your
performance.
You
get
a
baseline
on
how
you're
delivering
everything
and
then
from
there
start
tweaking
it
a
little
bit
and
see
if
what
were
the
deltas
are
between
your
but
from
where
you
were.
A
Your
tweaks
are
hopefully
they're
in
a
positive
direction
and
not
a
negative
direction,
and
then
you
can
actually
compare
that,
and
so
then
you
have
a
better
idea
that
when
this
you
get
better
throughput,
etc,
etc,
and
you
have
a
little
lower
latency
Network,
then
you're
gonna
be
able
to
take
advantage
of
it
immediately.
So.