►
From YouTube: 2015-AUG-27 -- Ceph Tech Talks: Ceph Performance
Description
A look at performance profiling and tuning in Ceph with some recent findings and examples. http://ceph.com/ceph-tech-talks/
A
Alright
welcome
everybody
back
to
the
monthly
Seth
Tech
Talks
event
here
on
our
blue
jeans,
video
conferencing
system
for
those
of
you
that
aren't
familiar
with
the
SEF
Tech
Talks.
These
are
basically
a
one-hour
deep
dive
on
a
technical
level
of
something
at
least
tangentially
related
to
SEF.
Thankfully,
we've
had
some
pretty
pretty
core
discussions.
Thus
far,
we've
had
you
know,
rato
some,
the
block
device
in
the
gateway,
and
we
looked
at
the
calamari
Romana
during
that
split.
We've
talked
about
some
placement
groups
and
we
had
a
nice
examination
of
CFS
in
July
this
month.
A
It's
going
to
be
Mark,
Nelson
who's,
our
lead
performance
engineer,
who
will
be
looking
at
the
performance
tuning
world
and
SEF,
and
using
some
recent
findings
to
help
share
kind
of
how
we
arrived
at
some
of
our
recent
decisions
in
the
future.
If
other
people
would
like
to
give
SF
tech
talk,
anyone
from
the
community
is
welcome
to
do
so
as
long
as
it
is
relatively
technical
and
set
related,
we're
looking
at
potentially
in
September.
So
next
month
doing
a
discussion
of
someone
who's
been
working
on
SEF
integrations.
A
So
looking
at
kind
of
the
start
to
finish
on
how
to
do
a
SEF
integration
with
some
piece
of
custom
software,
but
if
anyone
has
any
other
ideas
feel
free
to
contact
me.
Otherwise,
afterwards,
you
can
find
the
the
videos
here
for
replays
or
linking
elsewhere
via
the
SEF
dash
dash
talks
page
on
set
calm.
A
B
Sure
so
I
will
start
out
by
saying
that
Patrick
reminded
me
yesterday
that
I
was
scheduled
to
give
this
talk
and
I
scrambled
to
put
slides
together.
So
I,
don't
know
that
will
will
take
up
the
full
hour,
but
will
will
go
for
as
long
as
we
can,
and
maybe
we'll
have
time
at
the
end
for
folks
to
ask
questions
if
they,
if
they
would
like
to
so.
B
At
Red
Hat-
and
this
is
mostly
kind
of
carryover
from
from
when
we
were
ink
tank.
We
we
used
to
kind
of
major
pieces
of
software
for
doing
a
lot
of
the
performance
testing
work.
Those
are
teeth,
ology
and
cbt.
There's
actually
also
various
software
pieces
that
are
consultants
have
written
that
they
use
as
well.
Although
I
don't
actually
know
too
much
about
them.
B
We
actually
recently
went
in
and
and
kind
of
analyzed
what
technology
was
spending
most
of
his
time,
every
night
doing
and
a
lot
of
it
as
things
like
running
rados
tests,
with
thrashing
happening
in
the
background,
so
marking,
osts
down
and
increasing
PG
counts
for
different
pools
and
more
or
less
just
really
trying
to
stress
the
cluster
while
doing
various
ratos
commands
or
other
things.
At
the
same
time.
B
So,
let's
cap
technology,
DBT
was
written
a
couple
of
years
ago,
as
kind
of
a
lighter-weight
performance
benchmarking
oriented
tool
than
tooth
all
g
technology
is
really
kind
of
big
and
does
a
lot
of
things
like
I,
more
or
less
deploying
nodes
and
and
setting
up
software
and
doing
a
lot
of
that.
That
kind
of
infrastructure
related
thing
cbt
doesn't
do
any
of
that.
B
It
doesn't
do
any
kind
of
like
software
installation,
but
it
will
at
least
automate
the
the
ceph
portion
of
the
set
up
and
running
the
tests
we
use
CBT
quite
a
bit
in
the
engineering
group.
It's
also
used
by
the
reference
architecture
team
inside
Red
Hat
for
like
testing
on
partner
lab
equipment
and
then
also
by
our
QE
group
for
doing
like
nightly
performance
regression,
tests
against
red
hats,
f,
storage,
there's
a
lot
of
different
tests
that
we
use
CBT
for
it's
everything
from
looking
at
cash
turn
performance
under
different
scenarios.
B
To
actually
benchmarking
some
of
the
functional
tests
that
are
run
with
ecology
under
different
scenarios,
so
we
use
it
for
quite
a
few
different
things
is
an
open
source
tool,
its
uses,
yeah
mph
or
the
configuration
files,
and
it
can
also
run
a
lot
of
monitoring
in
the
background,
so
collect
L
is
a
tool.
That's
used
quite
a
bit
for
looking
at,
like
no
statistics
and
cbt
will
run
that
for
every
single
test.
B
B
B
The
ink
tank
performance
lab
back
in
the
day,
but
it
was
it
was
more
or
less
just
a
single
for
you,
supermicro
knowed
that
I
swapped
configurations
around
you
know
left
and
right
trying
to
determine
what
different
things
made
stuff
perform
well
and
what
kinds
of
different
hardware
configuration
stuff
did
not
perform
well
in
well.
Just
recently,
in
the
last
couple
of
months,
Intel
very
generously
donated
eight
really
high-performance
nodes
to
the
SEF
community,
which
red
hat
is
is
hosting
in
South
community
lab
and
those
nodes
have
ten
spinning
disks.
Although
those
aren't.
B
You
know
very
interesting
necessarily
at
this
point
from
a
performance
perspective,
because
there's
a
lot
of
other
nodes
out
there
that
have
similar
configurations,
but
each
one
of
these
nodes
has
4
800
gigabyte,
nvme
SSDs
in
it,
which
is,
is
fantastic,
there's
a
lot
of
back-end
throughput
and
I
ops
that
that
we
can
actually
now.
Finally,
test
Stefan
on
a
regular
basis,
trying
to
determine
how
to
improve
performance
on
this
kind
of
a
setup
it
has.
B
These
notes
have
fairly
reasonably
fast
processors
and
in
each
one
also
has
a
dual
40
gigabit
qsfp+,
a
ethernet
adapter
in
it,
and
luckily
for
us,
mellanox
actually
recently
provided
us
with
a
12
port
for
gigabit
switch.
So
we
were
able
to
hook
all
these
nodes
up
together
and
actually
get
really
pretty
good
performance
with
them.
They
also
have
64
gigs
of
ddr4
memory,
which
is
very
nice,
pretty
reasonable,
actually
for
how
fast
these
things
are
and
and
how
much
disk
they
have.
B
So
we
just
got
those
nodes
set
up
right
before
the
first
set
hackathon
took
place
earlier
this
month
of
in
Portland,
and
we
are
looking
for
a
have
initial
test
case.
We
had
done
some
chem,
very
real
initial
performance
tests
on
it
using
fio
and
ratos
bench
through
CBT,
but
we
didn't
have
any
have
really
good
example
for
this
thing.
So
when
we
went
to
the
hackathon
we're
kind
of
looking
to
see
well,
what
could
we
do
with
these
at
the
hackathon?
B
Also
looking
at
both
of
these
different
memory,
allocators
and
actually
they
had
found
a
bug
in
TC
milk,
where
you
couldn't
change
the
thread
cash
settings
in
the
version
of
TC
milk
that
was
distributed
with
almost
all
distributions
that
are
out
there
right
now,
abbu
to
rel,
Santos
I,
think
they
bein
as
well,
and-
and
this
was
a
big
deal-
they
found
that
the
default
thread
cash
values
in
TC
milk
actually
were
causing
stuff
to
not
perform
well
for
small
IO
and
Intel's
results
that
they
show
at
the
hackathon
reiterated
those
findings
and
in
fact
they
went
to
it
in
a
little
bit
more
detail.
B
So
when
I
saw
that
Intel
had
posted
these
numbers,
I
became
really
interested
in
thought.
Well,
hey
we're
here
at
the
hackathon
we've
got
these
nodes,
we're
looking
for
a
test
case
for
them.
Maybe
we
can
go
and
use
CBT
to
try
to
replicate
the
findings
that
both
sandisk
and
Intel
had
seen.
So,
basically,
within
a
couple
hours,
we
sat
down
at
the
hackathon
and
created
a
cbt
configuration
that
let
us
try
to
do
that.
So
in
cbt,
you
define
in
the
ammo
I'll.
B
B
But
one
thing
I
did
want
to
point
out
here
is
that
when
you're
setting
up
the
cluster,
you
can
specify
this
fo
SD
command
to
run
and,
in
this
case,
I'm
actually
using
a
pre,
compile
our
a
compiled
version
of
stuff
that
was
from
get
and
as
part
of
the
sepo
SD
command.
I'm.
Actually,
setting
an
environment
variable
here
in
this
case
to
change
the
TC
mouth
thread
cash
settings.
You
can
also
potentially
use
LD
preload
here
to
actually
set
the
memory
allocator
to
use
for
the
SEF
OSD
process.
B
It
I
didn't
actually
do
that
here,
I
actually
recompiled
stuff
to
just
use
the
different
memory
allocator
for
each
run,
but
potentially
you
could
actually
use
LD
preload.
Some
people
might
consider
this
a
kind
of
a
lack
of
of
a
input,
validation,
which
is
very
true
and
CVTs
part,
but
it
does
let
you
kind
of
do
useful
things
like
this.
So
you
know
it's
it's
a
little
bit
of
a
mixed
bag.
Hopefully
no
one
would
be
doing
anything
really
terrible
with
cbt.
B
In
terms
of
you
know,
passing
malicious
things
and
but,
but
you
know,
it
is
kind
of
nice
to
be
able
to
change
environment
variables
that
you're
passing
into
the
processes.
B
So
in
each
case
for
every
single
test
that
we
ran
or
every
single
different
memory,
allocator
configuration
that
we
tested,
we
rebuilt
the
cluster
using
CBT
the
exact
same
way.
Every
single
cluster
was
configured
using
the
exact
same
steps
and
that's
kind
of
one
of
the
things
that
CBP
buys.
You
is
that,
if
you're
going
to
be
doing
a
lot
of
repeated
testing,
you
can
make
sure
that
clusters
are
set
up
the
same
way.
You
can
actually
tell
cbt
to
rebuild
the
cluster
before
every
single
test.
B
So
when
we
ran
through
these
tests
for
the
hackathon
on
the
performance
cluster,
we
ran
through
a
whole
ton
of
different
tests.
Random
writes
random,
read,
sequential,
writes,
sequential,
read
mixed
I/o
tests
at
various
different
sizes,
four
megabytes,
128
kilobytes
four
kilobytes
justjust
account
a
whole
battery
of
tests.
B
I
have
not
included
all
of
us
because
it'd
take
hours
to
go
through
all
of
them,
but
the
most
important
or
most
interesting
ones
were
the
small
I/o
tests,
at
least
regarding
the
different
memory
alligators
that
we
looked
at
a
couple
of
things
that
we
found
out
from
this
testing
is
that
stuff
is
really
hard
on
memory.
Alligators,
there's
probably
a
lot
of
different
opportunities
for
tuning
here
and
there's
a
couple
of
different
folks
that
are
looking
into
that.
B
B
So
that's
a
big
deal,
there's
a
lot
of
folks
that
are
running
large
OSD
configurations,
nodes
that
have
you
know,
30
or
even
60,
Oh
SDS
on
them,
and
a
two
hundred
three
hundred
megabyte
increase
in
RSS
usage
per
OSD
would
push
them
over
kind
of
the
expectations
that
they
had
when
they
built
those
clusters.
So
we
need
to
think
about
how
we
can
try
to
gain
some
of
the
performance
that
we're
seeing
in
JD
Malik,
without
necessarily
increasing
memory
usage
that
much
those
are
kind
of
the
goals
going
forward.
B
So
so,
let's
take
a
look
here
in
terms
of
I
ups
for
4k
random
writes.
If
we
look
at
TC
Mel
2.1,
which
is
the
default,
that's
basically
distributed
with.
You
know
most
different.
Most
distributions
that
are
out
there
right
now
with
a
default
32
megabyte
thread
cache
size,
which
is
actually
the
only
guys
that
you
can
use
with
TC
milk
2.1,
forcing
pretty
anemic
I
abscess
20,000,
write
apps,
considering
how
fast
these
SSDs
are.
B
That's
a
fraction,
a
small
fraction
of
what
they
can
do
when
we
switch
to
the
newer
version
of
PC
Malik
with
128
megabytes
of
thread
cash.
It
starts
out
really
good
we're
about
four
times
faster,
but
it
degrades
over
time.
We
see
it
kept
trailing
off
and
ending
up
somewhere
around,
maybe
64,000
I
apps,
which
is
still
much
better
than
it
is
with
the
the
current
default
TC
milk.
But
you
know
it's
kind
of
concerning.
We
don't
know
with
very,
very
long
running
tests
how
much
it
may
degrade.
B
It
looks
like
it's
maybe
leveling
off,
but
you
know
that
is
still
pretty.
Concerning
je
Melek,
on
the
other
hand,
is
pretty
consistently
about
4
to
4.1
times
faster
than
TC
male
2.1.
It's
a
big
increase,
and
it's
it's
looking
really
consistent
in
ace,
we're
still
CPU
limited
when
we,
when
we
look
later
on
at
the
CPU
results.
You'll
see
that,
but
potentially
if
we
can
reduce
CPU
usage
or
if
these
no's
had
even
faster
cities
in
them,
we
may
be
able
to
even
get
more
out
of
the
SSDs
and
we're
seeing
here.
B
In
this
case
we
actually
had
16
fio
clients
and
I,
I
want
to
say,
maybe
512
concurrent
iOS
going
at
once
and
we
were
able
to
see
a
decrease
in
latency,
as
reported
by
fio,
from
like
around
a
typical
50
millisecond
latency,
for
these,
these
ops
down
to
around
10,
so
really
really
good
improvement,
still
not
as
low
as
we'd
like
to
see
again.
This
is
CPU
limited.
So
that's
where
we
think
you
know
we're
just
backing
up
or
processing
as
many
iOS
as
possible,
and
things
are
waiting.
B
So
when
we
look
at
read
I
ops,
which
are
much
less
CPU
limited,
we
actually
saw
that
around
ninety
eight
percent
of
the
4k
random
read
iOS
were
two
milliseconds
or
laughs,
so
we're
at
least
four
reads
getting
really
close
to
that
one
millisecond
mark
we're
not
we're
not
quite
there
yet,
but
but
really
close.
So
that's
really
good.
It
means
that
I
think
will
will
be
able
to
get
there
for
rights
as
we
continue
to
find
ways
to
improve
cpu
utilization
and
and
just
can't
generally
optimize
the
code.
B
B
At
the
same
time
as
the
OSD
we
didn't,
unfortunately,
weren't
didn't
have
time
to
get
clients
set
up
on
the
40
gig
network,
so
we
were
actually
running
fio
on
the
same
nodes
as
the
OS
DS
and
fio
itself
was
using
about
five
cores
of
CPU
just
for
the
client
I,
oh,
so
so
we're
pretty
close
to
maxing
everything
out.
You
know
it's
kind
of
weird
in
the
64
megabyte
thread
cash
case
with
TC
mouth
that
we
actually
saw
it
to
be
kind
of
a
little
bit
down
more,
like
30
course
being
used.
B
I,
don't
know
exactly
why
it.
Maybe
there
was
some
other
reason
for
that,
but
but
we're
still
really
quite
up
there,
using
almost
everything
and,
in
the
other
cases
we're
definitely
pretty
much
pegged.
B
So.
The
downside
here
to
Jay
milk
is
that
it
uses
more
memory.
It's
really
fast
and
it
looked
really
consistent.
But
this
is
what
you
pay
for
to
get
that
kind
of
performance
you
you,
you
see
probably
around,
like
300
megabytes,
more
RSS,
TC
milk.
When
you
increase
the
thread,
cash
also
uses
more
memory,
but
it
actually
wasn't
as
high
as
I
expected.
B
Various
osts
down
wait
until
the
cluster
is
healthy
and
mark
them
back
up
and
waiting
tablet.
Cluster
is
healthy
again
as
a
really
simple
configuration
change.
Basically,
in
the
cluster
section
of
the
cbt
configuration
file,
you
specify
that
you
want
a
recovery
test.
You
specify
the
OS
DS
that
you
want
marked
down
and
mark
back
up
and
then
there's
a
couple
of
other
parameters
you
can
adjust.
B
We
didn't
adjust
that
those
in
this
case,
basically
there's
a
wait
period
at
the
beginning
of
the
test
to
wait
before
the
OS
DS
get
marked
down
and
then
there's
a
wait
period
at
the
end
of
the
test.
For
how
long
you
want
the
benchmark
to
continue
to
run
until
the
the
overall
test
completes,
you
can
also
specify
that
you
want
the
recovery
test
to
continuously
mark
o/s
these
down
and
back
up
throughout
the
duration
of
the
benchmark.
B
So
we
we
basically
went
through
and
ran
this
for
TC
Malik
and
je
milk
with
4k
random,
writes
again
to
see
what
happened
and,
and
what
we
saw
actually
is
that
in
all
configurations
with
with
all
memory
alligators,
there
is
a
big
spike
in
RSS
memory
usage.
When
recovery
happens,
when
when
these
osts
are
marked
back
up
and
in
and
the
the
cluster
is
dealing
with
that
before
it
becomes
healthy
again
with
DC
Mel
in
a
32
megabytes
thread
cash
configuration
and
these
numbers
are
actually
very
similar,
fortissimo
2.1.
B
B
Now,
having
said
that,
if
you
look
at
how
deep
these
graphs
go
in
terms
of
time
with
je
Malik
recovery
happened
over
twice
as
fast
as
the
current
kind
of
default
configuration,
it
was
still
faster
than
the
TC
Malik
128
megabytes
of
Fred
cash,
though
not
quite
you
know
it's
not
a
significantly
faster,
but
that's
a
that's
a
good
game.
That's
a
impressive
gain
right!
I
mean,
if
recovery
happens,
that
much
faster.
That
means
that
your
cluster
is
healthy,
more
often
and
a
larger
percentage
of
the
time.
So
that's
really
nice.
B
That's
a
really
big
benefit.
So,
unfortunately,
there's
a
trade-off
here,
more
memory
or
better
recovery
and
better
performance
for
our
work
going
forward
now
is
to
figure
out.
Can
we
make
je
Malik
or
TC
Malik
better?
We
reduce
J
Malik's
memory
usage,
or
can
we
increase
PC
mouths
performance
without
increasing
memory
usage?
So
far
we
started
looking
at
je
milk.
We
we've
upgraded
to
pour
point
0,
which
just
came
out
about
a
week
ago,
and
that
seems
like
maybe
is
helping
slightly,
but
it
didn't
do
much
in
our
task.
B
B
B
We
tried
changing
a
lot
of
different
j
emailed
parameters.
They
let
you
pass
in
an
environment,
variable
to
set
them
out
conf,
and
you
can
change
different
settings
that
way
everything
we
tried
had
almost
no
effect,
it's
quite
possible
that
we
were
doing
something
wrong,
so
we
we
need
to
figure
out.
If
we're
not
doing
this
correctly
or
if
something's
happening.
You
can
tell
gay
male
to
print
out
statistics
on
exit,
and
that
will
tell
you
what
those
settings
are.
So
we
tried
to
go
that
route.
B
Unfortunately,
when
we
issued
a
sig
term
to
the
OSD,
it
caused
the
OSD
to
seg
fault,
since
we've
done
that
we've
had
some
feedback
on
that.
Specifically,
we
need
to
probably
run
the
OSD
in
the
foreground
rather
than
demon
mode
and
there's
also
some
documentation
in
the
j
mouth
man
page
that
printing
statistics
can
cause
deadlox
if
there
are
threads,
are
trying
to
allocate
memory
at
the
same
time,
so
that
could
be
related.
B
We
both
try
to
print
statistics
through
the
malloc
offsetting
that
jml
provides
at
exit,
and
then
we
also
tried
to
instrument
statistics
printing
directly
into
the
SEF
OSD
shutdown
process.
Our
suggestion
shutdown
function,
neither
work
they're,
both
still
causing
seg
faults,
and
it's
quite
possible
that
maybe
just
using
je
milk
in
general
is
causing
segfaults
when
sig
terminus
sent
I
don't
know
yet
so
there's
still
a
lot
of
work
there.
That
needs
to
be
done
to
try
to
figure
out
exactly
what's
going
on
here
and
and
and
gets
good
statistics
out
of
gay
milk.
B
It
does
provide
a
lot
of
really
useful
looking
statistics
and
also
profiling
data.
That's
compatible
with
g
/
tools,
so
this
is
really
nice.
You
know,
potentially
there's
there's
a
lot
of
data
that
we
can
get
out
of
this
to
find
out
more
about
what's
going
on
in
terms
of
memory
allocations
and
then
also
whether
or
not
we're
were
modifying
settings
correctly.
B
So
beyond
just
trying
to
tweak
memory,
Alec
hairs
and
play
with
you
know
what
things
are
being
used
for
stuff
in
parallel.
There's
an
effort
going
on
right
now
to
try
to
improve
the
chef's
behavior
stuff
is
really
really
hard
head
memory.
Alligators.
I
think
I
think
these
results
pretty
much
show
that
and
we
kind
of
known
that
for
a
while,
but
you
know
the
improvement
in
performance
that
were
say
here,
kind
of
makes
it
a
little
bit
more
crystal
clear.
B
It
uses
a
thread
cool
with
polling,
and
there
are
other
things
that
are
also
being
implemented
like
xio,
messenger
and
various
other
things
that
may
affect
all
of
this.
So
it's
there's,
there's
a
ton
of
work,
that's
happening
and
there's
a
ton
of
testing
that
needs
to
happen,
and
a
lot
of
different
people
are
looking
at
this.
Actually,
every
week
we
have
a
weekly
performance
meeting
where
folks
get
together
and
kind
of
discuss
all
of
these
different
things
that
we're
looking
at
and
people
present
their
results.
B
So
it
you
know
if
you're
interested
certainly
feel
free
to
stop
by
I
post
every
week.
The
meeting
invitation
it's
a
wednesday
mornings
at
8am
time
so
feel
free
to
stop
by
if
you're,
interested
and
and
if
you'd
like
to
to
participate.
You
know
otherwise.
They'll
certainly
be
more
of
these
kinds
of
things
more
presentations
and
newer
versions
of
stuff.
Hopefully
we
will
be
able
to
integrate
all
of
this
and
and
really
see
dramatic
performance
increases,
especially
for
small
I.
Oh
so
that's
it!
That's
all
I've
got
well
I.
B
B
Yes,
so
probably
because
no
one's
had
time
this,
I'm
guessing
the
the
right
answer.
You
know.
I
think
that
probably
talking
to
two
sage
about
that
would
be
the
way
to
go
and
see
what
his
thoughts
are
ill.
He
knows
that
far.
He
knows
the
code
far
better
than
I
do.
Actually,
if
daughter
is
on,
he
mentioned
on
the
mailing
list
that
he
had
actually
I
think
done
some
initial
investigation
and
was
seeing
just
stuff
all
over
the
place
that
need
work.
But
giver
are
you?
Are
you
around
or
you
have
microphone
access.
C
Yeah
I
don't
Kenny
come
here;
yes,
oh,
oh
great,
so
regarding
the
memory
usage
and
memory
behavior
of
the
set.
Okay
all
have
followed
me
in
past.
It
was
the
at
first
the
memory
usage
of
the
greatest
benchmark
itself
and
then
I
realized
that
there
is
a
lot
of
memory
shopping
going
back
and
forth
and
quickly.
I
have
identified
that
the
the
Riza
and
similar
behavior
in
the
entire
set
code
base,
including
also
buffer
list.
So
this
is
a
lot
of
work
that
needs
to
be
actually
done
on
this
matter.
C
If
we
actually
want
to
have
a
lot
better
and
that
perform
at
at
all,
we
need
to
have
this
performance
and
behavior
simply
fixed.
We
made
lightly
replace
the
technology
gmail,
but
this
will
be
a
short-term
solution
that
will
work
and
will
increase
the
memory
usage
for
most
users,
but
it
won't
fix
the
root
issue
so
from.
B
Yeah
I
agree
with
each
other,
I
think
I
think
you
know
what
we're
seeing
here
right
is
a.j
milk
is
sort
of
a
band-aid
right.
It's
doing
much
much
better,
but
it's
pretty
clear
that
that
stuff
is
is
really
really
stressing
the
the
alligator
so
there's
a
lot
of
work
that
we
need
to
do
to
to
stop
that
from
happening
and
who
knows,
maybe
once
we
do
that
Jay
milk
and
TC
Malik
will
both
start
looking
more
similar
in
terms
of
their
behavior,
but
at
least
for
right,
now
kind
of
as
the
short-term
band-aid
fix.
B
A
The
meantime,
people
that
want
to
ask
questions
somewhere
getting
via
the
blue
jeans
chat
here,
which
is
fine.
If
you
wanna
unmute
your
microphone,
you
can
ask
a
voice
question
or
you
could
use
this
up
IRC
channel,
so
any
of
those
are
acceptable.
Looks
like
brian.
Has
the
next
question
asking
have
you
done
any
performance
testing
at
rooster
with
hdds
compared
to
file
store
with
SS
cadence
journals?
Yes,.
B
Yes,
we
have
so
I,
don't
remember
all
the
results
off
top
my
head,
but
what
we
have
been
seeing
with
new
store
is
that
it's
faster
in
almost
all
cases,
except
for
our
BD
style,
object
over
rights,
though
you
know,
this
is
actually
object
over
rights
in
the
general
case,
but
our
bodies,
where
you
see
it
happen,
a
lot
if
you're
doing
like
a
a
day,
a
512k
right
into
a
4
megabyte
objective
or
has
been
slower
than
file
store
in
those
cases.
B
Part
of
it
is
due
to
how
rocks
TB
does
is
write
ahead.
Logging
excuse
me
basically,
there's
there's
a
lot
of
overhead
do
to
it
recreating
a
log
or
creating
a
new
log.
Every
so
often
I,
don't
remember
what
the
default
values
are,
but
it
periodically
creates
new
log
files
and
then
gets
rid
of
the
old
ones.
B
A
sage
actually
in
the
last
couple
weeks
has
implemented
kind
of
a
hacky
change
into
rocks
DB
that
allowed
it
to
kind
of
recreate
the
log
file
in
place
and
that
I
think
he
said,
yielded
about
a
fifteen
or
twenty
percent
performance
improvement
when
he
did
that.
But
he
needs
to
kind
of
rework
that
and
present
a
formal
pull
request
over
there.
Xd
be
guys
before
that
will
make
it
in
and
we'll
need
to
do.
You
know
a
lot
more
testing
on
new
store.
B
After
that
happens,
another
alternative
might
be
to
take
objects
in
new
store
and
break
them
into
chunks.
So
maybe
like
do
512k
or
one
megabyte
chunks,
so
that
portions
of
the
object
don't
actually
need
to
be
rewritten
if
you're
doing
a
partial
over
right
and
that
might
help
too
we'll
just
have
to
kind
of
see
where
we
end
up
after
the
rocks
TV
changes
and
whether
or
not
we
want
to
go
through
all
the
work
of
implementing
like
object,
trunking.
B
So
that's
that's
kind
of
where
we've
been
at
at
new
with
new
store,
I.
Think
a
lot
of
work
recently
has
just
been
going
into
getting
all
of
the
underlying
code
necessary
for
new
store
into
SEF,
so
that
new
store
can
be
can
be
merged
in
actually,
there's
open
pull
requests
for
that
right
now,
so
it's
all
happening
kind
of
as
we
speak.
A
B
So
we
do
do
testing
of
cash
tearing
and
cbt
can
actually
go
through
and
take
a
set
of
os
DS
that
you
have
defined
in
your
septic
on
file
and
designate
those
as
a
cached
here.
So
what
it
will
do
is
basically
modify
all
the
the
crusher
rules
and
create
a
parallel
hierarchy
for
the
cached
here
and
then
go
through
and
kind
of
do
all
of
the
Annoying
commands
that
you
have
to
run
to
create
that
cash
to
you
automatically
and
let
you
specify
that
that
should
be
a
cash
for
some
other
pool.
B
So
actually
in
Simi
tu,
you
specify
profiles.
You
say
that
you
have
say
like
a
base,
pool
profile
and
a
cash
to
your
profile,
and
then
it
uses
those
profiles
to
actually
make
pools
during
the
benchmarks
and
if
you've
specified
it
in
kind
of
the
the
expected
way,
then
it
will
go
through
and
automatically
create
that
the
cached
here
in
cbt
for
the
the
base
pool
that's
being
used
for
the
benchmark.
So
some
of
the
testing
that
we've
done
on
that
front
has
been
focused
on
specifically
on
promotion
behavior
into
the
cash
tier.
B
What
we've
seen
is
that
it's
really
really
easy
to
get
to
a
point
where
there
are
excessive
promotions
and
that
really
drag
performance
down.
The
big
reason
for
that
is
say
that
you're,
using
our
BD
with
a
cached
here
and
you
have
a
4k,
read
miss
well
that
4k
read
miss
means
that
you've
got
a
4
megabyte,
or
at
least
by
default.
B
The
four
megabyte
rbd
object
that
gets
promoted
into
the
cache
tier
now,
assuming
that
you
have
default
3x
replication
and
that
you're
doing
SSDs
and
you've
got
your
journals
on
the
ssds,
as
well
as
the
the
data.
That
means
that
that,
for
megabyte
object
actually
turns
into
a
24
megabyte
right.
The
4k
read
cache
miss
is
actually
promoting
24
megabytes
of
Rights
into
the
SSD
cached.
Here
it's
that's
really
intense.
It's
really
really
easy
to
overload
the
cash
tier
with
excessive
rights.
B
B
You
know
to
be
very,
very
low
and
really
really
hot
objects
will
eventually
make
it
in
because,
even
though
the
only
one
percent
of
the
promotions
are
making
it
through
they're,
so
hot
that
that
sooner
or
later,
they're
going
to
make
into
the
cash
tier
anyway
and
anything
that's
cold
is
is
basically
and
get
rejected,
and
we
saw
something
like
I
want
to
say
a
round
a
40
or
50
X
performance
improvement.
When
we
did
that
it
was,
it
was
huge.
It
was
actually
enough
that
that
the
cash
tier
started
looking
pretty
good.
B
It
was,
it
was
not,
it
was
I.
Remember
it
went
from
being
significantly
slower
than
just
using
the
base
pool
by
itself
to
being
maybe
like
a
head
I
want
said,
maybe
like
forty
or
fifty
percent
faster
than
using
the
base
pool,
though
that
was
really
really
good.
That
was
really
important
in
addition
to
that
right.
Proxy
support
just
got
merged
this
week.
If
I
remember
right
that
will
hopefully
help
things
quite
a
bit
as
well.
Read
proxy
was
merged,
maybe
whole
month
like
six
months
ago.
B
A
B
So
Kyle
Bader
in
the
reference
architecture
came
at
Red.
Hat
has
done
some
work
on
this
and
also
Ben
England
from
the
the
right
half
performance
team
is
planning
on
and
doing
a
lot
more
investigation
into
see
groups
specifically
and
looking
at
you
know
both
cpu
affinity
and
and
probably
memory
affinity
under
different
scenarios.
B
I
think
there's
been
kind
of
a
a
general
interest
in
kind
of
like
hyper
converged
solutions,
and
so
a
couple
of
different
people
are
interested
in
trying
to
figure
out
whether
or
not
this
can
be
accomplished
without
really
significantly
impacting
the
OS
DS
I
think
it's
gonna
be
really
important
going
forward
to
figure
out.
You
know
if
you
want,
like
hi,
I
ops,
how
to
deal
with
the
fact
that
and
we're
we're
basically
maxing
out
the
CPUs
already
or
for
these
kinds
of
workloads.
B
So
there
there's
definitely
going
to
be
some
contention
in
terms
of
resources
when
you
want
to
do
this.
You'll
have
to
be
careful
about
how
your
design,
your
notes,
maybe
you
can
only
have
a
couple
of
SSDs
if
you
also
want
VMs
on
those
on
those
same
same
nodes,
there's
probably
a
lot
of
kind
of
hardware,
reference
architecture,
design,
work
that
will
need
to
go
into
figuring
out
where
the
balancing
points
are,
and
certainly
things
like
jml
can
t
see.
B
A
B
B
What
I've
seen
when
changing
CEFs
options
for
kind
of
priority
of
recovery
operations
that
it
it?
It
basically
changes
as
you
have
more
client,
I/o
even
past
the
saturation
point
on
the
cluster
so
say
you
have
700
megabytes
of
client
I
owe
under
some
kind
of
recovery,
priority
settings
that
you've
set,
and
now
you
increase
the
number
of
clients,
though
you've
got
more
clients
that
are
waiting
to
do.
I
owe
when
you
do
that.
B
What
seems
to
happen
is
that
your
client
I/o
performance
might
not
change
you.
You
might
see
the
same
level
of
performance,
but
now
you
all
you've
done
is
kind
of
made
recovery.
Take
longer,
it's
kind
of
just
made
the
situation
worse
in
a
way.
What
you
optimally
in
my
mind,
what
you
would
think
would
happen
is
that
you
said
okay
I
want
to
vote.
Under-Recovery
scenario.
I
want
thirty
percent
of
the
traffic
to
comfort
to
happen
to
be
recovery.
B
Traffic
and
I
want
seventy
percent
to
be
client,
I
0
traffic,
it
kind
of
ideally,
maybe
that's
that's
what
you
want
or
maybe
wants
four
values.
I
don't
know,
but
you
would
hope,
then,
that
you
would
maintain
that,
regardless
of
how
much
client
traffic
or
how
much
client
I
oh
you
have,
even
if
you
have
more
clients
trying
to
do.
B
I
oh
you'd
hope
that
you'd
keep
those
ratios,
but
it
doesn't
seem
like
that's
what's
happening
right
now,
though,
you
know,
I
think
we
need
to
kind
of
investigate
that
whole
kind
of
area
of
the
code
and
how
it
works
do
to
make
this
simpler
and
nicer
for
for
users.
But
that's
that's
just
my
take
on
this
kind
of
kind
of
what
I've
seen.
B
B
A
Enters
saying
that
using
the
Swift
api
they've
been
seeing
some
head
request,
some
large
a
few
objects
fire
flying
take
longer
than
three
plus
minutes.
Is
there
any
optimization
tips
you
might
recommend
in
order
to
bring
this
number
down
here
should
be
doing
ahead.
Requests
on
each
part,
which
can
take
an
extremely
long
time,
depending
on
my
parts.
B
So
I
think
that
if
it's
taking
three
plus
minutes,
there's
definitely
something
wrong
unless
there's
so
much,
I
owe
that
it.
It
really
is
taking
that
long.
You
know
for
the
other
howling
hard
work,
but
I
suspect
not
what
I
would
suggest
doing
is
looking
at
the
different
statistics,
the
different
administrative
socket
statistics
that
you
can
get
to
performance
counters
both
for
the
OS
DS
and
then
for
also
the
other,
the
other
demons
and
trying
to
figure
out
where
in
the
pipeline,
you're
stuck
without
standing
ovations.
B
B
If
there's
there's
also
an
option
where
you
can
target
existing
clusters.
So
if
it's
already
running
an
existing
cluster,
you
could
basically
just
say
use
existing
and
then
have
it
go
often
and
target
that
cluster
there's
some
interesting
work
going
on
by
mirantis
right
now
to
try
to
kind
of
integrate
like
OpenStack
provisioning
tools
into
cbt
and
honestly
I
actually
do
not
know
that
much
about
OpenStack.
At
this
point,
I
deployed
a
cluster
like
three
years
ago
and
I'm
horribly
horribly
out
of
date.
So
I
can't
even
begin
to
explain.
B
You
know
what
it
is
that
it
actually
does
it
provisioning
in
those
cases.
But
but
there
is
some
work
going
on
kind
of
to
try
to
do
that
in
parallel,
there's
kind
of
a
really
simple
enhancement
to
cbt.
That
would
bet
that
a
couple
people
of
express
interest
in
that
would
be
really
nice.
That
tooth
ology
already
does,
which
is
to
let
you
define
multiple
yamo
files,
so
you
you
wouldn't
actually
have
to
recreate
the
entire
file.
B
You
just
recreate,
maybe
a
separate
note
targets
section
and
then
include
that
yeah
mole
or
a
different
target
sea
animal
for
the
the
cluster
part
of
it,
and
that
would
be
a
really
easy
change
would
be
just
basically
letting
the
settings
take
in
multiple
llamo
files
and
then
adding
those
to
the
settings
that
gets
created
in
Python.
So
so
that's
kind
of
kind
of
where
that
right
now
we
don't
do
any
kind
of
like
auto
provisioning,
though,
which
you
know
maybe
someday
with
like
this
OpenStack
thing.
So
that's
that's
kind
of
where
we're
at.
A
B
Think,
probably
the
best
answer
is
you're
just
going
to
have
to
try
it
and
see
if
it
if
it
helps
you
with
the
stuff
that
that
that
you
need
I,
suspect
that
for
small
ios4,
like
4k,
reads
and
writes,
you
would
see
a
benefit
by
reducing
the
rbd
object
size,
but
for
large
iOS.
You
may
actually
see
a
decrease
in
performance
that
would
be
Mike,
just
kind
of
guess.
A
A
It
sounds
like
a
lotta,
nothing
alright!
Well,
thank
you
very
much
mark
for
kicking
the
time
to
give
us
the
lowdown
on
set
performance.
This
video
should
be
up
before
the
end
of
the
week
here.
If
people
want
to
review
it
or
share
it
to
folks
that
missed
out
so
other
than
that
thanks
everybody
for
coming
we'll
see
you
again
next
month.