►
From YouTube: CDS Jewel -- RADOS Tail Latency Improvements
A
B
One
of
the
problems
does
that
a
notice
to
you,
when
it
asserts,
goes
through
the
same
sort
of
failure,
detection
as
a
host
physically
dying,
then
we
can
do
a
little
bit
better
than
that.
We
don't
actually
need
to
wait
for
the
cluster
to
detect
us
as
down
with
a
heartbeat,
I'm
out
process.
Everyone
report
to
the
monitors
the
monitors
decided
enough.
Reporters
have
reported
published,
knows
d
map
marking
us
down
and
finally,
we
can
start
doing
I/o
somewhere
else.
That
process
takes,
like
you,
know,
30
seconds
or
something.
B
So
if
we're
going
to
fail
an
assert
because
of
an
e
io
or
because
something
bad
happened,
the
software
we
could
instead
do
the
same
thing.
We
do
in
the
graceful
shutdown
case,
which
is
send
a
message
to
the
monitor
on
the
way
out,
saying
I,
certify
myself
as
being
dead
and
I
will
not
respond,
20
further
messages
that
allows
the
monitor
to
mark
it
down
without
waiting
for
reporters.
B
So
that
would
be
a
small
but
very
tractable
change
that
someone
could
do
so.
There's
going
to
be
a
later
session
on
some
peering
speed
improvements.
So
I'll
just
summarize
that,
as
there
are
some
peering
situations,
we
can
do
with
fewer
messages
to
make
currently
spend
so
well
actually
set
the
next
session.
B
So
the
big
one,
though,
is
automatically
detecting
slow,
OSD
use,
I,
don't
if
you're,
if
you're
used
to
running
a
decently
sized
SF
cluster
you'll
notice
that
if
an
OSD
gets
slow,
the
whole
cluster
will
tend
to
slow
down
to
that
speed
because
crush
partitioning
data.
Based
on
whatever
step
wait,
you
gave
it
not
based
on
any
kind
of
real-time
evaluation
of
the
OSD
performance.
B
That
part
probably
can't
really
change,
because
you
don't
want
to
be
dynamically
moving
data
around
just
because
at
OC
happens
to
be
a
little
slow
right
now,
but
we
might
want
to
shift
primary
pneus
away
from
such
OSU's
and
we
might
want
to
preemptively
mark
them
down
if
we
think
that
the
slowness
is
a
possibly
a
symptom
of
a
failing
disk.
B
There's
also
a
patch
out
to
read:
I,
don't
know
if
anyone
read
the
yahoo
I
think
paper,
but
for
an
ec
pool
to
read
all
of
the
chunks
and
use
the
fastest
k
return
chunks
to
reconstruct
the
read
and
they
found
there
was
a
pretty
substantial
improvement.
One
of
the
problems
is
that
it
still
goes
through
the
primary.
So
if
the
primary
is
slow,
it'll
still
be
slow,
it
doesn't
really
add
a
benefit
for
rights
and
there's
nothing
analogous.
We
can
well
I
guess
what
about
yeah?
There
really
isn't
anything
analogous.
B
B
So,
fair
about
that,
it's
faster
to
go
through
the
file
store,
even
if
it
is
under
load,
then
it
will
be
to
go
to
the
replica.
What
we're
on
that
might
be
to
concurrently
read
from
the
client
from
multiple
replicas
I'm
working
on
some
patches
to
make
replicas
read
work
properly,
so
it
would.
It
wouldn't
be
a
large
step
from
that
to
being
able
to
at
the
client,
read
from
all
the
replicas
and
use
the
fastest
response.
B
B
C
Sink
except
the
slow
disco.
Later
we
found
the
most
of
tell
slow
requests,
are
confirmed
the
first
door
into
no
employment
apartment
in
technician,
because
something
like
peace
rather
to
read
all
of
the
info.
It
will
look
at
multi,
I,
know
the
from
disk,
and
we
found
at
most
it
her
stand
it
using
nearly
500
million.
Second,
the
to
read
or
another
to
a
complete
her
general
reader
I'll
be
info
request,
so
may
accept
this,
the
spool
cap.
So
this
problem
I
think
maybe
week
that,
with
more
focused
on
something
like
this,
like.
B
Well,
that
is,
sage,
is
actively
working
on
the
replacement
for
the
file
store.
So
it's
true,
the
final
store
could
be
faster
and
we're
going
to
fix.
That
also
are.
Do
you
feel
that
this
is
causing
this
generally
slower
reads,
or
a
big
spike
in
99th
percentile
rates?
That
is
some
like
that
is
the
slowest
rates
are
much
much
slower
than
the
average
case.
You
have
a
sensor
which
of
those
is
true.
C
B
C
B
So
if
it
actually
is
the
case
that
the
object
store
itself
is,
has
an
extremely
large
read,
latency
distribution,
then
we
could
get
some
benefit
by
sending
the
read
in
parallel
to
the
replicas
from
the
primary.
So
that
would
be
an
interesting
approach
as
far
as
actually
improving
the
file
store
performance,
we
may
not
choose
to
do
that.
I
think
what
we're
going
to
choose
to
do
instead
is
rewrite
it.
B
Much
of
the
complexity
in
the
file
store
comes
from
trying
to
use
the
file
system
directories
to
facilitate
collection
listing
which
wasn't
a
good
idea,
as
it
turns
out.
That's
why
we
have
the
hash
index.
That's
why
we
have
to
do
all
those
directory
traversals,
so
it's
unlikely
that
what
we
could
probably
make
the
existing
implementation
better.
It
would
be
better
to
not
have
to
do
that
in
the
first
place,
which
is
why
sage
is
working
on
new
store.
I
know.
C
No
more
no
more
yeah
I
just
come
up
with
with
a
common
questions
and
I.
Don't
know
how
too
much
sense
to
you
being
so
it
before
you
store.
B
B
Oh
yeah,
oh
I,
was
looking
at
the
wrong
one.
Sorry
I
was
looking
at
the
Ceph
southern
IRC
channel
Oh,
see
here.
How
will
it
o
is?
Newmarket
self
is
slow,
so
that
is
one
way
it
could
do
it
that's
a
little
bit
tricky
because
it
needs
to
know
what
the
other
OSD
is
considered
to
be
a
normal
normal
throughput
check
the
rage
from
private
guau.
Just
a
quick
thing:
Guang
I
was
talking
about
replication.
B
In
that
case,
doing
it
for
ec
is
harder,
because
the
client
needs
to
have
access
to
the
Richer
coding
library
used,
which
isn't
impossible
just
harder,
so
that
would
be
different.
The
the
client
is
already
capable
of
performing
replicated
reads
from
the
primary
or
from
replicas.
It's
just
that
there
are
holes
in
the
implementation
that
make
it
not
a
good
idea
to
use
in
the
general
case,
but
once
that's
fixed,
which
should
be
relatively
soon.
B
So,
let's
see
as
it
as
far
as
marketing
itself
slow
you
it's!
It's
odd
because
you
could
also
have
multiple
classes
of
OST
so
we'll
have
to
make.
It
will
probably
have
to
create
some
kind
of
a
process
where
the
OSD
benchmarks
itself
and
stores
what
it
believes
its
speed
to
be
and
then
over
time
it
would
be
able
to
detect
the
degradation.
Perhaps
I'm
not
sure
we'll
be
looking
for
input
on
that.
Certainly,
do
you
have
any
thoughts
Anakin?
D
Oh
yeah,
no,
no,
that's
that's
fine!
I
just
I
want
to
make
sure
you
get
so
so
on
this
topic
specifically,
couldn't
we
take
a
look
at
statistics
regarding
things
like
the
io
size
and
the
throughput
and
kind
of
get
a
general
sense
of
what
we
think
the
the
back
end
is
you
know
there
are
certain
classes
of
drives
right
like
7200
RPM
drives
that
have
fairly
consistent
performance
characteristics,
you
might
be
able.
B
D
D
B
B
D
B
Might
that's
a
little
bit
well,
actually,
let
me
get
prioritize
primary
request
to
at
OSD.
What
do
you
buy
prioritize.
C
B
With
that,
anyway,
that's
possibly
a
feature,
not
a
bug
for
one
thing:
ratos
already
limits
the
size
of
objects
you
can
have,
so
you
can't
have
a
one
gig
right
and
a
1
kilobyte
right.
You
could
have
a
1
kilobyte
right
in
a
formatted
light
right,
yeah.
C
B
I
would
argue
that
those
sizes
that
the
sizes,
you
naturally
want
your
rato
subjects
to
be
are
such
that
you
wouldn't
want
to
break
it
up
because
it'll
be
dominated
by
the
seek
time
and
not
by
the
throat
book.
Part,
though
by
that.
So,
if
the,
if
the
large
right
got
to
the
or
got
to
the
journaling
code
first,
then
it's
unlikely.
We
want
to
break
it
up,
it's
likely
that
we
want
to
simply
finish
it
and
the
part
where
it
actually
applies
it
to
the
file.
C
B
D
D
B
B
Kinds
of
requests:
there
are
the
kind
there's
the
kind
of
comes
from
the
client
where
you
haven't
committed
locks
or
serialization
resources.
Yet,
and
then
there
are
requests
you
need
to
complete
as
soon
as
possible.
That
would
be
requests
from
a
primary
or
from
the
replicas
back
to
the
primary
that
you've
already
committed
resources
for
so
no
you,
wouldn't
you
you
you,
you
would
always
prioritize
requests
from
the
primary
2
and
0
2
and
0
state.
B
D
B
Replicas
then
send
back
so
up
replies
which
you
then
process
through
the
same
queue,
and
then
you
send
back
to
the
client
and
opera
ply
the
first
one
has
a
the
first
original
client
request
has
a
priority
of
63,
which
is
the
highest
non-strict
priority,
everything
after
that
has
a
per
a
priority
of
highest
or
something
it
all
they
they.
They
always
prioritize
ahead
of
client
requests
that
haven't
been
seen
yet
because
and
eggplant
request
is
actually
blocked
on
them
and,
furthermore,
we're
holding
a
lock.
So
once
we've.
B
D
B
D
D
B
B
D
B
Only
way
to
do
to
the
only
way
to
really
mitigate
that
is
either
to
reduce
the
work.
The
primary
is
doing
period,
which
means
not
sending
the
request
to
your
own
store,
but
sending
it
to
a
replica
instead
or
sending
it
to
both
and
using
the
fastest
return
that,
by
the
way
it
doesn't
make
the
primary
less
slow.
It
just
means
that
the
client
doesn't
see
it,
it
actually
increases
overall
load
or
the
client
could
send
it
from
the
client
to
multiple
replicas
and
use
the
fastest
return,
which
again
same
thing.
D
One
thing
that
I've
toyed
with
in
the
past,
which
I
I
don't
know
if
it's
even
worth
getting
into
here.
That
is
the
the
thought
of
having
some
kind
of
local
decision-making
process.
If
a
particular
disk
is
slow,
having
an
OSD
be
able
to
potentially
kind
of
spill
over
to
another
one,
but
it's
it's
kind
of
a
it
could
all
probably
right
below
SEF
that
even
have
no
involved,
but.
B
We
can
do
it
above
stuff
actually
and
that's
there.
There
are
two
kinds
of
things
you
can
do
when
you
have
a
slow
desk,
you
can
do
things
that
require
a
lot
of
effort
and
things
that
require
a
very
small
amount
of
effort.
Things
that
require
a
very
small
amount
of
effort
include
letting
someone
else
be
the
primary
for
your
pgs
and
we
haven't.
A
B
Have
machinery
in
the
monitor
for
that
the
primary
affinity?
You
would
such
a
primary
affinity
20,
you
would
still
hold
the
same
data,
so
you
wouldn't
shift
any
data,
but
crush
would
tend
to
output
a
different
one
of
the
for
all
the
pg's
you're
in
the
acting
set
for
a
different
OST
would
wind
up
as
the
primary.
So
you
wouldn't
see
requests
that's
the
kind
of
thing
we
could
do.
What
are
we
looking
for
quickly
as
in
response
to
transient
shifts
in
I
in
well
somewhat
quickly
in
response
to
relatively
short
transient
shifts
in
bonus?
D
B
Long
as
all
those,
as
long
as
all
of
the
pools
have
uniform,
distribute
or
as
long
as
your
IO
is
partitioned
all
over
your
pools
such
that
it's
uniformly
just
distributed
over
the
pgs,
then
it
still
works.
The
problem
you
get
into
is
when
the
pool
that
receives
all
of
the
I/o
doesn't
have
enough
pgs
and
then
that's
not
a
skew
problem.
That's
simply,
you
don't
have
enough
pg's,
it's
not
that
you
don't
have
too
many
people.
It's
not
that
you
have
to
n
equals.
D
B
This
is
kind
of
a
big
area.
There
are
lots
of
ways
we
can.
We
could
consider
doing
the
OSD
speed
thing.
It's
going
to
be
kind
of
a
lot
of
work,
and
the
nice
thing
about
it
is
that
it's
not
that
internal
to
the
OSD
or
the
monitor
it'll,
be
in
the
code,
but
it'll
be
a
new
chunk
of
heuristic
that
will
only
interact
with
the
existing
code
minimally.
So
this
would
be
a
pretty
good
project
for
someone
who
had
a
good
design,
they
thought
might
be
worth
implementing
it'd,
be
pretty
easy
to
prototype.
Also.
B
D
B
B
So
that
should
help
a
lot,
because
before
we
were
relying
badly
on
the
P
threads
on
the
OS
is
own
scheduling
to
decide
when
we
were
going
to
do
recovery
versus
when
we
were
going
to
work
on
client
I,
oh,
and
that
was
never
all
that
good.
An
idea,
so
I
think
we'll
get
better
performance
under
contention
once
once
people
start
testing
out
those.
Those
changes
after
that
it'll
be
I
mean
as
how
my
is
pointing
out.
B
D
Yeah,
we
are
seeing
that
actually,
interestingly
enough
with
with
with
large
iOS
with
like
large
sequential
reads
nice
or
is
significantly
better
than
the
file
store,
the
the
place
where
new
store
really
falls
apart
is
is
on
our
BD
over
rights
priority,
but
just
on
object
over
rights
in
general,
like
large
object
over
rights.
So
that's
that's
kind
of
what
we
need
to
focus
on
there,
but
then
a
lot
of
other
cases,
actually
new
stores,
looking
really
good
cool.
D
D
You
know
I've
got
data,
I
haven't
looked
at
it.
I
should
go
back
and
do
that
but
I,
don't
I,
don't
have
an
answer
for
you
right
now
and
what
the
CPUs
it
looks
like,
although
I
suspect
that
it's
from
up
from
a
performance
perspective,
it's
no
worse.
This
machine
that
been
doing
testing
on
this
is
not
super
fast
in
terms
of
CPU,
a
noose
or
does
better.
D
Maybe
maybe
11
question
here
would
be
so
like
both
in
the
file
store
and
with
new
store.
You
know
the
level
DB
or
rock
CB,
or
whatever
we
have
four
key
value
stores
is,
can
become
more
important
and
and
latency
is
a
really
big
deal,
especially
during
compaction.
So
that's
the
in
terms
of
like
reducing
our
our
long
tail
latency.
C
I
have
the
idea:
could
we
reduce
the
number
of
key
log
to
the
normal
state,
such
as
that
by
default?
It
is
one
one
thousand
p
log,
if
I
remember
correctly
so
we
can
make
p
log
buffered
at
the
Messiah
david
EBY,
powerful
default
is
icing,
is
888
a.m.
and
be
and
we
can
rat
o
P
Q
P
load
state
at
channel
and
don't
flash
to
into
disk.
So
maybe
we
can
reduce
some
yellow
writing.
B
Yeah
that
probably
wouldn't
hurt
you
you
convert,
so
the
number
of
PG
log
entries
that
are
configured
is
not
a
magic
number.
You
need
enough
so
that
you
don't
spurious
Lee
go
into
backfill
when
you
reboot
a
host
beyond
that.
I'm,
not
sure
it
matters
that
much.
You
could
also
tweak
the
leveldb
settings
so
that
it's
caching
a
much
larger
amount,
which
would
be
a
good
idea,
nothing
magic
about
level
settings
either
yeah
in.
C
B
Another
option,
if
you
something
you
want
to
try,
is
try
writing
rocks
TB
under
the
OSD
instead
of
leveldb
I.
Believe,
there's
a
way
to
configure
it.
I
dont
rember
what
it
is.
Not
that
is
don't
run
new
store,
run
file
store
with
yeah
yeah
rocks.
B
D
I,
don't
remember
if
leveldb
gives
you
statistics
like
this,
but
with
rocks
DB.
If
we,
if
we
run
on
the
OSD
using
using
rocks
BB,
it
will
give
us
end
of
semi
regular
statistics
on
kind
of
what
data
is
hitting,
what
level
and
and
kind
of
the
size
of
the
the
key
value
pairs
and
all
this
other
stuff.
Do
we
have
a
good
sense
right
now
on
kind
of
what
the
the
workload
looks
like
from
the
OSD
or
we
can
have
in
the
dark,
still
think.
C
D
Bem,
do
you
know,
have
we
seen
any
evidence
in
the
recent
past
that
that
level
DB
is
is
causing
any
kind
of
like
latency
or
or
or
other
issues
than
me?
Ost
I,
remember
kind
of
looking
at
this
looking
couple
years
ago
and
I
didn't
think
that
was
but
I
I'm
now
questioning
myself
a
little
bit
I,
don't.