►
From YouTube: Ceph Performance Meeting 2021-11-11
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right
pull
requests
this
week.
I
did
not
see
a
whole
lot
at
all.
There
was
one
new
pull
request
that
looks
really
interesting
from
radic.
This
is
a
introducing
huge
page
based,
read
buffers,
so
if
you're
interested
in
that
it
looks
interesting,
I'm
not
sure
under
what
circumstances.
This
is
going
to
end
up
helping,
but
it
definitely
could
so
yeah
there's
this
one.
Otherwise
I
didn't
see
anything
new
or
closed.
Besides
that
there
are
a
couple
that
got
updated.
A
This
week,
though
adam's
retry
of
bluefs
fine
grain
locking
this
went
through
some
qa
this
week
and
it
was
passing
cases
that
previously
apparently
failed.
A
I
don't
know
if
that
means
this
is
ready
to
merge
yet
or
if
we
still
need
to
run
this
through
more
tests.
Certainly
that
code
is
is
complicated,
so
you
know
this
is
this
is
a
touchy
one
and
it's
it's.
A
Unfortunately,
it's
also
kind
of
this
is
adam's
attempt
to
fix
what
was
wrong
in
a
previous
pr
from
majiang
peng
that
also
we
merged
that
also
broke
after
so
right
now
we're
zero
for
two,
maybe
hopefully
the
third
third,
try
we'll
get
it,
but
but
this
is
this
is
tricky
code
to
get
right
all
right,
let's
see
beast
optimizations
for
request
timeout.
I
love
this
pr.
A
There's
tons
of
benchmarking
data
in
it
lots
of
discussion
about
what
makes
sense
to
include,
or
what
doesn't
oh
casey
you've
been
reviewing
this
any
anything
else.
You
want
to
add.
B
Since
last
time,
let's
see
we
got
a
comparison
with
and
without
the
custom
allocator
piece
which
was
kind
of
complicated
and
we
found
that
it
didn't
really
help.
So
I
was
happy
to
to
rip
that
piece
out.
B
Mark
is
satisfied
with
the
performance,
so
I
ran
it
through
toothology
and
see
some
valgrind
issues
that
are
there.
So
we
talked
about
it
in
the
in
the
bug
scrub.
This
morning
we
have
a
plan
for
that.
A
B
Yeah,
the
the
piece
of
the
beast
library
that
we
were
relying
on
for
the
timeouts
is
kind
of
complicated
in
doing
extra
stuff.
So
getting
rid
of
that,
I
think
is,
is
the
main
thing.
A
Okay,
excellent
excellent:
do
you
know
there
was
another
pr
that
was
kind
of
hurting
rgb
performance?
It
looked
like.
I
need
to
go
back
and
look
at
which
one
it
was,
but.
B
B
C
B
And
there's
a
stack
allocator
for
the
co-routine
stacks
and
originally
it
was
just
using
memory
from
the
heap
and
it
was
sized
pretty
small.
So
we
were
seeing
a
lot
of
valgrind
issues
just
over
running
the
stack
so
that
pr
switched
the
allocator
to
use
a
map
and
m
protect
so
that
we
would
actually
crash
if
we
overran
the
stack,
and
so
I
mean
that
adds
a
couple
system
calls
for
every
request.
So
we
kind
of
expected
a
performance
hit
there.
B
So
I
I
don't
think
that
we're
planning
to
revert
that,
but
maybe
there's
something
else
we
could
do
sure.
But
I
think
that
the
this
time
out
thing
was
the
was
the
much
bigger
issue.
A
A
Well,
cool
excellent
sounds
great
all
right.
Let's
see
next
up,
there's
this
pr
to
set
the
min
alex
size
to
the
optimal
I
o
size
with
the
underlying
device.
There's
some
discussion
eventually,
both
that
was
approved
by
both
sage
and
igor.
I
believe
now
it's
just
it's
ready
for
testing.
A
There's
also
this
pr
from
igor
to
make
sure
blob,
fsck
much
less
ram
greedy
that
needed
a
rebase.
It's
gotten
an
update
since
then
from
igor,
but
I'm
not
sure
if
he
did
anything
besides
just
rebase
on
master,
so
I
think
that
one
may
be
ready
for
review
and
and
testing
another
one
from
igor.
This
is
this
old
pr
for
optimizing,
pg
removal.
A
It's
been
under
lots
of
review.
There
were
some
failures,
testing,
more
review
and
discussion,
I'm
not
without
eager
here,
I'm
not
sure
what
what
what
the
status
of
that
is
yet,
but
anyways
still
being
actively
worked
on.
So
that's
good
lots
of
stuff
in
the
no
movement
category
this
week.
I
know
I've
got
a
couple
that
are
in
there,
but
I
just
haven't
been
touching
them.
My
big
one,
I
guess,
is
this
priority
cache.
It
was
more
or
less
working
in
performance
tests.
A
Well,
it
was
working
performance
tests,
I'm
just
not
showing
a
whole
lot
of
improvement
really,
but
then
it's
like
faulted
during
testing
and
I
haven't
gotten
back
to
it.
Yet
I'm
yeah
it.
There
are
some
benefits
to
it,
just
not
what
I
was
hoping
there
would
be,
so
it's
still
probably
worth
doing
just
it's
not
a
big
improvement.
It's
much
a
much
more
nuanced
minor
kind
of
thing,
but
it
does
give
us
a
lot
of
insight
into
what
the
cache
is
doing.
So
maybe
that's
the
big
one.
Otherwise,.
A
We
still
have
the
mds
optimizations
here,
which
I
think
are
really
can
be
good
to
get
in,
but
I
know
patrick
have
you
do
you
know?
Is
anyone
still
looking
at
those
the
the
ones
from
you,
colonel.
D
No
not
yet
look
at
that
in
the
next
few
weeks.
A
For
kind
of,
interestingly,
I've
gotten
a
lot
of
people
recently
asking
about
like
hpc
and
io
500
with
ffs
so
they're.
There
just
seemed
to
be
a
lot
of
interest.
D
Yeah
well,
we'll
see,
I
I'm
not
sure
if
the
approach
is
right,
something
we
want
to
support
long
term.
So
you
know
yeah
hard
to
say
whether
or
not
those
will
actually
get
in.
A
Okay,
all
right!
Well,
let's
see
yeah
I
mean
there.
They
have
likewise
on
on
this
optimize
object
memory
allocations
using
pools,
ronin
kind
of
oh
you
were
in
here.
You
know
you
had
it
raised
some
issues
about
that
pr
and,
I
think
they're
very
legitimate.
That's
another
kind
of
question
of
it
could
be
a
big
performance
when
doing
something
like
this,
but
we
have
to
figure
out
the
right
way
to
do
it.
A
I
don't
think
we
saw
a
response
back
from
the
author
of
that
pr,
so
still
kind
of
waiting
on
that.
A
All
right
well,
otherwise,
I
think
that's
about
it
for
prs
anything
I
missed
or
anything.
Anyone
would
like
to
discuss.
A
All
right,
then,
moving
on
okay,
so
first
thing
this
week
I
wanted
to
bring
up.
Is
that
there's
been
a
lot
of
discussion,
mostly
inside
red
hat,
regarding
how
long
it
takes
to
build
packages?
Building
stuff
itself
is
actually
not
horribly
slow.
If
you
have
enough
cores
on
our,
you
know
really
fast
development
machines.
A
We
can
build
stuff
in
probably
10
to
12
minutes,
which
you
know
it's
not
great,
but
it's
okay,
but
our
debug
or
our
actual
package
builds
take
a
lot
longer
and
there's
interest
in
trying
to
figure
out
why
so?
The
two
big
things
that
came
up
there's
was
interest
in
trying
to
parallelize
builds
across
multiple
nodes,
and
david
galloway
has
been
doing
a
lot
of
testing
on
that
right
now
he's
trying
it's
some
kind
of
closed
source
tool.
A
I
forget
the
name
of
it,
but
the
gist
of
it
is
that
it's
faster,
but
maybe
not
as
fast
as
well.
It
uses
a
lot
of
nodes
or
a
lot
of
cores
to
get
an
improvement
and
dc
may
kind
of
be
similar
in
terms
of
this
improvement.
A
A
A
It's
you
know
basically
a
c
program.
It
is
being
actively
maintained
there.
There
appears
to
be
someone
at
red
hat-
that's
working
on
this,
but
perhaps
this
is
something
that
could
be
parallelized.
I
wanted
to
open
it
up
for
anyone
that
is
interested
in
this.
Anyone
has
anyone
used.
Dwz
knows
much
about
it.
Look
at
it
at
all.
A
All
right
that
was,
that
was
somewhat
the
response
I
expected
so
okay,
I've
looked
at
this
just
a
little
bit
now.
I
don't
know
how
difficult
it
would
be
to
actually
do
anything
with
this.
I
haven't
looked
at
it
closely
enough
to
really
get
a
sense
of
it.
There
is
some
comments
right
at
the
top
of
the
c
source
talking
about
trying
to
optimize
multi-file
cases.
A
So
you
know
the
author
definitely
has
been
thinking
about
things.
You
know
performance
at
least
a
little
bit.
I
may
try
to
reach
out
to
him
and
see
if
there's
any
opportunities
for
us
to
to
help,
but
in
any
event,
I
think
that
that's
probably
the
better
way
to
go
in
terms
of
trying
to
make
our
our
package
builds
faster,
at
least
at
first
rather
than
trying
to
parallelize
and
and
use
multiple
nodes,
and
we
already
can
make
the
parallel
parts
fairly
fast.
A
Just
by
using
a
big
node
with
lots
of
cores,
it
seems
like
the
single
threaded
parts
are
the
ones
that
really
are
are
hurting
us
right
now
so
anyway,
that's
that's
it
for
that
all
right!
Next,
here,
josh
and
gabby.
Do
you
want
to
continue
a
quick
summarize,
the
the
the
discussion
from
core
this
morning
about
fast
shutdown.
E
If
there's
anything
happening,
if
there's
com
sorry
compaction
ongoing,
if
there
is
still
ongoing
rights,
then
this
thing
would
be
bogus
and
next
time
we
start
it's
going
to
be
bad,
and
so
there
are
two
solutions.
The
first
solution
is
just
you
know.
What,
if
you
do
fast
shutdown,
then
don't
store
the
location
file
and
the
next
time
you
start,
you
will
do
a
full
resto,
a
full
recovery.
E
E
That's
one
solution:
a
better
solution
is
on
the
first
shutdown:
try
to
do
minimal,
set
of
operation
to
get
the
system
into
quiescent
state
and
then
stage
that
location
and
shut
down
the
step
we
could
skip
are
all
the
cleanup.
We
do
for
memory
and
stuff
that
we
don't
really
need,
like
you,
don't
need
to
drain
memory
pool
you
don't
need
to.
E
I
don't
know,
free
up
all
the
memory
and
let
malok
a
new
free
and
sorry
new
and
free
deal
before
the
fragmentation.
You
can
just
skip
that
state,
and
so
I've
been
working
on
this
for
some
time
now
and
today
I
I
start
profiling
the
step
to
see
how
much
I'm
saving
and
I
was
very
disappointed
to
see
on
my
system
that,
with
the
minimal
set
of
operation,
I
can
go
down
to
like
one
second,
and
if
everything
is
done,
it's
going
to
be
10
seconds.
E
E
My
intuition
is
that
we
don't
do
it
because
we
try
to
save
the
shutdowns
time.
I
don't
think
it's
very
long
and
I
suspect
people
need
fast
shutdown
in
case
some
bug
put
us
on
deadlock
or
some
kind
of
a
condition
preventing
the
system
from
ever
going
to
shut
down.
So
it's
just
okay
kill
the
system,
but
it's
my
it's
something.
I
suspect
I
have
no,
no
knowledge
about.
That
being
the
case.
F
C
It's
like,
I
don't
know
about
the
second
part,
but
I
can
tell
you
about
the
history
of
why
it's
implemented.
It's
actually
a
fairly
recent
addition.
C
C
I
think
one
of
the
aspects
that
you
may
not
be
seeing
as
well
in
your
testing
on
the
official
analysis
like
nasa,
is
because
they're
so
fast
and
have
these
very
fast
and
nvme
devices
is
the
flushing
of
data
in
flight
is
going
to
be
pretty
fast
in
those
nodes
which
wouldn't
be
the
case.
If
you
had
slower
hardware
like
a
hard
disk
or
a
much
lower
cpu,
I
think
that
maybe
one
aspect
where
why
we'd
we'd
see
longer
shutdown
times
in
some
cases.
C
But
that's
not
to
say
that
we
have
to
do
fashion
or
we
have
to.
I
think
it
is
worth
trying
to
like
figure
out
where
the
time
is
actually
being
spent.
What
is
worth
optimizing
or
not?.
G
G
That
could
completely
hold
their
whole
plan
where
they
could
be
governmental,
could
be
financial
losing
tons
of
money
waiting
for
this
note
to
shut
down,
so
there
could
be
a
a
big,
a
big
difference
of
what
a
fast
shot
dom
could
do
and
being
newer.
I
still
know
all
the
details
of
it
and
what
we're
looking
to
achieve,
but
there
could
be
a
case
where
something
just
never
shuts
down,
because
hard
drives
like
to
still
be
operational,
but
be
very
finicky,
especially
the
more
commodity
type
hard
drives.
I
just
wanted
to
add
that.
E
So
it's
another
sorry
yeah.
There
is
another
solution
that
we
discussed,
which
was
do
the
shutdown
but
cap
it
at
five
minutes.
Ten
minutes.
E
If
you
cannot
do
it
in
five
minutes,
then
kill
the
machine
and
then
you
will
do
a
full
recovery
afterwards,
but
the
full
recovery
should
probably
not
take
you
more
than
10
minutes.
So
there's
no
reason
to
wait:
10
minutes
to
save
another
10
minutes,
so
you
should
never
wait
one
hour
or
two
hours,
but
if
we
could
maybe
make
it
enough
just
to
do
this
thing
plus
there
there
are
few
other
simple
changes
which
don't
change
the
semantic.
E
There
is
anything
which
is
not
starting
the
execution,
but
it's
in
the
queues.
So
when
the
system
is
slowing
down,
I
expect
a
lot
of
stuff
is
being
queued
on
the
external
queues,
but
we
didn't
start
executing
them
executing
them.
So
something
that
I'm
already
doing
in
this
solution
is
I'm
I'm
stopping
all
the
cues
from
accepting
new
tasks,
the
external
and
the
internals.
So
nobody
is
going
to
start
anything.
Only
the
stuff
which
already
start
execution
is
going
to
conclude.
The
question
is:
how
many
tasks
can
we
have
in
flight?
C
But
that's
another
area
where
I
think
we've
made
a
recent
change
there,
where
for
a
long
time
we
actually
had
no
limit,
and
so
we
could
have
very,
very
large
buffers
of
data
that
was
incoming
to
the
osd
and
we
finally
enabled
that
again
it's
the
osd
message
cap,
which
is
what
do
we
set
that
to
like
100
or
something
256.
E
Not
what
I
mean,
even
if
we
assume
that
every
iotec
10
20
millisecond,
you
could
still
finish
just
the
I
o
in
I
would
say
five
seconds
like
the
slowest
sata
drive,
would
finish
the
I
o
itself
in
five
seconds.
Then
there
is
the
processing
that
we
do
make
it
to
be
10
seconds.
So
that's
not
where
we
spend
time.
I
don't
think
this
is
where
we
spend
time.
E
I
suspect
that
there's
some
cases
that
so
so
one
one
one
thing
would
be:
maybe
the
queues
got
thousands
of
of
pending
io
that
we
didn't
start
executing
executing,
which
we
still
keep
bringing
in.
That
thing
is
easy
to
stop
and
I'm
already
doing
that,
I'm
not
going
to
add
anything
new
to
the
execution,
but
256
ios.
That's
not
something
that
would
cause
us
to
wait.
Minutes
like
one
minute
should
be
enough
to
complete
everything
and
with
huge
margins.
E
So
else
that
was
the
reason
that
people
so
so
that
faster
than
was
needed
if
there
was
no
limit
and
by
the
way,
even
today,
there
is
still
a
similar
thing,
because
you
could
have
think
skews
be
queued
before
they
start
executing.
I
don't
know
how
much
we
can
have
in
that
queue.
The
message
queue
on
the
osd.
Everything
arriving
from
the
clients
is
the
limit
to
how
much
we
could
keep.
C
E
E
E
Okay,
okay,
okay,
so
even
without
any
of
my
changes,
there
is
only
256
possible
ios.
We
should
be
able
to
complete
in
few
seconds
plus.
There
is
cleanup
that
we
do,
assuming
that
there
is
no
bug
and
there
is
nothing
stopping
us.
Then
I
would
expect
that
one
minute
should
be
more
than
enough
for
any
system.
F
F
Yeah,
that's
so
upgrades,
as
josh
pointed
out,
was
the
reason
and
like
25
to
30
seconds
for
osd
was
not
acceptable.
Is
what
I'm
understanding
just
by
reading
through
the
initial
comments.
F
Yeah,
but
I
guess
with
gabby's
stuff
some
of
the
details
in
this
pr
don't
hold
right.
I
mean
like
the
only
reason
why
we
are
doing
clean
shutdown.
Now
we'll
have
more
motivation
right.
We
have
other
stuff
to
do
during
shutdown,
so
it
might
be
worth
revisiting
whether
we
can
afford
to
have
clean
shutdown.
F
E
See
no,
I
want
to
see
and
not
fast
shut
down
with
system
doing
I
o
in
full
capacity,
and
I
want
to
see
how
long
it
takes,
because
if
we
can
see
that
that
takes,
I
don't
know
one.
I
don't
know
five
minutes
at
worst
case
scenario,
I
don't
even
think
that'd
be
the
case.
I
think
one
minute
should
suffice
to
any
reasonable
system.
Then
I
don't
think
there's
much
to
gain
on
the
fast
shutdown
and
we
only
risk
creating
inconsistencies.
F
I
think
that
should
be
doable.
If
you
I
mean
it's
just
a
matter
like
we
have
long
running
cluster,
we
also
have
scale
clusters
and
this
it's
just
a
matter
of
disabling
our
shutdown.
F
E
E
C
I
think
we
do
have
to
be
a
little
bit
concerned,
even
if
it's
a
it's
a
minor
increase
for
osd,
because
when
you're
trying
to
restart
the
entire
cluster
and
during
upgrades
now
you're
keeping
things
online,
you're
you're
serializing
things
a
bit
so
that
you
don't
disrupt
activities
so
you're
going
like
node
by
node,
maybe
you're
starting
a
few
osd's
at
a
time.
So
you
don't
make
the
entire
cluster
unavailable
and
that
kind
of
scenario
it
does
add
up.
When
you
only
have
at
that
sequential
delay.
C
Even
if
it's
like
relatively
fast,
I
think
it
might
still
be
worth
optimizing,
but
I
think
I
agree.
We
should
check
out
how
much
time
it
actually
does
take
on
something
like
the
lrc
and
if
it's
like
only
one
second
or
something
there,
then
it's
probably
not
worth
looking
at.
But
if
it
is
like
a
difference
of
10
seconds
20
seconds,
then
there's
more
significant
and
adds
up
over
a
large
cluster.
E
E
C
Yeah,
that
would
be
the
worst
case
scenario
on
giant
rates.
Late
and
necros
has
a
good
point
about
other
kinds
of
scenarios
where
you
care
about
the
whole
cluster
too.
Shutting
it
down
entirely,
at
least
in
that
case
it
it
is,
can
be
done
in
parallel,
so
the
extra
time
isn't
so
bad.
It's
it's
really
the
long
tail
that
you're
worried
about
there,
but
your
that's
an
idea.
C
C
E
E
C
Yeah
but
again
in
practice
it's
going
to
be
a
much
lower.
I
mean
like,
like
four
megabytes,
is
gonna,
be
the
usual
worst
case,
maybe
32
for
more
aggressive
setups,
but
that's
pretty
rare.
E
E
A
Yeah,
but
there
are
four
four
of
them
are
being
used
by
the
jenkins
and
the
other
ones
I
think
are
mostly
checked
out.
Those
might
be
tough.
We
have
the
ancient
mirror
machines,
josh,
they're,
very
slow.
A
Yeah
so
so
gabby
that
might
actually
be
the.
If
you
want
worst
case
scenario,
that
would
probably
be
the
worst
case
scenario
for
everyone
in
your
machine,
so
those
are
toothology
or
in
the
standard
lab.
So
you'll
need
to
check
one
out
using
that.
I
don't
know
if
you
can.
E
Send
me
the
names
and
for
running
multiple,
a
fire
can
just
keep
opening
more
and
more
fires,
and
is
there
any
synchronization
happening
between
them?
Can
they
all
write
to
the
same
osd.
A
Yeah
yeah,
absolutely
you
can
run
multiple
jobs
from
one
fio.
You
can
just
multiple
independent
fios
hold
up
the
q
depth
for
each
one
so
for
for
each
one
you'll
want.
If
you
want
to
have
a
high,
I
o
depth
for
each
fio
process,
you'll
use,
lubeio
and
direct
and
pump
the
q
depth
up
and
just
do
like
large,
large
writes
and
whatever
you
want
to
set
it
to.
A
Yeah
I
mean
that'd
probably
be
the
way
to
go.
I
would
think
and
you'll
probably
melt
the
mirror
machine,
because
I
mean
these
things
are
like
10
years
old,
so
they'll
they'll
be
yeah,
it's
gonna
be
super
slow,
probably,
but
you'll
you'll
learn
something
maybe.
E
Yeah,
because
also
for
now,
one
thing
I
notice
is
that
that
the
stage
part
is
not
what
is
spending
the
time
most
of
the
time,
I'm
spending
all
the
stuff
I'm
skipping
is
just
increasing
inter
cluster
synchronizations,
it's
not
in
io,
it's
like
getting
everybody
to
agree
on
some
steps
and
stuff
like
that
talking
to
the
manager
talking
to
all
other
components,
but
I
don't
see
the
I
o
itself
as
something
taking
too
much
of
my
time.
E
Sorry,
I'm
saying
before
before
I
try
to
optimize
when
I
optimize,
that's
the
only
thing
left,
but
the
nine
second
out
of
ten
there
all
been
some
kind
of
synchronization
happening,
like
a
very
big
chunk,
was
even
spent
inside
service
prepare
to
stop,
which
is
essentially
communication
with
managers
and
others.
A
A
C
Yeah
it's
in
the
technology,
docs
there's
a
pathology,
lock
command.
You
can
use
to
do
that,
but
you
can
see
here,
there's
a
that's
the
general
page,
but
I'm
trying
to
link
to
the
popular
page
without
the
mirror
machines
there's
a
bunch
of
them
that
are
free.
So
in
theory
those
those
ones
that
are
are
free
are
still
working.
A
So
gabby,
you
probably
need
to
grab
some
and
then,
however,
you
want
to
run
your
tests
like
set,
accept
the
cluster
and
test
I
mean
toothology
could
do
it
right?
You
could
set
the
cluster
that
way,
but
otherwise
you
know
you
could
you
could
use
restart
or
you
could
use
cbt
once
you
grab
the
nodes,
I
used
to
do
that.
E
E
E
C
Like
generated
by
activity
is
just
the
the
message
limit
coming
in
the
client
side,
there
could
be
more
like
io
operations
internally
generated
from
those
offer.
Those
client
ops.
C
It's
a
separate
piece.
I
guess
there
would
be
only
like
you're
talking
about
the
log
entries
right.
C
Yeah,
that's
separate,
that's
not
counted
in
the
same
way,
but
for
deferred
rates.
You
don't
necessarily
have
to
do
them
during
the
shutdown
process.
E
E
A
A
A
You'll
probably
want
to
do
like
so,
okay.
One
of
the
reasons
I
brought
put
it
in
here,
though,
is
because
we
should
talk
about
this
a
little
bit
with
nun
jobs
if
you
increase
that
what
will
end
up
happening
is
that
you'll
start
out
at
the
same
offset
on
the
image
at
the
same
time,
so
I
don't
typically
like
increasing
the
number
of
jobs
with
fio
for
like
sequential
rights.
A
If
you
did
random
rights,
it
wouldn't
matter,
so
you
could
do
that
too,
but
with
the
control
rights,
you
probably
want
to
hit
different
rbd
images.
E
A
A
A
Otherwise
you
can
run
multiple
fio
processes,
parallel
against
different
rvd
images,
or
you
could
just
make
one
rbd
image
and
then
do
like
you
know:
num
jobs
equals
whatever
you
want
to
set
it
to,
and
then
the
io
def
per
fio
process,
whatever
you
want
that
to
be
set
to
with
the
rbdm
engine,
you
don't
have
to
worry
about
like
using,
like
you
know,
direct
one
and
all
the
other
garbage
that
goes
along
with
that
lubio
and
everything
so
yeah.
A
I
can
send
an
email
too,
so
I
mean
I
assume
that
you
want
to
just
use
like
the
rbd
with
fio,
rather
than
going
through
the
whole
process
of,
like
you
know,
using
setting
up
kernel,
rbd
and
writing
a
file
system
to
the
image
and
all
that
stuff.
F
E
A
E
E
E
E
A
C
E
The
reason
data
servers
don't
allow
you
to
give
them
unsolicited
data
is
because
it
allows
them
to
control
the
way
they
behave
and
they
know
how
much
resources
they
have
so
usually
they're
going
to
allocate
that
which
offers
that
request
and
they
can
accept
a
request.
But
then
they
will
move
the
data
on
your
own
good
time
when
they
have
enough
buffer
free
for
the
data,
because
usually
the
request
itself
is
just
64
bytes
of
I,
the
I
request
block
is
usually
64
byte.
E
I
never
heard
about
this
being
a
performance
limitation,
because
you
could
see
people
walking
in
almost
widespread
with
this
behavior
and
again,
even
if
there
is
I,
I
can
see
that
there
is
possible
optimization
for
a
very
small
right
if
it's
four
kilobyte
around
it.
But
when
you
do
megabyte
of
right,
there
is
never
a
reason
for
you
to
push
data
unsolicited
and
hammer
on
the
server
which
it
might
be
doing
other
stuff
and
you're
just
consuming
space.
A
So
I
don't
know
all
the
details,
but
I
I
remember
hearing
stories
that
early
early
on
there
were
many
many
iterations
of
the
messenger,
and
there
was
a
lot
of
like
changes
going
into
a
lot
of
this
stuff
early
on.
I
I
think
at
one
point
like
we
had
a
messenger
that
was
entirely
based
on
mpi
in
the
early
days
when
it
was
to
look
like
you
know,
kind
of
focus
on
being
like
a
luster
replacement
for
supercomputers
josh.
C
Yeah,
I'm
not
familiar
with
that
really
early
interview
myself
on
semester
layer,
it
seems
like
I
mean
one
of
the
things
I
can.
I
imagine
I
might
be
that
it's
a
little
bit
more
difficult
to
like
re-pre-register
your
intents
and
then
execute
on
them
in
a
system
like
stuff
where
you
can
move
around
across
nodes,
and
you
don't
necessarily
know
exactly
where
your
request
is
going
to
be
going
and
until
you
get
around
to
sending
it.
C
E
But
you
could
maximize
throughput
if
it's
a
small
eye,
because
then
the
overhead
of
the
round
trip
is
going
to
be
big
if
it's
just
four
kilobytes
of
data,
if
you're
going
to
send
four
megabytes
of
data,
then
doing
an
extra
round
trip.
It's
no
big
deal
because
I
mean
even
tcp
is
going
to
keep
sending
messages
back
and
forth
you're
not
going
to
move
four
megabytes
in
one
buffer
internally.
E
It's
going
to
open
windows,
close
them
and
keep
asking
you
to
give
them
the
more
data,
so
the
overall
saving,
also
the
relative
improvement
in
four
megabytes
going
to
be.
I,
I
don't
think
it's
even
going
to
give
you
one
one
percent
extra.
If
it's
doing
4k,
maybe
you're
going
to
get
10
extra
response
time,
you
could
cut
a
response
time
by
something
but
four
megabytes.
E
A
E
I'm
really
unfamiliar
with
anybody
doing
unsolicited
data.
I
know
it's
like
since
beginning
of
time.
It
was
always
the
privilege
of
this
of
the
server
to
control
the
flow
because
they
they
are
doing
more
and
they
don't
want
few
strong
clients
to
kill
them,
and
they
don't
even
know
if
these
clients
is
in
high
priority
whatever
so
they
always
refuse.
It
was
never
allowed
to
post
data.
C
Yeah,
I
get
your
point
like
it
could
be
like
a
few
milliseconds
for
your
bit.
My
regular.
E
And
that
means
that
you
could
create
an
adversary
which
could
kill
your
system
by
sending
that
much
data.
If
we
allow
32
megabyte,
then
I'm
going
to
shut
256
requests
each
of
32
megabytes
and
your
sister
perform
is
going
to
suck.
I
I
think,
from
from
the
traditional
system,
they
were
optimized
to
use
the
cues
as
they
are
fixed
and
that's
why
you
need
to
just
ask
for
whether
you
could
send
something
down
or
not.
But
the
idea,
I
think,
is
not
that
bad
to
have
some
boundary
where
we
just
don't
allow
to
send
the
data
straight
away
and
we
could
gain
perhaps
some
decoupling
from
those
really
big
data
streams
and
have
a
better
handling
just
get
library
or
just
at
least
improve.
Also
the
things
around
qos.
A
C
I'm
not
sure
exactly
how
how
big
of
an
issue
it
is
with
like
once
you
have
m
clock,
enabled
as
well
the
mark.
It
does
try
to
be
able
to
handle
that,
but
yeah
that
doesn't
mean
that
you
end
up
having
potentially
more
like
system
resources
in
terms
of
the
queueing
space
used
by
these
large
requests.
E
E
If
you
want
to
do
any
kind
of
qos
you
could
say
you
know
what
after
I've
done,
100
of
four
kilobytes,
I'm
going
to
be
willing
to
do
one
of
your
one
megabyte
request.
C
But
I
guess
what
it
doesn't
do.
Is
it
doesn't
stop
you
from
taking
up
that
cube
space
in
this
in
the
server
in
the
first
place?
So
I
guess
we
make
the
difference
with.
C
Yeah,
I'm
saying
that's
the
advantage
of
that
kind
of
system
compared
to
just
that.
Just
the
m
clock
implementation
that
we
have
in
chef
today
is
that
it
it
does.
Allow
you
to
push
back
on
that
and
control
the
buffer
queueing
space,
as
well
as
the
just
order
of
requests.
E
E
C
Not
what
we
do
today
but
could
be
done.
I
guess,
but
that's
not
that's
not
how.
C
I
think
this
might
be
a
good
idea
to
investigate
and
it's
not
just
the
client
requests
as
well.
It's
also
like
the
intro
osd
request
in
some
cases
like
recovery
or
for
those
kind
of
things
can
be
quite
large
as
well
or
replication.
Traffic,
too.
E
E
You
could
just
set
up
the
request,
update,
robs
db
pg
log
and
everything,
but
then
the
data
itself
you'd
accept
in
smaller
chunks.
C
Yeah,
I
guess
it
gets
into
that
kind
of
little
messenger
messenger
protocol,
which
is
where
this
I
think
gets
complicated
to
implement,
because
today
we
do
need
to
like
read
in
is
the
header
of
the
message
before
determining
like
the
payload
size,
and
then
we
allocate
a
buffer
for
the
whole
payload.
Let
me
read
that
in
so
that
we
can
decode
the
payload.
C
E
Yeah,
so
I'm
just
wondering
if
it's
possible
just
to
be
kid,
you
call
the
header,
do
all
the
work
that
it
not
needs
to
be
done
and
then
start.
Then
you
need
to
have
an
active
thread,
pulling
the
data
and
staging
them
in
chunk.
But
you
don't
need
all
the
buffers
just
you
could
do
double
buffering
of
64
kilobyte
and
just
get
the
whole
data
instead
of
getting
one
make
a
big
one
megabyte.
I
don't
know
what
happened.
If
you
do,
somebody
gives
you
128,
kilobytes
a
megabyte.
C
E
And
then
so,
this
thing
is
going
to
be
sitting
highly
for
very
long
time,
because
until
this
thing
is
complete,
then
this
thing
cannot
be
touched
and
it's
not
a
very
good.
You
usage
of
resources
resources
is
going
to
be
sitting
either
most
of
the
time
just
waiting
for
anything
to
happen.
While
you
could
start
staging
things
so
in
effect,
you're
actually
slowing
them
down.
So
I
would
even
suspect
that
doing
a
multiple
64k
loops
would
be
faster
than
giving
you
unsolicited
one
megabyte
or
whatever.
C
How
would
the
multiple
64k
loops
work
when
you
need
to
write
down
the
entire
right
as
a
single
update
to
the
disk,
because
I,
I
think
you're
sending
that
off
to
the
object,
store
the
single
transaction.
E
E
E
A
I've
had
these
kinds
of
similar
thoughts,
not
exactly
in
this
case,
but
with
other
things
and
and
it's
just
the
the
weight
of
of
changing
something
like
this
is
so
big,
and
maybe
it's
maybe
once
you
get
into
it.
It's
not
so
scary.
I
don't
know,
but
that's
always
been
the
the
thing.
That's
kind
of
held
me
back
from
trying
to
do
stuff
like
this.
It's
just
such
a
big
change.
C
F
A
C
C
E
C
A
C
Yeah,
that's
that's
a
more
niche
case.
I
think
that
that
qs
aspect,
though,
when
you
have
multiple
writers,
who
are
using
very
different
kinds
of
workloads
and
getting
very
different
results
because
of
that
may
be
more
relevant
yeah.
C
Users
with
like
one
three
writing
very
small
objects.
One
writing
very
large
directs
they're
into
this
kind
of
situation.
E
A
Yeah,
it's
not
like
there's
best
performance,
there's
best
performance
in
specific
situations,
right
like
say
for
rbd.
There
are
cases
where
you're
better
off
with
a
larger
size,
and
there
are
cases
where
you're
better
off
with
like
a
smaller
object,
size
right.
A
E
E
Allow
you
to
write
a
complete
object
at
once.
If
you
save
object
rather
subject,
if
you.
E
F
A
It's
been
a
long
time
since
I've
looked
at
any
of
this
and
it's
quite
possible
that
things
have
changed
now,
but
at
least
back
when
we
did
this
and
really
four
megabytes
has
been
the
default
for
a
long
point.
It
was.
It
was
kind
of
the
the
middle
of
the
road
option
at
that
time,
not
not
always
best,
not
always
worse,
but
but
reasonably
good
in
all
scenarios
from
what
we
saw.
A
C
A
You
know
the
funny
thing
right
is
that
hard
disk,
even
at
like
512k,
is,
is
reasonably
good,
like
it
should
be
able
to
do
54k
rights.
You
know
pretty
well
if
you're
doing
like
full
full
object
rights,
but
we
still
saw
benefit
going
up
to
four
megabytes
from
what
I
remember
it
was.
It
went
beyond
that.
E
E
E
E
A
Gabby
beck,
I
remember
back
when
I
was
doing
a
lot
with
luster.
A
There
definitely
were
advantages
when
talking
directly
to
the
block
layer.
You
have
the
ability
to
do
like
rights
that
were
like
two
megabytes
to
the
underlying
disk
array.
A
lot
of
times,
optimizations
that
we
did
you
know,
are
same
with
mac
sectors,
kb
to
do
one
megabyte
or
two
megabytes
or
whatever
the
driver.
Let
us
do
so
at
least
back
then,
on
hard
drives,
there
seemed
to
be
advantage
to
being
able
to
like
be
you
know,
doing
big
rights
to
the
array.
E
I
would
probably
just
my
I
suspect
that
that
was
because
you
compared
single
non-q,
not
overlapping.
Sorry,
you
didn't
use
multi
io.
E
E
I
think
he
was
using
one
megabyte
by
just
sending
the
first
sending
of
sending
two
requests
and
as
soon
as
one
of
them
come
back,
you
send
the
next
one.
So
you
just
you're
keeping
a
pipeline
of
64
kilobyte.
It
will
give
you
better
performance
than
using
one
buffer
of
one
big
buffer.
Because
of
this
thing
I
discussed
before
you
have
to
wait
until
the
buffer
is
full
and
you
don't
take
any
action
until
that.
So
there
is
a
big
window
between
two
operations
when
you're
doing
overlapped.
I
o
not
of
a.
E
Multi,
if
you
have
a
queue-
and
you
send
few
asynchronous
io,
and
that
there
is
an
active
queue,
you
don't
benefit
from
a
very
big
I
o,
I
I
I'm
sure
I
can
I,
if
you'd
use
fiona
and
ask
it
to
use
q,
64
kilobyte
and
use
a
q
dep
of
four
versus
one
megabyte
in
q
type
of
one.
You
would
see
that
the
64
or
even
16
kilobytes
would
be
faster.
C
E
E
It
could
be
more
efficient
if
you're
giving
64k,
because
you
your
you
would
be
taking
less
time
to
feel
the
data.
You
always
have
something
working
you're
going
to
create
a
pipeline,
but
there
is
never
a
case
in
which
you
push
that
much
of
data.
The
only
case
I'm
aware
of
is
ice,
causing
because
the
beginning
of
time
ice,
kaze
people
thought
about
it
about
being
something
you
write
from
east
coast
to
the
west
coast
of
your
eyes
over
tcp
and
with
that
kind
of
distance.
E
I
The
big
questions
of
data,
and
especially
last
year,
used
to
raise,
and
the
thing
was
there,
that
you
were
always
were
looking
for
a
full
stripe
right.
That's
why
you
ended
up
with
such
big
ios,
that
being
the
optimum
one
just
in
the
race,
large
block
starts
at
four
meg
or
16.
Make
depends
on
on
what
the
firmware
said
and
because
of
that,
so
then
you
go
for
streamers
for
just
sequential
io
and
then
deal
with
the
data.
Just
write
it
straight
to
this.
I
I
E
I
I
think
that's
that
might
be
something
for
di
for
the
direct
hd
media
that
to
understand
what
the
optimal
or
just
an
optimal
amount
of
data
that
could
be
written,
perhaps
in
a
single
track
just
with
one
revolution
or
something
like
this,
and
to
get
the
best
performance
for
just
large
ios
but
yeah
anyway.
For
for
flash,
it
might
depend
on
the
media.
A
E
If
you
look
at
oracle
oracle
likes
to
do
one
megabyte
right
where
they
pack
a
lot
of
information,
and
then
they
send
you
one
megabyte
that
thing
exists,
but
of
course
they
cannot
send
you
the
data
unsolicited.
They
tell
you
that
they
want
to
do
one
megabyte
and
then
you're
going
to
port
for
the
data.
So
it's
different.
F
I
Yeah,
the
pure
scuzzy
protocol
doesn't
allow
that
that
much
amount
of
data
for
single
white
requests-
that's
even
with
so
I
don't
remember
what
it
is,
but
you
know
you
have
a
bigger
portion
than
your
very
own,
but
such
a
big
amount
just
not
divided,
so
you
have
to
split
it
up
on
the
driver
level
already.
E
A
E
I
think
we
don't
have
an
easy
way
to
deal
with
this.
I
would
suggest,
if
possible,
to
limit
the
max
size
internally
from
128
megabyte
to
4
megabyte,
because
4
megabyte
is
what
the
client
is
using,
but
don't
allow
anybody
to
pass
it
if,
anyway,
you
don't
expect
this
thing
to
happen.
There
is
no
benefit
and
then
check
on
the
client
to
see
if
one
megabyte
four
megabyte,
two
megabytes
where's
the
sweet
spot.
E
A
E
C
E
A
C
A
A
C
C
C
But
I
guess
a
question
that
keeps
me
to
me
is
like
how
how
much
of
a
practical
difference
would
this
make
for
like
us
purposes
if
we
did
like
the
rearranged,
buffers
or
pre
processing
the
metadata,
without
actually
storing
the
data
on
the
server
side?.
E
It
would
apply
to
a
bigger
change
to
the
system
in
which
you
stop
doing
dynamic
allocation,
because
everything
is
done
in
a
preset
values.
So
you
don't
need
to
allocate
one
megabyte
four
megabyte
you're
just
going
to
allocate
some
size
of
buffers.
I
don't
know
you're
going
to
have
a
pool
of
64k,
maybe
128k
and
you're
just
going
to
recycle
buffers
on
there.
A
Yeah
you
you
allocate
like
a
space
for,
however,
many
requests
that
you're
right,
you
know
asynchronous
and
handling,
and
then
you've
got
another
little
buffer
that
lets
you
pull
in
from
that
right.
Whenever
data
is
actually
being
sent.
E
E
Because
when
I
first
saw
the
osd,
I
was
surprised
by
how
much
memory
I
can
use
at
first.
I
was
really
trying
to
pack
everything
and
then
I
realized
there's
so
much
memory,
I'm
not
used
to
having
so
much
memory
back
in
the
symmetrics.
There
was
one
megabyte
one
one
gigabyte,
one
director
which
would
be
the
equivalent
of
a
noisy,
not
always
disorder,
because
one
director
can
manage
128
drives.
E
A
Josh,
how
horrible
would
it
be
to
add
like
a
fast
or
not
fast
path,
but
like
a
an
async
option
to
rados,
like
I'm
going
to
tell
you
I
want
to
do
this
and
then,
but
let
the
osd,
you
know
tell
you,
I'm
ready
to
have
data
come
in,
it's
not
the
primary
path,
but
just
as
a
secondary
path.
C
I
think
it
really
depends
on
how
in
depth
you
want
to
go
like
what
kind
of
saying
about
making
it
go,
all
the
way
down
to
the
objects
or
layer
where
you're
able
to
kind
of
take
this
into
small
chunks.
That's
really
like
the.
C
C
Yeah,
I
guess
it's
like
whenever
you
have
a
mixed
small
object,
workload
or
higher
low
priority
in
general,
where
one
one
low
priority
client
is
taking
up
lots
of
buffer
space
today,
right.
A
A
C
A
But,
but
it's
not
even
any
of
this
that
really
makes
our
memories.
It's
memory,
fragmentation
due
to
like
a
ton
of
nodes
being
in
memory
at
the
same
time
that
are
themselves
have
all
these
dynamic
structures
inside
them
and
another.
You
know,
h-object
t
and
everything,
it's
just
the
whole
mess
of
it
all
makes
things
horrible.
B
E
C
C
Yeah,
I
think,
that's
a
good
point.
It's
kind
of
similar
to
an
idea
we
talked
about
when
we
were
starting
out
with
crimson
as
well,
trying
to
pre-allocate
all
the
structures
needed
for
a
request
up
front.
So
you
could
view
that
kind
of
pooling
and
not
need
to
do
a
dynamic
allocation,
and
I
o
path
but
doing
the
taking
it
a
step.
Further
and
doing
incremental
after
after
wire
reads
increases
that
remember.
Saving
me
from
further
yeah.
E
J
E
A
C
Yeah,
it's
very
challenging
to
put
that
switch
system
to
work
like
that
when
it
it's
designed
the
opposite
way,.
J
A
And
actually
on
that
note,
I
talked
to
sam
about
this
exact
topic
last
week
and
he
did
mention
that
dynamic
memory
allocations
are
one
of
the
things
he's
not
really
thinking
about
much
right
now,
he's
just
trying
to
get
correctness
right,
but
that
they
are
using
a
lot
of
dynamic
memory.
So
this
is
an
area
that
you
know,
maybe
wouldn't
be
bad
for.
Someone
else
to
come
in
and
think
about
as
sam
is
trying
to
simultaneously
make
sure
correctness
is
right,
you
know,
are
we
are
we
making?
A
H
E
A
You
know
the
root
of
all
of
it
is
gh
object,
t
and
h
object,
t
right,
that's
kind
of
like
where,
in
my
mind
at
least
that's
where
it
all
starts,
you
see,
you
know,
you've
got
strings
being
used
for
dynamic
dynamically
allocated
or
things
like
object,
name
and
everything
else,
and
then
it
just
kind
of
grows
from
there.
In
my
mind,.
C
Oh
yeah,
I
guess
we
were
talking
about
because
potentially
creating
snapchats,
then
you
have
metadata
from
from
like
the
head
address,
as
well
as
the
snapshot,
so
you
could
say
double
that
number.
E
Ideally,
like
in
a
system
you'd
like
to
set
a
limit
on
how
many
requests
you're
allowed
to
accept,
and
then
everything
should
derive
from
there,
because
you
say
every
request
could
use
that
many
resources
and
then
you
pre-allocate
everything,
and
you
know
how
much
sorry
you
do
not
the
cue
but
the
the
in-flight
object.
That's
why
I
was.
I
was
surprised
to
see
that
the
cap
is
on
the
queue,
but
that's
because
of
the
unsolicited
data.
E
E
A
C
A
C
E
C
Yep
yeah
at
the
end
of
the
earlier
like
in
bluestar,
we
do
have
more
throttles
throttling
going
on
to
control
like
how
much
data
is
inflate
and
there's
a
bite
total,
and
maybe
an
operations
throttle
too,
but
that's
kind
of
the
what's.
What's
controlling
capping
the
I
o
resources
from
being
overwhelmed.