►
From YouTube: Ceph Performance Meeting 2021-09-02
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right
tell
I'll
quickly
just
mention
here
that
I
linked
in
the
ether
pad
a
spreadsheet
that
the
core
team
has
already
seen,
but
for
folks
that
are
interested
in
crimson.
This
is
our
first
attempt,
brett
had
kind
of
requested
this
at
looking
at
trying
to
simulate
what
we
might
see
with
a
multi-reactor
by
just
using
multiple
osds.
A
A
We
now
can
do
this,
so
these
tests
are
basically
just
looking
at
what
happens
when
you
stick
multiple
osds
on
a
given
device
with
both
have
our
traditional
classic
object,
store,
implementations
and
then
also
in
crimson,
with
science
store
and
alien,
using
memstor
or
bluestork,
there's
kind
of
a
couple
of
interesting
things,
probably
the
biggest
of
which
is
that
we
we
do
see
good
scaling
with
random,
reads
and
still
see
lower
cpu
usage
in
terms
of
like
cycles.
Prop,
that's
really
good.
A
I
think
that
means
that
when
we
do
multi-reactor
we're
going
to
see
that
too,
I
I
don't
see
any
reason
why
we
wouldn't,
if
we
can
do
it
with
multiple
osds,
so
you
know
we'll
see,
but
that's
good
news.
What
I'm
a
little
more
concerned
about
is
why
we're
not
seeing
better
scaling
with
random
rights,
blue
star
tradition,
is
actually
really
good
compared
to
everything
else.
A
We've
got
when
you
have
like
high
io
depth
with
random
rights,
so
we
probably
have
some
work
to
do
there,
just
even
baseline
with
one
osd.
We
don't
do
great,
with
either
mem
store
or
science
store,
but
yeah.
A
I
was
expected
to
see
this
almost
kind
of
like
a
linear
increase,
so
I
have
to
figure
that
out
then
this
weird
behavior
with
sequential
rights
in
in
aliens
or
sorry
in
crimson.
I
don't
know
if
we're
just
like
doing
some
right,
combining
through
the
classic
osd
we
see
in
both
memphis
or
bluestore,
where
we're
quite
a
bit
faster
than
anything
through
crimson.
So
another
thing
to
look
at:
that's
that's
it,
though,
really
a
lot
of
data,
but
you
know
those
were
kind
of
the
trends.
A
All
right,
let's
move
on
then
today
is
your
knee
higher
josh.
Do
you
want
to
talk
about
billions
of
tiny
objects
and
stuff.
B
B
I'm
sorry
about
the
only
okay,
so
this
came
up
from
a
discussion
with
software
heritage.
Louis
actually,
I
think
brought
this
up
at
a
cdm
call
maybe
like
four
months
ago,
six
months
ago
or
something
like
that.
They
basically
came
back
and
they
designed
this
whole
architecture
that
sits
on
top
of
radios
to
store
bajillions
of
tiny
objects.
Basically,
they're
write
once
objects
their
source
files.
B
So
it's
kind
of
a
different
use
case
than
what
radius
is
normally
used
to,
and
their
approach
in
a
nutshell
is
basically
to
have
sort
of
they
have
radios
on
the
back
end
and
all
radius
is
going
to
be
doing,
is
storing
rbd
images
that
are
like
100
megs,
or
something
like
that.
I
can't
remember
how
big
no
not
100
gigs,
I
don't
know,
they're
big
big
rbd
images
and
those
rbd
images
are
not
actually
file
systems
or
anything.
It's
just
essentially
like
a
big
file.
B
That's
a
concatenation
of
a
bajillion
of
these
objects,
and
so
they
have
these
right
servers
that
they
step
in
front,
that
sort
of
serialize
the
rights
and
just
append
to
these
big
objects
and
they
dump
it
into
radios,
and
then
they
have
a
whole
bunch
of
databases
that
track
how
to
look
them
up
later.
B
So
the
main,
the
main
challenge
that
they
have
is
that
they
want
to
be
able
to
read
an
object
by
the
hash
of
the
object
content,
and
so
they
have
some
other
external
database
that
you
look
up
the
hash
of
the
object.
It
tells
you
which
chunk
it's
in
which
rbd
image
they're
called
shards,
but
which
chart
it's
in
and
what
offset
in
that
shard,
and
so
they
can
go
find
that
object
and
they
have
to
sustain
certain
read
rates
and
write
rates,
but
that's
sort
of
the
architecture.
B
In
a
nutshell,
the
main
sort
of
sticking
point
is
that
it's
they
they
want
to
be
able
to
look
up
by
object
id,
but
then
they
also
want
this
like
efficient
packing
and
so
the
architecture.
I
think
that
they
defined
on
the
wiki.
I
think
it'll
work.
B
Fine,
it's
just
like
a
whole
lot
of
moving
parts,
and
it
occurs
to
me
that
we
can
do
almost
the
same
thing
with
almost
exactly
the
same
thing,
but
with
the
the
sort
of
the
tiering
v2
stuff
that
we
started
to
build
up
using
a
sort
of
a
subset
of
the
vrados
manifest
stuff
object
manifests
that
basically,
where
you
have
a
radius
object
where
in
the
o,
node
there's
a
thing
that
says.
B
So
basically,
the
idea
has
two
main
parts:
we
would.
You
could
make
a
radius
class,
so
it'd
be
a
little
bit
different
in
that
in
their
proposal.
They
have
these
big
rbd
images
that
are
striped
across
radio
subjects.
In
this
proposal,
we'd
simplify
that
so
that
they're
just
rados
objects
each
charge
would
be
a
radius
object
and
we'd
make
them
as
big
as
we
think
is
sane.
I
don't
know
if
that's
64
megs,
I
think
we
have
a
hard-coded
limit
of
128
megs,
but
I
don't
want
to
get
bigger
than
that.
B
I
think
64.
Banks
is
probably
big
enough,
and
so
you
pack
up
you
know
thousands
many
thousands
of
these
inside
of
the
greatest
object
and
so
you'd
have
a
radius
class.
Where
you
say
append
this
thing
to
the
shard
and
either
would
say
the
shard
is
full
or
it
would
say
sure,
and
it
would
the
return.
B
The
right
return
would
include
the
offset
that
it
wrote
to,
and
then
you
would
take
that
offset
and
then
you
would
create
a
second
object
in
a
different
pool.
That's
named
after
the
object
hash
and
you
would
set
the
manifest
for
that
object
to
point
to
the
shard
and
the
offset
where
the
data
is,
and
so
as
a
result,
if
you
want
to
read
the
object,
you
just
read
the
hash
from
the
index.
B
Tier
and
rados
would
do
its
little
like
read
thing
where
it
proxies
the
read
over
to
the
chunk
and
it
would
read
it
out
and
so,
like
the
reading
is
sort
of
trivial
existence,
checks
are
trivial,
like
radios
just
handles
that,
and
then,
if
you
want
to
do
the
bulk
stuff,
you
would
go
to
the
archive
tier
and
you
just
enumerate
the
chunks.
B
The
shards,
whatever
you
want
to
call
them
and
you'd,
read
them
in
their
entirety.
So
you'd
read
like
these
big
64
meg
objects
and
you
could
stream
them
out
or
whatever,
which
I'm
hoping
would
be
that
size
if
that
size
is
sort
of
big
enough
to
get
further
like
bulk
requirements.
I
think
that
would
work
too.
A
B
A
B
Million
they
want
to
enumerate
a
million
objects
at
a
time,
and
so
they
probably
just
said,
oh,
we
can
fit
a
ballet
million
objects
into
whatever
it
is
100
gigs,
rbd
images,
but
I
don't
know,
I
don't
know
that
it
really
matters
how
big
that
is
with
them
to
make
them
to
make
sure.
My
guess
is
that
64
meg
shards
is
a
big
enough
bulk
container
that
you
could
still
efficiently
like
stream
these
out
in
bulk
and
mass
you
could
you
could?
B
B
Right
only
delete
never
read
frequently,
probably
whatever
it
is
yeah
yeah.
So
I
think
that
I
mean
there
are
like
a
few
implementation
quirks
because,
like
rights
wouldn't
be
atomic,
so
if
you
have
two
people
trying
to
write
an
identical
object
with
the
same
hash
at
the
same
time,
they
must
might
attend
it
to
the
same
chart
or
two
different
charts
before
they
realize
that
the
first
one
already
completed
and
created
the
index
object,
though,
if
you
had
simultaneous
writers,
they
might
waste
a
tiny
bit
of
space.
I
don't
think
that
matters.
B
It
probably
never
actually
happens
in
practice,
though
I
wouldn't
worry
about
it.
Otherwise,
I
think
I
think
it
would
work
fine.
The
question,
I
think
the
real
question
is
for
us
whether
we
want
to
support
that
raido's
manifestop,
because
so
far
it's
it's
like
a
small
piece
of
all
that
work
around
the
deduplication,
tiering
v2
stuff,
and
there
are
definitely
bugs
in
there.
I
think
mostly
around
the
reference
counting,
because
I
think,
but
I
think
they
don't
need
the
parts
that
are
buggy
like
we
could.
We
could
narrow.
B
So
it's
just
the
like
this
idea
of
having
just
a
manifest
that
points
something
and
ignore
all
the
other
stuff.
We
can
separate
into
a
radio
stop
a
separate
radio
stop
so
that
it's
like
distinctly
destroying.
But
I
don't
remember
exactly
what
tests
are
failing,
so
I'm
hoping
yeah.
E
There
are
yeah,
I
think,
from
my
perspective,
there
are
two
kinds
of
bugs
there
are
one
tearing
bugs,
and
then
there
are
so
there
is
development
going
in
this
manifest
area
which
do
expose
bugs
they
have
also
been
fixed.
So
I
am
not
100
sure,
because
I
haven't
looked
like
all
the
chain.
Looked
at
all
the
changes
that
have
gone
in
this
area
about
how
stable
that
is.
If
we
can
get
confidence,
you
know-
maybe
even
maybe
I'll
talk
to
sam
as
well,
because
he's
been
reviewing
some
of
those
pr's.
B
B
The
second
question
for
them
is
like
whether
this
like
alternate
set
of
like
size,
trade-offs
or
whatever
is
division
for
them,
because
if
so,
I
think
it's
like
it
would
be
way
less
work,
because
I
don't
know
I
don't
if
you
look,
if
you
pull
up
that
page
in
their
wiki,
they've
got
like
postgres
databases
and
these
like
a
couple,
different
servers
and
dispatchers.
B
B
E
Yeah,
I
I'll
take
a
look
at
any
existing
bugs
that
are
popping
up
currently
and
also
check
with
sam
as
to
how
stable
he
thinks
you
know
it
is.
Some
of
them
have
been
like
you
know,
mostly
test
issues
that
have
gotten
fixed
over
time.
E
Some
are
race
things
that
have
also
come
up,
but
as
far
as
I
know
recently,
I
haven't
seen
any
of
those,
so
I
I'd
double
check
with
them
and
and
in
general
I
like
the
idea,
given
that
we
already
actively
developing
in
the
end
if
they
can
just
reuse
the
same
logic.
That
makes
sense
to
me
if
that
works,
for
them
as
well.
B
I
think
that
the
one
thing
that
I
might
suggest
that
we
do
is
that
I
think
a
lot
of
the
complexity
with
the
the
current
manifest
stuff
is
that
it
sort
of
is
coupled
with
the
reference
counting,
so
that
if
you
have
an
object
with
the
manifest
that
points
somewhere
else,
it
like
increments
the
reference
count
and
if
you
delete
this,
it
decrements
the
reference
count
and
we
for
this
use
case.
We
don't
need
that,
and
so
I
think
we
should
make
sure
that
it's
possible
to
describe
a
manifest.
B
That,
like
I
don't
know
if
there's
a
flag
or
if
it's
a
different
type
of
reference,
that
like
isn't
a
reference,
counted
reference,
but
it's
just
a
pointer
and
possibly
like
separated
out
into
a
separate
radius
op
to
set
these
so
that
because
right
now
that
the
the
radio
stop
that
you
use
to
create.
This
is
called
like
set
chunk.
And
it's
it's
like
a.
You
can
set
basically
an
arbitrary
manifest
with
like
arbitrary
content,
and
so
it's
like
kind
of
like
a
wide
surface.
B
C
B
B
C
Yeah,
and
actually
that
actually
run
for
other
users
too,
that,
like
that
the
right
kind
of
archival
right
ones
read
many
case,
or
they
literally
never
wanted
to
delete
for
the
similar
kinds
of
reasons
not
not
necessarily
to
archive
it
forever,
but
to
collect
legal
requirements
or
something
to
keep.
B
Yeah-
and
I
my
recollection
here
is
vague
because
it's
been
a
while,
since
I
looked
at
this,
but
I
thought
that
we
actually
had
in
those
these
manifests.
There
was
like
a
flag
for
to
indicate
whether
or
not
it
was
a.
B
Whether
yeah
there
is
a
flag
for
whether
we
have
a
reference,
so
the
manifest
structure
when
it
points
to
it,
there's
a
flag
on
that
reference.
That
says
whether
we're
responsible
for
decrementing
a
reference
on
the
remote
object
or
not,
so
we
just
wouldn't
set
that
flag.
I
think
all
the
like
the
plumbing
is
there.
I
think
we
might
just
want
to
make
like
a
sort
of
a
more
narrowly
scoped
radio
soft,
perhaps
that
that
does
just
this
so
that
we
can.
B
B
B
The
other
simplifying
case
in
this,
in
this
particular
case,
is
that,
like
a
reference
is
only
ever
one
extent,
whereas,
like
we
actually
support,
you
know
a
whole
list
of
a
single
object.
Logical
object
might
have
lots
of
references
to
lots
of
other
bits
for
the
dedupe
case,
where
you
like
chunk
it
up,
and
that
would
we
don't
need
that
here.
A
Did
you
know
what
how
hard
this
first
bite
of
any
object
never
takes
longer
than
100
ml
quarter
minutes.
B
A
B
Yeah
I
mean
none
of
these
require
I
mean
that
there's
always
going
to
be
like
some
99.9
percentile.
That
doesn't
happen,
but
I
think
in
general,
like
if
you're,
if
your
index
tier
is
on
ssds
and
then
the
other
tier
is
on
whatever
like
in
general,
the
rates
will
be
pretty
quick
because
what
you'll
hit
the
index
tier
and
it'll
proxy
the
read
off
to
the
thing
and
then
usually
radius
observe
not
more
than
100
milliseconds,
but
it
all
depends
on
like
how
you
provision
your
storage
right.
So
it
totally
depends
yeah.
B
Basically,
the
performance
balance
stuff
is
like
a
little
bit
different,
so
I
think
those
would
certainly
want
to
be
like
pure
ssd,
like
you'd,
want
to
have
like
just
indian
me
back
in
that
case,
so
that
you
get
reasonable
performance,
but
I'm
not
sure
if
we've
like
really
done
much
performance
testing
for
how
bluestar
behaves
in
that
kind
of
scenario,.
B
Okay,
it
kind
of
depends
on
what
yeah
it
depends
on
what
our,
what
our
expectations
are.
C
Would
it
be
bad
if
it
was
if
it
was
all
spread
out
across
the
yeah?
Clustered
like
you
had?
Was
these
that
have
a
mix
of
nvme
there
that
might
that
might
make
it
better.
C
B
Yeah
yeah
and
if
all
the
osds
have
are
like
a
our
hybrid
with
disk
and
that's
right,
the
metadata
would
look
mostly
on
the
disk
yeah.
That
might
be
another.
I
mean.
I
think
these
are
probably
performance
questions
that,
like
they
could
experiment
with
today,
right
like
even
if
they're,
not
using
the
manifest
api,
they
could
just
like
create
an
object
with
an
attribute
that
has,
but
they
don't
even
have
to
create
an
option.
B
You
could
just
create
empty
objects
and
like
modulo,
like
50
bytes,
it's
going
to
be
basically
the
same
as
what
their
the
end
result
would
be
and
just
see
what
happens
when
they
create
like
that
number
of
objects,
either
on
like
dedicated
osds
or
on
osds
that
are
have
some
like
the
expected
ratio
of.
C
B
Yeah,
let's
see
that's
the
idea.
Okay,
the
tiny
object
is
yeah
and
that's,
I
think
the
whole
point
of
going
down
this
path
is
to
like
get
that
right
so
that
you
can
just
say:
I
have
the
object.
You
have
to
just
go
straight
to
rados
and
it
it
handles
the
indexing
of
like
figuring
out
where
that,
where
that
content
is
right,
right,
just
thinking.
B
C
Yeah,
because
I
think
it
was
like
it
was
it's
pretty
largely
like
like
the
network
of
10k,
but
that's
probably
because
of
all
the
checksumming
and
and
blob
information
for
a
large
object.
Rbu
is
much
smaller.
B
B
B
C
Yeah,
yeah
and-
and
that
means
elites
is
actually
a
benefit
because
that
they're
really
expensive
with
blocks
to
be
so
that
probably
improves
their
performance
quite
a
bit.
B
B
So
this
is
sort
of
the
extreme
version
there's
like
if
you
take
one
step
back
instead
of
using
the
manifest
thing,
you
could
just
create
the
object
with
an
attribute
with
that
same
data,
and
then
the
readers
could
stat
that
object
and
get
the
offset
the
chunk
and
then
go
read
it
directly
without
having
like
radius.
Do
it
transparently
magically
for
you.
B
B
C
B
B
B
C
Anyway,
okay,
looking
at
the
flagger
and
there's
like
27,
open
bugs
in
the
gearing
category,
I'm
not
sure
how
many
those
are
troops
or,
but
there
are
a
number
of
them
that
are
even
this
year.
C
E
Yeah
yeah,
so
talking
of
those
bugs
some
of
them
could
definitely
be
related
to
old
versions.
Like
I
see
some
from
2017
etc,
which
we
are
not
seeing
today,
I
don't
remember:
having
removed
any
turing
tests
yet
so
so
I'd
be
I'd,
be
curious
to
see
these
the
new
ones,
the
2021
and
2021s,
which
are
still
relevant
so
yeah.
E
If,
if
we
really
want
to
go
down
this
path,
then
we
want
to
see
how
relevant
these
are
and,
as
I
said,
I
also
want
to
understand
sam's
confidence,
because
some
of
these
bugs
have
been
ignored
in
the
past
because
nobody
was
using
it.
So
we
don't
care
too
much
about
the
priority
of
these
bugs.
But
if
somebody
wants
to
use
it,
then
we
will
take
it
seriously.
C
Yeah,
there
are
definitely
things
around
like
ordering
and
recovery
and
and
pg
logs
between
tiers
that
are
not
have
not
been
fixed.
E
C
Yeah
I
mean
gabby
and
I
were
talking
a
while
back
about
an
idea
like
this,
where
you
could
have
a
specialized
pool
just
for
storing
metadata
like
this,
or
you
could
have
a
dedicated
kind
of
type
of
node
structure
that
you
wouldn't
need
to
take
much
much
space.
It
would
be
kind
of
like
more
like
a
maybe
a
larger
spectra
that
would
contain
these
kinds
of
references.
C
E
Yeah
I
was
just
saying:
sam
has
just
joined
sam
you're,
just
curious
to
know
your
confidence
in
the
tiering
code
and
just
the
manifest
of
handling
the
usd
it's
extremely
far.
F
C
Yeah,
I
guess
sage
was
kind
of
hoping
that
we're
stringing
it
down
to
a
very
small
subset
of
what
the
manifesto
is
capable
of
this
use
case,
namely
just
tracking.
C
F
F
Because
the
the
cache
pool
it's
coming
from
still
needs
to
sort
the
metadata
pointing
at
those
at
those
packed
objects
right.
F
F
F
G
So
the
main
question
here,
I'm
just
jumping
in
as
a
newcomer
to
this
problem,
but
the
question
is
whether
to
use
existing
infrastructure
at
the
cost
of
maybe
efficiency
or
to
create
invent
something
entirely
new
or
efficiency,
but
then
there's
the
problem
of
creating
something
entirely
new.
That's
what's
the
dilemma:
yeah.
C
Anything
with
that
structure
instead
do
this
outside
of
stuff
and
not
use
an
existing
database
to
handle
this
kind
of
redirection
metadata.
A
E
F
I
mean
we
have
the
ability
to
set
or
redirect
up
at
the
cash
pool,
to
an
object
in
the
base
pool
trying
to
remember
it
doesn't
actually
have
the
ability
to
point
to
a
sub-object,
so
the
packing
feature
still
would
still
need
to
be
implemented.
That's
a
whole
new
thing
that
doesn't
exist
now.
C
F
A
F
F
A
I
sam
I,
I
somewhat
disagree
with
your
premise
that
that's
the
rocks
to
be
as
well
suited
to
this.
They
have
a
requirement
that
no
first
bite
read
of
any
object
takes
longer
than
a
hundred
milliseconds.
I
don't
think
rocksteady
can
do
that
with
like
and
well
it's
not
in
one
instance,
but
just
like,
maybe
you
know,
hundreds
of
millions
to
a
billion
e-value
pairs
or
more
in
one
instance,.
F
C
Software
heritage
foundation
use
case
where
they're
kind
of
archiving
billions
and
billions
of
tiny
objects,
like
mainly
the
objects
from
gate
repos
and
trying
to
pack
them
together
and
index
them.
Their
indexing
scheme
is
actually
in
a
separate,
separate
storage
system,
but
they're
going
to
use
f
purely
for
storing
the
contents.
C
The
other
repos
and
the
potentially
the
mapping
form.
F
Suited
to
this,
namely
using
crush
to
distribute
the
objects
across
those.
These
was
a
good
fit.
C
Their
alternative
is
basically
using
rvd
images
that
are
that
they
have
their
own
indexing
scheme
into
where
they
write
a
bunch
of
small
objects
into
the
rvd
have
their
own
index
within
that
rbd
of
what
which
objects
are
where
and
also
store
that
information
about
how
to
how
to
look
up
the
different
repos
in
their
in
a
separate
database.
E
Yeah
it
does
sound
like
like
josh,
you
asked
this
question
earlier,
but
if
they
want
a
solution
as
early
as
like
you
know
now
or
like
you
know
soon,
maybe
going
the
rbd
route
is
a
better
way
for
them.
Given
the
state
of
the
hearing
court.
C
Yeah,
certainly
right
now,
I
think
their
time
frame
is
more
like
six
months
to
a
year,
but
even
if
we
started
working
on
this
stuff
now,
I
think
it
wouldn't
be
dirty
for
quincy.
So.
C
F
Yep
like
this
would
be
a
large
development
effort.
Like
I
said,
the
large-scale
properties
of
ceph
seem
to
fit
this.
Well,
that's
a
way
of
distributing
these
keys
across
the
whole
cluster
and
dealing
with
recovery
whatever,
but
there
are
a
bunch
of
details
that
really
don't
fit
like
we
like
to
do
recovery
on
a
per
object
basis.
That's
flatly
inappropriate
for
this,
even
within
the
osd
we'd
want
to
pack
them
and
only
consider
them
on
bulk
during
recovery
or
whatever
yep.
F
So
I'm
not
I'm
not
sure
this
is
even
just
a
matter
of
making
efficient
donuts.
This
might
be
a
matter
of
creating
a
pool
where,
within
each
host,
we
have
a
much
more
complicated
indexing
structure,
yeah.
So.