►
From YouTube: Ceph Month 2021: Crimson Update
Description
Presented by Samuel Just
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021
A
All
right,
I'm
going
to
give
a
bit
of
a
rundown
of
some
of
the
recent
crimson
work
and
a
more
detailed
dive
into
the
ongoing
c-store
work,
the
first
a
bit
of
a
refresher.
So
what
is
crimson
crimson
is
an
effort
to
replace
cephosd
with
a
new
implementation,
crimson
osd,
which
is
better
suited
to
the
demands
of
next
generation
storage
hardware
by
requiring
less
cpu
overhead
per
eye
out.
A
It's
to
improve
throughput
per
per
core
classic
osd
is
capable
of
driving
a
great
deal
of
throughput.
It's
just
that
it
requires
a
lot
of
cores
to
do
so
so
we'd
like
to
improve
that
there
are
a
few
ways
we're
looking
at
doing
this,
we're
using
the
c-star
framework
to
try
to
avoid
context
switching
context.
Switching
is.
A
Let's
talk
a
bit
about
what's
been
happening
lately.
Recent
pros
and
work
has
focused
on
a
few
broad
areas:
implementing
raido's
features,
data,
durability
and
reliability,
visibility
and
debugging
and
stability.
A
Recent
work
has
focused
on
getting
rbd
workloads
up
and
running
for
a
couple
of
reasons:
one
it's
in
a
way
the
most
sort
of
straightforward
workload.
Another
is
that
rbd
workloads
tend
to
be
the
ones
most
sensitive
to
cpu
overhead
in
the
core
right
path,
so
it
seemed
like
a
good
place
to
start,
so
we
can
start
getting
good
performance
information.
B
A
So,
as
crimson
is
a
drop
in
replacement
first
fosd,
it
also
needs
to
implement
all
of
the
relevant
data.
Reliability
and
durability
features
that
we
rely
on
so
recently,
radik
managed
to
merge
the
backfill
implementation
last
year
with
some
initial
testing
and
ronan
did
some
work
to
refactor
scrub,
to
enable
code
sharing
between
the
classic
osd
implementation
and
crimson,
and
the
initial
version
has
been
ported
to
crimson,
so
backfill
should
give.
A
Crimson
should
be
the
last
piece
to
enable
quincy
to
do
appropriate
failure,
recovery
and
rebalancing
and
should
allow
crimson
now
to
survive
pathology
failure.
Ejection
testing
scrub
gives
us
a
way
to
ensure
that
backfill
is
doing
its
job
to
ensure
that
the
crimson's.
A
As
crimson
starts
to
approach
a
state
where
a
wider
audience
might
be
interested
in
testing
and
benchmarking,
it
it's
important
to
improve
debugging
and
performance
visibility.
So
to
that
end,
the
team
has
been
working
on
wiring
up
something
like
perf
counters,
using
seastar's
metric
framework
and
exposing
these
counters
through
a
prometheus
endpoint
for
larger
cluster
level
statistics.
Segregation
work
has
also
been
done
to
improve
back
traces
on
crash,
although
more
work
will
need
to
be
done.
There.
A
The
main
focus,
though,
lately
and
going
forward,
has
been
stability.
Crimson
can
now
be
tested
with
tethology
and
has
an
initial
suite
of
tests
in
qa
sweets,
crimson
radios,
tons
of
work
has
gone
into
expanding
that
set
and
fixing
the
exposed
crashes.
A
A
A
It's
designed
to
avoid
cpu-heady
cpu-heavy
metadata
designs
like
rocksdb,
and
it's
intended
to
exploit
emerging
storage
technologies
like
zns,
fast
dns,
persistent
memory
and
in
general,
fast
nvme
devices.
A
So
I
mentioned
zns
zns
is
a
new
nvme
specification
intended
to
address
challenges
with
conventional
ftl-based
flash
designs,
traditional
ssds
implement
what
is
essentially
a
log
structured
file
system
internally,
where
writes
are
performed
to
free
regions
of
the
disk,
with
a
dynamic
logical
to
physical
mapping
being
updated.
In
the
background,
doing
random
writes
to
one
of
those
regions
of
the
disks.
First
requires
relocating
the
data
and
erasing
it
because
you
can't
do
a
random
write
on
an
ssd.
You
need
to
do
writes
in
erasure
block
size
chunks.
A
The
right
amplification
is
particularly
a
problem
for
new
quad
level
cell
flash
technologies
with
relatively
low
right
endurance
properties,
so
where
zns
differs,
is
that
it
changes
the
interface
to
the
drive
by
dividing
the
drive
into
zones
which
can
only
be
opened
written,
sequentially
closed
and
released.
A
The
other
major
technology
that
we
want
to
address
here
is
persistent
memory.
A
It
has
almost
dram
like
read,
latencies
right,
latency,
right,
latency,
dramatically
lower
than
flash
and
very
high
right
endurance,
so
it
seems
like
a
good
fit
for
persistently
caching
data
and
metadata,
particularly
data
metadata,
with
high
update
rates
that
would
otherwise
have
an
impact
on
the
right
endurance
of
an
underlying
qlc
device.
A
So
the
approach
that's
being
discussed-
this
is
a
little
ways
out.
Hopefully,
this
year
is
to
keep
these
sort
of.
Caching
is
to
keep
the
caching
layer
and
persistent
memory
and
to
maintain
a
copy
on
write
pattern
to
extent
mapping
maintained
via
the
write
ahead
journal
so
that
we
can
rebuild
the
sort
of
cache
mapping
on
restart.
A
So
the
main
characteristics
are
that
it
needs
to
be
transactional.
It's
composed
of
flat
object,
name,
space
object,
names
may
be
large
and
they
may
be
under
the
control
of
the
user.
In
the
case
of
rgw,
each
object
contains
a
key
to
value
mapping
as
well
as
the
data
payload.
The
key
to
value
mapping
is
used
for
rgw,
bucket
indices
and
cfs
directories.
A
Among
other
things,
we
also
need
to
be
to
be
able
to
support
copy
on
right
object,
clones
for
a
lot
of
features,
including
snapshots
and
rbd
snapshots,
and
we
need
to
support,
ordered
list
listing
of
both
the
omap
and
object
name.
Space.
A
So
the
really
really
high
level
logical
structure
of
c-store
is
we
have
a
sort
of
a
root
block
pointing
at
an
o
node
index
mapping
h
objects,
two
o
nodes,
each
of
which
contains
sort
of
pointer
to
an
omap
tree
for
that
key
to
value
mapping
and
a
set
of
contiguous
logical
lbas
for
the
actual
object
data.
A
So
why
do
we
need
an
lba
interaction,
it's
about
garbage
collection
and
relocating
data.
So
if
we
have
some
internal
structure
on
disk,
with
three
extents,
a
b
and
c
with
a
referencing
b,
referencing
c,
if
these
are
physical
mappings,
then
if
we
relocate
c
to
c
prime
here,
we
need
to
update
b
and
transitively.
Therefore,
eventually
update
a
once
b
gets
relocated.
A
By
contrast,
if
these
references
are
logical,
then
all
we
need
to
do
is
update
the
logical
mapping
for
c
the
logical
mapping
only
needs
to
maintain
these
are
the
sort
of
basically
64-bit
64-bit
mapping,
so
it
has
considerably
higher
fat
or
yeah
higher
fan
out
than,
for
instance,
the
o-mapper,
oh
no
trees,
so
there
should
be.
We
should
be
able
to
trade
extra
reads
in
the
lookup
path
for
lower
right
amplification
during
garbage
collection.
A
A
So
the
on-disk
layout
of
the
journal
looks
something
like
this:
each
each
journal
unit,
this
called
a
record
and
a
record
contains
a
header
with
all
of
the
usual
checks
on
the
length
information,
a
set
of
deltas
which
are
sort
of
mutation,
records
for
existing
on
disk
extents
and
a
set
of
new
extents,
including
logical
or
physical
blocks,
where
logical
blocks
are
data
or
things
of
the
left
side
of
the
tree
really
and
physical
blocks
are
really
just
the
blocks
comprising
the
lba
tree
itself.
A
So
here,
new
blocks
b
and
a
get
written
and
they're
sort
of
in
magenta
here
with
a
being
part
of
the
lpa
tree
and
d.
Prime
and
e
prime,
are
the
new
representations
for
the
d
d
extents
after
these
two
deltas
are
applied.
A
This
is
important
for
dns,
because
we
are
able
to
logically
mutate
extents
without
actually
mutating
them.
We
don't
need
to
actually
rewrite
dnd
or
go
back
to
where
they
are
on
disk
and
rewrite
them.
We
can
just
write
down
a
record
explaining
how
the
the
extent
should
be
modified,
there's
another
wrinkle,
which
is
that
with
zns
devices
we
don't
necessarily
it
might
be
possible
to
predict
where
this
record
will
show
up
on
disk,
but
it's
more
efficient
if
we
don't,
as
it
would
increase
concurrency.
A
So
these
internal
arrows
here
from
e
prime
to
a
and
from
a
to
b,
these
are
expressed
in
terms
of
relative
addresses,
so
that
when
we
read
this
record
back
off
of
disk
there's
a
step
where
we
need
to
adjust
any
of
the
in
extent
addresses
relative
to
their
own
base.
Basic
addresses.
This
way.
These
records
work,
regardless
of
where
they
actually
end
up
on
on
disk.
A
There
are
two
major
divisions,
the
things
above
and
below
this
transaction
manager
concept,
transaction
manager
supplies
a
transactional
interface
in
terms
of
a
logically
addressed
set
of
blocks
used
by
data
extents
and
metadata
structures
like
the
gh
object
to
o
node
index
and
no
map
trees.
A
So
now
I'm
going
to
sort
of
talk
through
some
of
the
pieces
that
have
been
merged
so
far,
so
the
first
is
the
fl
tree
or
the
odon
manager
fl
tree
implementation.
A
If
you
look
in
the
source
code,
a
gh
object
is
really,
I
believe,
a
three
four
five,
a
seven
tuple
with
a
fixed
size,
prefix,
shard
pool
and
key
a
variable
size,
middle,
the
object
name
and
name
space
and
a
fixed
suffix
snapshot
in
generation.
A
The
ordering
here
matters
because
the
traversal
order
for
this
index
is
is,
is
used
for
making
scrub
and
backfill
behave
uniformly
across
nodes.
So
we
actually
need
the
listing
behavior
to
be
well
defined
here.
A
To
save
space,
internal
nodes
drop
components
that
are
uniform
over
the
whole
node.
So
if
you
have
a
long
extent
of
internal
nodes
that
all
map
to
the
same
pool,
they
can
omit
the
start.
Pool
the
shard
and
pool
portions
of
the
key,
and
so
on.
Internal
nodes
can
also
drop
components
that
differ
between
adjacent
keys,
so
that,
if
you
have
a
sequence
of
nodes
high
in
the
tree.
A
Well,
above
the
point
at
which
the
sort
of
name
comes
into
play,
they
don't
need
to.
They
don't
need
to
include
a
name
for
their
child
blocks,
which
also
helps
to
save
space.
The
work
was
done
mainly
by
eagleton
at
intel.
A
The
next
bit
is
the
v3
omap
manager.
This
is
also
a
bee
tree,
essentially
a
b3
structure,
and
it's
used
for
storing
the
omap
data
for
each
object.
It's
a
fairly
straightforward
string
to
string
b
tree,
contributed
mainly
by
chad
may
at
intel.
A
The
object
data
handler
is
the
component
responsible
for
actually
mapping
object,
data
to
logical,
logical
extents.
It
uses
the
lba
manager
to
avoid
needing
to
maintain
a
secondary
extent.
Map
clone
work
is
still
or
clone.
Support
is
still
a
to
do
and
requires
the
ability
to
relocate
logical
extent
ranges
which
is
future
work,
so
this
transaction
manager
is
the
sort
of
intermediate
layer
between
everything
I've
mentioned
so
far
and
the
lower
level
implementation
details.
A
A
Users
of
this
interface
can
express
can
can
create
a
dependent
delta
data
type
which
can
be
included
transparently
in
the
commit
journal
record.
This
enables,
for
instance,
the
b
tree
o
map
manager
to
represent
the
insertion
of
a
key
into
a
block
by
encoding,
just
that
key
value
pair
rather
than
needing
to
encode
the
full
new
copy
of
the
block
or
everything
after
the
point
in
the
block
where
that
key
shows
up.
So
it
allows
more
compact,
encodings
components.
Using
transaction
manager
can
ignore
extent,
relocation
completely.
A
A
The
cache
component
manages
in
in
memory
a
set
of
in-memory
extents,
including
mainly
any
any
dirty
extents,
whose
current
representation
differs
from
the
on-disk
representation
due
to
deltas
that
have
been
written
between
when
the
extent
was
written.
And
now
it
also
is
the
component
responsible
for
detecting.
A
Conflicts
finally,
the
journal
is
responsible
for
actually
handling
atomic
write
and
replay
of
journal
records.
A
The
segment
cleaner
is
the
component
responsible
for
garbage
collection.
It
tracks,
it
also
tracks
the
sort
of
usage
status
of
segments.
It
runs
a
background
process,
though,
within
the
same
reactor
thread
that
sort
of
in
a
loop
chooses
a
segment
relocates
any
life
extents
within
it
and
then
releases
it.
B
A
A
A
So
the
status
is
that
we
have
very
initial
osd
support,
just
merged
in
that
it
is
there's
a
v
start
dash
c
store
command
line,
option
that
will
allow
you
to
start
and
almost
immediately
crash
an
osd
once
you
do
any
actual.
I
o,
though
it
doesn't
support
well
in
addition
to
crashing
it
also
doesn't
support
snapshots
other
things.
A
The
ongoing
work
will
be
to
stabilize
this,
to
the
point
where
you
know
all
of
the
features
that
exist
actually
work
and
then
to
start
working
on
optimizing
performance,
particularly
the
garbage
collection,
I
think,
will
be
an
important
focus
in
the
near
future
and
then
there
we
need
the
ability
to
remap
logical,
extents,
to
support
clone
and
then
slightly
further
out
we're
working
on
in-place
mutation
support
for
fast
nvme
devices.
A
What
I
described
here
is
so
far
is
focused
on
zns
and
slower
ssds,
where
functional
rights
are
helpful,
but
for
faster
devices
like
optane.
That's
not
really
helpful,
so
we're
working
on
adapting
the
internal
interfaces
to
enable
direct
mutation
for
devices
where
that's
efficient
and
then
we're
also
working
on
adding
support
for
persistent
memory
somewhat
further
out.
We
want
to
add
support
for
tiering
so
that
we
can
combine
fast
small
devices
with
lower
higher
capacity
devices,
for
instance,
persistent
memory
with
qlc
or
obtained
with
well
also
qlc.
B
Questions
there's
a
question
in
the
pad:
can
you
explain
the
role
of
spdk?
Is
it
layered
above
or
below
c-store.
A
Spdk,
if
it
comes
into
play
at
all,
will
be
below
it
that
so
the
the
low
level
I
o
details
need
to
be
plugged
through
c
store,
z,
star
rather
sorry.
So
that's
how
that
would
work.
It
would
be
an
alternative
to
the
posix-based
accesses
we're
using
so
far.
A
Oh
question
crimson
osd
going
to
support
bluestora.
Initially
it
actually
does
right
now,
so
c-store
is
a
whole
different
thing.
This
is
a
different
way
of
doing
storage.
Right
now,
crimson
osd
already
supports
blue
store
via
sort
of
a
queuing,
separate
threads
system,
so
yeah
crimson
sd
will
support
blue
store
and
does
support
blue
store.
A
Iops
performance
increase
of
crystal
osd
ssd,
with
blue
store,
not
directly.
No
sorry,
particularly
c
store
c
store
is
really
immature.
We
don't
really
have
data
on
that.
A
Though
it
didn't
so
the
design
of
c
store
didn't
really
come
from
a
university
research
project.
It's
pretty
similar
in
a
lot
of
ways
to
butter,
fs
and
other
log
structured
file
system
implementations,
and
it's
not
wildly
dissimilar
from
ffs
either.
A
A
I
think
the
zns
thing
is
something
manufacturers
are
promoting
because
of
demands
for
the
market.
It
would
enable
higher
capacity
cheaper
devices
with
less
on
disk
or
with
less
on
device.
B
There's
one
other
question
in
the
pad:
is
there
integration
with
the
nvme
over
fabric
gateway?
No.
A
Do
you
mean
for
crimson
in
general
or
for
c
store
to
both
questions
at
the
moment?
The
answer
is
no
in
the
future,
possibly
depending
on.
B
A
B
A
Deals
with
the
seastar
reactor
or
thread
bottle
correctly,
and
that
is
likely
to
be
of
use
to
the
nvme
of
her
fabric
gateway
as
well,
and
there
may
be
future
work
or
future
things
where
nvme
over
fabric
may
be
useful
in
crimson.
So
there
may
be
some
overlap
there
in
the.
B
B
B
A
Yeah
the
goals
would
be
a
lot
more
stability,
a
lot
more
of
the
tests
supported
over.
I
think
that's,
that's
the
main
progress
marker
for
store,
it'll,
look
like
osd's,
don't
crash
anymore.
That'll
that'll
be
good
and
like
starting
to
evolve
like
real
performance
information
to
get
a
sense
for
which
things
are
really
important
to
attack
next,
but
c
stores
actually
has
work
being
done
on
a
lot
of
fronts,
so
it'll
sort
of
depend
on
where
different
people
put
their.
B
Focus
my
understanding
is
that
the
current,
the
current
crimson
osd
is
single
core.
Yes,
that's
true,
oh
yeah,
where
what's
the
what's
the
thinking
current
thinking,
there.
A
I
think
the
current
thinking
is
still
that
stability
is
the
more
important
component,
but
I
do
expect
that
to
be
addressed
at
least
partially
in
the
next
year.
It's
going
to
be
a
fair
amount
of
work,
though
so
that'll
be.
There
will
be
a
fair
amount
of
underlying
work
to
modify
the
messenger
osd
interface
and
the
the
osd
backing
store
interface,
at
least
with
blue
store.
A
B
A
I
think
the
main
thing
is
that
the
one
of
the
biggest
like
danger
points
of
going
to
multiple
is
messing
up.
The
bob
ordering
and
correctness
details
in
hopefully
fewer
of
the
ways
than
we
managed
to
in
classic
over
the
years
and
having
a
decent
test
suite
filled
out,
will
really
help
with
that.