►
From YouTube: What's new with Crimson and Seastore?
Description
Presented by: Samuel Just
What's new with Crimson and Seastore?
Next generation storage devices require a change in strategy, so the community has been developing Crimson, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices. This talk will explain recent developments in the Crimson project and Seastore.
Event: https://ceph.io/en/community/events/2022/ceph-virtual/
A
I'm
gonna
talk,
I'm
gonna,
give
I'm
Sam
I'm
gonna,
give
a
an
introduction
to
the
Crimson
project
and
then
talk
about
some
of
the
recent
development
work
that
we
have
been
doing.
A
But
CPU
throughput
isn't
really
improving
on
the
same
trajectory.
We
do
see
an
increase
in
core
count,
and
but
we
we
see
an
increase
in
core
count,
but
not
not
an
increase
in
per
quart
throughput
in
quite
the
same
way.
A
A
Here's
a
graphical
representation
of
CPU
usage
and
classic
OSD
I
borrowed
from
Key
Foo
and
Radix
presentation
at
cephalicon
Barcelona
many
years
ago
now
we
can
see
several
of
the
osd's
thread.
Pools
represented
here,
lose
the
Blue
Store
FN
and
on
Anonymous
finisher
thread
the
messenger
thread
and
the
OSD
thread
pool
TPO
is
dtb.
A
So
if
we
zoom
in
on
the
request,
commit
callback,
because
the
Callback
needs
to
reach
into
PG
structures,
which
are
also
used
in
the
OSD
op
thread
pool,
we
spent
quite
a
bit
of
time
acquiring
locks
to
prevent
access
to
these
resources.
A
A
So,
there's
quite
a
bit
of
CPU
overhead
inherent
in
the
classic
OSD
threading
architecture
alone.
Even
before
we
get
to
the
implementation
of
any
of
the
iOS,
the
Crimson
project
is
meant
to
solve
this
at
its
core.
The
project
has
two
major
goals:
minimize
CPU
overhead
per
I
o,
including
cross-core
communication,
copies
and
contact
switches
and
secondly,
to
exploit
emerging
storage
Technologies
like
fast
and
vme
So.
To
that
end,
Crimson
is
a
rewrite
of
the
OSD
Damon
intended
to
address
these
goals.
A
It's
important
to
note
that
it's
iops
per
core,
that's
the
important
metric
here,
rather
than
Raw
iops
classicalist,
you
can
can
drive
a
substantial
amount
of
throughput.
A
So,
probably
speaking,
there
are
three
main
components
of
the
OSD
I
o
pipeline,
there's
the
the
messenger,
the
PG
State
blob
of
in-memory
State
used
for
operation,
ordering
recovery,
Etc
and
blue
store,
or
whatever
objects
for
implementation
responsible
for
actual.
I
o
and
transaction
atomicity,
so
for
small
and
replicated
to
write
the
client
messages
right
off
the
Wire
by
the
messenger
thread
which
places
the
message
in
the
op
queue.
The
OSD
op
red
pool
then
picks
up.
The
message
creates
a
transaction
which
accused
for
Blue
Store,
Blue
stores.
A
The
Finisher
thread
picks
up
the
completion,
callback
and
cues
the
reply
for
a
messenger
threat
to
send
each
of
these
thread.
Handoffs
requires
coordination
over
the
contents
of
a
queue
and
with
PG
state
in
particular.
More
than
one
thread
may
need
to
access
the
PG
data
requiring
mutexes
by
contacts
or
by
contrast
in
Crimson.
The
goal
is
to
localize
all
of
the
parts
of
this
pipeline
into
each
thread.
A
We're
using
c-star
a
user
space
scheduling
library
to
do
this,
and
rather
than
passing
around
callbacks
c-star
uses
these
future
objects
which
allow
us
to
chain
asynchronous
handlers
onto.
I
o
results.
A
And
c-star
handles
switching
among
tasks
as
their
iOS
become
available,
so
the
other
big
component
of
the
Crimson
project
is
c-store,
as
I
mentioned
before.
The
other
goal
is
that
we
want
to
be
able
to
exploit
new,
emerging
storage
Technologies
like
nvme
and
Zone
namespace
storage.
A
How
we
will
do
that
c-store
is
designed
is
a
new
Object
Store
implementation
designed
specifically
for
crimson's
threading
callback
model,
get
avoid
CPU
heavy
metadata
designs
like
roxdb,
and
it's
intended
to
exploit
emerging
Technologies
like
DNS
and
envy
me.
So
what
is
zns?
It's
a
new
nvme
specification
intended
to
address
challenges
with
conventional
FTL
based
flash
designs,
conventional
flash
devices
use
garbage
collection
to
to
empty
Erasure
blocks
for
reuse.
A
This
tends
to
cause
fairly
High
write
amplification,
which
is
to
save
each
right
to
the
device,
tends
to
result
to
multiple
devices
within
the
device,
and
this
is
bad
for
qlc
Flash,
because
qlc
Flash
has
poor
right
endurance.
Each
cell
can
only
be
written
a
comparatively
small
number
of
times
and
background.
Garbage
collection
tends
to
impact
tail
latencies
in
that
garbage
collection
within
the
device
can
compete
with
client.
I
o
for
device
resources
resulting
in
periodic
latency.
A
Spikes
zns
addresses
this
problem
by
creating
a
different
interface
where
the
drive
is
still
is
divided
into
zones
which
can
only
be
opened
written,
sequentially
closed
and
then
released
were
interested
in
addressing
these
devices
both
because
we
think
they'll
be
specifically
useful
for
qlc,
but
also
because
this
right
pattern
tends
to
be
good
for
conventional
ssds
as
well.
Since
it
reduces
the
amount
of
traffic.
The
garbage
collector
needs
to
handle.
A
A
Each
object
also
contains
a
or
contains
a
key
to
Value
mapping,
which
we
use
for
a
few
things
like
CFS
directories
and
radius
GW
bucket
indices,
as
well
as
a
data
payload
which
you're
probably
more
familiar
with,
as,
for
instance,
the
four
megabytes
of
data
associated
with
each
RPD
block.
A
A
Which
is
important
for
just
basic
functionality
of
bucket
indices
and
directories
and
for
the
object
namespace,
it's
important
to
know
the
OSD
does
object.
Recovery,
backfill,
specifically
so
c-store's
logical
structure,
looks
something
like
this.
There
are
two
basic
sides
to
the
to
the
house
on
the
left
side.
Here
we
have
an
O
node
by
object
tree
the
Ono
tree,
which
points
to
O
nodes
and
each
o
node
contains
metadata,
pointing
out
an
omap
tree
and
a
set
of
extents
which
were
contiguous
blocks
of
lbas.
A
A
So
why
do
we
use
an
LBA
interaction?
There
are
a
few
reasons.
One
is
that
is
basically
right
amplification
when
we
need
to.
Let's
say
we
have
references,
a
referencing,
B,
referencing
C,
and
we
need
to
garbage,
collect,
node,
C
or
just
to
move
it
to
a
new
znf
Zone
so
that
we
can
reuse
the
Zone.
A
We
would
first
need
to
find
all
incoming
references
which
could
be
expensive
without
a
secondary
structure,
and
then
we
need
to
write
out
a
transaction
updating
and
dirtying
those
references
as
well
as
writing
out
a
new
Block
C
Prime
using
direct
references
means
that
we
would
need
to
maintain
some
way
of
finding
the
parent
references
in
butterfs
I'm,
forgetting
the
name
of
the
structure,
but
there's
a
structure
for
it
and
we
pay
the
cost
of
updating
the
relatively
low
fat
out.
A
Oh
node
and
omap
trees,
since
we
would
all,
since
the
keys
in
those
trees
are
relatively
large.
By
contrast,
using
an
LBA
interaction
does
require
extra
reads
in
the
lookup
path,
but
they're
cachable
and
it
potentially
decreases,
read
and
write
amplification
during
garbage
collection
or
indeed
during
tiering
operations,
where
we're
moving
blocks
between
tiers.
A
So
the
on
disk
layout,
when
we're
writing
out
Journal
segments,
looks
something
like
this.
Each
or
each
Journal
transaction.
A
Contains
a
header
some
number
of
deltas
here
represented
by
these
blue
box
blue
blocks
and
then
some
number
of
logical
and
physical
blocks
represented
here
by
the
magenta
color.
A
So
the
left
side
here
is
the
state
prior
to
the
transaction
and
the
right
side
here
is
the
state
after
the
transaction,
D,
Prime
and
E.
Prime
are
the
mutated
versions
of
those
two
blocks,
which
and
b
and
a
are
new
blocks
being
added
to
the
tree.
A
If
we're
using
zns
append,
we
don't
actually
know
the
final
location
of
A
and
B,
so
this
transaction
record
format
allows
the
E
Prime
to
A
and
A
to
B
references
to
be
record
Global,
so
they're
represented
as
offsets
within
the
record,
so
that
it'll
be
correct,
regardless
of
where
the
record
ends
up
and
then
during
Replacements
we
have
the
record
address.
We
can
resolve
those
relative
addresses,
so
the
overall
architecture
looks
something
like
this
up
at
the
top.
There's
a
c-store
layer
responsible
for
the
object
store
interface
itself
below
that
are
the
portions.
A
Are
the
portions
in
blue
responsible
for
operations
on
logical
data?
The
unknown
manager
is
the
tree
responsible
for
dealing
with
o
nodes.
The
omap
manager
deals
with
the
omap
tree
optic
data
Handler
deals
with
the
object
context,
Etc
or
contents
Etc
the
layer
in
the
middle.
A
The
transaction
manager
handles
all
of
these
logical
block
mutations
uniformly
as
operations
on
a
logical
address
space
and
the
portion
below
that
is
responsible
for
translating
those
into
operations
on
physical
nodes,
with
the
LBA
manager
being
responsible
for
the
LBA
tree
I
mentioned
before
the
advantage
of
this
architecture.
Is
that
the
things
above
the
transaction
manager
do
not
need
to
be
aware
of
garbage
collection
or
tiering?
The
transaction
manager
and
the
stuff
below
it
can
transparently
handle
moving
blocks
around,
and
the
stuff
above
doesn't
need
to
know
about.
A
So
the
inverse
of
this
talk
I
think,
is
to
go
over
some
of
the
recent
work.
That's
gone
into
crimson
and
c-store,
because
Crimson
is
effectively
a
rewrite
of
the
OSD.
The
story
over
as
it
matures
will
be
a
story
of
adding
features
that
already
exist
in
classic
ososd.
A
So
the
main
goal
for
reef
is
to
have
an
initial
kind
of
experimental
release
that
people
can
try
out
on
for
RBD
workloads
on
replicated
pools
So.
To
that
end,
a
lot
of
work
has
been
done.
Stabilizing
the
blue
store,
RBD
rados
tests,
there's
pathology,
Suite
that
we've
now
split
into
crimson
rados
and
Crimson
rados
experimental.
A
The
other
Top
Line
sort
of
thing
for
Crimson
this
year
has
been
multi-core
because
our
initial
functionality
or
our
initial
goal
for
Crimson
was
to
drive
down
per
core
or
was
to
what's
to
improve
per
core
throughput.
We
initially
focused
on
a
single
core
implementation
and
we
didn't
worry
about
actually
partitioning
the
data
structures.
A
We
were
pretty
satisfied
with
those
results.
So
this
year
a
lot
of
work
went
into
splitting
the
PG
State
among
multiple
cores
and
introducing
the
internal
architecture
required
to
Route
messages
between
cores
The
Next
Step
here
will
be
further
improvements
to
this
and
adding
multi-core
support
to
the
messenger
natively.
A
A
Another
thing
that
was
important
recently
is:
if
person
is
going
to
start
seeing
wider
testing,
we
needed
to
add
some
user
guard
rails
to
prevent
people
from
accidentally,
creating
Crips
and
osds,
and
that
kind
of
thing
so
we've
added
an
experimental
feature
flag.
So,
if
you're
actually
trying
to
deploy
it,
there
will
be
updates
to
the
documentation
that
reflect
this.
But
you
need
to
set
an
experimental
feature
flag
and
then,
on
the
whole
cluster,
you
need
to
set
a
set.
A
The
other
piece
is
that
pools
will
need
to
be
created
as
Crimson
pools
and
the
monitor
has
some
code
to
gate
some
features
that
aren't
supported
from
Crimson
yep.
For
instance,
you
can't
create
Crimson,
richer
coated
pools
and
you
can't
change
the
number
of
pgs
on
Crimson
pools
or
set
them
as
tiers.
A
All
the
intention
ultimately
will
be
to
allow
mixed
clusters
between
crimson
and
classic,
especially
during
the
transitional
period,
when
Crimson
doesn't
support
everything,
so
you
might
be
able
to
create
a
crimson
pool
specifically
for
RBD
and
another
Pool
for
an
eraser
coated.
Rgw
pool
that
kind
of
thing.
A
So
the
current
status
that
we're
hoping
to
get
ready
for
reef
is
our
meeting
workloads
and
replicated
pools.
Will
work
for
on
bluestore
it'll
include
blue
store
and
c-store
backends,
as
well
as
the
cyan
store
in
memory
backend,
but
the
c-store
one
will
be
even
more
experimental
than
the
blue
store.
One
deployment.
Bsf
ADM
should
work
relatively
smoothly
and
there's
some
documentation
for
that
and
for
developers.
A
You
can
of
course,
use
vstart
to
with
the
dash
dash
Crimson
flag
and
there's
a
stable,
Crimson
rados
topology
Suite,
which
by
Reef,
will
be
used
to
gate
new
PRS
tests
that
do
not
yet
reliably
pass,
including
the
c-sore
ones,
RNA
Crimson,
rados,
experimental
Suite.
A
So
the
next
steps
for
Crimson
are
going
to
be
continuing
to
expand
test
coverage.
Many
of
the
thrashing
patterns,
for
instance,
are
not
yet
stable.
We
want
a
scrub
implementation,
because,
without
that
we
don't,
we
won't
have
particularly
good
confidence
in
correctness.
Some
work
has
already
gone
into
refactoring.
These
scrub
use
for
use
within
Crimson,
that's
mostly
complete,
so
the
remaining
work
is
basically
adapting
that
work
into
Crimson.
It's
itself.
Much
like
we
did
with
with
pairing.
A
The
other
thing
for
or
for
multi-core
the
next
step
will
be
supporting
multiple
reactors
and
generally
doing
performance
testing
and
improvements
based
on
profiling
results
and
the
other
big
theme
of
the
S
release
will
be
performance
improvements
now
that
multi-core
support
is
in
place.
A
So
the
Yin
Jin
I
think
is
going
to
talk
a
bit
more
about
c-store.
So
I
won't
go
over
this
in
too
much
detail,
but
the
major
theme
for
c-store
will
be,
or
has
been
stability,
a
kind
of
c-store
test
of
an
attitude
pathology
now
the
Crimson
radius
experimental
suite
and
it
can
reliably
complete
the
radios
API
unit
tests
do
a
bunch
of
work.
A
A
lot
of
work
has
gone
into
c-store's
tiering
implementation.
A
major
goal
is
to
support
heterogeneous
configurations
combining
expensive
but
small,
expensive,
fast
nvme
devices
with
slower,
bigger,
cheaper
qlc
devices,
so
you
get
sort
of
the
best
of
both
worlds.
A
So
to
that
end,
recent
work
includes
a
random
block.
Manager
has
been
introduced
to
handle
fast
nvme
devices
without
needing
to
use
garbage
collection.
So
much
and
Direct
hearing
support
is
in
progress
with
a
lot
of
the
architectural
groundwork
already
merged.
So
S
release
will
see
a
lot
more
of
that
usable.
A
The
garbage
collector
has
been
replaced
with
a
generational
implementation
that
does
a
better
job
of
grouping
extents
by
age,
and
there
have
been
a
lot
of
improvements
to
the
way
the
LBA
tree
Works.
To
reduce
the
number
of
reads.
We
actually
need
to
do
and
improve
the
CPU
efficiency
of
traversing
the
tree
in
in
memory.
A
A
We
want
to
add
multi-core
reactor
support
to
c-store,
there's
already
PR
for
the
initial
version
of
that,
but
I
expect
there
will
be
quite
a
bit
more
work.
That
needs
to
be
done
for
us
release.
A
Well
what
the
next
step
for
tiering
would
be
to
get
an
actually
usable
initial
implementation
with
testing
c-store
also
does
not
support
snapshots
yet
so
that's
an
important
piece
of
outstanding
work
and
as
with
Crimson
as
a
whole,
an
important
theme
of
the
next
year
will
be
further
performance
improvements.
A
A
A
Will
Clemson
sport
EC
in
the
future?
In
the
point
in
time,
probably
I
mean
well,
okay,
so
long
term.
The
goal
is
to
flat
out
replace
deaf
OST.
So
in
that
sense,
yes,
it
may
not
be
an
immediate
priority
for
two
reasons.
A
One
Erasure
coding
workloads
aren't
usually
all
that
all
that
performance
sensitive,
so
it's
not
as
good
a
way
to
prove
out
crimson
and
the
implementation
is
largely
will
will
benefit
from
all
of
the
other
work
being
done
so,
for
instance,
the
implementation
scrub
will
be
using
abstraction
barriers
that
will
just
natively
or
that
will
naturally
support
Erasure
coding
when
we
get
around
to
it.
A
So,
yes,
it
will
eventually
support
it
exactly
when
will
depend
on
when
someone
wants
to
work
on
it,
basically,
I'm
not
willing
to
commit
to
a
when
Crimson
might
be
ready
for
production
use,
but
the
intention
is
in
S
release
for
very
knowledgeable
users
who
have
done
a
a
thorough
job
of
evaluating
it.
With
respect
to
their
use
case,
those
users
may
be
able
to
use
replicated
pools
with
RBD.
A
A
All
right,
well,
oh
question
just
popped
in.