►
From YouTube: Crimson/SeaStore Meeting 2022-07-13
Description
Join us weekly for the Ceph Crimson/SeaStore meeting: https://ceph.io/en/community/meetups
Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/
A
All
right,
let's
get
started,
I
think
everyone's
here,
let's
see
for
me
this
week,
it's
been
some
reviews.
The
second
multi-core
pr
is
very
nearly
ready.
I
should
have
it
out
tomorrow
and
then
I'll
start
working
on
the
last
one
which
actually
that's
multi-core
how's
it
going.
B
Yeah
no
hi
everyone.
So
last
week
I
have
been
looking
at
the
review
comments,
thanks
for
the
review
comments
in
leo
and
sam,
so
I
have
been.
I
have
addressed
most
of
those
comments,
I'll
push
the
changes
with
that,
and
on
top
of
that
I
was
able
to
you
know.
B
Like
I
mentioned
last
week,
I
was
able
to
get
the
osd
boot
on
the
dns
z,
the
null
blocks
dns
device,
and
then
I
was
looking
at
how
to
run
ios
on
top
of
that
and
then
I
you
know,
ran
radius
bench
right,
workloads
on
that
and-
and
I
was
able
to-
I
was
able
to
do
the
rights
on
on
the
zns
drive.
B
I
was
able
to
see
the
rights
going
on
to
the
drive
and
and
everything
and
but
I
am
kind
of
hit
with
another
issue
where,
after
I'm
able
to
run
the
rights
for
about
two
minutes
like
on
a
16
gb
device,
the
device
size.
I'm
keeping
is
it's
a
memory,
backup
device
and
it's
like
16
gb.
So
I'm
able
to
run
it
for
two
minutes
the
rights
and
then
once
it
after
two
minutes,
the
I
think
the
gc
is
kicking
in
and
once
the
gc
starts,
I
saw
that.
B
I
see
that
the
actual
io
from
the
radar's
bench
kind
of
stops
in
the
sense
that
rados
bends,
its
ios,
are
not
happening.
But
the
gc
starts
doing
all
these
reads
and
then
writes
from
it
starts
reading
from
these
others.
The
filled
up
zones
and
start
writing
it
to
newer
zones
and
that
that
seems
to
go
on
for
kind
of
forever,
like
it's
not
kind
of
exiting
from
gc.
B
So
that
couple
of
questions
like
is
it
expected
that
when
gc
starts
the
because
I
see
the
code
that
I
mean
that
we
block
on
gc?
Is
it
like
we
when
gc
is
happening,
do
we
stop
the
ios
one
thing.
A
B
Yeah
that
part
I
get,
but
so
what
I'm
saying
is
I
have.
I
have
a
small
device
of
each
zone
of
64
mb
and
then
256
zones
are
there
and
then
the
when
I
start
redos
bench.
It
starts
writing
and
then
it
goes
until
up
until
like
190
zones,
it
starts
right.
It
completes
writes
to
190
zones.
I
can
see
that
in
the
you
know,
zone
there
are
certain
tools
to
check
the
zone
information.
B
So
with
those
tools,
I
can
see
that
the
number
of
these
number
of
zones
are
full
and
once
it
reaches
like
190
zones,
then
rados
bench,
you
know
star
stops
reporting
that
you
know
current
mbps
that
this
one
and
it
goes
to
zero.
Until
then,
I
can
see
that
it
is
writing
at
a
certain
speed
and
then
it
goes
to
zero.
I
still
have
around
50
to
60
zones
free
and
then
gc.
So
I
can
see
the
debug
log
messages
of
gc
happening.
A
B
That
yeah,
so
I
no.
I
saw
those
parameters
which
I
forgot,
the
exact
numbers,
but
I
I
saw
that
you
can
tune
the
gc
parameters
and
yeah,
but
currently
I'm
debugging
that
so
that's
where
I
am.
If
you
need
more
help.
C
Sorry,
I
think
ceo
or
dns
pr.
There
are
some
configurations
that
are
discovered
by
the
vms
segment
manager,
but
that
information
hasn't
been
propagated
to
the
cleaner
stamina.
Cleaner,
maybe
there's
a
gap
there,
possibly.
A
C
A
B
Yeah
yeah,
so
I
did
try
to
do
the
same
experiment
with
a
normal
block
device.
I
mean
same
null
block
device
instead
of
a
zone
device.
I
created
a
normal
block
device
of
16
gb
and
then
I
kind
of
saw
the
same
behavior
after
after
certain
number
of
writes
the
the
the
right
stopped
and
and
from
the
the
locks
I
could
see
that
it
has
written.
B
A
C
B
No
yeah,
I
I
kind
of
expected
that
I
wanted
to
you
know
fail
with
the
out
of
space
error
right.
I
wanted
to
see
that.
So
that's
why
I
was
trying
to.
A
A
It's
it's
not
enough.
Like
an
osd,
that's
out
of
space
doesn't
mean
the
cluster
is
out
of
space
right,
it's
much
much
more
complicated
than
that
with
normal
classic
osds.
They
report
usage
statistics
to
the
monitors
and
once
those
get
full
enough,
the
monitors
will
declare
the
pool
full
and
the
and
on
client
rights
will
start
to
fail,
but
some
of
that
machinery,
I
don't
think,
is
in
place
in
crimson.
You
are
probably
not
going
to
see
any
no
space.
A
A
B
Right
yeah
right,
I'm
I'm
yeah,
I'm
working
on
I'm
looking
into
gc
code
now,
trying
to
you
know
trying
to
debug
that
yeah.
That's
all!
I
had
thanks.
A
All
right
agent
how's
it
going.
C
C
The
first
is
the
split
object,
data
blocks
and
that
is
merged,
and
there
are
follow-up
works
to
works
to
do,
and
I
will
try
to
further
improve
the
conflict
detection
during
your
claim-
and
I
also
simplify
the
random
block
manager
circular
journal
and
review
the
async
cleaner
trimming,
and
I
think
we
probably
can
split
the
cleaning,
cleaner
implementation
for
a
random
block
and
a
second
manager-
and
I
will
do
a
second
review
this
week
and
before
hdd.
I
think
it
might
be
good
to
make
consensus
on
the
level
architecture.
A
A
Okay
cool,
should
we.
C
I'm
working
on
the
list
object
back
and
work
still
modifying
the
pr
according
to
the
comments
and
still
needed
to
debug
a
enumerage
object.
Just
late,
so
still
has
a
bag
here
and
you
need
to
get
back.
So
that's
all.
C
Okay,
israel.
D
Oh
last
week
I
run
into
two
parts
of
death
reference
when
I
was
developing
the
physical
b3,
optimization
I've
separated
a
pr
for
the
first
block
and
the
second
part
is
here
and
the
code
I
to
fix
it
is
I
I
haven't.
I
haven't
submitted
the
pr4
for
the
second
part,
but
I
think
it's
it's
getting
closer
I'll,
also
be
the
pr
for
this
within
the
next
one
day
or
two.
After
that,
I
will
be
back
into
implementing
the
optimization
for
physical
features.
A
All
right,
I
thought
we
could
discuss
the
hard
disk
stuff
a
little
bit,
but
does
anyone
have
anything
else
they
want
to
talk
about
other
than
that.
A
So
shayhan,
I
thought
I'd
ask
you
yeah
what
what
is
your
goal
for
using
the
random
block
manager
for
hdds?
I
had
assumed
that
the
whole
point
was
to
do
right
back
to
dirty
extents.
D
Actually,
I
I
think-
or
we
think
if
the
nvme,
the
nvme
devices
are
large
enough
to
hold
all
the
data,
then
we
do
not
have
to
write
all
data
back
to
the
hdd,
then
maybe
they
may
be
accessed
even
even
though
not
frequently,
I
think
if
they
are
an
available
space
on
the
on
the
on
the
mvme
devices,
then
why
why
don't
we
put
the
cold
air?
Also
there
is.
D
We
think
that
it's
only
when
there
are
not
enough
space
in
the
in
the
mvme
devices
for
to
hold
all
the
data,
then
we
have
to
put
coded
it
called
the
color
data
into
hdds.
I
I
don't
know
if
this
is
reasonable.
A
Well,
that
certainly
makes
sense.
I
have
two
questions,
though
one
I
really
want
to
support
clusters
that
only
have
our
disks,
so
whatever
design
choices
we
make
here
have
to
make
sense.
In
that
context,.
D
Actually
we,
what
we
were
thinking
is
for
for
data
overrides.
That's
our
4k
align
are
we.
We
do
not
use
mutation.
We
just
use
this
extensively
specific
function.
We
preach
the
the
overwritten
data
are
as
new
as
new
extents
and
when
writing
those
news
stands
back
to
hdds.
D
Oh,
we
have
to
get
them
together
with
the
old,
with
the
old,
old
and
larger
one.
So
the
old
one
is
not
splitted
forever.
On
the
hard
disk.
It's
just
split
temporarily.
When
there
are,
there
are
overwritten
data
on
the
nvme
devices.
A
A
So,
okay,
so
there
are
two
pieces
here.
The
first
is
even
if
you
are
performing
so
I
I
think
part
of
what
you
were
just
saying
is
that
when
we
get
a
mutation
to
an
extent
located
on
a
hard
disk,
we
split
the
extent
and
we
write
the
newly
split
extent
to
the
faster
it's
hot.
So
that's
a
that's
likely.
A
good
use
of
time
right
is
that
am
I
understanding
that
correctly.
So
I
agree
with
that.
There
will
be
scenarios
where
that's
the
right
heuristic
when
we
perform
right
back,
though
sorry.
A
I
have
a
couple
of
reasons
for
that:
supposition,
if
you,
if
you
don't
do
it
that
way,
then
you're
going
to
end
up
with
very,
very
sparse
random
free
space
on
the
hard
disk.
This
is
a
worst
case
scenario.
Not
only
will
you
have
a
hard
time
finding
free
space
to
do
large,
contiguous
rights,
you
will
also
be
unable
to
do
large
contiguous
reads,
because
your
data
will
be
heavily
fragmented.
A
The
advantage
of
something
like
a
garbage
collection
system
or
any
log
structured
file
system
in
this
context
is
both
that
the
garbage
collector
will
allow
you
to
do
large,
sequential
rights
which
allows
you
to
get
full
bandwidth
out
of
the
hard
disk
which
you
wouldn't
be
able
to
get
otherwise
and
during
garbage
collection.
If
we
add
the
right
heuristics,
we
can
and
we
can
defragment
the
relevant
extents.
A
D
A
A
D
A
And
if
my
instinct
on
all
of
this
is
to
do
as
little
hdd
specific
code
as
possible,
so
one
advantage
just
because
we
don't
want
any
device
specific
code
that
we
can
avoid
hard
disks
share,
essentially
the
same
access
mechanisms
as
any
normal
sata
ssd.
So
in
that
sense,
they're,
not
special,
the
only
area.
I
think
that
we
need
special
heuristics
for
is
allocation.
A
So
if
we
choose
the
strategy
I
outlined
above
in
this
thread,
we
end
up
with
an
rbm
implementation
that
can
tolerate
hard
disks
and
we
can
implement
better
allocation
strategies
for
hard
disks,
as
we
well
measure
really,
while
also
retaining
the
ability
to
use
a
block
segment
manager
for
them.
So
we
can
test
out
both
both
strategies
and
evolve
from
there,
and
I
think
that's
the
direction
we
should
go.
C
A
A
The
the
the
counter
argument
is
that
for
hard
disks
doing
an
out
of
line
write,
cuts,
the
or
doing
a
non-sequential
right
cuts
the
device
throughput
by
a
factor
of
a
lot
like
a
hundred
thousand.
It's
a
ton
right,
so
you're
actually
willing
to
pay
a
lot
of
right
amplification
on
a
hard
disk
to
avoid
random
rights.
C
A
You
put
well
except
that
if
the
envy
mirror,
if
the
faster
slo
fills
up,
then
you
become
bound
by
the
right
speed
of
the
slow
tier
right.
C
Yeah,
but
usually
writing
is,
is
very
sequential.
It's
collect
a
lot
of
data
and
horizon
together
as
a
transaction
to
this
code
tier,
so
it
will
exploit
the
most
of
the
bandwidth
out
of
the
hard
disk.
A
D
Yeah,
but
I
think
that's
all
that
will
mean
that
if,
if,
when
we're
writing
data
back
to
https
and
the
data
are
well
and
each
each
right
are
relatively
each,
each
data
piece
is
relatively
small
and
we
might
be
ending
up
making
logical,
continuous
data
better
on
the
hdd
devices
and
when
there
are
sequential
reads
for
a
logical,
logically,
continuous
data
that
that
kind
of
that
sequential
release
will
turn
will
be
turned
into
random
reads
to
the
hdd
devices,
and
I
think
that
that
will
be
a
problem.
D
If
we
want
to
support
like
databases
using
lsm3
or
other
big
data
applications.
Do
you
think
that's
that's
reasonable.
A
That's
true,
but
the
rbm
device
will
make
that
worse,
not
better.
The
best
solution
to
that
problem
is
to
modify
the
garbage
collector
to
perform
large,
contiguous,
logical
rights.
So
concretely,
that
means
when
we're
garbage
collecting
an
extent
that
is
part
of
an
object.
We
one
notice,
that's
a
that.
It's
an
extent
that's
part
of
an
object
and
instead
of
performing
a
4k
right
back
to
the
cool
tier,
we
check
to
see
whether
it
makes
sense
to
also
write
the
adjacent
four
megabytes
back
wherever
they
happen
to
be
located.
A
A
A
A
A
I
think
for
different
use
cases.
It
may
actually
make
sense
to
allow
overwrites
on
a
hard
disk,
so
I
think
we
probably
want
to
experiment
with
both.
A
B
Oh
one
quick
question:
so
there
are
these
type
of
smr
hard
disks.
If
you
guys
are
aware
of
it,
so
they
are
like
I
mean
some
support
is
already
there
in
blue
store
and
the
like.
So
so,
if
at
all,
crimson
ost
is
going
to
support
like
hdds
for
a
slower
tire,
this
one
storage,
then
the
is
it
a
possibility
to
like
add
smr,
also
to
that
this
one,
because
it
is
more,
it
mostly
works
like
a
dns
drive
only.
A
Yeah,
I
would
say
we
would
just
create
an
smr
implementation
of
segment
manager.
Easy
right.
I
don't
know
if
it'll
work
well,
but
it
should
be
straightforward
in
that
yeah
you're
right
do
they
behave
very
much
like
cns
devices.
Concretely,
though,
I
haven't
seen
very
much
from
the
smr
hard
disk
concept
in
the
last
couple
of
years,
it
seems
like
it
didn't
really
catch
on.
Am
I
wrong
about
that?