►
From YouTube: Ceph Developer Monthly 2022-11-02
Description
Join us the first Wednesday of every month for the Ceph Developer Monthly meeting
https://tracker.ceph.com/projects/ceph/wiki/Planning
B
A
Added
some
user-facing
guard
rails
to
Crimson
generally
for
two
reasons:
one
we
want
to
make
it
hard
to
accidentally
start
Crimson
osds,
since
it
would
really
complicate
a
production
cluster
and
two
Crimson
OCS,
don't
support
everything.
Regular
osts
do
so
to
the
extent
that
we
can
it's
probably
a
good
idea
to
guard
off
those
features
so
that
people
don't
accidentally
use
them.
A
So
to
that
end
there
are
two
things
I
did
so.
The
first
is
editor
Crimson
experimental
feature
flag
and
a
set.
Allow
Crimson
OSD
map,
flag,
I
didn't
do
anything
especially
complicated
here
you
have
to
have
the
experimental
feature
enabled
to
set
the
subtle
algorithms
and
flag,
and
you
can't
unset
up
Crimson
osds
won't
boot
at
all.
If
they
don't
see
this
in
the
ostmap
flag.
So
that's
how
that
part
works.
A
A
When
you,
you
can
either
that
the
OSD
pool
default
Crimson
parameter
which
will
cause
all
pools
to
be
created
as
groups
of
Pools
by
default,
or
you
can
give
it
on
a
per
pool
basis
with
a
crimson
flag.
It
disallows
a
couple
of
features
that
krypson
can't
deal
with
like
changing
the
number
of
pgs
or
using
tiers,
and
the
made
up
shot
is
that
Crimson
osds
won't
locally,
create
pgs
from
pools
that
don't
support
those
that's
pretty
much.
A
C
D
C
A
If
that
makes
sense,
so
in
that
sense
we
actually
were
protocol
compatible
with
classic
after
a
fashion,
but
for
instance,
we
don't
support
tiering
at
all.
So
when
we
do
support
tiering,
we'll
support
hearing
the
same
way
classic
does
up
to
the
same
future
flag,
Etc,
so
I'm
hoping
it
won't
be
a
problem,
but
this
is
likely
something
that's
going
to
have
to
evolve
with
time
we
I'm
going.
C
C
E
E
A
E
E
A
B
Do
we
have
these
features
which
are
supported
or
unsupported,
documented
anyway,
I'm
just
looking
at
the
Upstream
documentation.
B
A
E
The
other
thing
I
noticed
Sam
was
that
there
was
a
global
of
conflict
set
Global
that
you're
doing
so.
It's
not
an
OST
kind
of
thing.
It's
a
global
setting.
E
E
E
A
A
Oh
well,
the
lab
will
have
to
work
again
before
we
can
actually
merge
this
because
I
don't
want
to
break
this
pathology
jobs,
but
the
short
answer
is
when
you're
creating
the
cluster.
You
bet
this
experimental
coverable
data,
prepping
features
flag
and
the
satellite
Crimson
flag
and
the
monitor
config.
If
that
causes
pools
to
be
created
by
default
as
groups
and
pools.
A
It
also
it
does
what
other
thing
the
Crimson
pool
thing
it
disables
the
autoscaler.
E
A
That
should
work
yeah,
the
Crimson
can
deal
with
peering,
just
fine.
It
just
doesn't
have
split
up.
E
A
A
E
B
E
B
Intent
of
talking
about
conditionally
working
on
intelligent
logging
is
that
I
mean
we
do
have
a
lot
of
debug
levels
in
there,
but
I
mean
we
do
have
to
change
it
once
when
we
are
troubleshooting
a
situation
but
with
conditional,
debugging
or
intelligent
logging,
I
mean
and
I
think
developer
available
changes
would
need
to
be
done
so
that
whenever
there's
a
situation
or
whenever
there's
a
trigger,
the
debug
logs
can
be
triggered
or
enabled
automatically
for
a
condition,
and
then
they
could
be
toggled
off
again.
B
If
a
mechanism
could
be
devised
that
could
be
really
useful
in
troubleshooting
and
root
cause
root,
causing
the
problem
whenever
it
happens,
is
in
some
situations.
We
know
that
we
require
the
debug
logs
from
the
time
when
the
problem
happened.
D
B
Many
a
times
we
don't
have
or
find
those
debug
logs
because
they
are
by
default,
not
enabled
so
this
could
I
mean.
Does
this
sound
like
a
feasible
idea.
E
So
this
is
definitely
an
interesting
idea
and
this
resonated
with
something
that
radic
brought
up
and
that
was
in
context
with
an
upgrade
I
think
Brad.
You
are
familiar
with
this
one,
the
bug
that
we
were
seeing
in
our
givea
cluster
in
the
lrc.
So
every
time
we
were
doing
an
upgrade,
there
were
randomly
like
one
OSD
that
would
hit
a
crash
and
then
it
would
just
come
back
up
fine
and
we
were
never
able
to
reproduce
that
kind
of
issue
and
not
pathology,
testing
or
anywhere
else.
E
D
Revolved
around
the
inability
for.
A
E
No
I
I
just
wanted
to
say
that
there
the
idea
was
like
you
know,
we
knew
that
restarts
for
the
issue,
so
something
like
cephyrian.
That
was
doing
the
restarts
for
the
upgrades
could
potentially
you
know,
have
some
kind
of
books
to
enable
and
disable
logging.
But
then
again
that
would
be
blanket
right.
So
it's
not.
You
can
imagine
in
a
thousand
OSD
cluster.
We
are
trying
to
enable
logs
for
all
the
usds
one
by
one.
So
it
was
not
logistically
feasible.
I
am
Sam.
You
can.
A
Yeah,
so
the
challenge
is
predicting
ahead
of
time,
where
you'd
want
to
put
a
trace
plate
like
that.
So
an
obvious
example
is
a
thing
happens
somewhere
in
an
I
o,
and
you
wish
you
had
logging
for
the
beginning
of
that
I
O
right.
So
it's
easy
to
create
a
branch
that
does
whatever
custom
logic.
You
need
to
detect
the
condition
you're
interested.
C
A
It's
not
clear
to
me
how
to
how
to
allow
you
to
create
a
trace
Point
like
that,
without
going
as
far
as
the
I'm,
forgetting
the
name
of
the
Linux
kernel,
injectable
Trace,
Point
concept,
which
would
be
an
interesting
idea
if
someone
wants
to
do
a
prototype
with
that,
but.
A
B
D
An
advanced
or
we
have
an
enhanced
log
that
we
store
in
memory
and
if
we
get
a
Core
that
gets
dumped
if
the
diamond
crashes
that
gets
dumped
and
I
believe
there's
a
command
that
allows
you
to
dump
that
that's.
D
A
No,
it's
more
like.
Let's
say
you
want
to
dump
some
specific
debug
line,
but
only
for
very
specific
iOS
contingent
on
some
parameter
in
the
I
o
itself.
D-Trace
would
actually
allow
you
to
create
a
trace
point
with
conditional
logic.
That's
basically
arbitrary
to
the
parameters
being
passed
in
it's
shockingly
powerful
to
do
something
equivalent
to
that
and
stuff
you'd
have
to
create
a
branch
like.
A
D
That
would
take
a
would
have
to
be
a
a
Advanced
user.
That
would
be
able
to
do
that,
but
I
suppose
we
could
come
up
with
a
with
some
system
type
scripts,
but.
A
D
B
I,
remember
this
in
rgw,
actually
as
well,
where
multi-site
in
one
of
the
earlier
situations
when
it
was
during
a
customer
situation,
when
we
saw
that
I
was
from
one
side
to
the
other,
were
not
I
mean
some
objects
were
not
getting
replicated
from
one
side
to
the
other,
and
maybe
we
thought
that
we
would
introduce
conditional
debugging.
B
This
was
discussed
with
I,
remember
Matt
Benjamin
before
as
well.
He
he's
not
here
but
yeah.
We
or
I
I.
Remember
thinking
about
this
from
rgw
multi-site
point
of
view
as
well,
where
we
could
have
conditional
debugging,
where
we
could.
A
B
E
And
there
are,
this
could
be
thought
of
it
is
by
case.
Sometimes
there
are
invariance
that
are
getting
broken
and
then
then
some
things
can
be
thought
of
unacceptable
value.
Then
you
can
take
the
example
of
the
dupes
issue
that
we
saw
so
we
hadn't
we
had
a
bug
where
a
data
structure
called
the
the
dupes
in
the
PG
log
was
just
growing
infinitely.
E
So
you
can
imagine
that
if
we
had
some
sort
of
threshold
which
could
have
been
a
very
conservative
threshold,
that
we
start
to
log
things
like
okay,
if
you
go
about
this
threshold,
we
are
going
to
aggressively
start
logging
things
right,
but
then
it
has
to
be
very.
You
know,
Case
by
case
basis
and
similarly
for
for
things
that
end
up
in
crashes.
We
could
we
could
end
up
logging
things
at
the
whatever
the
in-memory
default.
B
E
A
E
E
Right
so
sometimes
even
that
is
not
enough
right.
We
still
are
asking
for
a
core
dump
or
we
are
looking
for
further
logs
and
stuff.
B
Yes,
I
remember
a
lot
of
times
when
troubleshooting
a
lot
of
problems
when
it
was
discussed
that
we
need
debug
logs
from
the
time
and
the
problem
actually
happened.
But
in
many
situations
since
in
a
lot
of
production
environments,
the
debug
logs
were
not
enabled
and
we
could
not
find
the
root
cause
easily
and
then
you
need
to
do
a
lot
of
troubleshooting
further
on.
So
that's
that's
how
the
idea.
D
This
is
this
is
girl.
This
is
because
we're
caught
between
two
worlds,
we're
caught
between
a
world
that
says
we
don't
want
any
debug
logging,
because
we
don't
want
the
overhead
involved
and
we
don't
want
to
store
huge
logs.
So
we
don't
want
huge
in
memory
logs,
because
it's
too
much
of
an
overhead,
we
don't
want
the
you
know
the
impacts
on
speed
from
having
to
log
all
this
data.
D
B
C
A
C
C
A
Would
say
it's
the
other
way
around?
We
actually
have
tons
of
those
it's
every
assert
in
the
code
right,
every
assert
is
okay.
We
claim
that
if
this
condition
is
violated,
a
very
bad
thing
has
happened
graph.
What
you're
observing
is
that
there
are
cases
where
a
bad
thing
happened
and
then
a
long
time
later,
yeah
a
crash
or
something
happens,
and
we
don't
know
what
happened
back
at
the
original
point
so
for
a
system
to
notice
when
it
happened
and
go.
D
D
D
Piece
of
information-
that's
possibly
gatherable
and
and
that's
possible
to
do-
but
that's
and
you
could
probably
do
that
now-
just
turn
on
all
logs
to
their
highest
level.
But.
C
If
this
is
something
where
you
have
specific
things
that
have
occurred,
that
were
that
were
problematic,
we
we
have
done
things
like,
say:
okay,
we've
detected
a
condition
like
there's
a
stuck
operation.
Maybe
we
should
do
something
like
dump
out
all
the
operations
that
are
that
exist
or
that
are
like
dump
out
the
operations
that
are
stuck
in
what
they're
in
what
state
they
reached.
B
A
B
A
It
might
be
worth
keeping
notes
when
you
hit
conditions
like
this,
and
then
you
could
present
them
as
like
I,
don't
know,
these
are
General
categories
of
things
that
are
happening,
and
then
we
can
brainstorm
good
ways
to
attack
them.
As
a
group
like
maybe
a
bunch
of
them
are
all
about
rgw
multi-site,
it
might
suggest
a
really
specific
kind
of
instrumentation.
That
would
be
helpful
at
rgw.
B
B
The
point
I
think
point
to
consider
about
feasibility
as
well,
where
to
do
and
it's
it's
like
a
lot
of
times,
I
mean
we
might
have
to
identify
a
lot
of
conditions
or
maybe
one
simple
condition
that
could
help
us
understand
when
or
maybe
just
like
you
mentioned,
Sam
right,
maybe
identifying
an
assert
where
things
went
bad.
D
D
This
is
also
the
question
of
what
level
do
you
go
to
I
mean
users
aren't
going
to
like
it
if
we
fill
up
their
petition
automatically
under
certain
circumstances,
you
know
so
always
a
delicate
balance
between
not
enough
vlogging
too
much
logging.
E
D
D
Where
you
know
a
gig
for
your
logs
is
plenty
to
a
situation
where
a
gig
is
not
going
to
last
more
than
half
an
hour,
10
minutes
and
so
yeah.
A
A
A
A
A
Well,
so
hypothetically
we
could
dot
more,
but
that's
not
really
what
I'm
even
getting
out.
The
point
is
that
all
of
those
things
are
specific
things
that
were
created
one
at
a
time
by
a
developer,
who
identified
an
invariant,
we
didn't
write
a
system
that
generates
us
asserts.
We
wrote
them
individually,
so
I
think
Greg's
right
most
likely.
A
The
problem
is
that
whatever
system
you've
been
hitting
these
issues
in
you're
just
hitting
an
issue
with
fairly
poor
visibility
in
the
first
place,
and
the
answer
is
to
think
carefully
about
the
nature
of
the
problems
you're
hitting
and
why
they're
hard
to
debug
and
then
think
of
an
actual
subsystem.
You
can
add
to
that
component
that
will
make
those
problems
in
general,
easier
to
debug
things
like
the
harpy
timers
in
the
OSD
or
the
op
tracker.
D
Yeah,
there's
also
the
sophistication
of
the
user.
That's
trying
to
divide
the
debug,
a
problem
like
some
some
issues,
pretty
transparent
to
a
developer
or
somebody
who
has
a
high
level
of
debugging
experience
and
some
come
on
and
for
someone
who
has
very
little
debugging
experience,
they're,
all
pretty
good,
pretty
opaque.
E
D
No,
we
could
go
to
the
trouble
of
developing
an
automatic
framework
that
you
know
dynamically
adjusts
log
levels
and
does
X
and
Y
and
then
find
out
that
90
of
the
user
base
just
turn
it
off
disable.
A
I,
don't
think
90
of
the
user
base
turns
off
the
op
tracker.
So
the
fact
that
people
can
turn.
B
E
D
A
A
I
will
say
that
there
is
one
possible,
but
there's
a
sort
of
triggered
one
concept
right
now
we
assert
on
pretty
much
every
invariant,
but
some
are
more
fatal
than
others
and
it
might
make
sense
to
add
a
utility
in
the
code
that
allows
us
to
dump
the
core
memory.
The
core
log
lines,
the
memory
log
lines
to
log
instead
of
killing
the
process.
That
might
be
an
interesting
middle
ground.
D
A
D
A
B
B
D
D
B
D
D
D
Db
was
getting
corrupted
by
something
to
do
it.
It
was
never
positively
identified
what
the
actual
issue
was,
but
it
was
somehow
corrected
and
just
went
away,
and
we
were
almost
certain
that
the
database
was
being
corrupted
under
sex,
so
the
container
system
or
the
storage
system,
something
to
do
with
containers
or
Braille
or
storage,
was
corrupting
the
Rocks
DB
database.
We
detected
the
corruption,
but
by
the
time
we
detect
the
corruption.
All
we
can
do
is
Crash,
there's
nothing.
We
can
do
about
why
it
got
corrupted.
A
So
to
expand
on
my
point
before
the
intervention
I
would
suggest
in
a
scenario
like
that
would
be
exploring
Rod
Stevie's
own
apis
for
dumping
checksums
associated
with
little
file
thingies
and
things
of
that
nature.
Doing
that
on
Mount
or
unmount
might
give
you
a
way
to
detect
corruption
between
monitor,
starts
and
that's
kind
of
what
I'm
getting
at.
Even
in
this
scenario,
where
it
really
wasn't
a
sep
thing,
there
are
sometimes
visibility
things
you
can
add.
That
would
give
more
information.
A
D
Yeah
I
posted
a
couple
of
links
in
the
chat
samway
about
system
tap
and
and
music
space
applications,
just
in
case
you're
interested
they're,
not
terribly
comprehensive,
but.
A
D
And
also,
potentially,
you
can
set
variables
and
functions
and,
and
that
sort
of
thing
as
well,
you
know,
make
an
if
statement
behave
differently.
That
sort
of
thing,
so
you
know
it's
fascinating
in
in
you
know
you
could
you
could
ship
a
hot
fix
that
doesn't
even
require
the
Damon
to
be
restarted?
Put
it.
D
E
So
that
remind
ask
me
something
like
Jager
tracing
the
typical
I
was
working
on
will
also
allow
you
to
do
some
of
the
the
example
that
you
were
giving
about
something
got
stuck
got
slow,
rgw
multi-site.
So
if
you
can
grab
traces,
if
we
had
the
trace
points
first
place,
you
could
go
back
exactly.
A
Yeah,
that
would
be
kind
of
like
the
op
tracker
approach,
except
for
it
huge
and
across
the
whole
cluster,
and
it
has
pretty
much
the
same
advantages.
So
that
would
be
a
great
example
of
a
really
comprehensive
visibility
system
that
would
allow
you
to
gain
information
on
a
wide
class
of
problems.
Foreign.