►
From YouTube: Ceph Crimson/SeaStore 2021-05-19
Description
A
So
what
do
we
have
on
there
again
now?
So
we
definitely
would
I.
I
would
like
to
definitely
look
into
the
self-adm
agent
a
bit
more
trying
to
to
figure
out
what
what
we
need
to
do
and
and
if
we
need
to
do
it
and
what
are
the
gains
and
then,
if
yes,
how
and
and
what
does
it
impact
the
architecture?
A
A
So
yeah
it's
fading
agent,
so
the
the
the
problem
is
that
the
reconciliation
loop
that
we
have
in
in
cephadm
doesn't
scale
right
it.
A
It's
anything
we
are
going
to
do,
is
kind
of
bound
by
creating
ssh
connections
and
then
creating
executing
stuff
on
the
host
and
then
going
to
the
next
house.
I
mean
we,
we
are
doing
it
in
parallel
with
10
hours
at
the
same
time,
but
still
it's
going
to
be
slow,
especially
for
the
dashboard
for
users
that
want
to
have
kind
of
up-to-date
information,
and
if
we
want
to
have
fast
failover
in
a
few
seconds,
then
we
also
going
to
need
to
improve
here.
A
A
Yes,
so
where's
the
biggest
gains
that
we
can
get
right
and
and
and
how
the
should
the
architecture
look
like.
I,
I
know
that
we
have
the
current
architect
of
this.
Heavy
adm
agent
or
daemon
is
like
providing
a
simple
api.
A
Endpoint,
that's
great
from
the
saf
adm
manager
module,
but
I
do
not
think
this
is
going
to
be
the
best
way
to
leverage
this
f8md,
mainly
because
we
are
still
going
to
have
a
reconciliation
loop
that
goes
to
every
host
and
instead
of
creating
an
ssh
connection
to
every
host,
we
are
creating
an
http
or
https
connected
to
every
house,
so
we
are
gaining
a
lot
less,
at
least
from
from
a
from
a
from
a
from
a
algorithmic
complexity
perspective.
It's
still
o
of
n.
A
A
Be
much
more
performant
and
and
and
then
we
can,
we
can
actually
decide
what
we
want
to
do
with
the
severity
agent
or
demon.
B
B
I
was
just
gonna
say:
it'll,
be
a
little
bit
weird.
If
you
haven't
heard
like
it'll,
be
unclear
what
the
manager
should
do
if
it
hasn't
heard
from
the
agent
like,
should
it
go
and
poke
at
it
to
like
make
sure
it
has
up-to-date
manager
address,
or
maybe
maybe
both
push
and
pull
work,
and
so
it
can
pull
if
it
needs
to
if
it
doesn't
hurt
from
it
recently.
A
No,
even
just
generally
only
yeah
in
general,
the
the
the
loop
that
we
have
right
now
is
that
simple.
So
we
are
in
any
case,
we
are
gaining
a
lot
of
new
failure
modes
so
right
now,
cypherium
the
reconciliation
is
really
dependent
on
the
order
of
things
right.
I
guess.
B
My
my
suggestion
would
be
to
start
by
making
like
we
should.
We
should
be
thinking
ahead
so
that
we're
not
preventing
ourselves
from
doing
something
later,
but
we
should
focus
on
making
the
reconciliation
loop
initially.
Just
like
all
the
refresh
like.
B
First
have
the
agent
maintain
all
the
the
node
state
on
the
current
containers
and
the
devices
and
the
facts,
and
just
have
that,
be
the
thing
that's
being
pushed
to
the
manager
in
a
scalable
way,
and
then
the
manager
can
still
actually
take
action
using
the
existing
loop,
which
I
think
is
okay,
like
if
you're,
if
you're
deploying
a
thousand
osds
it'll
be
slow.
But
like
that's
less
of
an
issue,
I
guess
yes,
yeah.
B
A
First,
let
let's
imagine
we
had
that
race
condition.
Already.
Imagine
your
you
record,
you
did
the
reconciliation
loop
and
you
want
to
deploy
a
new
manager
on
host
one,
and
then
we
have
a
race
between
the
agent
and
the
cdm
manager
module.
A
The
the
next
thing
that
happens
is
the
agent
goes
over
all
the
demons
and
and
and
and
aggregates
the
information
about
which
demons
are
running
on
one
specific
host.
A
The
the
next
thing
that
happened
is
that
the
manager
daemon
deploys
a
new
manager
on
host
a
and,
finally,
the
agent
pushes
the
information
about
its
aggregated
state
into
the
manager
module
and
suddenly
the
manager
module
kind
of
forgets
that
it
ever
had
deployed
in
your
manager
on
house
one
of
jose.
A
It
doesn't
hurt
if
the
the
agent
cannot
caches
the
information,
then
we
have
a
rest
condition
between
the
cache
of
the
agent.
Then
the
manager
deploys
a
new
daemon
and
then
the
the
agent.
C
But
I
I
think
that
probably
this
is
just
an
implementation
detail:
okay,
it's
something
that
is
going
to
happen,
and
probably
what
the
the
things
that
we
need
to
do
is
to
avoid
any
kind
of
change
in
the
manager
when
you
are
and
when
you
have
pending
pending
operations.
Okay
or
you
are
going
to
put
some
way
to
to
signal
that
you
are
doing
an
operation
and
it's
not
possible
to
change
the
main.
C
The
manager
and
this
operation
has
finished
okay
or,
for
example,
block
the
the
manager
okay
in
in
order
to
to
get
new
operations
until
13
condition
has
been,
has
been
finished.
Okay,
so
I
think
that
we
have
several
several
ways
to
do
that.
I
think
that
what
is
most
important
in
this
moment
to
have
a
high
level
view
of
what
we
we
need,
what
are
the
responsibilities
of
the
agent
and
what
model
is
going
to
to
be
used?
C
I
I
think
my
my
vision
is
that
well
having
the
the
third
diamond
running
in
each
house
and
communicates
and
communicating
the
information
to
the
manager
is.
I
think
that
this
is
the
push
model.
Okay,
not
the
pull
model
is
the
the
right
way
to
do
that
in
order
to
to
reach
scalability
okay-
and
we
are
going
to
to
this
part
of
the
complexity
that
we
have
now
in
the
orchestrator
in
the
diamond
model-
the
third
diamond,
because
in
the
third
diamond,
the
themes,
for
example,
drumming
diamonds,
for
example.
C
A
Yeah,
but
does
it
really
simplify
the
orchestrator?
I
don't
think
so.
It's
going
to
be
more
complicated.
I
guess
it's
worth
it,
but
but
we
have
to
be
super
careful.
A
And
that's
why
I
really
do
not
want
to
enable
the
current
daemon
implementation,
because
it's
prone
to
that
race-
and
I
know
we
we
we
had
that
race
already
between
between.
A
B
Yeah
yeah,
I
mean,
I
think,
that
particular
race.
We
can
resolve
with
some
variation
of
a
lan
port
clock
that
we
know
if
the
information
that's
being
recorded
in
is
older
than
whatever
yes.
So
I
think
we
should.
I
think
we
can
set
that
aside
and
we'll
we'll
get
to
it.
For
the
when
we
do
a
the
agent
does
a
push.
Is
it
just
going
to
use
like
the
manager,
like
the
cli,
basically
just
issue,
a
cli
command.
B
B
A
We
can
make
it
an
api
endpoint
right
with
proper
client
authentication
that
would
work.
B
A
B
B
Too
yeah
the
other
thing
with
that
is
that
then
it
needs
to
run
inside
the
container.
I
guess
this
is
a
dumb
question,
but
it
should
this
should
run
outside
the
container
right
because,
yes
stuff,
on
the
host
right.
Yes,
yes,
yes,
so
it's
probably
going
to
be
basically
it'll,
be
the
seth
adm
binary
it'll
be
stuff
adm
agent.
A
B
A
B
A
A
A
If
you
have
all
hosts
simultaneously
trying
to
push
information
to
the
manager,
the
manager
is
getting
overloaded.
B
A
I
I
think
it's
kind
of
kind
of
easy
to
cope
with
it
right
we
just
have
to.
If,
if
it
turns
out
that
the
connection
times
out,
then
we
have
to
have
a
random
amount
of
seconds
or
minutes
to
to
back
up
and
then.
A
C
B
C
That,
okay,
so
I
think
that
what
is
something
that
this
hardware
is
basically
static.
Okay,
so
if
we
have
just
the
time
stamp
in
order
to
see
if
we
have
very
old
information
or
not,
we
can
we
can
deal
with
that,
maybe
in
the
case
of
the
diamonds,
maybe
we
need
to
to
think
a
little
bit
more,
but
basically
is
well.
This
is
there
is
like
the
list
of
demos
that
must
be
running
in
this
course,
and
this
is
the
the
diamonds
that
the
host
has
communicated
that
are
running.
A
Yeah,
but
still
it's
it's
prone
to
races
and
having
a
solution
that
that
99
percent
works
is
crea,
going
to
create
a
lot
of
headaches.
If,
if
we
have
multiple
monitors
deployed
on
a
single
host,
but
you
only
want
to
have
one
monitor
and
stuff
like
that
right,
so
we
are
creating.
We
have
to
be
sure
that
we
are
not
prone
to.
A
A
B
B
There'd
have
to
be
basically
some
coordination
between
whenever
we
call
stuff
adm
on
a
remote
host,
we'd
have
to
pass
along
a
timestamp
that
gets
recorded
somewhere
on
that
local
host
and
the
agent
whenever
it
does.
Something
would
have
to
record
that
check
that
adjust
it
and
so
on
so
yeah.
B
I
mean
I
haven't
looked
at
the
current
exporter
code
at
all,
but
is
there
any,
is
it?
Is
there
anything
valuable
there,
or
can
we
just
like
rip
it
out
and
re-implement.
B
Because
I
mean
one
of
the
things
that
I
keep
like
having
putting
this
on
my
list
and
meaning
to
fix
it
and
then
going
and
looking
at
it
and
then
not
doing
it
because
right
now,
there's
a
list.
There's
a
list
networks
command
that
I
rely
on
to
get
information
about
ethernet
interfaces
and
subnets
and
stuff,
and
then
there's
also
a
gather,
fax
command
that
it
was
part
of
the
exporter.
I
think,
but
I
don't
actually
can't
actually
tell
what
uses
it.
B
A
Yeah
paul
was
it
they're
pretty
demanding
when
it
comes
to
introducing
a
new
way
of
doing
the
same
thing,
kind
of
yeah.
A
A
A
And
and
no
no,
we
can
ever
actually
leverage
a
lot
of
things
like
the
integration
and
the
system.
D
is
good,
so
all
the
basic
stuff
is
there
and
we
can
deliver
it.
So
we
have
a
head
start
when
it
comes
to
to
to
introducing
the
push
model.
A
Gather
effects
actually
actually
dot
in
the
in
the
manager
module
already,
so
we
just
have
to
expose
it
it's
there
and-
and
we
shall
probably
add
it
to
death
arch
posterless,
I
guess
but
other
than
that,
it's
it's
already.
There.
A
I
think
with
a
push
model
and
we
have
a
failed
demon,
we
could
achieve
failovers
within
a
few
seconds.
Yep.
A
A
A
I
mean
it's
pretty
much
solved
by
systemd
already,
so
if,
if
we
have
a
demon,
that's
failing
constantly,
then
the
systemd
unit
is
going
to
be
in
in
an
aerostat.
A
B
B
It
oh,
this
is
all
predate
systemd
by
many.
C
B
But
also
it
it
wasn't
just
related
to
a
single
demon
necessarily
restarting.
I
guess
it
did,
but.
B
B
So
I
wonder
the
tasks
would
basically
be
add
at
an
endpoint.
B
B
C
A
No,
no
no
container,
we
have
to
we.
We
can't
put
it
into
a
container
because
at
some
point
it
would
be
great
to
have
the
posit
the
capability
to
actually
deploy
demons
on
that
host
with
the
agent
and
and
if
we
want
to
do
that,
we
can't
put
it
into
a
container.
B
Yeah
I
mean,
I
think
the
the
agent
will
have
to
run
this
cli
inside
the
container
like
shell,
but
but
the
cl
it'll
actually
be
running
on
the
hose
yeah.
B
A
B
B
I
mean
if
we
make
the
agent
check
in
at
regular
intervals
and
then,
if
it
doesn't
check
in
then
we
do
a
poll
attempt
and
if
the
poll
fails,
then
we
mark
the
host
offline,
which
is
basically,
I
think,
more
or
less
what
it
does
now
right.
If
we
refer
to
the
poll
and
we
hit.
B
A
A
B
And
the
deploy
process
could
be
a
well
down
here.
I
said
that,
like
one
of
the
tasks
would
just
basically
be
a
spli
command
that
will
just
explicitly
comparatively
run
the
gather,
fax
code
and
phone
it
home.
B
So
that
could
be
the
thing
that
the
push
or
not
the
push
the
pull
does,
or
it
could
do
that.
I
guess,
but
it's
still
a
little
different.
B
I
mean,
I
guess
the
failure
event
that
we'd
be
worrying
about
is
if,
for
some
reason,
the
the
endpoint
isn't
responding
like
there's
a
firewall
or
something
something
is
blocking
you
from
being
able
to
post
to
the
whatever
the
cherry
pie,
endpoint
or
whatever,
just
whatever
it's
going
to.
B
A
If
someone
has
a
broken
firewall
configuration
that
just
as
soon
as
we
have
a
manager
failover,
then
suddenly
it
can
be
that
the
agent
can
no
longer
access
access.
The
new
manager,
because
fire
rule
only
provides
access
to
that
signal
manager.
B
A
A
A
Imagine
we
we
have
a
five
minute
timeout,
where
agents
need
to
push
information
every
five
minutes
and-
and
we
have
a
load
spike
in
the
monitor
that
prevents
the
manager
from
from
accepting
proper
connections
for
for
a
time
period
of
maybe
five
five
to
six
minutes,
and
at
that
time
we
are
marking
all
posts
as
offline.
A
C
The
I
think
that
that
means
to
to
have
some
kind
of
of
control
over
the
load
of
the
endpoint
in
order
to
see
if
you
are
in
a
situation
of
saturation
or
not.
Okay
and
depending
on
that,
just
checking,
if
you
are
saturated,
maybe
has
no
sense,
try
to
connect
with
different
force
to
see
if
they
are
alive
or
not,
because
you
are
in
a
situation
that
you
know
that
no,
you
are
not
protesting
request.
B
I
mean,
I
think
I
I
don't
know
what
library
use
for
the
the
rest
endpoint,
but
we
can
make
it
like.
Only
accept
a
single
connection
at
a
time
and
just
have
them
like
retry,
wait
or
whatever
so
that'll
that'll
throttle
it
a
little
bit.
A
A
B
B
B
B
A
Should
we
make
it
so
that
we,
and
instead
of
a
version,
simply
said,
should
we
use
the
hash.
B
B
A
A
Depending
on
your
definition
of
quicker,
yes,.
B
B
We'll
probably
have
a
similar
issue
where
that
we
had
before,
where
it's
not
just
the
version
that's
running,
but
also
which
version
is
deployed.
C
I
think
that
we
can
avoid
lot
of
problems
if
we
are
very
strict
with
information
that
we
are
passing
and
to
to
keep
this
information
in
the
in
the
same
version,
and
it
doesn't
matter
too
much
what
what
is
the
version
of
the
of
the
center.
This
might
maybe,
but.
B
C
Well,
we
have
a
list
of
tasks
very
high
level
tasks,
okay,
so
it
could
be
nice,
at
least
for
me,
okay,
I
need
more
more
explanation
of
of
this
task
in
order
to
to
have
more
clear
what
are
the
the
things
that
we
need
to
do,
and
I
think
that
maybe
it
is
good
to
just
to
try
to
to
ascend
this
task
to
different
people
in
the
in
the
team.
Let's
see
when
when
we
can
start
with
with
that.
C
Okay,
maybe
I
think
that
when
sage
or
sebastian
you
are
the
the
best
people
in
order
to
clarify
the
the
details
of
the
task?
Okay,
so
well,
I
think
that
maybe
we
can
we
can
try
to
to
start
destination
and
to
to
see
if
we
can
start
to
to
work
with
that.
C
B
I
have
a
question
about
the
the
rest
endpoint
part.
I
seem
to
remember
there's
some
weird
thing
where
we're
using
cherry
pie
for
both
for
prometheus
and
for
the
dashboard
or
something
something
like
that,
and
it
doesn't
like
having
two
instances
of
the
like
some
static
variables
or
something
yeah.
A
To
avoid
it,
yes
should.
A
A
B
B
A
Do
we
actually
have
to
live
in
the
manager,
or
can
we
can
can
thefi
dm
live
in
its
own
container?.
B
A
Manager,
yeah
we're
still
thinking,
I
think,
really.
We
are
close
to
the
top
of
the
hour.