►
From YouTube: Ceph Orchestrator Meeting 2021-09-07
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Hello
hi
everyone
and
welcome
to
this
week's
orchestrator
weekly.
Looking
at
the
topics,
we
have
a
presentation
from
edin
regarding
the
ancient
adam.
No,
I
want
to
start.
B
Yeah
I'll
share
my
screen,
so
I'm
going
to
present
a
little
bit
about
the
agent.
I
think
most
people
in
here
actually
already
know
most
of
the
stuff,
but
anyway
I'm
going
to
go
over
it
sort
of
why
we're
doing
it,
but
the
architecture
is
going
to
be
for
some
of
the
like
important
issues
with
it.
B
So
to
start
why?
Why
do
we
need
this
agent?
So
the
main
thing
here
is
for
scalability
and
performance,
so
for
one
on
here
we
have
ssh
is
currently
the
only
way
we
really
communicate
with
any
of
the
hosts
and
we've
seen
that
that's
pretty
slow
and
so
like.
We
have
our
server
loop
and
we
go
through
and
we
try
to
ssh
and
every
single
host
to
do
everything
it
just
takes
too
long,
and
we
can't
come
up
with
everything
for
parallelization.
B
So
right
now
right
we're
just
paralyzing
the
metadata
gathering
in
the
future.
There
could
be.
Maybe
if
you
want
to
deploy
a
lot
of
osps,
you
want
to
do
that
in
parallel.
Just
tell
all
the
agents
what
they
have
to
do
and
they
can
go
do
it.
You
don't
have
to
worry
about
going
through
each
host
individually,
one
of
the
time
in
the
circle
doing
that
and
also
it's
a
push
model,
so
it
saves
the
manager
of
the
work
of
having
to
go
explicitly
going
and
gathering.
B
All
these
things
all
have
to
do
is
sit
back,
have
its
http
server
up
and
it'll
get
the
metadata
sent
to
it
and
also
the
other
reason
we
need.
This
is
responsiveness
specifically
we're
talking
about
for
nfs
here
when
I
remember
nfs
needs.
B
We
need
to
know
if
it's
down
within
a
minute,
and
so
if
we
have
the
serve
loop-
and
it
only
goes
off
every
few
minutes-
that's
already
too
slow,
and
even
then,
if
it
went
off
once
a
minute,
it
still
takes
a
while
to
go
to
sh
in
the
host
and
gather
the
metadata.
So
it's
just
it's
too
slow.
B
We
can't
do
any
aj
for
nfs
with
the
current
architecture,
so
why
we
need
some
sort
of
agent
on
the
host
to
make
things
faster,
okay,
so
the
basic
architecture
here
is
that
the
manager
itself
is
going
to
have
an
http
endpoint,
we're
using
cherry
pie.
For
that,
because
we're
working
with
we're
worried
about
scalability
and
stuff,
we
want
to
be
able
to
take
http
requests
from
a
lot
of
different
places.
We
really
want
to
have
like
a
nice
library
here.
We
don't
want
to.
B
Have
our
own
http
server
we
have
to,
you,
know
worry
about
debugging
and
everything
this
one's
already
built
have
to
implement
it
in
here
and
then
for
the
host
themselves.
The
agent
will
be
a
non-containerized
system
d
unit
that
allows
it
to
run
commands
directly
on
the
host
really
easily.
B
So
we'll
see
the
other
slides,
like
all
the
metadata
easily
gather
it's
easy
to
just
run
on
the
host
without
it
being
in
a
container
and
what
they'll
be
doing
is
we'll
be
sending
all
the
things
it
gathers
over
http
to
the
manager
on
that
server
that
it
has
waiting
here
and
then
for
messages
from
the
manager
to
the
agent.
We
have
a
raw
socket.
We
don't
really
want
to
have
to
have
an
http
server
running
on
every
single
host
agent,
so
we
just
have
a
socket
to
communicate.
B
It
should
be
enough
there
you
see
here,
we've
set
it
up,
so
we
can
today
receive
a
variable
and
json
string
which
basically
lets
us
send
whatever
you
want
to.
It
will
help
in
the
future.
If
you
want
to
extend
the
functionality,
that's
what
we'll
do
for
now
we're
just
getting
metadata,
because
we're
worried
about
the
responsiveness
and
scalability
stuff
right
now.
The
biggest
scale
problem
is
actually
gathering
metadata
in
the
serve
loop.
It
just
takes
a
while
because
most
other
things
you
want
to
do
once
in
a
while.
B
B
So
list
of
demons
like
this
is
the
important
stuff
for
one
of
the
slowest
things
that
runs
like
the
ls
is
super
slow,
and
so
the
agent
will
do
that
it'll
send
to
the
manager,
so
it
already
has
it
ready.
Now
we
have
the
networks
and
hosts
facts
in
here.
This
is
just
some
information
about
the
host
that
could
be
useful
networks
is
pretty
useful
and
the
set
volume
output.
This
is
helpful
too,
for
the
discs
right
now.
B
We
don't
even
refresh
that
very
often
so
this
will
make
that
faster,
we'll
have
more
up-to-date
info
on
the
discs
on
the
host
and
in
the
future.
I
think
we
can
even
set
up
so
if
the
disks
change
we'll
immediately,
have
the
agent
send
more
data,
so
we'll
be
really
responsive
on
that
stuff
and
maybe
more
in
the
future.
So
right
now
it's
just
metadata
stuff,
but
we've
talked
about
the
possibility
of
once.
This
is
a
stable
thing.
B
Maybe
we
wanted
to
deploy
demons,
it
could
help
with
pulling
a
lot
of
osds
or
whatever,
but
that's
future
work.
You
don't
want
to
go
there
until
this.
Some
other
stuff
works.
B
B
So
we're
talking
about
having
a
secure
channel
here
for
because
we're
doing
things
over
atp
and
also
the
raw
socket,
the
things
we're
worried
about
are
making
sure
the
messages
are
encrypted
and
then
making
sure
we're
authenticating
who's
sending
them.
So
we
have
messages
that
can't
be
sniffed
by
anyone.
You
can't
read
them
at
all,
but
and
also
we
know
exactly
who
they're
coming
from
and
who
we're
sending
things
to.
Then
we
have
a
pretty
secure
channel
of
communication.
B
So
in
this
case
we
have
two
channels
we're
worried
about
so
for
the
http
messages
and
that's
things
going
from
the
agent
to
the
server
like
when
we
send
metadata
up
there.
We
do
it
with
a
post
request,
and
so
we
have
to
make
sure
that's
secure
and
then
also
the
raw
socket,
the
manager
or
another.
The
agent
itself
has
a
the
raw
socket
and
the
manager
needs
to
be
able
to
send
information
to
that
socket
again.
It
needs
to
be
secure.
B
B
So
first
we
need
to
it's
up
https,
so
we
have
encryption.
Unfortunately,
it
doesn't
seem
like
there's
any
native
two-way
authentication
in
cherry
pie
for
ssl,
so
we
can't
that
would
be
the
ideal
way
to
do
it,
but
we
can't
quite
do
that
we'll
get
back
to
that
in
a
second.
B
B
That
is
we
give
that
to
the
agent
when
we're
deploying
it
that's
done
over
as
estate,
so
we're
not
as
worried
about
anything
encryption
or
anything
over
that
and
then
once
since
the
agent
has
its
root
cert,
it's
able
to
use
that
to
verify
the
manager.
I
can
see
that
the
cert
the
manager
is
using
is
one
that's
signed
by
this
it'll
also
check
the
host
name
on
that
certification.
The
manager
has
that
way.
It
knows
who
it's
sending
data
to
is,
in
fact
the
actual
manager.
B
It's
not
just
some
random
person
and
the
manager
also
verifies
the
agent
like.
So
we
don't
have
two-way
authentication,
so
we
can't
verify
some
sort
of
ssl
cert
on
the
agent
side,
so
we
have
a
different
way
of
doing
it.
So
we
do,
is
we
generate
a
key
ring
for
the
agent
and
then,
when
the
agent
sends
metadata
back
to
the
manager?
It
includes
that
key
ring
and
we
verify
that
the
agent
on
that
host
is
supposed
to
have
that
exact
hearing.
B
If
it
doesn't,
we
just
discard
it
whatever
it
sends,
but
if
it
does,
we
can
use
that
metadata.
Be
sure
that
it's
from
someone
reliable
on
the
other
side,
the
raw
socket.
This
is
just
a
raw
socket.
We
don't
have
any
problems
with
not
being
allowed
to
use
two-way
authentication.
So
that's
what
we're
doing
here
the
nice
thing
about
this.
It
covers
both
the
encryption
and
the
authentication.
B
So
it's
a
similar
thing.
We
have
that
root
cert
on
the
manager
that
we
generated,
but
this
time
we're
actually
generating
a
search
for
the
agent
itself.
So,
on
top
of
passing
the
root
cert
to
the
agent,
we
also
are
passing
this
new
generated
cert
to
the
agent
and
then
now
the
manager
and
the
agent
both
have
their
own
certifications,
and
so
they
can
verify
each
other's
serve.
The
manager
will
verify
the
agent
has
a
notification
signed
by
that
root
serve
and
the
agent
can
verify
the
manager's
certification
is
also
signed
by
that
root.
B
Cert
since
we're
the
ones
making
the
root
cert
we're
the
ones
only
passing
that
stuff
around
we
can
say
if
both
sides
have
a
search
signed
by
our
newly
created
root.
Cert,
then
we're
pretty
sure
that
this
is
like
a
legit
request
from
someone
in
the
cluster
and
again
because
it's
ssl,
it's
already
all
encryptions,
already
covered
okay.
Another
big
topic
we
had
was
metadata
integrity,
specifically
here
it's
about
things
being
out
of
date.
B
So
before
we
were
doing
is
in
our
server
loop,
we'd
gather
the
metadata
first
and
then
we'd
say
apply
specs
or
whatever,
but
now
we're
doing
that
all
asynchronously.
So
we
have
to
be
concerned
about
when
we
go
through
the
server
loop
we
want
to
apply.
Spec
is
the
metadata.
Actually
update
is
reliable.
B
That
can
be
an
issue,
because,
if
it's
not
up
to
date,
you
could
double
deploy
a
demon.
So
if
you've
deployed,
say
a
monitor
on
a
host
and
then
you've
rented
the
server
loop
started
up
again.
Maybe
you
didn't
have
new
metadata
from
the
host.
You
might
still
think
there's
only
there's
no
monitor
there
or
whatever
you
could
try
to
deploy
it
again.
You
don't
want
to
have
double
demons
going
on
house.
I
think
it's
more
of
a
problem
with.
B
I
don't
know
if
money,
specifically,
it's
probably
it's
just
a
problem
in
general,
with
extra
demons
getting
deployed,
and
so
our
solution
to
this
is
just
a
counter,
essentially
we're
originally
thinking
of
something
like
a
lan
port
clock,
but
I
looked
at
it
some
more.
This
problem
is
a
bit
simpler
than
what
a
lampport
clock
can
cover
a
lamp
or
clock.
B
You
have
a
full
queue
and
there's
counter
values
in
that
queue,
and
you
can
use
it
to
verify
or
access
distributed
resources
across
a
bunch
of
different
hosts,
but
in
our
case
they
really
only
have
one
issue
which
is
verifying
two
specific
events
which
order
they
happen,
then,
on
two
specific
hosts.
So
it's
super
simple,
and
this
goes
out.
We
can
use
a
counter
and
the
way
the
counter
essentially
works
is
the
manager
is
in
control
of
actually
incrementing
the
counter
at
any
point.
B
So
it
has
this
counter
value
and
then,
whenever
it
changes
the
hosts
that
are
on
any
given
or
demons
that
are
on
any
given
host.
It
just
updates
that
counter
value,
then,
from
that
point
on
it
only
will
accept
metadata
or
we'll
only
consider
the
host
metadata
up
to
date
if
it
receives
metadata
with
that
counter
value.
B
So
we
just
send
that
new
counter
value
to
the
agent
and
what
the
agent
will
do
is
whenever
it's
about
to
start
gathering
metadata,
it
will
see
what
its
counter
value
is
and
it'll
make
sure
it
attaches
that
to
the
message
that
way,
then,
if
the
host
gets
metadata
with
the
counter
value,
the
new
counter
value,
who
knows
that?
B
Not
only
did
the
agent
like
see
that
counter
value
and
put
it
in,
but
that
saw
that
kind
of
before
it
even
started,
gathering
the
metadata
which
guarantees
that
the
metadata
is
newer
than
the
last
time
we
we
deployed
demons
on
that
host,
which
is
essentially
the
same
verification
we
have
for
our
current
setup,
so
we're
not
losing
any
sort
of
metadata
integrity
there,
which
is
good,
because
that
was
one
of
the
big
problems
with
this
asynchronous
push
model
system
offline
hosts.
This
is
another
thing.
B
B
So
essentially,
if
it's
been
50
seconds-
and
we
haven't
gotten
a
message
from
any
agent,
then
that
host
will
get
considered
offline
and
if
that
happens,
then
we
know
hey.
We
have
to
move
a
demon
around
if
you
want
to
like
have
some
sort
of
hi
system
and
then
what
we
do
in
that
case
is
we
simply
just
schedule
a
redeploy
of
the
down
agent.
B
The
reason
we
do
that
is,
it
helps
with
two
things:
one
it
can
verify.
Those
is
actually
offline
because
we
will
try
to
ssh
in
to
deploy
the
agent
and
redeploy
the
agent,
and
so
if
the
host
zac
is
not
actually
offline,
we'll
we'll
be
able
to
tell
because
we'll
go
to
station
and
see
that
and
if
the
host
is
offline,
then
we
won't
and
we'll
know
for
sure
it's
offline.
B
I'm
just
going
to
mention
here.
This
is
still
a
work
in
progress.
I
have
this
pull
request
open.
Still,
I'm
not
going
to
open
it
right
now,
but
it's
there.
If
you
just
run
through
a
tautology
run
last
night,
there's
some
problems
with
ssl
on
ubuntu,
so
there's
still
things
to
be
ironed
out.
If
people
are
interested
in
going
and
looking
at
it
reviewing
it
themselves,
I'm
going
to
give
a
small
demo-
I
hopefully
this
cluster
is
actually
up.
B
B
Yeah,
this
is
really
where
it
it
helps.
So
you
can
see
here
we
have
three
agents
deployed.
I
have
three
hosts
pm000102
and
there's
some
demons
on
these
hosts
just
the
default
demons.
If
you
have
the
monitoring
stock
stuff
on
and
you
can
see
if
I
run
this
again
like
say
it
got
up
like
these
already
all
got
reset
they're,
not
all
in
sync
anymore,
it's
just
because
whenever
they
just
have
to
send
data
and
they
really
only
get
up
to
like
20
something
seconds
before
they'll
send
something
back.
B
You
see
it's
always
kept
super
slow
or
a
super
small
amount
of
time
in
between
refreshes.
That
way,
we
can
always
tell
things
are
always
up
to
date.
We
can
see
things
faster,
we'll
be
able
to
project
offline
hosts
faster,
and
that's
really
all
there
is
to
show
for
this
stuff.
I
guess
like
it
shows
these
debug
logs.
You
can
see
here.
B
Anyway,
if
you
can't
read
this,
it
says
refresh
the
host
vmo
on
demons.
They
have
a
message
about
how
we
received
up-to-date
metadata
from
host
vmo1
and
then
there's
just
this
automatically
printed
message
about
a
http
message
being
sent,
but
basically
that's
all.
It
relates
to
show
again
because
there's
no
new
functionality
here,
we
can
see
that
the
refresh
time
is
super
low,
which
is
the
whole
goal
of
this
and
that
the
agents
are
able
to
run.
I
said
this
is
on
a
centos
set
of
vms.
B
It
seems
to
be
problems
with
ubuntu
currently,
but
it'll
get
ironed
out,
and
I
think
that's
really
it.
Unless
anyone
has
any
questions,
they
want
to
ask.
A
B
I
think
it
should
be
able
to
handle
it
because
cherry
pie
can
have
a
lot
of
different
threads
running
at
once,
and
it
doesn't
take
very
long
to
handle
the
individual
metadata
once
it's
there.
The
actual
processing
of
it
seems
really
quick.
I
think
the
gathering
is
the
slow
part,
so
I
want
to
say
it
should
be
able
to
handle
that
stuff,
at
least
a
lot
better
than
it's
currently
being
handled
with
the
ssh
and
stuff,
but
I
think
it'll
be
okay.
Obviously
it's
going
to
have
to
get
tested.
A
A
A
A
So
when,
when
we
are
going
out
of
the
one
minute
timeout
in
a
large
cluster,
are
we
ending
up
in
a
threshing
situation
where
we
are
creating
even
more
more
imagine,
imagine
the
the
manager
is
looked
for
a
minute
or
so,
for
whatever
reason
I
don't
know.
B
I
mean
my
hope
would
just
be
to
avoid
that
at
all.
Just
have
it
not
take
that
much
longer.
I
think
we
could
maybe
work
with
adjusting
the
how
fast
we
time
out
or
anything,
if
that's
a
problem
and
also
we're
in
the
future,
we
want
to
get
it
so
it
can
run.
The
agent
can
run
a
bit
faster
and
so
can
push
off
without
having
any
being
slowed
down,
but
there's
no
real
way
around
it.
B
I
think
where,
if
you're
going
to
have
something
like
this,
if
you
put
the
time
out
it
like,
if
it
takes
too
long,
then
it
has
to
time
it
out.
There's
nothing
else.
Really
to
do
so.
I
think
they
really
have
to
just
do
is
work
towards
making
it
so
that
the
agent
can
always
get
its
request
accepted
in
that
timeout
range.
B
Yeah
they
do
have
to
get
reconfigured.
I
thought
about
it
as
some
sort
of
future
work,
it's
possible
to
do
that
reconfiguration
over
the
http,
with
like
the
socket
right
now.
It's
just
doing
the
ssh
thing
we
have
built
in
because
that's
what
we
have.
I
didn't
want
to
try
to
implement
that
in
this
basic
version,
but
necessarily
something
to
go
for
afterwards,
just
see.
If
I,
if
it's
just
a
simple
reconfigure,
it's
not
need
to
change
that.
Much.
If
I
just
want
to
change,
say
the
target
ip
of
the
manager.
D
We
could
also
imagine
yeah.
We
could
also
imagine
that
the
agents
learn
what
the
standby
managers
are
every
time
they
send
in
their
data,
so
that,
if
they're
having
trouble
connecting,
they
could
try
one
of
the
standbys
and
then
they'd
sort
of
seamlessly
transition.
B
D
Let's
get
your
work,
it
seems
like
I'm
going
back
to
the
thrashing
question.
It
seems
like
maybe
the
way
to
avoid
that
is,
and
the
issue
is
that
the
manager
is
like
five
of
the
agents
have
timed
out.
D
So
I
have
a
list
of
these
five
agents
and
then
I
iterate
over
them
and
each
iteration
is
like
this
20
second
process
of
like
doing
the
slow
stage
connection
whatever
it
is,
and
then
you
end
up
like
redeploying
agents
that
are
no
longer
slow,
maybe
like
structuring
that
loop,
so
that
we
check
and
see
if
there
is
a
slow
agent
and
if
so,
we
take
one
of
them
and
redeploy
it.
B
D
D
D
B
Yeah
you
can
do
like
the
client
can
verify
the
servers
with
that
library.
A
D
I
guess
the
one
last
thought
the
since
both
since
this
assert
for
both
the
client
and
for
the
server.
I
wonder
if
that,
whatever
the
agent,
the
agent
cert
piece
of
it,
can
be
used
in
place
of
like
generating
a
suffix
key
for
each
of
the
agents.
B
The
only
reason
I
thought
I
needed
the
key
rings
is
because
the
cherry
pie
server
doesn't
do
the
two-way
like
for
the
thought
into
the
raw
socket.
I
can
just
use
the
two
authentication
and
the
cell
search,
but
when
I
want
to
verify
the
agent
from
the
manager
side
when
it
sends
metadata
over,
I
can't
it
doesn't
seem
like.
I
can
use
a
verified
search.
D
B
D
B
E
Yeah
sorry,
so
this
is
cory
yeah
jump,
though
I
just
wanted
to
kind
of
get
some
tips
and
stuff.
I
guess
so
far
today
for
how
you
guys
imagine
it
being
envisioned
from
my
standpoint.
I
guess
what
I've
looked
at
so
far
and
how
I
see
it
is
basically
you
guys
already
have
some
utility
functions
for
determining
whether
an
osd
is
safe
to
destroy.
E
So
I
imagine
I
just
look
for
ones
that
are
safe
to
destroy
and
have
a
cue
of
them
and
destroy
them
as
possible
based
upon
the
requirements
of
maintaining
the
replication
factor
and
stuff
and
then
use
your
standard
commands
the
same
ones
that
end
up
being
used.
If
you
were
to
do
it
manually
to
replace
them,
watch
for
them
to
be
drained,
and
then
I
guess
it
would
take
care
of
zapping
them
as
well.
E
D
Evacuated
devices
depth
devices,
in
which
case
you
wouldn't
have
to
do
anything.
D
D
D
F
A
There
is
a
device
id,
maybe
we
can
craft
a
specific
drive
group
for
that
specific
device.
Id.
D
I
don't
know
how
common
this
scenario
is,
but
the
repave
osds
are
exactly
what
like
user
scenario,
we're
trying
to
cover
here,
but
it
seems
like
if
we,
if
we
just
said
that,
in
order
for
this
to
work
properly,
you
have
to
have
a
drive
group
defined
that
will
cover
the
devices
as
it
becomes
available,
then
like
that,
would
sort
of
make
our
life
easier
and
would
also
nudge
users.
A
Currently,
what
was
your
use
case
again?
I
think
it
was
changing
a
min
allocation,
nice
right.
E
E
Good
but
yes,
the
the
big
use
case
for
us
is
to
change
the
bin
allocation
size
across
a
bunch
of
clusters
that
are
pretty
big.
And
then
we
just
had
another
use
case
that
we
ended
up
doing
manually,
where
we
wanted
to
start
using
db
wall
on
some
osds
that
weren't
previously
configured
to
use
db
wall
on
vmes.
A
G
G
G
A
Do
you
want
to
at
least
have
a
look
at
that
fire
starter
blue
store,
ansible
playbook
in
order
for
for
us
to
avoid
running
into
the
same
issues?
Yeah,
I'm
sure
one
of
her.
A
What
what
else
do
we
need
to
add
some
persistent
state
for
us
to
track?
What's
the
current
state
of
repaving
ocs.
D
A
D
D
D
Actually
so
maybe
the
gap,
then
the
drain
also
has
this
thing
where
there
are
two
modes:
one
where
you're
gonna
reuse,
the
id
osd
id
and
one
where
you're?
Not.
Maybe
we
need
to
have
a
way
so
that
when
we're
zapping
and
the
drive
group
applies,
it
notices
that
it's
knows
that
it
should
reuse
the
osd
id.
A
Reusing
the
osd
idea
is
pretty
simple:
we
are
just
when
creating
those
these.
We
are
just
searching
for
those
d,
ids.
D
So
one
other
concern
I
have
is
that
if
you
have
a
drive
group
that
says
let's
say
you
have
servers
that
have
like
eight
hard
disks
and
one
ssd
or
two
ssds
or
whatever
they
get
split
up
into
db
wall
partitions
and
you
do
and
if
everything's
empty-
and
you
have
the
drive
group
that
says,
use
this
for
wall
on
this
for
data
or
whatever
the
volume
like
figures
out,
that
the
ssd
should
be
divided
eight
ways.
D
But
if
you
delete,
if
you
zap
like
one
of
the
osds,
so
one
of
the
data
devices
and
you
delete
one
of
the
lvs
for
the
the
db
wall,
we
need
to
make
sure
that
that
volume
is
or
whatever
yeah
that
is
smart
enough
to
like
know,
to
recreate
the
db
lb.
That's
the
right
size.
E
A
In
the
self
adm
test
tweets,
we
don't
testing
it.
We
are
relying
on
the
volume
test
suites
to
properly
do
that
guillaume.
Do
you
know
if
that's
properly
tested.
A
G
D
E
So,
as
far
as
the
command
to
kick
this
off
and
sorry,
if
you
guys
talked
about
this,
when
my
headphones
were
disconnected,
but
do
you
imagine
you
imagine
a
new
command
like
seth,
osd,
repave
kind
of
thing,
and
then
they
specify
some
kind
of
with
some
kind
of
wild
card?
Syntax
or
I
don't
know
what
do
you?
What
do
you
imagine
the
input
to
select
the
set
of
osts
that
should
be
repaved,
I
guess,
or
just
a
whole
host
spec,
something
like
that.
D
D
D
D
D
D
D
Maybe
just
yeah,
that's!
That
might
be
just
a
good
starting
point
right.
I
think
before
when
we
talked
about
this
like
two
years
ago.
D
The
strategy
we
thought
of
was
that,
if
you're
going
to
repave
like
when
you
it's,
the
complicated
part
is
when
you
have
these
hybrid
osteos,
where
you
have
the
ssd
and
four
hard
disks
that
are
whatever
using
ssd.
Is
the
db
like
you
basically
want
to
repave
that
whole
set
of
devices
the
ssd
and
the
paired
hard
drive
hard
disks
all
as
like
a
a
unit.
D
You'd
want
to
like
destroy
all
those
osds
and
then
redeploy
that
whole
thing
as
it's
that
maybe
that
isn't
necessary,
because
volume
is
smart
enough,
that
you
could
do
them
one
at
a
time,
be
nice,
but
they're,
probably
cases
where
you
do
want
to
do
all
of
them
because,
for
example,
maybe
their
stuff.
Well,
we
don't
really
care
about.
I
don't
even
know
if
he
supports
that
disc
well
enough
that
we
should
worry
about
it.
E
D
A
E
Yeah
yeah,
I
think
so
let
me
yeah
I'll,
take
back
and
kind
of
sketch
it
out
more
and
stuff
and
then
I'll
come
back
with
more
questions
for
next
time
and
stuff,
probably
or
things
that
might
need
a
second
pair
about
something,
but
that
was
really
helpful.
Thank
you.
E
A
A
D
So
the
grace
period
is
90
seconds.
I
would
say
that
we
probably
want
to
do
the
failover
by
the
midpoint
of
that,
if
we
can
so
that
they'll
have
plenty
of
time
to
go
through
their
thing.
B
Yeah,
a
little
bit
of
work
needs
to
be
done,
so
it
actually
pushes
every
20
seconds,
because
right
now,
it's
20
seconds
plus
time
takes
a
gather
which
I
want
to
fix
in
a
follow-up.
But
it
should
be
every
20
seconds
unless.
G
D
Pretty
close
to
be
50
seconds
and
probably
about
a
minute
before
that's
probably
still
good
enough
in
like
a
healthy
environment.
So
maybe
we
can
just
go
with
that
and
we
can
make
the
the
agent
interval
like
tunable
too.
So
somebody
has
like
really
tight
and
the
best
performance
or
something
that
could
like
the
agent
would
overdrive.
B
We
also,
I
have
a
config
setting
right
now
for
the
cluster
overall,
but
you
also,
we
wanted
to
deploy
individual
ones
with
different
settings.
We'd
say
like
nfs
hosts
wanted
to
have
slightly
faster
ones
or
something
let's
do
doable.
A
Okay
next
topic
I
had
on
my
list-
was
testing
reboots.
We,
we
had
a
bad
issue
where
once
got
removed
from
the
monmap
when
rebooting
posts
and
that's
kinda
either,
and
we
really
probably
should
avoid
running
into
the
same
issue
again
said.
You
mentioned
that
we
have
a
power
cycle
thresher
and
a
kernel
tasker
with
what
we
could
use
to
test
reboots.
D
Yes,
the
the
thrasher
and
I
think
the
code
actually
was
in
step
manager,
those
reboot
nodes.
D
Oh
it
power
cycles,
nodes
which
should
have
the
same
effect,
so
we
could
actually
do
a
full-on
power
cycle
or
we
could
go
ssh
and
run
reboot,
but
I
haven't
checked
to
see
whether
there's
any
trick
you
have
to
do
to
like.
D
D
D
B
A
Another
thing
that
did
pop
up
last
week
was
lock
aggregation.
There
is
a
demand
for
sephardium
to
kind
of
make
it
possible
to
aggregate
locks
from
all
over
the
cluster,
possibly
for
support
cases.
Also,
I
don't
know.
A
I
kinda
was
a
bit
hesitant
to
add
functionality
to
the
thefty
manager
module
to
aggregate
logs
from
all
over
the
cluster
because
they
are
huge
or
might
be
huge.
A
D
D
D
D
A
We're
doing
it
in
quarterly
already
we
are
ssh,
we
are
running
a
ssh
on
the
remote
host
and
then
aggregating
all
daemon
locks
and
putting
it
into
a
zip
file.
D
Seems
like
maybe
then
you'd
want
something
like
idiom
gather
all
logs
and
then
like
get
key
from
manager.
D
A
C
C
I
think
that
there
are
external
tools
that
are
specialized
in
analyzing.
Logs
is
another
issue?
Okay,
so
I
I
think
that
it
could
be
useful,
for
example,
for
integration
tests
to
have
all
the
information
after
the
test,
but
in
running
clusters
or
in
production
systems.
I
think
that
to
have
everything
or
to
have
an
aggregation
of
the
logs,
just
an
aggregation
of
the
logs
not
going
to
be
useful.
E
F
D
B
E
Island,
we
have
blogger
log
aggregation
set
up
with
prom
tail
and
loki,
and
I
know
there
are
other
alternatives,
solutions,
oak,
stacks
and
stuff,
but
that
seems
to
be
a
pretty
straightforward
setup
and
those
tools
already
exist.
So.
E
At
least
from
our
standpoint,
like
it's
easy
enough
to
use
those
external
tools
for
the
log
aggregation
and
making
them
searchable
and
indexed
and
stuff
for
all
of
these
purposes,
and
I'm
not
sure
what
advantage
there
is
to
having
specific
support
for
it
and
stuff.
I
guess,
besides,
maybe
being
convenient
and
easier
to
set
up
right
away.
D
Mutually
exclusive
right,
like
some
users,
will
want
a
full
blown
out
stack
really
but
having
like
a
gather
logs
ooh.
It
might
also
be.
D
A
Yeah,
it's
I
don't
know,
I
really
don't
like
going
into
the
realm
of
gathering
rock
files
it.
It
feels
to
me
that
civilian
is
really
the
wrong
tool
for
the
job,
and
it
feels
that
we
are
that
we're
going
to
invest
a
lot
of
time
to
implement
this
or
a
rather
limited
benefit.
D
Well,
I
think
just
having
a
simple
command
isn't
a
whole
lot.
I
mean
it's
narrowly
scoped,
it's
not
a
whole
lot
of
effort.
It
doesn't
have
to
be
part
of
stuff
idiom.
You
could
imagine
just
writing
a
quick
little
script.
That
does
the
same
thing,
but
just
taking
that
functionality
and
putting
it
inside
that
idiom,
the
cli
tool
at
least
seems
nice,
because
I
think
that
lots
of
people
will.