►
From YouTube: Ceph Orchestrator Meeting 2022-01-25
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
The
first
one
is
our
timeouts.
We
discussed
this
bit
before
that.
We
have
that
issue.
I
don't
have
the
tracker
linked
or
actually
is
that
the
right
tracker
yeah,
that
is
the
right
track,
so
the
the
rbd
tracker
is
linked
in
the
other
as
well.
Let's
take
an
example
of
this
issue
in
the
manager
module
where,
if
you
have
any
of
these
ssh
commands
like
a
set
volume
command
or
something
that
ends
up
hanging,
then
the
entire
server
loop
hangs
up
permanently,
so
you
restart
it.
A
So
there
was
some
discussion
about
introducing
some
timeouts
to
our
ssh
commands,
but
we
never
finalize
anything
because
it
ends
up
being
a
bit
tricky,
because
if
you
try
to
set
like
say
a
global
timeout,
then
it's
hard
to
find
a
sort
of
good
spot
for
it,
because
you
make
it
too
long,
then
it's
not
doesn't
really
do
very
much
like.
A
If
you
have
to
wait
20
minutes
to
run
this
expanded
timeout,
then
you
just
loop
back
to
the
server
loop
and
then
it
just
does
that
again,
it's
almost
never
doing
anything
anyway,
but
you
can't
make
it
too
short
either,
because
some
commands
like
a
deploy,
command
or
next
actually
pull
an
image
at
the
start
of
it
is
gonna,
take
a
few
minutes
so
either
we'd
have
to
actually
do
this
sort
of
there
are
different
commands,
have
different,
timeouts
or
we'd
have
to
have
like
a
really
long,
timeout
and
just
say
it's
okay
to
have
it.
A
Let's
raise
a
health
warning
and
have
it
be
idle
most
of
the
time
something
one
of
those
options
I
don't
know.
Does
anyone
have
any
thoughts
on
that?
One.
B
B
That's
I
just
want
to
mention
that,
because
it's
not
totally
orthogonal
to
have
global
versus
per
command.
Okay,
it's
a
global
default.
C
There
was
one
thing:
actually,
if
I
remember
correctly,
there
is
a
15
minutes
timeout
in
the
ssh
protocol.
That's
super
hard
to
get
rid
of.
C
Yeah
here,
okay,
here
I
found
the.
If
you
actually
hear
this
one.
C
Offline
house
saying
thanks
surf
loop
for
15
minutes
found
by
daniel
a
half
a
year
ago,
and
it's
super
hard
to
get
rid
of
this
specific
problem,
because
it's
some
kind
of
a
weird
thing
because
we
are,
we
are
persisting-
the
ssh
connections
in
our
ssh
cache
and
as
soon
as
we
are
trying
to
reuse
the
existing
open,
ssh
connection
to
a
host
that
is
no
longer
there.
We
are
going
to
hang
for
15
minutes
and
there
is
basically
no
way
to
avoid
that.
15
minutes
hang.
C
But
as
this
problem
is
with
the
ssh
protocol
and
with
ssh
implementation
and
not
with
the
python
binding,
I
very
much
think
that
we
are
still
prone
to
the
very
same
particular
problem.
C
Offline,
you
know
where
we
know
that
it's
offline
and
we
shall
probably
first
reset
the
connection
if
we
suspect
that
the
house
is
offline.
Otherwise
we
end
up
in
this
15
minute
thing.
A
A
That's
what
we
were.
That's
one
thing
we
were
running
actually
on
offline
hosts
we'd
always
still
do
checkos.
C
I
know
that
the
the
agents
have
an
open
port
can
can
we
just
if
we
know
that
there
is
an
agent?
I
mean
adam
correct
me.
If
I'm,
if
I'm
outdated,
my
information
is
outdated.
Can
we
just
connect
to
that
open
port
on
the
a10?
If
it's?
If,
if
we
know
that
donation
should
be
there,
and
only
if
we
cannot
connect
to
that
host
and.
A
Yeah
we
could,
I
mean
the
reason
we
I've
been
avoiding.
That
is
because
it
it
was.
I
guess
it's
fine,
the
first
calling
check.
Oh
so
it
fails,
as
I
was
worried
about
it
being
like
we're
sort
of
relying
on
this
being
like
a
stable
thing
like
if
it's
there
it'll
be
working,
but
I
guess,
if
we're
just
saying
as
long
whenever
it
fails,
we'll
just
call
checkouts,
normally
otherwise,
we'll
just
ping
ping.
I
think
you
guys
would
guarantee
it's
online.
At
least.
C
C
Like
I
said
that
was
the
the
reason
said
to
remove
it,
it
doesn't
give
you
any
information.
Ping
might
fail
just
because
the
ping
protocol
is
disabled
by
a
firewall,
so
there.
A
A
I
think
I've
I
kind
of
meant
to
send
a
message
to
it,
saying
that
just
implement
that
a
little
smaller,
because,
right
now,
it's
like
actual
updates
and
stuff
you'd,
send
there
like,
like
a
that's
like
an
empty
json
or
something
and
just
see
if
we
get
something
back
because
it
does
respond,
and
we
can
do
that
and
at
least
if
that
works,
we
can
guarantee
it's
online
and
we're
fine
and
if
it
doesn't
work,
maybe
reset
the
connection
and
try
check
host.
C
A
I
mean
it
could
be
a
starting
point,
though
it's
just.
If
there
is
an
agent
up,
we
can
try
to
use
its
port
to
verify
it's
online
and
then,
if
it
doesn't
that
that
fails
for
whatever
reason
that
we
don't
get
anything
back,
then
we
can
be
a
bit
more
cautious
and
try
to
reset
the
connections
instead
of
just
running
a
normal
check
host.
A
Be
a
good
starting
spot
yeah,
I
guess
so
so
I
guess
if
we
already
have
any
time
out,
though,
the
introducing
some
other
timeouts,
probably
not
going
to
help
us
too
much
we
could
do
is
if
we
could
recognize
the
time
what
actually
happened.
We
could
raise
a
health
warning.
That's
one
thing
we're
not
doing.
A
E
So
I
think
this
touches
like
on
the
other
subtlety
of
this
issue,
because
there's
the
issue
of
detecting
that
the
host
is
offline,
but
in
the
case
of
the
rbd
failure
the
host
was
online,
but
simply
in
a
live
locked
state.
So
we
were
attempting
to
inventory
devices
and
we're
blocked
in
an
uninterruptible
state,
so
the
process
would
never
return.
So
the
connection
was
valid,
but
it
was
simply
holding
the
global
sephadm
lock
and
when
something
like
this
occurs,
there's
no
real
good
indication
of
what
the
orchestrator
is
doing
on
that
host.
E
It
just
appears
to
be
hung
or
stuck,
and
I
think
in
those
cases
we
need
some
sort
of
trigger
to
say
you
know
this:
is
the
operation
we're
attempting
to
perform
and
raise
a
health
warning,
but
that's
not
occurring
I've.
I've
seen
a
similar
thing
if
image
fell,
pools
happen
in
the
background
they'll
just
silently
fail
over
and
over
and
over
on
a
individual
house
without
progressing
in
the
serve
loop,
which
creates
the
appearance
that
the
manager
is
hung.
But
it's
it's
really
not
it's.
C
Just
today
looked
into
the
looking
of
cfdm
and
turns
out
that
we
are
looking
indefinitely
trying
to
get
to
get
a
hold
of
the
cdm
global,
lock
indefinitely.
There
is
no
timeout.
A
E
B
This
is
a
little
bit
more
ambitious,
maybe
or
just
you
know,
my
own
ignorance.
Are
these
commands
like
fully
synchronous,
as
in
you
know,
four
thread
x
or
four
host
x?
The
system
only
does
this
one
thing
until
the
command
reply
leads
with
a
response
or
is
it
like?
Are
there
any
async
components
where
it's
like
I've
started
operation
x
on
host
y,
but
I
haven't
gotten
a
response
yet.
C
I
mean
internally
we're
using
the
azing
ssh
libraries.
Well,
we
don't
lose
control
of
the
we're,
not
losing
control.
Now
we
we
could.
I
I
think,
melissa.
Do
you
know
if
the
I
think
ssh
library
supports
some
kind
of
a
timeout
when
waiting
when
when
doing
ssh
calls.
F
Yeah,
I
think
there
is
a
timeout
it's
like
if,
if
like
the
the
process,
if
there's
like
a
certain
like
if
the
process
the
timeout
starts
before
the
process
exits,
there's
a
timeout
error
that
I
can
raise
and
then
also
if
the
like
ssh
process.
If
there's
also
like
there
turns
an
error
if
it
returns,
if
it
exits
with
a
non-zero
status,
so
it
can
return
a
timeout
error.
A
When
I
remember
looking
at
this,
there's
like
two
different
sort
of
basic
stage
commands
we're
doing
where's
like
a
actual
connect
command
and
then
there's
like
the
run
command.
I
think
that's
the
connect.
One
is
probably
okay,
it's
probably
one
that
still
has
the
timeout
and
works
like
that.
I
think
it's
the
run
command.
That
has
the
opportunity
to
hang
forever.
I
think
I
looked
at
the
documentation
before
for
racing
stage.
I
think
the
default
timeout
is
just
no
timeout
for
those
whenever
we
actually
execute
commands.
A
Assuming
when
we
assume
we
have
like
a
good
connection,
even
if
the
text
actually
is
good,
then
it
just
like
it
will
just
last
forever,
but
I'm
pretty
sure
you
could
put
timeouts
for
that.
I
think
when
I
was
also
canada
before
you
get
to
set
up
some
sort
of
ssh
conflict
object
and
you
could
pass
it
in
there
and
it
would
do
something
like
that.
F
Yeah,
I
think
you
have
to
like
specify
the
timeout.
If
you
want
there
to
be
a
timeout,
I
don't
think
we
specified
it
or
at
least
like
in
the
config,
if
yeah,
if
it
uses
the
sh
config
for
the
timeout.
A
Yeah,
I
don't
think
currently,
I
don't
think
we
are
yeah,
it
would
be
nice
if
we
for
those
commands
had
like
a
he
could
be
afraid
to
be
a
pretty
lenient
timeout,
because
again
there
are
some
things
that
are
slow,
but
if
it
could
at
least
raise
a
health
warning
that
timed
out
day
which
command
failed,
it
would
be
an
improvement.
E
A
A
And
if
that
works
as
well,
we
could
just
throw
that
onto,
and
then
you
run
the
set
videom
that
function
put,
that
there
use
the
test
if
it
works
properly
for
basis.
A
I
think
those
these
run
commands
are
the
ones
that
they're
the
risky
ones
right.
Now
we
don't
handle
it
at
all.
A
I
think
we've
seen.
I
think
this
rbd
issue
is
not
the
only
time
this
happened.
I'm
pretty
sure
this
was
what
was
happening
in
the
gibba
cluster.
At
one
point,
the
third
loop
was
seemed
to
be
hanging
and
I
think
it
was.
There
was
one
host
that
had
some
hardware
issues
and
so
that
volume
pin
was
failing
there,
I'm
pretty
sure,
and
then
also
downstream.
There
was
a
spla
testing.
There
was
a
similar
issue
as
well
again,
it
was
in
step
volume,
command
hanging.
A
So
it
seems
like
we
have
two
options
here,
so
the
obviously
the
sebastian
just
linked
there
is
that
timeout
or
run
commands
might
work
and
I'll
see
how
it
likes
posting
there,
which
is.
We
have
actually
have
a
global
timeout
arc
fidm.
G
A
Yeah,
I
can
invite
you
to
their
or
get
you
on
the
steph
community
calendar.
I
think,
after
this,
where
I
normally
find
this
one.
A
Okay,
yeah
we're
discussing
options
for
I'm
looking
more
of
a
global
level
right
now
with
the
timeouts,
so
there's
two
options
we
have
async
ssh
has
a
timeout
option
for
run,
commands
that
could
work,
and
we
also
set
adm
actually
has
a
built-in
timeout
argument.
A
G
G
And
go
ahead
also
just
understand,
so
the
idea
is
to
rely
on
ssh
for
for
the
timeout.
A
A
Then
we
also
have
what
mike
fridge
posted
stephanie.
Damn
the
binary
has
a
built-in
timeout
arc
that
we've
already
set
up.
So
we
include
that
in
all
of
our
set
fit
amp
commands
like
a
base
timeout,
it's
possible.
That
would
also
we'd
also
do
the
same
thing.
We
could
let
it
timeout
and
then
raise
health
warning.
B
I
was
going
to
say
one
one
way
to
kind
of
think
about.
The
problem,
too
is
I
did
not
see
in
the
documentation
for
the
async
ssh
timeout,
whether
it
terminates
the
command
or
not,
but
you
know
so
if
the
connection
is
killed
but
say
this
process
is
still
running.
Maybe
it's
got
some
kernel
level
locks
running
lvm
commands
or
whatever
it's
it's
still
there
running
on
the
host,
whereas
possibly
the
timeout
option
to
set
atm
might
actually
be
able
to.
G
I
mean
in
general,
I
think,
regardless
of
if
acch
supports
the
the
the
tablet
or
not.
I
think
we
should
not
rely
on
some
underlying
component
in
this
case
ecch.
If
there
is
some
time
out,
it
should
come
from
c5dm
and
propagate
it
to
the
underlying
component
to
to
have
it
like
more
explicit.
A
E
The
second
part
of
this
question
is:
what's
a
reasonable
timeout,
and
are
we
walking
down
the
path
of
implementing
a
crash
loop
back
off
like
kubernetes?
What
is
our
strategy
here.
A
And
I
thought,
for
the
initial
version
of
this
was
just
to
have
like
a
fairly
long
one
and
we'd
raise
a
health
warning
and
say
like
something
was
very
wrong
here
and
it
wouldn't
be
great,
but
obviously
it's
pretty.
This
would
be
pretty
slow,
timeout
whatever
this
is,
but
at
least
actually
raise
a
health
warning
after
say,
like
10
15
minutes
and
now
you
know
something's
wrong,
and
sadly
we
also
wouldn't
have
a
full
hang
a
serve
loop.
We
would
still
be
able
to
do
things
that
aren't
whatever
I
was
trying
to
do
there.
A
That
was
my
initial
thought.
What
we
would
do.
G
Are
you
are
you
ready
yesterday,
because
I
was
often
looked
to
back
and
this
pack
was
basically
this
if
adm
is
stuck
forever
waiting
for
the
that
lock
fine,
so
that
was
probably
a
particular
case,
but
we
have
more
cases
like
this,
so
maybe
it
depends
on
on
each
specific
use
case,
so
the
amount
value
will
be
different.
A
Yeah
I
mean
we
have
another
example
case
in
the
the
other
pad
there's
a
link
to
a
tracker
issue.
B
Do
we
have
this
idea
a
pro
you
know
very
rough
approximation
of
what
the
longest
running
success
you
know
cases
are
you
know
it's
like?
Is
it
five
minutes?
Is
it
10
minutes?
Is
it
20
minutes?
You
know
that
kind
of
scale.
It
doesn't
need
to
be
super
exact,
but
that
can
help
kind
of
create
an
upper
bound
on
what
you
would
want
for
a
success
case.
E
I
I
think
the
challenge
is
many
of
the
operations
are
actually
quite
fast
like
within
a
few
minutes,
but
we
do
know
that
inventorying
devices
expressed
on
a
dense
node
with
set
volume
can
take
quite
a
long
time
so
hardware
variabilities
yeah,
that's.
C
B
G
I
mean
not
printing
it,
but
if
there
is
somehow
a
way
to
get
to
the
the
progress,
some
events
and
feedback.
B
B
B
B
If
we
know
it
might
take
a
long
time,
it
might
be
better
to
kick
off
an
async
process
and
then
pull
the
results,
but
that's
a
bigger
problem
than
the
timeout
thing.
It's
somewhat
orthogonal,
so
I'm
still
in
favor
of
the
timeout.
It's
just
you
know
further
down
the
road
we
might
consider
making
some
of
these
long-running
background
tasks.
Async
based
async,
not
on
the
level
not
like
async
ssh,
but
like
create
a
system
d
job
or
something.
C
Yeah,
that's
fading
to
my
okay,
great
the
safety.
B
C
Already
already
does
inventory
of
the
of
the
volume,
as
in
honestly
so
to
to
the
from
from
the
from
cfvm
manager
modules
perspective
with
asynchronous.
C
B
A
Yeah,
it's
not
like
by
default
right
now,
so
it's
not
getting
as
much
as
use.
Some
people
aren't
making
use
of
it
yet,
but
it
is.
A
That
the
volume
inventory
stuff
and
also
the
the
demons
on
the
host
it's
growing-
those
because
that's
also
a
fairly
slow
command,
not
as
slow
they're,
talking
like,
like
maybe
10
seconds
or
something
or
longer,
one
you
get
like
10-20
seconds,
but
still
fairly
slow
as
far
as
directions
go.
A
E
B
A
Yeah
and
I
think,
we're
all
kind
of
agreed
that
we
want
to
try
this
cdm
timeout
option,
we
just
kind
of
need-
I
guess
a
good
reproducer
for
this
to
test
this
on.
E
Another
option
is,
we
could
artificially
create
a
still
lock
yeah
artificially
hold
the
global,
lock.
A
But
yeah,
I
think,
if
we
can
get
some
way
to
reproduce
that,
and
someone
can
test
that
this
timeout
actually
still
works
and
it
comes
back
and
then
we
can
just
implement
health
printing
based
off
of
that
that'll
be
a
good
starting
point,
at
least
and
as
for
the
actual
timeout
we're
sending
it
to
it
sounds
like
yeah.
I
think
the
easy
sort
of
way
to
start
with
it
is
to
set
it
pretty
high
by
default.
A
Maybe
we
can
try
to
collect
some
information
on
what
how
long
these
commands
actually
take.
But
I
mean
it
is
always
going
to
be
hard
to
get
like
a
max
time
on,
like
these
inventory
commands,
and
these
point
image
commands.
A
C
A
A
Yeah,
so
I
guess:
go
without
a
configurable
timeout
based
on
this
fadm
timeout
flag,
assuming
that
all
works,
so
they
test
that
and
then
we
raise
a
health
warning
and
in
partisan
health
warning.
We
include
information
about
black
in
case
somebody's
on
some
cluster
with
really
slow
internet,
or
maybe
they
have
two
videos
or
too
many
discs
on
it.
This
takes
too
long,
and
I
guess
the
fault
for
that
can
be
a
bit
lower
than
maybe
a
few
minutes,
five
minutes
or
so,
and
they
can
raise
it
if
they
need
to.
G
Just
question
I
mean
normally,
the
cpdm
is
launched
manually
and
sometimes
from
the
instable
playbooks.
A
So
there's
a
few
ansible
playbooks
do
very
specific
tasks
like
they
do
some
pre-flight
stuff.
They
install
some
things.
They
do
like
a
purge
or
they
remove
the
clusters.
Most
of
the
uses
of
it
is
going
to
be
from
the
cepheum
manager
module
where
it
deploys
it,
and
then
it
runs
an
individual
command
with
it.
A
A
Yeah
I
mean,
if
I
don't,
I
don't
even
worry
about
it
right
now.
Plenty
runs
through
a
problem
with
it
I
mean,
even
if
they
do
like.
I
don't
know
what
you
would
do
if
the
the
purge
times
are
just
going
to
fail
anyway,
they
probably
end
up
in
the
similar
spot
they're
going
to
tell
us
who's
hanging
they're
going
to
tell
us.
A
That
sounds
good
for
that
topic.
I
think
I
said
we're
going
to
have
the
timeout
configurable
timeout
with
the
stuff
idiom
timeout
flag.
If
that
works
I'll,
raise
a
health
warning
telling
people
to
maybe
raise
a
timeout
or
something's
wrong.
A
A
If
we
have
an
agent,
we
can
check
it.
It's
support.
We
can
send
it
a
quick
message
and
if
we
get
acknowledgement
back,
we
know
the
host
is
online
and
we're
all
good,
and
if
it
fails,
then
we're
a
bit
more
cautious
and
we
try
to
reset
our
connection
and
then
maybe
do
a
normal
check
host
or
something
only
if
you
have
to.
A
A
D
A
D
The
question
I
had
is
how
how
quickly
would
this
procedure
detect
the
offline
host?
Because
for
nfs
service
we
wanted
to
detect
that
it?
The
host
nephes
host
failed
within
an
order
of
30
seconds.
A
Or
a
minute,
so
if
it
works
and
we
reset
the
connection
before
running,
check
host
it's
fairly
fast,
I
think
the
biggest
risk
in
this
situation
is,
if
we
were
already
like,
say,
the
serve
loop
was
running
and
it
like
called
czechos.
First
before
we
got
a
chance
to
detect
it
when,
like
one
of
the
agents,
comes
back
or
anything
one
of
those
threads
and
then
that
one
would
be
on
the
normal
timeout.
A
A
But
I'm
not
sure
really
what
to
do
about
that
other
than
just
not
caching
ssh
connections,
because
there's
always
going
to
be
a
risk
that,
like
it,
could
go
offline
at
this
exact
moment
right
before
we
run
this.
If
you
never
know
exactly
what's
going
to
happen,
any
sh
command,
we
do
there's
a
risk
and
it'll
do
that.
A
A
A
C
C
List
networks
and
fadm
device
ls
and
having
to
create
four
different
phony
ssh
connections
for
each
self-adm
command
is
extremely
expensive,
but
even
for.
C
A
Could
we
like,
if
that's
the
only
spot
where
we
are
doing
like
a
bunch
of
room?
I
don't
know
if
it
is,
could
we
like
patch
it
at
the
start
of
that
and
like
keep
the
cache
through
those
like
four
commands,
then
the
end
of
it
reset
it
or
that
still
be
too
much
of
a
performance
draw.
G
A
Yeah,
basically,
everything
can
be
done
with
ssh,
because
that
agent
is
still
optional,
but
right
now
the
agent
is
responsible
for
collecting
metadata
on
the
host.
So
it
collects
the
demons
and
the
inventory
stuff
and
it
returns
it
back
to
the
host
and
then,
while
the
agent
is
active,
we
usually
don't
run
those
without
states.
We
just
avoid
it.
A
Is
she
yeah,
but
we
still
need
states
for
certain
things
like
we
can't
deploy
demons
with
the
agents
or
anything
so.
B
B
A
A
I
mean
the
reason
we're
not
doing
it
right
now
is
just
because
we
don't
it's
not
considered
like
a
stable,
reliable
component.
Necessarily,
yes,
there's
a
lot
of
changes
going
in
yeah,
yeah,
okay.
I
think
it
would
probably
be
at
least
another
year
before
stuff,
even
if
we
were
to
get
all
of
this
feature.
Parity
in
we'd
have
to
make
sure
it's
stable
or
we'd
like
actually
considered
getting
rid
of
ssh
for
anything.
A
C
A
A
A
Yeah,
so
we
need
those
two
functionalities
still
there
and
typically
we
could
do
everything
else,
but
again
that
that's
all
just
like
this
initial
setup
stuff,
you
should
theory
only
have
to
do
once
and
then
I
guess
it's
during
upgrades
as
well,
because
you'd
have
to
go
upgrade
the
agents
and
deploy
it
on
a
new
binary
whatever,
but
it'd
be
uncommon
operations
that
you
do
these
like
little
one-offs,
and
so
it
wouldn't
be
a
performance
thing
at
all.
At
that
point,.
A
Yeah,
I
guess
we
can
try
to
look
into
that
in
the
future
and
see
if
we
can
get
some
more
future
parody
with
the
agent
too.
That's
what
that
stitch
can
do
minus
those
two
things
we're
talking
about
there
in
the
meantime,
implement
this
timeout
stuff
and
try
to
use
the
agent
port
for
some
offline
host
detection.
If
we
can,
we
still
don't
have
necessarily
solved
like
the
worst
case
offender
section
for
now,
but
I'm
not
sure
there
is
a
great
solution
currently.
C
There
yes
going
to
be
my
last
orchestrator
weekly.