►
From YouTube: Ceph Orchestrator Meeting 2022-03-22
Description
Join us weekly for the Ceph Orchestrator meeting: https://ceph.io/en/community/meetups
Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/
A
I
was
muted
yeah,
so
you
can
get
started,
but
I
have
a
lot
of
topics
today,
maybe
add
one
later
so.
The
first
thing
we
have
on
here
is
service
discovery,
and
I
think
whether
you
added
that
in
do
you
want
to
introduce
that
one.
B
That's
part
of
this
request
for
adding
some
new
endpoints
for
premises
for
service
discovery.
The
idea
is
to
be
able
to
get
the
current
configuration
by
using
http
from
outside.
So
this
way
you
can
use
it
its
perception
or
for
other
promises,
external
confused
in
the
customer,
and
then
we
get
the
same
configuration
for
both
cases.
B
So
here
in
this
request,
if
you
can
see
in
the
changes,
you
will
see
that
you
are
adding
a
bunch
of
end
points
and
points
right
now
and
hard
coded
in
the
code.
The
urls
basically
and
ernesto
from
the
dashboard
team,
commented
that
it
would
be
good,
at
least
to
think
about
having
something
for
service
discovery.
B
That's
a
level.
This
way
we
can
publish
our
end
points
there,
and
this
will
be
much
easier
for
clients
from
inside
the
the
cluster
to
discover
different
services
we
are
providing
and
where,
where
are
they
listening?.
B
And
I
had
to
be
honest,
we
didn't
came
like
to
a
conclusion
that
discussion.
B
B
The
problem
with
that
that
you
have
also
yeah-
I
I
remember
now,
yeah
in
order
to
implement
this
kind
of
features
in
a
good
way.
You
need
to
have
the
name
resolution
in
the
cluster
and
right
now.
As
far
as
I
know,
we
are
using
ips
and
and
cfdm
like,
for
example,
you
you
publish
right
now
in
the
end
points
we
publish
ips
and
ports,
and
this
normally
points
to
the
active
manager.
A
Okay,
and
so
until
we
have
this
proper
name
resolution
stuff
can't
have
this
sort
of
more
generic
feature.
B
A
A
Yeah
it's
supposed
to
be
like
if
you
have
one
ip
and
you
just
have
all
the
requests
always
go
to
that
one
and
then
yeah
h.a
proxy
will
handle
everything
else
from
there
where
it
should.
A
Yeah
which
something
we
had
discussed
doing
a
while
back
possibility
of
maybe
doing
that,
but
it
never
really
sort
of
came
in.
There
was
no
reason
not
to.
I
think
it
was
just
it's
a
lot
of
work,
and
so
we
didn't
have
a
reason
to
prioritize
it.
I
think.
B
A
A
B
So
you
don't
have
to
worry
about
that,
and
they
think
that
also
not
related
with
that,
but
could
also
help
to
solve
scalability
issues
we
already
have
with
the
manager
just
with
the
metrics
reporting.
A
B
Right
now
we
have
right
now
we
have
only
redundancy
in
the
manager
at
the
manager
level.
We
have
active
and
standby,
but
we
don't
have
load
balancing.
A
Multiplexies
managers
yeah
and
then
it
gets
complicated
and
you
have
to
make
sure
they
have
they're
consistent
between
them.
That's
what
they
know.
Everything
so
there's
a
whole
like
thing
for
getting
the
monitors
to
work
like
that,
so
that
would
be
really
hard,
but
I
think
for
the
managers
as
well.
A
Well,
in
that
case,
it
doesn't
seem
like
we
can
really
implement
this
sort
of
generic
one
right
now,
this
that's
like
a
long
term
project.
That
might
be
something
that
we
aim
for
for
our
release.
Maybe.
A
Yeah
they
would
have
to
start
with
just
the
pride
generic
manager
h,
a
or
something
and
then
we'd
have
to
move
on
to
this
extra
stuff
with
the
double
active
ones
whatever.
Well,
I
guess
that's
more
for
the
metrics
thing,
that's
that
wouldn't
be
for
what
you're
doing
with
service
discovery
that
could
just
use
the
virtual
ip.
A
A
Okay,
that's
good
yeah,
so
just
to
wrap
that
up.
I
guess
in
the
meantime,
we're
probably
just
okay
with
doing
things
the
way
you're
doing
right
now,
using
the
this
one
endpoint.
A
Yeah,
I
still
want
to
see
if
we
get
paul
cuz
there
to
take
a
look
at
this
I'll
have
to
see
if
I
can
get
in
contact
with
him,
the
other
than
that.
I
think
the
the
strategy
in
general
is
is
fine
for
now
and
we
maybe
change
it
once
we
have
work
on
the
manager
done
yeah
all
right,
good
next
topic
on
or
do
you
have
something
else.
B
Nothing,
oh
myself.
Thank
you
all
right.
A
Next
topic
we
have
here
is
update
on
the
rook
test
failure.
I
assume
this
is
your
thing
joseph.
C
Yeah
yeah
yeah,
so,
okay,
this
issue
came
up
a
while
ago.
Actually-
and
I
remember
when
it
first
came
up-
I
thought
it.
It
was
just
like
a
radius
issue
because
it
was
the
first
occurrence
of
this
happening
turns
out.
It's
been
happening.
C
A
lot
in
the
radio
suite
to
neha
asked
me
to
take
a
another
look
and
what
I've
decided
to
do
is
just
remove
all
the
the
orchestrator
commands
being
tested
from
from
the
suite
and
just
basically
remove
the
orchestrator
from
like
this
test
suite
and
see
if
that
still
breaks
and
the
the
the
idea
behind
this
is
that
the
orchestrator
isn't
being
maintained
at
the
moment.
So
it's
okay
to
remove
it
from
the
test
suite,
because
we
can
just
assume
it's
failing.
C
A
C
C
Oh
yeah,
no,
no
just
for
the
yeah
just
for
the
rook
test,
suite
like
in
the
in
the
cef
qa,
so
yeah,
I'm
just
that
the
the
idea
is
to
remove
that,
because
it
looks
like
that.
The
error
right
now,
it's
failing
due
to
the
the
rook
orchestrator
being
tested.
So
if
we
remove
that,
we
might
get
a
better
idea
of
why
this
is
feeling.
C
I
think
it's
just
better
to
clean
up
the
tests
so
that
this
doesn't
happen
in
the
future.
But
I'm
just.
A
Like
you
know,
okay,
I
do
know,
there's
a
whole
discussion
with
travis
about
the
future
of
manager
rook
and
what
was
sort
of
going
on
there
yeah.
I
don't
know
if
they're
using
it
right
now,
but
I
guess,
if
nobody's
using
it
at
the
moment,
then
I
guess
we
can
remove
it
from
testing.
I
mean
nobody's
going
to
fix
it
if
something's
broken
or
no
yep
yeah.
I
guess
that's.
A
C
A
All
right
thanks,
okay-
and
we
don't
have
it
on
here,
but
I
thought
maybe
we
could
talk
a
little
bit
more
about
h,
a
nfs
like
did
you
have
a
chance
to
test
with
all
three
of
those
pull
requests
talking
about
last
week.
D
I
did
actually
yeah,
but
the
combination
of
all
three
worked
out
much
better
on
this
go.
The.
What
I
found
is
that
the
offline
host
detection
is
fairly
quick
when
I
unplug
the
network
cable.
However,
detecting
that
the
node
has
gone
online
again,
that's
fairly
slow,
that's
again
takes
about
10
minutes.
D
The
other
interesting
thing
about
it
is
the
nfs
daemon
was
rescheduled.
However,
it
took
a
bit
longer
than
I
was
hoping
this.
I
think
it
was
more
in
the
time
frame
of
about
a
minute
and
a
half
two
minutes,
but
in
either
case
I
crossed
the
grace
period.
For
that.
D
But
when
I
did
read
when
I
did
finally
redeploy
the
nfs
stamen,
I
had
clients
connected
with
in
hard
mode
just
reading
and
writing
to
a
file,
and
they
were
able
to
resume
it
just
took
a
little
bit
longer
than
one
might
hope,
but
I
think
all
in
all
it
was
like
you
know.
The
fellow
overtime
was
a
few
minutes,
which
is
certainly
better
than
what
we
were
doing
before.
The
other
issue
that
I
encountered
is
one
of
my
mds's
was
co-located
on
the
same
box.
D
That
went
down
is
the
nfs,
and
that
was
not
rescheduled,
so
I
had
to
kind
of
manually
work
around
that
I
think
it
so
in
a
very
small
cluster,
like
you
know,
two
three
nodes,
it
kind
of
seems
like
we
need
to
some
of
these
other
stateless
services,
reschedule
them
as
well,
because
there's
there's
definitely
a
dependency
there.
Between
nfs
and
mds.
A
All
right
I
mean,
I
think,
even
at
this
point
with
the
work
from
that
one
pull
request
is
supposed
to
move
to
nfs.
I
think
if
we
just
add
the
mds
to
that
list,
that
it
has,
it
would
do
that
and
then
we
just
have
to
also
have
it
be
checked
for
that
offline
post.
D
Right
and
to
be
clear,
all
of
my
testing
was
without
the
agent
when
I
turned
the
agent
on
I've
had
trouble
with
the
agent
reconnecting
when
the
node
comes
back
online.
So
I
kind
of
just
avoided
that
whole
branch
of
the
code.
A
Yeah,
I
don't
think
the
agent
would
make
it
much
faster
because
with
that
extra
loop
that
for
the
host,
I
don't
think
it's
just
going
to
beat
that
time.
It
helps
faster
with,
like
the
checking
their
demons
are
down
or
anything,
but
that's
really
not
what
we
care
about.
In
this
case,
it's
really
just
post
offline,
great,
exactly
so
yeah.
I
guess
the
question
is
one:
can
we
make
it
faster
somehow.
D
You
know
the
more
I
was
thinking
about
it,
we're
kind
of
really
playing
with
timeouts
and
polling,
and
I
think
the
only
way
to
do
that
is
maybe
implement
a
proper
heartbeat
and
that
might
come
from
a
proper
heartbeat
and
that
might
have
to
you
know.
Then
we
might
need
to
use
the
agent
or
something
like
that.
A
A
D
Yeah,
I
agree
well,
and
I
guess
the
other
thing
to
be
clear
with
this
is
this
is
using
the
timeout
through
async
ssh.
I
don't
think
we
have
a
remote
based
solution,
so
do
do.
We
want
to
approach
the
back
port
of
a
sync
ssh
to
pacific
or.
A
First,
if
I
could
get
a
remote,
because
it's
really
just
the
one
thing
we
need
is
just
that
the
that
keep
alive
on
the
sh
requests
do
something
similar
for
remoto.
I
was
going
to
see
first.
If
I
could,
if
that's
possible,
then
only
backboard
a6sh
is
like
a
last
resort.
If
there's
no
way
to
do
it
with
moto.
A
A
I
like
to
avoid
it
it's
possible
and
I'll
have
to
check
if
we
can
do
keep
a
live
request
for
remoto.
That
lets
us
do
something
like
that.
D
A
Yeah
that
would
I
was
originally
written
with
the
idea
that
it
was
necessary
to
have
the
agent
changes
for
the
offline
notice,
section
yeah,
but
we've
sort
of
moved
away
from
that
now,
so
there's
no
need
to
have
it
as
one
thing
and
honestly,
the
parts,
the
asian
parts
that
other
things
just
get
removed.
I
don't
think
we're
going
to
need
them
at
least
right
now,
even
if
we
do
the
host
section
using
the
agent,
it
sounds
like
we're
going
to
try
to
do
this
heartbeat
strategy.
A
Yeah
yeah,
I
don't
want
to,
we
don't
want
the
agent
stuff
in
the
back
ports
right
all
right.
It
sounds
like
it's
almost
there
just
a
little
bit
slow,
but
it
was
able
to
reconnect
eventually
just
as.
D
D
Yeah,
the
the
the
initial
test,
I
did
where
it
redeployed
the
just
the
nfs
statement
that
was
it
reconnected
within,
like
you
know,
a
minute
or
two
which
of
course
exceeded
the
grace
period,
but
it
still
worked
out.
Okay,
the
case
with
the
mds
was
much
longer.
I
I
think
it
took
like
five
minutes
or
more
before
it
was
able
to
re-establish
the
clients
were
able
to
re-establish
a
connection,
but
but
in
both
instances
it
did
actually
work
out
it.
Just
the
clients
were
kind
of
hung,
since
they
were
mounted.
A
Yeah,
just
a
little
bit
longer
than
we'd
like
right
with
the
mds
one,
at
least
we
could
probably
just
add
something
similar
for
the
mds
we
just
removed
them.
It
should
be
even
easier.
We
don't
have
to
fence
anything,
but
that
I
don't
think
we
just
have
to
move
it.
I
had
to
check
that
actually.
E
Yeah,
michael,
how
was
the
nds
set
up
that?
Was
it
active
standby?
There
were
two
active
mtst
means.
What
is
the
setup
of
the
mdss.
D
Well,
it
was
a
fairly
small
cluster.
I
think
there
were
only
like
three
nodes,
and
so
I
had
one
active
and
one
standby
okay
and
that's
why
it's
one
offline,
because
I
lost
one
of
them
so.
E
A
D
A
Yeah,
I
guess
we
should
just
try
that,
because
again,
that
would
be
more
proper
tests
and
at
least
if
it
works
that
way,
we
can
maybe
push
off
the
mdsha
stuff
and
say
anyone
who's
like
very
serious
about
this
and
has
like
a
large
cluster
they'll,
be
okay
and
we'll
we'll
fix
it
for
the
tiny
clusters
perfect
and
then
there's
the
other
part
of
it,
which
was
the
thing
with.
If
you
don't
have
enough
hosts
to
reschedule
them
on.
A
D
It
it
didn't.
I
had
a
separate
conversation
elsewhere.
It
has
a
lot,
I
believe,
to
do
with
the
nfs
protocol,
I'm
in
maintaining
consistency
between
the
ranks,
and
so
I
don't
think
it's
you
know.
So
if
we
have
say
two
nfs
daemons,
you
know
we
need
to
bring
back
two
of
the
same
of
their
freeing
ranks.
So
so
the
nfs
service
can't
continue
with
just
one
of
them
present.
D
All
of
them
we
have
to
do
the
redeploy
yeah
and
if
you
watch
the
grace
db,
even
when
that
one
of
the
ranks
goes
down,
it's
not
reflected
in
the
grace
db
because
there's
nothing
actively
manipulating
that
because
that's
only
during
an
ad
or
drawing,
but
that
are
saying
so
as
far
as
the
other
nfs
stamen
is
aware,
and
their
pair
is
still
active,
even
though
they're
not
so.
A
Yeah,
just
so,
you
think
we
should
try
to
get
in
right
now
as
well,
or
is
this
something
you
tell
you
could
also
work
around
if
you
have
enough
hosts
to
do
it
on
it
should
work.
D
Yeah,
I
think
that's
it's
not
convenient
in
a
small
cluster,
but
I
think
probably
the
best
thing
we
could
do
is
just
try
and
reschedule
and
then
do
a
health
warning
or
something
if
we
can
and
or
maybe
consider
fixing
the
port
conflict
issue.
So
we
can
co-locate
two
ranks
on
the
same
server.
A
All
right-
and
we
can
probably
get
that
in
sort
of
soon
and
just
have
that
as
sort
of
the
solution
for
the
time
being.
If
you
have
enough
hosts
it'll
work
and
you
have,
I
guess,
your
mts
elsewhere-
it'll
almost
work.
Apparently
it's
still
a
little
bit
slow.
It
needs
to
be
like
30
seconds
faster
and
maybe
we
can
see
if
there's
any
ways
to
do
that.
I
know
like
I'm
trying
to
think
of
where
all
the
time
would
come
from
I'm
thinking
like
worst
case
it
maybe
takes.
A
Like
40
something
seconds
to
detect
the
hosts
offline
and
then
we
have
to
redeploy
if
you
have
to
play
like
multiple
nfs
demons.
Maybe
there
is
only
to
replay
one
because
one
host
going
offline,
trying
to
add
up
to
how
it
gets
to
a
minute
at
30
or
it
gets
to
two
minutes
and
see
if
there's
anywhere,
we
can
save
some
time
off.
D
D
A
A
I
wouldn't
be
interested
to
see
if
we
knew
for
sure
it
didn't
need
a
demon
refresh
or
device
refresh
or
anything,
and
it
was
all
good
if
we
could
get
it
to
happen
within
a
minute
30..
A
Let's
see
what
narrow
it
down
sort
of
what
we
need
to
be
able
to
do,
but
yeah,
that's
tough.
I
didn't
even
thought
about
that.
The
fact
that
the
refresh
would
take
so
long
that
it
doesn't
matter
how
fast
you
do
everything
else.
D
A
C
A
A
Yeah
you'll
probably
have
it
up
in
like
at
least
a
couple
of
minutes
I
mean
that's
if
it
has
I'll
do
all
the
refreshes.
If
you
got
like
a
lot,
unfor
yeah
got
fortunate
with
the
timing,
and
then
they
didn't
need
to
refresh
anything
and
detected
the
host
in
like
the
minimum
time.
I
think
you
could
still
do
it
in
like
a
minute
or
so.
B
A
Goes
in
yeah
there's
a
thread.
It's
a
pull
request.
That's
open,
there's
a
thread
that
just
loops
through
checks
every
20
seconds
and
then
in
this
case
you
also
just
say,
there's
a
keep
a
live
request.
I
think
it's
is
it
21
seconds.
A
B
There
and
what
it
works
is.
A
B
Let
me
try
to
link
the
four
requests.
You
know
what
I'm
talking
about.
I
I
remember
this
pull
request.
You
posted
you
have
like
a
timeout
of
seven
seconds.
A
Then,
when
the
thing
goes
to
check
at
worst,
it'll
hang
for
the
21
seconds,
just
like
say
like
right
before
yeah
it.
It
went
off
like
right
before
I
sent
the
message
where's,
the
other
one.
That's
he
does
the.
A
B
A
It
handles
all
of
the
refresh
stuff
we
were
talking
about,
like
it
refreshes
the
devices,
the
demons
and
stuff
on
the
host,
so
that
stuff
takes
a
really
long
time.
If
you
have
a
lot
of
posts
that
run
stuff
at
mls,
just
slow
run
volume
inventory,
which
is
slow,
if
you
have
say
like
50
hosts,
then
that's
going
to
take
forever
yeah
and
so
every
action
we
want
to
do
in
the
server
loop
gets
delayed
because
we're
waiting
for
that.
So
you
have
the
agent
going.
A
B
A
I
mean
that's
what
we
want
to
do
this
solution,
we're
trying
right
now
supposed
to
be
for
pacific,
which
won't
have
the
agent
and
also
the
agent
just
isn't
stable.
So
we
don't
want
our
offline
host
section
to
be
reliant
on
like
an
unstable
okay,
okay,
opponent,
okay,.
A
Yeah
we
could
potentially
even
remove
the
offline
host
watcher
just
thread
thing
that's
being
implemented
once
the
agent
has
heartbeats
and
that
can
be
more
the
greater
than
yeah.
It
should
be
faster,
okay,
yeah
and
that
could
that
would
handle
the
refresh
stuff
and
it
would
handle
the
alpha
host
detection.
A
So
it
would
be
good,
be
able
to
do
that.
All
quicker
yeah
yeah!
A
B
A
E
A
So
we
could
actually
repurpose
that
thread
to
do
that
if
the
agent
had
a
proper
artbeat
and
stuff
and
we
were
confident
in
it
yeah
yeah
that
could
be
a
future
work.
Yeah
yeah,
that's
more
for
next
releases,
so
I
guess
that's
sort
of
where
we
are
with
htnfs.
Is
you?
Can
it
sort
of
works,
but
it's
a
little
bit
slow
if
we
ignore
the
mds
case,
we're
talking
like
two
minutes
to
get
it
from
an
offline
host
back
into
a
working
state?
A
Assuming
you
have
enough
hosts
to
reschedule
the
nfs
demons
on
that's
like
an
okay
state,
that's
better
than
we
were
a
few
weeks
ago.
A
So
we'll
see
if
we
can
backport
those
things
and
then
we'll
have
like
a
decent
solution
in
pacific
and
then
we
can
try
to
work
on
like
a
faster
one.
Over
the.
B
A
Could
just
do
it
the
same
with
the
nfs
one,
I
think,
and
then
it
would
sort
of
be
all
right.
I
think,
or
we
just
reschedule
them
like
that,
you
know
what
the
nfs
ones
are,
because
right
now
we're
not
moving
them
at
all.
Like
that's
part
of
the
problem
is
we're
relying
on
basically
the
standby
mds
to
come
up
after
the
active
one
goes
down
which
seems
like
it
takes
a
while.
E
A
E
I
I
I'm
in
the
self-esteem
so
like
I
work
with
the
surface
team,
so
yeah,
so
maybe
setting
up
this
active
standby
should
help,
and
there
are
other
conflict
settings
to
you
know
lower
the
lower
the
failover.
So
we
can
look
at
that
figure
out
what
the
default
settings
are
and
yeah
yeah
and
yeah.
We
could
also
think
about
having
multiple
active
mds
and
you
know
whatever
makes
sense.
So
all
this
is
target
targeted
for
pacific
right
and
for
the
open,
stack
use
case.
Is
that
what
you
are.
A
A
E
A
A
Yeah,
so
if
you
have
two
investments,
they
use
the
same
port.
We
can't
put
them
on
the
same
host
right
now,
which
means
that
we're
limited
so
that
you
have
your
setup
with
three
nfs
demons
and
we
want
to
reschedule
all
of
them
so
that
if
one
of
them
goes
offline,
one
of
the
hosts
goes
offline.
But
if
you
only
have
three
hosts,
we
are
allowed
to
put
them
on
based
on
the
placement.
A
We
can't
do
anything
because
we
can't
put
them
on
the
same
host
and
there's
nowhere
else
to
put
them.
So
if
we
fix
that
so
that
the
port
conflict
doesn't
happen
anymore,
then
we
could
just
save
like
put
the
two
of
them
on
one
host
and
one
of
them
on
the
third
one,
and
then
we
could
still
sort
of
have
the
service
stay
up,
even
if
there's
not
enough
hosts
for
one
for
each
okay,
yeah
right
now.
That's
again,
that's
future
work.
We
want
to
do
in
the
short
term.
A
A
All
right
that
was
the
last
topic
I
had
in
mind.
Does
anyone
have
anything
else?
I
want
to
talk
about
here.