►
From YouTube: Dynamic multi-cluster management with Rook for cloud native IaaS providers for the private clouds
Description
Presented by: Joachim Kraftmayer | Clyso
Over the last few years, we have been gaining experience with Rook in production. One of our challenges was to implement dynamic resource management between 50+ Ceph clusters. Kubernetes events dynamically and fully automatically distribute loads and capacity between Ceph clusters. This is done by removing single or multiple Ceph nodes from Ceph clusters while ensuring data integrity at all times. In the next step, the released Ceph nodes are integrated into other Ceph clusters as needed.
A
Okay,
first
I
have
a
short
introduction
for
myself.
You
have
some
I
mean
it's
the
first
time
for
me
in
the
U.S
since
since,
since
a
long
time,
I'm
more
focused
on
Europe,
so
I'm,
the
self
ambassador
of
the
dark
region
I'm
the
foundation.
Member
of
the
board
I
was
previously
Susie,
Enterprise
principal
consultant.
It's
also
I'm
working
for.
We
worked
for
Redhead
as
an
Enterprise
consultant
in
Europe
as
well.
A
A
Sorry,
is
this
better
now,
okay,
good,
so
the
company
is
founded
in
2010
focus
on
infrastructure
and
platform
as
a
service
based
in
in
Munich,
and
we
are
proud,
proud
to
be
the
member
of
lion,
Foundation
self
foundation
and
also
Cloud
native
Foundation,
so
Dynamic
multi-cloud
cluster
management,
I.
First
of
all,
I
want
to
give
some
key
facts,
because
this
product
already
started
in
2020
and
it's
all
and
it
will
I
think,
will
go
on
for
additional
two
years.
A
So
so
the
main
the
main
keys
are,
they
have
no
vendor
lock
and
open
source,
scalability
and
and
so
on.
Pure
IPv6.
Everything
is
in
this
project
is
kubernetes
driven
only
micro
services
are
the
the
workload
and
also
in
the
used
in
the
core,
and
it's
a
project
that's
coming
from
from
the
public
cloud,
and
it's
going
back
into
the
on-premise
private
Cloud
so,
and
it
has
a
moment
somehow
a
bit
more
than
200
petabytes
of
ram
footprint
in
production,
but
it's
also
one
of
the
projects
where
we
don't
count.
A
At
the
end,
the
storage
capacity
we
just
orientate
on
on
the
on
the
ram,
how
much
the
project
or
how
much
the
production
environment
will
consume
or
already
consumes.
A
So
so
the
motivation
for
50
or
even
more
safe
clusters,
and
why
we
need
this
Dynamic
multi-cluster
management
solution
is
we
have
different
project
requirements,
so
we
have
exclusive
use
cases.
We
have
shared
use
cases,
we
have
failure
domains.
We
have
we
have
different
availability
zones,
data
centers
regions,
we
have
security,
isolation,
I'm
I
mean
the
talk
before
we
have.
We
spent
a
lot
of
time
in
investigating
how
how
this
fcsi
is
doing
encryption.
A
How
how
The
Rook
is
doing
this
and
also
the
underlay
is,
is
completely
built
or
based
or
rely
on
on
Seth.
So
for
the
virtual
foreign
stack
for
for
out
of
band
management
for
all
the
services,
you
need
to
run
a
cloud
in
the
kubernetes
way
you
yeah,
we
have
so
so
much
different
requirements.
Even
this
nice
database
as
a
service
is
even
something
very,
very
good
talk,
I
think
at
least
an
hour
on
it.
A
So
when,
when
we
come
to
this,
to
the
point
that
yeah
that
the
project
or
even
the
user
requirements
constantly
changing,
even
I
think
we
talked
today
about
this-
that
the
car
can
a
client,
explain
you
what
kind
of
performance
and
what
kind
of
capacity
he
really
needs,
and
how
is
the
outlook
for
the
for
the
next
two
years?
It's
quite
hard
for
them
even
to
somehow
report.
A
Okay,
we
have
at
the
moment
this
IO
or
throughput
needs,
and
that
will
grow
over
purpose
will
shrink
next
year,
even
when,
when
some
of
them
update
their
software
stack,
then
they
also
have
not
the
real
yeah
information.
How
this
will
impact
the
back
end
as
well,
and
so
out
of
out
of
this
it
comes.
It
comes
to
a
point
that
we
can
do
some
estimations
that
we
Acclaim
resources
for
specific
customers
for
specific
projects
for
even
for
the
internal
infrastructure
as
well,
but
we
normally,
we
are
not
never.
A
We
are
never
our
right,
it's
just
an
assumption,
and
then
we
have
to
adapt
it
afterwards
again
and
for
this
we
implemented
this
Dynamic
multi-cloud
multi-cluster
management
that
we
somehow
can
create
elastic
back
end
that
can
adapt
dynamically
to
the
demands
of
of
the
application
stack
above.
So
what
does
what?
Does
it
mean?
A
One?
Second:
what
does
it
mean?
I
mean
what
we
do
is
we
have.
We
have
different
resource
pools
that
are
completely
managed
by
kubernetes,
so
we
have
no
control
over
this.
We
got
different
events
that
explain
us.
A
Okay,
we
need
resources
from
from
this
pool,
and
here
or
the
surf
classes
is
selected
to
give
some
capacities
back
to
a
calming
pool,
because
someone
else
already
is
somehow
at
the
limit,
and
so
we
start
processing
to
add
and
remove
ceph
nodes
from
one
cluster
and
also
introducing
them
to
a
to
another
one
completely
controlled
by
by
kubernetes,
and
we
we
have
to
control
or
what
we
but
Rook
Seth
has
the
control
over
over
the
data
and
how
it
should
be
managed,
but
the
the
events
that
are
created
to
to
distribute
the
data
or
the
safe
nodes
yeah,
that's
one.
A
A
Yeah,
so
once
again
is
informing
us
what
staff
should
do
Seth
is
taking
care
about
what
how
it
should
train
safe
nodes
and
inform
all
the
kubernetes
that
it
will
be
possible
or
even
not
possible
to
to
drain
them,
and
then
it
will
give
it
back
and
kubernetes
takes
it
into
account
and
bring
it
to
the
next
cluster
again.
So
I
think
that's
not
on
this
slide,
but
I
will
explain
it
later,
but
with
an
example
in
the
upper.
So
we
don't.
We
also
changed
some
parts
of
Rook.
We
also
refactored
Rook.
A
At
a
certain
point.
We
also
said
okay,
who
is
responsible
for
what
kind
of
management,
and
that's
just
one
example
that
we
said.
Okay,
we
should
delegate
the
responsibility
responsibility
to
scaling
the
safe
monitors
back
to
kubernetes,
because
when
Rook
was
introduced
there
was
only
a
deployment
available
to
manage
replica
sets
and
I.
Think
two
years
later,
then
the
state
full
set
was
introduced
to
kubernetes,
and
this
one
can
take
over
or
can
guarantee
that
the
self-monitors
will
be
scaled
one
in
the
right
way.
A
A
So-
and
this
is
the
the
part
I
was
talking
previously
before
we
get
a
kubernetes
event,
adding
nodes.
We
got
we're,
adding
the
notes
to
to
the
crush
root
staging
doing
the
the
OSD
test,
doing
doing
the
benchmarking
doing
the
classification,
because
we
have
different
fast
and
slow
device
or
Hardware
storage
classes
in
kubernetes
there
and
then
waiting
or
checking.
A
If
this,
if
the
cluster
load
is
below
a
specific
specific
threshold,
and
then
we
move
the
Crush,
the
one
node
or
the
multiple
nodes
from
the
staging
route
to
the
plot
route,
and
so
that
the
customer
experience
is
explaining
no
difference
in
in
latency,
IOP
or
throughput
delivery.
A
And
also
example,
how
this
is
implemented.
Is
that
how
you
remove
a
node
from
from
our
existing
root
cluster,
that
you
say,
Okay
I
have
the
oval.
The
kubernetes
controllers
told
you
okay,
we
need
a
fast
or
slow
note
from
this
cluster.
Please
free
it
up.
Then
Rook
is
verifying
the
user
capacity,
the
use
capacity
and
is
say:
okay,
we
have
enough
capacity
in
this
cluster.
We
can
give
it
back
to
to
you
then
the
next
step.
It's
very
do
the
verification
of
quality
of
service
requirements.
A
We
have
also
in
this
an
extension
to
Rook
that
every
RBD
image
and
has
an
IOP
and
a
throughput
definition
before
we
introduce
the
cluster.
We
do.
We
do
a
basic
test
or
load
test.
A
We
try
to
find
the
boundaries
of
the
existing
cluster
that
we
initially
deploy
and
then
we
can
somehow
create
an
estimation
how
many
RBD
images
of
this
specific
storage
class
is
100,
iops
or
200
iops
and
50
50
megabyte
of
throughput
we
can
deliver
and
and
when
we
remove
a
node
in
this
case,
then
we
also
have
to
reduce
the
the
images
that
we
can
deliver
with
a
specific
quality
of
service
requirement
in
the
next
step.
Also,
we
verify
verify
can
be
still
can
be
still
handle
the
failure
domain
across.
A
If
you
have,
for
example,
a
clustered
deployed
across
bootable
availability
zones
and
your
capacity
is,
you
could
remove
a
node
on
multiple
nodes,
capacitive
wise,
but
perhaps
you
cannot
even
compensate
a
failure
of
of
one
a
set
anymore.
Then
it
would
be
also
taken
into
account
in
this
process
and
perhaps
it
would
be
also
denied.
A
Then
it's
the
same
that
we
also
say:
okay,
you
have
perhaps
High
usage
on
on
disguster
at
the
moment
and
it
would
be
not
a
good
idea
to
even
put
more
loads
on
the
cluster.
At
this
point
to
say:
yeah,
we
just
wait.
A
Perhaps
a
day
or
even
a
week,
or
we
have
a
specific,
a
different
defined
maintenance
window
and
then
those
go
goes
down
and
then
it's
a
good,
a
good
point
to
remove
the
nodes
from
from
this
production
cluster
and
yeah
and
give
it
back
at
the
end
when
it
rained
to
to
kubernetes
and
and
drained
mean
that
we
really
move
the
data
away
so
that
the
replication
is
always
at
three
or
the
Erasure
chord
is
always
complete
fulfilled,
so
that
we
never
end
up
in
a
situation
that
perhaps
node
reboots
or
something
like
this,
and
then
you
can
only
yeah.
A
So
so
what
what
we
also
did
in
this
time
was
we
added
the
extended
Rook
that
kubernetes
is
more
Seth
is,
is
yeah?
It's
a
data,
awareness
or
I've,
always
always
said
leader
at
the
end,
who
makes
a
decision?
If,
if
you
can
how
you
how
you
can
manage
a
ceph
node
in
Rook,
we
also
removed
some
some
features
like
the
PG
Auto
scale
or
disabled
them.
A
We
also,
together
with
the
customer
we
agreed
at
the
at
the
beginning,
that
they
should
not
use
CFS
if,
if
possible,
I
think
that
perhaps
will
change
in
the
next
two
years
that
the
requirements
will
come
again
and
again
on
on
the
table
that
that
they
will
need
a
CFS
again
and
also
that
they
will
handle
the
the
distribution
of
of
objects
via
rgw
on
the
higher
layer,
not
not
on
in
the
back
end
of
stuff
itself.
A
What
you
also
did
we
extended
the
health
states
of
of
Seth?
At
this
point,
we
have
to
differentiate
between
data,
replication,
Delta,
Data,
health
and
and,
for
example,
the
the
status
of
the
service
we
have
to
we
have
to.
We
also
have
to
suppress
for
this.
Some
some
health
states
of
self,
for
just
a
simple
example
is,
is
the
crash
warning
that
we
have
to
suppress,
because
it's
not
Management
in
a
setup
like
this.
So
what
we
did
is
okay.
A
We
set
the
warning
for
for
the
crush
topic
to
ignore,
to
suppress
it
and
just
send
the
the
events
to
a
central
facility
that
this
will
analyze.
What's
happened
in
in
each
classif
cluster.
A
We
we
extended
also
Seth
for
Safari,
improved
with
this
additional
recovery
options,
so
that
we
also
can
interact
with
Mr
osds
if
there
is
really
a
problem-
and
we
also
extended
the
RBD
metadata
in
safe
itself
for
for
this
project,
so
that
you
can
store
different
metadata.
That's
not
that's
not
in
the
not
in
a
direct
context
of
Sephora,
even
RBD.
A
That's
an
example
for
for
the
metadata
is,
if
you
have,
if
you
want
to
have
a
router,
if
you
want
to
use
it
for
a
router
environment,
and
you
want
to
add
RBD
images
to
a
virtual
machine
and
you
want
to
have
it
always
in
the
again
in
the
right
order.
You
need
something
like
a
a
worldwide
name
that
that
you
really
can
identify.
Okay,
that's
that's
my
disc
at
this.
A
At
this
position
in
in
the
virtual
machine
and
yeah
and
like
mentioned
before,
we
already
improved
with
the
RBD
encryption
for
the
fcsi
but
I
think
last
week
we
decided
to
completely
re-implement
it,
because
we
we
don't
see
that
the
passphrase
for
to
D
and
encrypt
the
keys,
the
deluxe
keys
for
each
image
should
be
stored
in
in
the
cfcsi.
A
A
So
at
the
moment,
so
perhaps
no
last
year
last
year
we
did
a
partial
production
setup
that
was
around
about
300
group
clusters
between
16
and
and
35
osds.
So
it's
quite
small.
We
we
tested
several
improvements
in
in
the
setup
and
even
it's
still
running
in
in
production.
A
A
Three
new
regions
are
built.
They
are
building
up
at
the
moment
and
because
of
this
also,
the
data
center
fail
domain
is
still
in
progress.
So
and
also
now,
after
around
about
two
years
decision
was
made
that
we
don't
will
support
virtual
machines.
It's
a
standard,
kubernetes
PB
PVC
stack
because
it
makes
the
handling
much
more
complicated
and
the
cesy's
eye
is
also
somehow
a
limiter
to
use
all
the
set
features
that
are
really
available
and
I
think
with
the
smartnex.
A
Yeah,
we
are
it's,
it's
still.
It's
still
in
progress.
We
we
had
a
stronger
focus
on
this,
especially
also
with
nvme
over
TCP
for
security
isolations.
A
The
one
idea
was
to
implement
or
to
directly
use
the
nvme
over
over
February
client
of
of
the
smartnic
and
directly
connected
to
to
the
project
of
surf
with
nvme
over
fabric
as
a
security,
isolation
and
also
and
also
height,
even
the
informations
of
of
the
ceph
client
for
for
clients
that
have
direct
access
on
on
the
hardware
as
well.
So
what
so?
A
What
we?
What
we
achieved
is
that
from
a
client
perspective
is
you
have
a
hardware
node,
you
see
disks
in
the
hardware
node,
but
but
in
reality
it
is
just
a
mapping
of
the
smart
Nick
on
on
this
Hardware
node,
where
you
have
yeah
an
RBD
client.
That's
this
is
this:
is
the
config
in
the
in
the
key?
That's
just
taking
care
of
this
as
well,
but
it's
it's
one
way
to
go,
but
it's
it's
not
not
decided
if
or
it's
not
100
production
ready,
good
and
yeah.
A
If
you
have
questions,
perhaps
I
did
a
little
bit
a
bit
wrong
at
the
beginning,
but
it's
also
quite
complex
project
at
the
moment
and
I'm
involved
in
this
since
now
more
than
two
years,
and
it's
it's
really
really
hard
to
compensate
every
single
to
compress
this
in
in
in
30
minutes.
So
if,
if
you
have
any
questions,
if
then,
please.
A
Sorry,
maybe
I
missed
this.
You
said
you're
not
using
the
CSI
driver
for
the
utilizations,
take
as
a
software
for
normal
kubernetes
cluster,
so
for
client
kubernetes
cluster.
So
we
will,
we
will
use
it
or
we
are
using
it,
but
we
also
have
we
have
a
virtualization
stack,
that's
consuming
the
self
cluster
and
the
decision
I
think
was
made
yes
well
last
week
that
we
will
not
use
it
anymore,
that
we
have
to
re-implement
it
at
the
end.
A
A
We
are
talking
about
ten
thousands
of
future
machines
and
it
brings
no
no
benefit
for
the
underlay
that
you
handle
it
like
this
and
there's
also
I
mean
there
are
a
lot
of
other
services
as
well,
and
one
one
is
just
taking
care
already
about
the
quality
of
service
and
how
many,
how
many
images
are
allowed
to
to
be
created
on
one
thing
self
cluster,
and
this
can
take
also
over
the
part.
That's
normally.
This
fcsi
is
doing
I
mean
it's.
A
It
is
from
from
a
general
from
a
general
overview.
It's
it's
quite
complex
from
from
all
the
community
kubernetes
events,
even
how
how
the
decision
is
made
where
an
image
will
be
created
in
a
self
cluster.
A
So
and
it's
also
not
that
transparent
to
us
where
this
comes
from
and
what
is
the
real
use
case
for
it?
So
a
bit
blind
in
the
underlay.
A
Yeah,
correct
yeah,
partially
I
mean
we.
We
have
to
remove
a
lot
of
stuff
of
this.
We
we
and
we
have
to
also
extend
our
services.
There.
I
mean
it's
not
not
the
main
part
of
of
us
as
a
storage
consultant,
it's
more
or
less
the
kubernetes
experts
who
who
do
the
decision
and
we
just
support
them-
how
how
they
should
do
it.
A
Yeah
I
think
they
they
try
to
to
stay
with
kubernetes
with
the
Holy
Stein
and
and
follow
all
of
this,
and
then
they
realized
after
a
while
that
it
is
not
sufficient
enough
and
we
have
we
have.
We
have
different
other
problems
as
well.
How
many
pots
you
can
deploy
on
one
on
one
server,
one
physical
server
to
get
rid
of
this
limitations
as
well,
but
it's
everything
more
we
need
is
related,
yeah
yeah.
A
Some
other
question
I
mean
if,
if
you
want
I,
can
show
you
what
what
we
also
there
as
well,
if
you're
interested
so
like
I,
said
the
NVM
view
of
over
the
TCP
smartness
support
was
100
gigabit.
We
also
then
topics
like
ecn,
DC
TCP.
A
That's
all
all
the
part
in
in
the
pure
IP
stack,
IP,
IPv6,
stack,
I.
Think
although
I
also
not
mention
is
that
the
theft
classes
and
everything
else
is
just
a
run
with
bgp,
so
no
layer,
2
anymore,
then
we
are
working
on
the
implementation
of
you,
BLK
with
iring
for
RBD,
backup,
daily
backup.
A
It's
at
Excel
scale.
We
are
yeah.
We
will
improve
Rook
with
the
safety
day,
two
operations
yeah.
We
all
still
have
a
look
on
on
system
chromson
for
the
next
OSD
generation.
In
Peril.
We
try
to
improve
the
performance
of
dual
store
as
well
yeah.
A
We
are
thinking
how
we
can
integrate,
integrate,
backup
strategies,
technology
stack
specifically
for
for
kubernetes
again,
because
if
you,
if
you
think
that
all
or
most
of
the
backup
Solutions
come
from
from
kubernetes
sites
and
all
all
the
time
tries
or
backup
your
whole
PVC
and
the
rest
of
and
all
the
major
data
of
your
kubernetes
cluster
there's
also
some
better
approaches
how
you
can
use
all
the
features
in
the
underlay
for
Seth
to
make
it
a
bit
more
efficient
with
data
export
for
for
RBD.
A
As
an
example
yeah,
we
are
still
on
this
topic
with
safe
namespace
implementation
in
the
boot
cream
or
I.
Think
that's
open
since
three
years.
So
so
we
will
see
if
this
sorry,
it's
it's
not
I
mean
the
ticket,
for
this
is
I,
think
three
years
old.
Something
like
this
at
the
last
time,
when
I,
when
we
requested
or
when
we
asked
for
this,
then
they
just
ask
us
for
what
customer
what
could
be
the?
What
is
the
customer
who
wants
to
have
this
yeah.
A
Yeah
I'm
I
mean
yeah
so
from
from
my
what
I
know,
but
I
will
double
check
this.
Thank
you
very
much.
Thank
you
for
this
and
also
yeah
storage
as
a
service
and
yeah.
Some
we're
also
working
on
some
FS
improvements,
and
we
are
really
really
interested
in
in
the
S3
super
project
as
well.
What's
what's
going
on
there
as
well,
especially
also
what
Susa
is
doing
with
this
S3
rtw
Patrick
is
the
right
name
from
the
yeah
yeah.
A
Okay,
yeah
I
just
saw
because
last
summer
on
Foster
yeah,
so
that's
it
good.