►
From YouTube: Ceph Tech Talk: A Different Scale, Running Small Ceph Clusters in Multiple Data Centers 20200723
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
so
we're
here
to
talk
about
small
step
clusters
and
this
presentation
is
called
a
different
scale,
running
small
subclusters
in
multiple
data
centers
and
my
name
is
yuval.
A
This
is
a
short
summary
of
what
we'll
be
talking
about.
I'll.
Tell
you
a
little
bit
about
me
then
we'll
talk
about.
Why
should
we
talk
about
small
subclusters?
To
begin
with,
then
I'll
introduce
our
use
case
then
I'll
answer.
The
question
is
this
an
overkill
and
some
other
questions
that
I
often
get
from
managers
of
sorts?
A
A
So
a
little
about
me,
so
I'm
35,
I
have
17
years
of
experience
working
with
different
types
of
systems,
including
multiple
storage
system,
at
different
scales.
I
was
born
in
israel.
I
started
my
career
in
the
compository
idf
service
as
a
system
administrator
and
I'm
currently
working
living
and
working
in
berlin
and
germany,
happily
employed
as
a
system
engineer
with
a
fairly
broad
work
spectrum,
which
means
that
I'm
working
on
other
things,
not
just
that
and
I
have
been
working
with
since
2017.
A
So
the
first
version
that
I
touched
in
production
was
jewel
one
more
thing,
though
I
tend
to
think
outside
the
box,
but
I
try
to
get
to
know
the
box
pretty
well.
A
So
why
are
we
talking
about
small
set
clusters?
What's
so
interesting
about
them?
So
the
thing
about
stuff
is
that
it
scales
very
well,
and
most
people
like
talking
about
how
large
your
clusters
are
and
their
io
performance
and
that's
understandable,
because
that's
what
people
like
to
hear
about
and
that's
what
people
are
accustomed
to
and
when
I
started
working
with
people
were
excited
about
cracking.
A
So
the
first
thing
that
comes
to
mind
when,
when
someone
says
kraken
is
something
like
that
release
the
kraken
some
huge
monster
that
wants
to
come
and
devour
you
whole.
So
it's
logical
to
think
about
seth,
as
something
very
very
big,
however,
sometimes
can
just
be
cute
and
the
new
t-shirt
kind
of
proves
that.
A
Small
clusters
are
not
just
cute
they're
practical
and
definitely
have
their
place.
I
believe
small
clusters
have
hidden
benefits.
Most
people
are
not
aware
of
and
of
course,
you
always
have
the
option
to
expand
at
disks
or
nodes
as
needed,
so
you
can
go
from
running
a
small
cluster
to
running
a
larger
cluster
fairly
easily.
A
Now,
let's
talk
about
our
use
case,
so
we
have
several
subclusters
running,
always
in
different
locations.
We're
planning
we're
currently
not
running
nullis
in
all
locations
and
we're
planning
on
upgrading
to
octopus.
In
a
few
months,
the
clusters
hold
data
and
resources
for
orchestration
and
other
services,
four
node
clusters
with
the
three-way
replication.
This
is
what
we're
running
in
all
data
centers.
At
the
moment,
we
have
three
monitors
and
three
managers
in
each
cluster
running
on
the
same
physical
machine,
and
we
have
kvn
based
vms
that
are
also
running
on
the
same
machines.
A
That's
why
we
need
to
spec
them
accordingly,
so
most
of
our
clusters
have
around
240
if
54
gigs
of
ram
and
64
cpu
cores,
and
we
use
ceph
as
a
back
end
for
the
rbds,
which
are
directly
attached
to
as
virtual
disks
with
those
vms.
A
So
now
to
the
question,
is
this
an
overkill?
I
mean
the
questions
that
we
usually
get
from
managers
is
our:
do
we
really
need
four
nodes?
Aren't
two
or
three
enough:
why
are
we
spending
so
much
money
on
expensive
disks?
Aren't
we
wasting
space
with
three-way
replication?
Let's
just
mirror
instead,
because
I
mean
mirroring
is
cheaper
and
some
people
might
argue
that
it's
more
performant.
A
This
sounds
difficult
to
maintain.
Is
this
an
over-engineered
solution
that
wastes
money
and
time
so
I'm
here
to
say
that
the
answer
to
at
least
most
of
these
questions
is
no.
I
mean
the
advantage
of
running
this
kind
of
setup
is
well.
There
are
several
of
them.
The
first
one
is
you
avoid
split
brain
with
three
monitors,
which
is
best
practice.
A
We
used
to
run
when
we
started
running
ceph.
We
had
two
node
clusters
with
two
monitors
and
we
ran
into
quite
a
few
issues
with
the
split
brains
which
were
caused
by
network
outages
or
network
issues.
This
is
not
something
that
is
recommended
specifically
for
production.
A
You
can
create
multiple
data
pools,
so
you
can
use
a
combination
of
hdds
and
ssds
ssds
and
nvmes,
or
even
like
all
three
of
them
in
the
same
cluster,
to
save
money
on
cold
storage
and
split
your
data
in
a
way
that
makes
sense
to
you.
So
you
can
put
some
of
your
data
on
ssd
some
of
your
data
and
hdd
somewhere,
that
on
nvme
that
makes
sense.
A
You
have
an
adjustable
replication
level.
You
can,
if
you
really
really
have
to.
This,
is
not
something
that
we
recommend,
but
if
you're
stuck
without
additional
disks,
you
can
temporarily
go
down
to
two-way
replication.
Of
course,
that
would
require
you
to
rebalance
your
cluster,
but
with
small
clusters
that
don't
have
a
lot
of
io.
That's
usually
not
a
major
pain.
A
However,
we
strongly
recommend
against
this.
This
is
just
something
that
you
could
theoretically
do
if
you're
in
a
jam,
so
don't
try
this
at
home.
If
you
don't
have
to-
and
I
mean
you
need-
you
need
at
least
four
nodes
if
you're
going
to
run
a
three-way
replication
cluster,
it's
just
common
sense.
I
mean
I'm
not
going
to
expand
on
that
now,
as
I
mentioned
before,
it's
the
cluster
is
easily
extendable
stuff
scales
according
to
your
needs,
and
we
should
be
assuming
that
the
company
will
grow
and
not
shrink.
A
So
if
we
start
with
four
nodes,
which
is
the
bare
minimum,
as
I
see
it
at
least,
we
can
grow
from
there
and
that's
fairly
easily
done
with
sep
easily
maintained.
It
saves
many
hours.
It
frees
up
the
engineers
to
work
on
other
things,
and
engineers
usually
need
to
do
other
things
because
well
we're
expensive
and
we
like
to
have
challenges
and
not
just
work
on
one
thing:
all
the
time
stuff
is
free
and
open
source,
so
you
can
use
it
in
production
without
any
licensing
fees
and
it
has
great
performance.
A
So
on
the
long
runs
it
saves
money
and
the
pig
is
happy
so
running.
Multiple
small
clusters
versus
one
large
cluster,
so
automation
is
cool
and
I'm
not
against
automation,
but
it
becomes
slightly
more
optional
on
smaller
clusters,
most
of
our
production
clusters
have
been
set
up
manually,
and
that
was
a
conscious
choice
which
doesn't
really
scale.
If
you
have
a
really
large
cluster,
you
can't,
if
you
have.
A
A
lot
of
nodes
running
in
parallel,
you
can't
really
micromanage
them
that
does
not
work,
but
on
small
clusters.
You
can
do
that
if
you
need
to
and
want
to
and
choose
to
manual.
Major
updates
upgrades
take
about
40
minutes,
so
the
last
upgrade
that
I've
made
from
mimik
to
novas,
for
example,
took
about
four
minutes
per
cluster,
which
is
quite
good
in
my
opinion,
and
remember
that
that
upgrade
includes
some
changes
to
how
things
are
done.
A
So
the
content
with
the
managers
instead
of
the
config
files
and
we
were
forced
to
move
to
use
a
set
volume
instead
of
self-disc
and
we
had
to
activate
messenger
version
two.
So
everything
had
to
be
done
within
a
fairly
short
period
of
time.
A
When
we
were
doing
the
upgrades
and
40
minutes
was
fairly
good
time
and
I
didn't
feel
like,
I
was
rushing
things
too
much
having
several
services
running
on
the
same
nodes,
so
that
monitors
and
osds
running
on
the
same
nodes
leads
to
better
understanding
of
their
limitations
and
how
they
interconnect
and
forces
you
to
be
more
careful
because
you
need
to
take
that
into
consideration
when
you're
rebooting
things
test
upgrades
and
stay
current,
deploy
in
same
scale,
test
setup
and
or
less
critical
clusters
and
watch
the
impact
before
upgrading
the
more
clusters,
because
you
want
to
make
sure
that
if
you
you're
going
with
a
newer
version
of
seth
that
it's
compatible
with
whatever
you're
running
and
whatever
you're
doing
with
it.
A
Different
clusters
required
different
strategies
and
when
we
decided
or
when
we
well,
we
decided
but
we're
also
kind
of
forced
to
move
to
ssd
due
to
the
low
performance
on
hdd
and
due
to
the
fact
that
ssd
is
now
industry
standard,
you
don't
see
a
lot
of
clusters
running
hdd
anymore,
so
in
smaller
clusters,
where
we
wanted
to
get
rid
of
hdds
directly,
we
just
added
the
ssds
with
for
the
cluster
to
rebalance,
remove
the
hdds
and
wait
for
the
cluster
free
balance
again
on
larger
clusters.
A
A
So
something
that
we
hit
during
those
migrations
due
to
the
hvds
being
so
slow
is
we
got
some
blocked
ios,
so
we
had
to
actually
reduce
the
osd
max
backfills
and
the
osd
recovery
max
activity.
A
Now
replacing
nodes,
we
were
going
from
fairly
old
amd
opterons
to
intel
xeons
and
obviously
we
saw
a
big
boost
in
performance
for
those
of
you
who
are
also
running
vms
or
have
a
similar
case
to
ours.
We
had
to
obviously
turn
off
the
vms
and
reconfigure
them
for
the
new
cpus.
That's
a
bit
less
relevant
here.
A
A
A
So,
once
you've
done
that
you
have
to
rename
the
new
nodes
to
the
name
of
the
old
node
install
the
services.
And
then,
if
you
have
osds
in
the
new
node,
you
add
the
new
osds
in
and
once
you
have
all
nodes
operational
in
the
cluster.
Once
you've
done
the
previous
steps
on
all
nodes,
then
you
can
remove
the
old
osds
from
and
the
old
nodes
from
your
cluster.
A
Now
one
more
thing
that
we
hit
or
another
true
story
is:
we
had
to
reduce
pg
num
free
pre-novels.
A
We
had
an
old
cluster
with
a
lot
of
small
osds,
which
we
wanted
to
replace
with
the
fewer
larger
ones.
Now
this
wasn't
out
yet
so
we
were
running
luminous,
pg
num,
the
number
of
placement
groups
in
a
pool
wasn't
reducible
or
auto
adjustable,
that's
something
that
was
actually
introduced
in
those
thank
you
guys
and
luckily
we
didn't
have
any
critical
services.
So
what
we
did
was
we
turned
off
all
services
and
renamed
the
old
pool.
A
A
So
migration
between
clusters
sometimes
need
to
migrate
services
between
clusters
in
different
locations.
That's
just
the
use
case
that
we
have
and
the
built-in
mirroring
capabilities
are
nice,
but
setting
up
mirroring
in
in
this
case
would
be
an
overkill.
We
just
want
to
move
one
or
two
rbds,
which
is
why
we
chose
to
use
export
import
over
the
wire.
A
Some
of
you
might
be
familiar
with
zed
standards,
which
is
a
lossless
data,
compression
algorithm
developed
by
facebook,
and
that
has
gained
the
traction
in
recent
years
due
to
the
compression
rate
and
speed.
So
we
decided
to
use
that
most
of
the
time
we
would
simply
want
to
copy
an
rbd
and
ensure
it's
created
in
the
target
cluster
with
the
new
default
features.
A
So
that's
something
that
we
only
had
to
use
if
we
had
snapshots
that
we
wanted
to
migrate
along
with
those
rbds
that
either
up
or
downside,
depending
on
how
you
view
things
is
that
the
rbd
features
are
also
maintained,
if
you're
using
format
two
to
do
the
migration.
A
So
you
have
to
keep
that
in
mind
and
another
pro
tip.
You
can
use
xxh
sum
64
to
verify
that
the
rbds
are
identical,
and
this
is
something
that
we're
going
to
see
in
the
next
slide,
exactly
which
commands
we
were
using.
A
Yeah,
so
that
was
basically,
my
presentation
was
went
fairly
quickly,
but
we
started
fairly
late.
So
that's
okay.
I
guess
we
will
move
to
the
questions
part
of
the
presentation.
Okay,
I'm
reading
the
chat
right
now
anthony.
That's
that's
great
that
you're
that
you're
basically
doing
the
same
as
us
I
mean
do
you
have
something
to
share
about
that,
something
that
you
would
like
to
share
about
that.
B
A
Yes,
exactly
we
didn't
want
to,
I
mean
so
historically
that
cluster,
that
particular
cluster,
when
we
did
the
migration
had
fewer
disks,
but
when
we,
when
we
first
set
up
that
cluster,
it
had
three
nodes
with
12
osds
each
and
we
went
to
four
nodes
with
four
osds
each.
So
the
pg
num
was
very,
very
high
and
we
didn't
want
to
keep
going
with
with
that
high
of
a
pg
numb.
So.
B
Yeah,
the
you
know
the
you
know,
for
I
mean
for
yeah,
I
mean
you
probably
know
this,
but
for
everybody
else,
listening
the
you
know
before
luminous
12
2
1
came
out
the
sort
of
guidance
of
a
pg
ratio
of
200,
and
there
was
a
warning
threshold
sage
pushed
changed.
B
Then
it
changed
the
the
warning
into
a
hard
limit
and
sort
of
retconned
the
target
back
to
100
for
the
ratio,
and,
as
I
understand
that
that
was
to
reduce
memory
usage
to
keep
people
from
running
out,
but
in
practice,
depending
on
on
what
drives
you're
using
you.
Can
you
can
jack
that
threshold?
Up
and
happily,
you
know,
use
a
larger
ratio,
but
it
does
mean
more
memory,
and
you
know
you
won't
run
a
very
high
ratio
on
hard
drives,
for
example,
but
ssds
can
handle,
handle
more
pgs.
A
Yes,
that's
that's
correct.
I
mean
I
I'm
fairly
sure
that
we
have
reached
over
400
now
that
I'm
thinking
about
it
and
we
want
to
look
up
the
the
exact
ticket
where
we
did
it.
But
I
I
do
remember
that
we
got
the
warning
and
we
silenced
the
warning
at
some
point.
But
then
we
realized
that
if
we
would
go
through
the
procedure,
then
we
would
go
beyond
the
actual
error
limit
and
we
didn't
want
to
risk
that,
though,
absolutely.
B
I
I
think
I
can
also
give
an
example
not
to
to
to
steal
your
meaning,
but
previous
life
I
had.
B
You
do
that
a
few
times
and
your
pge
ratio
gets
to
9
000.,
and
this
was
my
deal.
This
was
under
dumpling
and
then
think
about
what
happens
when
the
dc
loses
power.
B
You
end
up.
The
clusters
were
unrecoverable
because
there
was
not
enough
memory
for
the
pgs
to
appear
so
that
that's
that's
one
reason
of
why
why
it
is
good
to
keep
an
eye
on
the
pg
ratio
as
well,
as
you
know,
perform
and
memory.
A
And
so
the
first
version
that
we
actually
ran
in
production
was
cowfish,
but
luckily
we
didn't
have
anything
as
crazy
as
that.
A
A
A
Add,
okay:
silence
is
golden
so
yeah.
I
guess
I
guess
we're
done
here
guys.
Thank
you
very,
very
much
for
for
being
here
for
listening
to
me
ramble
on
about
how
we
do
things
and
yeah.
I
hope
that
if
you're
not
running,
oh,
I
see
that
there
are
some
chat
messages.
Okay,
thank
you.
A
You're,
very,
very
welcome,
of
course,
yeah
I
mean
if
you're
not
running
stuff
yet
or
if
you're,
just
thinking
about
implementing
small
clusters
for
any
kind
of
use,
you
can
just
hit
mike
and
they
will
connect
us.
So
we
could
exchange
some
some
some
details
and
I
could
try
and
help
out
if
I
could,
aside
from
that
again,
thank
you
very
much
for
being
here
and
thank
you
anthony
for
the
interesting
insights
on
how
you
guys
are
running
things
and
yeah
have
a
great
evening.
Everyone.