►
From YouTube: Ceph Tech Talk: Ceph at DigitalOcean
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
hi
everybody,
I'm
alex
merrigan,
I'm
a
senior
engineer
at
digital
ocean
and
I
work
as
part
of
the
storage
systems
team.
So
our
role
in
the
storage
systems
team
is
to
handle
the
entire
lifecycle
from
deployment
to
decommission
of
all
the
storage
back-ends
at
dio.
A
As
you
may
already
know,
digital
ocean
is
a
large
set
consumer.
We
are
also
part
of
the
set
foundation
and
we've
not.
We
haven't
really
done
any
public
talk
about
our
use
case,
and
so
this
is
what
it
is
about.
Today,
it's
going
to
be
fairly
superficially.
A
We
are
only
just
gonna
brush
kind
of
the
surface
about
what
we
do
we
serve
and
how
we
operate
it
and
what
not
the
goal
is
that
the
world
will
hopefully
be
in
a
better
place
in
2022
and
so
for
the
next
falcon,
if
it
happens,
we'll
be
able
to
present
some
talks
with
some
more
deep
dive
on
our
use
cases
and
how
we
use
ceph,
but
today
yeah
we're
going
to
keep
it
more
as
an
introduction
of
how
we
use
that
deal.
A
So
in
terms
of
agenda
we're
gonna
start
with
what
digital
ocean.
In
case
you
don't
know
the
company
and
then
we're
gonna
dive
straight
in
we
said
ideal
just
presenting
some
statistics
and
the
product
that
uses
the
usef
over
here
and
then,
when
I
was
building
the
talk,
I
was
really
thinking
hey.
What
should
I
talk
about
and
I
figured
I
would
talk
about
operation,
how
we
operate
steps,
the
kind
of
processes
we
have
setup
and
the
automation
we
have.
A
We
have
developed
around
seth
and
a
few
issues
that
we've
noticed.
I
call
it
self
drives
it's
it's
not
a
good
term
and
it's
not
a
blame
game.
It's
just
some
some
stuff.
We
noticed
that
you
may
want
to
put
in
your
radar
if
you
start
to
have
to
deploy
seth
at
at
some
scale,
and
that
should
take
us
around
40-45
minutes
of
me
talking,
which
will
leave
us
plenty
of
time
for
any
questions
you
may
have.
A
All
right,
what's
digitalocean
so
very
briefly:
digitalocean
is
a
cloud
provider
that
was
founded
in
and
the
core
com.
Core
concept
of
do
is
simplicity.
It's
simple
to
provision
cloud
resources.
The
ui
is
simple.
It's
simple
to
sign
up
surprisingly
simple
and
all
of
that,
so
that's
really
the
core
concept
of
digital
ocean,
the
first
product
that
was
released.
What
is
called
the
droplets,
which
is
what
we
called
a
virtual
machine
and
the
main
attractivity
of
this
product
at
the
time,
was
that
every
single
droplet
came
with
a
local
ssd
attached.
A
Even
your
five
dollar
droplet
did
come
with
that
in
2012,
which
was
extremely
attractive,
and
it's
really
what
put
you
on
the
map
as
a
resource
provider
at
the
time
since
then,
we
have
expanded
our
product
portfolio,
starting
with
block
storage
in
2016,
and
many
many
many
more
products
have
been
added
since
including
spaces,
which
is
our
object,
s3
storage
platform.
A
But
if
you
go
into
the
website,
you'll
see
many
more
products,
such
as
manage
databases
manage
kubernetes,
one
click,
app
deployment
and
a
lot
of
other
very
cool
stuff
in
terms
of
global
footprints.
We
have
data
centers
in
eight
regions,
with
each
region
having
multiple
data
centers
as
well
and
in
terms
of
life
cycle
of
the
company.
One
very
important
event
happened
about
six
months
ago
with
our
ipo
earlier
this
year.
A
All
right!
So,
let's
talk
about
setback
deal
so
in
terms
of
footprints.
We
use
geo
for
two
products
block
and
object.
So
block
is
what
we
call
volume
object,
as
the
name
of
the
product
is
spaces
and
they
are
both
backed
by
ceph.
A
We
have
a
third
storage
product
in
our
portfolio.
That's
called
image
backups
or
bring
your
own
image
that
today
is
not
stored
in
ceph,
but
that
is
something
that
we
are
looking
into
and
we
hope
to
start
to
make
it
part
of
ceph.
Next
year,
early
next
year,
in
terms
of
number
of
clusters,
we
have
a
total
of
38
production
clusters
on
38
production.
Cluster
37
of
them
are
on
safe
knowledge
and
one
is
on
self
luminous.
A
This
laggard
is
simply
due
to
the
fact
that
for
block
cluster,
we've
decided
to
repave
all
the
osds
from
from
firestore
to
glue
store
before
the
upgrade,
and
so
we
have
one
lager
in
that.
That
should
be
finished
very
soon.
So
at
the
end
of
next
month,
we'll
probably
have
all
of
our
cluster
in
on
notice.
The
entire
replay
process
was
over
3000
osd,
so
so
it
took
a
bit
of
time
in
terms
of
actual
usage.
A
A
We
have
a
mix
of
hdds
and
the
qrc
flash
for
newer
object,
storage
clusters
and
they
are
pretty
much
all
except
for
some
specific
cases.
They
are
pretty
much
all
using
erasure
coding
with
5
2
and
four
two
km
values.
A
So
why
did
we
pick
seth?
So
seth
was
selected
way
before
I
joined
geo.
It's
we
started
with
it
in
2016.
I
joined
in
2018,
but
the
reason
why
we
pick
safe
are
pretty
much
the
same
reason.
You
see
a
marketing
document
about
seth
the
scalability,
especially
the
ability
to
add
a
node
horizontally
that
is
very
straightforward,
to
do
and
quite
safe
to
do
with
the
right
set
of
processes.
So
we
can
expand
the
size
of
our
cluster
very
easily.
A
The
most
important
point
in
all
of
this
to
me
is
a
reliability,
and
that
goes
via
self-healing
and
the
strong
consistency
that
comes
with
seph
without
situations
where
something
went
wrong
in
our
cluster
and
we
had
an
outage
and
you
can
see
your
pgs
being
in
terrible
states,
they
are
down
incomplete,
stuck
peering
and
what
have
you?
But
every
time
we
solve
the
underlying
issues
that
caused
this
outage.
A
Seth
came
back
on
its
own
with
very
little
to
no
intervention,
and
that
is
a
huge,
huge,
huge,
strong
point
of
self.
We
don't
have
to
intervene.
A
lot,
there's
obviously
also
the
scrubbing
and
auto
repair
that
is
fairly
cool
to
have
in
terms
of
performance.
We
always
want
more
performance
and
if
you
give
us
more
performance,
we're
not
going
to
complain
about
it,
and
if
you
look
at
the
second
survey
every
year,
performance
and
asking
for
more
performance
is
one
of
the
top
items,
but,
generally
speaking,
theft
performance
has
been
very
much
acceptable.
A
We
don't
have
many
issues
related
to
sales
performance.
It
is,
it
is
what
it
is
and
is
quite
quite
good,
as
is
we
always
want
more,
of
course,
but
it
is
acceptable
and
also
it
does
not
scale
perfectly
linearly
you,
and
we
cannot
expect
that
to
happen.
It
scales
linearly
enough
to
the
point
where
we
can
really
predict
the
performance
of
a
cluster
from
its
size,
which
is
a
fairly
nice
concept
and
it
improves
our
operation
quite
a
lot.
A
Obviously,
it
works
with
all
the
sharp
components.
So
no
appliances,
that
is
a
big
win.
That
does
mean
that
when
you
deploy
new
hardware,
you
need
to
do
more
testing
and
when
you
deploy,
when
you
upgrade
your
your
os,
you
need
to
do
more
testing,
but
we
are
not
tied
to
any
appliances
which
which
is
a
big
win
and
also,
very
importantly,
to
us.
A
It
works
with
multiple
products.
As
I
said,
we
use
it
for
object
and
block
storage
and
the
ability
to
have
only
one
backend
greatly
simplifies
our
lives.
We
only
have
to
have
one
set
of
automation.
We
we
have
to
deep
dive
into
in
terms
of
knowledge
into
one
product
and
not
multiple.
So
that
means
that
we
can
handle
more
cluster
with
less
people
and
less
automation,
and
that
is
obviously
a
huge
win
for
us.
A
All
right,
so,
let's
talk
about
safe
operations,
so
what
we
do
in
terms
of
operations,
that's
gonna,
be
I've
selected,
a
few,
a
few
items
to
talk
about
we're
gonna
brush
the
surface,
but
we
do
some
very
cool
stuff
and
I'm
going
to
introduce
a
couple
of
open
source
tools
that
we've
developed
as
an
idea
of
interest
to
the
community.
A
So
one
of
the
core
concepts
of
our
team
is
that
we
should
not
scale
the
number
of
people
we
have
in
the
team
with
a
number
of
clusters.
That
would
be
pretty
bad.
So
our
goal
today
is
to
automate
as
much
as
possible
with
the
final
like
perfect
goal,
to
never
have
to
ssh
into
a
machine-
we're
not
quite
there
yet
but
maybe
someday
in
terms
of
operation
and
automation.
A
Awx
is
the
open
source
version
of
ansible
tower,
which
I
think
now
is
named
red
hat
tower,
and
we
still
do
have
some
standalone
tooling
left
here
and
there
that
we
will
want
to
turn
into
service
in
the
near
future,
but
for
now
standalone
tuning
that
works
pretty
well
all
right.
So,
in
terms
of
deployments,
nothing
really
exceptional
going
on
there.
We
have
our
own
in-house
and
single
playbook
to
deploy
ceph
in
containers,
so
we
started
containerizing
self
as
we
upgraded
from
luminous
to
knowledge.
A
We
don't
use
fadm
for
the
simple
reason
that
unnoticed
it's
not
available
when
we
go
to
octopus
or
parliament,
when
you
go
to
pacific,
we
may
evaluate
safe
idm,
but
for
now
we
have
no
plan
on
switching
to
it.
So
what
we
do
in
terms
of
deployments
we
deploy.
We
build
our
own
safe
packages,
because
we
cherry
pick
a
lot
of
patches
instead
of
upgrading
set
directly.
A
This
is
the
this
lowers
the
risk
of
an
upgrade
going
wrong
more
on
that
in
a
bit
and
in
terms
of
daemon
we
use,
we
basically
use
the
theft
demon
base
image,
which
is
just
the
image
with
installed
packages,
and
we
have
just
a
minimal
script
to
mount
osd
outputs.
That's
like
15
lines
of
bash.
That's
that's
fairly
straightforward!
A
In
terms
of
operation,
it's
very
nice
to
have
all
your
osd
in
a
sequential
order
when
you
use
seth,
because
if
you
see
an
alert
that
says
osd012
absolute
request,
you
know
that
the
osd012
are
in
the
same
host
and
you
know
that
you
have
something
to
investigate
on
the
host
but
deploying
it.
That
way
is
super
slow,
like
it's
gonna,
take
eight
to
ten
hours,
because
you
need
to
deploy
all
the
osd
one
by
one
sequentially.
A
So
a
neat
trick
we
applied
is
that
because
we
know
how
many
osds
we're
gonna
have
per
us.
Instead
of
deploying
one
time
we
just
pre-create
the
crash
map
before
the
deployment,
and
then
we
can
deploy
all
the
os
the
other
time
and
so
yeah.
A
That's
that's
one
of
the
things
we
do
and
one
of
the
other
reasons
why
we
are
on
in-house
and
simple
playable
to
deposit
instead
of
using
something
like
defensible
or
something
else
available
within
the
community
is
because
we
have
a
tight
integration
with
our
secret
management
platform
that
will
push
the
safe
configurations,
some
secrets
for
monitoring
and
some
third-party
services
that
need
to
access
ceph
or
the
hypervisor.
Cheering
to
our
secret
management
platform
for
later
integration.
A
So
all
this
combination
allows
us
to
reduce
the
time
to
deploy
a
cluster
from
it
used
to
be
like
a
couple
of
days
with
a
lot
of
manual
tasks
to
now
maybe
half
a
day
to
deploy
a
new
cluster
in
the
future.
Our
main
goal
for
automation
is
to
do
a
zero
intervention
deployment,
where
the
hardware
becomes
ready
from
our
infrastructure
team
and
seth
automatically
deploys
without
us
doing
anything.
We
just
have
a
slack
message
or
something
that
tells
us
hey,
safe,
is
deploying
there
and
we
just
let
it
go
in
terms
of
monitoring.
A
We
use
for
monitoring,
set
health
and
safe
studies
in
general,
and
that's
the
global
global
information
about
our
safe
cluster,
like
capacity
and
all
of
that
we
use
self-exporter,
which
is
a
an
open
source
promoters.
Exporter
that
we
developed.
A
We
don't
use
the
explorer
that
is
part
of
the
step
manager
module
for
a
couple
of
reasons.
The
first
one
is
when
we
started
seth,
it
was
simply
not
available
and
we
needed
something.
So
we
we
created
set
exporter
and
because
we
created
it
and
it's
working
very
well
for
us,
we
haven't
really
thought
about
moving
to
the
manager
one.
But
we
also
have
some,
I
believe,
justified
concern
about
the
manager's
scalability.
A
We
noticed
last
week
as
we
upgraded
one
of
our
clusters
to
knowledge
that
the
manager
is
struggling
to
keep
up
because
of
what
I
suppose-
and
we
are
starting
investigations
this
week,
so
we're
going
to
report
that
back
to
community
when
we
know
exactly
what
happens,
but
we
think
that
it's
the
support
module
that
struggle
to
keep
up
with
data
and
it's
basically
overloading
the
manager
and
the
problem
with
the
manager
today,
is
that
it
doesn't
scale
because
it's
a
single
process
so
the
only
way
to
scale
it
would
be
to
put
it
onto
a
more
powerful
machine
which
is
obviously
very
expensive.
A
So
we're
going
to
look
into
that
this
week
and
report
that
back
upstream
to
see
what
can
be
done,
but
yeah
we're
looking
into
that
in
terms
of
other
monitoring,
we
have
node
exporter
running.
A
That's
the
open
source
parameters,
exporter
so
not
much
to
say
there,
and
we
also
have
built
another
explorer
called
store
explorer
that
is
used
to
gather
information
from
every
single
demon
in
the
cluster
via
the
admin
socket
so
cool
stuff,
for
instance,
that
we
discovered
through
it
was
bluestone
allocated
versus
bloodstore
storm
that
showed
us
the
overhead
of
the
minimum
allocation
size
on
dominus
with
blue
storms
that
was
64k
for
hdds
with
luminous.
A
That,
since
has
been
switched
to
16k
on
others.
But
we
saw
that
overhead
and
we
were
based
off
that.
We
decided
to
do
some
testing
on
16k
minimum
allocation
size
on
luminus
and
we
found
out
that
it
worked
very
well,
so
we
actually
switched
to
that.
It
also
explored
other
information,
most
notably
something
we
noticed
and
we
haven't
investigated.
Yet.
Is
that
with
safe
knowledge,
it
seems
like
the
slur
requests
from
that
are
reported
to
the
monitor
might
be
undercounted.
A
A
We
have
a
canary
process
that
is
running
regularly
on
each
cluster
to
check
performance
and
availability
so
to
take
block
as
an
example,
it's
gonna
run
read
and
write
workload
with
different
block
sizes
regularly
on
a
cluster,
and
that
gives
us
two
set
of
data
as
either
it
times
out,
and
that
means
we
have
a
big
availability
problem
on
our
cluster.
A
A
In
terms
of
other
set
operation,
augmenting
itself
is
easy,
but
it's
still
something
where
you
can
shoot
yourself
in
the
foot,
if
you're,
not
careful,
so
we
we
have
two
ways
to
augment
safe
today,
which
is
not
optimal.
We
want
to
convert
them
both,
but
we
haven't
got
to
that
yet
so,
from
the
block
cluster
perspective,
it's
pretty
much
the
same
thing
that
you're
gonna
see
out
there
in
the
mailing
list
or
in
our
on
on
irc,
where
people
are
recommending
how
to
augment
safely.
A
So
we
create
all
the
osd
in
a
dummy
root
tree
with
a
weight
of
zero.
Then
we
move
these
us
to
the
right
crash
location
in
the
right
racks
and
we
slowly
operate
them
to
avoid
any
latency
issue
right,
if
you
just
add
everything
at
once,
you're
going
to
have
latency
issue
on
the
clusters,
that's
going
to
impact
customers.
A
The
only
difference
with
what's
usually
recommended
upstream
is
that
we
actually
have
developed
a
tool
that
automatically
does
the
approach
for
you
at
a
specific
at
a
specified
interval.
So
by
default
we
automatically
upgrade
by
0.2,
I
believe
until
we
reach
the
weight
we
want
to
be
at
so
the
tool
is
open
source,
it's
available,
it's
called
archimedes
and
it
will
automatically
handle
this
crush
up
weight
for
you
up
to
the
weight
you
want
it
to
be.
A
From
the
object
point
of
view,
we
have
a
different
process.
That
is
more
interesting
process
wise,
I
think
so.
The
way
we
do
augments
is
we're
going
to
set
no
back
shell,
normal
balance.
To
avoid
any
data
movement,
we
create
all
the
osd
in
the
right
location
with
their
final
crush
weight.
So
you
see
a
lot
of
tiering
happening
at
this
stage
and
a
bunch
of
pg
are
going
to
switch
to
to
backshell
state,
but
nothing
is
gonna
move.
A
Then
we
cancel
all
the
backfills
yeah
up.
Wait.
Oh
sorry,
yeah
upmap
overwrites!
So
if
you
had
a
pg
that
was
like
zero
one,
two,
you
add
your
osd.
It
became
zero
one.
Three.
We
switch
it
back
to
zero
one
two
with
zero
one
two.
So
that
means
that
all
your
pgs
are
gonna
be
back
to
active,
plus
clip
like
nothing
happens.
To
do
these
maps,
we
use
a
tool
called
pitchery
mapper
more
on
that
in
the
next
slide.
A
So
because
all
the
pgs
are
active
plus
clean,
we
unset,
no
vacuum
no
rebalance
and
nothing
happens
and
the
way
we
actually
perform.
The
augment
is
two
ways:
either
we're
gonna
run
the
balancer
in
up
map
mode
and
because
the
set
balancer
we
always
try
to
undo
a
map
before
it
adds
a
new
one.
It
basically
do
our
augments
for
us,
but
the
concurrency
of
the
adma
balancer
is
pretty
low
and
we
don't
want
to
change
it
because
we
use
it
for
the
entire
cluster.
So
it's
like
10
pgr
time.
A
A
If
you
have
all
your
back
shells
going
on
and
you
have
flapping
someplace
else,
your
pg
from
the
flapping
are
going
to
need
to
recover,
but
some
of
them
are
going
to
be
recovering
weights
and
blocked
by
the
bacterial
reservation
taken
from
the
ongoing
backfield
and
so
they're
going
to
be
degraded
for
a
bit
and
then
you're
going
to
have
another
flight
flapping,
which
is
going
to
trigger
more
degradation
and
so
on
and
so
on.
Until
the
augment
is
finished
and
then
the
recovery
is
going
to
it's
going
to
go
away.
A
That
means
that
your
recov,
your
degradation,
can
be
very,
very
long,
lasting
and
that's
quite
uncomfortable.
By
doing
this
process,
where
we
only
undo
a
certain
amount
of
up
maps
at
a
time
we
lower
the
time
to
degradation.
So
let's
say
we
only
under
10
10
10
up
at
a
time
that
stand
back
in
pg
once
they're
done,
whatever
recovery
needs
to
be
done
can
be
done,
and
then
we
start
again
and
again
and
again
and
again
so
limiting
the
time
to
sometimes
you're
being
degraded.
So
both
approach
are
basically
either
limit.
A
Impact
of
latency
or
limit
the
risk
of
data
loss
due
to
the
long-lasting
degradation,
so
pg,
remapper
pg
wrapper,
is
a
tool
we
open
sourced
about
three
four
months
ago,
so
failure
recently
that
is
heavily
inspired
by
a
set
of
certain
scripts
that
I
believe
mostly
dan
van
duster
roth.
So
I
put
both
urls
there.
Of
course,
pg
rappers
in
python
is
in
golang
and
scripts
from
from
cerns
are
in
python
in
terms
of
ability
from
pgb
mapper.
A
It
allows
you
to
cancel
backfill
for
augments,
like
I
just
described
or
prioritized
recovery
weight
pgs.
We
also
use
it
in
blocks.
We
still
have
some
a
bunch
of
file
store
block
clusters,
but
mostly
file
starts
it's
mixed
now,
and
so
when
we
even
when
we
add
recreate,
parameter
a
new
osd,
there
is
flapping
going
on.
So
we
can
recreate
the
osd
cancel
its
backfills,
because
the
cursor
is
act.
A
It
also
goes
back
to
active,
plus
clean
and
then
do
it
like
10
at
a
time
to
not
prevent
recovery
from
happening
another
part
of
the
system.
It
allows
you
to
undo
a
maps.
I
already
talked
about
that.
It
allows
you
also
to
balance
a
crash
bucket,
which
is
similar
to
the
functionality
you
have
with
the
opmap
balancer
upstream,
except
it's
localized
to
a
specific
part
of
your
system.
So
the
admob
balancer
can
do
on
a
peripheral
basis.
Here
you
can
do
on
the
crush
bucket
basis.
A
It
doesn't
have
the
same
complex
considerations
that
the
map
balancer
has.
A
I
think,
if
I
recall
correctly,
that
I
might
be
wrong,
but
I
believe
it
just
look
at
the
pg
distribution,
so
it's
just
going
to
try
to
equal
it
to
equalize
the
number
of
pg
you
have
in
your
buckets
in
your
osd
prepare
the
bucket
which
work
in
most
cases,
but
doesn't
take
into
account
crush
weights
and
doesn't
take
into
account
if
you
have
a
non
power
of
two
pool
and
all
that
it
also
allows
you
to
drain
an
osd
which
is
a
process
we've
used
when
we
repaint
our
file
store
cluster
to
bluestorm.
A
We
drain
the
osd,
which
is
like
marking
it
out,
but
only
much
more
controlled,
because
we
don't
do
a
hundred
and
two
hundred
pg
at
the
time.
We
only
do
it
ten
or
something
at
a
time
which
limits
the
impact
of
latency.
It
allows
you
to
remap
pg,
which
is
kind
of
a
wrapper
around
the
safe
cli
for
the
same
thing
to
create
another
public
section.
And
finally,
it
allows
you
to
export
and
import
mappings,
which
can
be,
which
can
be
useful
in
your
automation.
A
It
was
to
us
for
the
draining
and
the
faster
recognition,
because
if
you
have
a
map
exceptions,
but
you
destroy
the
osd,
the
map
expert
exception
are
going
to
be
removed
from
the
osd
map.
Rightfully
so
so
you
can
export
them
before
you
destroy.
It
then
report
them
back
to
avoid
having
to
rewrite
them.
A
So
yeah
check
it
out,
digitalocean,
slash
pg,
remember
and
if
you
have
any
issue
with
submit
and
we'll
try
to
fix
it
as
fast
as
possible
in
terms
of
osd
lifecycle,
we've
developed
a
single
set
of
a
single
tool,
basically
to
handle
the
entire
osd
lifecycle
from
end
to
end.
A
So
we
have
these
tools
that
allows
you
allows
us
to
diagnose
recondition,
deploy,
upgrade
firmware,
remove
from
the
cluster
and
locate
osd
and
blink
the
light
on
our
chassis
for
for
the
osd's.
A
Most
of
the
operations
that
are
self-based
like
remove,
create
and
recondition
just
wraps
around
safe
volume.
It
also
wraps
around
safe
disk
because
we
still
have
one
firestore
cluster.
So
that's
good
because
we're
going
to
remove
soon
but
not
yet,
but
in
terms
of
cool
feature,
I
think
one
of
the
cool
one
is
really
the
ability
to
automatically
upgrade
the
firmware,
so
it
used
to
be
from
a
operational
point
of
view
when
you
need
to
upgrade
the
firmware
you
you
replace
a
drive,
it's
taken
from
a
panel
of
spares.
A
You
put
disappear
in
the
machine.
You
look
at
the
firmware.
You
then
look
at
what
is
the
latest
version
of
firmware.
We
have
validated
and
if
it's
not
matching
your
greater
firmware
now,
what
it
does
is
we
replace
the
drive
in
the
machine
and
when
we
run
the
command
to
deploy
the
osd
it
automatically
before
deploying
the
osd
check,
the
firmware
factual
latest
version
run
the
firmware
upgrade
and
then
actually
do
the
safe
with
the
deploy.
A
So,
from
our
operational
point
of
view,
it's
a
huge
gain
of
time.
Similarly,
when
the
drive,
when
the
noise
goes
down
in
safe,
we
need
to
determine
whether
this
osd-
this
is
a
drive,
needs
to
be
replaced
or
not,
and
it
used
to
be
a
very
manual
and
tedious
process
where
you
log
into
the
machine.
You
look
at
the
log
in
a
couple
of
different
places
and
then
you
make
a
determination,
and
that
was
just
a
pain,
so
we
developed
a
diagnostic
tooling
as
part
of
this
tool.
A
That's
going
to
look
at
smart
data
syslog,
maybe
some
nvme
stuff,
I
don't
recall,
but
basically
it's
going
to
make
a
determination
on
whether
a
disk
needs
to
be
physically
replaced
or
not,
and
if
it
does,
we
just
swap
the
drive
and
redeploy
the
osd
fairly
quickly.
A
This
diagonal
tool
is
probably
it's
not
as
mature
as
we
want
it
to
be,
but
as
we
have
more
different
type
of
failure,
we
just
are
moving
more
and
more
and
more
data
to
it.
A
And
finally,
in
terms
of
automatic
remediation,
there
are
issues
in
ceph
that
are
richering,
but
not
really.
It
don't
really
have
an
impact.
A
typical
example
of
that
is
the
set
manager
that
is
running
out
of
memory.
A
It
doesn't
happen
often,
but
because
we
have
38
production
cluster,
it
used
to
happen,
maybe
three
four
times
a
day
in
in
in
global
and
it's
running
out
of
memory,
but
a
simple
restart
fixes
it.
So
we
haven't
really
invested
any
time
or
resources
into
looking
into
why
that
happens
and
to
try
to
fix
it,
because
the
fix
is
so
simple
and
it's
not
impactful
that
we
just
have
other
priorities.
So
we
have
this
automatic
remediation
service.
Today,
that's
gonna
monitor
every
cluster
in
our
fleet
and
it's
gonna
look
for
condition.
A
So
in
the
case
of
set
manager,
you
can
look
at
hey
how
much
memory
is
being
consumed
by
the
self-manager
process
and
if
it's
over
x
gigabytes,
I
don't
remember
the
number
we
trigger
an
awx
job,
which
is
an
simple
playbook
to
fix
it.
So
in
the
case
of
a
manager,
hey,
the
manager
is
over.
That
amount
of
memory
used
just
run
a
random
playbook
to
fix
it,
which
is
just
restart
it
and
that
dramatically
reduces
the
number
of
alerts
we
have
for
this
and
the
integration
from
our
club
operations
team.
A
It
used
to
be
the
same
thing
where
we
have
a
page
for
for
the
our
cloud
operation
team.
They
log
into
the
machine,
restart
the
demo
and
then
come
back,
and
it
just
meets
three
four
times
a
day,
which
is
a
a
huge
pigment.
Yes,
so
we
have
this
automatic
remediation
service
in
place
to
do
that
and
we
plan
to
add
more
stuff.
I
must
talk
to
it
as
as
time
goes.
A
All
right,
that's
it
for
safe
operation,
again
brushing
the
surface,
and
I
really
hope
we
can
deep
dive
more
into
that
in
person
at
the
next
falcon.
We'll
have
some
fun
now
in
terms
of
self
drives.
Again,
it's
not
a
it's
it's
it's
not
blaming
it's
stuff.
We
noticed
that
you
may
want
to
be
aware
of
for
specific
scenarios
and
also
it's
stuff
that
we
we
are
looking
at
or
plan
to
look
at
very
soon.
A
So
the
first
one
is
that
safe
upgrades
are
hard
and
we
learn
that
with
knowledge.
The
process
is
simple,
but
there
are
many
issues.
The
process
has
always
been
the
safe.
The
same.
I've
been
using
ceph
since
slightly
before
argonauts,
and
it's
always
been
the
same
process
and
that's
very
cool.
You
upgrade
your
moon.
Now
you
upgrade
your
manager,
your
osd's
and
whatever
clients
you
may
have
right.
A
It's
always
the
same,
but
and
sometimes
you
have
some
short
intervals
like
you
need
to
add
something
in
your
setup
and
all
of
that,
but
it's
never.
This.
This
part
is
always
very
well
documented,
but
you
always
have
some
small
issue
popping
up
that
are
not
very
well
documented,
and
the
documentation
during
upgrade
is
actually
somewhat
sparse
and
that's
a
difficult
problem
to
solve
right.
You
cannot
expect
you
cannot
expect
engineers
every
time
to
be
like.
A
Oh,
I
need
to
document
that
and
that
and
that
that
just
never
works,
and
we
know
that
to
give
an
example
of
a
couple
of
examples
of
sparse
documentation,
a
very
simple
thing
was
from
luminous
to
knowledge.
The
output
of
the
the
json
output
of
cfpg
dump
changed,
which
brought
any
tooling
that
would
have
used
that
yeah.
I
would
have
used
that,
and
that
was
not
documented.
A
So
it's
not
a
big
deal
in
the
sense
that
we
call
it
very
early
before
we
even
upgrade
it
and
we
fix
all
of
our
tooling,
but
it
does
point
to
the
fact
that
it's
lacking
in
documentation
and
it's
such
a
big
change.
It's
such
a
change
is
not
documented.
A
The
safewasd
process
will
now,
by
default,
disable
transparent,
huge
pages,
which
is
a
change
from
the
default
of
the
operating
system,
which
actually
cause
issue
for
our
file
store
hdds.
So
we
fix
that.
We
fix
that
and
we
change
the
configuration.
But
again
it
goes
back
to
to
the
to
to
to
documentation
a
grid
can
be
slow.
A
It's
it's
just
a
fact
of
life
when
there
is
on
this
check
or
changes
during
upgrade,
it's
just
gonna
be
slow,
but
it's
important
to
know
when
you
plan
your
upgrade,
never
playing
an
upgrade
of
a
cluster
with
agds
at
2pm,
because
you're
to
be
there
for
a
while,
so
yeah
privilege
of
the
morning.
A
We've
seen
that
very
recently
with
a
message
on
the
mailing
list,
stating
that
if
you
created
your
cluster
prior
to
joule
and
I've,
never
used
step
fs,
you
need
to
stop
at
a
specific
version
of
octopus
before
going
to
pacific,
and
that's
that
happens.
That's
okay!
It's
not
complaining
about
these
specific
things.
A
It's
just
a
statement
that
the
history
of
the
cluster
mirrors
a
whole
lot,
and
so
that
makes
upgrades
even
more
complex,
because
how
can
you
have
proper
testing
if
the
history
of
the
cluster
matters
when
staging
systems
are
not
meant
to
be
long-lived
systems?
Right,
they're
meant
to
be
recreated
a
lot.
So
that's
something
we're
trying
to
think
about
and
something
else
we
we
learned
a
very
the
very
hard
way
earlier
this
month
about
two
weeks
ago,
is
that
a
successful
upgrade
doesn't
absolutely
not
make
a
a
trend.
A
We
upgraded
37
clusters.
Recently,
30,
I'm
running
running
up
or
down
doesn't
matter,
but
about
30
of
them
went
completely
uneventful,
where
nobody
who
knows
the
cluster
was
upgrading,
who
tells
the
crystal
we're
upgrading.
So
that
was
fantastic
on
some
block
cluster,
we
saw
issues
we
saw.
We
hit
a
bug
where
all
of
our
pg
went
into
snapstream,
which
was
scary
because
it
looked
like
it
tried
to
replace
the
entire
snap
trim
history
of
the
cluster
and
so
the
first
time
it
happened.
It
wasn't
a
huge
impact.
A
Upgrades
and
testing
stuff
in
general
is
extremely
difficult,
and
that
is
an
extremely
complicated
problem
to
solve
and
we're
not
saying,
hey
community
fix
that
please
and
going
away
and
just
waiting
for
it
to
happen.
A
We
are
going
to
look
at
that
very
very,
very
closely
as
we
start
to
look
at
pacific
within
the
next
weeks
months.
I
I
don't
know
exactly
when
we're
gonna
start,
but
we're
gonna
start
very
soon,
and
as
we
look
at
that,
we
have
to
look
at
how
we
improve
testing
for
self
and
how
we
improve
reporting
our
our
issues
upstream
and
maybe
some
collaboration
for
the
testing
part
with
upstream.
A
So
we're
not
sure
how
it's
gonna
look
like,
but
we're
definitely
definitely
definitely
definitely
gonna
gonna
try
to
improve
that
for
for
everybody,
and
the
last
thing
in
terms
of
graphs
I
want
to
mention,
is
around
rgw
defaults.
A
If
you're
using
the
rgw,
you
have
a
certain
scale,
you
may
want
to
be
wary
of
a
few
things
that
I
mentioned
here:
dynamic
recharging,
the
naming
I
found
a
bit
unfortunate
because
it
it
it
it
doesn't.
A
It's
not
clear
that
dynamic
resulting
is
going
to
block
rights
on
your
bucket,
and
you
can
see
that
in
the
main
english,
with
a
lot
of
people-
hey,
I
have
sure
shining
going
on
and
I
can
make
I
o
it
works
well
up
to
a
certain
stage,
because
in
most
cases,
if
you
have
various,
if
you
have
small
buckets
dynamic,
recharging
is
only
going
to
take
a
few
seconds
or
up
two
minutes
which
is
fine
to
block
right
for
a
minute.
It's
not
gonna
cause
anything.
A
If
you
have
a
very,
very,
very,
very
large
bucket
that
you're
recharging,
the
resharp
could
take
hour
hours
like
10
hours,
easy
if
you
have
a
humongously
large
bucket.
What
happens
during
that
time
is
the
requests.
The
right
requests
going
to
that
bucket
are
going
to
be
blocked
on
the
rails,
gateway
which
is
going
to
take
up
threads
and
they
are
never
going
to
be
timed
out
on
the
browse
gateway
they're
going
to
stay
there
indefinitely.
A
So
if
you're,
using
dynamic
shouting
and
you
plan
on
having
very
very
very
large
buckets,
be
wary
of
that
another
thing.
Another
consequence
of
free
sharding
in
general
is
that
it
makes
listing
very
expensive
on
the
back
end
way
more.
I
o
and
slower
on
the
front
end
so
also
something
something
to
be
aware
of
beasts
on
neues
as
a
couple
of
issues.
The
first
thing
you
want
to
do.
If
you
use
beast,
is
you
want
to
set
some
maxcon?
A
I
think
canonical
made
a
patch
for
that.
You
can
set
that
as
an
option.
I
think
it's
maxcon
in
the
configuration
option
you
want
to
make
that
to
your
sumac
scan
on
your
system
or
even
increase
your
some
exponent
on
your
system
and
then
set
that
to
that,
because
otherwise
you
cannot
send
a
lot
of
concurrent
operation
to
it
with
the
default
thread
settings
one
important
to
know,
and
this
is
anonymous.
I
know
octopus
and
pacific
have
improved
that.
A
I
don't
know
the
final
status
in
pacific,
so
I
also
I'm
only
talking
neues
here,
but
most
of
the
liberators
call,
even
though
beast
is
asynchronous.
Most
of
the
liberals.
Just
call
are
nuts,
they
are
still
synchronous.
A
So
if
you
are
in
a
situation
where
your
cluster
has
some
sort
of
issue-
and
you
have
a
few
slow
ups
on
your
clusters
on
your
cluster,
you
may
exhaust
the
thread
on
this
very
quickly
because
it
has
a
very
small
amount
of
threads
by
default.
I
think
500
512,
maybe
so
you
may
exhaust
that
and
actually
prevent
any
further
requests
coming
through
because
of
it.
A
We
we
found
in
our
testing
that
this
perform
extremely
well
like
impressively
well
compared
to
civet
web
in
normal
cluster
condition,
but
when
we
have
these
type
of
issues,
not
so
much
so
we
try
to
increase
the
number
of
threads
in
this
and
do
some
testing
there,
but
increasing
the
number
of
thread
in
threads
in
beast
does
not
scale
as
well
as
increasing
the
number
of
thread
on
c
between.
So
for
that
reason
we
are
actually
now
running
civic
web
on
modulus
in
our
cluster.
A
As
we
look
towards
pacific,
we
are
definitely
gonna.
Look
at
this
again
and
see
and
see
if
things
have
improved,
and
the
last
thing
I
wanted
to
mention
was
auto
shot
removal
that
shouldn't
be
an
issue
in
99
of
the
cases,
but
at
some
point
in
luminous
a
patch
was
added
to
automatically
remove
the
shaft.
So
if
you
run
dynamic
recharging
or
if
you
don't
run
the
raz
gateway
admin
command
for
resharing,
it
used
to
leave
the
old
charts
behind
and
you
had
to
manually
remove
them
now
it
does
it
automatically.
A
However,
if
you
do
it
on
a
very
large
pocket,
you
may
dos
your
cluster,
because
the
operation
to
delete
is
just
a
greatest
delete
of
the
object
and
if
it
has
billions
of
keys,
it's
gonna
dodge
your
cluster,
by
which
you're
just
preventing
the
osds
that
have
the
free
replica
of
it
from
contacting
the
cluster
and
that's
gonna
go
down
for
a
bit.
A
So
the
way
we
do
it
is
that
we
develop
a
small
tooling
that
using
gosef
not
just
iterate
over
the
key
and
delete
at
a
specific
specific
amount
of
time.
I
think
we
delete
like
ten
thousand
entry
our
time
every
few
seconds
and
just
catch
up
catch
up
until
it's
done,
and
then
we
remove
the
object.
That
was
right.
A
All
right,
that's
all
I
have,
as
I
mentioned
very
superficial.
I
really
hope
we
can
talk
more
at
cephalocon
next
year.
I
really
hope
we
can
do
that
before
any
question
just
wanted
to
mention
we're
hiring
we've
backfield
a
couple
of
jobs
already
in
the
storage
systems
team.
There
are
other
jobs
out
there
for
geo
with
sev
and
without
self
liking,
kubernetes
and
whatnot.
A
So
if
you
want,
if
you're
interested,
have
a
look
at
geo.com
jobs,
I'm
trying
to
maintain
the
list
of
job
in
the
seth
jobs
pad
as
well,
but
yeah
I'll
try
to
do
my
best
on
that
and
if
you
don't
see
a
job
that
you
like
or
if
you
have
any
questions
after
this
touch
that
you
think
about,
please
do
feel
free
to
shoot
me
an
email
at
morangan
digitalocean.com,
and
if
you
have
any
questions,
I'm
happy
to
answer
them.
A
So
I
guess
that's
it.
Thank
you
very
much
everybody
for
joining
and
have
a
good
day
yeah,
just
a
second
alex
can.
Can
we
get
the
slice?
A
Sorry,
what
was
that?
Can
we
get
the
slice?
Yes
so
I'll
send
them
to
mike
and
I'm
sure
he
has
like
a
platform
to
share
them,
I'm
I'm
not
sure
but
I'll,
send
them
to
my
course
to
see.
Okay.
Thank
you.