►
Description
Presented by: Dan van der Ster | Clyso
In 2013, the data storage team at CERN began investigating Ceph to solve an emerging problem: how to provide reliable, flexible, future-proof storage for our growing on-premises OpenStack cloud. Beginning with a humble 3PB cluster, the infrastructure has grown to support the entire lab, with 50PB of storage across multiple data centres used across a variety of use-cases ranging from basic IT apps, databases, HPC, cloud storage, and others.
A
So
this
is
a
10-year
retrospective
on
what
it
was
like
operating
operating,
sefitz
CERN,
this
so
quick
for
for
those
that
that
don't
know
me.
That's
my
email
addresses
I'm
from
I'm
a
Canadian
guy
I'm
from
University
of
Victoria
I've,
been
on
the
engineering
staff
at
CERN,
since
20
2008
I
was
managing
Seth
there
until
last
year
and
then
I
handed
off
the
reins,
because
I
became
Chief
I.T
architect
at
CERN.
A
That's
one
hat
other
hat
as
I've
been
proudly
happily
supporting
the
Seth
foundation
and
being
a
board
member
founding
board
member
and
now
honored
to
be
the
on
the
staff
executive
Council
since
since
2021,
when
Our
Benevolent
dictator
for
Life
moved
on
to
the
greener
pastures
and
then
actually
latest
news
is
in
April.
I'm
relocating
to
Vancouver
and
I'll,
be
taking
a
sabbatical
from
CERN
I'm,
going
to
be
joining
iwakim
and
and
Company
to
to
help
help
build
up
klysto
in
North
America.
A
What's
that
cool
picture
behind
that's
this,
the
the
LHC
at
CERN,
Large,
Hadron,
Collider,
27,
kilometer
ring
this
is
where,
where
I
I
work
these
days
still
100
meters
underground.
My
house
is
like
right
there
somewhere
right
there.
That's
my
house,
there's
a
ski
hill
going
up
when
it
gets
cooking
a
little
bit
melts
the
snow
a
little
bit,
not
not
too
much,
but
this
is
what
it's
like
100
meters
down,
that's
the
that's!
The
superconducting
accelerated
magnets.
A
They
accelerate
protons
up
to
the
speed
of
light
minus
11
kilometers
per
hour
just
hit
the
brakes
a
little
bit.
They
Bend
those
opposing
beams
together
at
Four,
Corners
of
that
Circle
Collide
them
like
and
then
take
pictures
in
giant
cameras
like
this.
So
you
can
see
you
can
see
someone
that
may
or
may
not
be
me
standing
there
to
get
the
idea
of
how
big
that
camera
is,
how
big
that
that
camera
is
and
then
it
takes
pictures
like
this.
This
is
a
picture
of
the
Higgs
boson.
A
This
is
the
whole
reason
that
we
built
the
LHC
at
CERN,
and
this
is
the
particle.
This
is
not
the
particle.
This
is
like
the
splash
after
that
particle
was
created
and
then
decayed
very
quickly,
and
that's
the
thing
that
gives
everything
Mass,
that's
why
we
weigh
something:
yeah
yeah
so
actually
yeah.
It's
really
crazy.
I'm,
a
computer
guy,
but
somehow
I
got
my
name
on
the
paper
that
discovered
the
Higgs
boson
for
Atlas
and
which
then
got
a
Nobel
Prize.
Okay,
I
don't
have
the
prize.
A
A
One
would
ask
a
certain
intelligent
system.
What
should
we?
Oh
this?
Unfortunately,
the
text
is
right.
There
write
an
outline
project
you're
at
retrospective
sure
here's
an
outline.
This
is
my
SEF
GPT
bot,
okay,
Seth
GPT
wrote
the
outline
for
my
talk.
I'll,
give
you
an
outline,
introduction
background
of
cephit
CERN
10-year
review
of
how
it
was
to
operate
and
some
lessons
learned
in
future
directions.
A
Of
course,
the
outline
was
more
detailed
that
it
gives
so
I'll.
Give
a
brief
overview
of
our
use,
the
importance
of
it
in
our
infrastructure.
This
is
real
output
from
Child
GPT,
guys,
unbelievable
and
the
purpose
of
this
Outlook.
So
here's
the
introduction.
This
is
overview
of
suffixer.
We
started
in
2013
with
a
300
terabyte
proof
of
concept
that
we
tried
to
break.
We
couldn't
break
it
okay
and
then
we
that
allowed
us
to
justify
a
three
petabytes
cluster
back
in
2013
for
openstack
2014
2015.
We
did.
A
We
did
some
r
d
and
eraser
coding
writing
the
first
Isa
acceleration
for
erasia
coding.
We
we
developed
the
first
object
striping
for
a
particular
use
case
that
we
had
for
physics,
data
2016.
We
upgraded
our
In-Place
stuff
from
three
petabytes
to
six
petabytes.
This
was
like
proving
that
ceph
organic
storage
actually
was
really
really
working.
Then
we
had
eight
production
clusters.
Then
we
got
into
CFS
and
S3
in
production
scaling
out.
Okay,
I
run
out
of
things,
there's
nothing
so
notable
there
in
those
years,
but
we're
scaling
out
business
continuity,
Disaster,
Recovery
Solutions.
A
By
now
we
have
around
17
clusters
and
around
100
petabytes
on
the
floor.
This
is
interesting.
These
two
numbers,
because
in
10
years
that's
about
five
Moore's
Law,
doubling
periods
and
that's
almost
exactly
matching
right.
That's
like
96
petabytes!
If
you,
if
you,
if
you,
if
you
do
the
math,
that's
kind
of
interesting,
but
it
also
means
that
my
SEF
budget
did
not
increase
in
one
decade.
A
A
That
hundred
petabytes
is
within
a
context
it's
with
an
exabyte
context
of
disks
and
tapes
for
the
physics
data,
but
we
use
it
sort
of
for
everything
else.
Okay,
so
it's
really
proven
it
becomes
the
the
key.
The
key
piece
purpose
of
this
talk,
if
you
don't
remember
your
history,
you're
condemned
to
something
decades,
a
round
number
and
I've
got
10
fingers.
So
10
sounds
like
an
interesting
number
and
it's
kind
of
like
a
memoir.
A
Ysef?
So
going
back,
you
know
we,
we
started
with
openstack.
The
management
forgot
to
get
a
storage
solution
for
the
cloud,
so
okay,
see
what's
out
in
their
open
source
land.
This
stuff
was
the
best
we
tried
gluster,
we
tried
our
netapps,
we
tried
staff
SEF
was
Far
and
Away
the
best.
We
all
know
that
story.
A
The
key
thing
is
we
needed
an
organic
system
that
would
grow
with
our
usage
and
our
evolving
use
cases.
Flexible,
storage,
no
data,
migrations,
data,
migration
like
forget
it,
you
can't
you
can't
just
like
shift
the
data
to
another
platform
as
you're
growing,
so
on
the
on
the
floor
behind
the
scenes,
transparent,
migrations
that
was
key
and
it
has
proven
for
10
years.
This
is
it.
It
works,
also
best
bang
for
the
buck.
A
A
So
these
are
more
like,
let's
say
the
initial
challenges
when
you
start-
and
this
is
I-
think
probably
echoed
many
people
have
this
thing
you
want
to
install
the
software-
you
need
to
make
it
work
on
this
Hardware
or
else
the
hardware
comes
from
some
someone
else
decides
which
Hardware
we
buy
it
in
bulk
for
scientific
use,
cases,
okay
and
it's
always
two
gigabytes
per
CPU
core,
always
the
largest
cheapest
hard
drives
data
center
hard
drives
not
desktop
stuff,
always
always
in
jbods
with
24
that
you
can
then
do
24
48
up
to
96,
let's
say
I,
you
know
why
do
you
need
three
replicas?
A
This
is
I
mean
I'm.
People
are
still
asking
that
question.
Why
do
you
need
ssds?
I
mean
this
is
still
something
to
justify.
There's
also
some
not
invented
hair
syndrome.
At
certain
I
mean
there
are
very
there's
a
storage
Department
of
50
developers
at
certain
so
to
to
use
something
from
outside
was
it
was
quite
a
took
quite
some
work,
there's
the
impression
that
the
only
thing
good
about
Seth
is
Crush.
For
some
reason,
I
mean
crush
is
like
crush,
is
even
broken.
A
Right,
Crush
actually
doesn't
work
I'll
get
into
that
a
little
bit
later,
tiny
bit
but
set
the
real
value
of
stuff.
Is
that
like
engineering
effort
to
make
the
failure
failures
like
transparent
and
the
osc
is
just
heal?
That's
the
real
real
true,
like
Gem
of
stuff
and
also
like
the
old
timers,
said,
forget
about
any
kind
of
software
device.
Storage
you'll
never
solve
the
latency
problem,
there'll
always
be
like
one
or
two
extra
Hops
and
just
forget
it.
A
So
that's
kind
of
still
true,
isn't
it,
but
anyway
our
architecture
so
I'm,
just
following
chat
gpt's
that
line
right
I'm.
Not.
This
is
not
creativity
at
all,
I'm,
so
lazy,
it's
the
same
host
recipe.
Since
the
beginning.
We
have
server.
Quads
I've
got
a
picture
on
the
next
slide.
Dual
xeons
now
amds,
always
two
gigabytes
per
hyper
threaded
core.
Now
these
are
256
gigabyte
machines,
one
from
one
gig
to
10
gig
to
25
gig
on
our
standard
stuff.
A
We
have
100
Gig
testing
as
well,
always
dumb
hpas,
never
raid
controllers,
one
or
two
of
those
j-bods
and
then
always
okay,
I
managed
from
the
beginning
to
convince
that
we
need
a
little
bit
of
Flash.
It
was
a
reliable
file
store
Journal
at
the
beginning,
and
now
it's
a
blue
star
block
DVD
and
while
on
the
on
the
flash,
originally
we
were,
we
were
spending
a
bit
extra
five
Drive
Right
per
day,
SATA
ssds
505
per
osc
5
osds
per
now
scale
down
a
bit.
A
A
Here's,
the
photos,
I
mean
this
is
like
probably
the
same
aisle
like
just
to
prove
that
nothing
has
changed
it's.
These
are
those
server
quads
they
have.
They
used
to
have
yeah
you've,
probably
seen
these
there's
three
there's
four
servers
in
there
and
then
those
24
disc,
J
Bots
it's
the
same
and
that
picture's
even
from
2017.,
it's
I,
just
got
bored
of
taking
pictures
because
it's
always
the
same
all
right,
voila,
Network,
wise,
very
simple
stuff.
A
We
originally,
although
we
did
originally,
we
had
this
idea-
that
okay,
actually
what
it
is,
let's
say
at
CERN,
it's
complicated
to
plan
where
you're
going
to
get
space
in
the
data
center,
so
the
servers
just
end
up
anywhere.
So
then
you
try
to
make
a
cluster
across
like
different
switches
across
different
routers.
You
know
multi-path
routing,
and
we
learned
that
this
is
not
the
way
to
go
with
stuff,
because
any
kind
of
fault
anywhere
and
you've
got
osds
flopping
up
and
down
it's
just
a
disaster.
A
So
now
we
usually
put
it
all
behind
one
switch,
which
is
also
slightly
crazy.
But
if
you
have
many
clusters-
and
you
teacher
teacher
users
about
you-
know
trying
to
use
many
clusters,
this
probably
ends
up
a
more
reliable
system
modes
yeah.
So
most
of
our
issues
are
related
to
router
line
card
failures
or
packet
corruption.
Now
we
just
use
sort
of
one
cluster
behind
watch,
switch.
The
whole
switch
cluster
will
go
down
inaccessible
all
at
once.
It's
it's
better
I
think
we
used
to
it's
better
yeah.
A
Well,
we
can
there
we
go
something
for
after
to
discuss.
We
used
to
have
one
cluster
for
all.
This
was
also
the
dream
of
Seth
right
one
cluster.
You
create
a
pool
for
all
these
different
things,
but
we
now
have
many
several
sufferers.
Several
clusters
across
several
zones-
teach
your
Builders
about.
Downtime
budgets
is
like
the
most
important
thing.
This
is
part
of
the
Google
SRE
book
right.
Downtime
budgets
spend
your
downtime
budgets
too.
A
Okay,
where
are
we
onto
now
Seth
GPT
I,
hope
you
noticed
the
ceph
logo
on
Seth,
GPT,
yeah
performance
and
scalability
upgrades
challenges
and
success
stories,
so
performance,
yeah,
I
guess
I
was
a
little
bit
tired.
When
I
wrote
this
slide,
saf
performance
can
be
summed
up
like
it
is
what
it
is
right.
It's
not
performance.
A
First,
in
that
triangle
of
things
that
you
choose
between
consistency,
availability
and
performance,
I
mean
I,
don't
well
partition
tolerance,
but
whatever
you
know,
we
prioritize
consistency
and
availability
in
Seth
or
partition
tolerance,
it's
actually
a
CP
system.
But
anyway
we
don't
prioritize
performance
right,
that's
Mark's
job
always
keep
Market
busy,
but
anyway
the
raw
user.
The
users
always
expect
like
raw
nvme
fio
performance,
like
management,
will
buy.
Okay.
These
students,
they
say
to
a
million
iops
I-
gave
you
a
thousand
of
them
like
where's,
my
billion
iops
right.
A
That's
the
expectation
so
always
have
to
remind
the
complexities
of
distributed
clustered
storage
across
the
network
and
but
anyway,
I
think
this
isn't
even
a
problem
in
experience.
10
years
almost
nobody
understands
their.
I
o
workloads
and
requirements.
Everybody
wants
that
performance,
but
in
practice
we
still
throttle
our
attached
storage
to
200
iops
500
iops.
We
allow
bursting
up
to
a
thousand
everybody's
happy.
Almost
like
99
of
people
are
happy.
A
So
that's
the
scale
of
storage
performance,
however,
like
from
the
operations
side
they're
still
like
you
really
want
this
I,
it
must
be
possible
to
take
100
petabytes
of
disks
and
put
small
amount
of
Flash
and
make
you
know.
The
goal
is
I
want,
like
a
cheap
performance
fix,
make
the
make
the
cluster
perform
like
100
petabytes
of
Flash
transparently,
and
it
should
be
just
like
invisible
file
store,
had
a
simple
way
to
to
speed
up
rights.
This
was
nice,
I.
A
Think
a
lot
of
Seth
operators
appreciated
this
and
then
lots
of
clever
folks
used
be
cash.
I,
don't
know
if
anybody
in
the
room
remember
doing
this
kind
of
thing.
We
used
be
cash,
you
still
do
yeah,
but
but
it's
it's
scary,
to
put
other
things
in
between
I
think
it's
scary,
okay,
I
know
it's
probably
working
in
practice,
but
but
it's
scary
to
put
other
things,
because
Seth
wants
to
scrub
the
discs
and
detect
the
the
durability
issues
immediately
itself.
A
Blue
star
is
deferring
small
rights
to
block
DB,
but
I'm
pretty
sure
it's
still
bottlenecking
on
fsyncs
to
spinning
discs.
So
we
don't
quite
get
the
same
performance
out
of
blue
star
as
we
did
with
file
store
for
rights,
at
least,
but
that's
good.
Maybe
it's
a
config
issue.
We're
talking
to
Mark
about
this
I
just
know
from
other
file
systems
that
I
use
like
ZFS.
A
They
have
a
persistent
layer,
2
cache
and
log
device,
and
it
works
very
well
to
to
site
of
sort
of
hide
that
spinning
disc
slowness
underneath
and
it
would
just
I
think
we
should
try
to
put
something
more
focused
on
this,
a
bit
and
stuff
to
put
something
more
more
like
that
built
in
directly
to
SEF.
Instead
of
having
to
rely
on
be
cash
and
other
things,
but
there
is
one
thing
there
was
one
like
cheap
fix.
This
is
actually
a
plot.
A
It's
unfortunate
I
didn't
have
the
the
I
didn't
look
forward
enough
when
we
set
up
our
original
monitoring
dashboards
10
years
ago
to
set
the
storage
schema
to
keep
10
years
of
retention.
A
So
we
only
keep
five
years
of
retention
here,
and
this
is
a
plot
of
of
the
like
every
minute
on
every
step,
cluster
I
run
rattles
bench,
four
kilobyte
writes
and
just
to
see
how
long
it
takes
to
write
the
objects-
and
this
is
what
this
is
on-
one
of
our
on
our
biggest
original
oldest
cluster,
and
actually
the
laser
doesn't
work
on
this.
You
see
it
was
originally
originally.
It
was
under
10
milliseconds
for
this
slow
bulk
stuff.
A
It
got
pretty
bad
there
for
quite
a
few
years,
2018
2019
up
and
above
50
milliseconds
for
a
4K
right.
I
mean
that's
terrible
right,
but
what's
in
what's
interesting
is
that
nobody
complained
because
no,
but
because
it's
protected
in
it's
like
inside
a
VM
who
does
synchronous
IO
like
consciously
or
almost
nobody
is
it's
all.
Buffered
and
Linux
is
Flushing
it
behind
the
scenes
and
the
users
didn't
even
really
notice.
But
then
okay,
we
got
some
AMD
epic
CPUs
and
then
nothing
worked.
It
got
like
completely
broken.
Is
anyone
from
AMD
in
the
room?
A
Okay,
it
wasn't
amd's
fault,
it
turned
out
to
be
the
you
see
around
like
end
of
2021,
like
suddenly
the
latency
went
down
and
it's
because
we
we
learned
that
spinning
discs
around
the
like,
say:
14
terabyte
generation
across
manufacturers
added
a
what
what
one
of
the
vendors
calls
a
media
cache,
and
this
thing
this
thing
specifically
I
mean
I
still
haven't
been
able
to
dig
out
why
they
did
it
like
for
exactly
which
use
case.
A
It
must
have
been
in
collaboration
with
one
of
the
the
the
Enterprise
and
ass
vendors,
but
specifically,
it
accelerates
synchronous,
direct,
I
o.
So,
if
you're,
using
o
direct
and
osync
at
the
same
time
like
like
Seth,
does
it
accelerates
in
a
fast
part
of
the
disk?
Okay,
so
it
turns
all
of
your.
If
you
have
it
off
on
these
latest
generation
drives
yeah
cool
magic
fix.
A
Yes,
if
you
turn,
if,
if
you
have
the
switch
to
turn
this
on
and
off
is
just
to
turn
the
right
back,
cache
on
and
off
on,
the
spinning
discs
WC
on
WC
on
off
out
of
the
box,
they're,
usually
coming
with
the
with
that
right.
Cash
on
and
recent
generation
distal
just
completely
unusable,
but
you
switch
those
discs
to
right
through
mode
and
it's
like
magic
fix
mode.
You
suddenly,
your
whole
cluster
becomes
with
spinning
disc,
performs
like
almost
like
ssds
again.
It's
like
this
is
a
spinning
disc
cluster
and
that's
really
great.
A
We
still
are
like
scared
that
something
is
not
persisted,
but
but
that's
a
couple
of
years
now
I
mean
this
looks
too
good
to
be
true
right,
yeah.
It
looks
like
that's
what
it
looks
like,
but
so
far
so
good
laughs,
okay
scalability,
so
we
had
lots
of
fun
with
our
big
bang
scale
testing.
We
did
a
couple
of
blog
posts
with
sage
on
this.
In
2015
we
did
a
30
petabyte
test.
A
When
that
was
a
big
number,
then
in
2018
we
did
a
72
petabyte
stuff
test
with
10
over
10
000
osds,
because
ceph
actually
has
a
hard,
not
hard
code,
but
a
Max
maximum
10
000
oses
per
cluster.
We
just
wanted
to
have
a
test
where
we
could
bump
it
up
above
that
we
did
that,
and
we
found
lots
of
problems.
I
mean
all
of
the
maps
were
too
large.
The
cluster
was
flapping
up
up
and
down.
A
A
That
was
all
that
work
was
going
to
the
mon.
In
the
past
we
moved
all
the
non-critical
workload
off
over
to
the
Seth
manager.
Demon-
and
it
also
gave
as
a
bonus
people
that
like
are
Python
devs
and
things
like
this,
it
gives
them
an
easy
way
to
contribute
to
Seth
through
through
the
modular
interfaces
of
the
manager,
but
still
I,
think
we
should
avoid
putting
more
than
a
few
thousand
two
to
three
thousands
about
the
limit,
although
afterwards
it
gets
sluggish
and
I
see
undresses.
A
Yeah
yeah,
you
see
some
things
yeah
exactly
yep.
Okay,
some
improvements,
blue
store
was
a
huge
Improvement.
You
know,
file
store
was
convenient,
simple
to
understand.
There
were
a
lot
of
painful
things
like
like
splitting
the
directories
and
then
merging
them
back.
I
mean
this
was
quite
painful.
A
People
these
days
can
just
ignore
that
you
have
a
nice
warm
fuzzy
feeling
from
blue
star
that
it's
always
serving
good
data.
You
didn't
have
that
with
file
store
and
and
checksums
now
at
Blue
Store
someone's
shaking
their
head,
no
yeah
someone's
taking
their
heads.
Yes,
okay,
well,
Jerry's,
still
out
on
that
I've
got
that
warm
fuzzy
feeling,
but
it
is
pretty
scary.
One
of
my
managers
said
because
he
was
like
oh
I.
A
Remember:
Oracle
did
the
same
thing
invented
their
own
file
system
underneath
and
it
took
like
I,
don't
know
what
he's
that's
before
my
time,
but
he
said
it
took
like
decades
before
that
stabilized.
Maybe
someone
else
knows
there
are
still
some.
You
know
blue
star's
still
scary,
but
this
like
major
kudos
to
those
low-level
gurus
that
that
know
how
this
works
and
know
how
to
fsck
a
corrupted,
Blue
Store
wow,
that's
a
place
we
can
improve.
A
Most
cluster
operations
are
more
civilized,
I've
got
some
stuff
on
the
next
slide.
You
know
cefs.
This
is
Improvement,
so
both
Seth
over
the
time
CFS
was
remember
in
the
olden
days
we
had
ceph
is
odd,
like
this
is
awesome.
Block
store
is
awesome.
Object
Store
is
awesome.
Cfs
was
almost
awesome
for
the
longest
time
and
then
it
was
declared
awesome
right
and
it
continues
to
improve.
Today.
A
Some
of
the
challenges
operating
Seth
we
learned
early
on,
and
we
still
know
to
like
make
changes
gently
to
a
soft
cluster.
Don't
do
like
massive
in
I.
Think
in
in
Berlin
I
had
to
talk
about
a
leap
of
faith
like
sometimes
you
do
a
self
command
and
then
oops
see
you
in
like
three
months.
A
In
a
large
cluster,
rebalancing
is
the
normal
state
of
a
cluster,
so
kind
of
like
tuning
that
understanding
that
feeling
the
heartbeat
of
the
cluster
is
important.
You
know
we
aim
in
our
operations
to
aim
for
a
health,
okay
with
no
backfilling
at
least
once
a
week.
If
you,
if
you're
doing
things
more
than
more,
if
you're
keep
doing
operations
that
take
weeks
on
weeks
and
weeks,
I
mean
things
will
accumulate
and
you'll
be
in
trouble.
I'm.
Sorry,
we
have
some
scripts
to
do
this.
Gentle
re-weight
gentle
split
this.
A
These
used
to
be
really
used.
I,
don't
know
how
many,
how
much
people
use
them
anymore,
but
Seth
is
getting
really
good
at
this.
Now
it's
solving
most
of
these
kind
of
operations
problems.
Now
it
has
like
it
tracks
the
number
of
misplaced
objects.
It
keeps
that
all
within
a
scope
it
has
schedules
for
the
balance
or
in
scrubbing
and
stuff.
It's
all.
All
of
this
got
much
better,
but
I
think
it's
still
way
too
easy
to
do
that.
Oops,
I,
don't
know
a
couple
of
months
ago,
someone
was
on
the
mailing
list.
A
I,
don't
know
if
they're
in
this
room,
but
I
did
this
and
then
it
was
just
like
really
like
it.
We
estimate
it's
going
to
take
three
months
to
complete
Jeepers
I,
can't
imagine
that's
a
long
time
with
no
sleep,
foreign
I
think
we
should
do
something
like
Nomad
and
terraform,
where
you
can
type
a
change
plan.
It
it'll
tell
you
what
it's
going
to
do
and
how
long
it
should
take
and
then
and
then
then
you
can
apply
it.
I,
don't
know
something
like
that.
That
kind
of
semantic
would
be
really
helpful
for
operators.
A
Other
challenges
remember
when
non-uniform
data
placement
was
like
costing
us
all
a
lot
of
money,
unbalanced
osds.
If
you
had
a
10
variance,
this
would
cost
Millions
right
in
in
Lost
space
petabytes
up
map
balancer
solves
this
problem
solve
problem
solved,
but
because-
and
this
is
what
I
alluded
to
earlier
crush-
is
broken.
A
There's
there's
some
mailing
list
threads.
It's
if
you
have
the
same
size,
everything
it's
fine,
but
as
soon
as
you
have
some
different
sizes.
Actually,
the
algorithm
is
wrong.
I'm,
sorry,
those
threads
about
this.
It's
called
the
multi-pick
anomaly
and
stuff.
Maybe
somebody's
smart
wants
to
pick
this
one
fix
this
one
day,
but
it
would
require
some
some
deep
work.
A
The
built-in
balancer
is
really
great,
but
it's
not
perfect
all
in
all
systems.
This
is
where
power
of
the
community,
the
JJ,
are
you
here
to
the
JJ?
That's
you
yeah
people
use
it.
A
This
is
great
some,
you
know,
so
it
works
really
well,
and
this
up
map
tool
is
really
a
Swiss
Swiss
army
knife
of
for
for,
for,
like
a
more
experienced
staff
operators,
this
is
a
kind
of
mechanism
we
can
use
to
direct
data
exactly
where
we
want-
and
you
know
this
PG
remapper
from
from
I-
think
digital
ocean
right
and
up
map
remap
that
we
made
I
mean
these
kind
of
things
are
quite
useful.
A
These
are
features
we
want
to
bring
into
cephs
somehow
scaling
staff
ffs
is
remains,
probably
the
more
challenging
out
of
the
Seth
demons,
because
the
MDS
is
single
threaded,
so
you
need
many
of
them.
It's
very
stable
these
days,
it's
very
stable
but
like
adding
removing
on
the
Fly.
This
is
not
so
practical
if
you've
got
like
some
AI
workloads
running
there.
A
I
need
to
upgrade,
and
you
know
it's
hard
to
convince
like
a
thousand
physicists
that
we're
gonna
do
an
upgrade
and
they
have
to
stop
all
the
their
long-running
month-long
jobs,
the
yeah.
So
there's
there's
a
bit
more
and
it's
still
scary,
if
there's
a
if
there's
a
problem
instead
of
us,
this
remains
a
very
scary
thing
and
snapshots
are
a
bit
too
scary
to
enable
at
scale
and
I,
wrote,
Ask
AP,
I
mean
I,
think
there's
someone
in
the
room
here
that
might
have
a
tear
in
their
eye.
Thinking
about
it.
A
S3
scaling
one
Region's,
not
very
hard.
We
had
trouble
with
so
I'll
come
out.
Our
stack
is
traffic
front
ends
used
to
be
AJ
proxy.
We
switched
to
traffic
because
we
managed
it
all
with
Nomad
and
we
route
like
routers
Gateway
speed
performance
comes
from
spreading
across
many
buckets
spreading
across
many
rattles
gateways,
but
we
group
our
routers
gateways
using
pattern
matching
on
the
bucket
names.
According
to
our
users,
these
are
different
user
communities.
This
is
a
good
way
to
get
nice
performance.
A
We
suffered
some
issues,
but
it's
overall
working
very
well,
but
multi-region
seems
difficult.
I
heard
I
heard.
Maybe
it's
going
to
be
fixed
in
Quincy
or
Reef.
It's
getting
there.
So
we'll
see
I'd
like
to
hear
more
about
that.
It
makes
it
easy
I
alluded
to
this
earlier.
It
makes
it
too
easy.
Seth
makes
it
easy
to
put
all
your
eggs
in
one
basket,
but
but
someone
told
me
don't
put
all
your
eggs
one
basket.
We
learned
the
hard
way.
A
I,
don't
know
if
someone
can
I
don't
know
if
the
slides
will
be
available
afterwards.
But
if
you,
if
you're
curious
in
February
2020,
we
we
we
made
a
talk.
This
is
from
June
that
year,
while
we
were
all
in
lockdown,
but
this
was
a
talk
that
was
called
like
the
bug
of
the
Year
okay.
Actually
little
did
we
know
that
there
was
like
a
bug
of
the
century
at
the
same
time
that
we
were
there
like
fixing
just
emerging
right.
A
But
what
happened
in
this
that
there
was
a
bug
in
the
low-level
lz4
compression
library
and
it
just
corrupted
all
the
OST
maps
on
all
the
osds
at
the
same
time
and
I
was
having
a
coffee
and
then
boom
I
had
to
put
it
down,
and
then
that
was
eight
hour
very
painful
traumatic
day,
but
everything
recovered
perfectly.
All
the
data
was
intact.
No,
not
one
single
bit
lost
and
now
Seth
is
stronger
because
of
it.
A
Oh
I
gotta
go
no
successes.
Success
I
didn't
want
to
put
all
the
successes,
because
there's
just
too
many
but
I
want
to
talk
about
why
it
succeeds
right.
It
succeeds
because
it
integrates
with
the
platforms.
This
always
is
like
we
got
it.
We
can't
forget
this.
You
got
to
be
stay
trendy,
like
sage
used
to
say
it
was
kubernetes.
I
was
openstack
now
it's
kubernetes
and
openshift
and
the
CSI
stuff,
and
then
CFS
and
S3
make
it
integrate
with
everything.
That's
why
everything
you
can
download
anyone
download
something
from
GitHub.
A
It's
got
an
S3
back
end
right!
That's
why
we're
that's?
Why
we're?
Even
here
probably
now
it
succeeds
because
it
protects
our
data,
despite
faults
everywhere,
someone's
going
like
this
I
mean
no
come
on,
it
goes
it
does
we
never.
We
didn't
have
any
Corruptions
10
years,
no
Corruptions.
It's
like
core
building
block
thousands
without
trusting
it,
at
least
in
our
environment.
It
continues
to
grow
because
it
used
to
be
flexible,
expanded,
I
think
we've
been
through
four
Hardware
generations
with
the
same
bits
yeah
and
the
the
user
Community.
A
It's
always
been
the
strength
of
Seth.
Let's
not
forget
it
built
up
because
of
all
of
these
ceph
days,
the
community
manager
over
there
on
his
phone.
Oh
taking
a
picture.
That's
why
yeah
kovit
heard
us
really.
The
community
is
kind
of
like
it's
a
it's
we're
not
doing
the
best.
This
is
the
best
thing
to
happen
in
in
in
in
in
in
a
few
years
now
so
and
I
hope
to
see
everyone
at
Amsterdam
as
well,
because
a
good
Community
is
good
for
all
of
us.
A
Okay
whoa.
This
is
a
pretty
long
like
outline
I'm,
probably
out
of
time,
Lessons
Learned
in
future
directions,
key
takeaways
from
10
years
of
operating
safety,
CERN
all
right,
so
I
couldn't
what
were
my
key
takeaways
from
running
sefitzer
in
the
past
10
years,.
A
Proper
planning
and
design
this-
this
is
real
guys.
It's
terrifying,
proper
planning
is
not
a
critical.
You
know
you
can
read
this
later.
It's
hilarious,
monitoring
and
maintenance
are
essential,
resilience
is
key,
testing
is
crucial
and
collaboration
is
important.
Running
sapphid
scale
is
a
complex
task.
It's
important
to
collaborate
with
other
organizations
and
individuals
who
have
experience.
This
includes
participating
in
the
open
source,
Community
attending
conferences
and
meetups,
and
sharing
best
practices
with
other
SEF
users.
A
Oh
my
God,
so
yeah
I
just
wanted
to
thank
my
brilliant
certain
colleagues
who
helped
build
our
it
and
storage
infrastructure
to
the
kind
and
knowledgeable
Community
the
hosts
today
and
those
that
helped
organize
it
and,
of
course,
the
Giants
who
build
Seth
standing
on
your
shoulders.
Really,
that's
the
end
of
my
talk.
Thank
you.
A
B
Thank
you,
two
quick
questions.
Five
years
of
retention
for
you
is
like
25
for
the
rest
of
us,
I'm,
pretty
sure
I'm
speaking
for
everybody
here
so
congrats,
on
keeping
all
that
data
there,
but
how
many
operators
do
you
actually
have
running
this
back
end
like
managing
this
on
a
day-to-day
basis,
two
Warm
Bodies
24
7,
just
on
call.
Oh,
no.
A
No
I
mean
so
the
no,
no,
no
no
I
mean
the
so
certain
has
a.
We
have
our
data
center,
it's
old
school.
We
haven't
we're
building
a
new
data
center,
there's
one
person
that
sits
there
all
the
time,
just
looking
for
things
to
turn
red
who
then
calls
a
list
of
the
phone?
Okay,
that's
the
database.
They
call
that
you
know
on-call
type
stuff.
A
The
SEF
team
is
is
is
is
is
to
to
people
basically
two
Warm
Bodies,
maybe
with
a
couple
of
students,
maybe
yeah
yeah,
but
that
but
that's
part
of
a
storage
team
which
is
like
40
or
50
people
doing
all
kinds
of
things
which
is
part
of
an
I.T
Department
with
250
people
right.
You
know
it's.
If
you
don't
need
much.
Many
people,
if
you're,
focusing
on
one
technology
right.