►
From YouTube: Ceph for Big Science - Dan van der Ster
Description
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Dan van der Ster, CERN Storage Engineer and Ceph Advisory Board member
A
This
picture
here
on
my
first
title
slide
is
the
Large
Hadron
Collider,
which
is
just
beside
Geneva
Switzerland,
just
beside
Lake
Geneva
here
on
the
side,
and
you
can
see
we
have
a
giant
tunnel
here,
going
under
the
earth,
accelerating
particles
and
colliding
at
four
locations
around
the
circle,
the
circles
actually
27
kilometers
in
circumference
and
the
scientists
when
they
go
down
or
the
technicians
when
they
have
to
do
some
work.
They
even
ride
bicycles
around
you
can
see
here
the
tunnel.
A
When
we're
colliding,
those
particles
like
I,
said
they're
going
in
other
directions,
either
direction,
and
you
see
we
can
make
an
animation
the
particles
coming
colliding
and
they
collide
in
these
giant
particle
detectors,
which
are
essentially
giant
digital
cameras.
This
is
an
example
of
one.
Can
you
see
the
scientists
in
this
picture
he's
down
here?
A
And
this
is
what
what
are
the
what
the
pictures
look
like.
We
collide
these
particles
together
to
to
generate
the
fundamental
building
blocks
of
our
matter.
This
is
a
Higgs
boson.
This
was
was
discovered
just
a
few
years
ago.
It's
the
fundamental
particle
which
gives
mass
to
everything
in
the
universe.
So
this
is
actually
what
they
look
like.
You
can
tell
what
it
looks
like
because
there's
a
couple
of
elementary
particles
coming
out
this
red
one
in
these
different
directions
and
some
other
jets
in
this
other
direction.
A
To
study
this
kind
of
physics,
we
have
a
lot
of
computing
power
needed
at
CERN.
This
is
a
picture
of
the
CERN
data
center.
We
have
around
300
petabytes
of
storage
and
230
thousand
CPU
cores,
but
in
fact
our
computing
is
just
one
part
of
the
global
worldwide
LHC
computing
grid,
where
we
have
around
150
data
centers
around
the
world,
universities
and
research
labs
and
one
of
those
sites
is
our
friends
that
I
have
the
high
energy
physics
Institute
here
in
Beijing,
which
we
had
the
pleasure
to
visit
a
couple
of
days
ago.
A
So
that's
a
little
bit
of
background
about
CERN
and
now
I
want
to
explain
how
we
use
SEF.
Yesterday
and
today
our
adventure
with
SEF
started
in
February
2013,
when
we
were
just
starting
to
get
involved
in
building
a
cloud
at
our
at
our
laboratory
with
OpenStack,
and
we
had
the
question:
how
do
we
provide
storage
for
OpenStack
at
the
time?
There
was
one
already
one
clear
candidate
and
that
was
SEF.
So
we
started
this
proof
of
concept.
A
A
We
had
around
just
over
one
thousand
three
terabyte
disks,
two
hundred
really
nice
SSDs
for
the
OSD
journaling,
and
this
was
the
time
of
set
dumpling.
So
we
were
using
this
quite
a
while
ago,
and
you
know
when
we
put
this
in
production.
It
was
just
before
the
Christmas
break.
Certain
closes
for
two
weeks
around
Christmastime
and
we
didn't
quite
understand
how
many
replicas
we
should
have,
so
we
even
used
for
replicas
over
Christmas
break
so
that
we
would
be
very
sure
not
to
have
any
interruption
to
our
nice
holidays.
A
When
we
came
back,
we
rethought
things
and
now
everything
is
three
replicas
and
we're
very
happy
with
like
that
over
the
years
following
things
grew
for
us.
After
this
three
hundred
terabytes,
then
three
petabytes
production
cluster
we
started
using
for
other
use
cases.
We
contributed
part
of
the
eraser
coating
libraries
to
Ceph.
We
write
something.
We
wrote
something
called
the
Rado
Stryper
which
lets
us
do
some
physics
data.
In
parallel
striping
to
the
OSDs.
We
had
a.
We
tried,
making
a
very
simplified
file
system
which
we
called
ray
dos
FS.
A
This
one
was
was
not
ideal.
We
we
moved
on
from
that
to
something
else
we
upgraded
in
2016.
We
were
upgraded,
our
three
petabytes
block
storage
cluster
to
six
petabytes.
So
this
we,
we
completely
replaced
all
the
hardware
with
no
downtime
and
by
today,
by
late
last
year,
and
now
we
have
eight
clusters
of
stuff
in
production,
and
here
are
those
clusters
today.
So
we
have
still.
A
The
main
key
use
case
is
OpenStack
cinder
and
glance
where
we
have
that
very
close
to
six
petabytes
of
production,
and
we
also
have
a
second
smaller
cluster
in
a
satellite
data
center
with
half
a
petabyte
we're
now
fully
also
in
production
with
CFS,
where
we
have
over
around
1.5
petabytes
in
total
across
three
different
clusters.
I'll
explain
a
bit
more
detail
about
those.
A
This
thing
that
set
oops
this
cast
or
excerpt
D.
This
is
a
physics
use
case
where
we
have
over
five
petabytes
of
physics
data
stored
in
Seth,
and
then
we
also
have
what
I
consider
to
you.
A
small
s3
Swift
object
store
for
other
use
cases,
so
I'm
going
to
normally
I
would
talk
about
block
storage
first,
but
I'll
do
things
in
reverse,
because
sefa
Fez
is
now
stable.
A
So
we
talked
about
setteth
s
when
we
have
been
using
so
like,
like
many
users,
you
know
we
have
a
kind
of
filer
use
case
for
several
different
IT
applications,
things
like
puppet
or
gitlab
repositories
or
openshift.
You
know
you
need.
You
often
just
need
a
place
to
store
to
store
files,
and
we
have
been
doing
this
for
the
last
few
years
with
virtual
NFS
filers.
A
So
these
are
basically
virtual
machines
with
cinder
block
devices
using
ZFS
and
a
nice
backup
tool
called
zero
to
move
to
make
NFS
highly
and
not
highly
available,
but
at
least
reliable
and
backed
up,
and
we
have
around
thirty
of
these
kind
of
virtual
filers
and
they
work
very
well
I
mean
here's.
Here's
an
example:
this
plot
is
showing
we
had.
A
We
can
do
close
to
like
50,000
stats
on
one
of
these
virtual
machines
per
second,
which
is
quite
high
performance,
but
this
kind
of
architecture
is
not
very
scalable
to
manage
the
the
quotas
you
have
to
detach
and
reattach
our
be
the
the
block
devices.
If
you
that
the
users
are
growing
their
space
too
much.
It's
also
labor-intensive
to
create
new
things.
You
have
to
manage
them
as
precious
individual
machines
and
you
can't
scale
the
performance
horizontally.
A
So
this
is
why
set
of
s
is
interesting,
so
we
started
last
year
to
evaluate
OpenStack
Manila
with
set
FS
because
it
was
starting
to
matter
most
of
our
features.
It
has.
It
supports
multi
tenants
with
the
with
each
of
the
users
isolated,
so
they
can't
delete
or
affect
the
other
users
data,
and
it
also
supports
quotas
and,
more
importantly,
it
has
easy,
self-service
provisioning.
So
the
users
can
just
either
through
an
API
or
click
through
their
web
browser.
A
They
can
create
a
new
virtual
filer
on
demand,
it's
a
very,
very
nice
procedure
and
it
has
scalable
performance.
We
just
add
more
OSDs
or
add
more
MDS
as
needed
when
we
need
more
performance.
We've
been
testing
this
since
middle
of
2017,
and
indeed,
at
that
time
a
single
metadata
server
was
a
bottleneck.
But
now
with
the
set
luminous
release,
multi
MDS
is
a
stable
feature.
So
we
have
this
in
production.
A
A
Here's
a
couple
of
pictures
showing
multi
MDS
in
production,
just
to
confirm
that
this
is
stable.
So
we
had
a
pre-production
testing
environment
where
we
had
20
different
tenets.
That's
that's
20
different
OpenStack
projects
that
were
using
this
for
several
months
and
we
had
two
active
MTS's
and
this
was
stable
so
on
our
production
cluster.
A
We
enabled
this
in
January
of
this
year
and
in
this
picture
on
the
bottom,
you
can
see,
for
example,
when
you
go
from
2
active
MD
Isis
to
3
active
MTS's,
there's
a
chef
just
balances
the
metadata
workload
across
those
different
MD.
Yet
the
new
MDS,
so
this
code
is,
is,
is
working
quite
well
and
it's
stable.
A
A
So
we
think
about
a
software-defined
HPC.
We
all
of
our
computing
is
built
with
commodity
parts,
both
the
high
throughput
and
also
high
performance
computing,
I'm,
a
compute
side.
We
solve
this
with
software
such
as
HT,
Condor
and
slurm,
but
on
the
other
hand,
typical,
HP
storage
is
not
that
attractive
for
us,
mainly
because
we
miss
the
expertise
to
operate
those
typical
HPC
storages,
but
also
because
we
miss
the
budget.
We
want
to
build
something
with
the
lowest
price
hardware
and
just
add
open-source
software
on
top
to
enable
the
HPC
workloads.
A
So
we've
done
this
with
set
of
s
and
Manila.
We've
done
this
with
300
HP
C
computing
nodes
accessing
a
one
petabyte
set
of
s
since
the
middle
of
2016,
and
by
today
we
treat
HPC
as
just
another
OpenStack
Manila
user,
and
indeed
it's
quite
stable,
but
maybe
it's
not
the
highest
performance
option.
So
what
about
performance?
So
we
were
very
interested
when
one
user,
mr.
A
John,
bent
posted
on
to
the
semi
thing
list
his
contribution
to
supercomputing
2017,
a
new
benchmark
called
IO
500
in
HPC
world
there's
the
top
computing
clusters
or
top
supercomputers,
are
measured
with
the
top
500.
So
for
storage,
there's
a
new
benchmark
called
IO
500
and
the
goal
of
this
benchmark
is
to
share
the
good
and
the
bad
the
heroes
and
the
anti-hero
is
just
to
try
to
get
this
kind
of
progress
and
improving
the
performance
of
storage
systems.
A
We've
just
started
testing
on
our
set
fest
clusters.
This
benchmark
includes
two
of
the
traditional
HPC
storage
benchmarks.
It's
called
IO
R.
This
is
measuring
throughput
and
also
metadata
is
measured
with
MD
test
and
find
ok,
and
this
each
of
these
tests
has
an
easy
mode
and
a
hard
mode
when
they
in
the
easy
mode.
The
different
parallel
jobs
are
writing
to
individual
files.
So
that's
relatively
easy
for
a
file
system,
but
in
the
hard
mode
they
all
these
hundreds
of
jobs,
try
to
write
to
the
same
file.
A
That's
a
parallel
I/o
case,
that's
very
difficult!
So
here's
a
first
look
at
this
and
the
disclaimer
is
that
this
is
straight
out
of
the
box
with
no
tuning
with
luminous
version
12
to
4
tested
just
this
March.
So
our
cluster
that,
where
we
have
this,
has
just
over
400
SSDs
per
server
they're,
located
on
the
same
machines
as
the
as
the
CPUs,
where
the
tests
and
where
the
clients
are
running
and
then
for
the
metadata
servers.
A
A
Well,
I
can
say
it's
it's
a
number
10
position,
so
the
score
2.46
is
just
below
the
number
9
for
4.25,
but
you
can
see
that
we're
already
ahead
of
number
9
in
gigabytes
per
second,
but
our
metadata
performance
is
quite
a
bit
lower.
So
this
kind
of
activity
is
something
that
we're
gonna
spend
some
time
to
improve
and
try
to
boost
ops
FS
in
this
rankings
to
get
some
to
be
able
to
use
this
as
a
real
HPC
storage
system
now
move
on
to
the
rathaus
gateway,
so
for
s3
at
cern.
A
We
use
this
because
s3
is
a
standard
object.
Storage,
where
different
developers,
inside
cern,
can
just
have
access
to
storage
that
their
applications
already
know
how
to
use.
We
do
this
with
a
luminous
cluster
with
a
racial
coding,
and
we
just
run
virtual
machines
as
the
rathaus
gateways.
We
have
one
single
region
here
and
the
types
of
use
cases
that
we
have
our
physics
data
with
very
small
objects,
so
each,
for
example,
each
collision.
We
would
store
in
a
different
file,
so
we
could
have
millions
and
millions
of
those.
A
We
also
use
this
for
volunteer
computing,
so
maybe
some
of
you
remember
SETI
at
home,
searching
for
extraterrestrial
intelligence.
So
this
is.
This
is
actually
generalized
in
a
piece
of
software
called
boink
which
lets
users
run
a
screensaver
and
donate
their
CPU
cycles
to
physics
research.
So
we
use
that
the
backend
storage
for
that
is
on
set
s3
and
we
use
features
such
as
pre
signed,
URLs
and
object
expiration
to
make
this
all
usable
for
the
users
and
it's
all
very
stable.
A
One
thing
that
we
do
in
addition
to
the
standard
out
of
the
box
rattles
gateways
that
we
run
H
a
proxy
in
front.
This
is
to
make
easy
high
availability
so
that
we
can
so
that
the
rattles
gateways
can
be
restarted
at
any
time
without
any
downtime,
and
we
also
use
H
a
proxy
to
map
some
special
s3
buckets
to
some
dedicated
Rattus
gateway
instances,
so
that
some
of
our
largely
use
cases
don't
affect
the
general
service.
A
This
is
a
picture
from
a
couple
of
days
ago,
of
our
cinder
glance,
cinder
and
glance
uses
use
cases
so
on
the
Left
first,
we
can
see
that
we're
storing,
currently
more
than
4000
glance
images.
These
are
not
just
system
images
but
they're,
also
some
machine
virtual
machine
snapshots
and
on
the
right
we
see
that
we
have
more
than
5000
cinder
volumes
attached.
So
we
can
see.
This
is
a
very
popular
service
in
our
cloud.
A
Our
service
is
made
available
through
several
different
volume
types,
so
the
vast
majority
of
users
are
using
our
standard
volume
type,
which
has
QoS
limitations
similar
to
that
of
a
single
spinning
disc.
For
our
higher
I
ups
use
cases,
we
let
users
create
something
that
we
call
IO
one
volume
type
and
that's
also
quite
popular,
and
then
we
have
different
regions
different
with
different
qualities
of
service
related
to
power
or
locations.
A
A
This
service
continues
to
be
highly
highly
reliable.
Sometimes
we
brag
to
our
colleagues
that
we've
never
lost
one
single
bit
that
we're
aware
of
we
find
this
is
reliable,
for
one
reason
is
because
the
quality
of
service
limitations-
let
you
really
make
this
so
that
no
one
user
can
affect
the
others.
Everything
is
very,
very
stable.
A
One
of
the
interesting
parts
of
running
a
block
storage
service
is
that
your
clients
and
having
such
a
long
uptime
that
it's
difficult
to
upgrade
your
upgrade
those
clients.
But
thanks
to
some
recent
security
exploits
Spector
and
meltdown,
we
had
the
opportunity
to
reboot
all
of
the
hypervisors,
so
this
brings
in
the
newest
SEF
client.
This
is
really
nice,
we're
doing
still
some
ongoing
work.
A
Actually,
this
is
this:
is
we're
testing
this
first
with
the
HPC
workloads,
so
we
have
these
300
or
3
to
400
SSDs
in
an
inn
in
these
HPC
compute
nodes
and
I
can
say
that
technically,
this
is
working
quite
well.
There
are
some
cases
where
you
need
to
use
CD
groups
to
to
encapsulate
to
isolate
the
different
processes,
but
in
fact,
the
bigger
issues
that
we
see
are
that
are
related
to
our
operations.
A
Culture,
because
we
have
separate
separated,
unique
teams
for
storage
and
for
processing,
and
so
we
on
the
seft
teams,
side,
don't
own
those
servers.
So
we
need
to
cooperate
with
the
cloud
and
HPC
teams.
You
need
to
develop
common
procedures
when
to
intervene
on
nodes,
how
to
drain
nodes,
how
to
upgrade
how
to
reboot
the
the
cloud
guys
need
to
know
how
to
operate
Seth
and
the
Ceph
guys
need
to
know
how
to
operate
cloud.
A
This
is
an
area
that
needs
to
be
solved
for
hyperconvergence
to
work
in
our
environment,
now
move
to
a
kind
of
section
where
I
give
some
user
feedback
from
like
dan
as
an
operator
upgrading
from
jewel
to
luminous
in
general,
went
well,
no
big
problems,
we're
replacing
OS
DS
that
also
all
the
new
OS
DS
in
our
cluster
are
built
with
blue
store
with
the
newest
tool,
set
volume,
LVM
format
and
we're
also
converting
existing
file
store,
OS
ds2
to
blue
store
with
a
script.
That's
tuned
for
our
infrastructure.
A
We're
also
very
excited
about
the
Ceph
manager
balancer
feature.
So,
finally,
we
can
make
the
OST
utilization
very
flat
across
the
whole
cluster
and
we're
actively
testing
this,
and
it's
really
convenient
in
fact
that
this
is
just
all
written
in
pipe.
So
we
can
patch
and
adjust
things
just
how
we
like
it
in
our
environment.
A
We
see
the
capacity
of
the
cluster
increasing,
as
we
add
racks
and
then
decreasing
as
we
drain
the
older
racks
and
then
increasing
again,
as
we
add
a
new
rack
of
observers
and
decreasing,
and
then
on
the
right,
we
can
see
the
new
OS
DS,
filling
up
on
the
existing
OS
DS
draining
data
as
the
as
the
data
rebalances
across
the
cluster.
So
this
is
a
thanks
to
SEF,
backfilling
and
recovery.
This
is
a
kind
of
intervention
you
can
kill
you
out,
transparently,
without
your
users,
users
noticing
any
downtime
whatsoever.
A
A
We
really
need
something
like
our
BD
top,
so
we
can
just
run
that
and
then
see
immediately
who's,
the
highest
user
and
then
maybe
go
speak
to
them.
On
the
performance
side.
There
are
some
use
cases
where
you
want
microsecond,
latency
and
kilohertz
I,
ops,
such
as
databases,
and
for
that
you
we
probably
need
persistent
SSD
caches
on
the
local
hypervisor.
A
Also,
some
use
cases
require
on
the
right
encryption
or
client-side
volume
encryption.
We
don't
have
that
yet,
but
it
would
be
really
useful
and
in
the
case
of
hyper-converged
clusters,
where
we
have
different
cells
with
local
local
OSDs,
we
just
need
some
tooling
in
OpenStack,
so
that
the
users
are
always
getting
a
volume,
that's
close
to
where
they're
running
their
urghhh
machine
on
the
sefa
fest
side
for
HPC.
A
It's
clear
that
we
need
to
work
a
little
bit
on
the
parallel
I/o
try
to
get
the
best
performance
possible
in
this
example
benchmark
and
then
also
from
the
operations
point
of
view.
We
need
a
simple
way
to
copy
huge
amounts
of
data
across
different
places
in
the
cluster,
so,
for
example,
our
sink
could
be
made
smart
about
how
Saif
is
storing
recursive
statistics
recursive
change
time
statistics,
so
we
could
have
a
safe
mode
added
to
our
sink.
Someone
in
the
audience
is
interested
to
implement
such
a
thing
for
the
general
use
case.
A
If
we
were
to
use
set
FS
in
general
for
all
of
our
POSIX
like
storage
at
CERN,
we
would
need
to
do
a
lot
more
testing.
We
would
need
to
be
able
to
scale
to
10000
or
even
a
hundred
thousand
clients,
but
for
this
to
work
really
well,
we
would
need,
for
example,
throttles
and
tools
to
discuss,
to
identify
and
block
or
disconnect
the
the
disruptive
users
or
clients.
So
a
set
client
top
is
a
similar
thing
that
we
need.
A
We
probably
need
native
Kerberos
without
an
NFS
gateway,
because
gateways
will
be
bottlenecks
and
we
need
group
accounting
and
group
quotas
for
the
for
the
non
Linux
clients.
Then,
of
course,
we
would
use
highly
available
sifts
or
NFS
gateways,
but
then
this
leaves
the
question
still
open
a
backup.
If
we
were
to
open
this
to
our
10,000
users,
we
would
very
quickly
have
a
problem
of
how
do
you
backup
10
billion
file
set
of
s?
A
So
maybe
we
can
think
eventually
about
doing
binary
diffs
between
snapshots
in
the
file
systems,
to
enable
this
kind
of
use
case
on
the
raid
o
sleve
'l.
We
also
have
some
challenges,
such
as.
We
have
some
very
large
clusters
with
old
configurations,
old,
tunable,
3
petabytes
of
our
BD
data
with
the
hammer
to
Nobles.
How
do
we
handle
that
cluster
moving
forward
over
the
next
5-10
years?
A
A
Now,
let's
look
to
the
future.
This
is
high
energy
physics
computing
for
the
2020s,
we're
currently
in
our
second
run
of
the
LHC,
and
we're
generating
50
to
80
petabytes
of
data
per
year
in
the
early
2020s
will
shift
to
a
new
run
with
increased
strength
of
the
accelerator
and
we'll
be
generating
150
petabytes
of
data
per
year.
But
then,
in
run
four,
it's
estimated
that
we
might
need
up
to
six
hundred
petabytes
per
year
to
store.
A
So
we
need
a
storage
system
for
that
and,
additionally,
with
our
global
grid,
we
think
of
something
that
we
call
a
data
Lake,
it's
different
from
the
conventional
industry
data
Lake.
What
we
mean
is
that
we
want
to
have
a
globally
distributed
file
system
with
flexible
data
placement
to
where
we
want
at
the
different
universities
and
ubiquitous
access
to
all
of
the
data.
A
With
these
kind
of
use
cases
in
mind.
This
is
what
motivates
us
to
do
these
large
scale,
which
we
call
big
bang
tests
in
cooperation
with
the
Ceph
core
team.
A
couple
of
years
ago
we
started
this
with
our
first
30
petabyte
test
7,200
OSDs,
and
we
indeed
found
some
limitations
which
at
that
moment
were
fixed.
A
We
did
a
second
run
with
the
dual
release
of
Ceph,
which
actually,
in
this
case,
we
found
some
limitations
in
the
messaging
between
the
Ceph,
Mon
and
thus
fos
DS,
and
this
was
part
of
the
motivation
to
develop
the
staff
manager
that
we
now
have
in
production
for
Luminess.
We
repeated
the
test
middle
of
last
year
with
a
65
petabyte
cluster
10800,
OS
DS,
and
this
revealed
just
a
few
minor
issues
remaining
which
were
fixed.
Luckily,
before
Luminess
was
released,
so
now
we
can
say
that
Seth
is
scalable
to
more
than
10,000
OSDs.
A
That's
the
end
of
my
talk.
I
want
to
say
some
thanks
to
some
of
my
certain
colleagues.
These
are
the
team
of
people
that
do
all
of
this
work
at
CERN
the
operations
and
development
side
of
things
for
SEF,
the
cloud
and
HPC
and,
of
course,
I
want
to
say
thanks
to
the
whole
community.
This
is
a
picture
of
the
cephalo
go
with
all
of
the
contributions
from
the
luminous
release,
and
maybe
you
can
even
see
your
company
in
there.
So
thank
you
for
listening.