►
From YouTube: Ceph Snapshots for Fun and Profit
Description
Ceph includes snapshot technology in most of its projects: the base RADOS layer, RBD block devices, and CephFS filesystem. This talk will briefly overview how snapshots are implemented in order to discuss their implications, before moving on to example use cases for operators. Learn how to use snapshots for backup and rollback-ability within RBD and CephFS, the benefits they bring over other mechanisms, how to use the new snapshot trim config options, and most importantly: how much they cost in
A
All
right
welcome
everybody
back
to
the
Ceph
day
track
here,
we're
going
to
keep
moving
along
with
another
SEF
FS
talk
or
pieces
of
cephus
I.
Guess
the
the
snapshots
for
fun
and
profit
we're
going
to
talk
about
all
the
various
snapshot,
mechanics
that
exist
within
SEF,
whether
that's
the
block
device,
the
FS
or
the
or
the
gateway.
So
there
all
kinds
of
options
and
Greg
Farnham
here
is
going
to
give
the
presentation
he's
a
longtime
core
developer
of
Ceph,
so
I'll.
Let
him
take
it
from
here:
hey.
B
Everybody
that
mic
dwell
okay
cool,
so
this
talk
is
SEP
snapshots
for
fun
and
profit.
My
name
is
Greg
Farnham
I'm,
a
principal
software
engineer
at
Red,
Hat
I've,
been
working
on
the
project
for
almost
eight
years
now
can't
believe
it.
So
during
this
talk,
we're
going
to
go
through
the
origin
of
snapshots
and
M
staff,
because
that's
important
for
some
of
the
design
decisions
that
have
been
made.
We're
going
to
look
at
how
rights
work
inside
of
the
OSD
and
how
the
snapshotting
systems
interact
with
those
rights.
B
Look
at
how
snapshots
work
at
a
higher
level
in
the
RB
d
and
set
ves
systems
we'll
look
at
how
the
snapshot
trimming
works
inside
of
the
OSD,
we'll
look
at
some
ways
to
control
that
and
throttle
it
and
the
consequences
of
the
implementation
and
we'll
look
at
some
use
cases
when
I
was
practicing
this
earlier.
The
talk
was
a
little
short
when
I
was
just
like
talking
through
it,
but
hopefully
this
one
is
more
understandable.
The
last
time
I
gave
this
talk.
It
was
a
little
too
hard
to
follow.
B
So
please,
if
you
have
any
questions
like
raise
your
hands
or
jump
up
and
down
or
something
because
we
should
have
enough
time
for
Q&A,
while
we're
going
through
so
Steph
started
out
at
the
UC
Santa
Cruz
storage,
Research,
System
Center
they
were,
it
was
a
long
term
research
project.
They
were
trying
to
build
a
successor
to
the
lustre
HPC
file
system.
It
was
some
of
the
research
was
sponsored
by
the
National
Lab
Sandia
and
Lawrence
Livermore,
as
they
were,
setting
up
lustre
for
the
first
time
and
realizing
like
wow.
This
has
some
downsides.
B
We'd
like
to
not
have
those
downsides
things.
Msf
projects
have
changed
a
bit
since
then
there's
a
lot
of
open
source
and
hardware
companies
that
are
contributing
to
the
project.
It's
a
lot
more
cloud
focused.
That's
why
we're
all
here
most
customers
are
working
in
virtual
block
devices
RVD
or
in
the
s3
and
Swift
radio
Skateway
interfaces.
B
But
about
a
year
ago,
at
the
OpenStack
Austin
summit,
the
Ceph
community
was
really
proud
to
announce
that
we
had
a
stable,
fest
file
system
upstream,
and
so
some
of
the
vendors
are
now
starting
to
to
push
that
down
to
some
of
their
customers
as
well.
If
you've
ever
seen
a
step,
talk,
you've
probably
seen
the
slide.
The
step
project
starts
off
with
the
reliable
autonomic
distributed
objects,
tour
that
sort
of
provide
that
provides
the
data,
durability
and
consistency
mechanics.
B
On
top
of
that,
we
build
various
interfaces,
a
full
file
system
with
metadata
server
and
some
and
a
custom
client,
the
radio
SLOC
device,
which
is
just
a
client
library
that
sits
inside
of
kimu
or
inside
of
the
Linux
kernel
and
other
systems
or
a
radio
Skateway
proxy
that
speaks
s3
and
Swift
the
outside
world
and
turns
that
into
internal
radius.
Operations
for
itself
snapshots
were
initially
envisioned
as
with
the
rest
of
the
project
as
a
thing
and
set
in
the
set
file
system,
and
they
were
designed
to
be
really
easy.
B
Every
directory
and
stuff
of
s
has
a
hidden
dot
snap
directory
inside
of
it.
If
you
want
to
make
a
snapshot
of
that
directory
and
everything
underneath
it,
you
just
create
a
directory
inside
of
the
dots
adapter.
So
it's
just
a
make
two
dot
snap
/
snap
and
then
everything
underneath
that
has
a
new
snapshot,
that
you
can
reference
through
the
dot,
snap
directory,
dot
snap
and
see
the
files
of
the
state
when
you
created
the
snapshot.
That
was
that
was
a
big
goal
was
that
you
could
do
this
with
arbitrary
sub
trees.
B
You
didn't
need
to
specify
that
the
directory
was
special.
Before
you
made
a
snapshot,
you
didn't
mean
to
create
the
directory
in
some
special
way.
You
didn't
need
to
do
sub
volumes
and
things.
So
it's
just.
We
wanted
to
work
with
any
directory
in
the
system.
Your
home
derp,
the
at
like
as
a
user,
the
administrator
taking
a
snapshot
of
every
home
door
or
at
the
root
of
the
file
system
or
whatever
it
would
all
just
work.
B
Because
of
that
and
the
user
accessibility
and
the
fact
that
in
HPC
applications
when
people
are
taking
snapshots,
it
might
be
a
thousand
nodes
all
doing
it
at
once.
Those
snapshots
need
to
be
cheap
to
create,
but
we
do
have
one
big
advantage
over
some
systems,
which
is
that
we
have
intelligent
clients.
The
Southwest
client
is
pretty
smart.
It
does
a
lot
of
work.
B
The
RB
decline,
not
that
we
knew
about
it
then,
but
today,
the
arbiter
Klein
is
pretty
smart
and
does
a
lot
of
work
so
the
so
the
clients
can
coordinate
the
snapshotting
across
OSDs.
We
don't
need
to
flood
all
the
OSDs
with
it
with
a
synchronous
message,
message
system
that
says:
hey,
there's
this
new
snapshot.
That
applies
to
these
objects
and
indeed,
when
sage
sat
down
with
that
system
and
worked
out
the
first
design,
then
we
took
advantage
that
snapshots
in
ray
DOS
are
actually
per
object
so
to
the
OSD.
B
B
Those
and
that's
because
the
the
snapshotting
is
driven
on
object
right
when
snapshot.
When
you
take
a
snapshot
in
the
Ceph
file
system,
then
it
applies
to
the
whole
directory
and
everything
underneath
it
or
if
you
take
a
snapshot
and
RBD
volume,
it
applies
to
all
the
objects
in
the
RVD
volume,
but
we
don't
go
out
and
touch
those
objects
right
away.
We
just
when
we
have
a
right
to
them,
then
we
send
along
a
little
bit
of
metadata.
B
B
A
B
Was
if
this
works
with
Stefan
ves
on
open
files-
and
the
answer
is
yes,
but
we're
not
going
to
talk
about
too
much
detail
so
in
ratos
in
the
OSD,
just
normal,
writes
without
snapshots
involved,
you
have
object,
storage
daemons.
You
probably
already
know
this
right
now,
those
consist
of
a
user
space
daemon
that
talks
to
an
XFS
filesystem,
there's
a
new
thing,
coming
called
blue
store
that
manages
disks
directly
and
that's
going
to
have
a
lot
of
advantages
that
is
being
pushed
forward.
B
But
most
of
this
talk
is
going
to
focus
on
the
files
on
the
file
store
on
X
FS,
because
that's
what
most
people
have
it's
the
most
battle
tuned?
It's
what
it's
the
only
one
thing
that
a
lot
of
vendors
are
supporting
right
now.
So
in
terms
of
the
network,
when
you
have
a
a
raw
ratos
client
that
wants
to
write
something,
it
says:
hey
I've
got
the
subject:
foo
I
want
to
write
to
it.
So
it
says:
hey,
like
the
client
says:
ok
I,
find
that
found
the
primary
for
this
object.
B
Foo
and
I
want
you
to
write
this
data
and
just
sends
a
message
to
the
primary
OSD.
The
primary
OSD
sends
that
sends
that
message
to
to
all
the
replicas
for
project
for
object,
foo,
and
then
it
sends
back
an
act
to
the
client
when
they've
been
committed
to
disk
inside
of
the
OSD.
There's
a
couple
different
things
that
need
to
happen.
It
needs
to
look
up
the
current
object
state
to
make
sure
that
the
clients
allowed
to
touch
that
object,
that
the
object
actually
exists.
B
If
it's
not
doing
object
create
to
see
if
it
needs
to
change
the
sides
of
the
object
or
whatever.
So
that's
one
disk
I/o.
If
that
data
isn't
cached
it
packages
up
the
right
data
for
its
replicas
and
for
its
local
files,
local
storage
system,
and
then
it
sends
that
to
the
replicas
over
the
network
and
to
its
local
storage
system
to
persist.
And
that's
you
know,
depending
on
the
file
system,
on
what
the
file
system
feels
like
doing
right
now.
B
B
We
call
these
snapshots
self
managed
because
we're
bad
at
naming,
but
also
because
force
f
of
s
and
for
our
BD
they're
sort
of
managing
the
metadata
about
the
snapshots.
F
FS
is
responsible
for
knowing
which
objects
are
in
snaps
are
in
this
particular
snapshot.
42.
It's
not
the
responsibility
of
Rea
dos
or
anything
like
that.
So
to
allocate
a
self
managed
snapshot
a
snapshot.
B
The
client
just
says
to
the
monitor:
hey
I
want
a
new
snapshot
ID
and
the
monitor
does
does
what
we
call
a
paxos
commit
round
and
it,
and
it
allocates
one
internally
between
its
it's
lightly,
consistent
and
available
and
writes
that
down
to
disk,
and
then
it
says:
okay,
client,
here's,
a
new
snapshot
and
as
far
as
Ray
dose
is
concerned,
that's
it
there's
now
a
snapshot,
it's
not
associated
with
anything
except
that
it
exists.
But
that's
that's
all
it
takes
to
do
the
logical
creation.
B
Now
the
client
probably
actually
has
some
data
at
once
in
the
snapshot.
So
at
some
later
point
it
says:
okay,
I,
have
this
object?
Foo,
let's
call
it
that
is
in
my
that
is
in
my
snapshot.
I
got,
which
is
just
snapshot
42
and
so
now,
I'm
writing
new
data
to
object,
foo,
and
so
he
says,
says,
sends
a
message
to
the
primary
that
says:
hey
write,
the
state
object,
foo,
oh
and
by
the
way,
I
know
that
object.
Foo
is
a
member
of
snapshot.
B
42
and
the
primary
gets
that
and
it
sends
it
up
the
replicas
then
back
and
everything's
happy
internally
ratos.
The
OSD
looks
up
object,
foo,
that's
about
one
disk
IO,
it
says:
oh,
hey
object,
foo,
isn't
in
snapshot
42,
yet
that
I
know
about
so
I,
better
make
a
copy
of
its
current
state
and
say
that
that's
snapshot
42,
and
so
that's
a
a
clone
operation
in
in
XFS.
That's
a
full
copy
of
the
object
and
in
blue
store
it.
It's
just.
B
B
Okay,
sorry
so
graphically
we
have.
The
disks
ADA's
exist
as
we
have
this.
This
object
foo
and
it's
got
sort
of
an
X
adder
which
contains
its
info
and
we
say:
hey
I,
want
you
to
look
up
the
untidy
to
look
up
the
info.
So
please
read
this
X
at
or
out
of
XFS
for
me
and
we
get
it
back
from
the
client
or
not
from
the
client,
but
from
our
file
system.
And
then
we
say:
hey
X
of
s.
We
need
you
to
copy
object
foo
into
this
new
location
and
I.
B
Think
what
it
actually
looks
like
is
renamed
yeah
and
I.
Remember
we
copy
it
into
a
new
location.
We
overwrite
it.
So
we
say:
hey
clone
the
object
right,
this
new
data
to
the
newly
cloned
object
and
record
the
snapshot,
and
so
that
goes
into
the
file
system
and
it
says:
hey.
We
now
have
the
foo
snapshot
version
one
and
this
object
foo,
which
has
the
new
overwritten
data
and
also
in
a
level
DB
instance
that
we
use
to
provide
a
whole
lot
of
things
than
we've
written
down
these.
B
These
two
key
value
pairs
from
snapshot
to
foo
and
from
food
to
snapshot,
and
that
can
get
cool
less
mostly
into
one
commit.
If
the
file
system
feels
like
in
time,
it
might
be
a
couple
more,
it
depends,
and
then
we
say:
hey
the
file
system
did
this,
you
can
you
can
have
it
back
now
and
you're
done
so
that's
sort
of
the
local
path
and
you'll
notice.
You
know,
depending
on
the
file
system,
feels
like
it
might
be
to
iOS.
B
It
could
be
more
if,
depending
on
how
many
folders
it
decides,
it
needs
to
look
at
it
needs
to
go,
do
lookups
in
or
to
update
or
whatever
at
the
time
so
at
a
higher
level
in
our
VD.
From
its
perspective,
let's
look
at
writes
in
snapshots,
ratos
block
device
stores,
virtual
disks,
you're,
probably
broadly
familiar
with
it.
B
B
When
you
take
a
snapshot
in
our
VD
you're
running
a
simple
operation
which
I
have
later
I,
think
it's
our
VD
create
snapshot
ID
on
this
object,
then
the
client
goes
to
the
monitor
and
says:
hey
I
need
a
staff
ID,
and
the
monitor
goes
to
this
concise
here's,
the
staff
ID
and
then
the
client
needs
to
write
down
on
what
we
call
an
RB
header
out,
that's
responsible
for
saying:
hey,
we
have
this
object
or
this
RV
volume
it
exists.
It
is
of
size,
10,
gigabytes,
things,
lend
and
it
supports
these
features.
B
Every
I/o
has
that
by
the
way
you
remember
object,
42
and
the
OSD
is
take
care
of
it
on
their
own,
and
so
the
right
path
looks
the
same,
and
it
doesn't
really
do
anything
from
the
clients
perspective.
So
it's
pretty
simple
and
stuff
s
not
hugely
different.
We
do
have
a
metadata
server
that
sits
in
between
the
OSD
s
and
the
client
in
order
to
provide
file
system.
B
Namespace
operations
like
saying,
hey,
I,
need
you
to
create
this
directory
or
rename
it
or
created
or
allocate
a
new
ID,
node
number,
and
so
the
client
goes
to
those
to
do
that.
But
then,
when
it
wants
to
write
cloth
actual
file
data,
it
just
talks
directly
to
the
right
OSDs
for
those
with
the
objects
that
that,
with
the
objects
that
are
a
part
of
that
file,
when
you
want
to
and
so
sort
of
graphically
it
says,
hey
MDS
I
want
to
open
this.
This
file,
Greg,
slash,
git
config
for
right
and
the
MDS
says.
B
Ok
here
it
is,
and
the
client
then
writes
out.
You
know
the
new
version
might
get
config
out
to
an
OSD
if
I
want
to
make
a
snapshot
of
my
home
directory.
The
client
says
to
the
MDS:
hey
I,
want
you
to
make
a
snapshot
in
slash
home,
slash,
Greg,
dot,
snap
slash
my
snapshot
and
the
MDS
will
has
its
own.
You
know
MDS
logs,
that
are
journals
too,
and
so
it-
and
so
it
persists
that
hey
there's
now
a
snapshot
in
slash
home,
slash,
Greg
and
then
I
respond
to
the
client.
B
Ok,
you've
got
a
new
snapshot,
got
snap,
ID
42
and
then,
when
I
later
on-
or
maybe
it's
happening
at
the
time.
But
when
I
later
on,
say,
hey
I
want
to
open
and
write
the
dot
git
config
file
in
Greg's
home
tur,
then
the
mbss,
here's,
the
here's,
how
you
open
the
file
and
then
the
client
sends
off
the
new
the
new
data
to
the
object
and
says
by
the
way
this
object
is
a
member
of
object.
42
and
again
that
happens
in
parallel.
B
You
go
to
the
MVS
to
open
files,
and
the
MVS
tells
you
that
it's
a
member
that
the
file
is
a
member
of
whatever
snapshots
it's
in
and
then,
whenever
you
go
talk
to
the
OSD,
is
you
just
set
that
up
so
sequential
or
it's
not?
It's
not
serialized,
it's
just
all
in
parallel
with
whatever
files
you
have
to
be
doing.
B
This
could
be
you
know
one
big
file
that
has
that
you're
writing
the
three
objects
at
once
on,
because
you're
doing,
oh
dear,
because
you're
doing
sixteen
megabytes
streaming
iOS,
it
could
be
three
very
small,
four
kilobyte
files.
It
just
all
happens.
Naturally,
so
that's
how
snapshots
get
created
any
questions?
Oh
one
in
the
middle
yeah
be
good.
Sorry,
I'll.
B
Every
OSD
daemon
provides
three
different
sort
of
data
streams
or
Forks
on
an
object.
You've
got
X
adders
and
the
object,
byte
stream,
and
what
we
call
OMAP
or
object
map.
It's
a
key
value
store
in
the
OSD,
that's
implemented
with
leveldb
rocks
TV
if
you're
familiar
with
those.
So
it's
it's
not
it's
not
a
sequel
thing.
It's
just
us,
just
a
key
value
store
that
you
can
list
and
read
it
out
of,
and
we
use
that
for
providing
the
O
map
implementation.
B
We
use
that
for
doing
for
some
of
our
internal
metadata,
like
this
snap
mapper
thing
in
the
normal
course
of
doing
business,
on
a
write,
you
actually
don't
talk,
don't
do
anything
with
it,
but
it
is
sort
of
a
thing
that's
being
worked
on
in
the
background
all
the
time,
and
so
there
is
a
cost
associated
with
writing
into
it.
But
it's
it's
sort
of
an
ongoing
thing.
You
pay
it's
not
a
for
this
OP.
We
created
an
I/o.
It's
like
for
these
50
ops.
We
created
a
4
kilobyte
write
to
disk.
D
B
So
replication
happens,
sort
of
parallel
with
the
local
right
to
disk
of
what
we
call
the
primary
list.
He
gets
a
right
and
it
puts
it
through
some
processing
and
once
it's
sort
of
approved
it
and
ordered
it
with
respect
to
other
rights,
then
it
simultaneously
sends
it
off
over
the
network
to
its
replicas
and
and
gives
it
to
its
local
storage
too,
to
persist.
Thank
you,
yep,
okay,
so
we've
seen
the
decree
up
one
more
sorry,
yep.
E
B
B
It
yeah
your
request,
yeah.
We
can
talk
about
that
afterwards.
That's
part
of
the
reason
that
there's
this
new
blue
store
thing
I've
been
alluding
to
is
that
it
handles
the
disk
directly,
and
so
we
remove
the
double
the
double
logging,
but
that's
not
something
we
can
really
talk
about
right
now.
Okay,
so
we've
seen
that
creating
snapshots
is
pretty
cheap.
B
B
The
zoasty
map
has
a
deleted
snapshot,
I'm,
going
to
put
that
deleted
snapshot
into
my
queue
of
things
to
trim
and
then,
as
it
works
its
way
through
that,
through
that
snap
trimming
queue,
it
will
list
the
objects
that
are
in
the
snapshot.
It
will
for
each
of
the
objects
on
length
to
clone
for
that
snapshot
in
XFS.
B
It
will
update
the
the
objects
main
info
same
info
exeter
that
contains
the
metadata
about
the
object
and
it
will
remove
the
the
leveldb
snap
mapper
entries
for
that
objects
in
snap
pair
visually.
Let's
say:
we've
been
a
little
more
ambitious.
We've
now
got
three
objects
that
are
in
snapshot.
One
and
we've
got
an
OSD
map
that
says
hey.
You
need
elite
snapshot
1,
so
this
snap
trimmer
is
now
running
through
and
it
says
alright.
I
need
to
leave
snapshot
1.
What's
the
next
objects,
that's
in
snapshot
1
and
the
answer
comes
back.
B
Oh,
hey,
it's
foo
and
so
the
OSD
says:
hey
XFS,
I
needed
to
remove
this
food.
One
object.
I
need
to
update
the
info
on
foo
to
say
that
it
doesn't
have
a
foo
one
object
and
I
need
you
to
remove
the
keys
out
of
out
of
the
snap
map
or
leveldb
instance,
and
so
XFS
and
level.
Do
you
change
their
instances,
level
D
crosses
out
the
entries.
Xfs
has
a
new
info
and
it's
removed
foo
1
and
it
says:
ok,
I'm
done
and
again
snap
trimmer
goes
hey.
B
What's
the
next
snapshot,
one
object
and
this
time
it's
bar
will
walk,
do
the
same
process
for
bar.
Does
it
again
what's
the
net?
What's
the
next
object,
its
Baz
walk
through
the
same
process
for
Baz,
and
then
we're
done
and
we
say
hey:
what's
the
next
snapshot,
one
object
and
we
go.
Oh
there
isn't
one
and
maybe
we're
now
deleting
snapshot
too
or
maybe
we're
just
done
and
the
snap
trimming
can
stop
for
a
while.
B
B
Sometimes
it
can
be
a
lot
more.
Sometimes
XFS
doesn't
have
any
of
the
metadata
and
entry
in
memory,
so
it
needs
to
go
through.
Sometimes
XFS
has
journaled
up,
but
it's
two
unlinked
50
or
50
files,
and
then
you
give
it
a
50
first
one.
It's
like
oh,
like
no
I
need
to
go
actually
unlink
things
at
about
of
out
of
the
out
of
the
folders
that
I
have
in
having
other
places
on
the
harddrive.
B
So
it's
a
little
unpredictable
when
scheduling
this
is
a
lot
better
in
blue
store,
because
all
the
metadata
in
blue
store
is
just
Cola
scible
into
into
the
leveldb
instance,
and
so
it's
sort
of
an
amortized
look
up
and
then
an
amortized
right
out
of
the
new
instances
of
keys
and
in
particular
seth,
has
historically
had
problems
with
throttling
these
trim
operations.
Because
of
the
way
that
work
that
we
think
XFS
is
done
where
it
says:
hey,
alright,
unlink.
B
This
file
has
not
actually
done
inside
of
XFS,
and
so
it
just
pops
up
later
on
as
much
more
work.
So
the
snap
trimming
in
XFS
is
in
ray
dos
is
very
important.
The
ways
that
you
control
it
and
hammer
sort
of
the
classic
and
the
classic
version
of
stab
trimming
that
people
who
have
used
it
a
lot
of
had
some
trouble
with,
but
that
had
it's
the
first
sort
of
rudimentary
controls
that
were
there
were
two
main
switches.
B
You
could
change
the
maximum
number
of
snap
trims
that
it
would
be
doing
at
a
time
that
is
the
number
that
every
PG
and
the
OSD
would
be
giving
to
exit
the
number
of
files
that
every
up
PG
in
the
OSD
would
be
giving
access
to
remove
at
once.
And
so
you
could
say:
oh
hey,
like
you'll,
have
a
lot
of
PGS
but
let's
say
up
30
PGS
at
your
primary
four,
so
you'll
be
giving
your
ex
FS
with
what
the
defaults
you'll
be
giving
at
60
things
to
remove
it
once
so.
B
B
Then
it
sleeps
for
this
number
of
seconds
it
defaults
to
off,
but
a
lot
of
people
have
turned
it
from
have
tuned
it
from
10
milliseconds
up
to
like
5
or
10
seconds,
even
because
they
didn't
have
very
many
objects,
but
they
just
needed
it
to
be
very.
Very
background:
in
the
jewelry
lease
we
made
a
lot
of
improvements.
B
We
move
the
snap
trimming
from
its
own
sort
of
separate
worker
pool
of
threads,
where
it
just
contended
in
the
disk
layer
with
client
IO
into
what
we
call
a
unified
op
q,
where
client
IO
goes
in
snap.
Trimming
goes
in
backfill
and
recovery
go
in
all
through
the
same
all
through
the
same
set
of
threads
and
the
same
queue.
So
we
can
prioritize
them
and
say:
ok
like
given
the
cost
of
doing
all
these
operations
and
their
priority
to
the
administrator.
B
What
order
we
want
to
go
in,
and
so
with
that,
when
you
we
can
set
the
snap
trim
priority
it
defaults
to
5,
which
is
pretty
low.
Clients
are
63
which
is
sort
of
the
max.
You
can
specify
how
expensive
you
want
to
consider
a
snap
trim
to
be,
and
it
defaults
to
being
one
megabyte
of
cost,
which,
which
frequently
is
a
little
more
expensive
than
it
needs
to
be,
but
sometimes
it's
not
quite
enough.
B
You
can
still
specify
the
concurrent
snap
trims
and
you
can
still
specify
the
snap
trim
sleep,
but
this
was
really
embarrassing
because
if
you
turned
it
on,
then
you
actually
block
the
OP
thread
that
client
I
went
through.
Whenever
you
did
it,
and
so
you
could
set
a
snap
trim
sleep
of
1/2
second
and
then
no
IO
would
happen
for
a
second,
including
all
your
clients,
and
it
was
bad.
So
you
shouldn't
do
that.
But
someone
pointed
out
this
bug
and
we
did
fix
it.
B
It's
in
the
upcoming
10.2
point
release,
and
it
also
has
a
few
new
things.
In
addition
to
making
snap
trim
sleep
work
properly,
we
also-
and
we
also
added
a
new
configuration
option
that
specifies
how
many
PGS
the
OSC
will
at
trim
at
a
time
and
so
I
think
most
of
the,
and
so
with
these
options
then
use
all
the
users
that
I'm
aware
of
that
have
tried.
B
Them
are
really
happy
with
the
way
that
that
trimming
works,
because
previously,
if
you
delete
it,
a
snapshot
that
had
a
lot
of
objects
in
it
your
than
euro,
so
you
just
go
away
for
a
while.
We'll
look
at
that
in
a
minute
about
why
that
happened,
but
with
these
settings
they
managed
to
turn
it
down.
So
it
wasn't
a
problem.
B
The
upcoming
luminous
release
has
the
same
tunable
as
the
previous
one,
so
there
are
some
consequences
to
map
shots
and
the
way
they
work
every
I/o
to
an
object
in
the
snapshot
that
hasn't
already
like
been
registered.
As
part
of
that
snapshot
copies
the
object
when
you're
using
when
using
XFS.
So
if
you're
benchmarking,
random
I/o,
we
occasionally
have
people
come
on
list
and
say:
hey,
like
I,
took
a
snapshot,
and
now
my
random
I/o
FIO
benchmark
is
running
at
a
thousand
to
speed.
It
was
before
and
we're
like
well
yeah.
B
That's
because
you're
copying
every
object
if
every
every
object
on
every
access,
because
you're
taking
a
snapshot
every
second
and
you're,
never
going
to
win
that
race
in
general.
This
is
amortized
across
iOS.
So
you
know
as
long
as
you
don't
take
snapshots
to
fast,
where
your
cluster
can
do
and
what
workload
you're
applying
to
it.
B
So
again,
it's
amortized,
but
if
you
take
a
snapshot
on
our
BD
volume
of
a
thousand
objects
and
you
write
to
every
object
and
then
you
delete
it.
You've
got
about
a
thousand
client
eye,
ops,
maybe
two
thousand,
and
if
you
have
ten
primary
OSDs
with
a
hard
drive
that
can
do
100,
I
ops,
then
that's
one
second
of
cluster
throughput
to
delete
that
snapshot.
Now,
assuming
you've
set
up
you're
using
the
defaults
or
have
set
up
the
snapshot.
Trimming
tuneable
as
well
that'll,
be
you
know,
not
a
whole
second,
but
it'll
be
distributed.
B
But
it
is,
you
know,
you're
starting
to
think
about
it
and
in
turn,
in
those
terms,
when
you're
doing
your
cluster
capacity
planning,
you'd
better,
not
create
a
cluster
and
then
have
an
hour's
worth
of
snapshot
creates
an
hour's
worth
of
snapshot
trimming
every
day.
If
your
cluster
is
running
at
full
capacity
for
23
hours
out
of
the
day,
and
you
need
to
design
the
system
for
to
do
that.
B
B
B
So
that's
how
snapshots
work
in
stuff
of
SNR
VD.
We
also
have
this
other
thing
called
pool
snapshots
that
I
made
in
my
first
year
or
two
and
that
I'm
a
little
sad
about
the
goal
with
pool
snapshots
was
to
make
things
easy
for
admins.
I
think
that
these
might
have
existed
before
RVD
was
even
a
thing,
but
after
we
created
the
radio
Skateway,
so
it
was
like
you
know.
B
Maybe
we
got
want
to
make
this
thing
so
that
admins
can
take
copies
of
the
current
state
of
their
cluster
and
we
want
to
use
the
same
implementation
inside
of
the
OSD,
and
so
we're,
like
you
know,
a
really
easy
way
to
do.
It
is
to
just
put
the
snapshot
in
the
OSD
map
and
let
it
spread.
There
were
some
problems
with
that,
though,
unlike
with
our
other
snapshotting
mechanisms,
pool
snapshots
are
not
point
in
time.
B
B
So
you
can't
use
the
real
RVD
snapshots
that
are
per
volume
and
that
are
used
for
some
of
the
replication
systems
that
people
have
built
up
in
things
and
you
can't
use
it
on
it:
fool
where
you're
using
stuff
of
s
and
also
because
it's
pool
wide
every
object
in
the
pool
snapshot
trimming
is
a
lot
more
expensive
than
on
a
snapshot
basis.
Then
then,
when
doing
most
snapshot
removals,
we
throw
out
a
lot
more
effectively
now,
so
it's
better,
but
it
does
mean
that
you
know
your
pool.
Sort
of.
B
Has
these
giant
consists,
not
consistency,
points
where
when
you
do
remove
it,
it's
all
it
just
used
up
a
whole
lot
of
data
throughout
the
system,
so
you
might
have
a
use
case
for
pool
snapshots.
There
are
some
but
they're
unlikely
to
to
be
what
you're,
after,
if
you're
looking
at
them,
and
so
you
should
talk
to
a
list
about
what
your
goals
are
and
or
you
know,
your
support,
guy
or
whatever
about
what
your
goals
are
and
what
the
right
way
to
accomplish
them
is.
There
are
also
a
few
pain
points
and
stuff.
B
B
It's
just
the
thing
that
hasn't
been
done,
it's
kind
of
hard.
We
know
how
to
fix
it,
but
it,
but
it's
still
queued
up,
because
other
things
like
the
multiple
active
MDS
has
got
prioritized
the
last
planning
round.
There
are
also
a
few
hard
edges
and
some
narrow
bugs
bugs
when
you
have
the
have
various
combinations
of
features
turned
on,
so
snapshots
aren't
considered
generally
stable,
yet
I'm,
not
sure
if
the
file
system
team
is
turning
them
on
for
Luminess
or
not,
but
as
but
they
aren't
a
jewel.
B
Certainly,
that's
said
you
know
they're
coming
along
they're
nice
most
of
the
time,
and
there
are
some
good
use
cases
for
them,
which
I
should
have
ordered
next,
but
are
instead
the
next
slide.
So
an
RB
D,
there's
a
there's,
a
web
or
there's
a
doc
page
about
how
to
use
snapshots
and
it's
pretty
simple.
You
run
the
RVD
command
and
you
say
snap
create
this
snapshot
on
this
RB
D
volume
and
it
takes
snapshot
of
the
image.
B
You
can
also
clone
the
image
from
a
snapshot,
and
so
when
you
do
that,
you've
got
your
image.
Ruin
your
image
bar
and
you
start
you've
got
your
image
foo
and
you've
snapshotted
it.
You
can
make
an
image
bar,
that's
in
a
different
pool
somewhere
that
might
have
you
know
different
different
speed
requirements
or
different,
consistent
or
different
durability
requirements,
or
something
and
or
it
might
just
be
like
you
want
to
copy,
and
then
you
have
this
new
image
bar.
B
That
starts
off
the
same
as
our
who
was
that
it's
edited
snapshot,
but
that
changes
as
as
you
do
rights.
There's
some
nice
use
cases
associated
with
that.
You
can
create
a
golden
image
and
then
every
time
anyone
wants
a
new,
a
new
volume.
Then
it's
just
an
overlay
of
your
old,
an
image
you
can
take
snapshots
right
before
you
do
an
OS
or
a
big
package
upgrade
and
if
it
fails,
you
can
clone
the
snapshot
and
just
resume
from
that
snapshot.
B
If
you
want
to
take
backups
for
your
clients
without
them
noticing,
then
you
can
get
a
point
in
time,
consistent
on
a
point
in
time,
consistent,
harddrive
image,
that's
not
like
it's
not
like
an
FS
freeze
and
flush
that
that's
safe,
but
it
is
its
crash
consistency
and
you
can
use
that
to
back
up
somewhere
else
out
of
our
BD
or
across
to
another,
our
beauty,
cluster
or
something,
and
you
can
use
it
in
various
ways
to
transparently
migrate,
VMs
around
between
pools
or
clusters
and
stuff.
If
s,
it's
a
simple
make
sure
by
default.
B
Everyone
on
the
on
the
cluster
can
create
snapshots,
but
you
can
limit
it
by
UID
range.
If
you
want
to,
you,
can
sort
of
use
it
for
anything
that
you
want
read-only
data
for
you
can
create
point-in-time
backups
of
a
directory
before
making
big
changes
that
does
work
with
open
files.
As
long
as
the
data
has
been
put
into
SAP
of
s,
the
clients
will
flush
it
out
correctly.
You
can
use
it
as
a
poor-man's.
Get
that
works.
Ok
with
binary
data.
B
You
can
use
it
as
a
basis
for
copying
consistent
data
around
you
can
take
snapshots
of
the
home
directory
every
day
to
prevent
user,
to
allow
your
to
allow
your
users
to
ask
you
to
recover
files
for
them
or
to
do
it
themselves
and
the
project
manila
files
as
a
service
system
in
an
openstack
uses
snapshots
for
whatever
use
of
snapshots
for,
but
this
is
for
that,
and
we
have
come
to
the
end
of
my
slides
a
little
early.
So
I'll
take
questions
now
or
maybe
I
won't.
D
D
B
One
of
the
rough
edges
in
the
file
system.
Right
now
there
there
are
cepsa
ports.
A
thing
called
are
static
thing
that
we
call
recursive
statistics
where
usage
information
gets
propagated
up
into
directories,
so
you
can
look
at
a
directory
instead
of
it
being
four
kilobytes,
because
that's
the
size
of
a
block,
it'll
say:
oh,
hey,
there's
10
gigabytes
of
data
in
my
descendant
and
so
I
think
probably
will
end
up.
Doing
is
hooking
snapshots
into
that.
To
have
like
a
snapshot
are
sad
saying.
B
F
B
G
B
So
the
interval
set
is
a
particular
kind
of
data
structure
where
and
it's
nice
for
snapshots,
because
if
you've
deleted
all
of
the
snapshot
0
to
100,
it
says
alright,
then
that
takes
two
integers
to
represent.
It
says
it
says,
starting
at
0
we've
deleted.
Then
this
set
contains
100
entries.
So
as
long
as
you
delete
snapshots
from
the
tail
moving
forward,
then
it's
a
very
small
structure.
If
you
have
a
more
complicated
backup
system,
it
can
grow,
it
can
grow
more.
It
hasn't
been
a
big
problem
for
users.
B
B
Are
there
plans
for
the
ability
to
create
consistent
snapshot
of
multiple
RVD
volumes?
I
believe
that's
a
blueprint
in
progress,
but
I
can't
talk
about
it
very
much.
There's
a
mechanism
yeah
from
mer,
antis,
save
someone.
It
branches
is
working
on
a
consistency.
Groups
feature
for
RBG
volumes,
Oh
Oh,
Jason's
there
sorry
Jason.
You
should
ask
him
about
things
like
that.