►
Description
Presented by: Anthony D'Atri
Full Ceph Month schedule: https://pad.ceph.com/p/ceph-month-june-2021
A
Intel
qlc
cost
perfect
of
ceph
on
nvme
legal
disclaimers.
A
A
Start
with
cost
tlc
excuse
me
still
awake
getting
my
throat
up
in
the
morning,
qlc
tc
over
crossover
soon
or
today,
they're
competitive.
Now,
especially
if
you
consider
some
subtle
factors
that
some
of
the
online
tco
calculators
don't
include
like
the
impact
on
your
service,
you
know
how
well
your
service
can
run
with
hdds
some
people
short
stroke,
hdds
or
limit
the
size
of
the
hdds
that
they'll
use
because
of
the
interface
bottlenecks
and
the
recovery
times
that
they
experience
with
hdds.
A
Other
costs
include
terabytes
per
chassis,
terabytes
per
ru
terabytes
per
watt,
the
operational
expense
of
swapping
drives
of
having
to
rma
and
crush
them.
A
A
lot
of
operations,
for
example,
won't
bother,
are
making
something
worth
under
five
hundred
dollars
because
of
of
all
of
the
hassle
and
effort,
some
places
that
cost
that
much
just
to
get
somebody
through
the
door.
If
it's
off,
you
know
in
a
site.
A
A
A
These
are
available
capacities
up
to
30
terabytes
and
you
can
fit
up
to
1.5
petabytes
raw
space
per
rack
unit
with
the
the
ruler
e1.led
sff
drives
that
you
know
having
come
from
a
time
cough
off
when
my
first
hard
drive
was
a
five
megabyte,
14
meg
pack,
that
it
that
just
blows
me
away
the
density
that
we
can
get
today
and
the
advantage
to
iops
that
that
ssds
give
you
allows
flexible
capacity
provisioning
where
you
don't
have
to
provision
for
iops
anymore.
You
can
provision
for
your
capacity.
A
Performance
ssds,
including
qlc,
are,
are
fast
and
wide.
Until
this
is
the
d5p30
50
nvme
qlc
delivers
up
to
800k
4k
random,
read
iops
a
38
increase
over
the
previous
generation
up
to
megabytes
per
second
sequential,
read
more
than
double
an
improvement
over
intel's
previous
generation
qlc.
A
This
sata
drives
saturated
about
550
megabytes
per
second
and
the
pci
gen4
nvme
interface
crushes
the
the
side
of
bottleneck.
Two
more
osbs
per
device
improved
throughput,
iops,
intel
latency.
This
is
common
practice
with
nvme
drives.
The
nvme
interface
allows
you
to
do
that
and
you
know
work
around
some
of
the
serializations
and
the
code
and
so
forth.
B
A
Operational
advantages,
the
rgw
service
for
soft
object
storage,
for
example,
is
is
prone
to
hot
spots
and
qos
events,
one
strategy
to
address
the
to
mitigate
latency
and
if
bottlenecks
are
to
to
limit
the
size
of
hdds
used,
for
example,
to
eight
terabytes
other
things
that
one
can
do
are
to
limit
scrub
intervals,
put
a
cdn
on
the
front
end
throttle
or
cache
on
the
load
balancers
in
front
of
multiple
rgws,
but
sometimes
osd
up
waiting,
especially
when
using
ec
on
hdd
is.
A
A
Replacing
hdds
with
intel
qoc
ssds
for
bucket
data
can
markedly
improve
your
qos
and
serviceability
of
your
clusters.
Reliability
and
endurance
qlc
reliability
is
better
than
you
think,
and
it's
actually
more
than
you
need
it
turns
out.
Most
ssd
failures
are
firmware
unless
you
can
fix
them
on
the
drive.
A
You
know
which,
in
my
experience,
is
not
as
true
with
hdds
and
studies,
show
that
99
of
ssds
never
actually
exceed
15
of
their
rated
endurance,
which
was
a
bit
of
an
eye
opener.
Having
come
from
an
assumption
that
you
needed
one
theoretical
drive
rate
per
day
to
do
rbd,
for
example,
the
rgw
service,
you
know
in
one
case
has
been
calculated
for
seven
years
of
endurance
using
the
previous
generation
qlc.
A
Revival,
endurance
get
with
the
program
erase
cycle,
that's
a
nan
joke
everybody.
Google,
the
30,
terabyte
intel
ssd,
d5p5316
qlcssd
much
like
the
lydium
q38
space
modulator
is
rated
at
over
22
petabytes
of
iu,
aligned,
random
rights,
one
drive
rate
per
day,
7.68
terabyte
tlc
ssd,
is
rated
at
less
than
15
petabytes
of
4k
random
rights.
A
A
If
you
need
to,
if
you
need
to
adjust
it,
reliability
and
opex
drive
failures,
cost
money
and
quality
of
service.
I've
run
stuff
clusters.
Where
you
know
a
dry
failure
sets
off
escalations
and
you
know
the
fewer
times
any
of
us
have
to
be
awakened
at
four.
In
the
morning.
A
The
better
eight
terabyte
hdd
is
have
a
0.44
percent
annual
failure
rate
spec,
but
turn
out
to
be
on
average,
one
to
two
percent
actual
failure
rate
in
production,
intel's
dc
qlc
nand
ssds
in
in
practice
have
an
average
failure
rate
of
less
than
0.44
percent.
A
They
have
a
greater
operating
temperature
range,
better
uber
error
rate
and
you
know
consider
the
cost
to
have
hands,
replace
the
failed
drive.
I've
been
in
situations
where
it
was
four
hundred
dollars
just
to
get
somebody
through
the
door.
A
Here
are
some
figures,
comparing
publicly
available
figures,
comparing
a
couple
of
hard
drive
models
in
the
middle,
a
tlc
ssd
model
and
then
qls
cue,
lc,
ssds,
with
random,
sequential
writes
and
with
with
a
a
figure
showing
additional
overvisioning.
A
This
shows
at
just
how
much
improvement
you
can
get
in
endurance
by
adjusting
your
over
provisioning.
A
Optimize
endurance
and
performance
with
with
cue
it
with
these
cute
course
iu,
qlc
ssds
is
beneficial
to
align
the
bluester
min
alex
size
to
the
the
iu
size.
The
indirect
union
size
of
the
drive
64k
for
the
new
drives
16k
for
the
drives
that
are
already
out
in
the
field
right:
the
lines
to
iu
boundaries
on
multiples,
enhanced
performance
and
endurance,
and
you
know
simply
aligning
with
bluester
menelik
size.
A
A
You
know
very
well,
align
your
metadata,
but
exactly
turns
out
to
be
a
small
percentage
of
the
overall
workload
and
that
doesn't
have
a
large
impact
on
the
drives
and
durability,
because
again
because
the
drives
are
so
big,
some
example
use
cases,
rgw
large
objects
rbd
when
used
for
backup
archive
and
media
I've
experienced
rbd
used
both
ways.
Cffs
has
a
four
megabyte
block
size
and
it
seems
as
though
dlc
could
be
a
good
fit
for
us
ffs,
that's
testing.
We
still
need
to
do
next
slide
additional
optimizations.
A
There
are
more
things
that
we're
exploring
aligning
the
the
roxdb
block
size
to
the
iu
size
and
exploring
rocks
db
universal
compaction.
A
I
believe
neither
of
those
is
is
by
default
and
they
may-
and
you
know
they
should
give
us
better
performance
and
better
right
amplification
at
the
expense
of
you
know,
using
a
bit
more
space.
But
again
when
you
have
30
terabytes,
maybe
that's
a
maybe
that's
a
good
trade-off.
Their
other
roxdb
tuning
may
be
beneficial.
A
Another
another
route
is
to
use
optane
to
accelerate
wall
and
db
on
a
separate
device
or
engage
in
other
right
shaping
activities.
Crimson,
you
know,
may
bring
us.
You
know
some
optimizations
for,
of
course,
iu
drives
as
well
and
for
gw.
We're
looking
at
separate
pools
for
large
and
small
objects
with
esc,
and
especially
with
raising
the
blue
storm
magnetic
size,
backup,
rgw's
fit
for
small
objects,
isn't
great
and
you
can
have
a
bunch
of
strands
of
space.
A
There's
talk
of
there
being
scaffolding
within
rgw
to
have
multiple
pools
for
behind
the
scenes
more
transparently,
and
so
those
are
some
additional
things
that
we
can
look
for
in
the
future
and
here's
a
bunch
of,
although
the
references
powerpoint
really
really
likes
to
to
make
these
gray,
I'm
still
learning
powerpoint
and
that's
the
end
of
the
story.
A
B
Questions
I
have
some
questions
at
towards
the
beginning
of
the
talk
you
mentioned.
The
the
new
ruler
form
factor
ssds,
that
let
you
fit
one
and
a
half
petabytes
in
a
rack
unit,
blows
my
mind
too,
are
those
by
those
zns
devices
and
or.
C
D
A
You're,
young,
maybe
you
can
you,
can
chime
in
there.
These
are.
These
are
not
cns,
I
think
zns,
I'm
not
sure
if
we're
shipping
zns
yet
those
are
you
know,
you
know
the
the
the
long
format
roller
drives.
A
The
30
terabytes
and
super
micro
offers
in
this
fine
print
that
you
can't
read.
Super
micro
offers
a
system
that
gets
you
around.
1.4
terabyte
one
terabyte
raw
and
somebody
else
that
I
found
offers
a
system
that
has
additional
ruler
drive
slots
on
the
back,
and
so
you
know
it's
a
fairly
deep
chassis,
because
you
have
these
long
drive
bays
on
the
front
and
the
back.
A
But
as
I
calculate
that's,
a
two-year
system
that
can
take
you
know
like
has
can
take
over
it
can
take
108
drives.
I
think-
and
so
you
know,
if
you
do,
the
math
modulo
base
2
base
10
rounding
that
case
you
somewhere
in
the
neighborhood
of
3
terabytes
raw
space
and
2u.
B
A
In
situations
where
I
had
to
fight
tooth
and
nail
to
get
a
single
rack
unit,
you
know
literally
a
a
pop
that
was
in
a
closet
in
sofia,
bulgaria,
for
example,
I'm
very
focused
on
on
minimizing
rack
units.
A
C
Okay,
yeah,
I
was
sorry
I
had
some
technical
issues
at
the
beginning.
First
is
the
audio.
Now
is
the
microphone,
but
so
this
is
you
young
hi
sage.
So
yes,
I
mean
no.
This
is
not
a
zns
device.
This
qrc
drive
is
just
that
anthony
introduced
today
is
block
device
only,
but
it's
using
the
course
in
direction,
meaning
that
you
know
it's
optimized
for
larger
block
size
than
the
traditional
4k.
C
So
the
zns
is
you
know.
Most
of
the
vendors
is
under
development.
At
this
moment,
I
don't
think
there
is
a
product,
that's
already
being
like
like
available
to
the
broad
market.
Yet.
C
Yeah,
definitely
we
do
have
a
plan
to
offer
zns
feature
in
the
future,
with
with
the
ruler
form
factor.
B
Cool
okay,
okay,
thanks
a
couple
slides
after
that,
you
mentioned
over
provisioning
to
to
balance
against
durability
versus
capacity.
Is
that
is
that,
like
an
explicit
management
operation
that
you
have
to
instruct
the
device
to
do
that,
or
is
it
just
a
matter
of
writing
to
a
constrained
set
of
lbas
in
order
to
avoid
that
trade-off?.
A
B
A
B
Cool
a
couple
slides
after
this
you're
talking
about
reliability-
and
you
mentioned
the
point
four
four
four
percent
afr
for
for
ssds,
but
earlier
you'd
mentioned
that,
usually
when
ssds
fail,
it's
like
a
firmware
fix
and
not
like
an
rma
hardware.
Failure
is
that
afr.
C
I
I
can't
take
this
yeah,
so
the
afr
target
for
ssd
include
all
the
failures,
not
only
former,
although
former
may
take
the
majority
of
it.
There's
also,
you
know
hardware
failure
and
also
you
know
different
kind
of
failures
for
for
ssd,
usually
for
intel's
nand
ssd.
We
have
an
annual
failure
rate
target
of
less
than
0.44.
C
Well,
most
of
the
time
we
are
able
to
beat
this
target
and
in
the
in
our
qlc
nand
example,
we
are
way
below
this
0.44
target
and
the
reason
is
you
know
so:
qrc
nand
right
now
is,
and
intel
is
only
one,
that's
using
a
floating
gate
technology
for
qlc
nand,
and
you
know,
as
far
as
I
know,
nobody
else
is
using
floating
gate
ssd
technology
nowadays.
C
So
this
floating
gate
technology
is
really
good
for
having
large
capacity
nand
components
as
well
as
have
really
fast
nand
components
and
also
reliable
nand
component.
That's
you
know,
have
a
really
good
retain
retention
rate
of
their
data.
So
that's
why
you
know
qlc
intel
skill.
Cssc
has
a
really
low
annual
failure
rate,
which
is
quite
different
from
what
the
market
think
usually
when
they
think
about
qlc.
They
think
about
the
quality
is
lower
than
tlc,
which
is
not
true
at
all.
C
The
another
technology
that
is
different
than
floating
gate
is
charge
trap,
so
it
just
you
know,
floating
gates
means
you
have
two
gates
in
each
nand
cell
and
you
use
that
floating
gate
to
store
the
zero
and
ones
off
of
the
data.
B
Okay
and
then
I
guess
the
last
question
was
there:
was
you
had
a
slide,
we're
talking
about
the
the
iu
and
the
block
size.
B
Yes,
yep,
I
guess
maybe
it's
just
yeah
one
more
back
one
more.
The
the
question
was
is
yeah.
There
we
go
is
the
is
the
iu,
something
that
is
exposed
in
the
was
it
the
kernel
block,
device
properties
and
sysfs
or
wherever
it
is,
though
you
can
tell
what
the
preferred
right
unit
size
is
like.
Is
this
something
that
we
can
have
make
blue
store
automatically
detect
of
the
device
so
that
it
automatically
sets
blue
storm
analog
size
accordingly?
Or
is
it
yes,
something
that
you
have
to
know.
C
Yeah
the
answer
is
yes,
I
can't
remember
the
the
the
the
you
know,
the
thing
that
you
have
to
read,
but
definitely
it
is
reflected.
Okay,.
B
Okay,
and
in
that
case,
a
pull
request
that
that
makes
blue
store
do
that
during
makefest
would
be
extremely
welcome,
because
at
the
end
of
the
day,
we
don't
want
users
to
have
to
know
any
of
this
right.
We
want
them
to.
A
A
B
A
You
know
I've
seen
people,
you
know
using
object,
storage
to
mirror
their
their
their
get
repositories,
but
then
you
know
also
to
to
store
you
know.
Tv
shows
stepped
into
portuguese
right.
So
a
lot
of
it
depends
on
your
workload.
B
A
I
will
certainly
work
on,
but
that's
a
very
good
idea
automatically
setting
that
and
I'll
talk
to
our
internal
folks
who
are
working.
B
B
The
option,
or
even
like
a
health
warning
or
something
if,
if
this
allocation
size
is
low,
I
don't
know
exactly
how
we'd
want
to
do
it,
but
giving
some
giving
providing
visibility
here
so
that,
and
ideally
some
default
behavior
that's
pain
would
be
great.
B
And
the
one
other
thing
I
mentioned
here
is
that
I
mean
you
call
out
that
the
rock's
db
workload
is
a
pretty
small
fraction
of
this,
which
is
is
true,
but
there's
also
a
bluefs
allocation
size
tunable
that
could
be
set.
So
we
could
also
tell
bluefest
to
respect
a
larger
box
size
as
well.
A
Yeah
I
actually
asked
one
of
our
internal
folks
who
was
described
to
me
as
a
blue
store
expert
who
who
works
on
that.
I
think
he's
in
the
prc
and
he
thought
that
it
wouldn't
make
a
difference
or
the
blue
fs
alex
says.
B
B
D
All
right
well,
thank
you
for
the
presentation
and
thank
you,
everybody
for
joining
us
for
our
final
day
for
week,
two
for
stuff
month
and
be
sure
to
join
us
for
week.
Three
and
recordings
for
these
will
be
posted
up
on
the
stuff
youtube
channel
today,
as
well.