►
From YouTube: Ceph Performance Meeting 2022-03-03
Description
Open Cache Acceleration Software: https://open-cas.github.io
Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups
Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/
A
Okay,
well,
it
looks
like
we've
got
a
couple:
people
here
already,
we'll
we'll
just
maybe
get
started
so
me
how
feel
free
to
to
share
your
screen
and
and
I'll
turn
it
over
to
you.
B
All
right
so,
let's,
let's
start
then
so
hello,
everyone.
Once
again,
my
name
is
mikhail
sadinsky.
I'm
a
cloud
software
architect
at
intel
based
in
poland,
and
my
work
is
focused
around
caching,
around
storage
caching,
including
opencast,
and
this
is
the
topic
that
I
want
to
show
you
today.
What
opencast
is?
What
are
the
features?
B
All
right
so
quick
agenda,
so
first
I
will
do
very
high
level
opencast
overview.
I
will
talk.
What
opencast
is
how
it
can
be
used.
Then
I
will
go
through
the
features
from
user
or
administrator
standpoint.
And,
finally,
I
will
do
quick
intro
of
opencast
architecture
a
bit
of
technical
details,
but
not
too
much
understand
all
right.
So,
let's
start
what
opencast
keys.
B
B
B
B
B
B
B
B
The
important
thing
to
note
here
is
that
you
don't
need
to
provision
a
backend
device
in
any
way
you
can
take
your
existing,
let's
say
hdd
with
data
attached
to
opencast
and
all
your
data
would
still
be
there
who
could
be
still
accessed.
There
is
no
data
loss
during
this
procedure
after
that,
after
you
finish
that
quick,
startup
quick
configuration,
all
your
data
would
be
visible
from
new
block
device
like
cast
1-1
and,
as
you
can
see
on
the
screenshots
here
from
lsblk.
B
B
I
mean
one
too
many
configuration,
in
which
case
you
use
a
single
cache
device
to
accelerate
multiple
backend
devices,
and
in
this
case
the
cache
is
shared.
So
if,
if
there
is
a
need
for
more
cash
to
for
one
back-end
device,
it
can
be
borrowed
from
another
one.
It's
simple
one
cache
area
to
accelerate
multiple
hard
drives
or
multiple
other
type
of
buckets.
B
It's
also
possible
to
stack
one
cache
instance
on
top
of
another.
I
don't
have
this
on
the
slide,
but
cas
devices
I
mean
those
cast.
1,
1-1
or
1.2
are
regular
block
devices,
so
they
can
be
used
as
input
for
another
cache
instance.
So,
for
example,
you
can
have
very
very
fast
cache
on
the
very
top
and
you
can
use
that
very
fast
device
to
accelerate
another
cache
instance
that
uses
slower
cache
device,
and
you
have
kind
of
three
of
cache
instances.
B
And,
as
I
mentioned,
it's
fully
transparent,
there's
no
data
loss
on
the
backend
devices,
how
it
fits
into
into
ceph,
then.
So,
if
opencasts,
if
opencast
exposes
regular
dog
devices,
then
you
can
simply
deploy
osd
on
top
of
the
device.
Instead
of
deploying
in
directly
on
a
hard
drive,
you
need
to
configure
the
cache
instances
and
then
start
to
use
those
exposed
virtual
devices
instead
of
origin
original
ones,
and
there
are
two
possibilities
again.
The
first
picture
is
for
one
and
one
and
the
second
one
is
for
one
too
many
deployments
in
the
first
one.
B
B
Both
approaches
have
pros
and
cons,
but
why,
in
the
save
the
traffic
across
osd
osd
should
be
more
or
less
evenly
distributed.
The
better
choice
seems
to
be
to
use
one-to-one
deployment
because
cache
metadata
is
not
shared
it.
It
would
have
some
performance
optimization.
Then
chad,
cache
is
more
for
cases
where
traffic
between
back-end
devices
is
not
distributed.
Eventually,.
C
B
No,
we
have.
We
have
four
more
than
one
caching
mods.
We
have
right
through.
We
have
right
back
by
data
loss.
At
this
point
I
mean
that
when
you
start
new
cache
instance,
you
don't
need
to
do
some
kind
of
provisioning
on
the
back
end,
so
you
can
take
existing
block
device
with
existing
data,
plug
it
into
opencast
instance,
and
we
still
have
you
will
still
have
access
to
this
data
so
for
that
full
open
cast
instance.
B
Okay,
so
let's,
let's
get
to
opencast
features.
What
opencas
can
offer
you
so
catch
modes?
Cache
modes
are
the
simplest
features
that
each
caching
software
should
have.
So
we
currently
have
four
caching
modes.
The
write
through
it's
the
probably
the
simplest
one.
In
this
case,
your
data
is
always
in
sync
between
cache
and
the
backend
storage,
and
because
of
that,
it
accelerates
on
the
reefs
are
always
targeted
to
both
cache
and
back
installation.
So
there
is
no
acceleration
to
rights
for
the
right
back.
B
Both
reads
and
rights
are
accelerated,
but
in
this
case
some
data
might
be
out
of
sync
between
cache
and
back-end
storage.
So
in
case
you
want
to
stop
cache
instance.
You
should
first
flash
the
dirt
data,
the
data
that
is
out
of
sync
with
backend
storage,
and
then
you
can
safely
detach
the
back-end
device
from
cache
instance.
There
is
also
background
flashing
background
synchronization.
B
B
Why,
on
the
right
only
cache
mode,
we
insert
data
into
cache.
Only
when
handling
write,
requests,
read
requests
are
handled
in
a
pass
promote,
as
they
are
handed
directly
from
backend
storage,
of
course,
assuming
that
data
for
given
lba
is
not
in
the
cache
already
in
case
it's
in
the
cache.
We
of
course
return
data
from
the
cache,
because
it's
the
recent
one
and
the
flight
around.
B
In
this
right,
around
accelerates
also
only
reads
so
in
this
in
this
manner
it's
similar
to
write
through,
and
the
difference
is
that
data
is
inserted
into
cache.
Only
on
only
during
read
handling
writes
are
targeted
directly
to
the
packet
storage,
so
it's
kind
of
opposite
to
write.
B
B
It
applies
to
the
write
only
and
right
back
cache
modes,
because
flashing
is
responsible
for
synchronizing
data
between
cache
and
back-end
storage
in
opencast
flashing
happens
in
two
different
ways.
You
can
force
force
opencast
to
flash
all
the
data
manually
by
using
a
command
side
utility,
but
also
opencast
flashes,
that
dirty
data.
In
the
background
and
this
background,
flashing
is
controlled
by
a
cleaning
policy.
So
currently
we
have
two
cleaning
policies.
B
B
The
second
one
is
icp
cleaning
policy
which
stands
for
aggressive
cleaning
policy,
which
is
best
for
hdds,
because
this
flashing
policy
tries
to
sequentialize
data
as
much
as
possible.
So
it
simply
selects
the
most
dirty
region
in
lba
domain
in
the
cache
and
flashes
that
that
first,
so
it
divides
the
whole
lba
lba
range
of
back-end
device
into
some
chunks,
then
flashes
those
chunks
in
the
order
of
percentage
of
dirt.
So
there
is
a
highest
possible
probability
to
form
some
sequential
regions.
B
B
B
But
it's
not
it's
not
always
best
strategy,
because
there
might
be
cases
that
you
have
some
important
data
in
cash
but
from
time
to
time
there
happens
very,
very
random
traffic
that
just
touches
the
blocks
and
those
blocks
are
never
used
again.
So
for
that
we
have
unhit
promotion
policy
which
inserts
data
into
the
cache.
Only
when
given
lba
all
given
block
was
accessed
more
than
specific
number
of
times
in
some
time
window
you
can
configure
how
many
times
this
block
should
be
accessed
in
order
to
be
inserted
into
the
cache.
B
So,
for
example,
you
can
classify
data
based
on
request,
size,
lba
ranges.
If
you
are
using
file
system,
then
you
can
classify,
classify
based
on
some
file
system,
related
attributes
like
file,
name
directory
file,
sizes
and
some
process
related
things
like
pid
or
process
name,
and
also
recently,
we
added
support
from
for
a
vlight
lifetime
hint
which
is
available
on
on
kernels,
starting
from
4.12
or
18..
I
don't
remember
exactly
and
it's
supported,
for
example,
for
by
roxdb.
B
It
simply
allows
application
user
space
application
to
specify
some
priority
of
data
expected
lifetime
of
given
file
or
lba,
and
we
can
utilize
this
right
item
hint
to
in
opencast
to,
for
example,
put
data
that
is
expected
not
to
be
touched
for
a
long
time,
static
data
directly
on
the
back
installation,
while
very
dynamic
data.
That
is
very
often
over
overwritten
to
put
it
into
the
cache,
with
the
highest
priority.
E
Sorry,
you
said,
I
think
you
said
that
you
are
able
to
say,
which
I
don't
know.
File
name
directory
will
not
be
out
of
the
cache.
Can
you
also
say
which
the
only
one
that
will
be
in
the
cache
and
all
the
others
would
be
out
of
the
cache?
So
one
directory
you
want
to
be
kept
and
all
the
rest
are
not.
B
Yes,
you
can
you
can
build
those
attributes
that
you
can
use,
but
based
on
those
attributes
you
can
use,
you
can
build
some
classification
news.
So,
for
example,
you
can
say
that
you
want
to
cache
directory
a
directory
b
directory
c
directory
d,
but
you
don't
want
to
cache
anything
else
or
you
can
say
that
you
want
to
cut
you,
you
don't
want
to
cache
directly
a
b
c
and
d,
but
want
to
cache
everything
else.
B
You
can
also
you
can
assign
different
priorities
for
each
directory.
You
can
specify
the
directory
a
is
most
more
important
for
you,
so
you
assign
priority
one
to
it.
Directory
b
is
a
priority
two
and
so
on.
So
on
so
on,
you
can
even
use
some
logic
operations
and
combine
those
attributes
together.
So
you
might
want
to
cache,
for
example,
all
files
that
are
smaller
than
let's
say,
40
kilobytes
that
are
placed
in
directory
a.
B
B
Avoid
caching
it
and
you
can
configure
when
the
sequential
cutoff
should
be
triggered.
We
have
three
options.
Actually,
the
one
is
that
you
want
sequential
cutoff
to
be
active
only
when
your
cache
is
full.
The
second
option
is
that
you
want
to
have
sequential
cut
off.
Always
even
you
have
a
lot
of
free
space
in
the
cache
and
the
last
one
is
to
disable
sequential
cutoff
at
all,
and
you
can
also
configure.
B
The
minimum
size
of
such
sequential
streams,
so
it's
it
means
that
you
can
configure
how
much
sequential
data
should
be
sent
to
a
given
block
device
in
order
to
be
treated
as
a
sequential
stream,
and
we
can
track
multiple
sequential
streams
that
way.
So,
if
you
have
multiple
applications,
writing
to
this
to
the
same
blog
device.
F
B
F
B
F
B
All
right:
okay,
let's
get
to
the
next
one
and
manageability,
as
I
said
at
the
very
beginning,
opencast
will
be
controlled
using
command
line
utility.
But
this
this
is
good
if
you
want
just
to
play
with
opencast
to
test
it,
but
in
production
environment,
it's
not
very,
very
friendly
method
to
use.
So
we
have
a
configuration
file
that
you
can
use,
that
you
can
dump
your
whole
cache
layout
and
then
system.
Startup
scripts
would
rapid
the
cache
configuration
on
on
the
startup.
B
And
we
also
provide
statistics
for
cache
to
monitor
the
cache.
We
have
about
30
counters
to
monitor
cache,
how
cache
is
used
by
what
kind
of
data
we
collect
that
counters
at
different
levels.
You
can
query
for
statistics
at
the
whole
cache
instance
level.
You
can
query
at
the
single
backend
device
level
if
you
are,
if
you
are
caching,
multiple
devices
using
single
cache
device,
or
you
can
check
statistics
at
the
higher
class
level.
B
If
you
are
using
io
classification
and
you
can
either
output
it
on
the
screen
using
humanity
formats
in
the
format
of
some
tables
or
you
can
export
it
to
csv
csv
files
for
some
machine
processing.
B
B
H
B
All
right,
so,
let's
get
a
bit
into
more
technical
details,
a
bit
into
architecture,
so
how
opencast
organizes
its
cache
space.
So
as
typical
cache,
we
divide
cache
space
into
cache
lines.
B
We
support
you
can
configure
what
cache
line
size
should
open
cassius
and
we
support
4k
from
4k
to
64k,
with
power
powers
of
two
step,
and
it
is
important
to
select
optimal
cache
line
size
because
all
the
caching
operations
we
are
performing
like
mapping
eviction
researching.
We
perform
with
cash
line
granularity.
B
We
do.
We
don't
perform
any
padding
or
prefetch.
So,
for
example,
if
you
request,
if
you
have
64k
cache
line
and
you
request
4k
of
data-
you
send
4k
request,
then
we
would
read
only
4k
data
from
the
backend
and
put
that
4k
into
the
cache
event
default.
64K
is
mapped
for
the
64k
region
on
the
bucket.
B
It
prevents
from
increasing
the
bandwidth
to
the
backend
device,
and
it's
also
important
during
flashing.
So
we
flush
the
dirty
only
the
dirty
sectors.
So
if
you
have
4k
cash
line,
for
example,
and
only
one
sector,
one
512
byte
sector
is
dirty
in
that
cache
line.
B
We
we
would
flash
only
5,
12
bytes,
not
the
full,
for
the
cache
lines,
so
we
would
prevent,
would
avoid
right,
amplification
factor
and
because
of
that,
it
is
important
to
set
the
correct
question
size,
because,
if
you
set,
if
your
question
size
would
be
too
large,
you
might
end
up
with
some
optimal
cache
utilization
like
in
this
example.
You
can
see
that
this
is
a
4k
question.
Size
example
the
smallest
one,
but
you
see
that
some
sectors
are
mapped
but
invalid.
B
B
Only
only
20
contains
valid
valid
data
in
the
cache,
and
if
your,
if
your
workload
is,
for
example,
4k
random
workload
and
you
set
64k
custom
size,
then
your
effective
cash
utilization
would
be
1
16
of
of
your
cash
size.
On
the
other
hand,
if
your
workout
is
your
average
request
size
for
your
local
is,
let's
say
64k
and
you
set
40k
cash
time
size,
nothing
but
happens.
B
Your
cash
would
be
utilized
perfectly,
but
you
opencast
would
need
to
track
more
cache
lines
than
it's
necessary,
so
it
means
it
would
consume
more
dram
for
metadata
and
it
would
consume
more
cpu
cycles
to
do
that.
Caching
logic,
so
the
optimal
configuration
would
be
to
to
match
your
average
request
size
with
your
cache
line,
size.
B
B
It's
because
our
default
from
our
whole
caching
logic
is
contained
in
a
caching
library
that
we
call
opencast
framework,
which
is
platform
independent,
and
currently
we
have
two
cast
products.
One
is
an
opencast
for
linux,
which
is
opencast
for
linux,
kernel,
which
I
mostly
described
in
this
presentation.
We
also
have
opencast
for
spdk.
B
B
B
In
order
to
integrate
it
with
some
stack,
you
need
to
wrap
opencast
framework
into
either
driver
or
in
your
application,
and
provide
the
top
adapter
and
bottom
adapter.
The
top
adapter
is
a
layer
on
top
of
open
custom
framework,
which
is
responsible
for
accepting
requests
from
the
storage
stack.
So
it
needs
to
understand
how
given
storage
stack
sends
the
request
in
kernel.
It
will
be
bio
based
in
spdk.
It
would
be
bdf.
B
There
is
no
data
copy.
Only
the
io
request
description
is
transformed.
We
don't
copy
data
data
is
always
the
in
the
buffers
that
application
or
upper
layers
sends
to
us.
On
the
other
hand,
there's
a
bottom
adapter,
which
is
responsible
for
sending
requests
to
to
cache
and
the
backend
device.
Whenever
opencast
framework
wants
to
perform
some
ios
and
it
has
to
perform
some
ios,
it
has
to
put
data
into
the
cache
get
from
the
backend
storage
and
so
on.
It
opencustomer
doesn't
perform
it
directly
because
it
it
doesn't
know
how
to
do
that.
B
And
two
examples
how
it
works.
I
I
partially
already
described
that
on
the
left
hand,
you
have
architecture
for
opencast
for
linux
kernel.
On
the
right
hand,
it's
for
spdk,
so
top
adapter,
which
is
part
of
opencast
kernel
driver,
accepts
from
request
from
block
layer
in
the
linux
kernel.
Translates
it
into
opencast
framework
requests.
Opencast
framework
performs
all
the
caching
logic
decides
where
to
put
data,
what
to
do
with
the
data
and
so
on.
B
Basically
everything
that
cache
cache
software
storage
cache
software
needs,
so
caching
engines
support
for
io
classification,
cache,
partitioning
eviction,
policies,
promotion,
all
the
policies
that
we
have
eviction,
promotion
and
cleaning
cleaner.
The
background,
cleaner
implementation,
metadata
handling
statistics
are
also
collected
in
and
managed
inside
opencast
framework
and
api
for
management,
and
all
these
items
are
exposed
from
opencast
frame
framework
through
api.
A
How
out
of
curiosity
I
I
know
that
there
was
some
some
concern
about
showing
direct
benchmarks,
but
in
the
in
the
past
it's
looked
like
you
guys,
have
seen
some
advantage
by
the
way
that
you
handle
your
your
promotion
and
evictions.
B
We
compared
this
with
a
dm
car
and
we
found
that
dm
cache,
especially
for
lower
request
sizes,
generates
very
high
light
amplification
and
re-amplification.
B
It's
because
dm
cache
organizes
its
data
in
chunks,
but
those
chunks
are
much
much
larger.
I
don't
remember
the
minimum
chunk
size
for
dm
cache
if
it's
32k
or
higher,
but
it's
not
that
important.
More
important
is
that
when
you,
for
example,
read
or
write
data,
that
is,
you
send
request
that
is
smaller
than
the
chunk
size.
Dm
cache
needs
to
vm
card
doesn't
track.
B
B
A
One
of
the
things
I
I
remember
when
looking
at
dm
cache
is
that
there
was
a
maximum
number
of
chunks
as
well.
I
believe,
if
you
exceeded
that,
then
your
chunk
size
was
automatically
incremented
by
two,
so
I
don't
remember
what
it
was.
Maybe
a
million
chunks
or
something
like
that
and
then
and
then
your
your
trunk
size
was
automatically
increased,
is
that
is
that
is
that?
Do
you
recall
that
was
it
correct
in
my?
Am
I
thinking
about
that.
B
Right
yeah
there
there
is
a
limit
for
trunks,
so
you
have
to
choose.
If
you
have
a
big
cache,
you
need
to
choose
the
the
smallest
possible
chunk
size.
That's
that
requirement.
So,
even
if
the
minimum
chunk
size
is
32k,
that
limit
might
force
you
to
use,
let's
say
64k
or
120k
28k
trunk
size,
which
would
further
increase
the
write
and
lead
amplification.
B
F
B
Memory
consumption
is
directly
dependent
on
on
your
cash
size,
so
number
of
of
chunks.
So
this
is
this
is
why
I
mentioned
that
you
should
match
your
cash
line
size
with
your
average
request
size,
because
if
you,
if
you
choose
too
small
cache
name
size,
then
you
would
consume
more
dram
than
easily
required.
B
B
B
We
are
in
progress,
we
are
doing
some
benchmarking
right
now.
We
we,
we
have
some
preliminary,
but
they
are
not
added
at
that
point
to
present,
but
we
are,
we
are
doing
extensive
benchmarking
right
now.
We
should
have
some
data
soon.
A
There
was
some
benchmarking
that
was
done
versus
dm
cash,
maybe
five
or
six
months
ago.
Are
those
results,
something
that
we
can
eventually
share
with
the
community
or
would
those
be
still
governed
by
nda.
B
Yeah,
let
me
let
me
check
that
I
don't
have
those
results
right
now,
but
let's
check
and
we
can
potentially
follow
on
some
next
meetings.
Okay,.
A
Yeah,
I
think
thank
you
miho.
Thank
you
for
presenting
this
was.
This
is
really
interesting.
Yeah
definitely
would
be,
would
be
really
interesting
to
see
some
of
those
earlier
results
or
or
the
new
results
that
you
guys
have
been
working
on,
especially
with
with
hard
drives
so
so
yeah.
I
would
would
absolutely
be
interested
in
having
you
guys
do
a
follow
up
well.
Thank
you
very
much
welcome.
A
I
don't
have
anything
else
for
people,
so
does
anyone
have
anything
for
the
last
10
minutes
that
they
want
to
bring
up
or
talk
about
before
we
wrap
up
all
right?
Well
then,
thank
you,
mihao
and,
and
thanks
everyone
for
coming
have
a
great
week
and
we'll
see
you
next
week.
Thank
you
have
a
good.