►
From YouTube: Ceph Developer Monthly 2020-10-07
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Should
we
move
on
to
another
topic
and
you
know
get
back
to
workshop
once
our
internet
and
things
start
working.
B
Yeah,
I
think
a
good
idea
is
scott
peterson
on
the
second
topic,
let's
see
in
the
list
here,
I.
B
Maybe
we
can
start
with
the
third
topic,
then
more
efficient
tracing
guppy.
Do
you
want
to
take
it
away.
B
D
We,
I
I'm
first
trying
to
explain:
why
do
we
even
need
this
tracing,
so
the
tracing
is
trying
to
fulfill
something
that
we
don't
have
in
our
system?
So
if
you
look
in
a
simple
system
like
standalone
application,
you
could
use
a
debugger.
If
you
want
to
find
that
something
is
not
working
as
it
should
be,
then
you
use
a
debugger
like
gdb
and
the
debugger
is
instrumenting
the
operation.
You
can
get
step
by
step
execution
and
then
online
you
could
decide
if
you
want
to
skip
a
function.
D
If
you
want
to
step
in,
if
you
want
to
skip
multiple
iterations
and
so
on,
you
can
monitor
and
modify
values
in
the
program
variables
and
again
online.
You
can
decide
which
parameters
are
you
interested
in
and
you
can
also
collect
a
predefined
variable
say
and
the
the
problem
with
this
mod
is
that
require
user
intervention.
D
The
the
debugger
is
blocked
waiting
on
user
input.
D
So
it's
it's
very
nice
if
you
debug
in
a
same
simple
user
application
with
a
single
for
the
execution
when
you
go
to
debug
high
performance
systems,
which
can
do
tens
of
thousands
to
millions
of
io
of
io
per
second
operation
per
second
and
there's
no
way
a
debugger
could
support
this
kind
of
flood
operational
timeout
waiting
on
on
user,
because
in
gdp
you
always
have
to
decide
what
you'll
do
next,
so
you
have
to
watch
something
you
have
to
see
it
and
then
okay,
you
step
in
the
next
step.
D
But
when
we
do
things
as
humans,
it
takes
seconds
to
operate
when
the
machine
is
doing
millions
of
iops.
It's
microsecond
response
time,
which,
of
course
we
cannot
compete.
So
the
system
would
slow
down
every
distributed.
Pilot
behavior
will
be
broken
because
the
fed
of
execution
timing
is
different
now,
and
race,
conditioning
and
timings
will
be
hidden
also
in
many
cases
operational
time
out
a
lot
of
network
operation.
D
So
the
way
to
address
this
is
to
use
an
online
tracing
online
tracing,
allows
the
system
to
progress
without
waiting
for
the
user
input,
the
system
collect
and
keep
changes
in
variables
and
the
user
can
review
them
postmortem.
So
you
just
collect
everything
and
then
you
replay
them
execution
can
be
replayed
later
from
the
collected
data
and
then
the
user
can
analyze
the
data
and
find
what
is
what
went
wrong
so
a.
F
D
D
G
D
E
D
D
Allocating
this
buffer
copying
the
data
to
this
guy,
the
staging
to
here
working
here
all
these
things.
You
need
something
to
combine
them
and
the
string
themselves.
They
don't
have
this
finger,
so
move
up,
so
some
walk
around
for
this.
So
if
you
want
to
control
the
amount
of
tracing
you
need
to
build
the
system
again,
you
need
to
change
it
and
compile
so
people
added
these
things.
They
call
trace
level
so
trace
level,
one
until
20,
every
level
you're
going
to
show
more
event,
races
and
what
you
do
in
development
time.
D
But
in
reality,
when
you
have
something
not
working,
it
might
be
something
you
put
on
priority
20
and
if
you
put
it
on
20,
it
means
the
whole
system
going
to
flood
with
information.
So
it's
again
it's
very
it's.
It's
not
flexible
enough.
D
D
D
So
I'm
trying
to
build
a
department
for
my
online
tracing
system,
so
requirement
number
one.
It
must
be
simple
to
operate.
If
it's
not
simple
people
don't
use
it,
it
must
be
lock
free,
otherwise
it
won't
scale.
If
we're
talking
about
now,
you
could
have
50
or
even
100,
calls
running
on
the
same
machine
if
all
of
them
would
take
locks
each
time
that
you
want
to
trade.
To
put
in
event
rates
it
just
won't
scale.
D
Online
tracing
system
should
not
impact
the
normal
execution.
Again,
if
I
mean
usually
when
you
have
event
racing
so
the
simplest
one
is
something
you
do
in
debug
time.
So
in
compile
timing,
if
you
could
put
something
that
is
a
macro
and
by
combining
this
you
remove
all
of
them.
So
there's
zero
impact
to
the
performance.
D
D
Data
should
be
cut,
so
that's
when
you're
not
running
when
you
do
run.
Data
should
be
collected
in
efficient
binary
tracing,
but
even
so
you
collect
it
in
binary.
It
should
be
still
display
in
human,
readable
format,
so
binary
collection
for
efficiency,
human
readability
so
it'd
be
useful
for
us
and
we
need
online
filtering
of
event,
collection,
which
is
something
we
don't
have
in
the
previous
one.
D
I
would
like
to
be
able
to
filter
by
event
type.
For
example.
Let's
say
I
find
there's
a
problem
in
my
scrubber
code,
so
I
say
this
time
when
you
run
even
so,
scrubber
is
priority
20,
but
I'm
now
trying
to
debug
a
scrubber
problem,
so
please
collect
only
scrubber
event,
nothing
else
or
maybe
scrubber
event
and
a
pg
log
event
or
scrubber
event.
Pg
log
and.
E
D
Blue
storage,
so
you
can
decide
which
events
you
want
to
collect
and
you
should
have.
You
should
have
a
lot
of
events,
so
the
granularity
of
littering
would
be
good.
Like
the
system
I've
been
using,
they
are
like
they
use
two
bytes
for
event
types,
so
you
could
define
in
theory,
16k
event
types
in
reality.
I
think
people
use
something
like
600.
D
For
example,
I
need
to
be
able
to
filter
by
pool
id
pgid
object,
not
id
device
id
lba
block
id,
you
name
it
or,
I
could
say,
show
me
everything
which
have
them
or
show
me
when
this
thing
only
pull
number
five,
only
pg
number
17
only
object,
node
id
x
and
so
on.
So
you
should
be
able
to
have
control
of
how
much
you
collect.
So
you
start
talking
about
collection
and
by
limiting
or
by
filtering
in
collection.
D
D
D
We
need
again
so
there's
filtering
when
you
collect
the
trace,
because
you
want
to
minimize
minimize
the
trace
size,
but
then
there's
filtering
when
you
view
the
trace,
you
try
to
look
for
something
and
say:
okay
now,
please
show
me
just
event
related
to
scuzzy,
or
I
just
want
to
see.
Events
on
the
fiber
channel
or
on
the
network
show
me
tcp
event
whatever
and
of
course
the
language
should
be
complicated.
D
So
I'm
trying
not
to
elaborate.
So
it
must
be
simple.
It
should
be
simple
process
to
add
a
new
tracing
point.
It
should
be
similar
to
adding
a
printf
if
it's
more
complicated
than
that
people
just
not
going
to
add
event
trace.
I
mean
you
could
force
to
use
it
in
some
very
critical
location,
but
if
you
want
people
to
be
generous
with
event
tracing
and
remember,
because
we
can
filter
event
in
runtime,
it's
it's
good
to
have
more
of
them.
D
So
it
should
be
easy
to
do
this,
and
I'm
actually
jumping
to
the
last
item
here,
one
thing's
to
make
it
easy.
It
should
be
globally
available
without
parameters
in
our
system.
Every
time
you
want
to
put
an
event,
and
you
don't
have
it
in
the
object,
you
need
to
pass.
The
cct
object
down
the
line
from
filter,
so
I've
done
it
in
one
file
and
I
had
to
change.
D
D
D
For
example,
you
could
say:
let's
say
you
debug
in
osd,
so
you
could
say:
okay
since
I'm
debugging
osd,
I
don't
care
about
everything
else.
Let's
start
with
default,
osd
events
and
you
could
create
some
events
which
most
osd
developers
need
and
for
this
event
set.
You
could
then
later
say:
okay
remove
this
one,
maybe
add
also
that
one
and
so
on
and
so
forth.
You
should
be
able
to
add
multiple
event
filters
in
a
single
command.
D
You
don't
want
to
start
issuing
one
command
after
another
and
of
course,
if
you
could
have
a
gui
for
this,
that's
the
best
it
should
be
easy
to
follow
in
search.
The
collector
trace
should
be
human,
readable
and
strong,
logical
language.
And
again
you
don't
need
people
to
do
crazy,
scripting
to
be
able
to
search
a
tree.
The
traits
should
be
easy.
D
Lock
free
operation,
so
it's
actually.
This
thing
is
well
understood
and
you
could
see
this
thing
on
other
tools
like
jaeger
and
others.
What
we
do
is
that
every
tpu
call
is
running
an
independent
stream
of
event.
Log
free,
completely
log
free,
the
events
are
collected
into
private
buffer
can
be
written
by
a
single
core,
so
you
don't
need
to
lock
it.
D
D
D
Now
the
timestamps
are
added
to
be
able
to
merge
the
event
and
to
get
ordering
between
two
independent
streams,
but
we
also
get
something
else
from
them.
It's
not
something
we
waste,
because
timestamps
are
useful
by
themselves.
You
could
be
able
to
judge
how
long
an
operation
took.
It's
not
going
to
be
a
performance
quality
timestamp.
But
if
you
see
that
some
step
taking
crazy
amount
of
time,
then
you
should
be
suspicious.
D
So
the
streams
are
merged
post
processing
bets
on
timestamps.
So
you
work
on
multiple
streams,
every
core
data
stream
and
you
do
it
based
on
timestamps.
So
you
have
multiple
pointers
and
when
you
advance
you
advance
advancing
all
of
them
in
parallel.
D
D
D
That
way,
the
compiler
will
optimize
the
code
and
it
won't
affect
production
code
because
everything
is
outside
there's
a
branch
false
which
the
computer
can
execute
in
parallel,
and
if
this
thing
happened,
then
it
would
jump
outside
the
code,
but
the
normal
code
is
not
affected
by
this.
So
this
contradicts
a
bad
code.
Would
look
after
this?
If
stress
level
is
bigger
than
some
level,
then
one
by
code.
D
I'm
going
to
make
a
short
stop
here
and
I
want
to
show
what
actually
I'm
waiting
I'm
aiming
for.
So
I'm
going
to
jump
to
the
end.
D
So
that's
actually
what
I'm
looking
for
when
people
think
of
event
race.
I
I
think
the
event
race
that
we
got
in
depth
in
many
and
also
in
many
systems.
It's
not
really
adventurous
that
it's
a
combination
of
few
things
in
some
cases.
It's
just
it's
just
it's
a
system
logger,
so
you
could
see
a
system
you're
starting
up.
D
Join,
oh
somebody,
disconnect
network
link
is
being
brought
up.
The
link
is
down
these
kind
of
events.
They
are
just
logging
days,
it's
not
something
suitable
to
high
performance.
My
example
is
wireshark,
which
is
everybody
here,
familiar
with
wireshark.
D
It
can
capture,
actually
it
captures
packets
going
on
the
network
and
then
later
it
can
display
them
and
help
you
to
find
some
problem
with
your
protocol.
A
similar
product
exists
for
many
years.
D
D
Wireshark
have
lots
of
filter
you.
Only
when
you
define
when
you
play
flying
shot,
you
can
define
what
you
want
to
collect.
You
can
tell
them.
Okay,
I
want
you
to
collect
ipv6
only
coming
only
from
this
internet
address
and
going
to
this
ip
address
or-
and
maybe
I
can
say,
I
only
want
to
see
rdma
over
tcp,
so
you
can
define
what
protocol
you
want.
You
can
define
what
appears
and
you
can
also
define
what
kind
like
the
more
the
more
you
get
inside.
D
You
have
more
and
more
fitness,
you
can
say
okay,
so
if
it's
iscsi,
I
only
interested
in
seeing
read
requests,
so
you
could
and
maybe
really
quite
coming
to
this
lda.
D
I
didn't
use
wireshark
for
this,
but
I'm
assuming
it
could
do
that
finish.
I
was
able
to
do
that.
So
you
need
crazy
amount
of
information
and
you
need
to
be
able
to
filter.
D
D
But
still
after
the
data
was
collected
in
binary
format,
you
could
display
it
in
a
human,
readable
format.
There's
actually
a
very
nice
gui,
very
user-friendly,
to
see
everything
and
it's
getting
you
can
get
inside
and
you
can
search
and
you
can
try
to
find
xiaomi
packets
coming
from
this
coming
from
here.
Show
me,
packets,
with
this
kind
of
problems,
show
me
packets,
with
this
kind
of
header
and
so
on
and
so
forth,
and
of
course
you
could
view
and
search
and
everything.
D
So
that's
that's
the
example
or
that's
the
model
I'm
trying
to
emulate
here,
I'm
not
trying
to
compete
with
sysblogger,
I'm
not
trying
to
emulate
jaeger,
which
is
another
tool
to
collect
system
log
with
jagers
and
friends.
You
should
be
able
to
see
that
everybody
is
running
and
that
the
network
state
is
stable
or
not
stable.
You
could
see
that
the
device
is
up
or
down
osd
up
and
down,
but
you're
not
going
to
get
a
granularity
needed
to
see
every
single
io.
H
Hang
on
jager
is
a
structured
event.
Tracer,
it's
not
a
logger.
D
D
D
The
wrong
name,
I'm
just
it's
a
name,
I'm
giving
this
this
I'm
just.
H
D
Sorry,
I
don't,
I
don't
get
you,
maybe
the
sound
here.
Isn't
I'm
trying
to
increase
the
sound.
H
D
D
Here
yeah,
but
I'm
more
concerned
about
a
debugger
utility.
The
adventurous
thing
I'm
talking
about
is
something
to
debug
your
code.
It's
not
something
to
show
the
customer
or
to
take
a
snapshot
of
the
system
and
say:
oh,
I
can
see
something
look
suspicious
here.
I
want
to
be
able
it's
the
event.
Tracer
I'm
talking
about
is
is
a
is
a
debugging
tool
for
developers.
D
B
D
D
D
It's
simple!
It's
easy
and
doesn't
take
long
time
to
to
to
write
it.
You
don't
need
to
be
some
amazing
engineer
to
be
able
to
do
it.
Everybody
should
do
it
in
day.
One
there's.
Zero
training
required
here,
optimize
mode
which
I've
seen
using
the
past.
I
don't
know
if
people
are
still
using
this
because
it's
it's
extremely
efficient,
but
it's
it's
not
so
comfortable
to
use.
D
So
what
happened
is
this?
You
define
an
event
and
sub
event.
So
sorry,
there's
event
type.
For
example,
event
type
is
going
to
be
a
page
log
event
and
shark
type
will
be
streaming
and
then
there
is
a
central
location
where
you
have
a
json-like
file
and
you
can
say
for
event,
type
logger,
sorry,
pg
log
with
start
type
trim.
Then
you're
going
to
see,
and
then
here
you
could
type
what's
going
to
be
the
parameters,
what's
going
to
be
the
names
and
so
on
and
so
forth.
D
So
it's
strong,
very
efficient,
but
it's
not
so
comfortable
later
I'll.
Try
to
show
some
examples,
but
the
idea
is,
you
would
write
some
text
description
for
this
and
it
would
be
indexed
by
event
in
sub
event
and
you'd,
give
a
list
of
parameters
and
for
each
one,
you'd
say
what's
the
type.
So
it
know
how
how
much
it
have
to
read
and
what's
the
logical
name
so
later,
you'll
be
able
to
search
so,
for
example,
you
could
say
in
this
event,
I'm.
B
D
Going
to
have
the
osd
number,
which
is
4
byte
and
the
name
for
search,
would
be
osd.
Then
I'm
going
to
have
our
own
mode
id,
which
is
another
four
byte
and
you
would
search
by
oh
node,
then
I'm
going
to
have
timestamp,
which
is
eight
byte
and
you
would
call
timestamp
and
so
on
and
so
forth.
D
So
by
creating
this
file
it
post
processing,
when
you
get
when
they're
reached
to
this
event,
they
know
exactly
how
to
display,
without
paying
any
overhead,
because
they're
just
going
to
dump
the
data
which
going
to
be
prefixed
by
the
event,
type
and
subtype
and
then
they'll
jump
to
this
table.
So
it's
extremely
efficient,
but
very
it's
not
crazy.
D
Uncomfortable
I've
been
working
like
this
for
many
years,
but
I
can
say
that
in
many
cases
you
would
see
people
being
lazy
and
not
adding
event,
because
for
every
time
they
want
to
at
event
and
they
need
to
open
this
crazy,
json
document
and
add
the
appropriate
line.
And
then
you
also
need
to
develop
some
utilities
to
make
sure
that
everything
matches.
So
if
you're
passing
five
parameters,
the
json
file
have
five
parameters,
and
you
also
try
to
do
some
kind
of
society
checking
on
them.
D
D
So
there's
marshall
and
d
martial,
but
it
have
to
be
efficient
and
later
in
post-processing
they
will
come
and
they
will
see
that
this
is
object,
object
of
this
type
and
then
they
will
call
this
function,
which
could
be
just
o
string
operator
or
something
like
this.
D
D
So
the
way
to
get
it
to
be
efficient
is
the
steepest.
One
is
every
time
you
trash
an
event.
You
also
trace
it.
The
printed
format,
which
is
inefficient,
because
you're
going
to
repeat
the
same
thing
again
and
again
solution
I've
seen
is
that
you
keep
the
instruction
pointer
and
the
file,
so
you
can
say
in
this
object.
D
H
D
If
the
binary
is
changed,
because
maybe
you
change
one
line,
you
compile
it's
not
going
to
work.
So
it's
working,
it's
fine,
but
there's
some
problem
with
this
something
else
I
discussed.
I
I
don't
know
if
it
was
implemented,
but
I
discussed
it
in
the
past
in
my
previous
job,
I
suggested
doing
that.
I
don't
know
if
it
was
ever
done,
is
to
keep
a
single
translation
table
per
file.
D
Time
the
hash
table
is
going
you're
going
to
you're
going
to
do
a
hash
search.
The
next
time.
You
see
it
you're
going
to
find
that
it's
already
exist,
so
you're
just
going
to
store
the
index
the
index
going
to
be
probably
2
bytes
should
suffice.
I
don't
expect
to
see
more
than
64k
event,
races
per
type.
Remember.
It
should
be
index
prototype.
F
D
You
create
a
a
table
which
could
be
stored
in
a
separate
file
because
the
log,
because
the
event
is
just
moving
forward
and
this
one
is
probably
kept
in
memory
and
in
the
end
you
dump
this
whole
table
at
once.
It's
it's
not
being
dubbed
with
anything
else,
it's
something
which
actually
should
go
to
the
header.
D
So
you
keep
a
transaction
table
from
the
format
to
an
index,
and
then
you
just
store
the
index
of
every
event.
I
mean
the
event
could
be.
Could
it
could
be
in
the
trades
thousands
of
times?
Oh
actually,
I've
seen
it
like
hundreds
of
times
of
time.
If
the
event
is
a
very
common
one
say
the
event
is
every
time
you
get
an
io
request,
you
don't
want
this
thing
stopped
again
again,
you
store
it
only
once
and
you
keep
the
index
and
then
on
post
processing.
D
D
So
that's
how
you
do
that
the
printf,
so
that's
allow
you
to
do
a
printf
like
comfortable
event
tracing
but
still
be
very
efficient
in
the
way
you
store
things.
D
D
So
say
you
you,
you
decide
that
you're
going
to
have
a
1k
event,
event
type.
It's
crazy!
It's
it's!
It's
a
lot
1k!
I
have
never
seen
something
go
so
much
so
big
but
say
that
you
decided
you
want
one
cave
event
type,
so
you
define
a
a
bitmap
and
every
bit
represent
one
event,
so
you
could
start
with
an
empty
bitmap
or
you
could
start
with
a
default
event
map
or
you
could
create
event
type
for
every
group
offer
some
kind
of
io
flow
like
for
bluestore.
D
D
I
want
you
to
add
event
for
pg
log
for
scrubbing,
for
tcp
and
for
what
blue
store
you
should
use
human
readable
commands,
don't
don't
let
customers
or
users
manipulate
bits
and
how
this
thing
is
working.
Every
event
trace
call
is
starting
with
two
parameters:
event
type
and
subtype.
D
No
okay,
I
didn't
talk
about
it.
Okay,
every
time
we
record
event,
we
put
event
type
and
subtype.
When
you
collect
event,
we
need
to
check
always
when
you
do
online
collection
right.
The
machine
is
always
running.
If
the
event
racing
is
active,
then
you
jump
outside
and
you
start
to
look,
and
so
this
was
event
type
like
this.
Then
you
check
the
bitmap.
If
the
bit
is
set,
then
you're
going
to
collect
the
event.
If
not,
then
you
skip
it
and
this
thing
you
could
do
online
and
you
could
change
online.
D
That's
one
of
the
biggest
flexibilities
with
this
model
when
you
do
event
tracing
based
on
severity,
from
level
0
to
level
20
and
your
the
event
type
you're
looking
for
happens
to
be
in
level
20..
By
opening
this,
you
open
everything
and
you
get
a
flab
here.
You
could
just
decide
which
event
type
are
interesting
for
you
and
it
might
be,
as
I
said,
it
might
be
scrubber
the
least
interesting
event,
but
if
oneness
debugging
his
code
for
him
is
the
most
interesting
thing
so.
E
D
D
And
and
other
things
actually,
I
should
have
used
probably
another
slide,
the
last
two
bullets
you
should
also
allow
collecting
by
subsystem.
For
example,
when
you're
running
ceph,
you
could
say:
okay
collect
osd
in
events
and
then
I'm
going
to
tell
you
what
else,
but
I
just
need
osd
or
just
bluestone
or
just
worksdb
or
whatever,
and
then
you
can
also
limit
event
collection
based
on
objective
just
collect
event
of.
If
they
belong
to
this
pg
group,
I
don't
care
about
other
page
groups.
I
just
want
this
one.
D
Even
a
way
to
view
and
search
the
trades,
so
you
need
some
way
to
look.
You
wanted
to
to
view
the
trace
forward
and
backwards,
just
like
how
you
you
you'd,
search
a
file.
If,
if
you're
reading
a
page,
you
should
be
able
to
go
up
and
down,
you
should
see
reverse
and
forward
tracing.
D
The
renderer
should
have
current
location,
so
you
should
always
be
able
to
say:
okay,
okay,
I've
seen
up
to
here
now
show
me
the
next
page.
It's
it's
a
paging
concept,
not
like
a
paging
pledging
page,
but
you
can
see
one
page
after
another
and
it
always
know
where
it
is.
You
can
say
now
scroll
up
scroll
down
from
the
current
location.
You
don't
need
to
age
time
repeat:
like
I've
seen
people
do
you
grab
for
something,
and
then
you
say
from
this
point:
you
filter
and
then
you
do
it.
It's
complicated.
D
Just
remember
where
you
are
and
then
you
can
say,
search
for
the
next
event.
You
can
give
it
current
location,
usually
there's
there's
a
free
location,
free,
macros,
there's
head
tell
and
current,
so
you
could
search
from
the
beginning
until
you
see
something
from
the
end
search
backwards
until
you
meet
something
or
from
here.
I
want
you
to
search
forward
and
backward
until
you
see
something
when
you
view
the
event.
D
You
could
do
second
filtering,
because
when
you
collected
you
filter
stuff,
but
when
you
view
it,
you
say:
okay,
we
collected
a
lot
of
information
because
somebody,
I
didn't
do
the
adventures.
If
somebody
else
did
it,
but
I'm
only
interested
in
this
event.
So
just
show
me
them.
I
don't
want
to
see
everything
else,
so
you
can
give
it
like.
Show
me
only
this
or
show
me
everything,
except
this
show
me
the
next
object.
D
The
next
appearance
of
an
object
show
me
the
next
I'm
looking
now
for
object,
nodes
show
me
the
next
time
I
can
see
the
object,
node
or
show
me
the
next
time.
Object.
Not
have
this
value
show
me
the
next
time
the
optional
id
is
this
one
or
it's
within
this
range?
I
won't
go
to
the
next
time,
but
I
want
to
share
to
see
only
object,
node
between
x
and
y
and
of
course,
you
can
do
combination
in
a
logical
file.
D
D
How
you
do
this.
So
how
can
you
do
the
filter?
Now
you
actually
implement
it.
So
the
first
thing
to
do
is
every
trace
event
is
started
by
event,
type
and
subtitle.
The
event
code
is
always
dumping
them
in
binary
format.
As
the
first
thing
for
every
event
record,
it
could
be
two
byte
for
the
event
type
and
two
bytes
for
the
subtype,
or
maybe
one
for
each
depends.
How
many
events
you
want
to
create?
D
D
So
it's
location
based
that
the
way
for
the
event
trace
renderer
to
understand
what
it's
doing
is
based
on
the
location.
Always
the
first
object
that
the
first
bytes
are
going
to
be
event,
type
and
subtype
you
you
could
do
you
could
use
four
you
could
you
could
use
the
same
concept
with
more
with
more
information,
so
let's
say
that
osd
always
when
we
collect
event,
we
need
some
common
parameters.
We
always
need
a
full
id,
the
object,
not
id
and
so
on
and
so
forth.
D
So
you
could
create
a
macro
or
function
and
you
can
call
it
if
osd
tracer
and
that
macro
is
going
to
ask
you
to
insert
pull
id
object,
node
id
and
so
on
and
so
forth,
and
this
thing
will
pack
all
this
information
and
dump
them
in
a
in
a
single,
well-known
location.
So
they
always
know
that
the
full
id
is
the
next
four
byte
and
object.
Node
id
is
the
four
byte
afterwards.
D
So
that's
similar
to
what
I
explained
before,
that
you
can
create
a
json
format
to
describe
it,
but
here
it's
actually
useful,
and
that's
being
I
actually
this
thing.
I
I've
seen
more
used
used
more
commonly
than
the
previous
one,
because
you
can
create
a
family
of
event.
Tracer
and
this
family
could
be
like
in
the
beginning
of
every
project.
D
The
project
manager
project
lead
is
going
to
define
the
common
macros
for
this
project
and
they're
going
to
have
like
three
or
four
json
line,
explaining
how
the
event
is
going
to
look
like
so,
and
usually
it's
going
to
see
like
the
beginning
is
going
to
look
like
this
event,
type
subtype,
which
is,
of
course,
then
they're
going
to
be
cool
id
object,
not
id
block
id
and
so
on
and
after
this
they're
going
to
be
something
else.
There's
something
else.
It's
like
before
print
a
format
which
you
could
render
on
time.
D
So
usually
what
people
do
is
description
base
so
for
every
parameters
you
could
start
and
say
now
collect
lba
column,
and
then
you
give
value
so
you
attach
before
every
parameters
all
the
parameters
that
you
want
to
give
an
m
you.
You
start
it
by
giving
a
name
to
this.
A
C
C
D
So
that's
an
example
of
how
you
could
use
it,
so
you
do
event
trace
and
of
course
you
start
by
saying,
if
that's
pg
log
event,
you're
always
going
to
start
with
the
event
type
and
the
subtype
I'm
going
to
use
here.
The
formatting
and
it's
going
to
be
this
event
is
a
trimming
log
and
then
I'm
adding
some
parameters
now
the
parameters.
This
is
not
printf,
it's
look
like
print,
but
of
course
it's
not
printf,
so
every
parameters
could
be
a
topol
doesn't
have
to,
but
could
be
so
you
do
it
like
this.
D
I'm
going
to
give
you
a
value
and
that's
going
to
be
the
name.
So
if
you're
going
to
search
you're
going
to
search
for
all
node
and
that's
the
value
and
the
value
is
four
byte
unsigned
integer,
then
the
second
one
is
going
to
be
pull.
D
Actually,
no
because
you
need
to
know
that
they're
going
to
be
here,
so
you
don't
need
to
describe
the
parameters
but
okay
and
the
last
one
you're
going
to
say
it's
a
character,
and
here
you
don't
give
it
a
name.
It's
not
interested.
Maybe
it's
some
information
you
want,
but
you're
not
going
to
search
by
this
you're
only
going
to
keep
this
kind
of
information
for
something
you're
going
to
search
by
this.
D
So
back
to
this,
so
that's
for
every
parameters
you
want,
you
could
add
a
time,
but
there
is
going
to
be,
of
course,
a
limited
number
of
types,
because
you
don't
want
to
type
everything
in
the
world.
Not
everything
going
to
have
a
type,
it's
crazy
and
the
way
you
do
it
in
the
beginning
of
the
event
records
they're
going
to
be
a
header
and
the
header
is
going
to
have
all
the
event
types.
D
D
This
thing
it's
similar
to
the
json
thing
I
described
before,
but
it's
done
automatically
on
the
fly
by
the
event
racing
code,
the
json
thing
when
I
was
using
this,
it
was
not
very
comfortable
people.
Just
don't
like
the
idea
of
writing
the
code
here
and
then
jumping
to
another
file
and
writing
something.
When
you
do
this
kind
of
code,
it's
much
more
intuitive
and
it's
easy.
It's
not
complicated,
you
just
do
it
and
it's
good.
D
D
So
if
you
enumerate
all
kind
of
operation-
and
you
got
remove
some
remo
so
get
put,
delete
whatever
and
the
index
it's
actually
an
index,
but
they're
going
to
use
the
enumeration
name
and
other
things
are
even
more
interesting
when
they
can
see
that
you're
using
something
of
some
type
is
going
to
create
on
the
fly
the
type
later
to
be
able
to
search
by
this
and
talking
about
like
if
there's
going
to
be
type
like
all
node
id
type.
So
you
wouldn't
need
to
have
this
all
nodes.
D
Yeah,
I'm
here
the
network
is
slow,
so
I
can
hear
you
good
to
see
you.
I
Okay,
one
comment
that
I
think
is
might
be
a
good
idea
to
follow
is
that
which
we
should
strive
to
create
the
output
either
when
collecting
or
immediately
on
the
spot
in
another
process
in
a
formatter
another
process
in
one
of
the
formats
that
there
are
already
existing
tools.
For
you
talked
about
a
wire
shark.
There
is,
for
example,
a
a
kernel
shark
and
there
in
there
is
trace
compass.
For
example.
I
I
Filtering
whatever,
and
so
it
might
be
a
good
idea
to
create
a
format
that
is
known
for
tracing
that
that's
the
most
important
comment
and
the
another
one
is
that
I
would
suggest.
I
think
we
mentioned
that
a
feature
that
I
had
in
the
tracer
I
used
in
my
previous
job,
which
was
of
a
temporary
temporary
streams.
I
D
I'm
familiar
with
which
actually
I
forgot
to
put
in
my
presentation,
which
is
the
event
to
stop
the
tracing
so
usually
tracing
they're
using
a
cyclic
buffer.
So
everything
is
going
to
override
each
other,
but
there
is
there's
a
showstopper.
D
You
need
to
define
when
something
happened,
then
I
want
you
to
stop
the
event
race,
because
otherwise
you
would
collect
forever.
So
you
could
example.
Example
when
you
see
that
object
not
or
when
this
sanity
check
is
going
to
break
then
stop
the
event
race.
I
J
H
Here
is
the
goal
here:
okay,
so
let's
say
that
we're
collecting
logs
from
a
customer-
and
we
have
it
set
up
so
that
the
rate
of
generation
is
about.
You
know
20
megabytes
per
minute,
or
so
that's
a
lot.
But
it's
not
impractical
at
all.
To
keep
weeks
of
that,
20.
D
H
I
understand
that
gaby,
I'm
just
asking
okay,
so
my
question
is:
is
the
idea
is
to
write
these
to
a
file
and
roll
the
file
right?
The
idea
isn't
to
roll
a
circular
buffer
in
in
the
middle
after
the
pr
after
the
thread
generates
the
event,
but
before
it's
written
to
disk?
Yes,
it's
in
a
circular
buffer,
but
as
long
as
the
write
out
process
keeps
up
with
the
generation,
it's
not
a
problem.
Is
that
the
idea.
D
Yeah
so
so
before
we
get
here
just
one
critical
thing
to
understand
when
you
trace
into
memory
performance
going
to
suck,
but
when
you
trace
into
file
reform
is
going
to
be
extremely
slow.
So
usually,
when
you
try
to
debug
distributed
system,
you
aim
to
get
as
much
as
possible
inside
your
memory.
D
So
that's
why
you
do
all
this
crazy
filtering
when
you
put
stuff
in
file
yeah
sure
you
could
keep
it
we're
now.
Sorry,
when
I
was
talking
about
cyclic
buffer,
I
was
talking
about
memory
tickets.
H
E
H
This
is
what
the
developer
thought
they
needed
right.
So
the
related
question
is
suppose
the
amount
being
collected
is
larger
and
we
actually
can't
keep
up
with
writing
to
disk.
H
H
Nearly
as
difficult
so
they're
they're.
H
D
H
H
B
H
So
it
would
be-
and
these
tend
not
to
be
high
volume
events
anyway,
so
I
don't
actually
expect
it
to
be
a
problem,
but
there's
a
difference
between
it
usually
won't
be
dropped
and
never
will
be.
So.
I
was
wondering
if
you've,
given
that
any
thought.
D
E
H
E
H
D
I'm
sure
people
people
come
with
with
a
lot
more
of
interesting
ideas.
I
spoke
last
week
of
this
guy
tom
about
his
eventualizer
and
he
said
yeah.
I've
done
it
eight
years
ago,
but
since
then
I've
done
it
for
many
other
systems
and
each
time
we
do
it,
it's
become
better
and
he
said
now
the
best
the
best
one
I
know
is
the
guy's
invest
data,
because
you
know
the
they
start
with
everything
he
got
and
then
they
added
more
so
yeah.
H
H
Some
of
the
tricks
that
are
present
here
are
already
present
in
the
ooc
log,
and
we
do
use
it
to
capture
all
of
the
operations
of
a
certain
type
pretty
frequently.
Now
it's
incredibly
non-granular,
you
have
to
use
grub
to
search
it
and
it's
a
giant
pain
in
the
ass.
It's
also
inefficient
for
all
the
reasons
you've
outlined,
but
we
use
it
for
these
things
anyway.
H
D
D
H
Yeah
I
wanted
to
talk
about
that
too.
So,
while
a
highly
featureful
log
viewer
would
be
great,
I
think
the
first
step,
probably
shouldn't
be
that
it
should
be
a
utility
that
just
processes
the
binary
trace
with
a
set
of
filters
into
a
text
file,
and
my
reasoning
for
that
is
that
a
log
viewer
is
a
great
example
of
a
place
where
you
can
generate
kind
of
an
unbounded
set
of
features
and
one
way
or
another.
H
We
will
need
a
way
to
process
a
trace
with
time
and
other
event
filters
into
a
text
file
simply
so
that
it
can
be
compatible
with
other
tooling.
So
I
suggest
we
develop
that
part
first,
it's
annoying,
admittedly,
that
it
won't
have
the
features
you
outlined.
Those
do
seem
good
well.
D
Actually,
you
can
also
develop
something
which
takes
binary
and
then
dump
it
into
text.
H
Saying
what
I
mean
is
the
utility
that
that
that
embeds
those
filters
and
time
bounds
into
a
just
a
command
line
utility
that
emits
a
text
file,
I
think,
should
be
the
first
step.
I
I
H
B
I
think
that
could
be
a
second
step,
though
right
it
could
be
something
to
understood
by
existing
trace
viewers
like
trace.
B
H
B
H
So
the
other
thing
I
wanted
to
ask
about
is
the
ceph
contacts
pointer.
That's
passed
around,
there's
a
reason
for
that:
it's
not
for
funsies
everywhere
in
the
existing
daemons.
There
is
a
global
staff
context
and
you
don't
need
to
pass
it
around.
The
exception
would
be
any
code
that
interacts
with
library
code,
because
in
those
cases
there
could
be
multiple
libraries
within
the
same
process
with
different
config
parameters.
H
Because
there
are,
there
are
utilities
that
interact
with
that
code.
In
principle,
though,
if
those
were
only
ever
called
from
the
daemon
yeah,
we
could
use
the
globals
it's
a
little
complicated,
but
that's
why
I'm
not
saying
that
it's
not
ergonomic.
I
just
wanted
to
bring
that
up.
Okay,.
D
H
H
D
H
D
Okay
and
okay,
the
last
thing
I
forgot
to
mention
I'm
maybe
pushing
people
to
consider
an
existing
trace,
eventually
solution,
which
does
everything.
I
explained
that
but
there's
one
problem
with
it.
Actually,
it's
a
big
problem
that
it's
not
supported,
it's
something
which
was
made
open
code
eight
years
ago
and
it
was
never
touched
since
I
spoke
to
the
guy
and
he
said
he
thinks
like
one
or
two
weeks
time
and
it
will
be
working
again,
ceiling,
compatibility
and
libraries
version
changing
and
so
on.
D
No,
no,
this
solution
was
developed
for
infinidat
infinity
is
the
third
company
made
by
the
same
guy,
which
met
symmetric
and
every
company
started.
D
He
was
recreating
the
event
race
and
refactoring
what
he
knew
so
the
first
one
was
met
30
years
ago,
which
probably
the
one
I'm
familiar
with,
and
that's
the
third
iteration
was
done
10
years
ago
and
the
guy
who
trotted
for
him
insisted
on
making
it
an
open
source
so
eight
years
ago
it
used
to
work
and
it
was
their
solution
and
since
then,
libraries
change,
paired
version,
change,
c
lang,
version
change
and
so
on
and
so
forth.
So
it's
not
that
it
used
to
work
it
just.
D
It
was
not
updated
to
support
sorry
to
build
with
latest
kernels
and
and
stuff
like.
D
This,
by
the
way
he
also
mentioned,
he's
going
to
speak
with
his
friends
in
this
data,
which
apparently
got
the
latest
iteration
of
this
design
and
tried
to
convince
them
to
make
their
tracer
open
code.
I
don't
know
if
he's
going
to
succeed,
but
if
that
would
happen,
then
it
would
be
getting
the
best
of
the.
D
A
Yes,
yeah,
I
mean
to
just
sum:
it
up
sounds
like
we're
kind
of
on
board
with
your
ideas,
and
the
next
step
would
be
to
just
create
a
small
prototype
with
some
isolated
piece
of
stuff.
Would
you
agree.
D
D
So
I
think
his
project
was
more
advanced
than
the
stuff
I
was
describing
because
it
was
just
another
refactoring
of
the
stuff
I'm
familiar
with,
and
I
don't
know
exactly
what
is
stopping
it
from
being
functioning.
I
think
it's
just
a
matter
of
building
this
with
the
correct
tools,
but
I
don't
know
right
and.
D
D
E
H
B
I
want
to
go
back
slightly
to
the
idea
of
the
ear
tracing
as
well.
I
think
that's
kind
of
a
separate
use
case
and
I'm
not
sure
I'm
not
sure
whether
it
would
be
fully
captured
by
the
same
kind
of
mechanism,
as
the
use
case
with
the
acre
involves
some
like
hierarchical.
D
D
I
think
the
high
performance
one
is
not
going
to
synchronize.
It's
not
built
to
synchronize
with
thousands
of
machines
like
that's.
B
H
The
way
jager
works
is
it's
just
a
bunch
of
different
independent
processes,
generating
events
they're
later
aggregated,
based
on
a
couple
of
well-known
ids
that
are
generated
in
a
deterministic
fashion.
It's
the
generation
part's
not
functionally
different
from
what
you're
describing
the
only
difference
is
that
you
send
them
with
some
degree
of
rapidity
to
an
aggregator.
H
But
I
understand
if
we
don't
want
to
do
that
immediately
or
if
we
simply
want
to
leave
that
under
jaeger,
but
I
disagree
that
they're
different
use
cases.
The
ad
addition
of
the
span
id
or
a
client
id
thing
will
be
interesting
in
both
the
specifics
fan.
Id
may
or
may
not
be
interesting
in
the
on
disc
log,
but
I
suspect
that
it
won't
be
detrimental
and
that
the
rest
of
the
of
of
the
event
will
be
interesting.
H
In
fact,
I
suspect
that
absolutely
every
trace
event
that
would
be
useful
for
jaeger
would
have
a
event
right
next
to
it.
For
admitting
to
this
look.
This
is
why
I
am
arguing
that
it
would
be
ideal
if
they
were
not
separate
systems.
I
cannot
think
of
a
thing.
I
would
want
to
admit
to
jager
that
I
wouldn't
also
want
in
the
disc
trace.
B
Yeah,
I
agree
it
would
probably
be
a
strict
subset,
but
it
might
be
a
like
it's
kind
of
I
wouldn't
say
it
wouldn't
be
unnecessary
for
like
the
minimal
start
here,
it
might
be
like
a
later
step
to
incorporate
the
egg
aspect
into
this
system
as
well.
H
B
H
I
want
the
original
source
level
annotations
to
have
this
in
mind.
That's
all.
H
H
B
Okay,
anything
else
on
this
topic,
should
we
try
going
going
back
to
version
now?
Do
you
think
your
screen
sharing
your
connection
is
working.
F
F
F
Then
the
other
way
is
with
cephaliam
the
fadium
uses
these
f
or
apply
command
which
deploys
the
ganesha
demons.
But
again
it
does
not
provide
an
interface
to
manage
the
exports
or
the
conflict
in
both
rook
and
cephadium.
The
ganesha
demons
are
deployed
with
a
minimal
minimal
conflict,
but
they
do
not
manipulate
the
exports
again
here.
F
A
export
config
object
needs
to
be
created,
then
the
other
way
is
dashboard.
So
with
the
dashboard,
the
thing
is,
we
need
the
nfs
ganesha
clusters
already
deployed.
It
does
allow
us
to
manipulate
the
exports,
but
it
also
requires
setting
up
a
rados
option
like
pool
and
namespace
setting
of
dashboard
before
the
exports
can
be
created.
F
Next,
let's
look
into
the
way
we
can
create
the
nfs.
We
can
deploy
the
nfs
cluster
using
the
volumes
plugin.
F
As
you
can
see
here,
this
is
so
with
the
nfs
volumes
plugin.
We
can
create
the
cluster
as
well
as
we
can
manipulate
the
exports.
F
The
cluster
creation
is
very
simple:
it
requires
the
cluster
type
and
the
cluster
id
to
be
specified,
and
let's
look
into
the
code
for
this,
so
in
other
deployment
ways.
Actually
you
need
to
also
create
a
pool
before
an
nfs
cluster
could
be
deployed,
but
with
the
volumes
plug-in,
we
create
a
pool
for
the
user
and
all
the
ganesha
daemons
share
a
single
pool.
F
Dashboard
and
for
creating
an
export,
we
ask
the
user
to
specify
the
cluster
id
fs
pseudo
path
and
then
the
path
and
the
surfaces
all
these
things
like,
be
it
minimum
things
to
be
specified
for
creating
an
export,
but,
along
with
that,
we
also
provide
a
way
in
which
user
can
create
their
own
export.
F
F
F
F
And
in
a
case,
if
the
manager
is
restarted,
then
we
read
all
the
export
from
the
rados
pool
with
the
this
particular
method,
but
this
does
not
happen
every
time.
This
only
happens
when
the
manager
is
restarted
or
the
first
time
the
export
is
created.
Otherwise,
this
function,
this
method
is
not
called
and
with
along
with
export
creation,
we
also
create
the
user
for
every
export.
F
F
So
we
create
a
user
with
the
following
caps
and
this
is
also
removed
when
we
delete
and
export,
and
this
also
gets
removed
when
we
delete
the
cluster.
So
basically,
when
we
delete
a
cluster,
all
the
exports
gets
deleted
and,
along
with
that,
the
pull
objects
are
also
deleted,
but
the
pool
is
not
deleted.
F
And
we
have
the
tests
in
the
the
nfs
test
needs
to
be
run
with
the
rado
suit.
Basically,
we
are
testing
the
centurology
with
cf
agm.
F
Getting
it
started
and
we
start
with
restart,
we
can
deploy
an
nfs
ganesha
cluster
in
two
ways
you
know
one
is
with
cdn
and
the
other
is
without
self-adm
using
the
test
orchestrator.
F
F
F
From
here,
if
that
idiom
is.
F
F
F
F
F
F
F
F
F
One
is
the
integration
of
dashboard
with
the
volumes
module,
so
in
the
current.
Currently,
the
exports
created
by
dashboard
cannot
be
detected
by
the
volumes
module
and
vice
versa.
Similarly,
there
are
some
couple
of
things
which
are
different
in
dashboard
and
which
is
different
in
the
volumes
module,
but
in
the
future
we
want
the
dashboard
to
use
the
volumes
module
to
create
the
export,
both
of
them
use
almost
the
same,
create
export
in
a
similar
way,
but
the
options
provided
is
little
different.
F
We
have
covered
the
discussion
in
this
particular
tracker
ticket
and
have
a
look,
and
second
is
the
compatibility
with
rook.
So
currently
the
rook
module
itself
is
in
a
very
bad
shape.
Most
of
the
commands
are
not
working
that
needs
to
be
fixed.
Also,
we
are
looking
into
the
volumes
module
is
compatible
with
cepheidium.
You
also
want
it
to
be
compatible
with
rook,
and
there
are
no
tests
for
rook
and
teutology
we're
also
looking
into
adding
the
test
in
totology
for
book.
B
F
F
So
that
needs
to
be
done
and
there
are
a
couple
of
other
things
that
needs
to
be
figured
out
how
you
want
to
do
it,
but
we
want
to
have
an
option
in
teutology
that
you
can
either
test
with
root
or
ceph
adm,
and
most
of
the
other
tests
should
be
like
a
generic
template
which
will
go
with
either
of
the
orchestra
orchestrator
back
end
yeah.
That's
it.
F
G
B
J
Okay,
so
my
name
is
scott
peterson.
I
work
for
intel.
A
couple
of
you
know
me
for
those
of
you
that
don't
I've
previously
worked
on
a
persistent
memory
based
h.a
right
back
cash
for
rbd,
which
took
like
two
years
longer
than
anyone
expected
it
to
still
not
quite
done
so.
J
The
latest
thing
we're
doing
is
this
thing
we
call
adaptive,
distributed
nvme
fabrics,
namespaces
and
that's
a
mouthful
of
a
name
because
that's
how
naming
works
in
the
nvme
universe
and
if
you
call
it
adaptive
distributed
namespaces
then
word
keeps
auto
correcting
that
acronym
to,
and
so
we
had
to
add
another
word
in
there.
So
this
crowd
understands
the
basic
goal
here.
J
I
I
just
presented
this
at
storage
developer
conference
and
we
all
get
in
this
universe
that
this
picture
on
the
right
is
how
storage
works
in
in
the
cloud
picture
on
the
left
is
how
nvme
member
fabrics
is
typically
used.
Basically,
it's
a
really
fast
patch
panel.
You
can
connect
a
drive
to
a
host,
but
that's
not
how
people
want
to
use
their
storage.
J
So
so
our
basic
problem
statement
here
is
well.
Nvme
fabrics
has
a
lot
of
features
that
people
love,
except
for
this
patch
panel.
You
know
point-to-point
type
connectivity
thing,
and
so
we
asked
ourselves.
So
what
would
we
have
to
do
to
nvme
to
be
able
to
connect
to
anything
and
it
turns
out
it's
all
it
takes
is.
Is
we
add
this
this
yellow
box?
Here
I
don't
know
if
you
can
see
my
pointer,
you
probably
can't
see
my
pointer.
J
J
Okay,
so
the
idea
is
that
we
we
take
this
point-to-point
protocol
and
in
all
the
entities
on
the
fabric
we
add
one
component
called
a
redirector
and
its
job
is
to
examine
every
io
look
at
the
start,
lba
for
it
and
instead
of
sending
it
to
the
target,
it
chooses
one
of
the
targets
based
on
a
table
of
hints
that
it
has
accumulated
from
these
places.
So.
J
The
basic
idea
is
that
they
can.
What
this
diagram
is
doing
is
walking
you
through
the
sequence
of
two
ios
and
then
the
first
one,
the
blue
one.
This
host
sends
it
to
what
we're
calling
here
the
wrong
place.
We're
saying
that
the
first
I
o
number
one
should
have
gone
over
here
to
storage,
node
two,
and
so
the
host
doesn't
know
that,
yet
it
sends
the
I
o
to
this
storage.
Node
storage
node
knows
that
it
should
be
over
here,
so
it
sends
it.
J
There
tells
the
host
hey
next
time,
send
it
there
and
then
subsequently
the
host
does
that
this
is
the.
This
is
the
basic
idea
that
that
a
host
could
learn
from
making
mistakes
in
practice
as
we'll
see
towards
the
end
here,
it
turns
out
it
they'll
this
system
actually
the
way
it
worked
out.
They
tend
to
learn
it
all
when
they
connect
so
and
this
learning
in
response
to
ios,
although
we
still
like
this
idea
turns
out
to
be
harder
than
you'd
think
in
the
environments
that
we
want
to
run
this
thing
in.
J
So
why
are
we
talking
about
this
for
seth?
Well,
because
I'm
kind
of
blasting
through
this-
let's,
let's-
let's
go
through
this
brand
new
slide
here,
which
is
just
to
summarize
our
whole
idea-
is
we're
now
going
to
connect
hosts
that
have
to
one
thing:
we're
going
to
connect
them
to
many
things.
If
you're
making
a
distributed
storage
system
across
many
nodes,
then
there's
going
to
be
targets,
nvme
fabrics
targets
and
all
of
those
nodes
and
your
hosts
are
going
to
connect
to
all
of
them.
J
You
will
then,
as
the
architect
of
that
storage
system,
arrange
what
it
says
here
to
basically
let
your
storage
system
complete
any
of
these
ios,
no
matter
where
they
arrive,
and
then
you
will
try
to
make
that
not
happen.
That's
that's!
The
basic
idea
is
you
you
can
handle
anything,
and
then
you
try
to
get
your
hosts
to
be
smart
enough
to
send
it
to
the
right
place
so
that
you
can
avoid
that
second
fabric
hop
associated
with
with
the
gateway.
J
So
I'm
kind
of
I
don't
know
how
much
time
we
actually
have
here.
So
I'm
trying
to
do
this
fast
and
also
skip
over
the
things
that
stuff
people
already
understand.
So
if
there's
anything
unclear
about
this
go
ahead,
raise
your
hand
tell
me
to
go
back
to
the
previous
slide.
That's
fine,
I'm
expecting
it
that
none
of
this
is
new
to
most
of
you,
except
these
terms
here.
J
J
Those
things
use
this
concept
of
location
hints
where
the
things
next
to
storage,
the
redirectors
next
to
storage
that
are
that
are
informed
by
some
cluster
manager
in
in
stuff's
case.
It's
actually
a
lot
simpler,
where
things
are
are,
are
enabled
to
to
communicate
that
to
hosts
in
a
in
a
in
a
standard
or
abstract
way.
Now
the
hosts
don't
have
to
have
any
specific
knowledge
of
the
back
end.
J
They
just
look
at
those
hints
and
they
do
what
they
say
and
then
we
just
this
term
called
the
distributed
volume
manager,
which
is
that
boundary
for
all
that
closely
coupled
stuff
in
a
storage
cluster,
and
so
this
this
is
the
details
of
a
redirector,
but
I
think
we've
covered
most
of
this,
the
so
the
distributed
volume
manager
is
the
thing,
basically
that
that
decides
where
things
are
placed
and
how
it
will
communicate
that
to
to
hosts.
J
We
we
expect
that
there
could
be
many
types
of
location
hints,
but
obviously
these
in
order
for
these
things
to
communicate,
you
have
to
establish
some
that
they
all
understand
and
so
we're
working
to
to
come
up
with
an
initial
set
that
probably
works
for
most
cases.
The
idea
is
that
this
will
be
extensible,
so
a
storage
system
may
send
a
hint.
An
old
client
may
not
understand
that
hint,
then
it
will
just
ignore
it,
but
we
hope
to
enable
the
the
you
know.
J
Jump
start
this
with
a
with
a
set
of
a
few
that
probably
work
in
most
cases
and
so
that
they
basically
fall
into
the
categories
of
simple
just
says
this
range
goes
here
and
and
a
group
you
could
call
algorithmic,
which
is
essentially
a
function.
You
say
this
range
of
lbas
has
this
function
applied
to
it
to
select
this
target?
The
simplest
of
those
is
striping.
J
We
all
know
how
striping
works,
so
a
fun,
a
striping
function
would
say:
here's
a
set
of
targets.
Here's
an
lba
range.
Here's
the
parameters
for
the
stripe
function
run
this
function
on
the
start,
lba
look
up
the
target,
send
it
there,
that's
how
striping
would
work
and
the
hash
hint
is
is
what
you
use
for
ceph.
Obviously,
now
our
original
idea
is
that
this
would
work
with
other
storage
systems
that
use
assistant
hashing,
as
as
their
basis
for
placement.
J
Chief
among
them
was
bluster.
I
had
never
used
gluster
and
but
I
did
spend
a
lot
of
time
working
in
ceph,
as
some
of
you
know,
so
it
took
me
a
while
to
get
back
around
to
gluster
and
discover.
Fortunately,
while
it
uses
a
hash
function
for
placing
files
when
it
does
block
your
whole
block
device
is
one
file,
so
it
doesn't
actually
spread
it
over
bricks.
J
Maybe
maybe
somebody's
got
a
patch
for
cluster
that
lets
it
do
that,
but
until
gluster
shards
its
block
devices
over
bricks,
this
hashing
isn't
going
to
be
much
help
there,
but
we
did
make
it
work
with
seth,
so
I
thought
I
would
be
able
to
get
through
this
much
faster.
I'm
sorry.
I
have
to
drag
you
through,
basically
everything
I
said
at
sdc,
but
some
of
these
concepts
I
kind
of
have
to
explain
so
so.
J
A
host
an
adn
host
sees
any
storage
system
as
a
dvm
and
they
all
have
a
common
set
of
behaviors
that
lets
these
hosts
be
as
dumb
as
we
can
get
away
with.
The
basic
idea
is
that
everything
complicated
is
the
dvm's
problem,
which
sounds
like
we're
creating
a
lot
of
work,
but
when
you
consider
that
lots
of
distributed
storage
systems
already
exist
and
they
already
work,
they've
already
solved
all
of
these
problems.
J
What
we're
really
doing
is
just
naming
the
important
things
that
they
all
have
to
do
and
deciding
how
we're
going
to
communicate
that
to
to
hosts.
Now
in
the
enviable
fabrics
world,
there
is
the
thing
that
you
can
barely
read
this
box.
Sorry
wrong
screen
this
box
right
here
with
a
mouse
wiggling
over.
It
is
labeled
discovery
service,
and
this
is
how
an
idm
fabrics
host
in
theory
can
be
told
the
address
of
a
discovery
service,
and
then
it
can
ask
that
discovery
service.
J
What
subsystems
exist
on
the
fabric
and
subsystem
is
what
nvme
fabrics
calls
targets
it
tells
it
the
addresses
and
and
things
and
if
they
have
the
right
credentials,
then
they'll
be
able
to
connect
to
those
things.
So
the
idea
is
that
an
adm
dvm
will
have
one
of
those
inside
it.
These
hosts
will
be
told
about
that
discovery
service.
The
discovery
service
will
reveal
all
the
nodes
in
that
storage
system,
and
they
will
then
commence
connecting
to
all
of
them
once
they
connect
to
them.
J
So
in
this,
in
this
little
example
here
you
can
see,
we
chose
say
that
this
hypothetical
dvm
has
four
nodes,
because
that's
an
interesting
number,
it's
greater
than
two,
the
two
node
storage
system,
you
think
of
it
as
having
just
h
a
or
active
passive,
and
we
want
to
highlight
that
that
this
is
intended
for
slightly
more
complicated
things,
so
see
all
these
yellow
boxes
here.
This
represents
a
volume
provisioned
from
these
ssds
on
all
these
four
nodes,
it's
in
some
way
a
little
bit
more
complicated
than
just
mirroring.
J
So,
let's,
let's
consider
that,
let's
say
this
host
is
supposed
to
use
this
yellow
volume.
Here
this
distributed
volume
manager
will
know
that
it
knows
this
host's
identity,
so
it
will
let
it
connect
in
the
first
place
when
it
connects.
It
will
tell
it
about
this
yellow
volume
on
all
of
these
targets,
so
this
host
will
see
it
everywhere.
Now
this
host
has
a
redirector
in
it.
So
when
it
sees
this
these,
these
yellow
volumes
and
it
notices
that
these
yellow
volumes
are
adn
enabled
basically
they
have
redirector
features.
J
Then
then,
it
basically
lumps
them
all
together
into
as
targets
for
for
its
internal
redirector,
and
at
that
point
this
volume
is
available
to
it
and
it
can
commence
sending
I
o
to
it
now.
It
doesn't
know
where
the
extents
are
yet
theoretically.
So
what
should
happen
next?
Is
these
targets
should
all
start
informing
it.
So
the
owner
of
this
extent
should
say
this
extent
is
here.
In
fact,
it
can
happen
anyway,
at
all.
J
All
of
these
targets
could
tell
it
everything
it
turns
out
that
doesn't
really
take
much
data,
so
that's
probably
what
will
usually
happen
and
that
even
helps
when,
for
instance,
they're
say,
there's
some
temporary
fabric
problem
and
this
host
can't
connect
to
this
server,
but
it
could
connect
to
all
to
the
other
three
now
they'll
all
tell
it
that
this
extent
it
should
send
it
here,
but
it
doesn't
have
a
path
there,
so
it
will
send
it
somewhere
else.
J
Remember
that
a
dvm
is
required
to
complete
any
I
o
that
arrives
anywhere
inside
it,
so
whichever
one
of
these
other
guys
gets
the
I
o
for
for
this
extent,
has
to
complete
it
by
forwarding.
So
if
you're
building
one
of
these
things
from
scratch,
that's
one
of
the
things
you
have
to
provide
as
a
dvm
architect,
the
beauty
of,
if
you're,
building
a
distributed
volume
manager
adapter
for
ceph.
Obviously,
that's
a
lot
simpler
because
inside
these
things
are
all
just
sending
it
to
rbd,
and
that
already
handles
that.
J
So
I
want
to
quickly
get
to
the
point
where
we're
talking
about
the
ceph
specific
versions
of
this.
So
rather
than
go
into
all
those
details,
anymore,
let's
say
yeah.
Okay,
I
think
I
just
said
all
these
things.
I
forgot
this
actually
well
probably
should
have
put
this
slide
at
the
end.
I'm
sorry
about
that
this.
This
basically
highlights.
If
you
were
going
to
build
a
an
add-on
package
for
ceph
that
provided
adn
connectivity,
it
would
need
to
do
these
things.
The
bottom
line
is
it's
just
a
demon
that
runs
everywhere.
J
J
We
also
see
a
couple
of
other
advantages.
We
we
have.
First
of
all,
we
have
constructed
an
adm
reference
implementation
in
spdk
and
it
has
all
the
components
you
need
to
connect
it
to
rbd
images
is
obviously
not
production,
ready,
it's
just
a
data
path
and
the
minimal
the
minimal
configuration
tools.
J
You
need
to
set
it
up
for
one
image,
and
I
can
show
you
in
a
minute
that
we
did
that
and
we
proved
that
it
worked,
and
we
also
showed
that
that
an
adn
host
uses
less
cpu
cycles
than
that
same
host
running
at
least
kernel
rbd
and
for
various
reasons.
When
I
did
this
poc,
I
I
stacked
the
adn
stuff
on
krbd,
rather
than
the
rbdb
dev
long
story
short.
I
needed
to
show
multiple
things
with
one
experiment
and
that's
how
it
had
to
work.
J
If
you
were
going
to
build
this
for
real,
you
would
use
the
rbdb
div
that
spdk
provides
so
at
the
bottom
of
this
slide.
Here
you
can
see
this
url,
which
is
an
rfc
patch
in
the
spdk
garrett,
and
the
one
under
it
is
the
a
readme
file
that
explains
this.
J
So
so
what
I
really
wanted
to
get
to
in
this
presentation
is,
is
you
know
how
this
connects
def
and
and
how
we
get
away
with
not
having
library
d
in
our
hosts?
We,
I
think
it's
summarized
by
showing
you
what
what
an
adn
host
knows
about
the
dvdm.
That's
when
it's
talking
to
ceph,
as
what
does
it
actually
see
this
json
file
over
here
on
the
right
on
this
slide
represents
what
it
gets.
So
so
those
remember
from
the
previous
slide.
J
All
these
hosts
are
going
to
see
a
bunch
of
targets,
they're
going
to
see
the
name
space
that
that
is
exposed
to
them.
We
expect
that
your
targets
would
use
blood
masking
right
when
a
host
connects
to
it.
It
would
say
yes,
you're
authorized
to
connect
and
it
would
look
up
in
a
table
and
say
and
here's
the
name
spaces
you
should
see,
but
when
it
does
list
name
spaces,
those
are
the
only
ones
it
would
see.
J
Any
storage
array
will
do
the
same
thing
as
I
have
to
point
out
that
the
spdk
nvme
fabrics
target
doesn't
actually
have
that
feature
yet,
but
lots
of
people
want
it
and
it's
proceeding
independently.
J
So
when
your
hosts
connect
to
these
antimatter
fabrics,
targets
in
ceph
nodes,
they're
going
to
get
a
location
hint
that
describes
their
volume
and
it's
going
to
be
the
hash
hint
which
looks
like
the
structure,
this
json
structure
on
the
side,
the
important
parts,
this
thing
called
label
isn't
actually
something
that
the
that
the
host
will
use
it's
in
this
blob
of
information,
so
that
when
we
need
to
regenerate
it
when
something
changes
in
the
cluster,
we
can
make
it
consistent.
J
So
the
there's
no
magic
here
we
what
it
what
it
tells
the
host
is
this
namespace
bytes
and
object.
Bytes
tells
it
how
to
turn
an
lba
into
an
object
number
and
the
object
name
format
tells
it
how
to
make
the
name
of
that
object,
and
then
the
hash
m
uses
and
for
ceph
that's
going
to
be
the
r
jenkins
hash
and
the
sefstablemod
function.
J
J
The
hash
table
is,
of
course,
just
a
simplified
version
of
the
pg
table,
so
inside
saf
you,
you
get
a
pg
and
then
you
do
the
rest
of
crush
to
figure
out
which
of
those
things
is
where
your
data
should
go
so
for
adn.
Since
this
is
we're
only
talking
about
block
we,
you
really
can
only
deliver
io
to
the
up
primary
anyway.
J
So
that's
all
we
tell
it.
We
we
determine
with
the
sev
cli
tools,
which
osd
is
the
up
primary,
that
pg.
What
host
is
it
in
and
then
what's
the
name,
the
nqn
of
the
adn
target
in
that
host
and
that's
all
it
needs
to
know
so
this
hash
table
basically
contains
buckets
which
are
indexes
into
the
table
of
enqueues.
J
B
J
Basically,
every
redirector
will
will
every
redirector
in
the
osd
nodes
will
will
have
this
entire
hint
now.
Obviously,
every
rbd
image
in
the
same
pool
will
have
exactly
the
same
hashtag
exactly
the
same
nqn
list,
but
we
don't
the.
This
is
the
whole
thing.
The
basic
guiding
principle
of
of
adn
is
loose
coupling
and
you
know
they'll
do
their
best
to
get
accurate
placement
information,
but
if
they
don't
it
needs
to
work
anyway.
So
that
means
all
the
redirectors
in
the
osd
nodes
send
the
entire
thing.
J
H
J
I
have
the
file
somewhere.
I
can
tell
you
it
so
this
pg
table
size.
You
know,
that's
that's
a
concern
because
this
all
has
to
fits
in
a
log
page,
but
one
of
the
reasons
that
the
hash
table
is
just
indexes
into
the
nqn
table
is
because
of
that,
because
there's
really
no
bound
on
a
pg
table
size
right
now.
There
are
reasonable
constraints.
It
doesn't
do
the
user
any
good
to
have
an
enormous
number
of
pgs.
J
J
H
J
So
when
these
things
are
communicated
so
so
I
don't
really
have
the
details,
it
wasn't
expected
to
go
into
this
detail,
but
hints
are
sent
to
hosts
in
an
nvme
log
page,
it's
a
page
that
has
a
bunch
of
small
structures
in
it.
The
hash
ants
is
one
of
those
small
structures,
but
because
it
has
these
enormous
tables
that
hint
contains
other
log
page
ids,
so
the
nqn
table
is
in
a
separate
log
page
and
the
hash
table
is
in
a
separate
log
page.
J
So
this
means
that
if
you're
a
host
using
several
volumes
on
the
same
theft
cluster,
you
only
need
to
read
they're
all
going
to
have
the
same
tables,
so
those
log
pages
are
designed
with
with
a
with
a
digest.
They
have
a
header.
The
header
has
a
digest.
J
So
if
you've
seen,
if
you've
already
read
the
the
say,
the
hash
table
for
for
one
of
your
adn
logical
namespaces
and
it
had
all
320
000
buckets
and
you
you've
remembered
what
the
digest
for
that
version
of
that
table
is
then,
when
you
connect
for
another
volume-
and
you
read
the
header
out
of
the
log
page
and
it's
got
the
same
digest
you
just.
B
J
Right
and
the
same
thing
for
thank
you
and
tables,
so
there's
some
when
you
connect
this
to
the
nvme
or
fabrics
world
you
have
some.
There
are
some
corners
of
the
envelope
that
aren't
going
to
maybe
work
so
well
if
you've
got
10
000
nodes,
that's
a
lot
of
nvme
fabrics
connections
and,
if
you're
trying
to
use
an
rdma
transport
that
might
not
work
well.
J
Obviously,
this
works
with
nvme
for
tcp
as
well,
but
anyway,
so
there's
tuning
things.
It's
also
true
that
if
you've
got
images
in
separate
pools,
then
hosts
really
only
need
to
know
about
the
targets
in
the
nodes
that
have
osds
that
they
will
actually
use
so
so
that
could
be
managed
too
all
right.
So,
let's
just
quickly
go
to
how
we
proved
so.
Does
anybody
have
any
questions
about
how
we
abuse
brush
here?
D
Could
you
explain
how
you
have
used
the
nvme
protocol?
I
mean
set,
not
a
simple,
read,
write.
There
is
a
read
with
some
parameters
which
you.
J
Sorry
what
I
mean
we
basically
only
support,
reads
and
writes
we
don't
what
other
operations.
D
H
J
H
J
J
H
J
D
Sorry
go
ahead,
but
before
that,
how
do
you
pass
the
information
associated
with
her?
With
request?
I
mean
the.
G
H
J
H
J
H
J
D
And
all
this
snap
and
clone
they
are
based
on
some
kind
of
sequence
number
coming
from
from
the
from
the
from
the
rbd.
H
J
D
H
C
J
Yep,
so
here's
the
results
of
our
poc
was,
we
showed
you
know
we're
comparing
conceptually
these
three
cases
and
as
you
can
see,
this
is
three
runs.
This
network
at
the
top
is
my
ceph
network.
This
network
at
the
bottom
is
my
inventor
fabrics
network.
We
do
it
here
with
just
rbd.
Obviously
all
the
traffic
is
right
here.
J
These
all
these
traces
in
grafana
are
all
my
eight
osd
nodes
and
my
one
client
in
this
case
we
used
we,
we
left
the
client
with
engineer
fabrics.
We
went
to
one
of
the
osd
nodes
which
then
just
delivered
it
with
rbd.
So
you
can
see
that
almost
all
the
traffic
is
still
on
the
cef
network,
and
it's
also
here
again
on
the
nvme
fabrics
network
and
then
in
the
adn
case
we
put
the
hash
hint
now
in
this
poc,
the
redirectors
weren't.
Actually
they
weren't
capable
of
passing
the
hash
into
log
pages.
J
Basically,
is
the
whole
idea,
so
we
also
wanted
to
prove
or
see
actually
whether
this
really
saved
any
cpu
overhead
and
in
this,
in
my
little
toy
virtual
cluster,
which
was
just
barely
large
enough
to
show
any
of
these
benefits,
but
a
whole
lot
easier
to
set
up
than
a
real
one
when
you
need
to
actually
get
somebody
to
pay
for
8
or
16
nodes,
and
let
you
keep
them
for
six
months.
The
differences
are.
Are
our
wrong
screen?
J
Sorry,
so
this
is
rbd
at
30k
iops,
and
this
one
over
here
is
adn
same
workload,
same
30k
iops,
and
these
two
red
arrows
indicate
that
the
cpu
utilization
on
that
client
dropped
by
that
much
not
a
huge
amount
but
measurable,
whether
that
matters
to
you
as
a
customer
depends
who
you
are
and
again
30k
ops
is
not
a
lot,
and
these
are
really
small
virtual
nodes
and
we
probably
don't
have
time
to
go
into
it
here,
but
this
actually
represents
two
kinds
of
cpu
utilization:
there's,
there's
normal
os
red
stats
and
because
adn
is
an
spdk
app.
J
We
have
to
extract
red
idle
time
because
they're
always
busy
from
spdk
and
then
transmogrify
that
into
a
unified
cpu
utilization
stats.
So
that's
what
all
these
caveats
here
are
is
that
this
is
a
derived
number
bind
from
two
and
I
have
the
scripts
that'll
show
you
how
you
can
do
that
if
you
care,
I'm
only
mentioning
it
because
someone
that
just
reads
this
slide
will
say:
hey
wait,
but
why
isn't
that
nailed
it
100?
It
was
so
we
looked
inside
you.
J
I
didn't
measure
it
like
this.
I
didn't
show
it
on
this
graph
and,
of
course,
I've
lost
the
griffon,
but
here's
the
line
here
it
said
we
did
see.
This
is
seven
percent
of
one
core
on
each
osd
node,
and
these
are
eight
virtual
core
nodes.
So
so
yeah
there
was
measurable,
cpu
utilization
increase
on
those
nodes.
Now
it
wasn't
quite
as
much
as
we
expected.
Let
me
see
if
I
left
that
slide.
C
So
you're
you're
you're
a
redirector
that
lives
on
the
osd
node.
That's
not
using
sbdk
running
at
100
cpu.
J
J
J
J
C
J
Been
used
in
the
same,
it's
not
absolutely
good.
It's
just
a
difference
in
capacity
exactly
if
you
can
have,
but
put
it
in
the
list
of
pull
functions
that
that
polar
does
great
fine.
As
long
as
it
happens,
and
all
this
was
done
with
kernel
networking.
I
didn't
use
dpdk
networking
here,
it's
just
you
know
it
wasn't
important
for
this
poc
and
then
may
or
may
not
want
to
do
that
in
a
real
stuff
cluster
anyway.
So
it's
it's
orthogonal.
You
could
use
user
mode
networking
or
not.
We
expect
most
people
will
not.
H
H
Rbd
plus
the
so
the
fio
plus
rbd
minus
rbd,
plus
this
adm
thing
is
like
yes,
there's
25.
J
We
are
saying
it
saves
you
half
of
one
core
30k
was
the
units
we
were
pitching
to
the
customer
that
was
interested
in
this.
I
wouldn't
I'd
have
to
think
about
it
before
I
would
make
it
that
general
percentage
statement
that
you
just
made.
J
The
other
thing
to
note
here
is
this:
this
sort
of
tortured
poc.
I
think
I
left
that
slide
out,
but
in
in
even
with
just
plain
rbd
and
adn,
it
wasn't
actually
just
plain
rbd.
This
was
it
was
always
going
through
an
spdk
app.
J
So
so
we
had
an
spdk
app
that
exposed
nbd
and
we
ran
fio
in
user
mode
to
get
to
that,
and
then
we
either
went
directly
to
kernel
rbd
out
of
the
bottom
of
the
spdk
app
or
we
went
to
ending
mirror
fabrics
to
go
to
another
node
or
we
went
through
a
redirector
to
go
to
one
of
the
eight
nodes.
So
all
of
the
nbd
overhead
was
always
there
all
of
the
leave
the
spdk
up
to
go
to
krbd
overhead
was
always
there.
J
The
only
thing
that
changed
was
what
we
did
in
the
middle
in
the
spdk
app.
So
we
think
that
you
know
that's
out
all
these
other
things
and
you
would
not
actually
build
it
that
way,
but
that's
the
rbd
case.
The
adm
case,
however,
was
direct
to
whatever
pci.
H
H
H
J
So
we
think
it
saved
some
and
we're
not
quite
sure
why,
and
we
were
also
surprised
by
the
by
the
the
overhead
numbers
we
were
even
more
surprised.
I
think
I
left
it
out
that
the
latency
did.
I
put
that
in
here
back
to
here
the
latency.
Paradoxically,.
J
Right
latency
got
lower
with
k
with
adn
than
with
just
going
to
krbd.
We
don't
understand
why
exactly
the
same
amount
of
rbd
work
had
to
happen.
We
just
did
it
on
eight
notes.
So
is
that
the
answer-
and
this
was
well-
this
is
cute
of
64..
So
maybe
that
is
the
answer.
Maybe
they
were
just
use
hotter
cash,
and
so
they
got
better
performance
and
when
you
shoot
up
64
at
one
node
fell
off
the
cash
cliff.
I
don't
know,
but
that
was
the
result
now
I
wouldn't.
J
I
wouldn't
bet
I
wouldn't
count
on
this
in
production
right.
This
is
probably
an
artifact
of
this
test
setup,
but
that's
what
the
numbers
set
so
we're
just
reporting
them.
So
that
was
weird
did
I
was
there
another
question
in
there
that
I
skipped
over
or
did
we
let's
see,
we've.
B
Talked
about
yet
regarding
the
redirection
with
any
of
these
kind
of
setups,
where
you're
sending
thing
I
o
from
the
client
to
a
separate
system
and
then
going
into
rpd
from
there.
If
you
were
changing
the
kind
of,
if
you
don't
know,
if
you
want
to
call
it
gateway,
but
the
the
receiving
and
translating
that
into
rbd.
B
Is
there
any
kind
of
handling
of
like
when
that
changes?
Trying
to
avoid
sending
the
same,
I
request
again
or
making
sure
that
I
use
that
we're
already
inflate,
don't
cause
correctness,
correctness
problems
if
they're
they're
still.
J
Yeah,
let's,
let's
jump
back
to
this,
this
diagram-
this
is
the
first
one
we
talked
about
where
the
two
different
ios
one
with
a
clue
with
first
one
without
a
clue
and
second
one
with
a
clue.
What
we
don't
show
here
is
the
the
responses.
So
the
idea
is
that,
if
you're
a
host
you
you
initiate
an
I
o
to
one
of
your
targets.
If
it
has
to
forward
it,
it's
initiating
an
I
o
to
the
target.
It
knows
it
should
really
go
to
this.
J
I
o
is
still
in
flight
the
whole
time
so
when
it
completes
here
it
completes
back
to
here,
and
then
it
completes
back
to
here.
So
if,
for
instance,
this
guy
died,
while
this
was
forwarded,
then
yeah
in
theory,
there
could
be
an
I
o
in
flight
here
from
from
this
you're
really
common
with
stuff.
C
J
J
It
over
and
there's
a
possibility
for.
Yes
if
it
was
already
in
flight,
so
there's
a
couple
of
things.
We
haven't
really
well
okay,
so
so
from
the
host's
point
of
view,
it
didn't
finish
so
this
redirector,
so
the
application
doesn't
know
about
any
of
this
right.
You
gave
the
I
o
to
this
redirector.
The
redirector
knows:
there's
multiple
alternative
targets
for
it
and
its
job.
If
it
tries
it
here
and
that
doesn't
work,
you
should
try
it
on
one
of
the
other
ones.
H
J
No,
that's
that's
so
we
did
that
in
rwl
we
had
to
do
overlap,
detection
for
ordering
because
we
were
replicating
and
we
need
to
make
sure
that
the
replicas
are
all
identical.
So
we
can't
let
that
let
those
two
the
race
happen
twice
and
resolve
differently
in
different
places.
Now
our
basic
position
is
that
if
you're
an
application-
and
you
issue
concurrent
rights
to
the
same
place-
that's
a
bug.
J
You
probably
shouldn't
have
done
that
and
because
most
block
devices
in
jboss
and
storage
arrays
don't
give
you
any
kind
of
guarantee
at
all.
J
Yeah
they're,
just
basically
it's
it's
a
b
dev
I
o,
and
it's
not
complete
until
it
completes
to
one
of
these
targets.
If
it
doesn't
complete,
then
they
try
it
on
a
different
target
so
and
so
the
only
race
happens
out
here.
I
tried
it
here.
This
guy
died
this
guy
had
it
already
he's,
got
it
in
his
queue
or
it's
in
flight.
J
H
So
the
redirector
here,
if
we
move
or
rbd
at
least
the
movement
of
an
rbd
image
from
redirect
or
one
to
redirector,
two
means
unmounting
in
one
place
and
remounting
on
the
other
right.
J
J
Change
right,
and
so
I
control
I
contrived
examples
where
I
you
know
at
first
I
gave
it
a
hint
where
I
deliberately
mangled
this
hash
table,
and
you
know
you
start
up,
I
o
and
then,
when
you
go
back
and
you
look
at
this
graph
here,
it
doesn't
look
beautiful
like
this,
because
the
host
has
the
wrong
idea.
J
Okay,
but
but
then,
when
you
update
the
hint,
then
the
host
applies
it
and
they
start
sending
start
sending
ios
to
different
place
so
yeah.
So
the
idea
is
that
there
is
no
correctness
issue,
because
all
of
the
targets
in
the
osd
nodes
can
complete
any
of
the
ios.
If
you
send
it
to
the
wrong
target,
it
still
completes.
You
still
give
it
to
library
d.
It
still
does
the
right
thing
with
it.
It
just
takes
longer.
J
J
H
H
I'm
not
worried
about
caching,
because
the
application
is
where
the
caching
should
be
done.
That's
fine
right.
I
agree,
but,
for
instance,
a
particular
image
will
have
operations
from
many
different
clients.
So
if
we
wanted
to
do
something
clever,
like
remember
only
the
most
recent
client
id
that
did
an
operation
to
a
particular
object,
that's
impossible,
yep
yep.
So
there
are
disadvantages
to.
J
H
Support
there
are
disadvantages
and
there
are
more
another
way
of
doing.
The
redirector
would
be
that
you
maintain
the
exclusive
mapping
and
actually
move
it
and
your
redirector
instead
of
simply
running
the
rbd
operation
forwards
it
to
the
one
that
does
have
the
rbd
image
mounted,
and
that
is
an
optimization
you.
We.
J
Could
do
that?
Yes,
more
complicated
this
this
so
so.
Seth
is
kind
of
a
beautiful
example
of
a
very
simple
dvm
because,
like
we
said
at
the
beginning,
normally
a
dvm's
job
is
if
the
I
o
comes
to
the
wrong
place,
you've
got
to
get
it
to
the
right
place
somehow
and
then
systems.
We
consider
doing
this
in
like
a
a
pairwise
h8j
buff
with
a
pci
bus
in
between
them.
J
That's
pretty
easy
in
that
case,
because
everybody
can
see
all
the
ssds
right
the
minute
it
becomes
a
bunch
of
boxes,
a
bunch
of
commodity
servers
on
a
fabric.
Suddenly
you've
got
another
fabric
hop
involved,
but
in
ceph
you
don't
have
to
do
that.
Also,
a
normal
dvm
needs
to
very
closely
manage
these.
We
call
egress
redirectors
to
make
sure
that
they
all
have
perfect
knowledge
of
where
everything
goes,
because
they're
the
source
of
truth
and
in
the
ceph
case
in
this
implementation
of
this
f
case,
that's
kind
of
relaxed.
J
We
could
be
kind
of
sloppy
about
it,
because
all
we
care
about
is
performance
and
what
you're
talking
about
is
is
tighter
integration,
where
you're
more
tightly
integrated
to
other
rbd
performance
features
and
then
suddenly,
that
would
matter-
and
we
think
the
architecture
allows
it.
But
that's
not
what
I
built
now.
The
bad
news
is
it
kind
of
gets
worse,
as
you
guys
have
all
probably
realized
by
now.
This
thing
is
only
talking
about
the
hash
function,
for
you
have.
J
If
you
have
a
clone
stack,
the
hash
function
only
talks
about
the
top
right
and
everything
under
it.
So
that
means,
if
you've
got
a
clone,
that
differs
by
one
object
from
its
parent.
All
the
I
o
is
going
to
go
to
the
wrong
place
and
stuff
is
going
to
have
to
forward
it
because
adnan
doesn't
doesn't
know
about
that
right
now.
If
you
decided
to
do
striping
over
rbd
objects,
I
don't
know
how
popular
that
still
is,
but
obviously
that's
not
going
to
work
here,
but
they.
H
G
J
H
J
Should
mention
that,
because
that's
actually
my
straw
man
proposal
is,
you
should
turn
on
the
shared
object,
cache
and
all
your
you
know
your
egress
redirectors,
and
that
the
reason
I
don't
usually
say
that
out
loud
is
because
then
people
say
wait
a
minute.
I
have
to
have
ssds
in
there
they're
just
caches
for
stuff.
That's
on
my
other
osds.
What
you're
just
trying
to
sell
me?
Ssds?
Yes,
yes,
we
are,
but
but
but
yeah
you're
right.
J
It
would,
and
it's
a
it's
a
it's
a
probably
the
cost
equation
for
the
customer
right
to
decide
whether
they
care
that
much
or
whether
they
really
just
want
to
move
their
caching.
J
Oh
okay,
well
sure
if
you've
got
enough
memory
for
that,
yes,
but
to
really
handle
clones
and
stripes
would
you
know,
need
much
more
sophisticated
way
of
of
combining
the
hints
in
the
host
and
we've
we've
deliberately
drawn
a
line
here
and
says
to
say
no:
let's
get
this
out
here
and
see
what
people
think
with
this
sort
of
minimum
viable
system
right.
J
We
can
all
think
of
ways
that
you
can
combine
these
hints
they
and
I
can
tell
you
I've
spent
some
time
thinking
about
it
and
there's
no
super
simple
way
to
do
it.
There
are
ways
to
do
it,
but
they're
all
kind
of
over
the
line
of
complexity
that
you'd
like
to
have
in
a
version
zero,
something.
So
so
we're
not
doing
that,
and
so,
if
you
are
a
person,
who's
wants
to
look
at
whether
this
works
for
you
you'll
have
to
understand
those
limitations.
J
Basically
so,
and
as
you
say,
turning
on
caching
and
your
targets
will
help
tremendously.
J
It'll
certainly
make
sense
if
you're
trying,
if
okay
so
the
basic
motivator,
is,
if
you're
a
cloud
provider
and
you're
billing
people
for
cycles
in
the
hosts,
then,
if
you're
spending
those
cycles
on
anything
in
your
infrastructure,
that's
money
yeah!
You
can
account
for
them
separately.
You
can
but.
J
That's
right,
you
know
we
would
have
worked
to
be
happy
if
all
your
osd
nodes
were
two
socket.
Xeon
servers
had
planned.
If
you
obviously
that's
totally
totally
obvious.
C
J
That
is
the
is
that
this
poster
child
case
is
you'd
like
to
do
bare
metal
provisioning.
You
want
to
be
free
to
give
it
whatever
storage
works
for
you
it.
You
know
it
makes
sense,
or
it's
at
least
handy.
If
it
can
see
the
world
as
a
bunch
of
nvme
devices,
and
that
interface
means
it
can
be
very
performant
if
you
put
very
expensive
storage
behind
it
and
and
with
this
it'll
still
work.
J
If
you
put
something
more
capable
and
flexible
and
a
lot
cheaper
behind
it,
though
okay
fair
enough,
now
it's
tempting
to
say:
well
how
would
you
migrate
things?
How
would
you
unify
that
kind
of
system,
so
you
can
have
a
you,
can
have
an
nvme,
jboff
and
ceph,
then
maybe
migrate
volumes
back
and
forth.
J
That's
clearly
out
of
scope,
but
so
did
you
consider.
H
J
J
Spdk
is
vst,
so
we
can't
just
bring
stuff
code
in
to
here.
We
there's
something
I'm
not
prepared
to
talk
about
here,
yet
that
would
be
very
flexible.
Basically,
posts
could
define
their
behavior.
H
G
J
J
J
Yes,
every
time
the
osd
map
changes
but
you've
got
you
know,
hundreds
of
images
and
maybe
thousands
of
clients,
but
you
really
only
need
to
generate
these
two
things
once
when
the
when
the
map
changes
and
because
those
log
pages
have
have
digests
yeah
hosts.
H
J
J
J
Yes,
how
big
is
it
and
that
when
we
talk
about
putting
this
in
pieces
of
hardware,
that's
that's
a
real
problem,
and
so
so,
if
you're
talking
about
you,
know
hardware,
accelerating
this,
you
get
to
a
point
where
you
got
to
make
a
choice
and
say:
well,
if
your
fiji
map
is
you
know
a
thousand,
we
can
definitely
do
that
if
it's,
if
it's
half
a
million
sorry
you're
going
to
go
to
the
slow
path,
we're
going
to
have
to
do
this,
the
software
database,
we
you
think
about
a
hardware
offload
for
adn.
J
You
always
have
to
have
the
software
path
there
or
the
things
that
the
hardware
can't
do.
You
know
if
this
becomes
standardized
in
nvme,
there's
there'll
be
a
table
somewhere.
That
says
here
are
the
hint
types
you
can
have
and
version
x
of
the
spec
there'll
be
n
of
them
and
then
the
next
version
there'll
be
more
your
hardware
or
the
firmware
or
device
or
your
os
may
not
understand
those.
So
there.
H
Are
other
approaches
to
this
too?
I
suggest
in
the
past
that
it's
sort
of
irrational
for
an
rbd
device
to
be
spread
across
an
entire
osd
cluster.
If
we
could
constrain
the
placement
to
just
a
subset
of
the
pg's
if,
for
instance,
to
get
the
parallelism
you
wanted
on
a
particular
entity
or
on
a
particular
rbd
image
you
needed
64
128
nodes,
then
you
could
just
constrain
that.
J
Yes,
yes,
that
would
be,
and
it
would
be
really
easy
to
basically
this
hint
well,
you
could
completely
hide
that
in
whatever
generated
the
hint
right,
you
know.
H
J
H
J
We're
trying
to
resist
the
temptation
to
be
clever
about
this,
though
right.
It's
ideally
like
jason,
said
we'd
like
to
be
able
to
run
it
in
a
very
constrained
environment
and
basically
abstract
the
storage
to
hosts.
So
you
can
have
a
bunch
of
dumb
bare
metal
tenants.
It
seems
to
us
that
the
the
the
attack
surface
is
reduced.
You
do
that
too.
You
no
longer
have
to
have
your
hosts
be
authorized
to
make
rados
connections
to
things.
J
They
can
only
do
nvme
things,
so
they
don't
have
to
have
the
key
keys
they
don't
have
to
have.
You
know
they
can't
really
view
the
state
of
the
cluster
at
all.
They
just
see
block.
I
o
that
may
or
may
not
be
important
to
you
as
a
customer,
but
so
let
me
just
go
back
to
I
mean
probably
way
out
of
time.
Sorry
I
was
late.
J
J
So,
where
we're
at
is
this,
this
reference
implementation
is
out
here.
It's
we
only
have
the
spdk
lego
block.
Now
you
could
imagine
you
might
like
to
try
this
with
a
kernel,
80
n
client.
Well,
we
haven't
written
one
of
those.
There
could
be
one,
but
you
know
kernel
development
is
so
much
fun
that
we're
getting
all
the
bugs
out
of
this.
This
way
and
the
beauty
from
my
point
of
view,
as
the
only
guy
working
on
this
right
now,
is
that
I
built
one
of
these
and
you
use
it
everywhere.
J
So
I
didn't
have
to
make
a
client
and
a
target
there's
only
one
thing
so,
like
I
said
it's
out,
there's
an
rfc
patch.
J
There's
also
some
end
to
me
ecosystem
issues,
so
adnn
essentially
does
what
you
call
a
shared
namespace.
We
have
unaimspace,
which
you
can
see
from
multiple
subsystems
and,
although
I
didn't
think
nvme
prohibited
that
it
turns
out,
it
actually
does
so
one
of
the
things
that
breaks
an
nvme
is
you
can't
do
reservations
on
it.
You
can't
do
other
things
that
have
been
defined
since
I
started
this
basically
nvme
over
fabrics,
multipathing
or
actually
it's
nvme
multipathing.
J
It's
called
a
a
that
won't
work
with
with
these,
but
my
position
is
that
you
almost
don't
need
it
because
you've
got
you've
got
multiple
targets.
You
can
handle
loss
of
a
path,
it's
not
as
performant
as
a
a
which
really
has
two
paths
the
same
thing,
but
I
was
going
for
a
different
use
case
here.
Reservations
may
or
may
not
be
important.
Now.
J
This
reference
implementation
doesn't
implement
that,
of
course,
the
real
issue,
if
you're
an
nvme
person,
the
things
that
that
that
have
distributed
or
a
shared
namespace
presents
problems
for
are
the
nvme
commands
for
namespace
management
and
those
are
the
things
that
you
know
create
and
destroy
namespaces
in
actual
ssds.
Now
I
you
know,
if
you're
making
a
distributed
storage
system
you're
not
going
to
manage
volume
creation
that
way
it
doesn't
make
a
lot
of
sense.
J
So
option
is
that
these
targets
just
won't
support
those
commands.
They're
optional,
commands
anyway,
so
but
bottom
line
is.
There
are
standards
in
progress,
standard
extensions
in
progress
or
what
they
call
dispersed
namespaces
and
I
apologize
for
adn's
name.
I
didn't
know
about
the
dispersed,
namespaces
technical
proposal
when
I
started
this.
Actually
it
started
well
after
this,
and
it
specifically
addresses
all
of
these
issues
it.
J
For
other
reasons,
storage
vendors
would
like
to
be
able
to
expose
a
namespace
from
multiple
subsystems,
they're
they're
sort
of
headline
use
cases
migrating
between
arrays,
and
you
know,
administrative.
You
know
maintenance
tasks
like
that
for
for,
if
you're
doing
that,
you
need
that
capability
at
least
temporarily,
and
so,
if
you're,
an
mpv
person,
you
look
at
those
standards
and
you
see
dispersed
name
spaces.
J
You
might
think,
oh
that
the
same
thing,
no,
where
this
is
probably
going,
is
if
that
tp
proceeds
and
is
adopted,
then
a
dnn
would
naturally
stack
on
top
of
that
you'd
say:
well,
you
use
adnn
with
a
dispersed
namespace
and
it
adds
these
capabilities,
which
are
essentially
just
a
bunch
of
log
pages
and
their
contents
defined
and
the
behaviors
the
hosts
recommended.
J
So
that's
kind
of
where
we
are
now
is
the
code.
Is
there
so
that
people
like
that
that
might
use
it
and
might
want
to
reconcile
it
with
the
ecosystem
issues,
can
contribute
and
comment
and
tell
me
what
works
and
what
doesn't,
but
we
don't
have
a
package
for
ceph
yet,
and
it
isn't
really
time
to
do
that.
I
don't
think
but
and
in
fact,
since
that's
not
part
of
the
part
of
steph
that
I
have
worked
on,
where
did
I
let's
back
to
this
slide?
J
So
at
some
point
I
would
consult
the
just
like
you
to
say
well
how
this
must
be
integrated
into
ceph.
I
you
know
somebody
wanted
to
create
a
package
that
added
sf
18n
capability.
How
would
you
do
that?
Would
you
use
the
iscsi
thing
as
a
model
they're
a
better
model
now
what
what
sort
of
my
question
to
you?
What
what
should
be
looking
at
there
or
or
targeting.
C
I
mean
it's
a
standalone
product
right
project,
so
you
have
tcme
runner
and
you
have
like
the
sapphire
scuzzy
packages
just
because
they're
not
tied
into
a
separately
stuff,
provides
the
core
functionality
and
the
runner
and
step
iscsi
are
just
clients
to
whatever
you
know,
to
a
lib
rbd
librarados
package
that
gets
installed
with
them,
because
there's
no
there's
no
like
one-to-one
feature
tie-ins
there's.
This
is
not
like
the
octopus
release
of
of
iscsi,
it's
iscsi
that
might
be
using
octopus
clients,
so
that's
they're,
distributed
and
kept
different,
separate
on
on
in
github.
J
Okay,
so
that's
one
way
to
think
of
a
you
know:
a
dnn
adaptation
layer
would
be
a.
It
could
be
a
separate
package
that
just
uses
normal
stuff
libraries
and
does
its
thing
that
would
kind
of
limit
us
to
architecture.
We
have
here
where
we
just
use
rbd
in
the
ost
nodes.
If
you
want
to
do
anything
trickier.
C
J
H
C
C
And
then
the
other
question
I
have
is
fabrics
back
like
how
many
name
spaces.
Can
you
really
have
per
target.
C
J
I
can't
remember
if
it's
a
32-bit
number
or
a
16-bit
number,
but
obviously
the
real
limit
is
going
to
be
an
implementation
limit
of
whatever
framework
you're
using
you
are
free
to
have
more
than
one
target
or
more
than
one
subsystem.
Now
that
can
so
you
can
pack
them
a
number
of
ways
right.
So
the
way
we
talk,
the
reason
we
talk
about
it
as
one
is:
that's
the
minimum
number
right.
C
J
C
In
use
concurrently
that
are
now
mapped
across
every
single
osd
node,
where
you
right
so,
and
I
think
really,
unless
you
get
to
the
point
of
you
know
what
sam
mentioned
with
like.
Oh,
I
give
an
rbd
image.
It
can
be
down
to
a
given
subset
of
pgs
right,
but
then
you
start
doing
that.
It's
like
well.
Is
this
really,
then
just
a
gateway
yeah.
You
know.
H
You
would
so
yeah.
The
two
major
pieces
would
be
somehow
restricting
the
number
of
pg's
a
node
can
map
to
or
changing
the
overhead
of
the
hands.
In
addition
to
you,
probably
can't
in
general
map
every
image
on
every
osd,
that's
extreme,
so
you
would
need
some
mechanism
to
only
map
on
the
oc
they're
supposed
to
be
the
target
for
with
you
know,
support
from
moving
them.
H
Certain
set
of
pg,
no,
those
two.
C
H
H
You
guys
it's
more
complicated
stuff
gets
you
right
across
a
lot
of
stuff
right,
but
you
were
asking
what
would
be
like
the
architectural
changes
that
you
would
make
as
it
becomes
more
complicated.
I
think
this
is
what
they
would
be,
but
for
the
simple
version,
I
don't
see
a
reason
why
it
would
necessarily
need
to
be
included
in
stuff.get,
like
an
external
project,
seems
fine.
This
doesn't
have
a
heart
dependency
on
any
particular
version
of
stuff
and
it
doesn't
know
anything
about
the
seth
internal
protocols,
as
that
becomes
more
useful.
J
Yeah
that
the
things
you
might
do
to
eliminate
or
reduce
the
the
rbd
overhead
in
the
osd,
you
know
integrating
it
to
to
crimson
somehow
in
some
way.
That
makes
sense-
or
at
least
maybe
optimizing
that
transport
connection
to
say.
Well,
if
you're
talking
to
your
local
osds,
maybe
we
don't
need
all
this.
No
encryption.
H
Even
that
could
be
done
within
the
client.
You
just
tell
the
client
by
the
way
you're
likely
to
yes,
the
local
connections.
J
That's
cool,
so
these
things
you
know
we're
not
there's
no
immediate
plans
to
try
and
do
any
deep
integration
we
figured.
We
would
wait
until
there
was
a
clear
need
and
the
feature
set
you
know
took
out
so
so
that
unless
there's
any
more
questions
is
basically
that
I
did
point
you
at
the.
J
Got
a
second:
where
did
it
go?
I
wanted
to
point
out
this
read
me
file
here
at
the
bottom.
It
oops
wait
for
that
to
do
that,
clicked
on
it
rather
than
a
powerpoint.
Some
people
would
rather
read
an
explanation
of
how
it
works
it
to
the
scripts
that
do
all
this
stuff.
It
doesn't
sound
like
any
of
that's
a
big
mystery.
Deaf
people
understand
that
you
can
get
all
this
information
out
of
this
fcli
tools
in
json
form
and
that's
what
we
did.