►
From YouTube: ZIO Pipeline by George Wilson
Description
From the OpenZFS Developer Summit 2018
Slides: https://docs.google.com/presentation/d/1ohdmjsp9mejuSRKwDeU83o9297KaiHrqC-tV__kjO6E/edit?usp=sharing
A
Our
next
speaker
is
going
to
be
George
Wilson.
Most
of
the
people
here,
I
think
know:
George
Wilson
he's
one
of
the
earlier
open,
ZFS
and
ZFS
developers,
and
he's
done
a
lot
of
work
on
performance
on
poor
location
and
like
various
aspects
of
ZFS,
and
he
probably
need
to
save
a
lot
of
pools
from
you
guys.
So,
please,
everyone
welcome
George
George
Wilson,
who
is
going
to
talk
about
the
CIO
pipeline.
B
B
With
regards
to
like
this
talk,
you
know
it's
I
always
find
it
interesting
when
I
have
to
go
and
like
think
about
a
subsystem
that
I
haven't
worked
on
in
a
really
long
time,
and
then
you
start
realizing
gosh
I,
don't
remember
this.
Why
does
it
work
this
way?
So
a
lot
of
this
talk
is
going
to
be
a
little
bit
like
that.
B
So
quick
shout
out
to
one
of
my
fellow
colleagues
and
in
in
crime
and
when
it
comes
to
ZFS
there
was
this
tweet
back
in
June
that
I
saw
and
I
was
like.
That's
kind
of
interesting-
maybe
that's
like
reason
to
have
a
talk
about
this
at
the
open
ZFS,
so
so
that
was
the
inspiration.
B
Initially,
this
was
going
to
be
the
presentation.
I
was
just
going
to
stop
here,
yeah,
yes,
this
exactly
so,
as
Brian
mentioned,
this
is
actually
in
the
source
base
and
he
was
just
kind
of
letting
us
all
know
that
hey
this
exists,
maybe
we
should
do
something
about
it.
Hopefully,
this
talk
inspires
somebody
to
go.
Do
something
about
it.
B
So
so,
first
of
all,
let
me
just
kind
of
set
the
stage
of
you
know
where
the
zio
pipeline
lives
and
kind
of
some
things
about
it.
It's
kind
of
fascinating
the
I/o
pipeline
for
as
small
a
number
of
lines
of
code
that
exist
is
actually
extremely
dense
code
and
does
a
lot
of
things
that
you
might
not
recognize
actually
happen
in
that
layer.
B
So,
as
you
can
see
in
this
diagram,
the
I/o
pipeline
kind
of
sits
at
the
lower
area
of
the
stack,
so
it's
actually
kind
of
in
an
area
that
we
refer
to
as
a
spa
and
is
comprised
of
a
couple
other
components
and
it
works
very
closely
with
the
video
player.
So
it
really
is
the
framework
for
all
I/o
that
is
driven
through
ZFS.
B
So
as
you're
starting
to
do
things
whether
it's
read
writes
you
know
even
iocked
doles
they're
coming
through
the
zio
pipeline,
it's
in
this
layer
where
we're
actually
going
to
do
our
translations
from
data
virtual
addresses
to
actual
physical
locations
on
disk.
It
doesn't
actually,
you
know,
handle
that
completely
and
it
works
very
closely
with
the
V
dev
layer
to
do
that,
but
it's
instrumental
in
kind
of
getting
that
part
of
that
portion
done.
B
It's
also
here
where
we
do
things
like
any
transformations,
whether
it's
you
know
checksumming
dedupe
compression.
So
a
lot
of
the
things
that
you're
used
to
seeing
like
zpool,
you
know
set
compression
equal
LZ
for
it's
like
that
gets
set
in
an
upper
layer
and
it
gets
passed
down
to
the
CIO
and
it's
in
this
layer
where
that
actually
happens,
and
it's
kind
of
neat,
because
the
way
that
the
checks,
Amin
and
compression
code
works
is
it's
somewhat
pluggable.
So
we
can
extend
it
so
we've.
B
You
know
we
probably
have
all
seen
over
the
course
of
years
where
we've
extended
and
adding
added
new
checksum
algorithms
added
new
compression
algorithms.
It
all
happens
in
here
and
there's
a
bunch
of
other
stuff.
So
there's
a
quite
a
lot
that
happens
at
this
layer,
whether
it's
allocation,
throttling
selecting
you
know
how
we're
gonna
do
ditto
blocks
and
which
devices
have
to
get
you
know
have
to
do
I
owe
to
them
whether
or
not
we
can
do
gang
blocks
or
not,
and
we'll
talk
a
little
bit
more
about
those
in
a
few
minutes.
B
You
know
staying
a
hundred
percent
busy
by
having
four
different
instructions
working
on
different
stages.
So,
if
you
think
about
the
I/o,
we
can
kind
of
do
a
similar
thing.
We
can
say
well,
if
we
kind
of
break
it
up.
We
have
like
fetching
some
data
where
we
actually
get
this
from.
You
know
the
user
or
the
caller.
We
get
whatever
the
data
buffer
they're
sending
us.
We
decode
it.
B
So
we
know
that
we're
doing
kind
of
like
a
read
or
write
and
we
initiate
the
I/o
to
the
underlying
device,
and
then
we
get
some
kind
of
response.
So
in
a
in
a
sense
we
can
kind
of
break
this
up
into
pipeline
stages
and
what
we've
done
with
ZFS
is
we've
done
exactly
that
they're
more
intricate
than
just
a
four
stage
pipeline
and
they
differ
based
on
the
types
that
that
exist.
So
here
are
some
of
the
basic
pipelines
that
exist
in
the
I/o,
the
zio
subsystem.
B
We
have
two
different
types
for
physical
iOS
for
doing
like
physical
reads
and
physical
writes.
We
also
have
logical,
reads
and
logical,
writes,
and
then
we
have
kind
of
this
bunch
of
specialized
more
specialized
pipelines
for
like
freeing
claims,
which
is
kind
of
a
very
unique
thing
that
happens
when
we
actually
do
a
pool
import,
I
octal,
which
is
primarily
used
to
kind
of
flush
out
the
right
cash
rewrite,
which
is
primarily
only
used
by
the
Zil,
and
then
these
other
ones
like
the
null
and
the
route
pipelines
which
we'll
talk
about.
B
So,
let's
kind
of
like
see
what
makes
up
these
different
pipelines.
So
when
we're
talking
about
these,
we
actually
have
different
types
of
iOS
that
get
associated
with
them.
So
we
just
saw
there's
all
these
different
pipelines
that
exist,
but
then
there's
six
different
types
of
iOS
that
can
utilize
those
pipelines.
So
we
have
reads,
writes,
freeze
claims,
iocked,
doles
and
the
special
null
one
and
these
different
types
utilize.
B
These
task
queue
pools
to
do
the
iOS
and
move
the
IO
pipeline
through
you
know,
through
the
various
stages
and
I
call
them
tasking
pools,
because
they're
actually
comprised
of
like
two
different
task
queues.
We
have
like
a
normal
priority.
We
have
a
high
priority
one
and
the
number
of
threads
associated
with
each
of
this
task
queues
depends
on
the
I/o
type.
B
So,
if
we're
talking
about
something
like
writes,
we
may
actually
have
a
lot
of
issuing
task
queue
threads
and
you
know
versus
something
like
claim,
which
may
only
have
one
task
queue
thread
and
the
whole
general
premise
behind
the
I/o
pipeline
and
kind
of
a
guiding
principle
was
that
we
never
burn
a
thread.
We
never
get
to
a
point
where,
when
you're
driving
through
the
pipeline,
that
you
actually
go
into
a
condition
variable
and
block,
so
whenever
we
actually
get
to
a
point
where
we
have
to
give
up
control,
we
transfer
control
to
something
else.
B
B
So
here's
kind
of
the
way
the
pipeline's
work
and
I've
broken
these
up
to
kind
of
show.
The
differences
between
logical
reads
and
physical
reads.
So
you
can
see
here
that
the
upper
section
shows
the
physical
reads.
We
have
these
stages,
which
are
indicated
in
red
the
ready
and
done
stages.
These
are
our
interlocked
stages
and
I'll
talk
more
about
what
they
mean
and
why
we
have
them,
and
then
we
have
these
green
stages
where
we're
actually
issue
eating
iOS.
So
these
are
all
the
the
vida
of
communication.
B
This
is
where
we're
actually
gonna
be
talking
to
the
video
player
when
we're
driving
IO
through
the
pipeline
and
then
the
blue
stages
are
kind
of
computational
stages
that
happen
throughout
the
pipeline.
So
in
this
case,
when
we're
looking
at
physical
reads,
we
can
see
that
we
simply
get
it's
the
ready
stage
where
there
really
might
not
be
very
much
done
there.
We
issue
the
IO
to
disk
once
it
completes,
we
generate
the
checksum
and
then
we
call
Don,
which
calls
the
callback
and
notifies
the
caller.
B
B
The
write
stages
also
get
more
complicated.
So
here,
when
we're
looking
at
physical,
writes
and
logical
writes,
we
can
see
that
there's
a
common
stage
that
gets
inserted,
which
is
issue
async.
This
is
where
we
are
actually
going
to
transfer
control
from
the
callers
thread
to
one
of
these
task
queue
pools,
we're
gonna
just
say:
do
this
work
on
my
behalf,
so
both
write,
write
pipelines
will
do
that,
for
you,
they'll
also
generate
the
checksum.
B
The
difference
here
is
that
we
see
these
orange
stages
that
actually
exist
in
the
logical
case,
where
we're
going
to
transform
data.
So
this
is
where
you're
going
to
do
that
compression
the
encryption,
the
things
that
you're
used
to
kind
of
doing
as
a
property
when
you
set
that
as
the
pool
they're
gonna
be
done
in
these
stages
of
the
pipeline.
B
It's
also
worth
noting,
where
the
check
sum
generate
code
happens,
you'll
notice
that
the
transformation
and
the
checksum
generate
happen
after
we
issue
the
async
stage.
It's
because
we
want
to
make
sure
that
we
can
span
this
work
out
to
as
many
threads
as
possible
to
go
compress
the
data
encrypted
and
then
once
that's
stable
and
the
data
has
finished,
transforming
we
can
actually
generate
the
checksum
that
we're
going
to
put
to
disk.
B
The
logical
also
introduces
these
purple
phases,
which
is
the
allocation
code.
This
is
where
we're
going
to
first
throttle
it
down
to
make
sure
that
we're
not
overloading
a
bunch
of
disks
that
potentially
can't
handle
the
workload
and
that's
done
by
the
DBA
throttle.
And
then
the
DVA
allocate
code
is
where
we're
actually
going
to
go
and
talk
to
the
Metis
lab
subsystem
and
say,
find
me
a
block
where
you
can
actually
write
this.
So
in
the
right
pipeline.
B
B
So
rights
have
optional
stages
for
D
dupe,
where
we
want
to
go
update
the
DD,
the
DD
T
table
as
part
of
the
write,
but
we
also
have
this
concept
of
an
OP
write
where
it's
possible
that
we're
actually
overwriting
the
data
with
the
exact
same
content.
So
if
we're
doing
that,
let's
just
abort
the
pipeline
early
get
a
performance
win,
don't
do
the
allocation,
don't
do
any
I/o.
B
So
that's
what
you
see
here
with
this
optional
stage
and
then
I
mentioned
there's
a
couple:
other
special
pipelines
there's
the
null
or
route
which
has,
which
is
the
simplest
of
all
pipelines
that
it
just
simply
has
two
stages.
It
has
the
ready
and
done
stage
which
we
refer
to
as
interlock.
The
idea
behind
these
is
that
they
create
a
container
for
you.
B
We
know
that
that
means
that
all
the
subsystem
or
all
the
children
iOS
would
have
been
done
as
well
and
we'll
show
an
example
of
how
that
works,
and
then
the
rewrite
which
looks
very
similar
to
write.
But
the
thing
that's
removed
here
are
the
allocation
stages.
So
it's
kind
of
odd.
If
you
think
about
the
fact
that
we're
copy-on-write,
why
do
we
have
the
rewrite
stage,
because,
obviously
we
do
copy-on-write
when
we
do
go,
allocate
a
new
block?
B
So
some
other
things
that
happen
in
the
zio
subsystem.
That's
probably
worth
noting
gangue
blocks
I
mentioned
some
stages
where
you
could
actually
add
that
in
so
if
we
have
a
condition
where
the
pool
is
mostly
full
or
severely
fragmented,
our
only
option
is
to
one
fail:
the
I/o
and
oftentimes.
You
know
by
the
time
it
gets
to
the
the
cio.
It's
like
I
have
no
way
to
tell
anybody
that
I'm
about
to
fail
anything.
So
that's
not
really
a
good
idea.
B
The
other
option
is
I
end
up,
creating
gang
blocks
and
gang
blocks
are
taking
and
creating
constituent,
smaller
blocks,
logically
putting
them
together
to
create
to
make
a
larger
allocation
so
assume
that
you
have
like
128k
IO
that
you
want
to
do,
but
you
have
severe
fragmentation
or
you're
mostly
out
of
space
when
it
gets
to
the
IO
pipeline.
We
would
simply
try
to
allocate
that
128
K
block
and
find
that
we
can't
do
it.
B
We
can't
find
a
contiguous
block
anywhere,
so
we
would
create
these
this
concept
of
a
gang
header
and
gang
members,
and
we
break
that
allocation
up
into
smaller
chunks
and
we'll
keep
nesting.
This
all
the
way
to
the
point
where
we're
allocating
simple
512
byte
sectors,
assuming
that
with
even
with
fragmentation,
those
allocations
will
always
succeed.
So
this
is
a
huge
penalty
if
you
get
to
a
mode
where
you're
having
to
do
this,
because
now,
what
used
to
be
one
IO
to
go?
B
Get
your
128
K
block
now
can
turn
into
a
entire
tree
of
iOS
as
we
build
up
and
create
all
these
little
constituent
blocks,
stitch
them
together
and
then
pass
up
the
logical
data
to
the
caller,
but
it's
a
necessary
concept
in
order
to
be
able
to
ensure
that
we
can
use
all
the
space
that's
on
the
pool,
and
it
also
prevents
us
from
just
throwing
up
our
hands
saying
sorry.
I
couldn't
allocate
that
so.
Obviously
you
know
we
have
to
stop
all
transactions
and
not
let
anything
go
through.
B
We
also
have
this
concept
of
children
types.
So
when
we're
going
through
the
pipeline,
most
often
we're
dealing
with
logical
children,
these
are
iOS
that
were
requested
by
some
consumer.
So,
like
somebody
said,
go
read
this
block
that
read
becomes
a
logical,
read
and
most
often
in
our
cases,
that's
coming
from
the
arc,
but
sometimes
within
the
pipeline.
We
have
to
create
these
new
children
to
go.
Do
work
on
our
behalf
and
that's
what
these
other
types
are
for.
B
So
when
we're
doing
say
a
logical
read
that
might
turn
into
you
know
if
it's
a
gang
block,
for
example,
it
might
create
some
gang
children
to
go.
Read
all
these
little
constituent
blocks,
stitch
them
together
and
then
transfer
that
data
to
the
logical
I/o,
which
will
then
be
returned
back
to
the
caller.
B
B
B
You
know
like
something
like
a
container
like
a
route
IO
where
we
can
get
the
get
the
error
status
and
the
completion
status
of
a
much
larger
set
of
iOS.
So
the
way
that
the
our
dependency
graph
works
is
that
we
have
concepts
for
notifications
and
which
are
these
two,
these
two
stages
of
ready
and
done.
These
are
places
where
we
can
actually
get
notified
when
an
IO
completes
that
stage.
B
Their
content
is,
is,
is
finalized,
they've
gone
through
the
compression
stages,
they've
done,
the
transformations
they've
actually
done
their
block
allocation
and
maybe
they're
on
their
way
to
disk.
So
that's
where
we
can
actually
have
these
notifications.
We
can
have
an
I/o
that
is
currently
sitting
waiting
for
to
get
started,
waiting
for
all
its
children
to
actually
make
it
through
those
initial
stages,
where
it's
doing
data
transformation
in
an
allocation
and
then
once
it
reaches
a
ready
stage,
then
it
notifies
the
parent
says:
hey
I'm
ready.
B
B
B
It
can't
do
its
own
internal
modifications
until
the
underlying
child,
I/o
has
actually
finished
and
gotten
to
the
ready
stage
and
the
stalls
are
can
be
either
very
coarse-grained
or
fine
grained,
depending
on
the
operation
that's
taking
place
in
some
cases,
you
only
want
to
wait
for,
like
divita
of
children
to
complete.
That
might
be
what
you
see
in,
like
the
video
of
I/o
done
stage,
where
we
issued
a
bunch
of
iOS,
and
we
simply
want
to
wait
to
make
sure
that
the
iOS
are
now
to
disk
before
we
move
forward.
B
So
the
only
thing
we
really
care
about
are
the
video
of
children.
We
don't
care
about
logical
children,
so
this
is
something
that
you'll
see
throughout
the
code
and
is
kind
of
a
subtle
thing
to
note,
but
it's
actually
a
very
powerful
thing
when
you're
actually
consuming
and
driving
the
I/o
pipeline.
B
So
here's
kind
of
a
dependency
graph
that
we
would
see
and
is
probably
pretty
common
within
ZFS,
where
we
create
this
route
IO
and
if
you
remember
the
route
pipeline
simply
just
has
two
stages
in
it.
We
add
these
children
underneath
it.
So
maybe
we
add
the
first
row
of
logical
children
and
they
might
be
reads
or
writes
we
don't
really
know
they
might
actually
add
some
additional
logical
children
underneath
them
and
maybe
at
the
very
bottom
we
get
some
VF
children
which
are
actually
doing
the
physical
I/o.
B
So
what
I,
what
I'm
gonna
start
with
is
kind
of
looking
at
what
is
a
cio?
What
is
this
thing
that
we've
been
talking
about?
How
does
it
actually
go
through
this
pipeline?
What
are
some
critical
things
of
that
structure
that
allow
us
to
do
what
we
do
with
it?
So
here's
kind
of
a
small
portion
of
the
zio
structure-
and
it's
actually
quite
large,
if
you
look
at
it
in
an
assorted
to
point
out-
are
some
some
things
that
kind
of
are
interesting
with
it.
For
example,
we
have
this
concept
of
a
bookmark.
B
The
bookmark
is
a
small
structure
that
actually
tells
us
which
file
system
were
going
to
what
file
were
operating
on.
What
level
of
indirection
are
we
in
what
block
ID?
Are
we
going
to
be
modifying?
So
we
have
this
context
of
exactly
what
we're
trying
to
do.
I
owe
to
embedded
in
the
I/o
from
a
ZFS
object.
You
know
kind
of
model.
B
We
also
have
this
properties
field
and
I
mentioned
that
there's
you
know
all
these
things
that
we
do
when
you
write.
You
know
you
either
do
like
a
ZFS
set.
Some
property
or
zpool
set
some
property.
Those
get
passed
down
to
the
I/o,
the
cio
itself
and
they're
stored
in
this
I/o
prompt.
So
things
like
you
know
how
many
copies?
Do
you
want
of
this
data?
Do
we
you
know
want
to
create
additional
copies?
What
checksumming
algorithm?
Do
you
want
to
use?
B
B
I
mentioned
the
I/o
types,
so
these
are
the
the
different
types
of
I/o,
so
these
are
going
to
be
rights.
Freeze
claims,
iocked,
doles,
the
child
type.
This
is
going
to
be
whether
it's
a
VF
type,
V
deaf
child,
a
logical
child,
a
you,
know,
a
gang
child,
and
then
we
also
have
this
these
callbacks,
where
we
can
actually
call
back
to
the
user
that
has
issued
the
IO
periodically
as
this
iOS
going
through
the
chain.
So
we
have
places
where
we
can
actually
call
back
when
you
get
to
the
ready
stage.
B
That's
one
of
the
notification
points
that
I
mentioned.
So
as
this
IO
gets
to
the
ready
stage,
it
can
call
back
to
the
consumer
and
say:
hey
I'm,
just
letting
you
know
I'm
in
the
ready
stage
and
the
consumer
might
do
something
as
a
result
of
that
or
do
nothing
likewise.
The
done
callback
is
typically
the
most
common
one,
where
we'll
actually
return
the
response,
whether
it's
returning
the
data
or
some
status,
but
that's
where
the
consumer
will
get
most
of
its
information
and
then
embed
it
in
here.
B
We
also
have
what
stage
were
in
of
the
pipeline
and
what
pipeline
are
we
using?
And
it's
worth
noting
this
pipeline?
That's
referenced
here
may
have
changed.
We
may
have
started
doing
a
full
re
or
a
full
logical
right
pipeline
and
then
now
that's
been
converted
to
an
OP
right
pipeline,
so
things
can
change.
B
The
other
thing
I
was
gonna
mention
is
we
also
maintain
the
priorities
for
the
for
the
I/o
and,
although
not
really
used
and
consumed
by
the
I/o
pipeline.
It
is
passed
down
to
the
video
scheduler,
there's
only
one
place
in
the
pipeline,
where
we
actually
look
at
the
the
priority
and
that's
when
we're
dealing
with
the
task
hoop
task
queue
pools
where
we
want
to
determine.
Do
we
want
to
send
this
as
a
high
priority?
You
know
task
queue
or
you
know
a
normal
priority
task
queue.
B
Okay.
So
let's
talk
about
how
we
create
one
of
these
suckers,
so
here's
a
very
common
entry
point
with
zio
Reed
and
we
can
see
that
when
we
calls
the
iro
zio
Reed,
we
want
to
do
a
logical,
read
so
we're
going
to
pass
in
a
lock
pointer.
So
the
picture
on
the
right
kind
of
represents
what
this
block
pointer
would
look
like,
and
it
has
in
there
3
DV
A's,
which
are
these
V
dev
a
size
off
shift
offset
parameters.
B
So
the
thing
to
note
here
is
this
parameter,
which
is
currently
no
happens
to
be
the
Vita
that
we're
gonna
do
the
I/o
to,
and
we
pass
it
in
as
no
because
we're
passing
in
a
block
pointer
and
the
block
pointer
is
going
to
determine
where
the
actual
physical
devices
that
I
need
to
talk
to.
We
also
pass
in
the
stage
where
we
want
to
start
so
we're
saying
start
in
the
open
stage
when
you're
doing
this
I/o
and
then
your
pipeline
is
going
to
be
a
read
pipeline.
B
The
other
thing
to
note
is
that
this
thing
doesn't
actually
do
any
I/o.
It
just
simply
returns
back
this
structure.
So
at
this
point
in
time
it
seems
like
we
should
be
going
off
and
doing
the
read,
but
we're
really
not
gonna
do
that
just
yet
we're
simply
going
to
create
the
structure.
Tell
it
what
you
know,
what
its
intention
is:
gonna,
be
and
then
sometime
later
down
the
road
somebody
better
go
often
and
do
the
actual
physical
I/o.
B
Both
of
these
functions
will
call
a
kind
of
driving
the
driver
of
the
pipeline,
which
is
zio,
execute
and
we'll
talk
about
what
that
looks
like
the
other
thing
to
note
here
is
with
zio
no
weight
when
you
issue
that
it's
kind
of
CV,
you
think
that
it's
going
to
be
an
asynchronous
I/o,
but
it
doesn't
always.
It
may
not
be
a
synchronous
I/o,
because
it's
possible
for
an
I/o
to
get
to
a
point
where
it
will
issue
and
even
return
in
the
same
context,
because
we
didn't
pass
the
control
to
anybody
else
again.
B
B
We
do,
however,
add
this
special
call
to
the
zio
add
child
here,
and
the
reason
for
this
is
that
we
have
to
have
a
way
to
keep
track
of
asynchronous
true
asynchronous
I/o.
The
case
in
point
here
is:
let's
say:
you've
gone
off.
You
created
a
bunch
of
CIOs
and
you
went
off.
You
did
a
zio,
no
wait
on
them,
they're
off
and
running,
and
now
you
want
to
export
your
pool.
B
So
what
happens
here
is
the
concept
of
the
Godfather
IO,
so
the
Godfather
IO
is
associated
at
a
pool
level
and
it
is
becomes
the
parent
of
all
no
waited
iOS
and
when
you
go
to
actually
export
the
pool,
we're
going
to
do
a
zio
wait
on
the
Godfather
I/o
when
that
is
the
I/o.
When
that
Godfather
IO
completes
we're
guaranteed
that
all
these
children
have
completed.
It
turns
out
that
the
Godfather
IO
is
a
simple
zio
root
pipeline.
It
just
has
two
stages,
just
like
we
saw
in
the
I/o
dependency
graph.
B
So
here's
what
zio
executes
does
this
is
the
heart
of
the
I/o
pipeline
and
the
driver
of
how
things
make
their
way
through
the
pipeline.
So
it's
just
a
pretty
simple
function
in
that
is
simply
take
the
IO
and
you're
going
to
keep
running
and
calling
into
different
stages
until
you
get
to
the
done
call
back
or
the
done
stage.
B
So
if
you
start
off
and
you're
not
in
the
done
stage,
so
we
started
say
in
the
open
stage,
we'll
simply
increment
the
IO
stage,
and
when
we
increment
the
io
stage,
we
may
actually
have
to
increment
multiple
levels,
because
not
all
pipelines
have
all
the
stages.
So
these
stages
are,
you
know,
they're
power
of
two
numerals,
and
so
we
may
have
to
go
and
skip
over
a
couple
of
them.
B
Do
that
real
quick,
find
the
next
stage
that
actually
exists
in
this
pipeline
and
then
determine
first
of
all,
do
I
need
to
switch
from
the
current
task
on
to
something
else,
in
which
case
that
happens,
if
say,
you're
on
the
interrupt.
Ask
you
and
you're
in
the
process
of
trying
to
do
an
allocation,
for
example,
where
you
may
have
to
like
go
off
and
do
a
bunch
of
reads,
because
we
want
to
load
in
some
space
maps
or
something
when
that
happens.
B
We
want
to
make
sure
that
we
switch
you
back
over
to
the
issuing
task
queue
and
we,
and
that
primarily
happens,
because
the
issuing
task
queues
have
a
lot
more
threads
than
the
interrupt
task
queues,
but
once
you've
you're
on
the
right
task,
queue
you're
being
handled
by
the
right
thread,
you'll
simply
call
the
CIO
pipeline
table
function
and
you
can
see
based
on
this
stage.
These
are
all
the
various
functions
that
will
get
invoked
so
when
you're
trying
to
figure
out
like
okay.
What
is
this
pipeline
going
to
do?
Well,
I
mentioned.
B
Like
a
stage
like
you
know,
zio
read
VP
in
it.
Well,
it
would
call
the
CIO
read:
BP
init
function.
If
the
stage
has
you
know
a
DV
a
throttle
stage
in
it,
then
it
calls
the
CIO
deviate
throttle
function.
All
these
functions
will
simply
return
a
cio
which
says
execute
this
cio.
Next,
when
we
come
through
this
big
cycle
or
they'll,
return
back
and
know
which
says:
hey
I
really
want
to
stop
right
now
that
happens
when
we
have
stall
point.
B
So
if
we're
going
through
the
pipeline
and
all
of
a
sudden,
we
have
to
wait
for
a
child
and
that
child
isn't
done,
will
return
back
a
null
point
back
to
the
to
the
execute
function
and
it
will
actually
short-circuit
and
break
out
right
here.
That's
a
point
where
the
IO
itself
is
requesting
the
pipeline
to
stop.
We
don't
go
into
a
CV,
wait.
We
just
simply
transfer
control
to
somebody
else.
B
Okay,
so
let's
look
at
more
examples
of
how
this
happens.
So
here's
the
case
where
we're
calling
zio
right
and
we're
calling
this
from
the
arc
so
pretty
straightforward.
The
arc
is
going
to
get
some
information
and
then
we're
gonna
call
zio
right
and
we're
going
to
return
back
the
zio
again.
What
do
we
expect
to
happen
here?
B
B
This
happens
when
we're
sinking
out
and
pushing
out
a
transaction
group,
and
so
this
is
how
it
works.
So
we
started
off
here.
We
said:
okay,
the
consumer
of
zio
right
was
actually
Arkwright.
It's
I
gave
back
a
zio,
but
I
didn't
issue
it.
So,
ok,
let's
go
figure
out
who
is
gonna
issue
it?
Well,
the
color
of
Arkwright
happens
to
be
debuff
right.
Well,
debuff
right
simply
takes
that.
I
owe
that
I
returned
from
the
arc
and
assigns
it
to
this
drz
I/o
member
doesn't
actually
issue
it.
Ok,
well,
that's
kind
of
interesting.
B
So,
let's
look
at
the
color
of
debuff
right,
there's,
actually
a
couple
colors
of
debuff
right
and
I'm
only
showing
one
here,
but
in
this
case
we
see
that
debuff
right
gets
called
we
pass
the
CIO.
Are
we
take
the
CIO
that
was
actually
part
of
the
dirty
record
and
then
eventually
we're
gonna
call
zio
no
wait
on
it.
So
we're
gonna
issue
it
asynchronously
we're
not
going
to
wait
for
it
to
complete.
B
What's
interesting
here.
Is
this
recursive
call
to
debuff
sink
list?
Debuff
sink
list
actually
is
where
debuff
sink
indirect
gets
called.
What
this
is
going
to
do
is
it's
going
to
start
looking
at
all
the
dirty
records
and
go
from
the
highest
level
of
indirection,
create
all
the
CIOs
for
those
indirect
objects.
Go
to
the
next
level,
create
all
the
CIOs
work,
its
way
all
the
way
down
to
the
data
blocks
so
and
every
single
time.
It's
doing.
B
This
will
notice
that
it
passes
it
passes
in
a
cio
which
happens
to
be
the
parent
to
Arkwright
that
cio
gets
passed
in
as
the
parent
to
zio
right.
This
has
now
built
up
that
dependency
tree.
That
is
similar
to
the
example
my
example
much
smaller
than
the
real
mega
zio,
but
we
build
up
this
huge
dependency
tree.
We're
at
the
very
bottom.
Are
all
the
data
blocks,
indirect
blocks,
indirect
blocks,
going
all
the
way
up.
You
know
all
the
way
up
to
our
meta
meta
denote.
B
The
other
thing
that
we'll
note
is
we
build
this
thing
recursively
and
it's
when
we
start
actually
popping
off
from
the
recursion
that
we
start
issuing
the
iOS.
So,
although
I
don't
show
it
here,
there's
another
function
called
debuff
sync
leaf
which
is
going
to
sink
out.
All
the
data
blocks.
It's
going
to
be
the
first
one
to
actually
start
issuing
zio
no
weights.
So
it's
going
to
say
ship
off
all
the
data
blocks,
then
we're
going
to
return
from
this
function.
B
B
So
when
it's
coming
through
the
pipeline,
you're
gonna
have
all
these
iOS
coming
through.
The
first
data
blocks
are
going
to
come
through
here.
They're
gonna
go
async,
so
immediately
they're
gonna
hit.
This
is
being
issued
from
typically
this
bossing
thread,
I
think
Matt.
We
have
now
one
thread
per
data
set
for.
B
It's
also
worth
noting
here
that
I
mentioned
how
like
DVA,
allocate
contacts
the
meta
slap
code.
This
is
where
we're
gonna
go.
Do
our
copy
on
write
the
VF
layer
here
is
going
to
contact
one
of
these
functions
and
chances
are
it's
going
to
contact
more
than
one
of
these
functions?
So
we
have
this
kind
of
little
nuance
of
every
single
time.
We
do
an
I/o,
no
matter
what
your
pool
configuration
is.
B
We
always
call
via
a
mirror
IO
start,
and
the
reason
for
this
is
that
we're
trying
to
figure
out
if
we
have
the
copies
properties
or
a
Ditto
block
associated
with
that
IO,
and
we
treat
that
as
a
mirror.
So
rather
than
kind
of
doing
some
special
case
for
the
little
blocks
or
copies.
We
just
use
the
mirroring
code
and
hijack
that
and
treat
that
those
iOS
as
mirrors
so
you'll
always
see
a
call
to
VM
or
IO
start.
If
you
really
do
have
mirrors,
you
might
see
a
call
to
Vita
a
mirror.
B
I
will
start
twice
and
then
eventually
it
calls
vita
Disgaea
start
to
actually
do
the
physical
io
to
the
you
know
the
constituents
of
that
mirror.
So
let's
look
at
that.
So
here's
the
allocate
path,
the
allocate
path,
will
simply
call
Metis
lab
Alec
it'll
request.
The
allocation
that's
associated
with
this
with
this
IO.
If
that
fails,
this
is
where
we're
going
to
call
the
gang
code.
So
when
we
have
a
failure,
will
call
zio
right
gang
block.
Zo
right
gang
block
will
create
gang
children
to
go.
B
Do
this
right
again
with
smaller
allocation
units,
so
let's
say
we
get
through
the
allocation
code.
Just
fine!
Now
we
get
to
IO
start.
This.
Is
that
kind
of
subtle,
little
piece
of
code
where
it
says
I
go
through
here?
If
the
veto
is
null
call
the
mirror
operations
to
do
an
iOS
start
every
single
time
our
logical
IO
comes
through
here.
The
logical,
iOS
Vida
is
going
to
be
null,
so
it's
going
to
always
call
Vav
Rio
start
to
try
and
deal
with
copies
property
and
ditto
blocks
in
the
mirror.
B
Ios
start
code
is
where
it
will
also
have
special
logic,
which
I
don't
show
here
also
has
special
logic.
If
the
V
dev
is
null
and
then
says,
take
the
block
pointer
that
was
allocated,
go
figure
out,
the
Vita
is
associated
with
it
and
create
this
mirror
map.
For
me,
this
mirror
map
will
then
will
simply
just
do
this
while
loop
and
say
for
each
one
of
the
children
in
this
mirror
map.
B
So
let's
say
we
have
two
DV
a
s
so
we'll
create
two
children,
because
they're
probably
going
to
two
different
Vedas
I'm
going
to
create
a
video
child
IO
with
that's
going
to
go
towards.
You
know
child
number
one
and
child
number
two
so
for
ditto
blocks.
It
looks
like
mirroring
if
you
have
mirroring
it'll
look
the
same
way.
B
So
this
is
what
the
child
io.
This
is
another
part
of
the
pipeline
that
has
a
kind
of
sub
pipeline.
If
you
will
so
zio
Vita
of
child
IO.
Has
this
thing
called
Vita
a
child
pipeline
which
looks
like
this
little
piece
on
the
right-hand
side.
It's
just
the
IOT
stages
that
are
going
to
Vita
and
the
done
callback,
and
that's
it.
So
it's
a
very
small
pipeline,
but
it
is
used
very
heavily
by
all
logical
iOS
that
are
trying
to
do
I
owe
to
disks.
B
So
every
time
a
logical
IO
comes
in,
it
will
actually
create
a
VF
child
to
go.
Do
this
work
on
its
behalf,
the
VF
child
will
run
through
this.
The
logical
IO
will
also
run
through
these
stages.
But
what
will
happen?
Is
it's
simply
waiting
for
its
children
to
do
the
work
on
you
know
on
its
behalf
and
then
doesn't
actually
do
any
real
I/o.
B
So
the
other
thing
to
note
here
is
the
pipeline
notice
how
the
start
stage
here
is
this
stage.
Vita
IO
start
right
shifted
one,
unlike
all
the
other
pipelines
where
we
actually
start
in
the
open
stage.
This
is
one
where
we
start
the
pipeline
at
V,
2,
vo
start
so
again,
kind
of
a
powerful
thing
for
zio
create
that
I
can
I
can
specify
where
I
start.
You
know
the
pipeline,
so
I
can
insert
an
IO
in
the
middle.
B
Don't
highly
recommend
it,
but
you
can
do
this,
so
there
might
be
special
needs
that
you
have
where
you're
like
I
want
to
be
able
to
create
an
IO.
That
starts
that
you
know
DVA
allocate
because
I'm
doing
something
funky,
but
it
has
that
capability.
In
this
case
this
is
a
convenience
because
we
wanted
to
simply
drive
through
the
last
portion
of
the
video
vo
you'll.
Also
note
that
we
actually
pass
in
the
Vita.
B
So,
unlike
the
other
cases
where
we've
passed
in
null-
and
we
let
the
block
pointer
figure
it
out
at
this
point
in
time,
we've
already
taken
the
block
pointer,
we
figured
out
which
V
DES
we're
going
to
go,
do
io
and
we're
simply
going
to
go.
Do
that
do
that
here,
I'm
going
to
point
out
this
little
snippet
here?
B
Has
anybody
ever
looked
at
this
code
where,
if
it's
a
leaf,
offset
plus
V
dev
label
star
size,
okay,
some
people?
So
this
is
something
to
note
if
you're
ever
trying
to
do
a
translation
of
AI
have
this
I/o-
and
this
is
the
the
offset
in
my
block
pointer
I-
want
to
know
where
that
is
physically
on
disk.
What
this
is
doing
is
every
time
we
go
to
disk.
We
are
shifting
and
adding
to
that,
offset
I.
Think
it's
four
and
a
half
megabytes
which
accounts
for
the
ZFS
label
on
that
device.
B
So
the
front
label
has
to
512
no
a
512
K
section,
another
4
Meg
reserved
piece
for
the
boot.
So
that's
the
four
and
a
half
Meg
and
we
shift
everything
off
of
it
so
that
now
offset
zero
associated
with
the
logical
block.
Pointer
really
is
four
and
a
half
Megan,
and
that's
what
that
code
is
doing,
and
it's
all
handled
in
in
the
pipeline
kind
of
a
little
subtle
thing,
if
you're
ever
having
to
figure
out
like
how
do
I
map
this
logical
block
to
where
it
might
live
on
on
the
physical
disk.
B
B
B
So
the
I
showed
you
the
first
half
of
vo
vo
start
with
this
funky
little
video
of
null
I'm,
now
going
to
show
you
the
bottom
half
of
it,
which
is
where
it
actually
does
most
of
the
work.
So
now
it
says,
come
through
we'll
call
vida
io
start
again,
so
this
time
we're
coming
through
it
as
a
child,
Vita
type,
not
a
logical,
Vita
type,
because
it's
moved
on
to
done
now.
I
kicked
off
my
child.
The
very
first
stage
that
it's
going
to
do
is
called
Vita
bio
start.
B
B
It's
also
worth
noting
this
little
qio
because
I
may
be
coming
in
and
I
call
Vita
of
qio
and
now
I'm
calling
into
the
scheduler
the
I/o
scheduler
and
the
I/o
scheduler
may
come
back
and
say:
hey
it's
great
that
you
know
you
want
to
run,
but
this
guy
over
here
is
the
one
that
really
should
run
right
next.
So
you
maybe
come
in,
as
you
know,
zio
for
vita
e
or
one
and
what
you
get
back
is
a
zio,
for
you
know,
Vita,
4
and
Vita
4
is
the
guy.
That's
actually
going
to
go.
B
Execute
this
so
again.
Another
kind
of
subtle
thing
here
is
that,
even
though
you
have
started
the
I/o
start,
you
may
not
be
the
one
that's
actually
running
and
going
to
the
next
phase
right
away.
You
may
end
up
getting
queued
because
you're
being
scheduled
to
run
later
or
maybe
you're
aggregated
to
go.
You
know
be
part
of
a
much
larger
I/o.
B
Ok
so
simple
case
here:
let's
assume
we
have
a
mirror
with
two
disks.
This
is
this
is
effectively
what
it
would
look
like.
We
would
have
a
logical
right.
It
would
go
to
mirror
I/o
start
because
Vita
is
no.
It
creates
a
child
out
that
child
is
going
to
be
associated
with
a
mirror
that
is
going
to
call
back
into
via
a
mere
I/o
start,
where
it's
going
to
create
two
children,
because
we
have
two
disks
associated
with
this
mirror
they're
also
going
to
go
to
disk
I/o
start
as
time
progresses.
B
Our
logical
I/o
moves
the
Vita
bio
done.
He
stalls
because
his
children
aren't
ready
to
complete
the
child.
Io
associated
with
the
mirror,
also
moves
to
Vita
bio
done
and
he's
gonna.
Wait.
He's
now
got
to
wait
for
two
disks
that
are
doing
IO
to
complete.
So
this
kind
of
shows
you
how
things
progress
through
the
system
in
a
very
simplistic
way.
B
The
one
thing
I
will
know
with
the
pipeline
and
you
may
not
have
caught
this
when
we
were
first
looking
at
all
the
pipelines.
I'm
I
showed
that
there's
these
transformation
stages
for
rites,
where
we,
you
know
compress
and
encrypt,
but
we
didn't
see
anything
for
reads
where
we
decompress
and
decrypt
it's
because
they're
handled
slightly
differently,
they're
handled
with
transformation
stacks
which
are
in
this
Reed
BP
and
knit
phase,
and
so
in
here
we
will
actually
you'll
see
that
the
two
things
that
are
highlighted
where
we
actually
decompress
the
data.
B
If
you're
asking
for
a
block
to
be
uncompressed
when
you're
doing
the
read,
then
we'll
decompress
it
in
the
I/o
pipeline
and
pass
you
back
the
decompressed
data,
it's
quite
possible
that
in
the
new
world
today,
when
you're
using
a
compressed
arc,
this
doesn't
get
called
very
often
because
we're
just
simply
passing
back
the
compressed
block
anyway.
But
so
anyway.
B
Let's
look
at
a
couple
examples.
So
I
have
this
little.
The
simple
little
D
trace
script.
That
shows
me
a
pipeline
based
on
different
functions.
So
in
this
case,
I
was
tracing
all
Zee
iOS
that
were
being
executed
from
debuff
sync
leaf
and
debuff
sync
indirect.
This
is
where
we
actually
do
the
sinking
of
a
transaction
group
to
disk.
So
a
couple
things
to
note
the
first
pipeline
we
see
in
here
that
we've
added
this
optional,
not
bright
stage.
B
It
just
so
happens
that,
when
we're
doing
the
rites
on
a
Delphic
system,
we
add
the
stage
in
all
the
time
to
see.
Can
we
avoid
the
IO?
The
thing
we
noticed
is
it
actually
went
to
the
allocate
phase,
so
it
did
not
avoid
the
IO.
So
this
knob
right
case
was
simply
you
know,
an
extra
stage
really
didn't
do
anything.
B
B
So
I
mentioned
stall
points.
Whenever
we
stall
the
pipeline,
the
thread
that's
gonna,
come
back.
That's
actually
now
waiting
gets
his
stage
reset
so
that
next
time
he
gets
invoked,
he
will
start
off
in
that
exact
same
stage.
This
is
a
case
where
we
stalled.
This
IO
actually
came
through
got
to
Rio
right
compress
and
said:
oh,
my
children
aren't
ready
for
me
I'm
going
to
go
off
CPU.
He
gets
woken
up
second
time
now,
150
Mike's
later
and
now
completes
the
rest
of
his
pipeline
stage.
B
So
here's
one
looking
at
reeds
and
again
much
simpler
and
we
see
a
similar
thing
at
that
point,
this
time
we're
in
zio
vita
of
iodine
and
we
see
that
getting
booked
twice
the
first
time.
It's
the
logical
I/o
coming
through
now
he's
waiting
for
his
Vita
of
children
to
actually
do
that
ion
on
his
behalf
and
then
he
gets.
He
goes
back
on
the
pipeline
once
that
IO
completes.
B
What
do
we
see
with
this
pipeline?
So
we
know
the
Zil
does
right
or
maybe
I
should
say
what
don't
we
see
with
this
pipeline?
We
don't
see
allocations,
we
see
checksum
generate
and
then
it
goes
to
ready.
Let's
compare
that
to
this
one
where
we
did
compression
checksum
generate
not
bright,
DV
a
allocate
from
this.
We
know
that
this
is
a
rewrite
pipeline
stage.
It
doesn't
have
the
allocation
phases
in
it
and
if
we
look
at
the
the
second
pipeline
that's
listed
here,
what
can
we
infer
from
just
looking
at
that?
B
B
B
B
Yes,
I
can
make
that
available
to
you.
It's
it's.
It
needs
some
love,
so
I
would
love
for
people
to
scrutinize
this
and
figure
out
how
to
use
it
in
a
better
way.
I
started
this
surprisingly
for
all
the
years
that
I've
been
doing
stuff
in
the
CIO
pipeline.
I've
never
really
had
a
script
as
good
as
the
CIO
trace,
1
and
which
is
probably
like
a
week
old.
B
B
The
reason
for
it
is
that
without
the
throttle-
and
we
saw
this
quite
a
bit
at
del
phix
without
the
throttle
when
you
have
a
lot
of
pool
with
a
lot
of
devices
that
have
that
are
very
imbalanced,
your
spa
sync
time
is
going
to
be
determined
by
the
slowest
device
that
you're
writing
to.
But
if
you
have
the
throttle
in
place,
so
the
throttle
is
giving
out
work
initially
as
a
one
big
chunk
and
then
gives
it
out
a
little
at
a
time
when
things
complete.
B
So
now,
if
you
have
a
slow
device,
but
a
bunch
of
fast
devices
along
with
it
that
slow
device
got
you
know,
maybe
he
gets
only
the
initial
increment
of
work
and
then
the
rest
of
the
the
work
gets
dealt
with
by
all
the
other
devices,
your
spawning
time
just
shrunk
down.
So
that
was
one
of
the
reasons
why
we
went
to
tackle.
That
was
because
we
were
seeing
in
cases
where
you
would
look
at.
You
know,
10
devices
and
two
of
them
were
busy.
B
The
rest
of
them
are
sitting
idle
and
it's
the
two
that
we're
busy
tended
to
be
either
heavily
fragmented
or
mostly
full
and
they're
waiting
for
the
last
pieces
of
work
to
complete
and
if
you've
ever
monitored,
spa
sync
and
just
did
like
you
know
an
I/o
stat
you
or
a
zpool
IO
stat
you'd,
see
like
this
huge
ramp
up
of
work
and
we're
pushing
out
all
this.
You
know
I/o
and
then
it
trickles
off
and
trickles
off.
B
B
Yes,
so
the
question
make
sure
I
get
this
right
mark,
so
the
question
is:
does
that
account
for,
like
large
rights
blocking
out?
You
know
like
demand
reads,
so
that
was
a
problem
that
we
solved
in
a
slightly
different
way.
So
originally
at
Oracle,
we
had
kind
of
like
a
simpler
schedule,
scheduler
and
weari
implemented
that
at
the
video
layer,
and
so
now
there's
like
scheduling
queues
for
each
of
the
I/o
types
and
that's
where
the
I/o
priority
is
actually
passed
into
the
V
dev
layer.
B
So
we
we
give
higher
priority,
even
though
you
may
have
a
ton
of
you
know.
Async
rights
coming
through.
We
give
higher
priority
to
a
demand
read
and
we
pull
off
of
that
queue,
even
though
it
may
have
fewer
elements
than
the
the
right
queue
does
at
Oracle.
We
weren't
doing
that.
We
actually
had
kind
of
a
combined
queue
where
everything
got
sorted
in
there.
Oh.
B
B
B
So,
with
regards
to
the
pipeline,
no,
so
the
fact
that
we're
using
4k
physical
sectors
won't
really
impact
the
the
the
pipeline
as
much
it's
possible
that
the
gang
code
I'd
have
to
go
back
and
look
at
this
that
the
gang
code
may
not
be
accounting
for
physical
sector
size
and
may
actually
try
to
do
smaller
allocations
in
the
physical
sector.
But
I'd
have
to
see,
if
that's
the
case,
so
I'm,
not
certain.
B
B
B
Yes,
so
so
Brian
made
an
observation,
which
is
a
great
one
that
for
those
you
know
in
the
room
or
even
online,
some
of
the
things
that
are
presented
here.
It
would
be
great
if
we
actually
had
comments
in
the
code
going
back
to
Brian's
original
tweet.
That
actually
explains
some
of
these
things.
You
know
that's
a
great
way
to
get
involved
and
be
part
of
the
community.
We
welcome.