►
From YouTube: ZFS Performance Troubleshooting Tools by Gaurav Kumar
Description
From the 2020 OpenZFS Developer Summit
slides: https://drive.google.com/file/d/1YzulcT7p7TvHF50aI-Rxg6CMZMIGnxL_/view?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020
A
All
right,
so,
let's
talk
about
performance,
so
the
agenda
is
we'll
talk
about
some
of
the
goals
that
we
have.
What
tools
are
available
when
we
have
to
debug
a
problem
and
we're
gonna.
Take
an
example
of
a
case
that
you
know
we
looked
at
and
how
all
these
you
know.
Stats
comes
together
to
figure
out,
you
know
what's
going
on
in
the
system
and
then
some
we
have
some
key
takeaways.
A
So
as
a
goal
we
have
to
identify
bottlenecks.
More
importantly,
we
need
to
be
able
to
avoid
pitfalls.
You
know,
sometimes
we
might
start
looking
into
the
wrong
direction
and
how
do
we?
You
know
backtrack
from
there
and
there
are
times
when
you
know
we
make
some
changes.
We
do
some
tunings
and
we
should
be
able
to
really
see
the
impact
of
it.
The
impact
may
not
be
very
evident
just
by
looking
at
the
throughput
or
the
latencies,
but
it
might
be.
You
know
well
hidden
beneath
the
layers.
A
A
So
please
do
not
jump,
you
know
just
don't
take
per
for
any
other
tool
and
just
start
debugging.
You
know
so.
You
have
to
first
understand
the
system.
Then,
let's
talk
about
the
internal
mechanisms
that
we
have
within
zfs,
there's
a
lot
of
lot
of
metrics
already
involved
in
zfs
we
have
logs.
We
have
counters
that
can
be
really
helpful
to
nail
down
an
issue,
and
if
that
is
that
is
not
sufficient.
A
We
have
some
external
tools,
evpf
system,
tab
that
we
can
use
to
get
more
data
out
of
the
system
and
often
at
times
we
underestimate
the
power
of
visualization,
but
I
think,
as
even
matt
also
had
you
know,
use
flame
graph
in
his
previous
presentation
and
you
know
showcasing
all
these
metrics.
You
know
using
some
ui
is
really
beneficial
and
we'll
we'll
talk
about
that
in
later
slides.
A
A
Now
it's
not
a
it
looks
like
a
functional
issue
is
not
a
performance
issue,
so
one
of
the
easiest
things
that
you
can
do
is
you
can
enable
the
code
paths
that
returns
errors
and
after
doing
that,
if
you've
done
those
logs
there's
an
interesting
entry
there,
which
is
a
dsl
door,
temp
reserve.
Basically,
we
are
trying
to
see
if
the
incoming
right
can
be
served
on
the
disk
and
we
are
seeing
a
error.
Ed
start
here,
so
looking
at
the
code,
we
can
quickly
find
out.
A
A
You
may
not
have
leverage
to
all
these.
You
know
tools
that
you
can
deploy
and
make
use
of
it.
So
all
your
best
tools
available
are
the
logs
that
the
counters
that
you
have
on
the
system,
so
dprintf
is
what
we
are.
Basically,
you
know
using
here.
The
only
issue
I
have
with
deep
printer
is
very
chatty
and
log
scene
can
roll
over
very
quickly.
So
be
careful
about
that.
A
A
It
tells
you
what's
going
on
it's
important
because
let's
say
you
have
a
performance
issue
and
actually
we
had
a
performance
issue
where
you
know
there
was
a
certain
spike
in
latency
for
a
you
know
applicable
amount
of
time,
and
it
was
basically
when
we
looked
at
the
history
we
saw
that
there
were
data
set,
deletes
happening,
a
lot
of
deletes
happening,
and
then
we
kind
of
you
know
we
figured
out
is
a
problem
with
the
trims.
A
So
you
know
it
could
be
really
useful.
A
cool
tip
use
an
eye
option.
It
can
give
you
more
information
about
the
you
know
when
those
events
were
happening.
For
example,
it
will
dump
the
transaction
group
number,
so
this
is
just
a
cool
tip
and
looking
further
we
have
zippo
events,
which
is
important
because
it
can
tell
you
if
you're
seeing
tech
sum
issues
on
a
disk.
A
If
your
disk
is
in
a
degraded
state
and
more
than
that,
I
I
like
to
you
know
even
see
if
a
particular
I
o
is
taking
longer
than
usual.
So
I
think
I
generally
tend
to
you
know
at
times
tune
this
parameter.
I
think
it's
last
I
checked
it
was
set
to
30
second
default.
A
What
it
means
is
anytime,
an
io
takes
more
than
30
seconds,
it's
going
to
raise
an
event
and
it's
getting,
it
will
be
logged,
so
I
generally,
you
know,
put
it
a
small
value
here
and
just
to
see
you
know
out
of
all
the
disk.
I
have,
if
there's
any
particular
disks,
where
I'm
seeing
higher
latencies
compared
to
the
other
disk
and
the
typical
output
that
you
will
get
here
is
you
will
have
a
report
saying
zfs
and
delay
it
will
dump
the
path
of
the
desk
and
all
the
bunch
of
information.
A
So
the
next
thing
is,
you
want
to
also
see
whether
you
are
hitting
the
right
throttling.
You
know
if
the
rights
are
coming
in
and
you
are
not
getting
a
performance.
How
do
you
look
for
right
throttling,
so
the
stats
to
look
at
is
dmu
tx,
it's
a
bunch
of
stats.
A
If
you
see
the
last
second
last
two,
the
dirty
delay
and
the
dirty
over
max
getting
incremented.
That
simply
means
that
you're
not
able
to
drain
the
dirty
data
as
fast
as
your
incoming,
I
o,
and
if
you
see
the
the
initial,
the
the
initial,
the
top
four,
that
simply
means
that
you
have
memory
pressure.
It
means
that
your
arc
is
either
shrinking
or
it's
not
growing,
or
there
are
issues
with
you
know
your
memory,
so
you
probably
want
to
take
a
look
at
that
part.
A
So
going
one
layer
further
down,
we
also
want
to
see
how
my
transaction
groups
are
moving
along
right,
which
means
that
the
most
important
piece
being
how
they're
getting
synced
to
the
disk.
So
we
do
have
stats,
we
can
dump
dxgs
and
it
will
dump
all
the
information
how
much
data
we
are
syncing
in
each
transaction
group.
A
How
much
was
the
time
for
how
long
the
transaction
group
was
open,
how
much
time
it
took
it
to
get
quizzed
how
much
time
it
waited
for
this
transaction
group
to
be
synced,
and
this
is
a
sinking
time.
So,
as
you
can
see
for
this
transaction
group
43950,
it
was
in
the
waiting
to
be
synced
because
the
previous
transaction
group
was
taking
around.
You
know
that
much
longer-
and
hence
this
was
you
know
waiting.
So
this
is
a
good
way
to
figure
out
how
your
underlying
infrastructure
is
behaving
with
respect
to
sinking
so.
A
Again,
arc
is
the
heart
of
the
system.
So
again
we
have
arcstar,
we
have
proc
arcstats
and
the
proc
interface,
and
that
also
gives
a
lot
of
information
regarding
whether
reclaim
is
happening
or
not
whether
the
arc
is
arc
no
grow
is
set,
for
example.
So
all
that
information
is
useful.
A
And
finally,
we
are
in
the
pipeline
stage
and
we
want
to
look
at
the
disk
performance
how
my
disks
are
behaving.
So
one
of
the
easiest
thing
that
you
can
do
is
just
to
check
whether
you
know
any
of
your
desk
is
in
a
degraded
state,
and
then
you
can
run
the
ios
stack
to
see
how
your
you
know.
Ios,
you
know
eyes
are
happening
like
you
know
what
what's
the
bandwidth
you're
getting
from
the
pool
and
so
on.
But
you
know
you
need
more
information
than
this.
It's
a
very
high
level.
A
Information
you
want
to
really
know
is
the
I
o
spending
time
in
the
queues
versus
the
ios.
A
A
You
know
across
different
latency
buckets
and
then
last,
but
not
the
least.
You
also
want
to
see
if
these
ios
are
small
versus
large
ios,
because
that
again
dictates
the
throughput.
A
A
And
one
thing
I
wanted
to
highlight
was
the
zfs
tuning:
every
system
is
different,
workloads
are
different
and
there
are
times
when
you
want
to
tune
something-
or
you
know,
according
to
your
workloads
or
systems,
and
here's
a
link
for
all
the
tuneables
that
we
have,
and
the
good
thing
is
that
this
is
really
very
well
documented
and
it's
very
easy
to
understand.
You
know
what
the
parameter
means
and
what
it
can
do
for
you.
So
this
is
a
good
place
to
you
know
start
looking
into.
A
Obviously,
zfs
is
running,
you
know
as
a
sub
component
in
a
bigger
system
where
you
have
cpu
memory,
disk
network
and
so
on,
and
everything
any
of
these
things
can
impact
right.
So
it's
important
to
look
at
the
health
of
you
know
how
much
cpu
is
being
spent.
Is
the
cpu
is
idle
or
saturated,
or
you
know
if
you
have
memory
available
and
so
on,
so
I'm
not
going
to
go
and
talk
about
these.
So
these
are
the
basic
commands
that
are
available
in
linux.
Vm.
A
So
let's
talk
about
some
of
the
external
tools
now
the
way
I'm
gonna
talk
about
this
is
I'm
gonna.
Take
three
examples
that
we
really
dig
into
in
the
past
and
how
we
use
different
tools
to
really
help
us
out.
You
know
figure
out
what's
going
on
and
it's
not,
there
was
no
preference
as
such
why
we
chose
one
tool
over
the
other,
because
all
the
tools
can
do
the
same
thing,
but
it's
just
the
choice
of
the
tool
at
that
time.
A
So
one
of
the
issues
that
we
long
ago
faced
are
shrinking
to
semen.
Now
we
might
say:
okay,
if
arc
is
shrinking,
that
means
you
have
memory
pressure
right.
So
we
said
okay,
but
there's
no
load.
We
cache
the
arc.
We
leave
the
system
as
it
is.
We
don't
do
anything
on
the
system
and
still
arc
is
moving
back
to
siemens
and
then
we
said,
okay,
looking
at
other
components,
I
don't
see
anything
happening.
You
know,
there's
no
other
memory,
hungry
application
running,
which
is
consuming
memory
or
leaking
memory.
A
So
then
we
said:
okay,
one
of
the
things
that
I
know
is
there
could
be
allocations
with
the
reclaimed
flag
that
can
also
trigger
reclamation.
So
we
said:
okay,
let's
look
at
the
allocations
that
are
happening
on
the
system
at
that
time
and
the
fps
came
to
mind
it's
the
linux
internet
tracing
tool.
So
what
we
did
was
we
mounted
the
debug
file
system
and
we
set
a
filter,
and
this
filter
is
saying
that
anything
which
has
a
reclaim
flag
set
and
if
the
order
is
greater
than
four
that
means.
A
If
the
request
for
the
allocation
is
more
than
64
kilobytes,
I'm
going
to
dump
the
entry
I'm
going
to
dump
the
log,
and
then
we
enable
this
log
and
the
output.
What
we
got
was
interesting.
We
saw
that
there
was
a
request
of
order
of
nine,
which
boils
down
to
two
meg
pages,
and
we
can
see
the
gpa
gfp
flag
being
sent
gfp
trans
huge.
So
looking
at
the
documentation,
what
this
means
we
figure
out
that
this
has
to
do
with
transparent,
huge
pages
and
the
way
we
you
know
fix.
A
This
problem
was
to
say:
hey.
We
don't
need
this,
let's
just
disable
this
for
us,
but
you
know
I'll
be
honest.
One
of
the
things
that
intrigued
us
even
at
that
time
was
the
fact
that
we
did
see
a
lot
of
reclamation
happening
so
in
the
arc
stats.
We
have
this
direct,
reclaim
and
introductory
claim
counters,
direct
count
and
indirect
count,
and
it
tells
how
many
times
the
kernel
is
trying
to.
A
You
know
invoke
the
shrinker
on
arc
as
a
result
of
k-swab
demon
or
as
a
result
of
direct
reclamation,
and
we
did
see
what
you
know
was
mentioned
in
today's
earlier
presentation
by
george
that
there
were
so
many
counts.
There
were
thousands
of
times.
You
know
the
account
was
incremented
over
a
very
short
period
of
time
and
it
always
intrigued
us,
but
we
never
invested
time
to
really
dig
into
that
until
today.
A
You
know,
I'm
glad
that
you
know
we
had
that
presentation
from
george,
and
I
know
that
matt
has
done
a
lot
of
work
in
this
area.
So
thanks
for
that,
at
least,
we
know
why
you
know
I
mean
that
fix
is
definitely
required
and
it's
a
very
good
improvement.
A
I
just
wanted
to
let
you
know
that
you
have
a
little
bit
over
five
minutes
left.
Okay,
so
thanks
man,
so
again
just
a
tip
balance
between
arc
and
page
cache.
You
can
tune
this
parameter
so
that
you
don't
unnecessarily
shrink
your
arc
on
the
expense
of
page
cache.
A
We
had
another
issue
where
we
had
similar
testbed
and
the
performance
was
different.
So
we
checked
everything
and
you
know
we
said
everything
looks
good
from
the
back
inside.
So
we
said,
let's
start
from
the
fresh
from
the
top,
how
what
is
zfs
is
seeing.
So
we
installed
this
basic
system
tap
script
and
the
results
were
interesting.
We
saw
that
in
one
cluster
we
were
seeing
all
one
mix
versus
the
other
one
which
is
saying
64k.
A
It
had
to
do
with
some
client
version.
I'm
not
an
expert
in
that,
but
you
know
we
at
least
by
debugging.
On
the
gfs
side,
we
were
able
to
figure
out
what's
the
issue,
what
the
issue
was-
and
this
is
a
sample
output
from
the
system-
tap
script.
So
here
we
use
system
tab.
You
know
to
our
advantage.
A
I
will
skip
through
this
slide
and
then
I'll
just
go
to
the
next
one,
so
we
were
looking
at
one
problem
where
we
wanted
to
really
look
at.
You
know
what
are
the
different
I
o
sizes
and
how
frequently
we're
doing
io
on
a
disk.
We
could
have
used
block
trace,
but
you
know
we
decided
on
using
ebpf.
It
was
very
useful
and
very
easy
to
install
bcc
tool
is
a
front
end
to
ebpf
and
there's
already
a
lot
of
scripts
that
can
be
helpful,
can
be
used
and
they're
very
helpful.
A
One
of
the
script
is
biosnoop
and
we
can
we
just
got
the
entire
profile
of
which
thread
is
doing
what
I
o
and
what
sector?
What
is
the
size
of
the
I
o
and
the
latency
that
we're
receiving
from
the
block
layer?
So
this
is
pretty
cool,
so
this
brings
to
the
case
study.
So
how
do
we,
you
know,
get
all
these
stats
together.
So
I
think
the
goal
one
of
the
goal
was:
how
do
we
avoid
using
a
lot
of
these
external
tools
internally?
A
We
can
definitely
use
these
external
tools
to
figure
out
what's
going
on,
but
there
are
oftentimes
cases
where
you
know
we
don't
have.
Leverage
of
you
know
all
these
tools
in
the
production
environment
on
the
customer
side.
So
let's
see
how
all
these
stats
comes
into
play.
So
we
had
a
problem
where
we
were
seeing
not
a
very
good
performance
on
nfs.
It's
a
single
client
single
file
right.
So,
as
you
can
see
here,
writing
a
50
gig
file
using
a
fio
and
the
configuration
is
a
12
core.
A
32
gig
memory
and
network
is
10
gig
for
us,
the
record
size
was
64k
compression
is
on.
We
disable
sync,
because
we
don't
have
slog
devices
and
we
for
a
very
long
time.
You
were
thinking
that
maybe
slog
is
a
problem.
Maybe
you
know
the
sync:
ios
are
a
problem,
so
we
said:
okay,
let's
disable
it
and
see
if
we
can
get
the
throughput
and
we
have
set
the
aggregation
to
one
meg.
A
So
this
is
a
general
io
flow.
We
have
an
fs
client
going
to
the
network.
Nff
server
gets
a
request
sent
to
the
zfs
and
zfs
arrives
to
the
disk.
So
let's
isolate
the
problem.
We
said:
okay,
let's
just
short
circuit
the
zfs,
so
we
got
a
one
gig
throughput.
We
said
everything
goes
from
the
protocol
side.
So
let's
look
at
vfs,
so
we
did
a
test
directly
on
zfs
yeah
and
then
we
again
got
one
gig.
A
So
we
said
everything
looks
okay
from
zfs
side
as
well,
so
where
the
problem
is
so
it's
time
to
dig
deeper
now
so
before
we
further
dig
into
this,
we
wanted
to
set
the
stage.
The
question
was:
do
we
have
enough
metrics?
So
we
added
a
lot
of
metrics
looking
at
some
of
the
past
problems
that
we
have
faced
and
things
that
we
thought
it
will
make
sense
like
workload
pattern.
A
A
A
A
So
then
we
look
at
the
zfs
write
latency,
it's
just
a
profiling
of
just
the
zfs
write
function,
which
is
an
entry
point
for
the
right
call,
and
we
saw
that
most
of
the
time
latencies
were
higher
than
64
milliseconds.
Now
that
was
in
intriguing,
because
we
are
writing
to
ram
right.
We
write
into
arc,
and
I
mean
I
can
I
can
do
better
than
this
even
on
a
physical
drive.
So
this
was
something
intriguing
and
the
first
thought
came
to
our
mind
was
it's
probably
because
of
the
throttling.
A
So
we
look
at
the
throttling
stats
and
we
see
that
we
are
indeed
delaying
it.
So
we
said
home
good.
We
know
now
why
we
are,
you
know,
delaying
the
clients,
but
then
we,
you
know,
started
looking
a
little
more
deeper
and
we
said:
let's,
let's
look
into
the
how
much
time
we
are
delaying
the
client
right
and
when
we
did
the
math,
we
were
not
able
to
account
for
the
entire
delay
that
the
zfs
write
call
was
showcasing.
A
So
we
said
that
let's
look
at
the
cpu
and
the
arc
right,
everything
looked.
Okay,
you
know,
50
idle
arc
is
completely
full
to
the
c-max,
and
everything
looks
okay
from
that
angle.
And
then
we
looked
at
the
pool
dirty
data
sync
and
we
said
okay,
you
know
we
are
able
to
sync
almost
two
gig
of
data
in
four
seconds,
giving
us
a
throughput
of
500
meg,
and
then
we
looked
at
the
I
o
start.
You
know
obviously,
and
ultimately,
you'll
have
to
then
look
at
the
disk.
A
You
know
how
your
desks
are
performing,
so
average
was
10
millisecond,
it
looked.
Okay,
not
bad,
and
the
queuing
delay
on
the
I
o
side
was
fairly
very,
very
less.
So
it's
kind
of
saying
that
the
ios
are
not
sitting
in
the
queue
at
all.
I
mean
you
just
push
into
the
queue
and
somebody
pulls
you
out
and
throw
it
in
the
disk.
A
So
it
means
that
we
are
basically
writing
to
the
queue,
but
we
don't
have
enough
ios
in
the
queue
to
aggregate,
and
that's
why
we
are
doing
you
know
small
ios.
So
I
will
not
go
through
this
diagram,
but
this
diagram
is
showing
that
you
know
how
the
the
right
pipeline
works
and
you
need
to
have
enough
ios
to
aggregate
so
z.
Right
issue
thread
is
a
one
which
is
basically
pushing
to
the
queue
and-
and
so
we
need
to
take
a
look
at
this-
to
see
what
this
thread
is
doing.
A
A
We
were
able
to
see
that
they're,
most
of
the
time
contending
for
a
semi-for
for
a
lock
and
when
we
looked
into
that
it
was
basically
a
denote
level
lock
and
basically
we
have
a
lot
of
readers
who
are
taking
the
lock
in
the
reader
mode
and
there's
a
writer
who
is
blocked
because
the
readers
are
there
and
he's
waiting
for
the
leaders
to
give
the
lock
away
and
so
on
and
because
of
this
contention,
things
were
not
falling
in
place.
A
Fortunately,
for
us
this
is
already
fixed
thanks
to
all
the
efforts
by
paul,
it
was
fixed
as
part
of
this
commit,
and
why
did
we
miss
it
because
it
was
not
part
of
the
dot
releases
and
it
was
not
part
of
0.8
release
and
it
would
have
been
nice
if
it
was
including
0.8.
A
So
just
a
quick
recap:
after
and
before
before
and
after
the
patch
comparison,
once
we
took
the
patch,
the
performance
got
a
boost
from
600
to
1100
almost
double,
and
we
can
also
see
the
aggregation
happening
much
efficiently
after
the
patch,
because
we
have
reduced
the
contention
point.
We
are
sending
more
ios
to
the
queue
and
we
are
able
to
get
the
aggregation
benefits.