►
From YouTube: Reconsidering tracing in Ceph - Mohamad Gebai
Description
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Mohamad Gebai, SUSE Software Engineer
A
A
So
this
is
what
I'm
going
to
be
mostly
talking
about
during
this
talk,
how
we
can
use
tracing
to
do
an
analysis
that
allows
us
to
take
such
a
decision.
I
will
then
sneak
in
a
small
comment
or
small
analysis
about
logging
in
general
and
self,
and
finally,
I'm
going
to
do
a
small
tracing
down
which
I
have
already
pre
recorded
just
in
case.
A
So
we
have
this
hypothesis
that
changing
the
back
end
of
buffer
list
might
make
an
improvement
for
performance.
Buffer
list
is
simply
a
class
that
chains
buffers
together.
It's
a
list
of
buffers
and
is
currently
implemented
using
a
linked
list
from
the
standard
library,
which
is
just
a
bunch
of
nodes
that
are
referenced
by
pointers,
they're,
not
contiguous.
In
memory.
A
A
How
do
we
measure
again
if
any
I
mean
such
as
change
might
be
really
really
small
and
insignificant?
Doing
a
Rados
bench
or
running
FIO
might
not
really
expose
the
benefit
or
the
regression
or
it
might
be
covered
by
something
else.
As
important
as
bufferless
is
a
lot
more
goes
into
a
workload
and
said,
I
think
we
can
use
tracing
for
that.
A
Self
currently
supports
trace
points,
a
TT
energy
trace
points.
You
can
define
a
new
trace
point
that
way
here.
I
have
the
point
the
trace
point
to
find
the
trace
point
that
is
called
buffer.
Let's
push
front
quite
easy.
This
is
the
current
implementation
of
the
push
front
method
of
bufferless.
It
simply
calls
an
insert
on
the
standard
library
container
that
is
used
have
instrumented
the
code
as
such
just
call
the
trace
point.
Before
and
after
the
the
insert
and
we'll
see
what
happens
between
these
two
trace
points.
A
The
cool
thing
about
30
TNG
is
that
it
allows
you
to
add
context
to
at
response
every
time
you
have
an
event
that
is
recorded
at
runtime.
You
can
you
can
let
that
event
with
the
trace
point
hold
extra
information
about
about
the
context
of
your
system
here?
In
this
analysis,
I
have
added
cache,
misses
and
1
data
cache
load,
misses,
TLB
misses
still
be,
is
simply
a
a
cache
structure
for
resolving
memory
addresses
from
virtual
memory
to
physical
memory,
which
might
be
very
expensive.
A
We
were
doing
low-level
profiling
on
x86
64
things
up
and
for
additional
access
to
memory
for
one
TLB,
miss
I'm,
also
looking
at
the
number
of
instructions
that
it
takes
to
do
a
push
front
or
a
push
back.
The
number
of
distractions
in
itself
is
not
really
relevant.
It's
not
really
important
because
this
case
had
changed
from
compiler
version
to
another.
A
The
approach
I
took
for
measuring
the
impact
of
changing
puffiness
from
analyst
or
Viktor
is
the
following:
F,
a
micro
benchmark
that
creates
a
thousand
buffer
lists
and
each
in
each
buffer
list.
I
append
a
thousand
buffers.
So
we
have
the
thousand
buffer
lists.
Each
buffer
list
has
a
thousand
buffer
to
have
at
l1
cache
size
of
32
K
and
an
l2
cache
of
256
and
during
the
micro
benchmark,
elected
to
disable
sea
states
and
the
turbo-boost
hyper-threading
and
whatever
might
come
in
and
interfere
with.
The
frequency
of
the
cpu.
A
When
you're
done,
this
is
what
you
get
to
get
hundreds
of
thousands
of
these
lines,
which
don't
look
really
interesting
and
they
look
pretty
boring
to
be
honest.
But
if
we
dissect
one,
we
see
that
it's
actually
easy.
You
get,
you
have
an
event
each
event
as
a
timestamp,
you
have
the
situation
in
which
that
event
was
recorded,
and
then
you
have
your
context.
The
number
of
cache
misses
instructions,
l1
data
cache,
told
mrs.
A
A
This
is
the
average
latency
of
pushback
for
different
data
structures.
Vector
is
in
red
list,
is
in
green
double-ended
queue
is
in
purple,
but
el
averages
are
really
interesting.
Instead,
let's
look
at
individual
push
flags
all
of
these
graphs.
The
y
axis
is
always
the
latency,
so
it's
really
how
much
time
it
takes
to
do.
One
push
back
the
x
axis
is
in
the
upper
left.
The
number
of
instructions
upper
right,
l1
cache
misses
which
what
we're
interested
in
bottom
left
is
not
rid
of
TLB
load,
misses
and
bottom
right
is.
A
A
If
you
look
at
the
number
of
instructions
for
a
list
all
of
the
green
dots,
they
have
the
exact
same
number
of
instructions
every
time,
because
the
list
is
always
the
same
way
you
create
a
node,
then
you
update
the
reference
to
it
and
it
really
is
the
exact
same
number.
You
can
it's
really
interesting,
because
what
I
at
least
always
had
the
impression
that
I
can't
really
see
what's
happening
in
the
CPU,
because
it's
so
obscure
and
there's
so
many
random
optimizations,
but
the
number
of
instructions.
A
You
can
actually
predict
how
much
the
next
function,
how
many
instructions
it
will
take
for
the
green
dots
as
I
said.
There's
only
one
call
path
that
is
taken
for
the
purple
one.
There
are
two
which
means
that
a
pushed
luck
using
the
bell
that
queue
either
takes
23,000
instructions,
or
it
takes
four
thousand
instructions,
with
the
focus
being
on
the
three
thousand
instructions,
which
is
the
first
class
meaning
there's
enough
map.
A
There's
enough
space
in
the
current
memory
block
to
append
a
new
buffer,
and
sometimes
when
there
isn't
a
slow
path
has
to
be
taken,
which
is
the
four
thousand
four
thousand
instructions
path
for
the
vector
documentation.
You
have
many
Kasbah
that
are
taken.
You
have
one
that
is
prominent
Averill
over
two
thousand
instructions,
which
is
again
the
fast
path,
meaning
there's
enough
memory,
there's
no
space
to
add
a
new
element,
but
you
have
many
other
instructions.
A
Many
other
code
paths
that
might
be
taken
at
first
I
wondered
how's,
that
possible
I
know
that
they
are
only
to
get
code
paths
possible
either
the
first
path
or
the
slow
path.
But
if
you
think
about
it,
you
have
to
move
elements
when
you're
crossing
boundaries
and
the
more
you
have
elements.
The
more
instructions
have
to
be
executed.
A
So
it's
pretty
straightforward
and
you
can
confirm
that
by
looking
at
the
number
of
branch
misses,
so
you
have
either
zero
branch
misses
or
you
have
one.
You
have.
Zero
branch
misses
when
the
CPU,
when
you
have
enough
memory
to
append
a
new
new
element,
which
is
what
the
CPU
expects,
because
it's
the
hot
bath.
But
sometimes
you
have
one
branch
this.
A
When
there
isn't
enough
memory
left
the
upper
right
graph
is
the
number
of
l1
misses
and
again
we
want
to
be
to
the
left
and
down
left,
meaning
we
want
few
cache,
misses
and
down.
We
want
low
latency
and
we
have
pretty
much
that
for
everyone,
but
for
the
list
for
the
green
does
we
have
plenty
of
our
players
with
more
key
l1
misses.
A
The
number
of
TLB
misses
doesn't
really
seem
to
impact.
Much
I
mean
the
more
you
have.
Tlb
misses
the
higher
your
lower
bound
will
be,
but
it's
not
really
significant,
so
it
will
just
ignore
it.
For
now,
the
next
step
was
to
make
the
the
buffer
sense,
very
so
I'm
still
appending
a
thousand
buffers
to
a
buffer
list,
but
I'm
making
that
buffer
size
larger
and,
let's
see
what
happens
as
I
increase.
It.
A
A
A
A
A
What
else
can
we
look
at
I
showed
only
4
p.m.
news
for
performance
monitoring
units.
There
are
many
more
that
you
can
look
at
a
cool.
One
is
looking
at
the
CPU
cycles,
because
if
you
have
the
instructions-
and
you
have
the
CPU
cycles,
you
can
calculate
the
IPC
and
you
want
that
to
be
close
to
1.
You
can
look
at
the
branch
prediction
machines
which
I
already
talked
about
page
faults
and
the
micro
benchmark.
Page
faults
aren't
really
interesting,
because
nothing
really
happens.
A
The
next
step
would
be
ready
to
implement
to
be
able
to
really
run
self
itself,
whether
what
different
backends
for
buffer
list,
what
I
did
for
these
was
take
code
from
other
number
sense,
people
and
Jessie
Williamson's
repo,
which
they
have
started
to
do
so
self
crashes.
If
you
want,
if
you
start
self,
but
you
can
run
the
unit
tests-
and
this
is
what
I
used
for
this
experiment
so
from
these
runs
mm-
that
you
seems
to
really
work
best
but
to
really
get
to
a
conclusion.
A
We
need
to
see
how
real
self
uses
buffer
lists
a
small
word
on
logging.
Currently,
this
is
how
we
do
logging
and
self.
We
use
the
DL
function,
we
give
it
the
severity
and
the
string
and
you
get
lumps-
and
there
was
also
talked
to
make
this
a
little
bit
more
efficient.
So
what
I
did
was
change
this
for
a
trace
point
of
NZT
ng,
and
by
doing
that,
we
take
off
all
the
locking,
because
l
TT
ng
is
a
lot
less
tracer.
A
It
uses
per
CPU
buffers,
so
there's
no
locking
and
it's
it's
pretty
efficient
and
I
didn't
see
any
difference.
I
don't
see
any
difference
between
T
out
or
logging
and
self,
which
uses
locking
and
I
think
it
uses
a
linked
list
to
keep
track
of
vlogs
and
then
flushes
them
at
a
later
time.
I
didn't
see
any
improvements,
which
means
that
the
big
the
bottleneck
is
of
this
is
a
string
manipulation.
This
is
just
what
I
think
it
is
that
I
wasn't
able
to
really
do
a
lot
further
than
that,
but
from
early
results.
A
A
What
happens
you
can
instrument
functions
if
you
want
and
all
you
do
it
all
you
need
to
do
for
that
is
recompile
the
code
with
GCC's
F
instrument,
functionals
flag,
which
Sepp
supports
so
for
this,
the
next
demo,
every
compiled
self
with
basically
GCC's
flag
and
nothing
else
and
I
just
traced
the
kernel
and
the
function.
Entry
and
exits.
A
A
This
is
Trace
compass,
you
have
three
views.
The
upper
one
is
the
controls
of
you.
It
shows
the
state
of
all
of
your
processes
on
the
system
through
time
the
green
is
running
in
user
space,
yellow
is
blocked,
orange
is
preempted,
so
you
see,
swapper
is
granted
often,
which
is
good
because
who
operates
the
idle
tasks
on
linux.
So
we
wanted
to
be
preempted.
Whenever
we
can
it's
really
a
list
of
all
your
views.
This
is
a
restart
cluster,
so
you
see
all
the
set
threads
as
well.
A
It
doesn't
open
right
and
the
clothes
back
to
back,
so
we
can
kind
of
guess
what's
happening
here
and
if
we
look
at
the
open
system
call,
it's
opening
the
south
/
self
proc
self
does
come,
which
is
the
name
of
the
of
the
process.
It
does
the
open
system
call
it
gets
a
file
descriptor
in
return.
Number
eight,
it
done,
does
a
right
on
FD
number,
eight
and
then
closes
for
the
script
rate.
This
might
not
be
reduce
itself,
it
might
be
exactly
that
does
not
for
it.
A
A
There
might
be
something
to
be
done
here
on
your
critical
path,
the
critical
path
of
the
Rados
faint,
because,
through
all
of
these
key
worker
threads
and
the
oranges,
deep
block
a
preempted
state,
another
term,
it's
the
time
between
when
the
process
created
and
when
is
scheduled
on
CPU,
and
you
see
it's
greater
than
the
amount
of
time
it
actually
runs.
So
there's
a
lot
of
wasted
time
here
we
might
be
able
to
get
away
without
creating
all
of
these
key
worker.
Threads
I
haven't
done
a
lot
of
analysis
on
this.
A
A
It's
really
cool,
because
sometimes
you'll
see
loops
repeating
in
based
on
the
colors
and
you'll
see
if
sometimes
in
efficiencies,
really
jump
out
when
you're
looking
at
this,
sometimes
not,
and
what
you
really
need
to
know
the
code
and
know
how
things
work
to
really
make
sense
of
this,
but
it
can
be
quite
powerful
and
obviously
all
three
views
are
synchronized.
So
you
can
see
that
nothing
is
happening
here
for
the
time
it
was
pre
entered,
which
makes
sense.
A
The
cool
thing
with
Eng
traces
is
that
there's
also
a
library
libel
trace.
If
you
want
to
do
automated
analysis,
so,
for
instance,
we
could
easily
write
a
script
to
compute
how
much
time
spent
and
specific
specific
function
or
specific
thread
I'm
gonna
cut.
This
short
representation
is
done
thanks
to
Jesse
Williams
and
at
the
Mersenne
for
unknowingly.
Providing
quotes
for
this
demo.
A
I
also
want
to
thank
the
South
Korean
team,
which
has
been
really
inclusive
and
as
a
developer,
you
see
it
even
whether
you're
a
newcomer,
the
newbie,
you
really
feel
the
openness
which
is
which
has
to
be
mentioned.
I
think
it's.
It's
really
great.
That
shows
it's
clearly
an
active
effort.
I
also
want
to
thank
Leo
for
his
help.
This
week,
I
have
heckled
him
with
my
presentation.