►
From YouTube: Let’s Turn Tracing on its Head
Description
Tracing has been around for a while and has proven its value to debug issues like performance and common day-to-day scenarios. Users commonly query for particular traces and analyze a waterfall view of a single trace. Similar to logs, the value is not from a single event but by the aggregation of multiple events, and the breakdown of these aggregates over different dimensions or tags. In the case of traces, it is possible to construct rich topologies and navigate the aggregate data from these topologies. In this Kong Summti 2019 talk, Omnition Software Engineer, Constance Caramanolis introduces a different approach to consuming distributed trace data and the full value that can be realized.
A
So
that
is
me.
I
work
at
a
mission
I'm
a
software
engineer
on
working
on
tracing
open
telemetry,
which
we'll
get
to
a
little
later.
But
what
really
actually
got
me
passionate
about
observability
is
actually
my
time
at
lyft,
so
I
am
one
of
the
original,
creative
maintainer
and
contributors
to
envoy
alongside
Matt,
and
the
benefit
of
this
is
that
it
gave
me
a
lot
of
experience
in
terms
of
operating
micro
services
and
teaching
other
engineers
how
to
operate
them.
I
was
on
call
for
envoy
and
other
services
that
lift
and
on-call
sucks.
A
A
Micro-Services
are
great
right.
They
allow
us
to
deploy
code
independently,
other
things.
So,
instead
of
having
one
deploy
schedule,
maybe
once
a
week
you
can
deploy
your
own
component
whenever
you
feel
like
it,
you
can
use
whatever
technology
there's
a
lot
of
flexibility
and
it's
really
powerful
for
engineers
and
applications
to
quickly
develop.
The
downside
is,
is
that
things
start
looking
like
this
I
wish.
A
I
could
use
yesterday's
slide
in
terms
of
what
was
at
the
keynote
in
terms
of
a
really
nice
graphic
of
what
looks
like
stars
and
all
connected,
it
does
get
that
messy
really
quickly,
and
so
as
great
as
it
is
to
be
developing
on
this
form.
It
gets
really
hard
to
actually
debug
and
so
like
some
of
the
main
three
points
that
we're
trying
to
solve
at
with
tracing
as
a
following.
How
do
you
actually
know
if
you
have
a
metric?
That's
saying
I
have
this
error?
A
How
do
I
actually
know
that
whatever
example,
the
log
or
trace,
is
actually
correlated
to
that?
Metrics
are
great,
but
you
don't
have
a
context
who
to
wake
up
so
say
now,
you're
all
using
on
voice.
You
at
least
have
consistent
metrics,
but
then
errors
propagate
across
the
system.
So
how
do
you
know?
What's
the
difference
between
a
service
propagating
that
error
and
actually
being
the
one
that
originates
the
error
and
then
what
is
the
customer
impact?
A
How
do
you
actually
determine
if
the
change
you
want
to
make
a
two
o'clock
in
the
morning
is
worth
it?
Do
you
actually
make
the
risky
change
if
it's
for
something
that's
used,
maybe
once
every
three
days
or
do
you
risk
making
something
that
will
fix
a
bug
right
away?
That's
impacting
90
percent
of
your
customers
and
a
lot
of
things
right
now
in
terms
of
serve
ability,
unless
you've
done
it
manually
yourself
will
tell
you
this.
A
So
right
now
most
of
you
probably
use
are
using
metrics
and
logs
they're
great
they've
gotten
us
this
far
and
as
well
saying
is
that
they're
missing
the
context
metrics
is
giving
you
one
data
point
in
time
same
thing
with
logs
right
logs
can
be
very
rich
for
that
one
applicant
for
that
one
instance
of
a
service.
But
then
how
do
you
connect
them
across
multiple
services?
In
the
micro
service
world?
A
A
Well,
this
is
how
we're
doing
tracing
now.
One
is
that
we
pre
aggregate
the
data.
What
this
really
means
is
that
we're
gonna
be
generating
some
metrics
off.
You
know
some
s
lis
or
metrics
off
of
the
traces.
But
what
ends
up
happening
is
that
once
you've
created
this
data
and
also
the
trace
is
usually
tend
to
be
sampled
once
you
get
rid
of
most
of
the
traces
that
actually
create
this,
you
can't
pivot
on
the
data
in
any
different
way,
so
you're
really
stuck
with
one
way
of
measuring
your
traces.
A
Most
people
who've
ever
interacted
with
traces.
Actually,
probably
maybe
most
you
haven't
right.
Cuz
tracing
is
usually
viewed
as
you
look
at
one
example,
so
you
always
look
at.
Like
oh
I
know,
I
have
an
arrow
the
service,
maybe
I'll,
pull
up
a
trace
today,
instead
of
a
log,
and
so
you
see
in
terms
of
what
the
entire
call
stack
is
for
the
service
or
through
the
application.
A
But
it
only
gives
you
a
lot
of
data
points,
but
doesn't
really
help
you
find
out
in
terms
of
where
to
look
at,
and
the
next
thing
is
sampling
tracing.
We
usually
maybe
keep
point
something
percent,
one
percent,
two
percent,
and
if
you
use
tracing
similar
to,
if
you
use
logs,
usually
the
one
you
want
is
the
one
you
didn't
capture
so
there's
a
better
way,
we're
gonna
kind
of
directly
call
it
the
omission
way
of
doing
it,
and
so
to
all
those
three
points
I
made.
A
There
is
a
much
better
way
to
do
tracing
so
one
instead
of
sampling.
There
is
actual
way
for
you
to
capture
all
your
traces
and
store
them
right.
No
longer
do
you
actually
have
to
generate
them
and
choose
which
ones
to
drop
most
of
the
actual
burden
and
the
cost
comes
from
creating
it
on
the
application
side,
the
cause
in
terms
of
storage
and
transferring
it
is
actually
not
as
bad
as
people
think
it
is.
A
Instead
of
doing
the
pre
aggregation,
if
you
determine
later
on
in
terms
of
what
you
how
you
want
to
group
the
data,
now,
you
actually
don't
have
such
rigid
things
in
terms
of
like
oh
I,
looked
at
only
these
type
of
aggregated
data,
but
I
won't
look
in
a
different
way.
Since
you
have
all
your
traces,
you
can
actually
go
back
and
look
at
everything
on
a
different
pivot.
A
And
no
longer
actually
looking
at
it
is
as
a
funnel.
If
you
look
at
all
your
tracing,
if
you
have
all
of
your
traces
for
an
entire
application,
you
can
actually
visually
represent.
What
a
service
looks
like
what
do
the
application
looks
like
in
terms
of
service
dependencies?
So
no
longer
do
you
actually
have
to
look
at
metrics
in
terms
of
this
is
the
upstream
request
to
another
one
to
another
one?
You
can
see
things
really
nicely.
So
what
do
I
mean?
A
This
is
a
happy
application.
All
of
this.
This
is
a
visual
representation.
It's
a
demo
application,
but
it
is
actual
visual
representation
of
a
live
application.
Using
tracing
data
there
are
no
forms
of
building
metrics
in
terms
of
I
know,
is
safe,
we're
using
envoy
in
terms
of
what
the
upstream
metrics
are.
Once
you
look
at
a
trace.
A
You
know
who
you're
talking
to,
and
you
can
build
this
here
so
now
in
terms
of
whenever
you're
onboarding
new
people,
you
don't
actually
have
to
say,
like
hey,
you
know
like
last
week,
we
called
service
ABC
and
make
sure
that
API
isn't
change.
You
can
actually
just
look
at
this
here
and
say:
okay
well
now,
I'm
calling
service,
B
I'm,
calling
checkout
currency
service,
and
this
shows
you
up
to
date.
A
So
in
terms
of
whatever
applications
changed,
you
know
Micra
service
dependency
change,
because
as
soon
as
you
get
into
micro
service
world,
it's
really
fun
to
change
things
and
change
where
the
api's
are
going
to
this.
Whatever
static
lists
you
maintain
will
be
out
of
date
within
days
or
weeks,
so
kind
of
like
kind
of
like
what
pages
like
to
do.
Airs
gonna
start
happening
right.
So
we
have
all
these
services
that,
like
to
page
themselves
and
Anna
legacy
system
or
I'm
gonna
call
it
a
legacy
way
of
doing
observability.
A
These
errors
right
now
are
being
paged
based
off
of
the
success
rate.
So
that's
what
this
area
of
red
box
is
here,
and
so
this
is
what
an
application
is
service.
Sorry,
not
application
to
services,
seeing
in
terms
errors
so
upstream
requests
5xx
the
thing
about
that
is
that
we
actually
don't
know
which
one
is
causing
it.
So
you
will
be
paging
all
four
of
these
service
team
owners
to
investigate
now.
Before
we
get
to
how
to
make
that
better,
let's
actually
look
at
a
trace.
This
is
a
typical
view
of
a
trace.
A
It
goes
across
level
up
at
the
services
and,
what's
really
unique
about
a
trace,
is
that
you
can
actually
pinpoint
when
an
error
started.
So,
instead
of
actually
saying
all
these
shaded
red
here,
all
these
services
are
paging,
since
you
know
that
at
the
bottom,
with
the
currency
service,
I
get
conversion
is
the
first
one
to
actually
throw
an
error.
You
can
differentiate
between
a
root
cause
and
the
actual
ones
are
propagating
errors.
So
when
it
comes
to
paging,
you
can
say
I'm
only
gonna,
page
currency
service
and
let
everyone
else
sleep.
A
So
let's
look
at
that
again.
So
here's
where
we
actually
calculate
the
root
cause
or
the
root
cause
in
terms
of
currency
service,
causing
the
errors
so
we're
gonna,
stop
paging
on
errors
and
we're
gonna
start
looking
at
root
cause.
Now
this
actually
helps
us
identify
who
to
wake
up
in
the
middle
tonight,
and
also
just
who
to
page
whenever
this
is
something
that
is
unique
to
tracing
metrics.
Unfortunately,
doesn't
give
you
the
context.
You
need
to
determine
this
unless
maybe
added
some
headers
and
did
some
special
processing
on
top
of
it
kind
of
ugly.
A
So
now
that
we
know
that
the
currents
of
service
is
paging,
we're
gonna
debug
it
what
could
be
causing
the
issue
in
terms
of
whenever
I'm
debugging
I,
usually
try
to
come
up
with
a
hypothesis
right
now.
I
see
that
get
conversion
path
is
returning
errors.
My
first
guess
give
me
like:
maybe
someone
messed
up
that
API
so
all
right?
A
Let's
we're
meeting
guest
and
I'm
gonna
look
for
another
instance
of
get
conversion
API
and
see
if
there
were
any
errors
well,
this
one,
unfortunately,
doesn't
have
any
errors,
and
so
as
obi
one
like
to
say,
these
are
not
the
droids
you're
looking
for,
but
this
is
a
very
common
pattern
in
terms
of
debugging
we're
usually
trying
to
pull
at
threads
to
trying
to
figure
out.
What
could
the
error
be?
All
right,
you'll
look
at
logs
and
you
look
at
all
the
different
parameters
in
terms
of
version.
You
know
version
input.
A
A
So,
like
one
thing
we're
going
to
raise,
we
weren't
trying
to
come
up
hypothesis
ignore
the
errors
text
in
the
middle.
It's
supposed
to
be
more
to
my
right,
but
apparently
poor
out
point
decided
to
change
that,
but
we're
gonna
do
now.
Is
that
you,
since
you
have
the
full,
like
you,
have
full
fidelity
of
all
the
traces
when
you're
breaking
down
you
can
actually
break
down
for
service
or
any
part
of
your
application
on
these
tags.
So
what
I
did
here
is
I'm
trying
to
find
out?
Maybe
it's
an
environment.
A
Maybe
someone
released
something
development
or
staging
or
whatever,
and
it's
particular
there.
That's
causing
issues
so
when
I
break
down
the
currency
service
based
on
environment
I
see
that
there
are
solid
red
everywhere
and
in
terms
here,
solid
red
is
bad
shaded
red,
not
so
bad
and
we'll
get
to
that
a
bit.
But
it's
very
much
a
visual
cue
to
say:
do
I
know
if
this
is
going
wrong
or
not.
So
we
know
that
all
these
are
experiencing
errors.
So
this
is
not
what
I
want
to
look
into.
A
So
next
one!
So
what
about
region
right?
This
is
another
thing
we
can
usually
look
into.
Okay,
so
I
know
that
US
east
one
has
no
issues.
So,
okay,
I'm
gonna,
eliminate
that,
in
terms
of
where
I'm
trying
to
debug
us
West
one
has
some
issues.
The
next
thing
I
can
do
is
like
maybe
just
break
it
down
within
there.
What
I'm
doing
here
is
actually
breaking
down
by
instance,
and
I,
see
that
it's
one
particular
instance
that
is
returning
all
the
errors
originating
all
the
errors.
A
What's
really
cool
about
this
is
that
you're
actually
just
breaking
down.
This
is
obviously
a
very
static
screen
shopping,
but
you're
just
looking
at
the
app
service
and
look
at
different
tags,
you've
included
on
the
trace,
and
so
these
are
actually
just
three
four
clicks
versus
before,
at
least
in
terms
of
when
I've
done
debugging
using
logs.
Is
that
I'll
look
for
a
hypothesis
query
for
it
and
then
see
what
the
results
are,
and
this
is
actually
just
an
arbitrary
tag,
there's
nothing
special
about
instance,
or
region.
A
It's
just
something:
that's
very
a
very
common
for
most
of
us
in
terms
what
we
think
of
the
most
common
culprit,
so
you
can
make
it
in
terms
of
input
user.
Anything
like
that
so
now
that
we
actually
know
how
to
root
cause.
Let's
talk
about
propagating
this
across
the
system
right,
so
we
know
here
that
currency
service
was
route,
causing
errors
right.
A
So
we
know
that
there
now
this,
the
rest
of
the
application
graph
here
you
see,
is
actually
scoped
to
anything
that
has
a
dependency
to
currency
service
or
any
things
interact
with
it
in
terms
of
upstream
or
downstream.
So
we
want
to
find
out.
Do
the
other
services
have
any
impact
so
check
out
services
next,
and
what
we
actually
see
here
is
that
there
is
no
root
cause
so
from
this,
we're
able
to
say
that
actually,
the
check
out
service
is
propagating
errors
through
across
the
system
instead
of
the
ones
that
are
actually
causing
it.
A
So
once
again,
we
don't
need
to
wake
these
people
up
now.
What
about
the
front
end?
The
most
important
one,
depending
on
who
you
are
but
for
most
people,
is
this
one
originating
errors.
No,
it
isn't,
and
so
with
tracing
drilling
and
again
right
tracing
can
actually
tell
you
the
difference.
Front
end
will
be
paging,
because
it
is
seeing
a
lot
of
errors
or
it
could
be
paged,
because
it's
seen
a
lot
of
errors,
but
you
actually
don't
need
to
have
all
the
front
end
team
anymore,
same
thing,
with
current
server
check-out
service.
A
Now,
more
importantly,
customer
impact.
How
do
you
actually
know
if
what
you're
doing
or
what's
broken
impacts,
anyone
with
tracing
what
you
could
do
is
we
have
this
concept
called
workflows
where
you
can
catch
and
turn
like
sets
of
transactions
to
a
business
workflow
so
for
lift
it
was
requesting
a
ride.
You
know
payments.
A
You
know
the
20%
of
your
customers,
who
impacting
or
using
this
path,
are
actually
experiencing
errors.
Why
this
matters
is
that,
at
least
for
myself,
in
my
experience
and
from
getting
to
talk
to
a
lot
of
companies
through
my
time
with
envoy,
is
that
it's
always
really
hard
to
connect.
What
a
service
for
hops
down
does
and
its
bugs
is
actually
impacting
your
customers
and
tracing
can
help
you
do
this.
A
A
Can
you
update
the
library,
please
I,
know
this
parents
gonna
open
for
three
weeks,
but
please
just
change
this
library
version
or,
like
you
know,
replaces
one
thing
here:
it
doesn't
make
things
a
lot
better.
What's
also
really
cool
about.
This
is
that
this
is
actually
being
done
by
really
big
companies.
A
A
B
So
you
said
like
we
can
ingest
all
the
traces
but
let's
say
I'm
doing
a
million
RPS
or
10
million
RPS
for
a
service.
The
amount
of
trace
data
can
you
gets
huge
and
network
costs
in
cloud
is
non-trivial
like
it
adds
up
fairly
quickly.
We
can
have
like
a
petabyte
of
traces.
You
know
day
very
easily,
yeah.
A
A
A
Billions
requests
per
second
all
that,
but
a
lot
of
people
actually
aren't
in
that
size,
and
so
a
part
of
it
is
like
you
know,
especially
if
you
get
to
like-
maybe
a
million
like
it
spans
for
a
second
or
like
that,
depending
on
what
your
costs
are
like,
maybe
you
don't
save
all
of
them,
but
maybe
don't
do
one
percent
like
find
something
is
a
little
better,
so
it
is
definitely
partially
to
challenge
that.
Don't
say
you
need
to
sample
right
away
like,
depending
on
your
constraints,
are
but
see.
A
If
you
can
like
I,
don't
I
can't
share
the
math
in
terms
what
it
is,
but
it's
not
as
bad
as
people
think
it
is,
and
I
know
like
I
can't
give
you
the
proof
in
the
whiteboard
proof
to
like
show
you
that,
but
it
is
actually
not
as
bad
as
people
think
it
is,
and
so
in
terms
like
when
you
play
around
with
it.
If
you
like,
say
check
like
do,
10%
you'll
actually
be
kind
of
surprised.
Okay,
follow.
B
Patient,
so
you
said
like
how
you
can
do
the
root
cause
analysis
like
yeah,
but
which
micro
service
is
actually
sending
it.
What
happens
sometimes
is
because
even
unrelated
micro
service,
they
start
through
errors,
because
one
of
them
is
throwing
errors
right,
and
you
can't
like
praise
that
with
with
tracing.
So,
even
if
suppose,
service
a
is
throwing
errors,
and
it
does
not
talk
to
service
me
even
because
service
bee
is
throwing
it
or
it
sometimes
starts
throwing
error,
so
that
correlation
sometimes
gets
hard
to
ya.
A
A
Maybe
it's
usually
like
I
guess,
like
so
depends
in
terms
of
how
strict
I
feel
like
that's
a
little
bit
in
terms
of
people.
Relaxing
like
I
will
say
that
I'm
very
traditional
in
terms
of
like
what
my
approach
is
in
terms
of
forwarding
errors.
So
if
you
see
one,
you
know
any
form
of
error
that
you
deem
an
error,
503
s
or
errors.
Everyone
don't
rewrite
that
to
200,
but
like
once,
you
see
a
503
instead
of
like
you
can
either
have
retry
logic
in
terms
of
handling
and
fixing
that.
A
That
might
be
a
little
more
corn
a
case,
but
at
least
more
in
terms
of
my
experience,
most
people
tend
to
like
once
they
have
one
error,
just
probably
flush
it
through.
It
is
something
in
terms
that
we
definitely
can
adjust
for
and
talk
about,
but
it's
mostly
common
case
yeah.
It
also
depends
how
strict
like,
if
you're
bending
the
rules,
don't
do
that
the
right
way
of
propagating
errors
make.
A
A
A
Think
that's
more
dependent
on
how
you
want
to
use
your
traces,
oh
yeah,
the
question
right.
So
someone
a
person
for
me
was
asking
in
terms
of
like
what
are
my
opinions
in
terms
of
baggage
I.
A
Don't
know
if
that's
the
right
term,
but
some
other
tracing
standards
have
provided
the
ability
to
add
extra
information
metadata
or
whatever
you
will
to
your
traces
I
think
it
really
depends
in
terms
of
how
you
use
it
right
for
maybe
like
from
what
I've
seen
that
hasn't
been
necessary,
but
also
like
I've,
seen
very,
like
I've,
seen
the
lift
environment
and,
like
other
customer
environments,
and
so
I
haven't
seen
that
knee.
But
it
doesn't
mean
that
I'm
right,
it
just
means
some
terms.