►
From YouTube: OpenTracing Monthly Call - 2018-08-03
Description
Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io
Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects
A
A
B
A
D
A
E
Hi
so
yeah
as
I'm
Jonathan
I
used
to
work
on
the
canopy
team
on
Facebook
I,
recently
moved
to
another
team
and,
and
so
Michael
recently
joined
the
team
as
well,
and
so
we're
using
this
also
as
an
opportunity
to
kind
of
like
move
me
off
of
the
advisory
board
and
move
Michael
onto
the
advisory
board
as
well.
And
so
you
have
Michael
hi.
C
I'm
Michael
I
think
I'm,
unmuted,
yeah
I,
know
so
I'm
Michael
I
just
recently
doing
Facebook
and
the
canopy
team.
Before
that
I
was
actually
at
Comcast
for
a
while,
but
actually
worked
on.
One
of
the
things
I
worked
on.
There
was
actually
our
sort
of
internal
tracing
system
as
well.
You
know
which
was
which
was
the
sort
of
X
trace
it
can
you
know
dapper
style,
one
so
I've
kind.
F
E
E
Yes,
okay,
so
yeah,
so
we're
gonna
be
talking
about
canopy,
which
is
Facebook's,
distributed,
tracing
and
analysis
system.
This
is
kind
of
an
amalgamation
of
a
couple
of
talks.
We
published
a
paper
in
SOS
pe
2017
last
year,
two
of
our
engineers
talked
to
Q
Khan
in
New
York,
and
we're
focusing
this
mostly
on
the
instrumentation
and
representation
of
canopy.
E
We
have
instrumentation
available
in
a
number
of
languages,
C,
C++,
Python,
Java
and
because
Facebook
PHP
other
languages
are
sort
of
supported
through
C
or
C++
bindings.
Our
instrumentation
is
integrated
into
both
are
like
common
RPC
stack,
that's
shared
across
all
services.
It's
integrated
deeply
into
our
dub-dub-dub
stack,
so
the
overall
page
load
process
both
on
the
client
and
server,
as
well
as
some
other
common
pieces
of
infrastructure,
and
then
we're
also
sort
of
able
to
ingest
data
from
other
sources.
So
we
have
tracing
and
our
mobile
applications.
E
It
also
combines
an
extraction
and
processing
framework
and
so
given
a
trace
that
we
receive
from
some
source
we're
able
to
run
custom
user
code
to
extract
trace
patterns,
information
from
a
right
them,
two
datasets
that
we
can
then
do
aggregate
analysis
on
and
then
there
is
a
separate
team
that
works
on
performance
visualizations,
and
so
they
work
on
both
single
trace
and
aggregate
visualizations.
For
these
traces.
G
B
B
E
E
So
this
is
what
our
overall
model
looks
like.
We
have
sort
of
five
basic
objects
in
our
trace.
We
have
the
overall
trace
that
encompasses
everything.
The
trace
is
broken
up
into
a
number
of
execution
units
in
a
like
explicit
manner
and
execution
unit
represents
a
sequence
of
or
a
sequence
of,
trace
data
that
comes
from
a
single
clock.
In
practice,
this
usually
represents
either
a
single
host
or
a
single
thread
within
that
host,
but
it
can
be
used
for
modeling
other
primitives
as
well
within
an
execution
unit.
Munich
contains
a
number
of
blocks.
E
We
have
a
simple
case
with
four
events:
this
is
sort
of
the
classic
orefici
call-and-response
events,
these
sort
of
get
broke.
You
know
we
have
a
call
to
receive
on
some
back-end
service
that
back-end
service
sends
some
response,
and
then
the
parent
service
gets
a
records.
A
complete
event
when
it
received
the
response
from
the
RPC
call
we're
then
sort
of
we
take
these
events
and
interpret
them
as
okay.
E
So
one
benefit
that
we
get
from
having
this
decoupling
of
events
from
the
actual
model
is
that
we're
able
to
update
cross
system
instrumentation
without
having
to
carefully
arrange
releasing
instrumentation
versions
to
both
services?
At
the
same
time,
in
practice,
given
you
know,
service
release
schedules,
it's
impractical
to
assume
that
the
instrumentation
version
on
both
sides
of
the
boundary
are
going
to
be
the
same.
So
you
need
to
have
some
compatibility
across
the
boundary,
and
so
this
decoupling
allows
us
to
say,
interpret
events
on
one
side
of
the
boundary
differently.
E
So,
coming
back
to
the
explicit
edges,
this
is
probably
the
like
biggest
single
difference
between
a
fan-based
model
and
our
model.
All
of
these
things
can
technically
be
represented
within
spans.
We
found
the
benefit
of
having
explicit
edges,
meaning
that
we
haven't
needed
to
change
the
structure,
to
add
additional
changes,
structure
of
the
trace
to
add
additional
features,
and
so
for
this
I
can
walk
through
like
one
an
example
that
we'd
have
in
our
current
system.
E
However,
we
also
have
a
second
causality
hierarchy,
which
is
the
function
hierarchy,
and
so
you
know
the
function
call
stack
sort
of
represents
here.
Our
you
know,
relations
between
parent
functions
and
the
child
function
that
they
invoke,
and
so
we
found
that
you
know
this
is
useful
for
sort
of
representing
nested
blocks.
So
you
can
have
one
block
that
is
entirely
contained
with
inside
the
execution
of
parent
block
and
we're
able
to
use
edges
to
sort
of
say
that
this
child
block
is
part
of
this
parent
block
without
sort
of
having
to
represent
it.
E
The
third
causality
is
actually
an
interesting
one
and
it
occurs
in
like
I,
guess
more
and
more
languages
over
time,
but
like
JavaScript
and
other
continuation
based
languages,
and
so
here
you
can
imagine
our
schedule
function,
queues
up
some
future
that
will
then
be
executed
later
on,
and
so
in
this
case,
like
our
causality,
isn't
necessarily
between
the
route
functions
that
we're
executing.
But
in
this
case
we've
scheduled
some
function
and
then
we
have
some.
You
know
common
infrastructure
stack
which
is
executing
these
or
pulling
these
entries
off
of
the
queue
and
executing
them
later.
F
E
Other
futures
that
are
not
actually
connected
to
the
one
that
we
scheduled
and
so
like
these
are
sort
of
all.
You
know
parent-child
relationships
ish,
but
what
we
found
is
like
they
represent
different.
You
know
different
parent-child
relationships
and
having
edges
and
specifically
types
for
those
edges.
Allow
us
to
say
that
the
function,
the
function
hierarchy
relationship
is
different
than
the
RPC
hierarchy.
Relationship
is
different
from
the
continuation
hierarchy
relationship.
E
So
one
other
example,
we
found
like
having
the
ability
to
create
explicit
edges
has
been
useful
is
representing
application
flow.
One
of
the
like
common
tools
for
understanding
our
traces
is
critical
path.
Analysis,
particularly
for
browser
traces,
and
so
we
ran
into
a
problem
where
we
end
up
with,
say
our
JavaScript
requesting
a
couple
of
resources
and
our
JavaScript
thread
tends
to
be
fairly
busy,
and
so
we
would
get
traces
that
look
like
this.
E
E
However,
if
we
have
additional
information
from
our
application
say
when
we
actually
end
up
using
these
resources,
we
can
actually
see
in
this
case
that
we
use
resource
two
immediately,
but
resource
1.
We
actually
don't
need
for
some
soup.
You
know
non-trivial
amount
of
time,
and
so
we
could
have
actually
delayed
resource
one
significantly
without
affecting
or
overall
time,
but
it
looks
like
resource
two
is
actually
are
blocking
resource,
and
so
in
this
case
we
want
to
represent
some
sort
of
application
flow.
E
This
has
also
allowed
us
to
experiment
with
different
representations
of
application,
application
based
logic.
So,
for
instance,
we've
also
experimented
with
saying
that
you
know
certain
events
must
happen
in
order
for
other
events
to
even
be
considered,
and
so
again
in
the
page
load
process.
There
are
some
synchronization
points
where
we
know
that
we
won't
receive.
E
So,
coming
back
to
our
metadata,
we
have
sort
of
the
standard
string
string,
annotation
map,
that's
common.
Among
you,
know
a
lot
of
tracing
platforms.
These
can
be
attached
to
any
object
in
the
trace,
so
points
edges,
blocks,
execution
units
or
trace.
All
of
them
have
a
metadata
object
associated
with
them.
E
This
also
allows
us
to
distinguish
between
annotations,
that
users
add
and
annotations
that
we
absolutely
must
have.
For
you
know,
loading
or
displaying
the
trace
custom
is
then
sort
of
a
general
bucket
for
any
any
trace
data
or
any
annotation
data
that
users
add
through
their
own
instrumentation,
and
then
error
properties
are
typically
used
for
noting
errors
in
trace
construction,
as
opposed
to
errors
in
the
overall
execution
of
the
trace,
and
so,
for
instance,
we
might
use
an
error
to
indicate
that
the
trace
instrumentation
never
closed.
E
The
other
feature
that
we
have
is
typed
counters
and
so
there
an
explicit,
separate
type
from
the
string,
annotation
map,
and
so
these
are
counters
that
have
a
numerical
value
associated
with
them
a
particular
type
and
then
also
a
precision.
And
so
this
allows
us
to
say
things
that,
like
a
thousand
and
24
bytes
is
distinct
from
a
thousand
and
24
milliseconds,
which
is
distinct
from
1024
kilobytes.
But
it
does
allow
us
to
say
that
if
the
user
records
1,024
bytes
in
one
place
and
one
kilobyte
in
another
place,
those
two
values
are
actually
equivalent.
E
We've
also
extended
it
over
time
to
more
types
as
we've
needed
them,
and
so
we've
introduced
sets
of
strings
that
can
sort
of
be
appended
to
over
time.
We've
used
this,
in
particular
on,
say
like
execution,
units
or
traces
and
then
also
for
a
stack
frame
so
capturing
either
like
sampled
profiling
data
or
the
stack
frame
at
a
particular
RPC
call.
E
So,
coming
back
to
I
guess
putting
this
all
together
in
between,
like
the
metadata
and
our
events,
we've
run
into
the
I,
guess
some
fun
challenges
in
modeling,
and
so
going
back
to
our
old
instrumentation,
where
we
just
had
a
call
receive
response
and
complete
events.
Each
of
these
has
some
associated
metadata
with
them,
and
one
problem
we
ran
into
was
well
when
we
wanted
to
extend
this
to.
You
know,
blocks
and
points
and
execution
units
a
call
event.
It
now
does
more
than
just
create
an
edge.
E
A
call
event
actually
ends
up,
creating
a
point
and
an
edge
to
the
RPC
service
that
you're
calling
to,
and
so
there's
an
open
there's.
A
question
of:
where
does
the
metadata
actually
apply
to
like?
Does
that
metadata
apply
entirely
to
the
point
that
it
creates?
Does
it
apply
entirely
to
the
edge
that
it
creates?
Is
there
a
mixture
between
we
sort
of
made
the
decision
that
a
call
represents
the
edge
and
the
point
is
sort
of
a
like
side
effect
of
that,
and
so
the
metadata
gets
applied
there.
E
But
this
does
mean
that
you
know
when
users
are
using
the
old
instrumentation,
they
can't
actually
attach
metadata
to
the
original
calling
point
instead,
and
so
this
is
why
we
sort
of
extended
the
instrumentation
over
time
to
allow
more
places
for
this
metadata
to
apply
and
with
that
I
will
hand
it
over
to
Michael
I.
Don't
do
you
want
to
try
sharing
your
screen
instead
or
you
want
me
to
walk
through
the
slides?
As
you
talk.
B
B
B
Up
from
what's
that,
oh
yeah
presenter
I
know
how
to
use
PowerPoint.
Everyone
don't
do
that
man
disaster.
C
B
C
Right
great,
so
so
yeah
it
kind
of
pick
up
from
where,
where
Jonathan
left
off
you
know
it
was
kind
of
interesting
before
I
before
I
came
here,
you
know,
I
worked
with,
we
had
a
well
we've
open
source
has
open
sourced
it,
but
a
you
know
very
span
based
sort
of
a
span
based
tracing
system.
We
called
it
money
as
in
like
follow
the
money,
and
then
we
had
all
these
like
clever
things
around
it,
like
the
money
bank,
was
where
all
the
traces
lived
and
stuff.
C
So
it
was
fun,
but
the
you
know
we
didn't
run
into
like
some
of
the
modeling
issues
that
Jonathan
was
was
talking
about,
and
actually
two
in
particular
that,
like
we,
we
kind
of
ran
into
there
and
then
read
the
canopy
paper
and
then
I
quit
and
came
here
like
you
know
that
we
were
like
hey.
This
could
actually
be
useful,
were
you
know
one?
Is
we
had
these
sort
of
situations
where
we
had
a
trace
on
a
particular
system?
C
And
you
know
a
bunch
of
stuff
is
going
on
in
the
system
when
you
just
wanted
to
to
sort
of
attach
like
a
profile
of
what
was
happening
on
that
system
at
you
know
various
levels
to
the
trace,
and
you
know
kind
of
the
best
way
we
could
think
of
to
do
that
in
the
sort
of
span
based
model.
C
Was
you
have
a
sort
of
a
top
level
span
that
represented
like
the
entire
scope
of
the
execution,
and
you
start
profiling
and
then
end
profiling
when
that
thing
closes
and
and
try
and
attach
that
profile
to
that
top
level
span.
But
then
it
was
kind
of
like
in
turn
you
had
to
know
that
span
was
kind
of
like
special
right,
like
that
was
the
one
that
that
like
had
the
profiling
information
right.
C
It
wasn't
that
bad,
but
it
was
actually
kind
of
clumsy
as
far
as
the
tooling
rebuilding
around
it
went
and
in
sort
of
the
the
canopy
model
is
actually
kind
of
more
natural
to
just
annotate
the
execution
unit.
That
represents
that,
like
request
handling
right
because
we
use
that
sort
of
naturally
represent.
You
know
here's
the
entire
span
of
processing,
an
individual
request,
but
not
span
and
tres
and
cousins,
and
another
interesting
one
was,
if
you
just
had
to
click
some
work
in
a
queue,
and
you
wanted
to
understand.
C
You
know
how
long
was
in
there
alright
and
when
it
came
out.
You
know
the
that's
actually
pretty
naturally
modeled
by
another
execution
unit,
with
points
for
in
qdq
and
edges
for
causality
right,
whereas
like,
if
you
sort
of
put
it
into
you
know,
you
could
model
out
as
a
separate
span,
and
this
start
you
know,
starting
into
the
span
or
when
it
goes
into
and
out
of
the
queue.
C
But
it
sort
of
means
a
very,
very
different
thing
than
like
most
of
the
other
spins
do
where
it's
like
an
actual
RPC
graph
like
it's
like.
Oh
no,
you
just
have
to
know
you
know.
As
far
as
your
tooling
and
stuff
goes,
it's
like,
oh
well,
that
particular
span
happens
to
be
one
that
represents
like
this
thing
sitting
in
a
queue
for
a
while.
C
So
so
those
are
a
couple
of
things
that
we
actually
did
struggle
with
the
modeling
perspective
that
we
were
actually
pretty
interested
in
when
we
read
the
canopy
paper
to
sort
of
help
us
out
with
so
you
know
just
kind
of
worth
noting
it
was.
It
was
kind
of
an
interesting
thing
to
sort
of
see
it
from
one
side
and
now
start
to
see
it
from
the
other.
C
So
so
that
said,
I
wanted
to
sort
of
move
on
a
little
bit
to
talk
a
little
bit
about
what
we're
we're
doing
now
and
sort
of
where
we're
focusing
you
know,
I
guess:
Facebook's
probably
grown
quite
a
bit
in
the
past.
You
know
X
years
and
we've
got
a
ton
of
engineering
teams
right.
So
one
of
the
things
we're
focusing
in
on
is
getting
getting
the
backend
API
is
that
you
know
Jonathan
alluded
to
into
a
point
where
they're,
safe
and
like
clear
and
usable.
So
that
means
for
us
actually
sampling
isn't
enough.
C
We
also
need
great,
limiting
and
I'll
kind
of
go
into
that
a
little
bit
later.
We
also
want
these
sort
of
somewhat
tailored.
You
know
high-quality
api's,
instrumentation
layers
that
are,
for
you
know,
back-end
use
cases
right
now
we're
you
know
what
we
really
wanna
be
is
where
we
have
just
sort
of
like
most
of
the
complexity
in
dealing
with
you
know,
the
the
underlying
model
is
handling
sort
of
instrumentation
layer
and
an
end
user.
C
Alongside
of
you,
know
the
sort
of
default,
one
you'd
sort
of
do
that
an
easy
way
you
know
so
so
we
really
want
to
do
is
just
make
tracing
on
on
the
back
end,
just
like
really
really
easy
for
folks
that
are
that
are
building
back
in
services,
the
the
PHP
instrumentation
we
have
that
Jonathan
mentioned
actually
kind
of.
C
Does
that
to
an
extent
already,
but
you
know
we
sort
of
exposed
a
lot
more
of
the
underlying
gots
to
to
back-end
folks
right
now,
I,
you
know
and
then
another
thing
that
actually
ends
up
being
useful
for
is
having
you
know,
sort
of
different
api's
that
are
good
for
different
situations.
So
you
know
one
of
the
things
that
we
did
we're
just
working
with
one
of
our
teams.
That
has
some
like
really
really
stringent
sort
of
perfect
quirements.
C
As
far
as
memory
usage
goes
and
then
sort
of
you
know,
they
really
worry
about
things
like
like
thread
contention,
the
sort
of
flexibility
underlying
models
actually
going
to.
Let
us
fairly
easily
create
like
a
fairly
tale,
or
you
know,
hey
if
you
need
to.
C
If
you
have
a
really
high
performance,
you
know
RPC
system
and
look
you
don't
want
things
going
on
behind
the
scenes
that
could
cause
additional
thread
contention
or
memory
allocation
like
use
this
API,
you
know
so
so
that's
one
thing
and
then
the
other
thing
we're
working
on
is
actually
a
sort
of
a
revamp
of.
C
If
anybody's
read
the
the
canopy
paper,
there's
a
there's,
I
think
they're
referred
to
as
I
think,
custom
extraction
functions
or
something
like
that,
but
but
it's
essentially
a
DSL
for
working
with
traces
that
happen
to
run
in
our
back-end
we're
working
on
sort
of
revamp.
You
know
expanded
version
of
that
that'll
run
a
you
know
in
a
separate
set
of
processes
elsewhere
and
use
a
you
know
then
be
based
on
sort
of
Python
rather
than
this
completely
custom
DSL.
C
So
it's
really
the
two
things
were
working
on
now
that
that's
actually
super
important
for
us,
because
we
tend
to
look
at
traces
in
aggregate
a
lot
and
we
just
sort
of
compute.
Like
summary,
you
know
information
about
traces,
often
right
and
you
know
but
stuff-
that's
sort
of
covered
in
the
paper,
but
I
think
the
way
we're
going
to
be
doing
it
is
sit
on
the
the
safety.
An
API
clarity.
Side
like
this
is
kind
of
a
this.
C
Is
this
sort
of
an
overview
of
what
the
instrumentation
stack
really
looks
like
for
us
at
the
the
bottom
layer?
We've
got
like
a
layer
of
sinks
that
do
nothing.
You
know
just
the
old
serializing
the
events
that
Jonathan
mention
and
flushing
them
somewhere.
You
know
we've
sort
of
an
internal
Kafkaesque
system
that
you
know
is
used
on
top
of
that.
C
There's
a
trace
model
that
really
you
know,
represents
that
that
that
trace
model
as
an
object
model,
but
doesn't
let
you
do
things
that
don't
make
sense
right,
like
you,
can't
create
like
a
block
on
a
point,
for
instance
right,
but
it
just
makes
it
makes
a
little
bit
easier
to
work
with.
But
you
know
when
you
do
things
that
they're,
you
know
when
you
do
things
with
that
model.
It'll
give
you
pointers,
so
you
can
sort
of
keep
reference
them
in
your
code
and
then
flush
them.
C
You
know,
it'll
flush,
things
to
events
usually
usually
right
away,
but
you
can
do
some
things
before
that
if
you
need
to
and
then
on
top
of
that,
we've
got
sort
of
a
set
of
code
that
deals
with
creating
instrumentation
layers
right
because
we
don't
want
people
to
have
to
worry
about
things
like
you
know,
propagating
context
either
through
thread
boundaries
in
their
system
or
you
know,
across
system
boundaries.
We
don't
want
people
to
have
to
like
understand,
really
really
understand
that
trace
model
deeply
understand
which
parts
are
active
right.
C
We
just
want
an
underlying
instrumentation
to
take
care
of
that.
We
obviously
don't
want
people
to
have
to
do
their
own
rate-limiting
because
they
won't,
and
then
our
system
will
get
you
know
knocked
over,
so
that's
not
great,
but
then
the
the
whole
idea
being
on
top
of
that,
we've
got
this
sort
of
instrumentation
kit
to
build
back
in
instrumentations.
We've
got
a
set
of
instrumentations
built
on
it
that
we
are
that
are
either
built
or
we'll
be
building
on
top
of
that,
and
then
we
really
want
to
have
most
folks.
C
Leverage
is
sort
of
this
high
level
API
that
really
just
lets
them
do
like
a
few
simple
things
right
and
all
of
the
more
complex
pieces
are
handled
by
instrumentation
right
like
at
the
end.
At
the
end
of
the
day,
you
know
soda
lacked
something
more
like
a
logging
framework
right,
like
you
log
a
point
or
you
create
a
point
and
it
it
goes
into
the
right
place
on
the
trace
right.
It
goes
on
the
right
block.
C
All
right
execution
unit-
and
you
know
and
so
on,
and
then
you
know
maybe
exposing
you
know
some
a
little
bit
of
additional
stuff
at
that
high
level
that
that
you
know
handles
like
the
80%
use
case
and
then,
if
we
need
to
something
more
sophisticated,
you
know
people
would
have
to
go
sort
of
layers
down
in
this
API
stack
to
do
it
right
and
then,
like
I'd
mentioned
earlier.
C
Another
another
thing
that
we're
we're
talking
about
doing
sort
of
on
top
of
this
is
creating
just
to
really
really
performant
but
but
much
more
constrained.
Api
on
that
sure
just
does
that
RPC
trace
model
for
some.
You
know
particularly
use
cases,
and
it's
kind
of
nice
that,
like
this,
not
only
the
underlying
you
know,
event
an
object
model
sort
of
gives
us
that
flexibility,
but
you
know
sort
the
instrumentation
that
we've
got
built
up.
Lets
us
do
that
too.
C
B
C
But
if
you
don't
even
know
that
you
know
or
if
you
want
to
compare.
You
know
data
in
aggregate
before
and
after
some
of
that
right,
like
you,
know
a
deployment
or
something
and
see,
what's
happened
right
or
if
you
just
want
to
compute.
You
know
summary
statistics
off
of
something
that
can
only
be
derived
from.
You
know
a
trace
right.
You
know
with
that
actually
ends
up
being
I,
think
a
more
common
use
case
for
us
than
just
looking
at
an
individual
trace,
so
so
really
what
this
soso
yeah.
C
So
this
is
a
domain-specific
string
processing.
You
know
system
for
forgetting
at
that
sort
of
stuff
right
and
we've
had
a
bunch
of
things
that
are
built
on
top
of
it
at
a
high
level.
You
know
really
what
we've
got
sort
of
a
configuration
based.
We've
got
our
you
know
internal
configuration
system
that
you
sort
of
put.
You
know
this
python-based,
you
know
DSL
into
that's
run
by
this.
This
whole
set
of
sort
of
machines
that
will
go
in
and
run
those
on
a
per
a
sort
of
use
case
basis.
C
Alright,
so
it's
it's
sort
of
you
know.
In
the
old
system
that's
mentioned
in
the
paper,
we
kind
of
ran
the
the
moral
equivalent
of
this.
All
that
sort
of
one,
you
know
said,
am
infer.
That
was
also
doing
a
bunch
of
other
things.
You
know
this
one,
the
use
case
are
actually
split
out.
Separate
sort
of
you
know,
chunks
of
machine
for
different.
You
know,
use
cases
different,
essentially
going
back
to
different
users
of
the
system.
Right
that
are
gonna,
be
doing
vastly
different
things.
C
So
those
are
really
the
two
big
things
we're
working
on
for
the
time
being
and
then,
though,
yeah
the
other
important
thing
about
this,
that
I
didn't
mention
which
is
different
from
the
stuff
in
the
papers
that
this
actually
does
allow
us
to
do
sort
of
ad
hoc
queries
right,
which
was
not
possible
with
your
old
system.
So
that's
super
useful
if
you
especially
when
you
don't
even
know
what
you,
what
your
yeah
yeah
one
more
good
after
that
yeah.
G
C
This
at,
like
a
big
company,
has
almost
certainly
came
across
the
use
case
where
somebody's
like
hey,
there's
this
stuff
that'll
like
magically
propagate,
like
this
idea
around
and
like
I,
want
that
to
probably
get
my
session
ID
like
so
you
know
the
notion
of
having
a
more
abstract
way
of
of
using
the
underlying
right
wherever
you've
got
the
tracing
instrumentation
of
being
able
to
propagate
context
through
which
I
know
is
you
know,
in
in
open
tracing
with
baggage
and
and
the
paper
that
you
know,
Jonathan
mace
I
think
wrote
like
is:
has
some
really
good
stuff
in
it,
but
I
think
one
of
the,
including
things
is
like
once
you
open
up
that
capability.
C
You
know
like
how
do
you
keep
people
from
doing
really
really
bad
things
with
it
and
such
bad
things
that
you
end
up
having
to
turn
it
off
is
kind
of
a
an
open
question
and
I
think
it's
kind
of
worth
thinking
about
like?
Is
there
a
difference
between?
You
know
that
sort
of
really
common
use
case
of
like
hey
I,
just
want
to
propagate
an
ID
sort
of
within
my
system,
boundaries
for
some
sort
of
session
or
something
versus
the
broader.
C
The
the
sort
of
canopy
data
model
is
that
it's
actually
really
really
good
at
doing
single
node
traces
as
well,
and
we
get
like
really
detailed
traces
of
mobile
clients
and
dub-dub-dub
and
and
folks
actually
want
to
get
like
really
detailed
traces
of
their
own
back-end
systems.
You
know
so,
but
it's
like
well.
How
do
you
then
take
that
really
detailed
view
of
a
little
piece
of
the
system
and
working
into
an
overall
you
know
distributed
trace
that,
that's
probably
broader.
You
know
you
sort
of
create
a
lot
of
noise
for
people.
C
I
just
want
that
broad
view,
but
then
sometimes
you
know
the
person
that's
like
looking
at
that
really
detailed
view
might
want
to
look
at
something.
You
know
a
few
layers
back,
so
there's
actually
a
couple
of
different
ways
that
we've
talked
about
modeling
that
and
there's
some
different
thing.
You
talked
about
doing
that
doing
with
that
in
terms
of
like
viz,
tooling,
you
know
and
so
on,
but
it's
but
it's
actually
kind
of
just.
C
You
know
it's
kind
of
built
the
question
of
how
we
actually
get
like
a
good,
solid,
end-to-end
trace,
while
still
having
like
these
chunks
of
like
really
really
detailed
trace
at
various
parts
of
the
system
in
there-
and
you
know
they
really
are
different
things
with
different
audiences.
You
know
so
so
yeah.
So
those
two,
the
things
that
I
think
we're
we're
thinking
about
and
you
know,
may
or
may
not
do
anything
useful
with
cool.
So
anybody
have
any
questions.
A
I
have
a
question
about
metrics
and
aggregates,
actually
I'm
wondering
I
mean.
Obviously
you
have
an
event
based
system
and
you're
rolling
up
some
amount
of
aggregates
out
of
that,
you
know
into
your
tracing
system,
but
are
you
doing
kind
of
all
metrics
extraction
based
on
the
system
or
do
you
have
a
totally
separate
metric
system?
And
if
so,
are
you
kind
of
sharing?
Are
you
using
the
tracing
system
for
context
propagation
and
like
how
are
do
those
two
things
relate
to
each
other.
E
So
we
have
there's
an
independent
metrics
based
system
for
sort
of
like
operational
management.
The
traces
will
tend
to
the
traces
can
share
some
of
the
data
from
those
metrics.
There's
some
caveats.
Usually
there,
where,
like
the
metrics,
captured
our
system
level
but
say
we
want
to
capture,
request
level
metrics
within
a
trace,
but
we
can
sort
of
hold
the
same
like
overall
system
CPU
utilization
and
things
like
that.
E
A
So
well
I
mean
it
sounds
like
you
actually
have
things
separated
between
system
metrics
and
then
maybe
your
application
level
metrics
are
coming
out
of
the
tracing
system,
but
the
degree
to
which
you
may
want
to
dimensionalize
some
metrics
is
that
all
happening?
You
know
that
context
tends
to
get
propagated
in
the
tracer,
which
is
just
wondering
how
that
really.
E
All
of
these
requests,
in
some
sampled
fashion,
to
understand,
like
request
based
utilization
through
the
system
where
we're
currently
using
tracing
for
that
like
this,
is
fundamentally
at
Facebook
like
context,
propagation
and
tracing
are
fundamentally
tied
together
for
better
or
worse,
and
so
that's
where,
like
as
Michael,
said
like
we
end
up
with
these
cases
where
people
are
like
man,
I
really
need
to
propagate
a
context
and
like.
Let
me
turn
on
tracing
and
we're,
like
you
know,.
C
Yeah
and
and
and
that
like
we
don't
we
don't
have
a
decoupled,
generalized
context,
propagation
system
and
I
think
you
know
to
that
sort
of
like
how
do
we
do
that
safely,
with
the
number
of
teams
that
we
have
is
sort
of
an
open
sort
of
operational
question
that
you
know
what
to
think
about
and
see
if
we
can
tackle
at
some
point
but
but
yeah,
it's
it's
interesting.
We
have
this
many
different
folks,
sort
of
that.
You
know
yeah
it
just
yeah.
A
D
C
C
The
other
big
thing
we
do
with
the
API
cleanup
is
we're
actually
adding
pervasive
rate-limiting
at
and
as
well
after
you
know,
after
the
sampling
arm,
sampling
actually
ends
up
not
being
quite
enough
for
us,
because
you
know,
if
somebody's
doing
a
coin
flip
that
they
expect
to
be
on
like
a
tiny
percentage
of
traffic,
like
maybe
in
a
region
that
you
know,
is
being
used
for,
like
you
know,
for
some
experiment
and
then
suddenly
a
lot
of
traffic
fails
over
to
there.
C
You
know,
suddenly
you
can
sort
of
get
this
explosion
of
traffic
just
because
of
that,
so
we're
actually
adding
rate
limits
on
a
per
trace
and
the
N
trace
size.
You
know
before
we
can
actually
sort
of
start
new
traces
to
cover
that.
So
that's
actually
one
important
safety
piece
as
far
as
generalized
context
prop
is
you
know,
I
think
it's
something
we've
been
talking
about
for
a
while
and
starting
to
think
about.
C
I
do
think
that
opening
up
a
sort
of
completely
generalized
system
and
like
saying
you
know
like
any
engineering
team
and
an
organization
of
our
size,
is
free
to
like
go
and
attach
baggage
to
this
thing
is
like
a
non-starter.
We
have.
You
know:
we've
got
systems
that
you
know
that
that
are
they're
extremely
memory
sensitive,
where
you
know
that,
like
the
engineers
that
sort
of
when
those
teams
would
you
know,
we'd
sort
of
rightfully
just
you
know
say
very
loud
things,
you
know
and
then
I
think.
C
If
you
look
at
some
of
the
the
safety
that
safety
that
was
sort
of
in
the
the
paper
that
Jonathan
wrote,
don't
know
if
he's
around,
but
you
know
essentially
it
comes
down
to
like
having
a
principled
way
of,
like
you
know,
capping
the
amount
of
size
that
the
amount
of
data
that
gets
propagated,
but
then
it
sort
of
has
this
like
downside
of
like
oh,
maybe
you
really
really
rely
on
a
particular
piece
of
data.
You
know
now
it's
not
there.
I
think
so.
The
extent
that
I've
thought
it
through
I
do
think.
C
It's
really
really
worth
thinking
about.
How
do
you
separate
out
the
use
case
where
somebody
wants
to
propagate
an
ID
within
their
system
bounds
and
then
sort
of
attach
metadata
to
it
after
the
fact
that
can
be
processed
by
like
some
other
system
right
and
have
it
emit
that
data,
but
sort
of
like
then
not
have
that
ID
crossed
you
know
be
propagated
outside
of
the
bounds
of
their
particular?
C
You
know
set
of
systems
and
then
separating
that
out
from
like
the
generalized
context
prop,
which
should
be
very
strictly
controlled
and
really
only
used
for
like
a
specific
set
of
blessed
things
with
like
a
decent
amount
of
process
around
putting
things
in
them
upfront
right
and
like
some
like
some
like
set
of
configurations
that,
like
can't
be
changed
outside
of,
like
you
know,
review
by
some
accountable
team.
So
you
know
to
the
extent
that
I've
thought
through
it,
like
that's
sort
of
where
I've
landed.
C
Yeah,
you
know
and
I
think
I
think
stuff
like
that
I
think
is.
Is
you
know
it's
good,
but
then
I
think
you
do
sort
of
run
it
at
least
in
our
rules.
We
don't
run
into
this
thing
like
well.
Okay.
What?
If
you
know
the
data
in
your
most
prioritized?
Namespace
is
larger
than
like
the
amount
of
data
that,
like
the
most
you
know,
sort
of
the
most
conservative
team
is
willing
to
accept.
You
know
so
you
kind
of
just
like
need.
C
Some
I
think
you
do
need
to
like
sort
of
really
tightly
control
like
you
know
the
generalized
thing
and
then
try
and
figure
out
how
to
build
the
more
you
know
hey
if
you
want
to
do
something
within
your
own
system.
Bounce
thing
on
the
same,
you
know
context
prop,
but
you
know
so
I
think
I
think
really
had
the
notion
of
having
like
some
sort
of
system
bounds.
E
Like
you
know
a
you
may
have
some
session
ID,
that's
propagating
over
multiple
individual
dub-dub-dub,
the
quests,
but
each
dub-dub-dub
request
should
have
its
own
ID
and,
like
making
sure
users
understand
like
what
are
the
boundaries,
were
things
cross
over
and
are
able
to
do
it
in
a
safe
way?
I
think
like
we're,
it's
a
very,
very
open
question
on
our
side
of
like
how
to
make
that
work.
B
F
C
So
so
the
way
we're
planning
on
doing
rate-limiting
is
we're
planning
on
still
doing
it
at
trace
start
and
not
killing
traces
that
are
in
progress
so
like,
for
instance,
like
you
know,
from
the
point
of
view,
let's
say:
you're
an
individual
node
and
you
know
you
have
one
in
a
thousand
coin
flip
right.
But
when
we
configured
that
coin
flip,
we
had
like
two
nodes
running
and
then
for
some
reason.
Now,
there's
a
thousand
running.
What
would
happen?
Is
you
do
the
coin
flip?
C
The
coin
flip
would
pass
and
then
there'd
be
an
additional
rate
limit
check
and
we
would
have
a
set
of
centrally
configured
rate
limits.
That
say
essentially,
okay,
this
you
know
this.
This
policy
gets
to
start
five
traces
per
second,
it
would
check
the
rate
limit
after
that,
and
the
rate
limit
would
then
fail
and
that's
also
where
it
would
check
the
trace
size
rate
limit
right.
So
so
we're
only
gonna.
Do
it
at
start.
We
don't
want
to
try
to
get
it.
You
know
we're
not
gonna
try
to
get
into
like
hey.
C
G
F
I
mean
we
we've
done
something
similar,
but
not
for
regular
something,
but
for
like
specific
bugs
little
something
that
people
were
abusing,
so
we
rate
limit
those.
But
for
the
reason
we
didn't
do
it
for
that.
The
regular
sampling
is
because
we
extract
some
extrapolations
from
from
the
statistics
we
get
from
overall
traces
and
and
they're
the
probability
of
sampling
actually
very
important.
So
if
you
start
rate
limiting
that,
you
cannot
do
extrapolations
anymore,
so
yeah.
C
That's
something
we'd
have
to
watch
out
for
so
we
do
it.
We
do
it
and
we
do
that
sort
of
thing
in
some
cases
as
well.
You
know
the
intent
of
the
rate
limits
is
just
for
sort
of
our
own
safety.
Like
we
don't
mean
we
don't
intend
them
to
be
things
that
are
gonna,
be
hid
under
like
normal
operation.
You
know
so
and
and
work
and
we're
gonna
monitor
them
right.
C
G
Not
us.
It's
also
good
for
the
case
of
maybe
not
this
than
the
new
hole
tracing
error
case,
but
maybe
I
have
this
distributor
request
and
somewhere
in
that
request.
We
continue
on
because
it
didn't
have
instrumentation
or
something
like
that.
And
now
maybe
it's
like
a
huge
like
a
course.
Everyone
hits
and
they
want
to
add
a
ton,
a
like
information,
and
now
their
service
has
10x
the
amount
of
data
pumping
out
than
it
was
before,
and
this
gives
us
like
a
good
way
of
understanding.
F
G
G
H
One
of
our
are
here
right
now
he's
he
was
working
on
more
of
a
sampling
based
approach
where
we
dynamically
changed
the
rates
based
on
how
much
they're
outputting
I
mean
we
have
both,
as
you
said,
because
of
the
debug
override
that
we
allowed
in
the
system
so
like.
We
still
need
that
case,
but
yeah
I
mean
the
changing.
The
probability
is
a
nice
way
of
doing
it.
If
it,
if
you
can,
it
seems
like
it's
pretty
static
right
now,.
G
Or
size
is
more
is
a
better
indicator.
We've
such
diversity
in
what
races
look
like,
and
especially
when
you're
in
when
you're
not
to
mean
the
case
where
it's
a
new
tracing
scenario,
but
obtaining
an
existing
one
that
just
a
ton
of
like
an
update,
existing
sort
of
talking
people
hit.
It's
really
hard
to
understand
like
is
this,
make
us
fall
over
immediately.