►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right:
everyone,
thanks
for
joining
us
today
for
today's
cncf,
live
webinar,
uncovering
hidden,
Hotel
traces
in
the
standardized
manner,
I'm,
Libby,
Schultz
and
I'll
be
moderating.
Today's
webinar
I'm
going
to
read
our
code
of
conduct
and
then
hand
things
over
to
Steve
waterworth
from
asserts
and
Taylor
dolezal
head
of
ecosystem
at
cncf
a
few
housekeeping
items
before
we
get
started
during
the
webinar.
A
You
are
not
able
to
speak
as
an
attendee,
but
there
is
a
chat
box
on
the
right
sidebar,
where
you
can
say
hello,
tell
us
where
you're
watching
from
and
leave
all
of
your
questions
here,
we'll
get
to
as
many
as
we
can.
At
the
end.
This
is
an
official
webinar
of
cncf,
and
is
that
just
subject
to
the
cncf
code
of
conduct?
Please
do
not
add
anything
to
the
chat
or
questions
that
would
be
in
violation
of
that
code
of
conduct
and
please
be
respectful
of
all
of
your
fellow
participants
and
our
presenters.
A
Please
also
note
that
the
recording
slides
will
be
posted
later
today
to
the
cncf
online
programs,
page
at
community.cncf.io,
under
online
programs
they're
also
available
via
your
registration
link.
You
used
to
join
today
and
will
be
available
on
our
online
programs.
Youtube
playlist.
With
that
I'll
hand,
things
over
to
Stephen
Taylor
to
kick
off
today's
presentation.
Take
it
away.
B
Awesome
well
howdy,
howdy
and
welcome
everyone
like
Libby
said.
If
you
have
any
questions
at
all
during
the
session,
please
feel
free
to
throw
this
into
chat
and
we'd
love
to
surface
those
and
get
to
as
many
as
we
possibly
can.
I'm
really
excited
to
be
joined
by
Steve
from
asserts
and
we're
going
to
be
covering
a
lot
of
things
as
it
pertains
to
open
source
and
really
would
just
love
to
dive
in
and
get
started.
B
One
project
that
I've
heard
quite
a
bit
about
from
many
folks
has
been
open,
Telemetry
also
referred
to
as
otel.
For
short,
so
Steve,
oh
tell
me
more
I'd
love
to
hear
about
it.
B
C
Open
Telemetry
is
it's
a
vendor
agnostic,
truly
open,
observability,
Tool,
Set,
trivia
fact
for
you,
it
is
the
second
most
popular
project
in
the
cncf
as
far
as
sort
of
code,
source
code
activity
with
commits
and
the
likes
second
only
to
kubernetes.
So
there's
a
there's,
a
lot
happening
with
it,
where
we're
predominantly
interested
in
it
is
from
the
tracing
side
of
things
as
as
we
know
with
Cloud
native
applications.
C
Microservices
and
the
like
being
able
to
do
distributed
tracing
can
be
a
very
good
diagnostic
tool
when
things
aren't
maybe
going
quite
as
well
as
you'd
hoped
that
might.
B
I
find
it
funny
that
the
second
most
popular
project
in
the
cncf
portfolio
right
now
is
one
focused
on
Telemetry
and
observability,
and
so
it's
great
to
see
that
it's
doing
so
well
in
that
space
and
that
folks
just
want
to
know
what's
going
on
one
other
project
that
I've
seen
that
paired
quite
often
with
kind
of
like
chocolate
and
peanut
butter
has
been
for
me,
yes,
is:
is
that
one
that
you
can
kind
of
talk
a
little
bit
more
about.
C
Chocolate
and
peanut
butter
I'm,
not
sure,
that's
a
good
pairing,
but
there
we
go
each
to
their
own
yeah.
So
for
me,
Prometheus
is
sort
of
the
flip
side.
Really
that's
All
About
Time
series
metrics
and
it's
probably
one
of
the
more
mature
project
projects
in
cncf.
If
I,
just
flick
over
to
my
crib
sheet,
just
to
be
absolutely
certain
but
I
think
it
sort
of
graduated
in
like
2018,
something
like
that.
Yes,
it
did
it
graduated
in
2018..
C
So
it's
it's.
B
C
C
B
It's
it's
been
interesting
to
see
what
folks
are
using
Prometheus
to
measure,
and
it
seems
kind
of
like
a
dark
art
for
some
when
it
comes
to
figuring
out
the
right
way
to
craft
their
perfect
dashboard
or
their
single
pane
of
glass.
So
I
know
that
can
be
a
little
bit
difficult
for
some
folks,
but
I
feel
like
once
you
have
that
setup.
You
have
a
really
good
view
into
the
things
you
actually
want
to
see,
and
then
we
I
know
we
use
that
at
the
cncf
for
things
like
Dev
stats.
B
Looking
at
this
various
projects
and
folks
companies
and
other
types
of
observability
as
it
comes
to
our
projects,
who's
contributing
to
them
with
the
project.
Health
might
look
like
in
some
metrics
there.
So
I
I
know
that
we're
yes,
biased
but
very
happy
to
have
the
project,
because
it
helps
provide
a
lot
of
insights
on
that.
C
Yeah,
for
me,
this
is
great
because
it's
very
easy
to
get
it
set
up
and
running,
particularly
if
you're
running
it
as
a
container
or
you
can
it's
not
it's
not
a
difficult
install
if
you're
running
it
natively
on
the
on
the
operating
system.
So
it's
very
easy
to
get
Prometheus
up
and
running
in
the
kubernetes
environment.
Using
the
Prometheus
operator
is
an
absolute
no-brainer.
That's
like
the
easiest
way.
C
It's
a
simple
Helm
install
and
you're
there,
and
so
yeah,
getting
those
metrics
in
and
there's
a
whole
bunch
of
exporters
for
for
various
other
software
components
and
a
lot
of
other
cncf
projects
expose
a
Prometheus
Matrix
endpoint
by
default
anyway.
So
it's
very
quick
and
easy
to
get
Prometheus
running
and
fill
it
up
with
a
few
million
metrics
on
absolutely
everything
and.
B
B
C
B
It's
I
I've
really
liked
as
well,
that
within
each
of
those
components
that
you're
able
to
take
a
look
at
Prometheus,
you
can
actually
inspect
the
data
in
many
cases
and
even
download
that
in
CSV
or
other
types
of
formats
as
well,
and
continue
to
work
with
it.
If
you
might
not
like
your
view,
or
you
might
be
locked
to
a
specific
view
within
your
organization
as
well,
I'm,
not
sure
if
you
have
any
Easter
eggs
or
tips
or
tricks
when
it
comes.
C
C
B
C
Had
everything
installed
and
all
my
it's
great
to
configs,
set
up
right
again.
I've
now
got
to
start
building
Health
rules
because
I
want
to
get
alerted
against
things
and
I
want
to
be
able
to
see
the
data
so
I'm
now
going
to
start
building
a
whole
bunch
of
dashboards
and
I
want
to
be
able
to
link
from
One
dashboard
to
another.
B
B
It
at
least
gives
you
a
framework
to
iterate
on,
and
you
know
it's
what
looks
good
or
is
relevant
to
you
right
now
might
not
always
be
the
case
so
giving
you
that
ability
to
change
things
and
make
it
modular
in
that
case
is
quite
helpful
and
and,
like
you
said,
I've
seen,
people
use
it
for
all
kinds
of
wild
things
from
tracking
all
of
the
times
that
you
know
you
fed
the
dog
or
gone
on
a
walk,
or
you
know,
measuring
Personal
Fitness
in
any
anything
else.
In
that
category,
it's.
C
B
There
there
was
one
group
in
in
just
outside
the
Bay
Area
Napa,
Valley
and
I
I.
Had
a
friend
send
me
some
images,
but
this
Winery
actually
uses
it
to
measure
soil
like
wetness
and
humidity,
and
all
these
other
things
too,
and
they
have
all
of
these
Prometheus
grafts
up
on
like
projected
onto
their
little
like
imax-esque
little
viewing
deck.
So
really,
cool,
really
cool
and
wild
way
to
see
people
using.
C
That
yeah
and
then
because
this
time
you
know
we're
talking
about
observability
and
there's
more
to
Observation.
Just
metrics.
Metrics
are
certainly
your
starting
point,
but
because
we've
said
we've
got
with
with
otel:
we've
got
the
traces
and
logging
is
as
old
as
the
hills.
Most
organizations
will
already
have
some
logging
solution,
be
it
open
source
or
proprietary.
B
C
On
something
when,
when
maybe
a
metric
isn't
in
the
value
that
we
we
hope
it
should
be
or
like
increase,
latency
or
excessive
resource
consumption,
and
you
want
to
dive
into
that,
and
then
you
want
to
be
able
to
go
and
look
at
the
traces
for
the
transaction
for
that
or
the
logs
from
the
container.
How
do
you?
C
How
do
you
pull
all
that
to
all
that
together
and
that's
where
it
gets
really
difficult
doing
that
manual
correlation,
particularly
then,
when
with
distributed
systems,
when
a
single
request
well
could
hit,
you
know
a
dozen
different
micro
services
and
possibly
in
different
kubernetes
clusters,
maybe
even
in
in
different
data
centers.
So
how
do
you?
How
do
you
pull
all
that
together
and
that's
sort
of
the
work
that
asserts
has
been
doing
is
to
add
that
layer
of
intelligence
and
automation,
on
top
of
these
great
open
source
tools,
to
help
you
pull
it
all
together?
C
B
We've
talked
a
little
bit
about
how
folks
are
using
open,
Telemetry
and
Prometheus,
but
I
think
one
thing
you
need
to
cover
this
a
little
bit
but
I'd
like
to
dive
deeper
into
why
people
are
using
those
those
solutions
for
their
problems.
C
B
C
Think
the
the
key
area
there
is
is
you're
you're,
avoiding
that
vendor
lock-in,
the
the
with
the
open
source-
tooling,
it's
so
good.
Now
you
there's
no
requirement
to
pay
for
proprietary
agents.
You've
been
able
to
collect
observability
data
and
store
it,
and
that
is
that
is
commoditized
you
you
can
do
that
very
easily
and
at
minimal
cost.
Obviously
it
doesn't
run
in
thin
air,
so
there's
a
better
compute
cost
somewhere,
but
yeah.
C
You
certainly
don't
have
to
be
paying
licenses
off
to
an
organization
in
order
to
collect
observability
data
and
in
fact,
in
many
cases
the
obser.
The
free
tools
are
better
than
the
proprietary
Tools.
In
that
there's
no
limits
on
the
custom
metrics
that
you
can
have,
and
also
the
maturity
of
the
and
range
of
collectors
that
are
out
there
again
is,
is
often
surpassing
what
is
available
commercially.
B
And
I
think
that,
while
it
might
be,
in
some
cases,
a
little
bit
more
difficult
to
set
up
the
things
that
you
care
about,
initially
you're
going
to
have
that
much
longer
term
satisfaction,
especially
around
that
cost
as
well.
You
know
just
a
little
bit
to
go
through
the
wall,
the
first
time
that
afterwards
fairly
smooth
sailing
keeping
up
with
the
nautical
terminology
yeah
when
people
are
using
open,
Telemetry
and
Prometheus.
What
kinds
of
challenges
have
you
you
seen
folks
run
into?
B
C
There's
a
bunch
of
you've
sort
of
broken
free
from
that
license.
Cost
you've
been
you've,
embraced,
open
source.
You've
got
all
this
fantastic
data.
We've
already
touched
on
you.
Turning
that
data
into
information
is
a
challenge.
There's
a
there's,
a
lot
of
work
there
and
then
the
the
correlation
aspect
of
it
I've
been
able
to
pull
data
from
different
places
and
and
have
it
all
related
to
maybe
to
the
issue
you're
working
on
and
then
the
one
of
the
other
challenges.
C
One
is
also
data
volume
because
it
is
so
easy
to
go
collect
all
this
data
you
end
up,
then
paying
a
penalty
on
storage
costs,
particularly
around
tracing
metrics,
are
tiny.
I.
Think
Prometheus
is
about
one
and
a
half
bytes,
something
like
that
per
per
sample.
If
you
have
a
look
at
their
documentation,
but
yeah
tracing
is
is
much
more.
The
the
worst
offender
there
with
you
know
with
the
span
being
about
2k
by
the
time
you've
got
all
the
baggage
in
it
and
then
a
particular
transaction
may
be
like
12
12
spans.
C
So
you
can
soon
be
into
sort
of
a
few
kilobytes
per
trace
and
then
you've
got
millions
if
not
billions,
if
you're
a
busy
site
of
traces.
That's
very
quickly
a
lot
of
data
and
yeah
a
lot
of
storage,
cost
and
processing
power
as
well
so
yeah,
there's
there's
a
there's.
C
And
what
most
people
come
across
is
like?
Oh,
oh,
this
is
a
lot
of
data.
We
can't
possibly
Trace
everything
and
what
we
need
to
do
is
some
sampling
and
hotel
offers
various
sampling
strategies,
but
they're
all
a
bit
of
a
blunt
instrument.
You
can
say
well,
I'll
just
take
10,
which
is
great,
but
then
Murphy's
Law
comes
it
comes
to
the
fore
and
says
when
there's
a
problem
yeah
the
choices
I
need
were
the
ones
that
weren't
sampled,
so
I'm,
still
blind.
B
When
when
so,
we've
heard
a
lot
about
open
Telemetry
as
well
and
I,
remember
early
days
of
folks
focusing
on
tracing
and
being
told
like
you
can
set
up
tracing.
But
it's
something
that
you
have
to
instrument
your
application
for.
Have
you
seen
that
change
when
it
comes
to
open
Telemetry
and
the
amount
of
effort
needed
to
get
started
in
looking
at
function,
calls
and
and
kind
of
delving
a
little
bit
deeper
when
it
comes
to
adopting
open
television,
yeah.
C
Open
Telemetry
has
done
done
a
lot
of
work
on
the
on
the
Tracy.
Now
it
is
the
second
most
active
project
and
there's
a
lot
of
Automation
in
there.
Now
it
tends
to
be
very
much
language
specific
some
languages,
yeah
make
them
make
themselves
easier
to
instrument
automatically,
and
they
are
some
more
of
the
compiled
languages
like
go.
Those
types
of
things,
they're
they're
a
little
more
difficult
to
do,
but
certainly
something
like
Java,
which
has
had
a
standard
for
for
the
Java
agent
since,
like
about
Java
1.5.
C
If
you
can
remember
that
far
back
so
there's
a
standard
API
for
it.
So
it's
very
easy
to
automatically
instrument
your
Java
application
say
some
of
the
others.
The
automation
is
maybe
not
quite
as
advanced,
but
certainly
for
things
like
go,
there's
a
whole
bunch
of
middlewares.
So
if
you're,
using
Gorilla
or
gin
to
do
your
request,
routing,
there's
wrappers
for
that,
so
yeah.
B
C
B
I
think
that's
nice,
it's
I
know
we've
also
gotten
a
little
bit
further,
along
with
things
like
transparent
proxies
for
service
meshes
and
those
kinds
of
concerns,
as
well
so
great
to
know
that
it
doesn't
take
as
much
time
or
effort
to
get
those
things
instrumented
and
we're
starting
to
see
more
capability
right
out
of
the
box.
Yeah.
C
C
Layer
of
complexity,
it's
like
all
things
in
engineering,
the
swings
and
roundabouts.
You
you
gain
in
that
you
haven't,
got
to
go
and
reconfigure
each
service
manually,
but
then
you're,
adding
another
layer
of
complexity
with
the
service
mesh.
But
then
a
service
mesh
can
do
lots
of
funky
things
for
you
as
well.
So.
B
I
I
saw
a
question:
come
in
asking
open
tracing
is
now
within
open
Telemetry
and
yes,.
C
C
Now,
you're,
going
back
into
the
dim
Mists
of
time
in
Computing,
speak
anyway,
probably
like
a
year
ago
in
in
real
time,
yeah,
so
so,
open
tracing
was
probably
the
first
open
standard
for
doing
distributed,
tracing
and
a
lot
of
actually
a
lot
of
the
commercial
products
are
actually
built
on
top
of
those
Hotel
standards,
open
tracing
standards
and
then
yeah
open
television
came
came
along
and
it
has.
It
has
a
broader
remake
than
just
distributed
tracing
it
does.
C
It
does
also
include
metrics
and
logs,
although
the
the
support
for
metrics
and
logs
isn't
as
mature
as
it
is
for
tracing.
If
you
actually
go
and
look
at
the
various
project.
Statuses,
most
of
them
are
pretty
much
there
with
Mainline
releases,
General
availability
releases
on
the
tracing
side,
and
you
look
at
metrics
and
logs,
and
these
are
still
still
a
lot
of
alphas
and
betas
and
don't
deploy
this
in
production
type
per
caveats
on
it.
So
yeah,
it's
certainly
so
it.
B
Amazing,
it's
it's
wild
to
look
back
and
see
which
projects
within
the
cncf
have
gotten
merged
or
archived,
and
things
of
that
nature
I,
remember,
reading
about.
Was
it
open
census
and
open
tracing
and
being
really
excited
back?
Also
back
in
the
dimness
of
time
when
I
was
working
at
Walt,
Disney
Studios
and
got
really
excited
seeing
those
things,
but
it's
like
it's
so
many
projects
to
kind
of
put
together
and
so
now
seeing
those
culminated
together
as
one
as
open,
Telemetry,
I
think
was
really
helpful
same
thing.
B
C
I
also
like
the
concept
they
have
with
the
open
Telemetry
collector,
so
this
sort
of
acts
as
like
a
like
a
patch
board
best
way
to
describe
it
I
suppose
and
that
you,
your
various
services
or
your
service
mesh,
sends
the
data
to
The
Collector.
So
it
has,
you
can
set
up
various
receivers
there
in
The
Collector,
so
it'll
it'll
receive
the
metrics
and
spat
and
Trace
spans,
and
then
optionally,
you
can
configure
processors,
so
it
can
actually
massage
the
data
before
passing
it
on.
C
So
we
can
do
it
and
we'll
get
on
to
what
certs
is
doing
with
that
in
a
little
a
little
time,
and
then
it
can
then
dispatch
that
data
to
one
or
more
back
ends.
So
you
can.
So
if
you've,
if
you've
got
Zipkin
and
there
you
go-
you
don't
have
to
choose,
you
can
actually
have
it
go
to
to
both
or
off
to
like
one
of
the
cloud
providers
you
can
use
Google,
Cloud,
tracing
or
AWS
x-ray.
C
You
could
use
that
as
your
as
your
Trace
store
or,
of
course,
Jaeger's,
probably
the
most
popular
one
in
the
in
the
open
source
world.
B
When,
when
I
was
working
at
a
previous
role,
one
of
my
colleagues
was
talking
a
little
bit
about
annotations
and
making
sure
that,
and
they
were
implementing
some
service
service.
Mesh
workloads
well,
I'll
say
that
five
times
fast
and
they
were
talking
about
annotating
that
and
losing
that
annotation
about
halfway
through.
B
So
they
didn't
get
to
see
that
full
traceability
until
they
had
the
aha
moment
and
said
like
oh
no,
this
actually
needs
to
be
annotated
each
step
of
the
way
so
as
you're
passing
along
this
thread,
or
this
call
that's
something
that
you
have
to
be
mindful
of,
and
with
that
I'd
like
to
transition
into
where
it
is
that
you
work,
you
know
talking
a
little
bit
about
asserts
what
you're
doing
with
otel
and
Prometheus,
but
really
I'd
love
to
hear
first
about
what
are
you
doing
at
a
service?
What
what's
the
com?
C
Yeah
so,
as
I
said,
we're
working
on
providing
a
layer
of
of
intelligence
and
automation
on
top
of
these
great
open
source
tools,
you
know
the
well
by
both
the
founder
and
and
I
previously
spent
time
at
adapt,
Dynamics
and
so
we've.
You
know,
we've
got
this
background
in
in
APM
monitoring,
observability
call
it
call
it
what
you
will
I
suppose
yeah.
We
had
we
sort
of
had
this
Epiphany
that
there's
all
these.
This
is
all
this
great
open
source,
tooling
out
here
now
to
collect
your
observability
data.
So
that's.
C
You
know
don't
reinvent
the
wheel,
why
would
you
do
that?
Just
just
use
the
great
open
source
stuff?
That's
that's
there
and
we
realized.
You
know
the
the
problem
is,
then,
is
turning
all
that
data
into
useful
information
and
doing
correlations.
We
thought
well
how
you
know.
How
can
we
help
help
people
do
that?
C
So
we've
said:
we've
built
this
layer
of
intelligence
and
automation
on
top
of
these
great
open
source
tools
that
provide
that
correlation
information
and
also
help
manage
the
data
so
yeah
you
don't
you're,
not
not
drowning
in
data
you,
we,
we,
we
distilled
the
data
down,
so
yeah
sort
of
talk
about
okay.
So
let's
talk
about
the
sort
of
the
metric
side
of
things
first,
so
you,
you
really
only
got
two
use
cases
for
the
metrics
yeah
you're
in
the
short
term.
You
want
as
much
as
possible
for
troubleshooting.
C
So
in
case
anything
goes
wrong.
You
want
fine-grained
metrics
on
absolutely
everything,
but
because
it's
expensive
to
keep
that
long
term,
because
the
other
use
case,
of
course
you
have
is
for
long-term
analysis
and
Reporting.
You
know:
we've
got
CI
CD
pipeline,
we're
throwing
out
release.
After
release
after
release,
it'd
be
useful
to
know
for
making
things
better
or
worse,
hobby
Services,
getting
faster
and
less
error
prone,
or
are
they
getting
slower
and
more
error-prone?
C
B
C
Storing
everything
forever,
so
we've
essentially
automated
that
we
take
your
existing
Prometheus,
which
typically
15
days
of
retention,
if
you're
still,
if
you're,
still
troubleshooting
after
15
days,
you've
got
other
problems,
so
we
take
that.
So
we
essentially
do
queries
on
that
data,
run
it
through
a
set
of
rules
and
then
store
low,
cardinality
data
long
term.
So
it
would
probably
not
with
data
volume
down
to
about
two
by
about
about
10
percent
of
what
it
was.
So
that
is
really
easy
or
relatively
easy
to
store
that
long
term.
C
So
then
you
can
still
do
your
Trend
analysis.
Hey
you
know.
Have
we
made
this
service
better,
all
the
more
errors,
less
errors?
Is
it
going
faster?
Is
it
going
slower
and
also
for
customer
metrics?
You
know:
are
people
buying
more?
Are
they
buying
less?
Is
customer
engagement
getting
better
as
performance
improves.
B
I'd,
like
that
and
and
I
like
what
you
said
around
just
storing
the
right
data
and
actually
being
actionable
on
it
right,
I
think
that
when
it
comes
to
you
know
if,
if
I
fill
my
garage
with
all
of
these
things
or
packages
or
if
I
just
keep
pushing
things
into
there,
because
it's
important
okay,
that's
great,
but
then
I've
filled
up.
My
garage
and
I
can't
park
my
car
there
or
you
can
use
the
same
analogy
with
the
closet
or
any
kind
of
room.
B
But
you
know
if
I
just
pulled
in
all
of
the
mail
that
would
include
my
junk
mail
too.
So
I
like
that
you're
taking
the
time
to
focus
on
making
this
data
actionable
and
really
being
able
to
focus
on
that
yeah.
C
The
other,
the
other
thing
we
do
to
help
people
get
started.
Like
I,
said
earlier.
It's
really
easy
to
stand
up.
Prometheus
set
up
a
bunch
of
collectors
and
exporters
scrape
config,
and
if
you,
if
you're,
using
Prometheus
operating
kubernetes,
it's
even
easier
so
yeah
you've
got
this
data,
but
you've
got
no
real
way
of
understanding
and
visualizing
it.
So
these
search
product
ships
for
the
curated
library
of
pre-built,
dashboards
and
health
rules
for
all
common
Technologies.
So
from
day
one
you
can
be,
you
can
be
effective.
C
You
know
you
can
actually
be
productive
and
start
using
the
data
you're
collecting
without
having
to
spend
weeks
or
months
building
dashboards
and
writing
Health
rules.
Of
course,
yeah.
It's
not
going
to
be
one
size
fits
all.
There's
always
going
to
be
some
some
uniqueness
to
each
environment.
So,
of
course
you
can
still
write
your
still
write
your
own
and
the
dashboarding
we're
building
again
just
Leverage
The
Open
Source,
we're
built
we
embed
grafana
in
the
product.
C
B
It's
really
helpful
and
I
think
that
folks
would
be
overjoyed
to
hear
that
it's
like
hey.
We
can
save
you
a
couple
weeks
or
months
of
time,
even
for
folks
that
are
that
have
implemented
open,
Telemetry
and
Prometheus
and
grafana
these
other
tools.
Do
you
help
leverage
making
their
Stacks
better
I've?
Definitely
like
I,
I,
won't
name
who
but
I
have
heard
of
folks
say:
hey
we
set
this
up
four
years
ago,
and
we
really
haven't
touched
these
rules,
since
is
that
another
kind
of
problem
case
that
you
helped
solve
for.
C
So
yeah
so
I
think
we've
since
a
curated
Library,
so
each
each
new
release
there
may
be
updates
to
the
to
the
rules
as
things
change
and
you
get
you
get
different.
You
know
newer
releases
of
the
software
components
you're
running,
and
so
they
maybe
behave
slightly
differently
so
get
those
rules
are
constantly
tweaked
and
massaged
to
to
be
being
most
effective.
Like
I
say
you
always
have
that
ability
to
override
and
and
tweak
and
tune
or
disable
a
one
of
our
health
rules.
C
If
it's
nagging
you,
if
you
think,
oh
actually,
you
don't
need
this.
This
is
fine
in
my
environment,
you
know
I,
don't
care
about
that.
You
can.
You
can
squelch
it
down
and
turn
it
off
and
say
if
you've
got
unique
things
you
might
be
in
your
environment,
there's
a
particularly
important
message
queue
and
if
the
queue
depth
is
greater
than
five,
oh
dear
we're
in
trouble.
So
that's
a
very
unique
rule
to
you:
okay
yeah!
You
can
just
add
that
in
there
and
that
you'll
get
you'll
get
notified
about
it.
B
C
Yeah
well,
the
way
we
handle
that
is
yeah.
As
you
know,
there's
in
any
large
system
there's
always
something
running
a
little
hot,
going
a
little
slower.
So
you
get
this
constant
chatter
of
alert
notifications
and
the
vast
majority
of
them
probably
are
actually
that
important.
There
may
be
something
that
could
be
tuned
later,
but
you
certainly
don't
want
to
be
woken
up
at
three
o'clock
in
the
morning
to
say:
Hey,
you
know,
CPU
consumption
on
this
container
was
a
little
hot
for
a
minute.
Oh,
who
cares
so
so.
C
C
But
if
it
doesn't
impact
that
overall
SLO,
then
we're
not
going
to
alert
you,
we
still
record
that
those
things
happened,
but
you're
not
going
to
get
that
emergency
page
or
a
slack
message
or
whatever
at
four
o'clock
in
the
morning,
telling
you
to
panic
only
if
the
if
the
SLO
is
in
danger
of
breaching
or
has
actually
breached.
So
we
we
monitor
the
the
SLO
burn
down
and
if
we
see
a
rapid
acceleration
and
burn
rate
we
try
and
sort
of
rather
than
wait
for
it
to
smash
through
and
head.
C
B
I've
seen
folks
Implement
some
alerting,
Implement
slos
and
have
some
you
know
key
metrics
or
key
uptime
deliverability
and
reliability,
factors
that
they're
trying
to
aim
for,
though
they
will
set
monitors
and
alerts
on
objectively,
the
wrong
thing
and,
like
you
said
this
container
was
running
hot
for
two
minutes,
but
kubernetes
is
going
to
reap
that
and
bring
it
back
anyway.
B
So
it's
not
that
much
of
an
issue
or
it's
or
the
auto
scaler
just
hasn't,
kicked
on
yet
to
kind
of
adjust
for
this
influx
of
traffic
that
we're
looking
at
so
I
think
that's
been
those
again
hours
of
those
stories.
Many
many
fun
many.
B
Retrospect
not
in
the
moment
at
all
but
I,
think
for
folks.
Looking
at
those
kinds
of
you
know,
implementing,
monitoring
and
alerting,
but
making
sure
it's
the
right
kind
as
well.
That's
also
really
helpful
and
it
sounds
like
you
have
yeah.
C
This
is
sort
of
there's
another
layer,
then,
on
top
of
that,
so
really
one
of
the
really
clever
things
and
I
don't
understand
how
they
do
it.
It's
some
very
clever
Engineers
that
wrote
it
all,
but
one
of
the
very
clever
things
we
do
is
analyze
all
the
the
metric
labels.
You
know
so
Prometheus
metric
obviously
has
its
value,
but
it
has
a
whole
bunch
of
labels
describing
what
the
metrics
about
so
we
analyze
those
metric
labels
and
similarly
traces
have
tags
which
is
the
metadata
about
about
the
trace.
C
It's
not
just
the
timing,
so
we
analyze
the
trace
tags.
So
from
that
we
can
build
up
a
graph
database
of
how
everything's
interconnected.
So
it's
not
just
service
to
service.
That's
what
tracing
gives
you,
but
it's.
It's
also
the
stack
that
it's
running
on
so
I
like
to
think
of
it
as
a
four-dimensional
graph
of
your
application
topology.
So
it's
service
to
service,
which
is
your
X
to
Y
the
stack,
which
is
your
Z
the
depth,
and
then
we
record
it
all
over
time.
C
C
Know-
and
we
didn't
want
that-
we
definitely
wanted
it
at
500,
milliseconds
or
less.
So
what
went
wrong
so
without
that
graph
you're
then,
relying
on
your,
maybe
your
own
knowledge
to
know
that
hey
this
user
service
uses
this
database
and
this
cache
and
and
piecing
it
together
that
way
or
having
to
ask
a
colleague
but
hey,
you
know,
certs
has
done
this
for
you.
It
will,
when
that
instance
generated
it
automatically
traverses
that
graph
database
and
collates
everything
together
onto
One
dashboard,
so
everything
all
the
information
you
need
to
troubleshoot.
C
B
Every
everybody
loves
a
good
scavenger
hunt,
especially
with
their
metrics
and
trying
to
figure
things
out.
I
think
that's
a
great
point
when,
when
slos
you
know,
go
bad
or
you
even
if
you
don't
break
an
SLO
and
you
had
a
really
impactful
event
and
your
team
was
still
kind
of
scrambling
to
meet
that
SLO.
B
C
Like
I
said
so
when
that
incident
happens
and
it's
always
like
okay,
well,
that
login
service,
it's
now
taking
a
lot
longer
than
our
Target
of
our
of
our
SLO,
so
that
generates
that
generates
an
incident
and
so
you'll
get
notified.
You
you
just
use
standard,
Prometheus
alert
manager,
so
that
can
whatever.
C
You
know
all
the
usuals
candidates
there
pay
to
duty
and
the
like,
so
you'll
get
you'll
get
notified.
So
then
you
can
go
in
to
the
dashboard
and
as
I
say
on
that
one
dashboard,
it
looks
so.
The
the
SLA
was
against
an
end
point
on
the
user
surface,
but
that
user
service
has
dependencies
yeah.
It
obviously
runs
somewhere.
If
we
take
kubernetes
it's
a
service,
so
there's
a
pod
and
then
that
pod
will
be
running
on
a
node
within
a
within.
B
C
Cluster,
but
that
service
may
use
a
cache,
it
may
use
a
database.
It
may
call
a
whole
load
of
other
things.
There
could
be,
like
you
know,
easily
a
dozen
Services,
a
dozen
microservices
involved
in
that,
so
you
know
which
one's
causing
the
problem
or
which
more
than
one
is
causing
the
problem.
So
you
want
to
be
able
to
go
and
investigate
and
check
everything
out
and,
as
I
said
this,
the
graph
database
we've
we've
built,
understands
all
those
relationships,
all
those
services
and
the
stack.
C
So
what
it
does
it
traverses
that
database
and
pulls
in
everything
that's
immediately
connected
around
it
and
puts
all
that
onto
onto
a
one
Dynamic
dashboard
for
you
so
you're
not
having
to
fish
around
trying
to
find
out.
What's
you
know,
what's
going
on
where
and
also
any
of
those
dependent
Services
if
they've
had
any
issues,
they're
highlighted
as
well,
so
you
can
see
that
maybe
it's
not
actually
our
user
service
itself,
it's
reliant
on
this
database
and
this
database
was
running
a
little
slow.
So
then
you
can
go
and
investigate
hey.
B
Like
I
love
it
when
it's
just
a
capped
resource
kind
of
problem,
yeah,
it's
much
less
fun
with
it.
It's
like,
oh,
this
yeah
null
type
of
defined.
What
is
what
is
going
on?
Yeah.
C
B
Great
to
have
that
Telemetry
to
dive
deeper
amazing
for
folks
that
have
any
questions,
I'd
love
to
urge
you
to
throw
those
into
chats
and
we
can
get
to
those
I.
Think
I've
got
a
couple
more
going,
but
would
love
to
hear
from
all
of
you
and
we
can
get
some
more
questions
answered.
C
Yeah
the
worst
ways
on
that
theme
of
then
troubleshooting,
so
he's
saying:
you've
got
all
your
metrics
and
all
the
dashboards
immediately
accessible
from
that
from
that
dynamically
created
dashboard,
but
equally
from
there
you
can
then
jump
out
into
logs.
So
you've
got
your
existing
logging
solution.
Maybe
you've
got
an
elk
stack
well,
we'll
jump
out
to
Elk
and
you'll
arrive
in
Elk
with
it
with
a
deep
link.
C
So
the
the
time
range
and
the
search
query
is
already
filled
in
for
you
so
you're
straight
away,
looking
at
the
appropriate
container
logs
or
subsystem
logs,
whatever
it
happens,
to
be
that
that's
linked
across
and
then
the
same
thing
with
the
tracing,
and
we
do
a
really
quite
a
clever
thing.
With
the
tracing
as
I
said,
we
we
have
a
an
our
own
hotel,
collector
module
and
we
and
we're
using
that
for
two
purposes
really
so.
C
First
of
all,
we're
analyzing
all
the
trace
tags,
so
that
so
we're
essentially
that
creates
that
hotel
collector
that
we've
got
creates
a
bunch
of
Prometheus
metrics
from
all
the
spans
that
it
sees
and
they
get
they
get
reaped
in.
So
that
gives
us
helps
us
build
our
graph,
but
also
we're
looking
at
the
timings,
so
we're
building
baselines,
multi-period
baselines
for
each
endpoint.
C
So
therefore,
we
know
whether
an
endpoint
with
a
a
particular
call
to
an
endpoint
is
normal
or
not.
Is
it
slower
than
normal,
or
was
it
normal
because
if
you
think
about
it,
ideally
most
of
your
requests
will
be
handled
in
a
prompt
and
error-free
manner.
They
won't
be
interesting,
they'll
be
a
perfectly
normal
request
that
came
through
not
a
problem
at
all
and
you're
really
generally
not
that
interested
in
those
it's
only
the
slow
and
error
ones.
You
want
to
go
and
you
want
to
go
and
delve
into
and
go
oh
well.
C
Why
did
this
one
go
wrong
and
if
you
think,
if
you've
got
an
SLO
of
99,
that
means
in
the
worst
case
scenario
you're
only
expecting
one
percent
of
those
traces
to
be
interesting
to
be
slow
or
erroneous
so
hell.
You
know
why
am
I
trying
to
collect
all
of
them
or
10
percent
of
them,
so
so
what
our
hotel
collector
does
it
then,
having
sent
all
its
metrics
up?
C
It
then
calls
and
pulls
down
the
Baseline
information,
so
it
knows
for
each
endpoint
that
it
sees
when
a
span
comes
in,
if
that's,
if
that's
slower
than
normal
or
not,
and
therefore,
if
it's,
if
it's.
C
And
if
it's
just
a
regular
one,
then
we
just
drop
it
and
because
we
hey,
we
don't
we
don't
need
to
fill
our
storage
up
with
all
these
perfect
traces.
So
that's
you
know,
taking
the
99
SLO
type
per
argument,
that's
going
to
reduce
your
stored
traces.
You
know
down
to
one
percent
of
of
Trace
volume,
which.
A
B
Good
I
came
in
under
the
lemons
on
that
front:
yeah
I
I,
like
I,
like
that
you,
you
talked
a
little
bit
about
elk
stack
as
well,
and
and
I'd
like
to
focus
on
that
data
collection
strategy,
and
so
so
does
that
mean
you
don't
necessarily
have
to
change
it
with
open
Telemetry
and
with
Prometheus?
Can
you
keep
your
same
measures
and
measurements
that
you
used
to
have
yeah.
C
Absolutely
and
being
so,
the
idea
is
we,
we
sit
if
you've
already
implemented
open
source,
Great,
Well
Done.
We
love
you
and
we
just
sit
on
top
of
what
you've
already
done,
so
we're
just
providing
that
layer
of
intelligence
and
automation
to
make
your
life
easier.
On
top
of
the
hard
work
you've
already
done
and
freeing
you
from
having
to
manage
and
maintain
hundreds
of
dashboards
you'll,
probably
better,
knock
that
down
to
just
a
handful
of
very
specific
ones,
to
your
business,
the
rest
of
the
rest
of
the
commodity
dashboards.
C
We've
done
that
for
you
and
the
same
thing
with
those
with
those
Health
rules,
and
we
also
you
know
I'm
saying
we
solve
that
real
problem
of
of
how
do
you
correlate
across
the
distributed
system?
And
how
do
you?
How
do
you
go
from
one
service
to
another
and
from
traces
to
logs
and
back
to
metrics?
And
you
know
how
do
you?
How
do
you
manage
all
of
that?
Well,
hey!
You
know
we're
doing
that,
for
you.
B
I
think
that's
one
of
the
most
painful
things
that
I've
dealt
with
in
in
other
roles
and
responsibilities
is
having
to
being
told
that
you
know
like
okay,
you
have
to
rip
out
everything
that
you
have
installed
and
you
install
this
new
operator
and
no,
you
have
to
use
our
agents
and
everything
like
that.
It's
so
much
nicer
to
be
able
to
Leverage
What
already
exists,
and
then
it
makes
it
an
easier
adoption
path,
at
least
in
my
experience.
So.
C
Yeah
yeah,
that's
what
they
I
think
the
open
source
tooling
is
all
is
all
about.
Is
you
can
you
can
Implement
these
great
open
source
tools
and
you'll
you're
not
tied
in
anywhere
you
you?
Can
you
can
use
that
data
wherever
you
want
and
you
can
choose
to
to
send
it
off
to
some
licensed
Cloud
software
company,
or
you
know
you,
can
you
can
use
a
one
of
the
big
cloud
providers
as
a
service?
You
know
Prometheus
as
a
service.
What
the
big
cloud
providers
do
that
or
you
can
run
it
yourself.
C
The
freedom
is
yours,
you
you
can
choose
to
do
with
it,
that's
what
you
want
and
that's
definitely
our
philosophy.
You
know
we're
not
saying
all
that
hard
work,
you've
done
with
open
source
rip
it
out
and
start
again
install
our
agent.
No,
we
want
we.
We
just
want
to
sit
on
top
of
your
data
and
that
you've
already
got
and
just
allow
you
to
do
a
lot
more
with
it.
A
B
Think
one
thing
that
helps
too
is
that
I
I
took
a
look
at
a
certain
site,
and
one
thing
I
thought
that
was
interesting
was
the
intelligent
sampling
that
you
have,
because
that's
a
core
concern
of
what
a
lot
of
people
focus
on,
especially
right
now
with
Workforce
reductions
and
everything
else
is
what's
the
cost,
and
so
by
utilizing
things
like
that
and
Prometheus
up
in
Telemetry.
Can
you
talk
to
some
of
those
cost
cutting
concerns
that
that
kind
of
come
into
play?
Yeah.
C
Like
I
said,
but
doing
the
intelligence
sampling
of
those
traces
you're
going
to
significantly
reduce
the
amount
of
amount
of
storage,
you
need
right.
Prices
are
big
and
there's
a
lot
of
them.
If
you,
if
you,
if
you
turn
the
dial
up,
even
if
you
try
and
collect
a
lot,
you
can
it's
one
of
those.
Yes,
it's
a
horrible
balancing
act,
because
if
you
collect
all
the
choices,
there's
a
lot
of
them
and
it's
really
expensive-
you've
got
egress
charges
if
you're
sending
them
somewhere,
it'll,
certainly
break
through
the
free
tier
of
that
cloud
provider.
C
C
C
For
oh,
didn't
get
that
so
yeah
doing
that
intelligent
sampling
is
the
best
of
both
worlds.
It's
it's
gonna,
really
really
compress
your
data
down,
so
you
don't
have
the
cost
associated
with
trying
to
process
and
store
all
those
traces,
but
having
compressed
it
down.
You've
still
got
the
really
interesting
ones
when
you
find,
when
you're
trying
to
do
that
problem
solving.
Why
didn't
that
work?
Where
did
that
go
where
they
like?
What,
through
that
error,
you
can
go
on
it.
C
B
And
you'll
have
to
read
a
lot,
my
my
last
question
and
then
I
would
love
to
tie
it
up
and
talk
a
little
bit
more
about
any
calls
to
action.
Or
you
know
any
things
that
you'd
like
to
point
out
with
asserts.
But
when
it
comes
to
understanding
the
relationship
between
your
data
and
automating,
that
correlation
are
there
any
tips,
tricks
or
things
that
asserts
offer
that
help
out
with
that
when
it
comes
to
open
Telemetry
and
from
the
PS.
C
C
Know
that
graph
database
that
we
build
by
analyzing
the
the
metric
tags
and
the
the
tracing
tags
to
build
up
that
relationship
model.
So
we
know
what
services
are
talking
to
what
and
where
they're
running
and
that
you
know
that
makes
your
your
life
a
lot
easier,
you're,
not
relying
on
that
tribal
knowledge.
You
know
you
might
be
trouble.
You
might
be
an
engineer.
You
might
be
trying
to
troubleshoot
a
payment
Gateway
but
you're
dependent
on
a.
C
You
some
troubles
so
right,
but
I,
don't
know
how
they're
deployed
so
then
you've
got
to
call
somebody
else
in
and
then
they're
going,
oh
yeah,
but
that
uses
this
database
thing.
But
I
know
nothing
about
that.
C
C
B
Amazing
amazing
yeah,
no
I
can't
I
can't
agree
more
with
wanting
to
kind
of
jump
more
so
into
the
code
and
less
you
know,
focus
on
the
you
know:
I
I,
like
the
idea
of
code,
but
writing
useful
code
is
always
more
helpful.
B
I
think
some
some
interesting
things
I've
seen
within
the
community
are
like
the
open,
Telemetry
demo,
which
I'll
I'll
link
in
the
chat
for
folks
many
different
programming
languages
and
options
to
start
understanding
what's
possible
with
open
Telemetry,
and
then
you
know
you
can
tie
that
together
with
something
like
asserts
or
crafting
up
a
dashboard.
That's
really
helpful
for
you,
but
I
like
that.
The
community
is
focused
on
providing
an
actual
use
case
of
how
to
put
these
things
together.
So
it's
not
just
wishing
you
well
and
kind
of
leaving
you.
C
B
C
B
B
C
C
Obviously,
if
you're
going
to
do
a
more
serious
production,
install
yeah
running
it
on
Docker,
composers
and
perhaps
the
best
way
of
doing
it,
there's
a
Helm
chart,
so
you
can
deploy
it
into
kubernetes
and
then
you
have
all
the
benefits
that
kubernetes
gives
you
of
scalability
and
self-healing
and
the
like.
B
B
Awesome
well,
I,
don't
see
any
more
questions
rolling
in
and
with
that
we'd
love
to
give
a
final
call
for
that,
but
otherwise,
but
that's
Steve,
do
you
have
any
parting
thoughts,
wisdoms
mantras
or
anything
else
that
you'd
like
to
share
before
we
start
we
spin
down
today.
C
C
B
There's
a
good
podcast
I'll
link
to
later
that
Somebody
went
into
like
for
an
hour
about
the
Deep
mechanics
of
why
that
actually
works
in
state
machines
and
everything
else,
but
it's
another
fun
conversation
for
another
time.
Awesome
well,.