►
From YouTube: Webinar: The Open-Source Observability Playbook
Description
Observability within distributed systems is essential to understanding how your application is performing and establishing application reliability. In this talk, we will discuss the rise of microservices in the cloud, the pillars of observability (metrics, logs and traces), and open source tools for tracing such as OpenTelemetry, as well as OpenTracing and OpenCensus. The goal is to understand the tools themselves, as well as discuss best practices in using them. Using the right tools and best practices can accelerate application development, ensure that applications perform as expected, and reduce user impacts.
Presenter:
Hen Peretz, Head of Solutions Engineering @Epsagon
A
The
open
source
observe
observability
playbook,
I'm
jerry,
fallon
and
I'll,
be
moderating.
Today's
webinar
we'd
like
to
welcome
our
presenter
today
ken
perez
head
of
solutions,
engineering
at
espion,
just
a
few
housekeeping
items
before
we
get
started
during
the
webinar.
You
are
not
able
to
talk
as
an
attendee.
There
is
a
q,
a
box
at
the
bottom
of
your
screen.
Please
feel
free
to
drop
your
questions
in
there
and
we'll
get
to
as
many
as
we
can.
At
the
end.
A
This
is
an
official
webinar
of
the
cncf
and,
as
such
is
subject
to
the
cncf
code
of
conduct.
Please
do
not
add
anything
to
the
chat
or
questions
that
would
be
in
violation
of
the
code
of
conduct.
Please
be
respectful
of
your
fellow
participants
and
presenters.
Please
note
that
the
recording
on
slides
will
be
posted
later
today
to
the
cncf
webinar
page
at
cncf,
dot,
io
webinars.
With
that
I'll
hand
it
over
to
hen
to
kick
off
today's
presentation.
B
Thanks
gary,
oh
guys,
you
guys
see
my
screen.
B
So
let
me
just
just
do
a
quick
introduction
about
myself,
I'm
ken
paris,
I'm
running
the
solution,
engineering
team
in
hexagon
and
up.
Second,
we
are
building
a
monitoring
solution
for
modern
application,
whether
it's
on
microservices
or
serverless,
everything
that
is
basically
distributed.
B
Our
tool
gives
the
ability
to
add
pretty
seamlessly
tracing
and
logging
and
also
correlating
between
metrics,
so
what
we
call
a
full
observability
for
those
modern
applications,
and
in
this
talk,
I'm
going
to
talk
about
quite
a
few
things.
B
The
first
thing
I
want
to
do
is
just
to
do
some
sort
of
like
a
recap
of
what
are
the
old-school
observability
methods
that
are
currently
used,
I'm
guessing
by
a
few
of
you
that
are
already
here
in
the
crowd
and
how
are
you
able
to
actually
achieve
observability
easily
I'll
also
do
some
some
kind
of
live
demos.
Hopefully
nothing
will
actually
blow
up.
B
It's
been
a
while,
since
I
did
that
and
things
that
we
also
gonna
touch,
is
how
open
source
tools
today
give
you
the
ability
to
implement
some
sort
of
like
do-it-yourself
observability,
and
they
have
improved
significantly
throughout
the
years,
and
I
think
there
can
be
an
excellent
start
for
people
that
want
to
start
off
just
gaining
some
more
understanding
of
about
how
the
system
actually
works,
but
first
thing
first.
B
So,
let's
just
ask
yourself:
why
do
we
even
need
monitoring
it
can
be
to
for
various
reasons,
but
I
think
the
main
one
is
just
to
make
sure
that
our
business
works,
because,
if
we're
able
to
if
we
have
a
website
that
is
currently
running
and
accepting
payments
from
customers
or
getting
some
orders
inside
the
website,
if
that
website
actually
working
properly,
that
means
that
our
business
is
actually
working
properly
and
if
we
think
about
what
are
the
type
of
aspects
we
want
to
monitor.
B
I
think
I
read
that
on
google's
sra
book
I'll
share
the
links
later.
B
There
are
three
golden
signals
that
you
want
to
get
from
your
system
in
order
to
be
able
to
understand
if
it's
working
properly,
if
it's
very
healthy
and
they
are
pretty
intuitive
and
logical-
to
understand
the
first
one
is
latency,
meaning
that
if
your
system
has
suffering
from
a
large
latency
when
it
comes
to
requests,
this
means
that
it's
affecting
your
customers
and
they
have
a
bad
experience
around
the
traffic-
is
how
many
people
actually
going
into
that
website,
and
if
that
traffic
is
being
spiked
or
being
low,
it's
something
you
want
to
monitor
in
order
to
be
able
to
understand
whether
your
system
is
healthy
or
not.
B
B
This
is
something
that
is
not
that
trivial
when
it
comes
to
distributed
systems,
and
not
it's
not
that
easy
to
understand
whether
you
have
errors
in
your
system
and
just
being
able
to
to
put
that
some
sort
of
like
a
spotlight
around
the
errors
can
be
extremely
useful
and
the
last
thing
is
saturations,
which
means
how
many
of
my
services
are
being
highly
highly
frequently
accessed.
B
B
So
all
the
monitorings
are
pretty
much
agent
based
and
the
downside
to
that
is
that
they
only
collect
host
data
and
they
only
collect
metrics
and
when
that
comes
to
a
problem,
is
that
it
doesn't
give
you
the
full
understanding
of,
what's
going
on
within
the
services
you're
able
to
see
that
your
payment
service,
for
example,
is
under,
let's
say
a
container,
so
you
can
see
the
cpu,
you
can
see
the
metrics,
but
that
doesn't
really
give
you
enough
understanding
whether
the
database
has
been
highly
accessed
or
something
inside
your
own
business
logic
is
actually
causing
the
errors
and
what
we
actually
need
in
order
to
troubleshoot.
B
We
need
some
more
debug
there,
so
metrics
are
not
enough,
and
this
is
like
the
first
thing
that
we
do
that
we
try
to
go
whenever
any
engineer
is
getting
a
call
and
says
that
one
of
the
services
is
acting
out
or
something
is
acting
a
little
fishy.
The
first
thing
you
want
to
do
is
achieve
much
as
more
as
debug
data
as
you
want,
and
we
go
directly
to
the
logs
in
order
to
get
that,
and
today,
logging
is
in
order
to
to
achieve
logging.
B
Then,
if
you
think
about
it,
you
have
your
logs
being
read
into
the
output,
probably
to
some
file
in
your
containers
or
whether
to
you
remotely
move
them
to
some
other
vendor
and
they're,
being
basically
dumped
either
locally
or
remotely,
and
an
agent
just
collects
those
data
and
sends
them
whether
to
your
own
proprietary,
elasticsearch
that
you
ran
out
on
yourself,
or
maybe
you
actually
push
that
to
some
log
vendor
to
be
able
to
handle
high
traffic.
B
So
everything
is,
is
being
done
using
agents
that
move
logs
from
one
place
to
another,
and
if
you
go
to
what
actually
is
being
done
today,
so
those
those
methods
of
logging
and
monitoring
can
actually
work
pretty
good.
B
If
you
have
a
monolithic
application-
and
you
have
only
one
system
but
think
about
taking
logs
of
different
types
of
services
in
your
system
and
making
sense
that
that's
where
the
hard
part
comes
and
just
if
we
think
about
it,
like
the
change
that
software
has
been
made
in
the
years,
I
you
can
see
that
trend
in
probably
any
graph
that
you
will
look
up
online.
There's
a
huge
trend
of
companies
doing
the
huge
shift
from
monolithic
applications
into
distributed
systems.
Lambda
adoption
has
been
a
highly
highly
along.
B
The
has
been
highly
adopted
and
also
containerized.
So
a
lot
of
companies
did
the
switch
to
content
to
containers
which
makes
their
their
system
much
much
much
more
highly
distributed,
and
that
shift
is
not.
It
makes
a
lot
of
sense
because
it's
much
more
easier
to
develop.
If
I'm
as
a
developer,
I
don't
need
to
worry
about
the
host
that
is
running.
B
I
only
need
to
work
about
the
business
logic
and
I'm
able
to
use
as
much
as
more
third-party
apis
in
order
to
query
different
types
of
data
sources
around
the
internet
or
if
I
want
to
implement
some
service,
I
can
just
use
it
instead
of
implementing
that
myself,
that
has
significantly
improved
the
the
pace
that
we
can
actually
develop
as
a
startup
company.
It's
extremely
useful
to
know
you
have
those
tools
available
online
that
you
can
just
if
I
want
to
implement
payment
payment,
I
can
just
use
stripe.
B
If
I
want
to
run
something
and
not
worry
about
the
host,
I
can
just
run
it
on
the
lambda
function
and
pay
as
I
go.
This
allows
me
to
move
my
business
much
much
more
faster,
but
it
also
comes
with
a
few
downsides
when
it
comes
to
actually
understanding
what's
going
on,
so
it
sounds
a
little
bit
shiny
at
the
beginning,
but
once
you
run
it
into
production,
I'm
pretty
sure
most
of
you
are
familiar
with
that.
B
It's
becoming
pretty
hard
to
to
to
track
those
things
down
and
if
I'll,
just
continue
talking
about
the
challenges
of
engineers
and
devops
throughout,
that
thing
is
that
troubleshooting
becomes
extremely
hard
because
you're
not
sure
if
just
using
those
logs
and
the
metrics
is
something
that
it
can
be
very
efficient
for
highly
distributed
application.
And
if
you
look
here
on
the
right,
we
have
a
few.
B
Each
line
represents
a
service,
and
we
see
just
the
one
that
is
actually
valid
is
a
communication
throughout
the
services
and
as
an
engineer
to
be
able
to
to
actually
identify
this
trend
throughout
my
system,
I
need
to
correlate
between
different
types
of
logs,
and
that
is
something
that
is
not
possible
using
the
the
basic
logging
capabilities
and
if
I
can't
really
properly
monitor
my
system,
how
can
I
actually
develop
new
things?
B
So
this
just
limits
to
what
we
call
the
three
pills
of
observability.
You
probably
saw
it
around
the
web
called
as
melt,
which
means
we
have
metric
tracing
logging
in
advance
and
what
it
means
that,
in
order
to
gain
full
observability
on
distributed
systems,
I
need
to
be
able
to
take
the
metrics
and
take
the
traces
and
take
the
logs
and
combine
them
all
together.
B
In
order
to
give
me
a
clue
of,
what's
going
on
and
to
end
through
my
system,
that
that
is
the
only
and
efficient
way
to
be
able
to
not
only
know
what's
going
on
in
production,
but
also
to
be
able
to
see.
What's
going
to
happen
once
I
gonna
push
new
services
through
my
development
process
and
when
it
comes
to
troubleshooting,
it's
gonna
be
extremely
useful
to
to
have
the
all
those
capabilities
in
hand.
B
Now
I
want
to
start
off
with
first.
A
lot
of
this
talk
is
gonna,
be
pretty
technical
and
just
showing
you
guys
how
we
can
take
a
python
application
that
is
running
on
flask,
which
is
probably
the
most
common
framework
today
and
we're
going
to
try
out
and
see
how
we
can
implement
login
best
practices.
So
hopefully
you
guys
can
actually
leave
this
talk
with
with
something
useful
and
and
something
that
you
can
actually
try
out
on
your
own
teams.
So
I'm
gonna
talk
a
little
bit
about
logging.
B
Logging
is
something
that
I
think
this
is
the
the
the
number
one
tool
for
engineers
number
one
debugging
tools
and
that
just
allows
you
to
not
only
not
only
display
some
appropriate
information
about
your
system,
but
also
assist
you
whenever
you
want
to
debug
something
you
just
print
something
to
the
screen,
and
now
I
just
encourage
you,
I'm
just
going
to
list
a
few
best
practices
and
then
we're
going
to
run
to
just
a
few
examples.
B
For
me,
the
the
top
best
practices
for
login
is,
first
of
all
print
thing
in
a
json
print
thing,
structured
and
the
reason
I'm
I'm
saying
that,
because,
if
you're
just
gonna
print
some
text,
that
tells
you
service
a
call
the
database,
it's
gonna
be
extremely
hard
for
you
to
push
that
and
and
scale
up
with
that
and
put
that
into
a
log
aggregator
that
will
actually
make
sense
and
if
you're
going
to
put
it
in
a
json
structured
data.
B
B
A
then
you'll
be
able
to
go
to
your
log
aggregator,
whether
it's
on
elastic
or
not,
and
once
it's
indexed,
it
can
actually
filter
all
the
events
that
that
actually
contain
that
without
getting
any
false
results,
if
you're
going
to
use
something
that
is
not
structured,
I'm
going
to
just
going
to
share
how
it
actually
looks
like-
and
I
think
the
most
important
thing
is
to
actually
automate
that,
because
let's
remember
that
everything
that
we
print
to
the
screen
is
stuff
that
we
actually
wrote
in
advance.
B
So
if
you
are
looking
to
improve
our
let's
say
our
logging
capabilities
or
observability
capabilities,
we
must
be
able
to
not
think
all
the
time
while
we
code
what
we
actually
gonna
print.
We
need
to
have
that
process
being
fully
automated,
because
it
will
just
be
prone
to
errors.
We
don't
want
to
every
time
we
write
one
one
line
of
code,
then
we
write
another
login
code.
B
This
just
also
makes
your
code
looks
bad
and
also
very
hard
to
maintain,
because
every
change
you
do
you
need
to
change,
also
the
prints
and
eventually
it
will.
It's
gonna
start
to
have
some
inconsistency,
so
I'm
just
gonna
jump
into
the
demo.
We're
gonna
use
python
and
flask.
Hopefully
it
will
work
since
I
haven't
done
it
for
a
while.
Let
me
see
here
so
I
clear
this
there.
It
is
so
let
me
just
actually
jump
to
the
code.
B
Okay,
so
just
gonna
run
here,
okay,
so
what
actually
we
have
here
is
the
this
is
a
flask
application,
it's
pretty
straightforward,
you
import
the
flask
and
what
you
do
with
that
is
actually
pretty
neat,
because
I
am
able
to
use
the
logger
if
you
are
familiar
with.
B
This
is
just
an
example
for
python,
but
if
you're
familiar
with
the
login
tracing
library,
you're
also
able
to
override
the
the
default
formatter,
which
means
that
every
time
I
print
to
the
screen,
even
if
it's
a
warning
or
anything
like
that,
it
will
put
it
in
a
structured
manner,
meaning
that
I
don't
need
to
no
longer
print.
B
If
you
only
change
and
add
a
an
additional
class
called
the
structure
log,
they
will
be
able
to
automatically
change
the
way
it
prints
to
the
screen,
and
what
we
have
here
is
just
a
simple
hello
that
is
accepting
anything
coming
from
the
the
default
website
and
it's
just
printing
a
warning
that
a
user
got
in
and
is
an
endpoint
whatever
now.
Actually,
I
think
this
is
gonna
work
better,
like.
B
This
yeah,
I'm
probably
breaking
the
code
now,
but
I
just
want
to
show
you
how
easy
it
is
to
actually
implement
that
kind
of
thing.
I'm
just
doing
some
minor
change,
because
I
just
noticed
that
this
screen
doesn't
really
print
anything.
B
B
Now
we're
just
going
to
curl
it
perfect,
so
I
just
ran
a
request
to
my
flask
application.
It's
been
running
and
now
we
can
see
the
structured
log,
so
you
can
see
the
log
here
is
pretty
structured.
I
can
put
it
in
elastic
and
easily
index
that,
and
this
means
that
every
time
there's
a
request,
I
can
make
sure
that
it's,
I
know
the
service
name.
I
know
the
stage
whether
it's
production
or
not.
B
I
know
the
level,
I
know
the
message
and
now,
if
you
want
to
know
how
many
errors
I
had
in
production
from
that
service,
you
can
just
filter
by
service
and
stage
and
the
type
of
level
message,
which
is
something
pretty
pretty
useful.
It's
not
just
bringing
something
to
the
screen
that
will
be
later
on
pretty
hard
to
to
follow
and
moving
to
the
next
thing
is
actually
monitoring
best
practices
so
logs.
B
B
And
what
do
I
mean
by
that
is,
is
that
I
I
need
to
have
a
single
dashboard,
where
I
can
view
all
of
the
resources
in
my
services,
whether
if
I'm
currently
monitoring
my
database
or
monitoring
my
flask
application
or
monitoring
anything
else
that
comes
in
mind.
I
don't
want
to
switch
between
screens,
because
every
time
there's
going
to
be
an
issue,
I'm
going
to
have
to
say.
Okay,
where
do
we
monitor
our
database?
B
And
if
I
want
to
actually
take
those
things
and
put
out
alerts,
it's
gonna
be
extremely
hard
to
maintain
once
you
have
different
types
of
dashboards.
So
if
you're
choosing
in
one
dashboard
just
go
with
that
and
make
sure
that
all
the
metrics
have
been
pushed
to
that
dashboard,
and
the
second
thing
is
define
the
critical
metrics
that
you
have.
You
don't
want
to
be
alerted
up
like
late
at
night
on
anything
that
actually
breaks
in
your
system.
B
You
want
to
define
the
critical
metrics
that
you
say
that
in
that
point
it's
a
something
that
you
have
to
to
fix,
meaning
that
if
my
database
is
a
little
bit
loaded,
I
don't
want
to
get
any
any
alerts,
but
whenever
the
database
reaches
it,
it's
like
there's
no
return
point
when
it
ever
it's
gonna
be
stuck
and
not
gonna
be
available
for
my
customers,
then
that's
the
threshold
I
want
to
be
pointing
at,
and
the
third
thing
is
that
don't
think
about
metrics,
as
only
as
infrastructure,
metrics
think
about
metrics
as
a
thing
that
can
help
your
business
so
make
sure
you
use
custom
business
metrics
and
if
I
take
the
example
of
a
website
that
is
currently,
I
know,
shipping
things
to
to
customers
and
have
some
audios
in
line.
B
If
I
have
a
dashboard
that
tells
me
that,
this
month
or
this
week
or
the
last
few
days,
we
got
20
20
orders
or
10
orders,
or
something
about
that.
I
can
also
understand
like
what
is
the
business
state
and
what
is
the
business
itself
and
it's
something
that
it's
pretty
cool
to
share
with
everyone
that
are
not
specifically
engineers
in
my
system
can
be
shared
by
the
sales
engineer.
B
They
can
also
use
those
those
dashboards,
and
I
also
highly
encourage
to
take
all
those
groups
in
your
system
and
and
put
them
in
the
same
dashboard,
obviously
give
each
one
of
them
their
own
view,
but
everything
needs
to
be
in
the
same
place.
It
will
be
the
much
more
useful
and
efficient
for
your
company
and
and
if
we
think
about
like
what
we
want
to
monitor
on
application
level,
then
every
call,
for
example,
that
we
are
doing
to
some
api.
B
We
want
to
see
what
is
the
average
duration
if
we
have
a
if
we
push
messages
on
a
queue
we
want
to
see
what
is
the
minimum
number
that
is
currently
being
accessed
to
the
queue?
It's
it's
just
making
sure
that
you
know
which
type
of
resource
in
your
system
is
prone
to
errors,
because
if
I'm
gonna
have
a
a
queue
that
is
currently
handling.
B
My
the
my
all
my
deliveries,
if
it's
gonna
break
for
example,
then
I
will
also
be
alerted
on
that,
but
I
also
want
to
be
aware
whenever
there
is
a
business
issue,
for
example,
if
that
message
queue
is
reaching
like
no
orders
at
all.
In
the
last
hour
and
and
typically
something
that
is
not
happening,
then
I
also
want
to
put
on
actual
things
on
that
and
there's
a
lot
of
things
that
you
can
go
through
like
the
http
code
that
you
want
to
actually
monitor.
B
But
I
think
the
most
important
thing
when
it
comes
to
those
metrics
is
to
be
able
to
have
your
your
services
trigger
those
metrics
and
print,
for
example,
the
the
structured
log
that
we
saw
and
do
that
automatically,
which
is
something
that
I'm
just
also
gonna
jump
and
show
now.
B
So
I'm
just
going
to
jump
to
this
example
of
actually
using
I'm
using
flask
in
this
example.
So
I'm
going
to
continue
that
and
show
how
we
can.
If
you
said,
we
want
to
measure
every
request
coming
to
my
server
and
know
what
is
the
average
duration?
I
need
to
be
able
to
calculate
that,
and
my
first
best
practice
tip
was
to
not
do
things
manually,
so
we're
gonna
use,
what's
called
a
middleware
in
flask,
it's
just
the
ability
to
every
time.
There's
a
request
coming
in
to
flask.
B
B
The
the
flask
application
is
pretty
standard,
just
something
that
returns
us
hello
world.
All
the
thing
that
is
pretty
juicy
is
coming
here
in
the
middle,
where
so
we
have
the
the
middleware
that
is
actually
hooking
to
the
before
request
and
after
request
with
those
two
functions,
and
what
we
do
is
that
we
first,
whenever
the
request
comes,
we
time
it.
We
take
the
current
time,
and
the
second
thing
we
do
is
that
we
print
to
the
screen.
What
was
the
duration
with
also
some
additional
information
and,
let's
see
how
that
actually
looks
like.
B
It's
called
apm
cool,
so
our
flask
application
is
currently
running.
We're
gonna
send
another
message
awesome.
So
now.
What
I
can
see
is
that
not
only
and
if
I'll
just
go
back
to
the
code,
we're
also
still
using
the
structured
log
that
we
saw
and
that
structured
log
gets
a
a
record
to
print
and
it
also
makes
sure
if
you
bring
in
something
under
message.
It
also
make
sure
to
add
that
as
an
additional
information,
so
all
the
file
name,
the
service.
B
The
stage
that
we
saw
before
is
still
keeping
the
same
the
same
structure,
so
it
can
be
also
extremely
useful
for
indexing
that
later
on,
and
if
I'll
go
back
to
the
example,
so
we
run
it
and
now
we
saw
the
level
the
level
is
warning.
This
is
the
the
stage
in
production
and
the
thing
the
new
thing
that
we
just
added
is
that
how
much
time
it
actually
took
so
I
can
see
the
duration
was
around
nine.
I
think
it's
nine
milliseconds
here
and
the
status
code
was
200.
B
So
if
I'm
the
fact
that
it's
so
json,
I
can
actually
parse
it
in
any
in
especially
in
an
elastic
on
any
anything
that
supports
log
aggregation.
So
I
can
actually
filter
out
all
the
requests
or
do
aggregation
like
how
many
requests
were
200,
how
many
500
and
set
that
as
my
metric
or
even
duration.
If
this
api
is
going
to
take
more
than
30
seconds,
then
I
want
to
be
alerted
because
aws
api
gateway
timeouts.
B
On
that
point
now
we
have
it's:
it's
pretty
nice
what
we
did
so
far,
but
there's
a
lot
of
things
that
we
know.
That's
missing.
It's
really
nice!
If
you
have
a
an
application
that
is
currently
running
and
it's
something
that
that
works
on
a
very
low
scale.
But
if
we
think
about
a
very
distributed
system,
we
need
to
be
able
to
correlate
those
metrics
and
logs.
B
We
need
to
be
able
to
take
different
services
in
our
system
and
make
sure
we
correlate
between
that,
and
this
just
lead
me
to
explaining
about
distributed
tracing.
I'm
sure
you
guys
heard
about
this.
Probably
a
lot
of
you
actually
practicing
that
and
and
using
that
in
your
teams,
I'm
just
going
to
show
a
few
kind
of
do
it
yourself.
B
This
retrace
thing
and
how
to
use
open
source
tools
to
to
be
able
to
implement
that,
and
it's
like,
I
said,
open
source
tool
today-
did
a
huge
improvement
when
it
comes
to
that.
So
it's
it
can
be
pretty
easy
to
do
that,
and
I
also
show
that
in
a
few
now
before
that,
let's
just
drag
back
and
understand,
why
do
we
even
need
and
what
is
distribute
tracing
so
a
distributed
tracing
is
basically
a
trace
is
basically
storytelling.
B
For
example,
a
client
went
to
my
website
and
pushed
an
order
and
that
trigger
a
service
that
is
talking
with
a
database
and
that
database
put
a
message
on
a
queue
and
that
queue
inject
another
thing,
and
then
I
returned,
and
then
I
sent
an
email
to
my
customer,
so
I'm
closing
a
full
loop
of
asynchronous
events,
but
they're
all
initiated
from
the
same
position
so
to
be
able
to
draw
that
and
to
be
able
to
to
to
understand
how
it
goes.
This
is
how
this
retracing
actually
help
us.
B
It's
just
the
ability
to
to
to
understand
what
happening
from
end
to
end
in
my
system
and
doing
that
yourself.
It.
It
is
easy,
but
it
requires
a
lot
of
maintenance
and
if
we're
talking
about
small
companies
that
it
might,
it
might
work.
But
if
we
think
about
scaling
up
this
means
that
your
entire
engineering
team
needs
to
be
aware
of
that.
That
needs
to
do
that.
B
I
think,
like
only
very
very
tech,
savvy
companies
like
uber
or
companies
like
that,
allow
themselves
to
have
full
teams
dedicated
to
code
of
observability
and
share
those
tools
along
the
other
teams.
And
the
truth
is
that
today,
when,
when
a
company
just
started,
it's
pretty
hard
to
do
that
yourself,
and
it
doesn't
even
make
sense
to
do
that
yourself,
because
you,
you
are
want
to
focus
on
your
business.
B
You
want
to
be
able
to
deliver
products
to
your
customers,
rather
than
have
a
niche
team
that
will
do
solely
observability
and
if
I'll,
just
talk
about
the.
What
do
we
have
today
in
terms
of
like
the
the
landscape
of
tools
that
allow
you
to
do
this
retracing
so
part
of
the
cncf?
We
have
open,
tracing
and
open
sensors
that
now
join
into
open,
telemetry
and
also
jager
and
zipkin.
B
That
will
allow
you
to
not
only
generate
traces,
but
also
the
tools
to
ingest
them
and
visualize
them
when
it
comes
open
to
omgs,
that's
the
term
of
the
standard
and
the
list
of
libraries
that
will
allow
you
to
do
some
sort
of
like
automated
instrumentation.
B
I,
if,
if
you
guys,
are
not
familiar
with
that,
I
I
strongly
invite
you
to
to
try
them
out
they're,
pretty
easy
to
set
up
and
we're
just
gonna
do
that
in
the
example
very
soon
now,
if
I'm
talking
about
how
to
actually
do
that
so
generate,
traces
is
something
that,
if
you
use
for
in
this
example,
we're
going
to
use
open
tracing
and
in
order
to
generate
traces,
we
need
to
understand
what
do
we
want
to
have
inside
that
traces?
B
So
a
trace
means
that
every
time
I
service
run
like
we
saw
the
request
going
from
in
my
flask.
So
if
we'll
just
track
back
here,
I
can
see
that
the
request
here
once
it
run
it
just
printed
a
message
up
to
the
log
and
for
me
this
message
can
be
that's
what
I
call
a
trace.
A
trace
is
just
a
log
print
that
tells
me
what
this
application
did.
B
We
talked
to
which
resources
it
actually
used
and
how
much
time
it
actually
happened,
whether
there
are
exceptions
or
not
just
some
raw
information
about
what
this
operation
actually
did
and
open
tracing
is
just
basically
a
standard
that
allow
you
to
to
to
code
and
create
those
kind
of
traces
in
their
lingo.
It's
called
a
span,
and
the
first
thing
you
want
to
do
is
to
be
able
to
instrument
every
call
you
have.
B
So,
if
I'm
talking,
if
I'm
having
a
flash
request,
that
is
also
accessing
my
adobe
sdk
and
putting
a
message
on
s3
or
putting
a
message
on
an
sqs
or
calling
a
third
party
api
or
even
calling
my
postgres
database,
I
want
to
be
able
to
automatically
trace
those
without
the
need
to
manually,
creating
those
spans,
and
this
means
that
every
question
response
I
have
I'm
going
to
create
a
span,
and
I
also
want
to
for
that
expand
to
add
some
context,
meaning
that
if
I'm
calling
a
database,
it's
not
enough
for
me
to
understand
how
much
time
it
took.
B
I
also
want
to
understand
which
table
which
query
and
some
more
additional
information
that
can
help
me
later
on
debug
and
the
most
important
thing
when
it
comes
to
creating
spams
is
to
be
able
to
not
only
think
about
a
span
as
part
of
a
a
single
service.
Thinking
about
in
the
scope
of
end
to
end
like
we,
like,
I
talked
before
that
a
request
can
go
and
put
a
message
on
and
on
on
a
message:
queue
that
will
later
on
asynchronically
trigger
another
service.
B
I'm
sorry
I'll
say
if
I'm
going
too
fast,
but
you
can
just
later
on
watch
it
and
put
me
on
half
the
speed.
That's
what
usually
people
do?
B
Okay,
so
let
me
just
jump
into
the
tracing,
so
I'm
using
open
tracing
library
and
I'm
just
using
their
tracer
and
their
format
here
and
if
we,
if
we
remember
the
flash
middleware
with
it
before
that,
actually
put
the
before
request
and
after
request.
This
is
our.
This
is
my
flask
application.
It's
still
the
same
hello
world.
It
doesn't
do
anything
fancy,
but
the
middleware
here
is
now
creating
spam.
So
I
want
before
the
request
actually
happens.
I
want
to
extract
from
the
http
headers
of
the
flask.
B
B
I
can
see
some
additional
information
and
I
can
also
use
that
context
to
to
create
a
new
spam
and-
and
if
this,
in
this
example,
if
for
if,
for
example,
someone
called
my
my
flask
application
and
and
that
service
was
also
traced
in
the
headers,
there
will
be
a
correlation
id.
That
will
tell
me
hey.
You
belong
to
this
distributed
trace,
and
now
I
can
continue
it
and
every
every
time
you
create
a
span,
it's
either
you're
continuing
a
distributed,
trace
or
you're
just
creating
a
new
one.
B
So
here
we're
just
going
to
create
one
or
or
be
a
continuation
of
another
one.
Now
we
also
want
to
add
some
context
and
additional
context
like
what
is
the
url
and
which
user
actually
requested
that,
so
I
can
later
on,
filter
by
type
user
or
filter
by
that
url.
That
will
allow
me
to
understand
think
much
more
powerfully,
and
I
also
wanted,
after
the
request
to
be
able
to
display
the
status
code
and
the
duration
that
it
actually
took.
So,
let's
see
how
it
actually
looks.
B
B
Good
yeah,
so
now
we
have
the
capture
span
and
everything
that
is
printed,
I'm
just
printing
the
actual
span.
It
doesn't
have
anything
very
structured
in
terms
of
of
printing
that,
but
the
span
itself
is
can
is
sent
is
being
sent
to
the
jugger
that
we
can
later
on,
show
how
it
actually
looks
like
now.
We
have
the
traces,
so
that's
service
that
flex
application
created
a
span
and
that
span
is
being
ingested.
B
So
I
can
use
jager,
for
example,
which
can
be
pretty
useful
for
ingesting
traces
and
it
can
handle
any
scale
that
you
want.
It
all
depends
on
where
you
actually
run
it,
and
it
also
allows
you
to
do
some
searches
around
that,
and
also,
and
also
do
some
visualization,
so
there's
also
other
tools
that
will
allow
you
to
set
up
alerts,
meaning
that
every
time
there
is
a
spam
that
contains
an
exception,
then
you
want
to
be
alerted
on
and
a
good
example
for
that
is
jager
that
this
is
a.
B
B
So
if
you
know
that
you
have
a
flow
kind
of
flow
in
your
system,
that
is
currently
handling
orders
like
the
example
that
I'm
using
throughout
this
talk,
you
can
understand
how
much
time
is
being
invested
in
that
in
each
in
each
part
of
your
system,
and
the
most
thing
important
about
creating
spam
is
adding
and
tagging
them
and
adding
some
contacts
for
you
to
be
able
not
only
not
only
understand
when
you
once
you
troubleshoot,
like
identifiers,
you
can
see
the
user
id
the
customer
id
the
device
id,
but
you
can
also
see
things
like
that
are
relevant
to
your
flow
control.
B
You
can
see
the
what
type
of
event
is
actually
happening,
things
that
are
related
to
your
business
logic,
and
you
can
also
add
business
metrics
in
each
span.
You
can
say
how
many
items
I
currently
have
in
the
cart.
How
many
minutes
actually
been
washed
in
my
video
offering
or
anything
of
that
sort.
So
adding
contacts
into
spam
can
be
extremely
useful,
not
only
for
troubleshooting,
but
also
for
later
on,
searching
and
filtering
and
then
creating
metrics
around
that.
B
The
other
thing
that
can
be
extremely
useful
and
life-saving
for
a
business.
The
next
thing
that
that
is
super
important
to
add
to
a
span
is
I'm
saying
just
let's
not
stop
there.
Let's
not
only
add
the
thing
that
we
think
we
need,
let's
just
add
anything.
For
example,
if
I'm
calling
a
database,
I
don't
only
need
to
say
what
type
of
table
or
how
many
what's
the
length
of
the
query
or
what
is
the
result?
B
I
want
to
see
the
entire
query
through
that
sql
or
nosql
database,
because
I
can
use
that
query
to
see
what
is
my
most
frequently
query
to
that
database.
I
can
do
optimizations
around
the
database.
I
can
see
which
one
which
what
exactly
was
the
query
that
ran
that
caused
that
error,
because
getting
a
notification
that
my
database
database
failed
is
not
enough
for
me
to
troubleshoot.
I
want
to
understand
exactly
what
happened
and,
let's
say,
for
example,
I'm
querying
now
stripe
for
payment.
B
B
They
also
give
you
a
very
informative
information
about
what
actually
went
wrong
and
if
you
think
about
the
time
that
this
actually
will
save
you
is
that
if,
for
example,
something
actually
broke-
and
you
know
you
have
500-
and
you
see-
and
you
know
the
exact
service-
you
know
the
customer
that
that
it
happened.
You
know
everything
but
now
you're
trying
to
understand
what
actually
happened.
So
the
engineer
will
go
to
your
dev
environment
and
will
try
to
you,
know,
reproduce
it
and
reproducing
takes
time.
B
It's
not
accurate
and
if
you're
in
advance
already
printing,
all
the
information
that
you
need
then
you'll
be
able
to
troubleshoot
it
on
the
exact
events
that
happen.
So
you
can
see
the
exact
event
that
happened
with
all
the
relevant
information.
B
Now.
This
just
leave
me
to
a
mindset
that
I
think
you
guys
should
should
actually
use.
Is
that
using
tracing
not
just
as
a
part
of
the
the
logging
of
the
matrix
pillars,
but
just
as
a
as
a
glue
between
those
so
I
can
use,
I
can
go
from
traces
to
the
exact
log.
I
can
print
exactly
where
the
logs
resides
through
the
trace
that
I
print.
I
can
print
things
like
what
is
the
container
id,
whether
it's
running
a
lambda?
B
What
type
of
what's
the
lambda
function
I
can
get
can
easily
correlate
between
or
even
print
the
request
id
of
that
lambda
function?
Then
I
can
easily
curl
it
between
the
trace
to
the
environment
and
also
vice
versa.
I
can
go
to
the
environment
and
to
that
invocation
and
and
search
that
request
id
throughout
my
login,
and
then
I
will
find
the
exact
traces
that
happen,
and
that
is
possible
because
I'm
going
to
use
the
structured
and
automated
logs.
So
I
don't
need
to
worry
that
my
team
actually
does
that.
B
So
we're
just
gonna
come
very
closely
to
the
ending
I'm
just
going
to
talk
very
briefly
about.
What's,
for
me,
the
best
practices
to
gain
full
observability
to
the
system
and
when
it
comes
to
that,
I
think
teams
just
need
to
have
things
automated
for
them.
You
don't
want
your
engineers
focusing
on
doing
by
yourself
creating
spans
bringing
things
to
the
logs.
B
Even
thinking
about
that,
you
want
to
have
a
tool
that
you
just
plug
in
and
make
sure
that
everything
is
already
automatically
captured.
You
don't
need
to
even
maintain
it,
because
if
you
need
to
make
the
decision,
whether
you're
going
to
focus
on
fixing
your
delivery
system
or
fixing
your
logging
library,
you're
always
going
to
choose
the
delivery
system,
and
then
it's
going
to
be
extremely
hard
for
you
to
track
back
and
actually
do
it.
So
you're
going
to
have
some
legacy
observability
code.
B
That
is
pretty
hard
to
maintain,
and
I
don't
think
especially
for
small
startups
or
even
huge
enterprises.
It's
not
something
that
you
want
your
team
to
actually
focus.
So
you
need
to
think
about
just
exactly
as
I
will
not
implement
my
own
payment
system
and
I
will
use
third-party
like
like
other
tools
that
that
are
currently
available.
B
I
don't
want
to
also
implement
observability
into
my
team,
because
there's
companies
that
exactly
do
that
side,
note
epsilon,
but
I'm
I'm
saying
that
as
an
engineer
that
this
is
something
that
we
also
encourage
in
our
own
company.
We
use.
We
also
use
the
apps
again
to
to
monitor
our
own
epson
environment,
and
this
means
that
our
engineers
can
focus
on
actually
building
the
system
and
not
worrying
on
monitoring
that
and
you
want
to
to
have
that
observability
tool
to
support
any
environment.
B
So
if
you're
going
to
switch
between
serverless
kubernetes,
ecs,
aks,
azure,
gcp
or
anything
on
on-premise,
you
don't
want
to
worry
about
that
that
you
need
to
change
the
way
your
code
operates.
You
want
to
be
able
to
have
a
a
tracing
library
or
a
tracing
tool
that
will
work
on
any
environment,
because
companies
often
do
this
change.
Do
the
sh
do
the
shift
and
the
first
thing
they
are
cured
about
how
we
will
be
able
to
monitor
if
you're
going
to
move
to
azure
or
things
like
that.
B
So
you
need
to
eliminate
that
from
your
your
thought,
because
you
can
use
a
tool
that
will
support
any
environment
and
you
want
to
be
able
to
connect
every
request
that
goes
in
in
a
transaction.
For
example,
if
you
have
a
lambda
function
that
is
currently
putting
a
message
on
a
database
that
is
triggering
another
function.
That
is
talking
with
your
one
of
your
containers
or
one
of
your
legacy
services.
B
You
want
to
be
able
to
connect
all
of
those
together
in
a
transaction
and
then
to
have
the
ability
to
take
that
data
and
actually
search
and
analyze
that
so
you
can
gain
business
related,
metrics
and
also
to
be
able
to
search
very
quickly
and
troubleshoot
that,
and
that
alone
will
just
help
you
to
you
just
quickly
pinpoint
those
problems.
B
So,
just
to
summarize
modern
applications
today
are
are
very
distributed.
They
are
using
a
lot
of
apps
like
abstract
layers.
You
don't
need
to
worry
about
service.
You
don't
need
to
worry.
If
I'm
talking
about
kubernetes,
you
don't
need
to
worry
about
creating
those
containers.
Everything
is
being
managed
automatically
and
everything
is
being
abstracted
from
you,
because
you
want
to
focus
on
your
business
logic
and
in
order
to
monitor
that
it's
not
the
standard
monitoring
unlock
that
will
assist
you.
B
You
need
to
use
something
much
more
advanced
that
will
also
inherently
implement,
distribute
tracing
with
them,
and
this
is
exactly
why
distribute
tracing
becomes
much
more
crucial
component
in
in
in
any
environment,
and
just
as
a
side
note
for
me,
I
encourage
you
guys
to
obviously,
if
you're
running
a
small
company,
and
you
want
to
try
all
those
things,
those
those
open
source
tools
are
definitely
available
and
they
are
pretty
pretty
good.
B
But
if
I'm
thinking
about
scale
and
production
don't
implement
something
that
unless
you
actually
need
it,
if
you
want
to
be
professional
in
payment,
then
yeah
then
implement
your
own
payment
system.
But
if
you
are
only
want
to
use
that
as
a
tool,
then
you
need
to
choose
like
what
is
the
best
tool
for
you
to
use,
and
obviously
you
don't
want
to
have
your
engineering
team
focused
on
creating
stuff
like
that,
so
that
is
for
me.
So
now
is
the
q
a
session.
A
Thanks,
thank
you
hen
for
a
wonderful
presentation.
We
now
have
some
time
for
some
questions.
I
believe
we
already
have
one
here
in
the
q,
a
box
in
quote,
support
any
environment.
Would
you
be
able
to
provide
an
example
of
a
tool
that
is
limited
to
a
set
of
environments.
B
Yeah,
so
if
you
think
about
being
able
to
create
your
your
own,
if
I'll
take
like
jager
as
a
service,
for
example,
that
you
can
run
on
your
own
environment,
if
you're
going
to
do
the
switch
to
another
environment,
then
you're
going
to
have
to
also
copy
that
to
another
tool
to
to
another
environment.