►
From YouTube: 016 Observability in Istio and Gloo Mesh
Description
Collecting telemetry and logging in a service mesh across multiple clusters can be complex. In this session, we look at common multi-cluster observability patterns and how to solve problems like large-scale collection, querying, and aggregation.
A
Thank
you,
everybody
and
welcome
to
our
talk
observability
in
istio
and
glumesh.
I'm
scott
weiss
architect
at
solo,
io.
A
And
today
we're
going
to
talk
to
you
about
observability,
and
what
does
this
mean
so
just
to
start
off
with
a
rudimentary
definition,
general
definition
observability
can
be
understood
as
the
ability
to
understand
the
internal
states
of
a
system
based
on
knowledge
of
his
external
outputs,
so
digging
into
detail.
What
we're
really
interested
in
here
is
approaching
this
from
a
microservices
point
of
view.
A
How
do
we
gain
observability
into
our
microservices
stack
and
just
to
explain
the
problem
a
little
bit,
we're
aware
that
there's
a
transformation
going
on
a
shift
from
monolithic
to
microservice
architecture,
applications
and
doing
so
brings
some
challenges
with
it
just
to
give
an
understanding
of
the
scale
at
large-scale
companies.
You
have
hundreds
of
microservices
that
are
all
interdependent
and
communicating
with
one
another.
A
So
this
leads
to
situation
where
we
love
this
tweet
at
solo,
we
replaced
our
monolith
with
microservices,
so
every
outage
could
be
more
like
a
murder
mystery.
Basically,
all
of
this
scale
and
separation,
modularization
of
components,
makes
things
more
difficult
to
track
and
understand.
A
So
that's
where
observability
comes
in
in
this
context.
In
the
microservice
world
we
talk
about
three
different
types
of
data:
three
different
types
of
telemetry:
we
have
tracing
logging
and
metrics
and
we'll
go
into
each
one
of
them
and
explain
how
they
work
and
how
they
help
understand
the
microservice
deck.
A
So
first
up
is
tracing.
How
does
traffic
flow
through
our
system?
So
when
we
have
a
system
of
microservices
and
an
error
occurs
somewhere
in
the
system?
We
want
to
understand
what
is
the
context
for
that
error
so,
for
example,
buff
io
scanner
error
that
we
see
here
occurs
somewhere
in
our
system.
That's
great,
but
we
want
to
understand
the
context
for
that
error
with
metrics
and
logs.
We
only
understand
the
picture
from
each
instance.
A
We
want
to
have
a
an
understanding
of
the
context
in
which
these
things
are
occurring,
it's
kind
of
like
debugging,
without
a
stack
trace,
so
we
need
to
actually
monitor
what
are
these
traces
that
happen
within
our
system
in
order
to
understand
who
invoked
what
just
to
give
an
explanation
of
how
tracing
works
each
the
the
edge
service
that
initiates
a
flow
of
requests
creates
a
unique
identifier
and
initially
initializes
a
context
that
context
gets
passed
down
in
headers
to
each
backend
service
that
gets
called
in
the
chain.
A
This
allows
us
to
capture
timing,
put
arbitrary
metadata
on
a
context,
and
we
can
then
reassemble
the
call
tree
that
we
collect
in
a
ui
and
just
to
give
an
example.
So
you
can
see
here
we
can
construct
these
spans
out
of
the
individual
traces
of
each
step
in
the
request
chain.
A
A
It
allows
us
to
propagate
the
context
between
applications,
so
application
themselves
can
be
aware,
and
it
allows
us
to
find
bottlenecks,
latency
bottlenecks
and
the
sources
of
errors.
Inside
of
our
request
flows
just
to
give
a
little
bit
of
a
clearer
idea
of
what's
going
on
in
the
old
world
before
service
mesh,
our
application
would
have
to
instrument
all
of
this
on
its
own,
so
it
would
have
to
initialize
context.
It
would
have
to
propagate
the
context.
A
What
service
mesh
has
added
to
the
equation
is
now
now
having
side
cars
for
our
applications
allows
us
to
automatically
generate
the
trace
context
and
propagate
it
to
a
storage
implementation,
all
right,
an
open,
telemetry
implementation.
The
only
limitation
is
that
the
application
still
needs
to
connect
the
context
between
different
requests,
so
it
receives
a
request
from
one
service.
It
needs
to
know
that
another
outbound
request
is
sharing
that.
B
Database
operations
and
other
critical
computations
logs
provide
us
with
detailed
information
with
an
expansive
context,
and
in
virtue
of
this
they
are
often
used
in
service
of
debugging
specific
problems
when
paired
with
a
logging
management
system,
it's
common
to
impose
a
well-defined
structure
on
the
logs,
which
enables
searchability.
This
is
also
known
as
structured
logging.
B
B
B
B
More
concretely,
they
are
quantitative
data
that
describe
the
internal
state
of
the
system
and
again,
the
exact
metrics
depends
on
how
an
application
is
instrumented.
Some
commonly
seen
metrics
include
measures
of
hardware
usage
such
as
cpu
and
ram
consumption,
as
well
as
measures
of
network
activity
such
as
requests
per
second
and
response.
Latency
metrics
provide
us
with
a
high
level
insights
over
time
that
summarize
the
overall
performance
of
the
system
and
for
this
reason,
they're
often
the
first
signal
to
check
when
evaluating
overall
system
health.
B
B
Next
slide
with
so
istio's
metrics
instrumentation
istio
provides
instrumentation
for
what
are
called
the
golden
metrics
golden
metrics
provide
a
bird's
eye
view
of
the
overall
system,
health,
and
there
are
three
typical
categories:
latency
traffic
and
failures,
latency
metrics
provide
us
with
a
measure
of
how
slow
or
fast
the
service
is.
It's
the
time
taken
to
service
requests
and
it's
typically
measured
in
percentiles.
B
So
a
99
percentile
latency
of
100
milliseconds
means
that
99
of
the
requests
are
served
in
100,
milliseconds
or
less
with
istio.
You
get
the
istio
request.
Duration
milliseconds
metric,
which
is
a
distribution
that
allows
for
subsequent
computation
of
percentiles
traffic
metrics,
provide
us
with
a
measure
of
how
in
demand
a
service
is-
and
this
is
usually
measured
as
the
number
of
requests
per
second
failure.
Metrics
are
similar.
They
provide
a
measure
of
the
number
of
requests
that
have
failed
and
when
combined
with
traffic
metrics,
we
can
generate
a
success
rate
or
failure
rate.
B
A
Thank
you
harvey.
So
when
we
look
at
the
real
world
use
cases
when
we
look
at
how
customers
today
are
actually
running
their
mesh
or
running
their
clusters,
they
are
running
multiple
clusters.
Sometimes
those
clusters
are
replicated
and
they've
got
clones
of
services
and
they're
running,
let's
say
in
different
regions
or
on
different
clouds.
A
There
are
really
a
number
of
complex
situations
that
we
get
into
and
this
increases
the
burden
for
observability,
like
carvey
mentioned
when
it
comes
to
istio,
running
and
managing
a
single
cluster.
The
responsibility
for
collecting
the
metrics
is
left
to
prometheus
and
prometheus
can
quickly
become
challenging
to
scale
when
you're
dealing
with
multiple
clusters,
because
you
have
to
set
up
a
prometheus
federation
and
it's
it's
really.
The
burden
is
left
on
the
user,
and
this
is
what
we've
seen
time
and
again
we're
working
with
customers.
A
So
what
we've
done
is
we've
built
out
solutions
for
them
in
our
product
blue
mesh
to
help
address
and
tackle
some
of
these,
including
how
to
handle
multiple
clusters?
How
to
handle
multiple
meshes?
How
to
handle
meshes
and
clusters
that
are
being
shared
by
multiple
tenants,
how
to
handle
meshes
and
clusters
that
are
running
on
multiple
clouds,
as
well
as
the
different
permissions
and
personas
that
users
have
in
order
to
interact
with
these.
A
So
some
of
the
challenges,
just
to
summarize
that
we've
come
across,
are
that
metrics
attribution.
Log
attribution
can
often
be
missing
the
context
of
which
mesh
that
they're,
a
part
of
because
different
meshes
are
not
aware
of
each
other.
You
may
have
different
meshes.
For
example,
you
may
be
running
app
mesh
if
you're
running
on
amazon,
and
then
you
have
clusters
that
are,
you
know,
kubernetes
clusters
that
are
living
on
gke
or
running
in
vms.
A
That
also
have
their
own
meshes
installed,
and
you
may
have
cross-cluster
traffic
going
on,
and
we
want
to
add
that
context
of
which
cluster
which
mesh
these
requests
are.
A
part
of
another
problem
is
the
aggregation
of
the
data.
This
is
an
ongoing
issue.
This
is
something
that's
that
could
be
quite
challenging,
in
particular
with
the
scale
of
the
amount
of
metrics
that
that
wind
up
it
being
stored,
and
there
are
a
number
of
best
practices
out
there
for
solving
that
or
reducing
the
the
cardinality
of
the
metrics
that
are
being
collected.
A
But
this
aggregation
question
still
becomes
more
challenging
when
dealing
with
large
environments
and
finally,
another
question
is
how
to
actually
integrate
into
third-party
tooling.
So
once
we've
actually
figured
out
how
to
aggregate
the
metrics,
how
do
we
then
provide
those
to
tooling
like
kyali
and
grafana,
to
get
more
visibility
into
our
system?
A
So
we've
solved
this
for
our
users
through
the
use
of
an
aggregation
layer
which
funnels
metrics
logs
and
in
the
future,
we'll
be
doing
traces
as
well
into
a
single
pane
of
glass
that
is
then
accessible
to
a
user
or
third-party
application
which
can
take
data
from
disparate
sources.
A
Another
feature
of
this
system
that
we've
developed
is
the
plugability
of
data
sources,
so,
depending
on
how
users
may
want
to
collect
their
metrics
certain
met.
Certain
users
already
have
a
solution
with
thanos
prometheus-based
solution
that
they're
using
to
scrape
multi-cluster
some
are
using
datadog
to
manage
their
clusters
and
collect
their
the
telemetry
data,
which
we
then
put.
We
we
allow
the
plugability
of
a
data
source
which
plugs
into
our
server
and
then
can
be
propagated
to
our
ui.
A
A
The
agent
connects
to
another
cluster
to
a
server
running
on
another
cluster
over
a
secure
mtls
connection,
which
uses
client
certificates
in
order
to
verify
the
identity
of
each
agent.
The
agents
connect
to
the
server
and
then
they
begin
to
push
envoy
proxies
are
pushing
their
metrics
and
logs
to
the
agent
via
grpc
service.
A
A
B
Okay,
so
let's
take
a
look
at
glue:
mesh's
observability
features
in
action,
but
before
we
get
started,
let's
first
review
the
architecture.
So,
as
we
can
see
in
this
diagram
here,
each
managed
cluster.
The
envoy
proxies
are
configured
to
send
their
access
logs
and
metrics
to
the
local
glue
mesh
agent,
which
then
forwards
those
access,
logs
and
metrics
to
the
central
glue
mesh
server,
and
so
this
acts
as
our
central
repository
for
all
of
our
observability
data,
from
which
we
can
access
it.
B
And
in
our
scenario
we
have
two
clusters.
Both
of
them
are
managed,
and
one
of
them
is
also
hosting
the
management
plane.
So,
as
you
can
see
here,
we
see
both
the
glue
mesh
server.
It's
indicated
by
the
enterprise
networking
pod,
as
well
as
the
agent
and
in
the
the
other
cluster.
We
also
have
an
agent
and
we've
deployed
the
bookinfo
app.
B
B
So
you
might
wonder
how
we
have
configured
the
envoy
proxies
to
send
their
metrics
and
access
logs
to
the
glute
mesh
agent.
To
do
this,
we
leverage
istio
their
mesh
config
has
an
option
where
you
can
declare
an
access
log
service,
as
well
as
a
metric
service,
and
for
both
of
these
we've
set
enterprise
agent,
as
the
sync
for
the
access
logs
and
metrics
and
istio
will
do
the
work
of
configuring
across
their
fleet
of
proxies
okay.
B
B
So
at
this
point,
it'd
be
a
good
idea
to
check
out
the
metrics,
get
a
higher
level
view
what's
going
on
and
this
matches
what
we're
seeing
details
is
fine
reviews.
V3
is
fine,
but
requests
to
reviews
v1
seem
to
all
be
failing,
as
indicated
here.
So
now.
Let's
drill
down,
let's
see
a
few
example
requests
and
maybe
that'll
give
us
a
clue
as
to
what's
happening.
For
this,
we
want
to
use
access
logs.
So
first
thing
we
need
to
do
is
configure
the
ability
to
collect
access
logs
so
we're
creating
this
access.
B
So
we've
created
this
access
log.
It
might
take
a
few
seconds
to
take
effect
and
then
now
we
are
going
to
connect
to
the
glue
mesh
server.
It
has
an
end
point
from
which
you
can
stream
received
access
logs.
So
let's
connect
here
and
we're
looking
at
access
logs
either
coming
from
reviews
or
from
the
product
page.
So
those
are
the
pertinent
workloads.
B
B
B
It
looks
like
this
authorization
policy
is
saying:
restrict
reviews
to
requests
that
contain
the
header
foo
with
the
value
solo
I
o.
So
that
explains
it.
When
we
look
back
at
our
access
log,
we
search
for
foo.
You
see
that
the
request
header
is
not
even
present,
which
would
explain
the
request
failures.