►
From YouTube: Hoot Live Episode: Understanding Istio Metrics
Description
Join Scott Weiss, Architect at the Office of the CTO, on our next Live Hoot Episode on April 6. Scott will dive into Istio Metrics.
About us https://www.solo.io
Questions? https://slack.solo.io
Code Samples: https://github.com/solo-io/hoot
Suggest a topic to cover here: https://github.com/solo-io/hoot/issues/new?title=episode+suggestion:
B
A
A
We've
gone
from
the
monolithic
to
the
microservice
world,
which
has
made
things
a
lot
more
complex
in
terms
of
understanding.
Our
system
identify
performance
issues
just
to
give
you
as
a
sense
of
the
scale
that
these
things
reach.
This
is
actually
from
a
couple
years
ago,
these
were
taken,
but
these
are
basically
published
by
the
companies
shown
here
showing
their
microservice
graphs.
A
The
communication
graphs
that
are
going
on
between
hundreds
of
microservices,
oh,
it
can
become
pretty
challenging
problem
of
figuring
out.
Where
is
the
bottleneck?
Where
is
the
source
of
an
error?
Do
we
even
have
an
error?
Is
there
something
happening?
That's
a
sign
of
a
service
failure
or
outage
there's
a
tweet
that
we
love.
We
replaced
our
monolith
with
microservices
so
that
every
outage
could
be
more
like
a
so.
This
is
basically
giving
the
framework
for
the
problem,
which
is
how
to
identify
the
performance.
How
are
things
performing
in
our
system?
A
A
A
So
that's
the
high
level.
Now,
let's
get
into
some
of
the
details.
So
what
metrics
do
we
actually
get
with
a
service
mesh
other
than
those
that
are
produced
by
the
application
itself?
How
are
we
hold
on
I'm
just
going
to
pause
one
second
and
just
check
the
stream
is
good.
A
Great
and
just
throw
up
any
questions
if
there's
any
questions,
okay,
so
what
metric?
What
envoy
gives
us
directly
before
we
get
into
the
istio
piece
and
just
look
at
what
happens
when
you
stick
an
envoy
proxy
in
front
of
some
traffic
and
you're,
using
it
to
proxy?
The
traffic
envoy
will
give
you
out
of
the
box
a
bunch
of
data.
A
That's
really
helpful
for
understanding
connection
issues,
issues
going
on
within
a
single
workload,
so
this
could
be
invalid
configuration
for
the
proxy
connection
failures
when
the
you
know,
upstream
that
we're
trying
to
connect
to,
we
can
connect
to
it
for
some
reason,
when
tls
handshake
errors
happen.
These
are
all
things
that
that
correlate
with
metrics.
That
envoy
emits
here's
an
example
of
some
of
the
metrics
that
envoy
provides
us.
These
are
kind
of
low
level.
A
This
often
is
useful
for
debugging
the
mesh
itself,
as
well
as,
let's
say,
tls
errors
or
when
a
service
is
offline,
refusing
connections
etc.
So
that
envoy
stuff
we're
not
gonna
dig
into
too
heavily
here,
but
just
know
that
this
is
something
that
you'll
get.
You
know
quote
unquote
for
free.
A
Now,
when
istio
comes
to
the
mix,
istio
gives
a
bit
more
of
a
high
level
set
of
metrics,
which
are
really
very
useful
for
understanding
the
the
service
health
in
our
system.
So
we
have
hundreds
of
microservices
istio
maintains
a
number
of
these
istio
underscore
metrics,
which
are
used
to
track
sort
of
a
higher
level,
they're
aggregated
across
all
the
services
in
the
mesh.
So
they
give
us
an
understanding
of,
for
example,
request.
Latency
istio
is
measuring
through
envoy
through
a
custom
filter
that
istio
has
added
to
their
installation
of
envoy.
A
It's
measuring
the
latency
between
requests,
so
we
get
a
global
metric
for
all
of
the
the
distributions
of
all
of
our
latencies
of
all
of
ourses.
The
requests
go
into
our
services
and
those
are
distributions,
so
they're
calculated
for
different
percentiles.
A
Then
istio
also
keeps
a
counter
of
every
single
request
that
happens
in
the
mesh
and
each
request
that
happens.
This
is
the
istio
request,
total
each
request.
That
happens
is
labeled,
so
we
know
what
source
what
the
source
was,
what
the
destination
was,
what
the
namespaces
are,
what
the
clusters
are.
We
have
all
that
information
on
the
istio
request,
total
so
that
we
can
split
it
up
by
service.
We
can
look
at
it
by
errors
and
just
using
those
two,
we
can
actually
get
three
of
the
four
golden
metrics.
A
If
you're
familiar
with
golden
metrics,
which
include
latency,
you
know
how
how
fast
or
slow
our
services
are,
how
much
demand
there
is
for
traffic?
How
much
demand
there
is
for
our
services,
just
by
looking
at
the
actual
quantity,
and
then
we
can
look
at
our
failure
rate
as
well,
which
will
be
the
percentage
of
the
and
the
last
one.
That's
not
in
here
is
saturation
and
saturation
has
to
deal
with
what
the
actual
capacity
of
the
services
are,
and
that
requires
a
bit
more.
A
It's
not
something
istio
can
give
us
on
its
own.
We
can
take
the
traffic
that's
here
and
understand
what
the
capacity
of
our
services
are
through.
Various
measures
by
looking
at,
for
example,
cpl
and
memory
usage
to
understand,
get
that
full
picture
view,
and
you
can
see
how,
for
it
ops
any
kind
of
operations.
A
This
is
super
valuable
in
order
to
understand
you
know
just
having
istio
installed,
just
putting
it
down
as
a
transparent
mesh
without
applying
any
configuration
and
collecting
all
this
data
out
of
the
box
is
pretty
powerful,
but
there
is
one
limitation
to
using
istio
or
or
a
limitation
to
all
this
design,
which
is
when
you
have
multiple
istio
control
planes.
A
These
metrics
don't
get
correlated
with
one
another
and
there's
istio
is
unable
to
sort
of
aggregate
or
know
as
it
should
be,
because
each
steel
control
plane
represents
a
boundary,
a
logical
boundary
between
for
the
set
of
all
of
this
data.
A
So
just
to
summarize
what
what
some
of
the
gaps
are
in
the
current
ecosystem
today,
where
customers
are
moving
towards,
where
we're
looking
towards
building
solutions
and
customers
are
in
need
of
solutions
are
situations
where
you
have
multiple
clusters,
you
may
have
multiple
meshes.
They
may
be
running
in
the
same
cluster
or
across
clusters,
there's
really
a
many-to-many
possible
relationship
there.
A
A
I
won't
get
into
all
of
these
inside
of
this
hoot.
It's
going
to
be
a
relatively
short
one.
Today,
but
we
will
look
at
the
multi-cluster
multi-mesh
setup
and
and
understand
how
we've
been
able
to
attack
that
problem.
A
So
just
to
summarize
some
of
the
the
limitations
here,
cluster
boundaries
can
lead
to
improper
metrics
attribution
because,
when
a
met,
when
a
request
crosses
a
truss
boundary,
which
means
that
it's
you
have
a
multi-cluster
setup
and
you
have
traffic
leaves
through
an
egress
which
terminates
the
mtls
inside
of
an
individual
mesh.
And
then
it
goes
to
a
remote
ingress
which
initiates
a
new
boundary
of
tls,
the
metrics
data,
all
of
the
metadata
that
istio
uses
to
generate
metrics
so
that
it
knows
sort
of
the
sender
and
the
recipient
for
the
traffic.
A
The
the
context
is
lost.
Essentially,
istio
will
only
know
that
the
egress
was
reached.
It's
not
going
to
know
that
that
a
service
in
cluster
a
is
actually
talking
to
the
service
of
cluster
b.
So
you
need
some
kind
of
orchestration
or
tooling,
on
top
of
to
reconcile
that
another
problem
is
actually
the
aggregation
itself.
A
We've
seen
various
approaches
in
this
space.
It
can
be
done
doing
prometheus
federation,
where
you
have
a
prometheus
instance,
set
to
scrape
the
envoys
in
each
cluster,
and
then
you
have
a
centralized
prometheus
that
goes
and
scrapes
each
prometheus
or
it
federates
across
multiple
prometheus
instances.
But
doing
this
is
a
fairly
complicated
is
a
large
operational
overhead
managing
so
many
instances
of
prometheus
and
some
of
them.
You
know
you'll
have
to
focus
on
synchronizing
storage
between
them.
The
cardinality
of
the
metrics
can
become
pretty
intense
and
it
can.
A
It
can
just
be
difficult
to
manage
all
of
that
metrics
aggregation
and
then,
once
you
have
all
of
that
those
metrics,
integrated
or
aggregated.
A
We
then
have
the
question
of
how
to
actually
leverage
it
in
its
raw
form
and
some
of
the
software
out
there
today,
keali
and
grafana
do
a
nice
job
of
integrating
into
prometheus
or
some
other
metric
store,
but
adding
those
higher
level
insights
is
still
something
that
requires
either
modifying
the
metrics
themselves
so
that
they
contain
the
context
of,
for
example,
which
cluster
boundaries
are
being
used,
or
this
the
the
third-party
software
has
to
be
aware
of
these
differences
or
that
it's
it's
observing.
A
A
multi-mesh
environment,
so
what
we've
done
and
what
we're
working
with
on
with
customers
is
to
provide
a
single
source
of
truth
for
observability,
a
single
pane
of
glass
that
can
be
used.
Oops,
a
single
pane
of
glass
that
can
be
used
to
aggregate
the
metrics
from
different
clusters
unify
the
formatting
of
the
data,
clear
up
the
differences
between
our
meshes
and
and
and
the
metrics
that
need
to
be
correlated
together
across
different
mesh
boundaries
and
produced
a
single
pane
of
glass.
A
So
our
tooling,
that
we're
building
out
it's
kind
of
like
a
platform
for
all
things
mesh
right
now.
I
want
to
demo
for
you
how
this
glue
mesh
today
is
working
which
will.
A
Aggregate
metrics
across
multiple
clusters,
so
I'm
going
to
switch
over
to
my
demo
here
and
before
I
do
as
I'm
doing
that
if
anyone
has
any
questions,
feel
free
to
drop
it
in
the
chat.
I
would
be
curious
to
see.
A
A
I
have
this
kind.
A
A
I
want
to
generate
some
traffic
and
we're
going
to
use
that
to
generate
some
metrics,
so
I'm
going
to
port
forward
to
the
product
page
of
the
book
info
application,
which
some
of
you
may
be
familiar
with
if
you've
played
around
with
istio
at
all
and
I'm
going
to
use
hay
to
generate
some
traffic
against
it.
A
Next
thing
I'm
going
to
do
is
I'm
going
to
actually
what
I'd
like
to
do.
First
is
I'd
like
to
explain
what's
going
on
a
little
bit?
So
if
I
show
you,
let's,
let's
let's
get
all
the
pods
here
and
take
a
look
at
what's
actually
running
so.
A
A
Oh,
oh,
thank
you.
I
just
got
a
notification,
I'm
not
sharing
whoopsies,
okay,
good
call!
Let
me
turn
all
right,
sorry
about
this.
Okay.
So
let
me
restart
anyway,
these
pieces
of
the
demo
and
I'll
just
show
you
that.
A
No
crashing
all
right
so
geez.
A
A
The
remote
you
see,
I
have
the
book
info
pause
and
I
have
this
agent
that
runs.
This
is
the
glue
mesh
agent
that's
running
and
it's
collecting
what's
actually
happening
under
the
hood.
Is
these
aren't
the
envoy
side
cars?
So
you
see
the
ratings
pod
here.
The
reviews
pod
they're
running
two
containers.
A
One
of
those
containers
is
the
envoy
sidecar
that
sidecar
is
configured
to
push
metrics
actually
directly
to
the
agent.
Now
the
agent
then
sends
it
over
a
secure
channel
across
the
cluster
into
the
management
cluster,
and
this
is
one
of
those
concerns
that
we
talked
about
before.
When
you
send
metrics,
you
want
to
make
sure
it's
over
a
secure
channel.
You
don't
want
those
to
be
in
plain
text,
obviously,
so
this
is
one
of
those
problems
that
customers
are
coming
against,
so
we've
integrated
that
into
our
solution.
A
If
we
look
at
the
management
cluster,
we
also
have
the
book
info
running
there.
So
this
is
this
cluster
is
doubling
as
both
like
a
management
plane,
control
plane,
as
well
as
a
data
plane,
because
we
have
our
our
pods
running
and
our
envoys
running
there
and
you'll
see.
We
have
more
pods
running,
we
also
have
an
agent
running
there.
A
This
agent
will
connect
to
this
management,
pod
and
and
stream
to
it.
Both
of
them
are
streaming
the
metrics
up
to
the
agent
pod
and
the
agent
pod
is
aggregating
them.
So
let's
generate
some
metrics
here.
A
B
B
Stream
and
we're
going
to
show
you
now
the
metrics
endpoint,
where
we're
aggregating
the
metrics.
Please
work
and
it's
not
working
oh
yeah
yeah.
Why?
Why
are
we
not
working?
A
B
Home
is
share
this.
A
There's
a
certain
setup
step
that
you
have
to
do
for
istio.
Then
everything.
A
It's
empty
all
right,
I'm
sorry,
everybody!
This
is,
I
think,
there's
something
wrong
with
my
setup.
Unfortunately,
I
I
don't
think
it's
gonna
be
fruitful
to
debug.
Here
there
are
videos
that
I
can
point
to
where
we
already
have
demos
of
this
and,
like
I
said
this
is
this
is
something
you
can
test
that
at
home
and
please
try
it
out
and
let
us
know
if
you
run
into
an
issue
with
it.
A
I'm
pretty
sure
I
I
skipped
a
step
in
my
local
setup
and
and
somewhere
along
the
way
it
stopped
working.
But
essentially
what
you'll
wind
up
with
is
a
single
prometheus,
where
you
can
actually
query
metrics
and
you'll
be
able
to
see
the
references
to
each
workload.
They'll
be
cross
cluster,
so
you'll
be
able
to
see
the
cross
cluster
metadata.
A
A
If
prometheus
is
not
your
preferred,
metrics
backend,
so
again,
really
sorry
about
that
demo,
I'm
not
sure
I
I
would.
I
would
have
to
go
back
in
and
kind
of
from
scratch
recreate
my
environment,
which
I
don't
want
everyone
to
sit
through.
So
why
don't
I
just
jump
into
some
questions.
A
Are
there
any
questions?
I
don't
see
any.
B
A
A
To
a
single
pane
as
well,
which
we
can
see
and
we
will
be
working
on
traces
as
well
and
providing
a
similar
functionality
there
in
an
upcoming
release
so
anyway,
thank
you.
I
again,
I'm
so
sorry
about
that
failed
demo.
Just.