►
From YouTube: Real-time troubleshooting of K8s applications
Description
Don't miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe in Amsterdam, The Netherlands from 18 - 21 April, 2023. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.
B
A
You
are
not
able
to
talk
as
an
attendee,
but
there
is
a
chat
box
on
the
right
hand,
side
of
your
screen,
those
of
you,
hello,
thank
you
and
continue
to
do
so
and
also
leave
your
questions
in
the
same
spot
and
we'll
get
to
as
many
as
we
can.
At
the
end.
This
is
an
official
webinar
of
the
cncf
and,
as
such
is
subject
to
the
cncf
code
of
conduct.
A
Please
do
not
add
anything
like
that
or
questions
that
would
be
in
violation
of
that
code
of
conduct
and
please
be
respectful
of
fellow
participants
and
presenters.
Please
also
note
that
the
recording
and
slides
will
be
posted
later
today
to
the
cncf
online
programs,
page
at
community.cncf.io,
under
online
programs
they're
also
available
via
your
registration
link
that
you
use
today
and
the
recording
will
be
on
our
online
programs.
Youtube
playlist
with
it's
over
to
elope
and
Nick,
to
kick
off
today's
presentation.
Y'all,
take
it
away.
C
Hey
thanks
Louis
thanks
for
the
introduction
and
thanks
everybody
for
joining
on
today's
webinar
live
webinar
demo
and
let's
give
let
Nick
give
a
shout
out
as
well.
C
Time
all
right,
so
I'm
gonna
jump
straight
ahead
to
the
topic
at
hand.
So
hopefully
you
guys
can
see
my
screen.
Everything
looks
good.
C
So
today's
topic
is
going
to
be
an
interesting
one.
You
know
it's
going
to
be
about.
How
can
we
do
real-time
troubleshooting
of
kubernetes
applications,
especially
when
those
applications
start
showing
problems
and
performance
issues
right,
and
you
know,
I'm
gonna
dispense
with
the
legal
notice,
Etc
and
I
thought.
We
should
give
you
a
little
bit
of
background
who
we
are
in
case.
You
don't
know
who
Ops
is.
C
We
are
in
a
relatively
young
company
based
out
of
the
Bay
Area,
but
our
focus
is
almost
exclusively
on
how
to
provide
observability
for
cloud
native
applications
and
so
because
it's
Cloud
native,
we
are
an
active
participant
and
member
of
the
cncf
community
and
as
you
as
you
will
see,
we
are
pretty
much
built
totally
on
cncf
and
open
source.
Instrumentation
I
I,
don't
want
to
read
through
this,
but
you
can
see.
We've
been
working
with
a
number
of
customers
number
of
Partners.
C
We
are
venture-backed
and
the
other
thing
I'll
point
out,
because
we
are
focused
so
much
on
open
source
and
cncf
Prometheus
being
one
of
the
first
projects,
Julius
votes,
who,
if
you
guys
are
following
Prometheus
and
and
CNC
instrumentation,
is
on
our
advisory
boards.
We're
glad
to
have
him
so
we'll
go
straight
into
a
little
bit
of
background
on
magazines,
so
I'm
going
to
hand
this
off
to
Nick
and
again,
as
I
said,
you
can
obviously
find
out
more
while
looking
at
our
website
appspace.com
so
Nick.
B
Thing
so
magazine,
Cloud
was
found
in
1998.
We
mainly
focus
on
customer
helping
the
customers
to
utilize
Cloud,
better,
so
headquartered
in
Seoul
Korea.
We
have
office
here
in
Palo,
Alto,
Toronto,
Canada,
Tokyo,
Hong,
Kong,
Vietnam
and
Shanghai,
and
we
recently
opened
up
one
in
Australia
as
well.
That's
the
latest
office,
so
our
main
focus
is
they're
using
Cloud
for
the
customers
and
helping
them
to
use
it
properly.
B
We
work
with
the
various
Partners,
such
as
Ops
ISP
Partnerships
in
Korea,
and
one
other
thing
that
I
do
here
in
Palo.
Alto
is
looking
for
that
Leading
Edge
technology
companies
to
bring
their
Technologies
to
Asia
to
help
the
customers
in
Korea,
as
well
as
reduce
the
gap
between
the
U.S
and
other
part
of
the
country.
B
We
also
provide
a
service
called
the
hyper
billing
to
help
the
customer
with
the
billing
on
their
Cloud
usage,
so
multi-cloud
Building
Services,
as
well
as
a
space
one
also
named
as
a
cloud
foret,
which
is
a
Linux
Foundation
project
that
helps
the
managing
the
multiple
cloud
in
a
single
portal.
Next
slide.
B
So
the
reason
why
we
are
working
with
Ops
Crews
was
that
internal
we
were
challenging.
We
were
facing
the
challenges
like
I
mentioned.
We
have
a
card
for
it
and
a
space
one
which
is
provided
as
a
SAS
product.
B
We
wanted
to
make
sure
that
we
are
in
the
within
the
slos
and
we
provide
a
right
level
of
the
quality
of
service
to
the
customers,
and
we
saw
the
same
pattern
with
other
customers
as
well:
our
own
customers,
so
one
of
the
largest
mobile
Telecom
company
Mobile
service
provider
in
Korea,
was
actually
working
with
us
and
asking
for
help
on
their
kubernetes
environment,
so
we're
trying
to
solve
their
problem
because
they
were
facing
to
use
multiple
kubernetes
tools,
it's
solid
and
it's
hard
to
maintain
and
operate,
and
they
need
to
translate
that
metrics
and
perform
the
complex
correlation.
B
The
problem
with
this
correlation
correlation
is
that
unless
you
know
exactly
what
you
want
to
do,
then
you
know
every
aspect
of
the
metrics.
It's
very
difficult
to
have
that,
and
you
know
getting
those
information
and
providing
it
to
devops
to
enhance
the
devops
practice
and
skills
is
not
an
easy
thing
and
it
takes
two.
B
It
takes
way
too
long
to
have
newcomers
to
get
trained
and
start
using
the
environment
and
the
tools
properly
and
keeping
up
with
a
new
release
and
various
open
source
project
is
not
an
easy
task
and
we
had
I
mean
it's
not
just
us.
It's
our
customers
and
also
globally,
it's
hard
to
keep
the
talents.
When
you
hire
new
operation
people
or
devops
Engineers.
B
It
takes
at
least
three
to
six
months
for
them
to
understand
what
it
is
and
for
them
to
understand
what
they
need
to
do
to
keep
up
the
same
environment
and
make
it
better
and
upgrade
so.
The
solutions
that
we're
looking
for
is
that
something
that
we
can
automate
but
easily
adopted,
and
we
can
train
others
easily
as
well
as
we.
B
We
get
a
single
pane
of
glass
with
all
the
integration
and
Telemetry
coming
into
the
environment
and
it
doesn't
hurt
to
get
a
machine,
learnings
and
AI
assisted
troubleshooting,
because
we
all
know
we
don't
have
enough
people
to
all
the
troubleshooting
by
ourselves
if
we
can
get
help
from
others,
that's
always
better
and
easily
understood,
SLO
and
the
quality
of
service
for
our
devops
and
service
owners.
So
that's
where
we
were
partnering
up
with
Ops
Crews
and
trying
to
help
the
customers
with
these
challenges.
C
It's
all
yours
now,
thanks
Nick
I
think
you
set
the
stage,
and
hopefully
many
of
you
can.
You
know
recognize
or
empathize
with
the
issues
that
megazone
and
their
clients
were
facing
in
kubernetes,
I
I
would
say
having
been
working
in
this
area
for
six
seven
years.
It's
not
an
easy
problem
right
so
to
get
right
to
the
heart
of
it.
Kubernetes
application
performance
troubleshooting
is
not
just
about
kubernetes
right.
C
It's
about
everything
that
sits
above
it
and
below
it
right
think
about
just
what
has
happened
with
Cloud
native
Services,
just
the
number
of
objects
containers
Services,
we've
seen
you
know
three
thousand
five
thousand
containers
in
a
single
cluster
large
number
of
nodes,
and
then
it's
not
just
that
right.
You
have
service
to
service
calls.
You
have
assas
entities
that
are
not
being
managed
by
kubernetes
could
be
external
calls
to
apis
and
then,
of
course,
the
reason
you
go
into
Cloud
native
and
agile
is
you
can
constantly
make
changes?
C
You
can
make
changes
to
the
services
you
can
scale
out
scale
in
change
the
code
version,
so
on
so
forth.
So
you
have
now
on
top
of
these
large
complexities.
In
scale
Dynamic
changes
that
are
affecting
everything,
because
every
time
you
add
a
new
service
or
take
it
out
your
dependencies
and
who's
talking
to
whom
has
changed,
and
if
you
don't
do
the
deployment
right,
how
do
you
know
that
causes
the
problem
or
run
time?
Something
happens.
You
didn't
know
about
the
infrastructure
or
something
else
that
changed
or
the
configurations
you
set.
C
All
of
that
is
not
just
a
kubernetes
problem.
It
goes
all
the
way
up.
You
know,
obviously
the
dependency
on
infrastructure,
but
also
all
the
way
up
to
the
micro
service
and
therefore
applications.
So
the
question
is:
if
the
application
does
fail,
how
do
you
know?
Is
it
at
the
code
level
something
else
on
a
third-party
service
or
it's
kubernetes
or
it's
the
infrastructure?
C
And
so
that
makes
it
complex
right.
The
good
news
is-
and
this
is
part
of
the
reason
we
love
you
know-
working
inside
the
cncf
ecosystem
is
all
of
the
Telemetry
need
metrics
from
Prometheus
right
flows.
You
know
where
I'm
looking
at
you
know
level
four
bytes
and
packets
or
even
level
seven,
whether
it's
from
istio
or
ebpf
right,
extended
birthday
packet
filter
that
tells
us
request
rates
response
times,
error
rates
that
events
from
kubernetes
State
metrics.
C
So
you
know
in
changes
you
made
you
can
capture
that
locks
that
are
coming
in,
whether
it's
the
application
Level
or
at
the
you
know,
specific
container
level
issues
and,
of
course,
traces,
open
Telemetry.
So
if
you
look
at
that
between
Primitives
istio,
kubernetes,
fluent
D
and,
of
course,
open
source
like
ebpf
and
Loki,
Jaeger
Zipkin
all
of
these
Telemetry,
as
well
as
information
on
the
configuration
out
there,
what's
the
challenge,
just
like
the
number
of
objects
for
every
object,
there
is
metrics
flows,
events
logs
and
traces.
C
So
we
have
a
fundamental
cardinality
problem
when
we
are
trying
to
debug
and
if
trying
to
figure
ourselves
in
real
time,
how
do
we
help
Ops
know,
what's
going
on
to
troubleshoot
a
problem
with
something
slowing
down
in
kubernetes
and
that's
the
Focus
right
now,
as
I
said
and
I'm
just
re-emphasizing,
all
of
the
things
that
we
are
looking
at
all
the
metrics
and
telemetries
already
available,
so
I'm
going
to
just
quickly.
Look
at
that.
If
you
look
at
things
like
all
the
standard
metrics,
you
recognize
most
of
these
logos.
C
But
of
course
you
also
want
to
capture
Cloud
level
metrics,
because
that
gives
you
the
infrastructure
information
on
the
VMS
you
are
using
or
the
persistent
volume
and
the
storage
right.
The.
What
we
want
to
do
is
leverage
all
of
these,
and
this
is
what
you
should
think
about
all
of
that
being
available
leverage
it
to
figure
out
hey.
How
are
they
tied
together?
Because
that's
going
to
tell
you,
you
know
contextually
what
those
things
mean
across
that
different
Telemetry.
C
What
you
want
to
do
is
stream
processing
I'm
not
going
to
go
to
a
whole
lot
of
detail
here.
You
can
always
look
this
up,
and
maybe
some
of
you
are
doing
something
similar
collect
all
the
Telemetry
bring
them
contextually.
So
you
know
how
they
align
link
to
each
other,
so
you're,
not
looking
at
metrics
or
logs
or
traces
or
flows,
or
any
events
happening
across
the
infrastructure
kubernetes
in
the
app
in
isolation.
Because
then
you
are
doing
all
the
work.
C
If
you
can
pull
that
together
and
get
the
topology
meaning
who's
talking
to
whom,
at
the
application
Level
down
to
the
dependence
to
the
infrastructure,
that
gives
you
a
better
context
running
what's
going
on
right
and
then,
of
course,
look
at
flow
understanding,
behavior
and
all
the
way
as
this
the
sequence
says,
able
to
isolate
the
cost.
So
you
have
all
this
information,
so
the
context
across
this
telemetic
configuration
changes
is
very
important
and
I'll
emphasize
this
again.
The
whole
idea
is
to
get
enough
information
across
this.
C
So
we
know
what
is
the
state
of
the
application
at
the
time
when
this
problem
was
detected
right?
This
is
where
analytics
comes
in.
This
is
where
you
would
need
automation,
because
there
is
no
way
you
know
one
or
more
sres
are
going
to
be
able
to
do
this
manually
without
having
some
Automation
and
that's
the
whole
point.
The
point
the
point
to
make
here
is
all
of
the
data
is
available
as
close
to
real
time
as
you
can.
Can
you
pull
the
insights
you
need
so
by
the
time
a
problem
does
happen.
C
You
can
able
to
figure
out
what
the
problem
is.
So,
let's
talk
a
little
bit
about
what
Buddha,
what
does
that
mean
in
automating
cause
isolation
if
you're
a
computer,
science,
geek
and
I've
looked
at
this
problem,
there's
something
called
something
called:
non-polynomial
complete
or
NP
heart
problem?
What
does
that
mean?
It
means
that
the
number
of
possible
combinations
of
data
that
you
have
across
and
trying
to
get
a
sense
of
what
is
happening
in
time
and
space
is
very,
very
hard.
But
you
know
what's
interesting.
C
I've
talked
to
customers
and
people
who
are
using
this.
They
say
some
of
their
best
sources
of
isolating
a
problem.
Is
there
senior
SRE
guy
who's,
seen
it
all
so
because
it's
not
a
simple
problem
and
highly
non-linear
think
about
how
we
solve
this
problem.
We
leverage
information
about
the
ID
stack.
We
know
that
when
a
container
is
working
and
it
is
sitting
on
a
node,
it
is
using
its
resources.
We
know
that
those
resources
are
coming
from
the
cloud
we
know.
C
There's
a
shared
service,
then
that
shared
service
can
become
a
bottleneck
when
multiple
things
are
talking
about
it.
Special
things.
Like
databases,
Etc
right,
these
are
things.
These
are
aspects
of
knowledge
that
really
good
sres
use.
They
also
look
for
what
I
call,
as
you
can
see
in
the
second
bullet,
follow
the
backgrounds
they
will
look
at
the
dependencies
who's
talking
to
whom,
because
they
know
if
the
alert
is
happening
somewhere
and
they
slow
down.
C
C
Looking
at
the
alerts,
then
looking
at
the
metrics
logs
and
places-
and
let's
say,
if
you
have
a
service
that
has
a
certain
kind
of
behavior,
like
let's
say,
is
audio
intensive
they'll
start
looking
at
saying:
hey,
they
did
this
problem,
so
expert
srvs,
who
do
cause
isolation,
use
all
of
this
information.
So
why
not
follow
that?
C
Instead
of
trying
to
look
at
everything
and
throwing
all
the
information
without
not
able
to
narrow
it
down,
the
whole
idea
for
an
automated
system
has
to
follow
these
breadcrumbs
understand,
use
knowledge,
use
the
information
appropriately
and
narrow
down
to
a
very
few
set
of
objects
which
gets
closer
to
the
likely
cause.
I
would
you
know,
venture
to
say,
perfect,
cause
isolation
in
real
time
is
not
theoretically
possible.
C
However,
if
you
have
enough
information-
and
you
have
used
the
information
correctly-
you
can
get
to
it
very
very
quickly
and
that
decision
system
is
what
most
sres
use
so
kind
of
a
block
diagram
on
the
right
is
basically
saying
you
know
when
you
see
the
alert
looking
at
where
the
alert
is
in
the
source
and
then
start
looking
and
say,
hey
what's
around
it,
what
is
the
performance?
Who
was
it
interacting
with?
Was
there
an
issue
depending
on
the
type
of
alert
in
kubernetes?
Was
there
in
the
infrastructure?
C
Was
there
problems
and
saturation,
and
based
on
that,
you
would
start
looking
and
eliminating
the
possible
cases
that
you
don't
have
to
look
at
right.
So
that's
what
the
dynamic
decision
system
is
and
that's
what
you're
going
to
talk
about,
how
to
leverage
all
the
information
and
extract
insights
that
will
help
out
isolate
the
problem
so
before
I
go
into
the
demo
mode
and
show
you
how
this
works.
C
As
an
example,
I
want
to
give
you
an
idea
what
we
want
to
what
you're
going
to
see
in
the
demo
what
instrumentation
is
being
used.
So
our
deployment
architecture
is
shown
here
effectively
all
the
blue
that
you're,
seeing
as
you'll
recognize
it
are
open
source
instrumentation
sitting
inside
the
kubernetes
cluster,
so
see
advisor,
node
exporter,
your
familiar
Prometheus
components
deployed
as
demon
sets
in
the
cluster
and
promptail
is
being
used
to
collect
the
logs
from
Loki.
C
So
that's
also
a
demon
set
I
would
add
that
in
the
node
exporter,
in
order
to
understand
flows
and
not
just
bytes
and
packets,
coming
in
and
out
of
the
node
into
and
out
of
the
containers,
we
are
leveraging
ebpf.
So
our
node
exporter
also
uses
an
ebpf
collector.
So
we
can
look
at
what
you
know.
L7
metrics
resource
sorry
request
rates,
error
rates
response
times
are
happening
at
the
level
of
a
container
within
within
a
node
right.
We
can
get
that
information.
We
also
have.
C
Most
of
our
environments
have
Jaeger,
so
we
collect,
we
can
collect
traces.
So
the
three
primary
four
things
that
you're
seeing
here
are
Jaeger
for
traces
Prometheus
for
the
metrics
and
flows
low
key
for
the
logs
and
then
because
we
want
to
look
at
changes
that
are
coming
in
we're
also
capturing
kubernetes,
State
metrics.
This
is
one
way
to
see.
Changes
have
happened
within
the
cluster,
so
then
what
we
do
essentially
is
once
this
is
instrumented,
and
you
can
do
this
yourself
with
a
Helm
chart.
We
basically
have
four
plus
one
additional.
C
What
we
call
pod
each
for
a
type
of
telemetry,
so
we're
collecting
the
metrics
from
the
the
orange
metric
you
see,
Prometheus
gateway
to
collect
all
the
metrics
from
Prometheus.
The
log
Gateway
bar
is
basically
collecting
all
the
logs
through
the
Loki
collector
kubernetes
gateways,
collecting
kubernetes,
State
metrics
and
changes.
So
we
know
what
the
configurations
are
and
changes
being
made
and
then
finally,
the
fifth
one
you're
seeing
is
the
cloud
Gateway,
because
we
want
to
know
what
is
the
infrastructure
and
so
do
we
understand
the
dependence.
C
All
of
these
essentially
collect
data,
and,
just
as
you
may
be
familiar
with,
are
scraping
error
classic.
You
know
interval
that
you
set
up
whether
it's
30
seconds
or
one
minute,
depending
on
your
scenario
and
bandwidth
they're
collected
compressed
and
sent
to
a
SAS
service,
which
is
shown
Below
in
the
orange
hexagon,
which
is
where
our
our
controller
is.
C
The
controller
is
basically
getting
all
that
information
which
I
showed
in
that
previous
staged,
Pipeline
and
processing
and
extracting
and
doing
that
information,
basically
bringing
them
into
context,
discovering
the
topology
figuring
out
what
the
dependencies
are,
both
at
the
service
level
down
to
from
kubernetes
to
infrastructure
and
then
looking
at
the
data
figuring
out
using
machine
learning.
What
the
expected
behavior
is
so
that's
kind
of
what
the
scenario
is
so
today.
C
The
specific
use
case
that
I
want
to
talk
about
in
the
next,
probably
20
or
so
minutes
is
an
application
slowdown
that
we
have
detected
and
how
do
we
can
analyze,
especially
in
the
kubernetes
kind
of
scenario,
kubernetes
problem
that
God
is
an
application
slowdown,
and
how
do
we
do
that
now?
I
want
to
make
a
note
here
that
the
case
that
we
are
looking
at
here
we
are
not
using
tracing.
C
Clearly
there
are
ways
of
using
tracing
for
doing
that,
there's
a
whole
nother
problem
and
it's
a
separate
issue
on
something
that
we
use
called
Trace
path.
In
fact,
there
is
a
cncf
live
webinar
on
that,
if
you're
interested
you
can
follow
up
with
us,
but
today's
case
we're
going
to
look
at
doing
root,
cause
analysis
for
kubernetes
affecting
application,
but
there's
no
Trace
enabled
all
right.
So
at
this
point,
I'm
going
to
switch
my
screen
to
the
demo
tab,
let
me
see
if
I
can
do
that
again.
C
All
right,
I
think
that's
the
one
if
you
guys
Libby
or
Nick
screen,
showing
up
just
confirm.
C
C
So,
in
fact,
if
I
hover
down,
you
know
you
can
see
this
basically
says
request
coming
in
yeah,
we
have
an
example:
load
balancer,
talking
to
an
nginx
card
load
balancer,
going
into
this
container
this
service
going
into
another
service
Etc,
going
into
the
whole
thing
all
the
way
down
to
Hey
There
is
a
full
stress
database
and
it's
running
actually
on
AWS.
So
this
application
map
that
I'm
showing
you
here
is
being
built
by
the
the
data
that
we
are
getting
and
in
fact,
obviously
you
can
organize
it.
C
You
know
by
labels
on
the
application
by
the
namespaces
that
are
running
on
it.
In
fact,
there
are
different
applications
running
on
it
here
that
I'll
show
in
a
minute
it's
running
on
a
five
note,
kubernetes
cluster,
and
show
that
in
a
few
minutes,
the
different
paths-
and
this
is
a
really
small
application.
This
is
a
the
test
bed
that
we
try
with
parts
containers
SAS
services.
This
is
in
fact
running
on
AWS,
as
you
can
see
it
load,
balancers
and
they're.
Actually,
multiple
clusters,
but
they're
connected
together.
C
It's
a
multi-cluster
environment,
but
we
won't
focus
on
that
today
and
if
you
want
to
know
what
are
the
different
name,
spaces
I
think
I
can
hear
no,
not
this
screen
here.
Let
me
share
this
tab.
If
you
guys
can
see
this
I
can
actually
search
by
namespace
and
in
fact
what
you're
seeing
here
is.
This
is
the
application
that
shopping
cart
as
a
small
e-commerce
application.
You
can
see
a
robot
shop
application,
which
is
a
which
is
a
IBM
application.
The
up
screws.
C
Sorry
there
is
a
slowness
here.
Yes,
so
I'll
start
again,
so
what
I've
done
is
I've
actually
tried
to
show
you
guys
that
different
name
spaces
of
the
different
application,
name,
spaces
and
or
what's
been
deployed
in
this
cluster,
and
what
I
was
showing
here
was
today
we'll
focus
on
this
little
e-commerce,
app
called
shopping
cart
and
there
is
more
there.
There
is
our
Ops
rules
deployment,
it's
only
namespace
robot
shop,
which
is
another
e-commerce
application
online
boutique
that's
used
for
tracing
all
of
these.
C
A
C
You
might
be
a
want
to
be
my
check
so
to
give
you
an
idea
what
why
this
matters
here
these
when
I
look
at
any
of
these
containers,
for
example,
nginx
coming
in
from
the
service.
In
fact,
if
you
look
at
this
as
I
highlighted,
you
can
see
because
we
collect
not
only
Prometheus
metrics
by,
but
also
the
flow
metrics,
you
can
actually
see
average
response
time
between
this
and
its
corresponding
service.
So
this
is
actually
very
useful
right
if
you
think
about
it.
C
A
lot
of
Enterprises
that
use
this
environment,
this
kind
of
environment,
ecosystem
for
monitoring
up
screws
can
see
that
dependency
right
away,
and
why
do
I
say
that?
Because,
when
we're
doing
the
root
cause
analysis,
if
I'm
not
involved
in
this
card
service
and
there's
a
problem
here,
then
I
don't
have
to
look
here
or
in
some
other
application
service
like
here.
I,
don't
need
to
look
at
this,
but
if
the
data
is
flowing
in
there's
a
problem,
I
know
what
to
look
at.
C
I
have
narrowed
down
the
focus
and
I
can
see
visually
what's
going
on.
So
let's
that's
one
key
part
right.
Knowing
the
topology
and
the
dependency
is
the
first
thing
that
most
of
us
do.
This
is
what
an
expert
SRE
will
say.
I
don't
need
to
look
here
if
the
problem
is
here
well,
how
do
I
know
that
in
real
time
as
kubernetes
is
changing,
I
should
be
able
to
see
this
dependency,
how
the
data
is
Flowing.
C
So
let's
go
into
one
more
depth
here
so
I'm
going
to
zoom
in
here
in
this
Shopping
Cart
app.
Hopefully
you
can
see
while
I'm
doing
that
and
if
I
look
at
that
and
I'm
going
to
shift
my
screen
here.
So
we
can
see.
The
key
here
is
able
to
see
for
this
container.
What
are
the
Telemetry
in
context
and
which
is
what
I
was
saying
earlier,
I'm
going
to
move
this
up
a
bit
here
so
metrics?
C
What
is
the
metrics
coming
in?
Obviously,
back
from
Prometheus
Etc
right
I
can
get
that.
What
are
the
if
there's
no
events,
I
know
this
one
doesn't
have
it,
but
there's
lots.
What
are
the
logs
in
there?
I
can
look
for
specific
logs,
for
example.
Is
there
a
problem
on
this
container
on
this
engine
X?
Something
has
happened
right.
You
know
anything
that
I'm
recording
I
can
look
for
errors.
Etc
I'm
not
going
to
go
back,
but
I
had
that
in
context.
What
we
call
a
quick
view.
What
is
it
talking
to
in
case
you're?
C
C
All
right
get
into
a
better
connection:
sorry
Libby,
so
what
I
was
getting
at
is
putting
in
context.
We
talked
about
metrics
logs,
but
also
the
connections,
because
this
is
what
we
want
to
know
when
we're
trying
to
see
who's
talking
to
whom
and
what
is
the
problem.
So
in
this
case,
just
to
quickly
summarize
I
know
what
is
inbound.
I
can
just
ambiguate
that
I
even
know
how
much
data
is
coming.
C
You
know
what
else
is
coming
and
actually
there
is,
if
you
look
at
this,
if
I
click
on
this,
let
me
see
if
I
can
find
the
right
way
might
be
hard
to
do
it
because
I'm
trying
to
show
this
on
this
port
itself.
There
are
multiple
connections
because
there
are
different
ports
involved
in
there
and
outbound,
as
you
can
see,
is
talking
to
this.
This
other
nginx
controller
and
in
fact
you
can
see
on
the
screen.
C
So
look
what
we
were
trying
to
point
out
is
and
if
there
are
straight
spots
involved,
if
I
want
to
go,
look
at
the
TCP
address.
I
don't
have
one
explicitly
here.
We
can
get
Trace
paths
as
well
as
in
service
performance
right
more.
Interestingly,
what
about
the
configuration
we
can
pull
in
the
actual
kubernetes
Manifesto,
so
we
know
what
has
been
designated.
How
much
is
the
resources?
What
volumes
is
talking
to?
Are
things
healthy?
What
is
the
rate
at
which
we
are
scraping
and
the
timeout
settings
all
of
that
information,
including
namespace?
C
Everything
is
right
at
your
fingertips,
so
you
don't
have
to
go
switch
and
Cube
Kettle
commands
right.
This
is
important,
and
so
any
changes
that
happen.
We
will
update
and
present
this
so
having
everything
together
in
context
is
very,
very
useful.
I'll
I'll
do
one
more
switch
before
I
get
into
the
root
cause
problem.
So
for
this
application,
and
as
I
showed,
there
is
multiple
containers
in
this
environment,
as
I
showed.
Where
does
the
node
map
say?
Node
map
says:
where
are
those
application
settings?
C
There
are
five
nodes
here
and
I
can
see
which
of
these
nodes
have
what
containers
and,
in
fact
more,
interestingly,
how
much
are
being
used
so
another
view
that
we'll
be
able
to
pull
together
from
that
is
the
usage
of
requests
and
if
your
numbers
are
the
highest
requests
and
what
is
the
request
limit
Etc,
both
at
CPU
and
memory,
and
up
on
a
con
node
by
node
basis,
so
we
can
see
whether
you're
over
provisioning
under
provisioning
right.
So,
for
example,
here
the
requests
on
this
one,
this
permit
is
node.
C
Exporter
is
set
at
200
and
the
requests
are
already
exceeding,
and
so
the
reason
it's
red
is
because
it's
burstable
and
it
might
be
prone
to
eviction.
So
that
gives
you
another
view
both
to
right
size,
the
environment
right.
So
going
back
to
this
application,
what
happens
when
we
have
a
problem?
And
in
order
to
do
that,
I'm
going
to
jump
into?
What's
called
our
not
this,
but
the
alert
view
so
I'm
going
to
switch
this
screen
again.
C
Let
me
know
if
it's
showing
up
I
actually
picked
up
an
alert
105.915.,
so
Nick
I'm
going
to
rely
on
you.
Can
you
see
my
screen
has
changed
to
the
alert
window?
Yes,
it
has
all
right.
So
the
one
the
the
primary
example
that
you
use
today
is
an
RCA
analysis
that
we
are
doing
on
a
service
level
objective
breach.
We
do
this
automatically
because
we're
collecting
flow
metrics
on
that
Ingress
on
the
shopping,
cart,
I've
showed
you
and
it's.
It
was
run
a
little
while
ago.
C
So
I
can
just
go
through
this
if
I
click
on
it.
This
is
where
things
start
getting
interesting
in
terms
of
how
this
is
automated.
So
we
capture
alerts
automatically
based
on
explicit
alerts
from
kubernetes
infrastructure,
predictive
using
ml,
which
I'll
talk
about
in
a
few
minutes,
but
also
if
there
is
delays
on
the
service
level,
indicators
and
I'll
go
back
in
a
minute
and
show
that
to
you
so
actually
actually
I
should
do
that.
C
Let
me
go
back
to
this
example
here
and
show
you
that
for
this
service
you
can
see
the
one
feeding
into
that
SL
on
the
Ingress.
There
is
something
called
SLO
SLI.
You
see
that
something
has
been
said.
The
suggested
is
this
value
and
that's
done
automatically
by
analysis
by
the
system.
We
can
also
look
at
what
the
current
Max
is.
C
Someone
has
manually,
the
user
has
set
it
at
four
seconds
and
you
can
obviously
change
that
if
that's
because,
if
you're
looking
at
an
outbound
connection
to
a
customer
facing
you
can
set
that
so
that's
where
we
can
set
slos
right
in
this
app
map
and
so
what
we're?
Looking
at
now
sorry
I'm,
switching
back
the
tab
here.
Can
you
see
my
tab
again.
B
C
Yeah
so
I'm
gonna
share
this
tab,
but
I
was
gonna.
Show
you
I,
think
I
missed
it
on
this
tab.
Is
that
the
data
coming
in
into
on
the
Ingress
side
on
that
that
we
can
detect
an
slsli?
And
while
we
can
do
this
by
using
machine
learning
on
what
the
expected
should
be,
someone
has
set
it
at
four
seconds,
and
this
is
what
we'll
use
as
an
example
of
where
the
breach
is
so
the
service
level
objective
for
this
application.
C
The
Ingress
side
has
been
said
that
four
seconds
all
right
now
I'm
going
to
switch
tabs
so
bear
with
me
and
then
I'm
back
to
that
alert.
That
I
showed
you
here
and
I
apologize,
I'm,
jumping
back
and
forth,
but
I'm
trying
to
give
you
context
of
what
was
the
SLO
and
what
does
the
system
do
in
this
kind
of
automated
so
for
that
SL
will
breach
on
that
shopping,
cart,
app.
There
is
a
breach
to
detect
because
on
the
floor.
C
So
if
you
see
that
it
says
it's
automatically
detected
because,
as
you
can
see
the
four
second
SLO,
we
are
at
6.657
and
there's
a
lot
more
detail
here.
Obviously
this
one
is
very
simple
rule.
I
can
see
you
can
set
this
up
the
max
across,
what's
coming
in
the
graph
based
on
the
response
time
and
the
flow
can,
if
it's
more
than
4
000
milliseconds,
that's
what
triggered
it
right.
C
It's
a
latency
based
alert
and
all
of
the
details,
and
the
aspects
of
that
that
we
already
have
in
context
are
given
here,
but
the
most
important
thing
from
the
user's
perspective
is:
we
saw
an
SLR
breach.
The
question
is:
how
do
I
know?
What
does
this
look
reach?
So
what
we've
done
in
this
system?
The
decision
tree
that's
running
on
the
background.
The
AI
engine
does
this
and
analyze,
and
this
is
where
it
starts
getting
interesting.
So
Nick
I'm,
assuming
screen,
has
changed.
C
Can
you
guys
see
the
screen
I'm
on
the
analyze
tab
so
now
this
is
interesting.
What
you're
seeing
here
this
is
automatically
done.
In
the
background,
whenever
an
SLO
kicks
off
remember
decision
plan
is
kicked
when
we
detect
a
problem.
It
is
actually
saying
okay
for
that.
There
are
five
connection
paths
and
the
latency.
The
highest
latency
path
has
been
pulled
out.
The
whole
idea
is
that
if
you
know
the
context
kind
of
narrowed
on
the
path-
and
everything
that's
read
here
is
basically
saying
all
of
them
are
high
latency
along
that
path.
C
So
I'm
going
to
close
that-
and
you
can
actually
see
laid
out
here
in
this
High
latency
path,
nginx
coming
into
its
part
and
container
to
the
service
for
web
server
button
container.
For
that
to
a
cart,
cache,
a
caching
element
bought
and
container
the
cart
server
button
container
down
to
the
database
server
and
then
oh
I
think
I
am
actually
on
am
I
on
the
right
one
I
might
be
on
the
wrong
one.
Actually,
let
me
go
back
and
see
if
I
got
the
right
alert
here,
the
one
I'm
going
to
look
at.
C
Okay,
I
think
I
have
a
different
alert
here.
That's,
but
the
one
I
want
to
show
you
is
slightly
different.
So
let
me
go
back
and
retrace
that
one
also
is
an
interesting
one,
but
that
one
that
is
not
a
kubernetes
one.
So
the
one
I
want
to
use
I
apologize
I'm
going
to
do
this
real
time.
I
think
it's
this
one
that
has
a
kubernetes
related
issue.
This
one
also
has
a
breach.
This
one
even
higher
same
same
case
at
a
different
time,
15.454
seconds
and
exceeding
the
four
second
service
level.
C
So
forth,
all
the
way
to
the
back
end
the
start
server.
What's
really
interesting.
To
note
is
that
is
not
the
only
path,
so
normally
a
user
would
actually
go
and
say
hey
what
are
the
possible
paths
on
which
that
ngenics
container
sorry
the
service
is
dependent
on,
and
there
are
three
paths
right.
You
could
have
more.
You
could
have
10
15
depending
the
complex
application,
but
the
reality
is
what's
slowed
down
is
the
path
that
has
the
highest
latency
path
and
not
surprisingly,
that's
the
one.
That's
got
other
alerts.
C
The
system
can
then
isolate
and
say.
Let
me
just
show
you
the
one,
that's
really
relevant
and
that's
key.
So
that's
the
first
thing,
the
decision
system
and
automatic
causal
analysis
too.
So,
let's
kind
of
walk
through
that.
So
obviously
this
is
a
situation
where
it
was
slow.
We've
already
detected.
If
I
go
to
the
web
server,
it
says
hey.
This
is
slow
because
it's
higher,
so
the
service
is
slow.
That's
not
surprising!
If
you
analyze
it
it'll,
say
yep,
it's
slow,
because
the
response
time
Downstream
coming
back
has
slowed
things
down
right.
C
So
that's
not
surprising.
What
about
the
next
error,
because
this
slowdown
expected
how
about
the
next
one
that
also
is
high?
This
one
also
has
an
SLR
bridge
this
card
cache.
Let's
look
at
that
again.
This
one
also
has
its
own
analysis
tab.
This
is
also
higher
than
expected,
and
this
one
I
believe
is
an
ml
based,
alert
or
actually,
which
is
a
response.
C
Time
has
been
higher
and
it
captured
that
it
says
okay
further
dependencies
further
down,
so
it's
slowing
everything
down
going
back,
going
back
again
to
that
Source
again,
let's
go
down
to
the
next
one.
What
about
this?
So
this
was
low.
What
about
this
car
cache?
This
says:
Network
metrics
are
not
normal
and
if
I
click
on
it
and
you'll
see,
there
is
something
called
an
RCA
tab,
that's
being
analyzed
besides,
just
being
high,
meaning
triggered
based
on
the
expected
service
level
for
response
time,
if
I
click
on
it.
C
This
is
where
it
starts
getting
interesting.
What
you're
seeing
here
is
that
this
one
has
what
we
call
our
fishbone
analysis.
That
means
it's
categorizing,
that
container
based
on
memory
CPU
file
system.
If
there's
any
iodependence
demand
side,
meaning
what
is
coming
in
and
the
responses
to
that
Supply
means,
in
our
case,
what
is
going
on
Downstream
and
even
configuration
changes.
But
if
you
look
at
one
of
the
things
that
sticks
up
is
hey,
the
number
of
Errors
actually
has
increased
in
this
case.
That's
one.
C
Second,
this
packets
here
decrease
the
data
that
was
going.
Outbound
has
decreased
from
zero.
Sorry
from
100
decrease,
there's
nothing
coming
in
and
neither
is
anything
going
out.
So
this
is
a
suspicion
why
the
network
level
metrics-
and
this
has
been
detected
by
just
having
an
expected
behavior
model
that
was
learned
over
the
past.
The
reason
is,
if
you
think,
about
threshold
based
on
the
threshold
based,
that's
usually,
are
high
water
marks
and
the
challenge
with
that
is
when
things
drop
below
that
you
would
normally
won't
detect
it.
C
But
if
you
know
the
expected
Behavior
data
comes
in,
but
data
doesn't
go
out
or
received
on
the
downstream
side.
This
is
where
an
expected
predicted
behavior
model
helps
right
so
going
back
to
the
detect
the
reason
we
have
this
and
the
analysis
says,
there's
something
wrong.
The
fact
that
there
are
more
errors
coming
in
so
going
back
now
that
chain
again.
So
that's
what
we
detected
here.
What
does
the
card
server
say?
C
We
are
starting
to
see
something
here.
If
we
see
it
says,
card
server
does
not
have
any
pod
to
serve
requests.
So
if
you
are
an
SRE-
and
you
know
this
dependency,
why
is
this
slowing
down
and
also
there's
no
data?
Well,
if
there
is
no
part
here,
there
is
no
data
coming
back.
Neither
is
any
requests
coming
in,
so
that
itself
is
telling
us
that
the
reason
why
this
network
metric
detected
on
that
ml
side
and
this
part
not
showing
serving
requests
are
connected.
So
we
have
a
problem.
C
That's
further
away,
Downstream
until
we
come
here.
So
if
I
go
to
this
container,
it
says
well.
In
fact,
I
already
gave
you
the
answer.
There's
an
image,
pullback
error
here,
and
if
you
look
at
this
it
says
hey
the
card.
Server
container
is
terminated.
In
fact,
if
I
click
on
it
it'll
just
say
actually
I
don't
have
a
container
on
that.
It's
terminated.
C
So
if
it's
terminated,
of
course,
what
that
says
is
I'm
not
going
to
have
I'm
not
going
to
have
anything
to
form
the
cards
of
responding
back
if
I
go
to
The
Container
now
now
we
can
start
looking
at
what
has
happened.
This
analysis
now
we'll
now
say
what
is
going
on
here.
There
is
no
pod
running
and
now,
if
I
click
on
the
analyze,
we
would
use
the
same
fishbone
except
it's
not
for
a
container.
C
This
is
specific
to
the
kubernetes
problem
and
if
you
look
at
this
the
same
fish
bone
that
you
saw
for
the
container
side,
it
has
two
classes
bought
schedule,
failures,
note,
failures,
startup
failures
and
runtime
failures,
and
this
is
predefined.
Remember
we
talked
about
curated
knowledge.
Someone
who
is
familiar
with
kubernetes
knows
how
startup
failures
can
happen.
Deployment,
failures,
runtime,
and
if
you
look
at
this,
this
says
it
is
constantly
in
transition
and
it's
saying
container
not
ready.
We
are
collecting
this
from
the
kubernetes
state
events:
foreign.
C
There
is
a
backup
restart.
Why
is
there
a
backup
restart
because
the
image
is
not
loading?
If
you
go
back
and
try
to
pull
it
container?
Is
not
reading
pod
doesn't
come
up
or
doesn't
come
up
what
happens
further
down
the
the
service
that
calls
it
does
not
respond.
There's
one
more
thing:
that's
detected
automatically
dynamically.
There
is
an
image
name
problem
that
is
incorrect
and
therefore,
because
it's
an
invalid
image,
name,
not
surprising,
it's
continuously
trying
to
pull
it
off
and
never
getting
ready,
which
means
as
a
result.
C
C
This
starts
seeing
network
errors
and
errors
that
increases
the
response
and,
as
a
domino
effect,
called
all
the
bags
that
goes
back
to
the
engine
next,
so
I
know
we'll
run
out
of
time,
but
what
I
wanted
to
point
out
is,
unless
you're
able
to
pull
together.
All
of
this
information
in
context
eliminate
the
paths
look
at
the
dependencies
analyze,
each
of
them
you're
really
looking
at.
If
you
notice
flow
metrics
to
understand
why
this
was
slow,
understand
all
the
way
down
to
card
ml
metrics
to
understand
why
that
problem
happened.
C
C
The
nail
was
a
bad
image
name
that
propagated
all
the
way
up
to
create
an
SLO
okay,
so
I
just
want
to
give
that
because
I
know
I
wanted
to
show
you
a
couple
more
examples,
but
having
a
specific
contextual
knowledge
on
the
system,
analyzing
things
in
sequence.
All
of
that
bringing
all
that
together
is
the
key
here
for
us
to
be
able
to
understand
how
this
problem
works
right,
so
I'm
going
to
actually
go
back
and
bring
this
this
other
deck
here.
C
C
C
As
I
said,
it
can
never
be
perfect
because
you
don't
have
100
information
at
100
granularity
of
time,
but
you
can
use
all
of
this
information
to
solve
the
problem
right.
In
this
case,
we
were
not
even
using
traces,
so
just
to
kind
of
summarize,
before
we
go
into
q
a
and
throw
it
up
for
discussion
as
you're,
probably
well
aware
of,
if
you're
using
kubernetes.
There
are
multiple
issues
that
impact
the
application
performance,
and
that
makes
just
the
cardinality
and
the
complexity
of
space
and
time
and
dynamism
makes
this
RCA
quite
challenging.
C
We
can
solve
the
problem
not
by
blind
correlations.
That's
not
going
to
help,
because
the
number
of
possible
correlations
is
going
to
be
very
large.
We
have
a
cardinality
problem
there
as
well,
but
if
you
look
at
understand
how
to
resolve
this,
how
we
do
this
very
well,
it's
really,
as
I
said,
follow
the
breadcrumbs
and
eliminate
things
that
are
not
relevant
and
that's
where
a
decision
system
is
needed.
That
has
to
be
runtime
and
automatically
doing
that
for
you,
so
you're
not
spending
time
leveraging.
Curated
knowledge
is
absolutely
important.
C
There
is
no
such
thing
as
blind
correlation
with
a
blind
ml
that
does
not
work.
It'll
just
lead
you
to
false
paths,
and
that
means
understanding
the
full
Telemetry
the
configuration
is
changing
and,
finally,
the
message
for
all
of
you
who
are
following
the
CNC:
you
can
leverage
all
the
open
source
CNC.
C
B
A
look:
there
was
a
question
from
Oliver
does
officer
the
show
a
metrics
for
services
in
the
cloud
provider
that
host
the
kubernetes
cluster
exactly.
C
Yes,
we
do.
Let
me
share
this
step.
Yes,
of
course
you
can
because
you
can
collect
the
data.
So
what
we
do,
then,
is
use
that
data
that
we're
getting
on
the
infrastructure,
so
I
think
the
example
I'm
showing
here,
however,
is
what
we
collected,
for
example,
there's
load
balancer
that
you
have
to
specify
to
the
cloud
whether
it's
one
out
of
this.
So
here's
an
example
of
collecting
data
from
AWS
on
the
load
balancer
that
we
showed
it's
also
visible
Oliver
at
the
app
map.
C
When
we
were
talking
about
so,
for
example,
this
is
postgres,
oops,
sorry,
postgres
metrics
and
you
can
get
the
Q
depth
Etc
from
the
cloud
vendor
itself.
You
know.
Obviously
it's
not
Prometheus
open
source,
so
we
have
to
kind
of
be
able
to
pull
that
to
do
that.
Hope
that
answers
your
question.
Oliver.
C
C
I'll,
just
if
there
are
questions
I'll,
just
post
this,
if
they
want
folks
want
to
follow
up
later,
I
want
to
talk
to
us
or
in
general
question
on
cncf
metrics
monitoring.
Whatever.