►
From YouTube: Live Kubernetes Debugging with Elastic Stack
Description
Kubernetes Community Days Bengaluru'21
Your Kubernetes app is down. Your users start ranting on Twitter. Your boss is standing right behind you. What do you do? This talk walks you through a live debugging session without panicking: - What do your health checks say? - Where does your monitoring point you? - Can you get more details from your application's traces? - Is there anything helpful in the logs? - What the heck is even deployed? We are using the Elastic Stack in this demo with a special focus on its Kubernetes integration with metadata enrichment and autodiscovery in combination with APM / tracing, metrics, logs, and health checks.
A
Hi
and
welcome
I'm
philip
and
without
further
ado,
let's
jump
into
today's
topic.
So
my
session
is
about
life.
Kubernetes
debugging,
with
the
elastic
stack
and
I
work
for
elastic
the
company
behind
elasticsearch
kibana
beats,
maybe
you've
heard
of
the
egg
or
elasticset
that's
what
we
do
so
we're
deeply
ingrained
into
that
full
debugging
and
monitoring
and
telemetry
issues
sometimes
called
observability.
Now,
whatever
you
want
to
call
it,
we
probably
have
some
tools
to
help
you,
and
I
will
show
you
what
you
can
actually
do
around
that.
A
So
I'm
running
a
very
simple
application
spring
boot
pad
clinic
on
kubernetes,
but
I
don't
want
to
focus
even
too
much
on
the
application.
It's
simple.
We
have
included
some
intentional
errors
to
make
stuff
break
a
bit,
so
we
can
actually
debug
that
what
is
a
bit
more
interesting
is
the
actual
architecture.
A
So
here
you
can
see
we
have
an
ingress
coming
into
nginx,
that
is
backed
by
node
and
a
react
application
on
top
of
that,
and
then
it's
spanning
out
to
my
springboot
application
again
behind
nginx
or
a
python
application
and
those
are
backed
by
mysql
or
elasticsearch
respectively
and
we'll
just
start
monitoring
and
debugging
those
with
the
elastic
stack
if
you've
never
seen
the
elastic
stack.
This
is
what
it
looks
like,
so
you
have
beats,
which
are
like
lightweight
agents
or
shippers.
A
They
can
collect
data
like
log
files
or
metrics,
for
example,
and
then
forward
them
either
to
logstash.
If
you
need
like
a
big
heavy
tool
that
is
very
powerful
but
to
keep
it
simple
today,
I
will
skip
that.
We
will
forward
directly
to
elasticsearch
and
then
keyboard
on
top
of
that
can
visualize
what
is
going
on
with
your
hit.
A
So
here
we
have
kibana,
you
can
see
we
have
collected
around
27,
000
or
so
log
events.
We
can
open
one
of
these
up
to
see
what
we
have
here.
You
can
see.
This
is
running
on
tcp
and
we
have
collected
some
information
around
the
operating
system.
You
can
see
well
it's
running
on
kubernetes
and
we
add
the
kubernetes
metadata
from
the
api
server.
So
you
can
see
what
is
the
part
name
on
which
node
is
this
running?
What
is
the
name
space
all
of
these
pieces
of
information
that
you
might
want
to
use?
A
You
could
filter
down
on
those,
but
since
we
are
mostly
interested
in
that
infra
namespace,
I
will
just
leave
it
at
that.
For
now
you
can
see
what
we
have
collected
here.
So
you
can
see
this.
We
got
a
redirect
when
we
tried
to
request
the
slash
api
endpoint,
and
this
was
actually
running
by
or
through
nginx,
but
since
those
are
too
many
events
to
go
through
manually,
let's
filter
down
a
different
dots,
so
we
have
kubernetes
and
then
we
can
use
the
labels.
A
A
After
a
few
moments
you
can
see
now
we
have
only
8
000
or
so
events,
and
those
will
include
a
lot
of
the
services
that
we're
running.
So
these
will
include
a
lot
of
engine
x
and
mysql
logs,
and
we
could,
for
example,
just
filter
those
out
as
well.
So,
for
example,
I
know
that
those
are
being
collected
by
the
so-called
event
module
and
this
more
structured
filter.
A
Here,
I
could
say,
is
not
one
off
and
then
I
could
just
exclude
the
ones
that
I'm
not
interested
in,
for
example,
I
could
say
I
don't
want
nginx
and
my
sequel,
and
now
I
filter
down
to
that
and
out
of
my
8
000
or
so
events.
This
will
filter
it
further
down
now
we're
down
to
7000
events,
and
when
I
open
one
of
these
up
here,
you
can
see
we
have
a
lot
of
metadata
that
you
can
collect
if
it's
useful,
but
you
don't
have
to
what
I'm
interested
in
right.
A
Now
is
this
one
here
so
here
you
can
see.
There
is
some
debug
message
that
comes
probably
from
my
java
application.
So
yes,
here
this
was
written
by
jackson,
the
library
this
has
been
generated
by
my
java
application
and
before
we
dive
deeper
into
that
now,
the
question
is:
how
do
we
collect
all
of
that
information?
A
So
if
you
want
to
get
the
blueprints
of
how
to
get
that,
we
have
in
the
beat
repository
under
deploy
and
kubernetes,
we
have
the
configurations
of
how
to
run
file
bit,
for
example,
to
collect
those
log
files
for
you.
Obviously,
you
run
that
as
a
demon
set.
Let
me
quickly
show
you
my
configuration
so
here
I
have
on
my
kubernetes
infrastructure,
I'm
heading
straight
to
production
and
I
won't
show
you
how
we
deploy
the
application
itself.
A
Most
of
that
here
is
just
the
right
configuration
so
file
b,
where
it
should
collect
the
information
where
to
get
passwords
from
where
elasticsearch
is
running,
what
configurations
to
mount
etc.
All
of
that
is
here.
What
is
more
interesting
here
is
I
have
a
config
map
and
that
one
actually
contains
the
configuration
that
I've
applied
with
filebeat.
Here
you
can
see
we're
running
this
on
kubernetes
with
autodiscover,
so
it
knows
automatically
if
you
run
this
on
kubernetes
and
you'd
set
this
to
auto-discover
with
kubernetes.
A
A
We
can
just
filter
down
on
that
which
might
make
our
life
a
bit
easier,
and
then
we
also
define
a
so-called
multi-line
pattern
if
you've
ever
touched
java,
you
know
that
they
write
these
pesky
multi-line
statements
and,
if
you
break
them
up,
one
line
doesn't
really
mean
anything
from
a
stack
trace.
So
what
you
have
to
do
is
you
need
to
keep
stack
traces
together
and
that's
exactly
what
I'm
defining
here.
A
That's
not
where
I
want
to
go.
I
mean
resources
in
log
back,
I
defined
my
log
level,
and
this
is
where
I
said
this
is
the
level
on
which
I
want
to
log.
So
here
you
have
the
log
message,
the
actual
logger
and
then
the
message
we
could,
which
could
be,
that
stack
trace,
and
that
would
then
look
something
like
this.
Let's
switch
to
another
view
that
we
have
here,
this
is
called
the
logs
ui.
A
This
is
a
bit
more
like
tail
f
that
you
maybe
miss
or
that
you
want
to
use
so
here
this
is
just
streaming
log
events
that
you
want
to
have,
but
you
can
still
filter
down
here,
and
we
just
mentioned
that
our
application
is
using
tags
and
it
was
so
java.
Log
messages
were
tagged
with
pet
connect
server,
so
I
went.
I
will
filter
down
on
that
one,
and
then
we
will
only
get
the
right
java
messages,
and
now
you
can
see
here.
A
These
were
the
java
log
messages
that
we
have
collected
and
I
could
also
open
one
of
them
up
here.
So
if
I
open
that
one
up
here,
you
can
see
again,
we
have
all
the
metadata
around
it.
But
what
is
also
interesting
is
we
have
a
log
level.
So,
for
example,
I
could
then
just
say
like.
Please
just
give
me
the
errors,
because
debug
is
nice,
but
I'm
not
going
to
look
through
all
the
debug
messages
right
now,
I'm
just
interested
in
the
arrows.
A
The
question
is
now:
how
did
we
get
debug
broken
out
here
when
the
actual
message
was
just
one
big
string,
and
you
have
probably
guessed
it?
It's
regular
expressions-
and
I
know
there
is
this
saying:
the
pull
of
regular
expression
is
regrets,
and
that
might
be
true,
but
here
we
are
using
them
to
parse
this
apart.
So
how
did
we
actually
do
that?
A
If
I
had
back
to
my
configuration
in
the
conflict
map,
there
was
one
piece
of
information
that
I've
added
here
and
that
was,
I
have
defined
a
pipeline
in
which
I
defined
how
to
parse
this
log
message.
I
hope
you
remember
the
log
back
pattern
of
how
we
were
writing
that
out,
because
now
we
need
that
petting
mix
server
pattern,
which
is
in
my
ninjas
pipelines,
peptonic
server,
and
here
I
have
written
the
right,
regular
expression
or
grog
pattern,
which
is
a
named
regular
expression
to
actually
parse
that
message.
So
here
you
can
see.
A
My
log
level
has
extracted
the
log
level,
then
we
broke
out
the
handler
and
then
we
extracted
whatever
the
reason
was
whatever
came
after
that.
So
this
is
how
we
break
this
up
and
how
we
got
the
lies,
broken
out
log
message
and
then
here
the
reason
which
is
just
the
rest
without
the
log
message,
for
example.
A
This
is
how
we
collected
that
by
the
way,
if
you
want
to
make
your
life
easier,
we
would
recommend
that
you
use
a
structured
block
format
like
json,
for
example,
and
we
have
actually
prepared
one
for
various
programming
languages
by
now.
So,
if
you
want
to
log
in
java,
there
is
one
that
is
called
ecs
logging,
java
and
ecs4
stands
for
elastic
common
schema.
It's
basically
a
naming
convention
that
we
use
across
everything
that
we
collect,
and
then
we
basically
know
what
the
semantic
meaning
of
the
different
fields
are.
A
Pesky
parsing,
because
I
always
tell
people
that
yes,
some
people
like
writing,
regular
expressions,
but
that's
a
bit
like
the
stockholm
syndrome,
where
you
get
so
used
to
doing
something
that
at
some
point
you
accept
that
this
is
the
only
right
way
to
do,
or
maybe
it's
a
big
job
security,
because
while
you
can
maybe
write
your
regular
expressions,
nobody
else
can
read
them.
So
it's
a
bit
like
perl.
A
People
might
complain
that
things
are
not
as
rosy
as
they
could
be,
and
the
next
thing
that
you
might
want
to
do
is
apm
for
application,
performance,
monitoring
or
tracing.
So,
for
example,
to
give
you
an
idea
what
that
looks
like
I
pick
my
spring
service
and
you
can
see
here.
This
is
an
agent
that
you
add
to
your
application
and
that
collects
timing,
information
from
the
app.
How
do
you
include
that
now?
This
is
not
a
demon
set.
A
The
daemon
is
depending
on
your
programming
language,
either
part
of
your
build
process
or
you
can
can
just
attach
it
to
run
at
run
time,
which
is
the
case
in
java,
where
you
have
the
concept
of
a
java
agent.
So,
for
example,
in
my
case
here
in
my
job
application,
I
have
a
docker
file
and
in
that
docker
file
we
add
the
java
agent
here
in
that
place.
A
We
just
baked
it
into
the
image
and
then
you
can
add
some
configuration
parameters
where
you
say:
okay,
the
apm
server
that
where
you
can
actually
send
that
information
is
wherever
I
have
it
somewhere
in
here,
apm
server.
So
this
is
where
the
apm
server
is
located,
and
this
is
where
you
can
forward
that
information.
A
Okay.
So
once
you
have
collected
that
information,
oh
and
by
the
way
you
don't
have
to
bake
it
into
the
image
you
can
also
attach
it
at
runtime.
For
example,
if
you
want
to
see
how
to
do
that
in
kubernetes,
we
have
a
very
nice
blog
post
now,
where
we
use
an
init
container
to
attach
the
agent
when
the
container
comes
out,
so
you
don't
even
have
to
bake
it
into
your
images,
but
you
can
just
attach
it
at
runtime.
A
If
you
want
to,
I
skipped
the
details,
because
the
blog
post
is
very
good
to
describe
that.
I
will
focus
on
what
you
can
get
out
of
the
tracing
now
here.
So
with
the
apm
agent,
for
example,
you
can
see
where
do
you
spend
your
time
and
you
can
see
most
of
the
time.
It's
like
a
50
50
split
between
my
database
and
my
application.
We
only
have
these
two
weird
outliers
here
and
when
I
hover
over
them,
you
can
see
down
here.
A
We
also
have
very
similar
outliers,
so
let's
zoom
into
that
area,
and
now
I
could
for
example,
say
I
want
to
see
both
or
I
just
want
to
focus
on
one
I'll.
Just
pick
this
one
here
and
you
can
see
most
of
the
things
were
very
fast,
but
right
now
here
we're
spending
a
lot
of
time
suddenly
in
the
application
like
99
percent
in
the
application,
and
also
our
response
times
are
suffering
tremendously,
while
the
requests
are
actually
pretty
stable.
So
it's
not
that
we
have
more
load
in
the
system.
It's
just.
A
Maybe
we
have
something
bad
in
our
application,
and
now
we
could
look
like
what
is
happening
here.
For
example,
let's
start
with
get
owners
just
to
show
you
what
that
could
look
like
here
in
get
owners,
you
can
see.
How
long
did
your
requests
take?
Let's
pick
one
of
the
slower
ones,
but
this
only
took
60
milliseconds.
A
A
Let
me
head
back
so
now
here
you
can
see
this
impact
and
the
impact
is
basically
the
multiplication
between
how
many
times
is
that
transaction
being
run
per
minute
and
what
is
the
average
duration
of
that
one?
And,
for
example,
you
can
see,
update
owner
got
very
slow
on
average.
So
let's
figure
out
what
is
going
wrong
here.
A
So
here
you
can
see
most
of
the
requests
again
were
very
fast,
but
there
was
one
outlier
that
was
really
slow,
and
when
I
head
to
that,
okay,
we
can
see
a
400.,
that's
not
good,
and
then
we
can
see.
Where
did
we
spend
our
time
in
the
timing
diagram
here
in
validate
zip
code?
So
we
spend
almost
40
seconds
in
the
method,
meditated
zip
code.
That's
weird!
Probably
we
should
check
that
out.
A
So
I'm
heading
to
the
class
validated
zip
code
and
in
line
33
we're
returning
something
in
line
12
we're
looking
at
some
code.
So,
let's
quickly
search
for
that
one.
This
is
the
class
I'm
interested
in.
So
this
is
the
class
where
we're
doing
something-
and
this
is
the
return.
But
what
is
the
actual
interesting
part?
A
Is
this
regular
expression
here-
and
this
is
a
very
bad
regular
expression-
and
you
can
actually
see
here
if
I
close
this
one
and
click
on
the
entire
transaction,
you
can
also
see
what
were
the
input
parameters
that
we
were
using.
So,
for
example,
this
was
the
request
body
that
we
sent
and
what
you
can
see
here
is.
If
you
look
at
the
zip
code
is
that
this
looks
like
a
very
weird
zip
code,
that's
a
very
long
zip
code
and
that's
exactly
the
problem
here.
A
That
short
zip
codes,
like
four
or
five
digits,
for
example,
will
be
very
fast,
but
the
longer
the
zip
code
gets
the
slower
regular
expression
is,
and
now
maybe
you
will
say
philip.
This
is
very
constructive,
like
who
writes
bad,
regular
expressions
and
brings
their
applications
down.
Unfortunately,
that
happens
not
that
infrequently.
A
For
example,
there
was
this
cloudflare
artichoke
a
while
ago
and
when
you
scroll
down
somewhere
in
here,
they
say
something
like
oh,
the
cpu
spiked,
because
they
had
a
bad
regular
expression
and
that's
just
one
of
the
many
cases
where
you
will
run
into
bad
regular
expressions.
So
this
is
more
real
world
than
you
might
think,
and
thanks
to
apn,
you
can
actually
figure
out
what
is
going
wrong
in
your
systems,
because
you
get
that
timing
information.
A
That
would
be
very
hard
to
get
otherwise
also
kind
of
like
to
compare
where
do
logs
when
you
trace
this
fit
in,
because
traces
are
kind
of
logs
as
well.
But
the
point
of
traces
is
that
they
have
much
more
context.
A
log
statement
is
just
output
from
one
specific
line
that
you
manually
instrumentally
say
please
put
out
this
log
information
where
the
trace
has
this
entire
call
stack
and
gets
the
timing.
A
Information
has
much
more
context
around
that
logs
are
much
more
common,
but
traces
are
very
good
addition
to
get
more
information
and
a
broader
insight
around
it,
but
sometimes
you
also
want
to
get
metrics,
for
example,
because
metrics
are
very
nice
to
get
an
overview
of
how
your
system
is
doing
it.
Let's
look
at
a
dashboard,
so
here
we
have
a
combination
of
metric
beat
and
packet
b
data
metric
b
collects
metrics
either
for
the
operating
system
or
for
an
application
like
mysql.
Packetbeat
is
a
lot
like
wireshark.
A
A
Okay,
now
we
can
suddenly
see
a
spike
here
and
when
I
hover
over
those
you
can
see,
this
is
the
instance
that
I'm
interested
in
the
technique,
my
sequel,
cc59,
whatever
you
can
just
shoot
it
or
exclude
this
one
or
you
could,
for
example,
filter
down.
We
have
built
this
filter
for
infra
pet
clinic,
my
c
quill.
A
The
cc,
whatever
instance,
was
that
once
you
filter
down
on
that
one,
then
you
will
see
okay.
This
is
the
only
one
that
is
left.
So
this
is
our
spike
here.
What
could
be
the
reason
for
this
weird
spike?
We
could
look
at
the
general
metrics
for
our
system,
so
this
is
from
the
operating
system,
the
general
resource
usage
you
can
see.
These
are
all
the
hosts
that
we
have
in
our
system.
A
You
could
also
filter
that
down
on
the
specific
docker
containers.
We
have
a
lot
of
those
or
the
parts
that
kubernetes
is
providing,
and
you
can
already
see
when
I
hover
over
this
one
here.
Okay,
this
one
has
a
higher
than
cpu
usage
than
the
others.
This
is
pretty
much
the
same
that
we
have
seen
in
the
other
ones.
Let's
look
a
bit
at
the
actual
metrics
for
this
instance
that
you
can
do
here.
A
All
of
this
information
is
provided
by
metricbeat
instrumented,
pretty
similarly
to
what
we
have
seen
in
filebeat,
so
I'll
skip
that
configuration
file.
You
can
see
here.
This
is
that
specific
instance
id
you
can
see
how
much
cpu
usage
you
have
so
here
it
started
growing
and
it's
higher,
but
otherwise
memory
usage
is
low.
The
network
traffic
is
also
not
very
exciting,
so
I'm
not
I'm
not
really
sure
from
the
host
perspective.
A
What
is
going
on,
but
when
I
go
back
to
that
dashboard
here,
you
can
actually
see
that
we
have
other
good
pieces
of
information.
So,
for
example,
here
we
have
which
processes
are
running,
and
you
can
see
here
when
I
hover
over
that
db.
Backup
is
running
or
down.
Here
you
see
the
processors
based
on
how
much
resources
they're
using
you
can
see.
A
So
one
of
the
things
that
I
think
is
important
to
go
from
this,
you
have
one
little
island
of
tools
for
every
single
piece
of
thing
that
you
want
to
collect
is
to
have
this
bigger
map
and
that's
kind
of
like
the
idea
of
the
elastic
stack.
We
try
to
provide
this
map
that
you
have
logs,
that
you're
probably
already
using,
and
if
you
have
metrics
you
can
include
those
and
correlate
those,
and
then
you
can
add
traces
on
top
of
that,
and
then
you
can
add.
A
If
you
want
to
try
all
of
this
at
home,
we
have
demodelastic.co,
which
has
quite
a
few
features
you
just
head
over
to
that
you
get
to
a
dashboard.
You
can
pick
what
kind
of
use
case
you're
interested
in
and
then
you
can
just
dive
into
that
and
play
around
in
kibana
and
see
if
that
makes
much
sense
for
you.
I
hope
it
does.