►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Hello,
everyone
welcome
to
cloud
native
live
where
we
dive
into
the
code
behind
cloud
native
world,
hello.
Everyone
welcome
on
my
behalf,
so
I
am
ali
talvasta
and
I
am
a
cncf
ambassador,
as
well
as
a
product
marketing
manager
at
cast
ai,
very
happy
to
be
here
and
very
happy
to
be
here
with
amazing
topics
and
presenters
as
well.
So
every
week
we
bring
a
new
set
of
presenters
to
showcase
how
to
work
with
cloud
native
technologies.
A
They
will
build
things,
they
will
break
things
and
they
will
answer
your
questions
live
here
today
as
well.
So
you
can
join
us
every
wednesday
at
this
time,
and
this
week
we
have
a
lot
of
people
presenting
from
ops
cruise.
We
have
cesar
and
alok
here
with
us
today
to
talk
about
next
generation
observability
using
open
source
monitoring,
but
before
we
get
to
the
topic,
I
want
to
remind
you
also
to
join
in
for
kubecon
plus
cloud
nativecon
virtual
north
america
october
11th
to
15th
to
hear
the
latest
from
the
cloud
native
community.
A
So
that's
already
next
week.
So
now
is
the
high
time
to
start
getting
on
on
getting
your
tickets.
If
you
haven't
yet
and
see
you
there
as
well-
and
this
is
an
official
live
stream
of
cnc
and
such
is
subject
to
the
cncf
code
of
conduct
as
well-
so
please
do
not
add
anything
in
the
chat
or
questions
that
would
be
in
violation
of
that
code
of
conduct.
B
B
The
topic
today
is
one:
that's
always
been
of
interest,
which
is
how
do
you
get
observability
for
cloud
native
applications
and
specifically
the
topic
that
I
think
is
always
on
everyone's
mind
given
where
we
are
is:
how
do
we
leverage
all
the
technologies,
especially
monitoring
instrumentation,
that's
coming
out
of
cncf
and
open
source
to
achieve
that,
so
I'm
going
to
set
this
up
with
a
short
introduction
and
kind
of
state
the
problem
and
give
you
our
philosophy
and
approaches
to
how
we
are
solving
this
problem,
leveraging
the
cncf
and
open
source
technology
as
an
example
of
what
all
of
us
can
do
as
we
move
to
cloud
native
or
or
actively
working
and
running
applications
in
productions.
B
Just
go
across
the
legal
side,
let's
start
with
what
exactly
is
the
observability
problem?
And
specifically,
let's
talk
about
what
cloud
native,
which
is
what
cloud
native
applications
create
new
sets
of
challenges?
There
are
a
couple
of
different
factors.
Most
of
you
are
aware
of
this,
but
it's
worth
highlighting
and
pointing
them
out.
Probably
the
number
one
thing
is
obfuscation.
B
If
you
will,
you
know,
services
managed
microservices,
and
you
know
cloud
native
services
that
are
running
in
the
application
have
dependencies
all
the
way
down
to
the
infrastructure
as
a
service
and
platform
service.
But
now
we
have
kubernetes
in
the
in
the
middle,
so
there
is
some
obfuscation,
so
you
don't
see
those
dependencies,
so
that's
one,
which
also
creates
what
we
call
multiple
points
of
performance
loss.
A
service
can
be
used
by
multiple
services,
even
if
it's
being
brokered
through
and
allocated
by
kubernetes
or
by
the
cloud
vendor.
So
that
creates
one
problem.
B
The
second
problem
related
to
this
is
dependencies
and
the
dependencies
are
two
levels.
We
talked
about
applications
down
to
infrastructure
platform
services,
the
other
one
is
dependencies
across
the
services
themselves,
and
this
is
because
you
have
a
very
large
number
of
objects.
You
could
have
thousands
of
micro
service
talking
to
each
other.
You
have
long
chain
dependencies,
not
obviously
when
those
happen
and
how
they
may
impact
each
other.
After
all,
cloud
native
applications
are
fundamentally
complex,
distributed
systems,
and
then,
of
course
it's
not
just
you
know
manage
kubernetes
managed
services.
B
You
have
sas
as
well
as
apis.
What
makes
it
even
more
challenging
is
now.
You
have
dynamism.
What
was
great
for
agility
is,
we
could
add
and
take
and
change
services
auto
scale,
but
that
means
the
structure
and
those
dependencies
are
dynamically
changing.
So
this
creates
a
significant
what
we
call
visibility,
but
observability
challenge
is
to
know
what
is
really
going
on
at
any
time.
You
know,
even
if
you
ignore
the
load
is
changing
as
well,
so
highly
complex.
B
The
good
news
is-
and
this
is
where
cncf
and
open
source
instrumentation
and
all
the
monitoring
comes
into
play.
We
have
data
on
almost
every
level,
but
if
we
just
look
at
the
scale
and
complexity,
the
amount
of
data
that
you
have
and
understanding
what
is
happening
becomes
extremely
complex.
So
that's
one
of
the
things
that
has
actually
become
the
scale
and
the
complexity
and
trying
to
get
to
the
right
insights
right.
So
what
do
we
need?
This
is
where
we
start
talking
about.
B
So
we
need
to
be
able
to
extract
that
automatically
because
at
that
scale
we
can't
do
this.
So
that's
one
capture,
structure
and
dependencies
dynamically.
Second,
if
you
want
to
understand
what
these
dependencies
mean,
it
also
means
you
need
to
understand
what
these
applications
do.
A
database
works
differently
and
what
it
does
versus
a
queuing
system
or,
let's
say
a
load
balancer
right.
So
this
means
when
you're
looking
at
them.
You
can
apply
knowledge
of
typical
I.t
operations.
Shared
services
can
have
issues
like
noisy
neighbor
kubernetes.
B
You
can
restart
applications
and
has
to
be
ready
before
something
happens.
There
are
allocation
limits,
so
all
of
these
means
that's
the
lens,
a
subject
matter
expert
uses.
We
need
to
embed
that
when
we
look
at
the
application,
what
is
the
current
state
if
it's
dynamic,
we
need
to
understand
for
every
component.
What
is
going
in?
What's
going
on?
What
is
the
workloads?
What
resources
is
it
using?
What
service
is
it
calling
at
the
end
of
it?
You
really
need
to
provide.
We
need.
B
B
So
then
the
question
is:
how
do
we
get
the
data
needed
to
build
this
application?
Understanding?
Okay?
This
is
why
we
embrace
cncf
and
open
source
the
what
we
essentially
have
to
do
and
what
we
are
doing
is
building
this
analytics
layer
that
processes
the
information
and
we
just
don't
it's
not
just
simply
about
metrics.
B
It's
about
traces.
It's
about
flows
between
them,
it's
about
what's
happening
in
changes
in
the
configuration
in
the
kubernetes,
as
well
as
the
cloud
it's
about
logs,
that
provide
this
information
and
if
you
look
across
the
landscape
today,
all
of
them,
if
you
think
about
open
telemetry,
all
of
these
are
available
now
as
open
source
instrumentation,
you
don't
need
proprietary
agents
anymore.
You
can
deploy
prometheus,
you
can
reply,
kerfana
look!
You
can
collect
them.
B
You
can
get
traces
with
standards
like
jaeger,
it's
standards
like
prometheus
and
it's
like
ebpf
for
flows
and
so
on
so
forth.
So
our
thesis
and
our
strong
belief
is
embrace
open
source
and
cncf
and
leverage
this
information
to
do
what
we
need
to
do,
which
is
understand,
contextually,
what's
going
on
in
the
application
process,
the
data
to
get
that
real-time
understanding
of
the
application.
B
So
what
we're
going
to
show
to
you
today
in
the
demo
is
how
do
we
take
the
data?
That's
coming
in
from
this
open
source
monitoring,
essentially
build
out
that
structure,
what
we
call
application
graph
and,
as
we
understand
and
understand,
the
interactions
and
those
dependencies
for
each
component
that
comprises
the
application,
build
out
this
behavior
model.
B
Now
this
behavior
model
is
not
simple.
We
can't
predefine
what
metrics
we
have
to
choose
to
build.
In
fact,
what
we
do
is
don't
make
any
assumptions
if
it's
a
generic
container
or
if
it's
a
known
container
like
a
database
or
even
let's
say
a
queueing
system
like
kafka,
use
all
of
the
information
to
extract
what
the
unknown
knowns
are
and
what
influences
at
any
time.
So
we
can
predictively
understand
what
is
expected
of
that.
That
will
tell
us
if
there's
a
problem
in
deviations
once
we
learn
this.
B
In
fact,
we
want
to
do
this
in
situ,
while
the
application
is
running,
observe
it
understand
the
behavior
across
all
the
applications
and
use
it
to
detect
deviations
that'll
indicate
problems.
In
fact,
that
should
tell
us
emerging
problems
and
then,
because
we
know
the
structure
and
we
understand
the
behavior,
we
understand
how
components
are
supposed
to
interact.
How
kubernetes
plays
a
role.
Do
this
global
dependency
analysis?
B
That
means
check
into
configuration
check
into
changes
that
it
mean
cross
check
with
the
events
and
the
logs,
but
also
look
at
what
the
expected
metrics
are,
because
that
will
help
you
isolate
the
problem
domains
and
isolate
the
faults.
If
we
can
pull
this
out,
we
reduce
the
amount
of
space
and
time
that
the
space
that
the
ops
teams
are
looking
at
and
reduce
the
effort
it
takes
to
resolve
problems.
B
We
can
now
also
pull
in
traces
on
top
of
the
flows
to
kind
of
add,
more
granularity
and
visibility,
so
that,
in
a
nutshell,
is
kind
of
the
approach
we
are
taking.
Think
of
it.
This
is
real-time
telemetry
processing
from
all
of
that,
the
idea
is
to
provide
actual
insights
and
a
pro
and-
and
if
you
will
the
actions
you
can
take
to
correct
problems,
as
you
see
them,
and
the
of
course,
the
best
way
to
do
this.
C
Absolutely
thank
you.
Thank
you
alok.
You
know,
that's
a
really
important
slide.
You
have
up
there,
but
we're
gonna
actually
jump
back
to
that
in
just
a
a
second.
I'm
gonna
share
my
screen.
C
Awesome
so
so
this
is,
this
is
our
this
is
you
know
the
ops
crews
landing
page
now
before
we
jump
into
a
lot
of
these
things,
I,
what
I
will
do
is
I'm
actually
going
to
deploy
op
screws
into
a
a
cluster
while,
while
we're
here-
and
I
just
want
to
show
the
simplicity
of
of
how
you
can
deploy
not
just
options,
but
also
all
the
underlying
tools
that
that
was
mentioning-
you
know
the
the
loki
and
prometheus
etc.
C
All
those
are
just
you
know,
a
couple
of
commands
away,
so
this
is
already
a
running
cluster,
but
I'm
actually
going
to
deploy
into
a
separate
cluster
that
I
have
here
and
so
we're
just
going
to
switch
to
that
really
quickly,
and
so
I'm
going
to
show
here,
let's
clear
my
screen
and
first
we'll
look
at
cube,
ctl
get
nodes.
C
This
is
just
a
single
node
cluster.
The
point
of
this
is
to
really
show
the
deployment,
so
it's
an
e
eks
cluster
that
I've
got
running
here
and
what
I'm
going
to
do
is
I'm
going
to
show
the
pods
that
we
have
running
and
it's
very
bare
bones.
It's
just
got
the
aws
node
components,
core
dns
and
cube
proxy.
C
So
what
I'm
going
to
do
is
if
I
switch
over
into
my
vs
code
you'll
see
here,
we
have
a
few
commands
right,
we're
going
to
add
a
couple
of
repos
we're
going
to
run
a
helm
repo
update
by
the
way
helm
is
our
preferred
choice
for
for
deploying
you
know.
Most
of
these
tools
makes
everything
easy.
The
kubernetes
package
manager
definitely
check
it
out.
C
If
you
haven't
used
it,
but
and
then
we
have
a
couple
of
commands,
one
is
the
helm,
upgrade
helm,
upgrade
install
for
the
op
screws
components
and
then
helm
upgrade
install
for
the
actual
underlying
open
source
tools.
Again
prometheus
loki
is
gonna
get
deployed
as
well
as
node
exporter,
cubesat
metrics,
we'll
talk
a
little
bit
more
about
the
architecture
right
after
this
and
then
we're
also
going
to
deploy
again
low-key
itself.
C
And
give
that
a
go,
and
this
is
just
giving
a
status.
You
know
it's.
It's
successfully
updated
the
repositories,
one
of
the
the
obscure's
gateways
are
being
deployed
and
then
so
it
found
that
it
doesn't
exist.
That's
been
successfully
deployed
now
we're
also
deploying
the
underlying
open
source
pieces
and
then
finally,
loki
is
being
deployed
as
well
and
we're
all
set.
So
that's
gonna
that
that's
gonna
check
in
I'll
do
I'll.
C
Look
at
that
cluster
again,
and
we
just
see
the
pods
starting
to
to
come
up
now
so
we'll
check
in
with
that
question
a
little
bit,
but
right
now
we're
gonna
we're
gonna
go
back
to
the
existing
cluster.
Actually
before
that,
I
am
going
to
bring
that
screen
back
up.
That
elope
was
just
sharing,
because
I
do
want
to
talk
a
little
bit
about
the
the
architecture,
so
one
sec
we'll
bring
that
up
in
the
meantime
has
are
there
any
questions
that
have.
A
C
Fair
enough,
so
I
do
want
to
share
that
last
screen
that
local
sharing,
which
is
our
architecture,
so
we
just
deployed,
but
what?
What
is
it
that
we
actually
just
deployed
so
as
the
local
was
mentioning,
you
know,
the
the
whole.
C
The
whole
purpose
of
these
tools
is
to
observe
and
absorb,
observe
intelligently
and
observe
easily
without
the
need,
for
you
know,
heavy
typical
proprietary
agents,
the
you
know,
I
think
the
industry
has
really
standardized
on
a
on
a
subset
of
tools,
a
lot
of
them
from
the
cncf,
and
so
that
is
again
what
we.
What
we
leverage
I
mean,
the
really
the
monitoring
layer
is,
I
should
say
a
data
collection
layer
for
monitoring
is
really
standardizing
and
becoming
really
easy
to
ingest.
There's,
not
always
a
need
to
go
with.
C
You
know
heavy
proprietary
tools,
so
we're
leveraging
that
in
this
example
of
of
the
architecture,
this
is
a
you
know.
Five
node
kubernetes
cluster
and
across
the
top
are
just
your
workloads
right.
Whatever
your
applications
are
running,
you
might
be
running
in
node.js
or
nginx
or
mongodb
inside
kubernetes,
whatever
you're
running
that
that's
across
the
top
and
then
in
the
next
two
layers.
We
turn
this
like
reddish
and
blue
colors.
C
You
have
the
open
source
components
now,
so
you
have
prometheus
as
well
as
loki
for
metrics
and
logs
and
then
in
the
blue.
You
have
the
exporters
and
collectors
so,
for
example,
you
we
leverage
node
exporter
for
node
level,
metrics,
right
c
advisor
for
container
level
metrics,
and
it's
of
course
important
to
not
only
look
at
the
container
itself
right.
This
is
why
we
use
both.
You
don't
only
need
container
metrics,
but
also
that
whole
infrastructure
layer
of
the
actual
kubernetes
nodes
that
are
running
the
the
workloads
right.
C
So
you
need
visibility
into
both
really
cool
tool.
Ksm
exporter,
that
gives
you
the
state
of
the
of
the
objects
inside
of
kubernetes.
So
all
those
exporters,
of
course,
feed
into
prometheus
and
makes
all
the
data
collection
really
really
simple.
Prom
tail
itself,
component
of
the
loki
stack,
is
going
to
grab
all
the
logs
from
the
actual
workloads
that
are
running
all
the
all
the
container
logs
pod
logs
and
node
logs
as
well,
and
we'll
feed
them
into
loki.
C
Now
those
are
the
open
source
components
that
we're
that
we're
leveraging
in
here.
So
this
is
this
is
this
is
what
we
just
deployed
right
now
with
those
commands
that
you
saw
and
then,
of
course,
you've
got
the
actual
underlying
pieces.
You
know
we've
simplified
it
here,
but
the
actual
pieces
of
our
of
the
cluster
are
running
inside
of
linux.
Nodes.
You've,
of
course,
got
the
cubelet
running
on
each
one
of
those
nodes
and
then
you've
got
a
kubernetes
api
instance.
So
what
what
we're
doing
is
we're
also
collecting
like
we
mentioned?
C
It's
super
important
to
have
events.
It's
super
important
important
to
have
the
objects
that
are
inside
of
your
kubernetes
cluster,
so
we
will
query
the
kubernetes
api
directly
to
grab
and
do
discovery
of
those
objects
as
well
as
event
collection
right.
So
all
the
kubernetes
events,
whether
it's
replica
sets
that
are
scaling,
failure
to
schedule
an
image.
C
Failure
to
schedule
a
pod
onto
a
node
because
of
an
image
failure
all
different
types
of
kubernetes
events,
we'll
grab
all
of
them,
and
the
other
thing
that
I
mentioned
earlier
was
the
gateways
right.
So
because
all
this
data
is
already
collected,
we
need
to
have
a
way
to
grab
it
and,
of
course,
feed
it
out
into
into
op
screws.
So
what
we
do
is
we
have
these
super
lightweight,
singleton
pods
that
you'll
see
here.
It's
basically
one
pod
per
telemetry
type.
C
You
have
the
metrics
gateway
here,
which
is
going
to
leverage
prometheus's
remote,
write,
capabilities,
prometheus
will
write,
metrics
out
and
so
we'll
also
grab
the
kubernetes
objects
using
the
kubernetes
gateway.
The
cloud
gateway
is
going
to
pull
into
your
cloud,
your
your,
whether
you're,
using
eks
aks,
if
you're
using
a
gke
cluster,
whatever
whatever
that
variant
is,
we
will
go
in
and,
and
you
need
insights
into
things
like
not
not
just
the
cluster
itself.
C
That's
that's
a
great
starting
point,
but
also
insights
into
the
other
services
that
are
tangential
to
your
cluster
right,
so
things
like
load
balancers
that
are
handling
the
connections
that
are
coming
into
your
cluster
things,
like
maybe
rds
instances
that
you're
using
let's
say
if
you're
using
aws.
You
know
those
cloud
databases
that
you're
using
calling
from
your
cluster
out
to
those
services.
C
It's
important
to
highlight
those
and
to
be
able
to
observe
those
in
context
as
well
and
again,
claw
gateway,
as
well
as
jager
for
tracing
all
of
those
are
just
super
lightweight
pods:
that
package
to
eat
up
and
send
them
off
to
ops.
A
B
On
the
monitoring
plane,
because
we
are
sitting
on
the
host,
not
touching
the
containers
not
putting
side
cars
and
we
don't
have
to
touch
the
application
code
so-
and
this
is
again
leveraging
anyone
who's
running
production
can
deploy.
These
all
we
have
to
do
is
collect
that
from
this
open
collector
that
so
that's
all
we
have
to
do
with
the
minimum
amount
of
touch
and
that
kind
of
simplifies
the
deployment
process
and
the
data
collection
process.
Also,
the
data
stays
there.
We
don't
have
to
store
the
data
away
and
lock
it
away.
C
Yeah
exactly,
I
think
you
know
one
of
the
big
things
I
mean
we
are
we
are
talking
about.
You
know
a
mixture
of
obstacles
as
well
as
the
has
the
you
know,
tools
themselves
that
that
you
know
again
again.
Things
like
the
cncf
has
enabled
to
exist
again,
so
you
know
prometheus
et
cetera.
All
these
all
these
tools,
you
know,
even
though
we
are
talking
about
obscures
obscurus,
is
pluggable.
C
The
fact
is
that
these
modern
architectures
for
observability
tools,
including
including
entrepreneurs,
but
even
even
not
you
know,
even
if
you
don't
have
this
layer
there,
all
this
data
is
is
still
there
and
that's
the
important
piece.
The
as
I
was
mentioning,
the
the
ease
of
using
combination,
commoditization
of
the
actual
collection
tools
has
made.
This
really
really
impressive,
because
you
know
your
data.
Is
there
it's
it's
so
easy
and
you're
not
tied
to
a
specific
vendor,
you're,
not
tied
to
a
to
some
sort
of
proprietary
implementation.
C
You
know
all
this
open
source
to
all
these
open
source
tools,
allow
you
to
collect
and
keep
that
data
and
leverage
it
as
needed.
For
you
know,
business
analytics
in
this
case
observability,
but
you
know
for
whatever
capacity
planning
etc.
So.
B
C
So
I'm
going
to
jump
back
into
I'm
going
to
jump
back
into
the
I'll.
Just
move
this
out
of
the
way
jump.
C
Yeah
I
want
to.
I
want
to
make
sure
that
this
is
up
and
running
and
it
looked
like
it
came
up
pretty
pretty
fast
but
I'll.
I
will
check
again
yeah.
So
all
these
all
these
are
running
and
you'll
notice,
and
I'm
going
to
show
this
to
you
inside
the
cluster
as
well.
But
you
know
that
you'll
notice,
we
have
a
couple
of
different
name
spaces.
We
have
the
collectors
namespace
and
we
have
the
actual
ops
cruise
gateway,
but
you'll
notice
things
like
c
advisor
right,
loki.
C
Looking
prom
tail
then
cube
state
metrics
as
well.
Here
the
prometheus
instance
as
well
is
up
and
running
here's
node
exporter,
and
then
our
gateways
are
are
there
as
well
so,
and
so
let
me
jump
into
that
cluster
itself.
This
is
this
is
actually
our
demo,
which
we're
gonna
we're
gonna
jump
into
some
more
of
the
cool
things
that
we're
doing
with
with
those
open
source
tools
and
all
that
data
that
we're
getting.
But
this
is
just
the
cluster
itself
again.
C
I
just
kind
of
really
wanted
to
show
that
deployment
I'll
refresh
my
screen.
C
Make
sure
everything's
up
to
date-
and
here
is
our
actual
deployment,
where
we
were
just
looking
in
the
command
line
so
you'll
see.
I
have
a
again.
We
saw
that
it
was
a
single
node
cluster
inside
of
eks,
so
you'll
see
that
node
and
and
you'll
see
again.
You'll
you'll
see
these
components,
you
have
loki,
you
have
prom
tail,
you
have
the
op
screws,
gateways,
node,
exporter,
etc.
C
So
again,
we're
we're
now
building
this
really
interesting
view
that
all
that
all
show
details
on,
but
all
within
a
couple
of
minutes
right
while
we
just
what
we
reviewed
the
architecture.
This
is
all
done
out
of
the
box.
So
again,
super
super
cool
that
we're
able
to
leverage
the
open
source
tools
for
for
grabbing
all
of
this.
So
now
I'm
actually
going
to
jump
into
the
into
the
actual
demo
cluster
itself
so
that
we
can
see
some
more
detail
again.
C
This
is
this
is
our
our
view
and
what
you're
seeing
here
is
a
lot
of
data
being
represented
by,
for
me,
sorry
being
collected
from
jaeger
if
you're
familiar
with
ebpf
we're
leveraging
the
linux
kernel's
ebpf
capabilities
to
actually
build
this
view
where
you're
seeing
data
flow
across
and
we
do
support
tracing,
and
there
are
a
lot
of
really
awesome
cases
for
tracing,
but
there's
also
a
lot
of
cases
where,
where
you
might
not
want
to
do
tracing,
maybe
you
want
to
avoid
overhead
et
cetera,
so
the
the
ebpf
capabilities
of
the
linux
kernel
really
allow
allow
you
to
build
this
kind
of
view
and
topology
and
structure
view
without
forcing
the
need
for
tracing.
C
So
I
think
that's
that's
one
of
the
really
cool
things
that
modern.
You
know.
Modern
implementation
clinics
have
allowed
us
to
do
now
before
again,
going
into
all
the
details.
C
I
just
want
to
show
one
more
time:
the
the
underlying
pieces
in
a
slightly
more
complex
cluster
you'll
notice
that
there's
a
bunch
of
filters
and
stuff
across
the
top,
so
we're
leveraging
again
the
the
open
source
tools
make
it
really
really
easy,
because,
when
you
deploy
when
you
deploy
these
tools,
they're
sending
things
like
labels
off
to
prometheus
and
we
can
stitch
those
labels
together
and
and
make
it
really
really
easy
to
to
ingest
this
data.
C
So
now
we
can
actually
leverage
the
filters
that
are
being
attached
to
to
your
workloads
into
different
entities.
So
you
know
you
can
filter
by
app
or
you
can
filter
by
name
space
and
all
this
is
just
like
incredibly
easy
to
do
now.
With
these
modern
tools,
I've
I've
built
a
view
of
just
the
underlying
pieces
using
these
filters,
so
I'm
just
going
to
apply.
You
know
this
data
collection
layer
which
is
going
to
show
us
the
op
screws,
as
well
as
the
open
source
tools.
C
So
in
this
case
we
have
a
five
node
excuse
me:
we
have
a
five
node
cluster
here.
These
are
the
five
nodes,
some
of
their
data,
but
again
a
lot
of
these
are
running
as
demon
sets.
So
if
I
zoom
in
a
little
bit
to
like
node
exporter,
you'll,
see
five
instances
of
node
exporter,
five
instances
of
c
advisor
cube.
C
State
metrics
is
a
singleton,
so
you'll
see
all
of
this
and
how
you
know
prometheus
is
actually
going
out
and
scraping
the
metrics
from
those
endpoints
you'll
notice,
those
those
arrows
going
outbound,
because
that's
the
way
the
traffic
is
flowing
in
and
then
you'll
see.
The
obscures
gateways
like
you'll
see
here
prometheus
is,
as
I
mentioned,
leveraging
remote
write,
capabilities
to
feed
the
prometheus
gateway
for
ops,
crews
and
then
you'll,
see
on
here
on
this
left
side,
the
loki
components.
C
So
these
are
the
actual
pieces
running
inside
the
cluster,
which
is
where
we're
getting
all
that
data
you'll
see.
These
are
actually
posting
out
to
our
excuse
me
to
our
amazon
instances,
which
is
where
we're
actually
housing.
This
particular
demo.
So
excuse
me
sorry
about
that.
So
I'm
gonna
go
back.
I'm
just
gonna
clear
that
that
filter-
and
I
want
to
you
know,
start
showing
some
of
the
really
cool
things
that
the
that
the
underlying
tool
sets
allow
us
to
do
so
again.
C
We're
leveraging
ebpf
to
to
show
you
this
view
of
the
flow
of
the
different
components.
You'll
notice,
the
different
pods,
for
example,
and
you'll-
also
see,
as
I
mentioned,
it's
important
to
have
a
view
into
other
pieces
that
your
infrastructure
is
touching
not
exclusively
kubernetes
right.
So
you'll
see
things
like
again,
as
I
mentioned,
we're
running
in
aws,
so
you'll
see
things
like
this
elastic
load,
balancer
right
and
we
can
actually
click
into
it,
and
this
is
kind
of
what
we
call
our
quick
view.
C
But
if
you
click
into
the
actual
load
balancer,
you
get
data
related
to
a
load,
balancer
right,
the
dns
name
of
it.
The
ip
addresses
the
ports
that
are
exposed,
and
both
sorry
I
mean,
I
should
have
said
private
and
public
ip
addresses
for
the
for
the
load
balancer
along
with
metrics.
Now
this
is
a
metric
snapshot.
We
can.
We
can
look
at
some
metrics
in
a
bit
I'll.
C
Show
you
how
the
context
for
that
works,
but
again
all
this
stuff
is
being
leveraged
from
from
the
underlying,
in
this
case
cloudwatch
right,
but
but
for
all
the
entities
as
well.
So
here
we
have
actual
pods
right.
Let
me
let
me
pick
something:
that's
maybe
a
little
bit
more
interesting
like
maybe
like
an
engine
xbox.
C
So
if
I,
if
I
click
on
the
nginx
pod,
again
all
the
data
from
from
the
underlying
tools
you'll
see
like
if
I
hover
on
this,
you
can
see
connection
data
and
architecture
data
things
like
performance
pieces
that
we're
seeing
this
nginx
controller
is
calling
out
to
the
nginx
service
with
a
response
time
of
57.2
0.23
milliseconds,
a
report,
30
000.,
so
architecture
validation
again
super
easy
because
of
the
data
we're
collecting
from
those
tools
and
then,
if
I
click
on,
for
example,
the
pod
itself
right
it'll
bring
me
into
that
quick
view
again,
which
you'll
notice
is
pervasive
throughout
the
throughout
the
platform
and
again
we're
leveraging
the
the
native
data
from
kubernetes
right.
C
So
the
label
that
was
attached
to
this
pot
and
and
the
manifest
is
automatically
picked
up
and
again,
as
you
saw
earlier,
the
that
view
that
I
built
was
based
on
in
part
on
some
of
these
labels
and
name
spaces,
but
why
so?
Why
is
it
important
to
have
all
this
data?
Why
is
it
important
to
have
things
like
the
name
space
like
the
ip
address?
C
Like
the
start
time,
all
these
things
are
important.
I
can
give
you,
I
mean
just
off
the
top
of
my
head,
a
lot
of
different
examples,
right
important
to
have
the
start
time
to
make
sure
that
the
latest
config
map
that
you
applied
is
now
in
place
right.
C
If
you
know
that
you
apply
to
config
map
on
october
6th
right,
which
is
today
and
you're,
seeing
that
the
start
time
was
from
february
16th,
you
know
that
that
config
map
is
not
in
use,
because
the
pod
itself
has
not
bounced
things
like
name
spaces,
so
a
lot
of
a
lot
of
other
companies.
We
we
work
with.
C
They
have
giant
clusters
that
you
know
they
might
have
50
60
node
clusters,
even
more
hundreds
of
of
nodes
in
their
cluster,
and
that
might
be
a
single
cluster
across
the
entire
enterprise
for
just
their
non-prod.
So
what
happens
there?
You
have,
I
don't
know,
let's
say
seven
instances
they
have
prod
and
pre-prod
and
and
stress
and
qa1
qa2
qa5,
and
you
have
all
these
individual
instances.
Well,
how
are
you
going
to
determine
which
slice
of
the
application
you're
looking
at?
C
Usually
that's
going
to
be
segmented
by
namespace,
so
it's
important
to
be
able
to
not
only
look
at
that,
but
also
you
know,
leverage
filtering
inside
of
your
observability
platforms
for
that,
so
the
the
other
thing
is,
of
course,
the
context
that
alok
was
was
mentioning,
so
it's
really
important
to
have
context
and
I'll
actually
click
on
a
container
to
get
a
little
bit
richer
data.
It's
important
to
have
that
context.
C
So
one
of
the
really
cool
things
that
that
we
do
is
stitch
all
the
all
this
data
together
again
the
the
facility-
that's
that
that's
available
to
us
because
of
you
know
the
underlying
tools.
C
Doing
such
a
good
job
of
sending
over
labels,
et
cetera
for
us
to
be
able
to
you,
know
stitch
this
data
and
richly
allows
us
to
then
now
do
some
really
cool
things
like
contextually,
giving
you
access
to
things
like
metrics
right.
So
if
I'm
looking
at
this
ingress
controller
container,
as
you
see
on
the
upper
left
corner,
I
can
click
on
metrics
and,
of
course,
it's
important
to
have
metrics.
You
know,
because
you
need
to
know
what's
happening
with
your
workloads
right.
What
does
my
cpu
look
like?
C
C
We
have
a
view
on
that
and
I'll
show
you
some
of
what
we
do
with
again
all
this
data
and
how
we
make
recommendations
and
highlight
pieces
that
you
know
could
use
some
more
resource,
so
higher
resource
settings
or
lower
resource
settings,
but
but
we'll
jump
to
that
in
a
sec.
But
again
you
get
all
this.
You
know
whatever
data
is
available
for
this
particular
entity,
you
will
get
network
received
by
its
memory
utilization
again.
C
The
one
thing
I'm
seeing
here
just
by
looking
at
this
is
that
this
pod
is
severely
oversized,
but
I'll
go
back
here
and
things
like
any
events.
I
won't
jump
into
every
single
one
of
these
just
in
the
interest
of
time,
but
kubernetes
events
logs.
I
think
that
one
is
an
important
one.
So
let
me
click
on
logs
so
again
we're
looking
at
this
nginx
controller
and
we're
now
we're
straight
into
the
logs.
For
that.
The
context
is
important.
C
Now,
I'm
just
giving
you
guys
like
a
sneak
peek
of
what
is
actually
under
the
hood,
but
in
reality,
while
it
is
kind
of
fun
to
go
in
and
explore
all
of
this
stuff,
dml
is
really
what
brings
a
lot
of
this
together
so,
but
I
just
want
to
show
you
guys
what's
underneath
so
so
again,
you'll
see
some
some
of
the
links
connections
is
just
like
a
table
view
of
what
it's
talking
to
or
what's
talking
to
it,
you
see
that
the
elastic
load
balancer
is
you'll
notice
by
this
little
arrow.
C
Now
we
again
I'll
jump
into
the
ml
in
just
a
bit
and
kind
of
show
the
real
magic
behind
that,
but
I
want
to
show
a
couple
of
other
views
so
actually
I'll
go
into
the
node
view.
This
is
just
another
view
of
of
the
underlying
data.
It's
important
to
be
able
to
see
what's
running
where
right,
you
might
have
a
a
particular
host,
that's
problematic
and
you
might
want
to
know
what
pods
are
running
on
top
of
that
host.
C
So
again
I
mentioned
we
have
five
nodes,
one
two,
three
four
and
fifth
note
all
the
way
here
on
the
right.
So
here
we're
breaking
up
these
views
into
basically
doing
a
cube.
Ctl
get
pods
with
a
filter
on
on
the
individual
notes,
but
showing
them
all
at
the
same
time.
So
you
you
can
see
the
the
workloads
you'll
see
advisor
running
on
here,
core
dns,
and
you
can
also
see
data
related
to
the
node
itself.
C
You
know
moving
off
off,
of
a
docker
runtime
and
onto
the
crio
runtime,
for
example,
and
you
can
check
the
actual
config
of
the
note
itself
and
get
details
like
you
can
see
that
this
is
still
a
docker
container
runtime,
but
also
things
like
the
cubelet
version
and
the
kernel
version
that
you're
leveraging
operating
system
images
and
again
just
like
we
saw
for
the
for
the
pods
themselves,
it's
important
to
have
the
node
level
metrics
right.
C
So
this
is
again
a
high
level
snapshot
of
the
metrics
for
the
node,
but
you
can
jump
into
it
and
again,
just
just
as
important
to
understand
how
you
can
see
here
a
time
frame
selector,
but
how,
at
any
point
in
time
your
your
nodes
themselves
were
behaving,
maybe
they
had
some
sort
of
spike
etc.
Now,
again
we
we
have
alerts
that
will
automatically
notify
you
of
that.
But
it's
great
to
be
able
to
begin
and
explore
at
will.
C
So
the
other,
so
so
this
this
is
before
I
actually
jumping
out.
I
do
want
to
show
the
balancing,
because
I
I
didn't
mention
the
balancing
so
in
our
balancing
view,
we'll
show
you
how
much
resource
how
many
resources
an
individual
pod
is
consuming
hold
on.
Let
me
refresh
my
screen.
C
I
don't
think
I
made
a
sacrifice
to
the
demo
gods
today,
so.
C
It's
it
should
come
up,
I
think
you're,
just
missing.
Oh
there
you
are
yeah,
so
we'll
show
you
we'll
show
you
resource
data
for
for
cpu
memory
and
disk,
and
you
can
see
right,
for
example,
let's
just
pick
on
c
advisor
right,
so
we
have
the
c
advisor
pod
and
you
can
see
that,
for
example,
c
advisor
has
no
request
and
no
limits
set
for
it.
So
that
might
be
something
to
explore.
C
We
can
also
see
that
it's
best
effort
and
the
current
cpu
utilization
is
195
millicourse
and
the,
but
the
average
is
about
124,
while
the
max
is
220..
So
this
might
help
when
you're
optimizing,
cluster
workloads
and
trying
to
you
know
understand.
B
C
C
B
C
Yeah,
so
again,
this
is
this:
is
this
this
view
really
is
specifically
around
that
right
being
able
to
optimize
workloads?
Make
sure
that
you
know
your
limits
are
properly
set?
You
might
have
some
some
that
are
crashing
or
something
and
we'll
we'll
alert
on
that
when
you're
looking
to
proactively
go
in
and
identify
things,
this
view
is
perfect
for
that.
B
C
Again,
you
know
there
are
modern
applications.
Are
you
know?
Very
often,
you
know,
we
know
that
kubernetes
have
has
won
the
orchestration
battle.
So
you
know,
tons
of
tons
of
modern
applications
are
actually
running,
of
course,
inside
of
kubernetes
so
again,
based
on
the
kubernetes
api
data.
Excuse
me
and
and
some
of
the
other
open
source
tools
we
built
this
view
exclusively
for
kubernetes
resources.
So
you
can
see
things
like
deployments.
Replica
sets
and
demon
sets
that
are
in
your
cluster,
and
these
are
all
this
is.
C
This
is
a
nice
map,
but
it's
also
clickable.
So,
for
example,
I
can
look
at
pods
right,
so
I
click
on
pods
and
now
I'm
looking
at
again
the
magic
of
those
labels
that
are
being
automatically
collected
by
the
under
underlying
tools,
a
lot
of
the
building
of
these
views.
You
know
you
you've
heard
me,
mention
it
two
or
three
times
at
this
point,
but
that's
that's
it's.
C
It's
really
almost
magic
right
so
being
able
to
grab
this
data
around
name
spaces
and
how
many
pods
are
running
in
those
name
spaces
any
failed
or
anomalous
pods.
You
know
anything
that
has
an
auto
detected
anomaly
will
show
up
here,
but
you'll,
but
you'll
see
you
know,
kind
of
the
distribution
of
your
workloads
and.
C
C
Right
so
and-
and
I
can
look
at
you
know
the
namespace
that
they're
part
of
their
ids
their
status,
the
host
that
they're
attached
to
and
the
ikea
address
for
the
pods
themselves,
along
with
the
labels
that
are
attached
to
them,
if
they're
part
of
a
replica
set
or
a
daemon
set,
and
then
quality
of
service
data
along
with
created
time
and
start
time,
but
also
again,
we
come
back
to
you
know
this
quick
view
built
with
the
data
from
all
those
different
tools.
C
And
you
can
you
can
look
at
all
this
data,
the
labels
and
all
the
metadata
and
you're
back
into
there
right
there's,
you
know,
there's
other
views.
This
is
this
is
very
similar,
but
what
I
want
to
highlight
is
the
richness
of
the
data
and
the
contextuality
that,
having
all
these
tools,
all
the
metrics,
all
the
all,
the
config,
all
the
events,
all
stitched
together-
that
richness
that
is
provided
to
us
in
the
context
of
all
these
views
right
so
now.
C
The
next
thing
I
kind
of
want
to
jump
into
is
the
alert
view
and
and
alok.
I
think
this
is
a
lot
more
pertinent
to
some
of
the
stuff
you
were
talking
about,
so
please
feel
free
to
chime
in,
but
with
all
this
data
that
we
are
that
we're
now
receiving
once
we
stitch
it
all
together.
C
The
the
the
one
thing
that
that
that
we're
hoping
to
drive
home
is
the
smart
layer,
as
the
luke
showed
in
the
slide,
the
smart
layer
on
top
of
all
these
tools,
because
it
is
important
to
have
metrics
it's
important
to
have
traces,
it's
important
to
have
network
data
and
config
data
and
change
data.
But
what
you
do
with
that
data
is
that's
the
real
challenge
that
I
think
we're
all
facing
today.
C
The
data
in
many
times
is
siloed.
You
might
be
using
some,
you
know
some
proprietary
tools
for
one
piece
of
data
and
other
open
source
tools
for
another
piece
of
data,
or
even,
if
you're,
fully
open
source,
you
might
be
looking
inside
prometheus
for
one
thing
directly
inside
of
locally
for
another,
you
might
be
going
directly
to
the
to
the
kubernetes
command
line
for
for
other
things,
so
that
is
one
of
the
challenges
we're
trying
to
solve
right,
having,
I
think,
everybody's
trying
to
solve
that
issue.
C
Having
that
context,
switching
is
you
know
if
you
guys
are
into
looking
at
you
know
brain
science
at
all,
you'll
notice
that
context
switching
is
a
big
problem.
It's
a
big
drain
on
us,
so
that's
one
of
the
things
we're
trying
to
to
avoid
it
wastes
resource
time.
It
makes
your
teams
less
effective.
It
increases
the
outage
duration,
which
in
in
many
cases
means
you
know,
lost
money
lost
opportunities.
C
You
know,
if
in
a
healthcare
environment
could
be
could
mean
losing
health
and
losing
you
know
important
time
to
care
for
patients.
It's
you
know
there,
there's
there's
a
myriad
of
things
that
could
be
affecting.
But
the
point
is
you
need
to
be
more
effective.
You
need
to
have
this
context
so
as
not
to
waste
time
and
weight
cycles.
So
this
is
really
what
this
screen
kind
of
represents.
C
It's
really
the
the
the
culmination
of
all
that
data
that
we
were
just
showing,
combined
with
that
smart
layer
that
eloqua
was
was
talking
about.
So
we
have
the
key.
We
have
the
ability
to
set
thresholds
just
like
any
tool
right.
C
You
can
set
thresholds
directly
inside
of
you
know,
you
can
do
alert
manager
inside
prometheus
and
and
set
thresholds
there
and
create
alerts
on
that,
and
we
have
that
capability
to
I'll,
actually
jump
out
here,
just
for
a
sec
to
show
that,
but
that's
not
what
we
that
that's,
not
the
philosophy
that
we
want
to
go
with.
I
mean,
if
you
really
want
to
you,
can
come
in
here
and
you
can
select
a
metric
and
and
apply
a
threshold
right.
C
You
can
say
well
all
right,
create
an
alert
if
I'm
over,
you
know
200
milliseconds
and
if
we
detect
you
know
if
we
detect
we'll
even
provide
automatic
threshold
suggestions
for
for
workloads
that
have
been
running
right.
So
we,
this
is
the
current
max
etcetera,
so
so
suggested.
Threshold
here
is
like
0.35,
because
in
this
case
the
cpu
doesn't
go
over
that
this
is
a
little
bit
of
our
of
rml.
You
know
in
play,
but
this
is
not
you
know
what
we
want
to
do
there
are.
C
There
are
a
little
bit
more
significant
pieces
that
we
can
do
around
around
really
stitching
all
this
data
and
the
behavioral
models
that
are
created
with
all
this
data.
So
let
me
let
me
find
an
alert.
I
think
we
were
so.
B
One,
when
you're
doing
that,
I
think
worth
a
comment
cesaro,
while
you're
finding
it
and
the.
A
B
B
So
instead
of
trying
to
guess
and
trying
to
optimize
and
tune
it
instead
of
waiting
for
it
to,
let's
say,
hit
a
saturation
limit,
it's
better
to
find
out
when
the
problem
is
actually
happening
and
detect
that
this
is
where
you
want
to
have
the
intelligence
to
be
able
to
understand
what
behavior
is.
Is
it
working
correctly,
as
opposed
to
when
it
hits
the
limit
and
keels
over
and
dies?
C
Exactly
no
and
that's
and
that's
a
real
challenge,
I
mean
you
know
you,
and
I
were
talking
about
this
last
night
about
the
fact
that
you
know
with
especially
with
with
the
ability
to
scale
out
that
you
know.
Kubernetes
provides
right
with,
with
with
being
able
to
scale
out.
The
only
replica
sets.
C
Workloads
are
running
in
much
tighter
windows
than
they
used
to
before
right.
So
it's
a
lot
harder
for
you
to
set
thresholds
nowadays,
because
workloads
might
be
running
in
just
like
a
really
small
window.
I
mean
everybody
wants
to
maximize
their
resources.
Some
might
be
running
at
capacity.
So
when
real
deviations
occur
right,
a
machine
is
going
to
be
able
to
find
those
deviations
much
more
easily
than
a
human.
You
know
you're
going
to
have
a
you
know,
one
of
your
one
of
your
ops,
guys
just
literally
scouring,
through
dashboards
and
finding
all
right.
C
B
Let
me
double
down
on
that.
This
is
where
I
think
the
a
lot
of
people
realize
what
has
changed
in
the
communities
with
auto
scaling.
Let's
say
you
decided
that
cpu
is
at
85
is
what
you're
worried
about
so
you
set
a
threshold
but
demand
increases
and
you
can
auto
scale,
which
means
when
you
hit
85
percent,
you
increase
the
number
of
replicas,
that's
not
a
problem
unless
there
are
so.
Why
are
you
alerting
when
I
know
I
can
auto
scale,
so
I'm
going
to
create
all
these
false
alerts.
B
Every
time
you
try
to
auto
scale
and
scale
back
container
is
behaving
fine
or
the
application
is
behaving
fine.
Nothing
to
do
with
it.
Increase
demand,
increase
resource
usage,
but
just
because
I
hit
the
limit
and
I'm
going
to
about
auto
scale
is
not
getting
an
alert,
that's
a
false
alert
if
the
application
wasn't
behaving
correctly
when
it's
supposed
to
not
hit
85.
That's
something
I'm
worried
about.
How
do
you
detect
that
right?
That's
the
key
and
like
we
can't
do
this
across
hundreds
and
thousands
of
containers,
yep.
C
Okay,
so
yeah
thanks
for
that,
although
so
yeah.
This
is
an
example
of
of
again
one
of
the
one
of
one
of
our
alerts.
Now
the
machine
learning
has
done
its
work,
but
again
the
work
that
it's
done
is
it's.
It's
done
on
the
data
that's
being
provided
by
prometheus
and
on
top
of
that,
we're
bringing
in
things
like
logs
for
loki
and
context
for
you
to
look
at
that.
C
Any
events
that
are
that
are
happening
so
again,
this
whole
rich
construct
and
really
rich
object
model,
essentially
what
what
is
being
built
with
bringing
all
those
open
source
tools
together
I'll
highlight
just
a
couple
things.
This
is
there's
a
lot
of
info
on
the
screen,
but
important
to
know
is
that
actually,
for
starters,
we're
seeing
that
some
metrics
are
not
normal
in
this
particular
hard
cache
container
inside
of
this
particular
pod.
C
C
I
won't
go
into
these
because
we
have
a
more
interesting
view
of
these
actual
violated
metrics,
but
I
do
want
to
call
out
what
alope
was
also
just
mentioning
around
not
being
able
to
you
know:
you're,
not
gonna,
know
no,
and
in
fact
many
times
you
might
set
alerts
on
some
metrics,
but
you
know
I
I've
been
monitoring
for
I
don't
know
10
years,
something
like
that
and
I
don't
recall,
seeing
like
somebody
going
in
and
setting
in
the
threshold
on
container
file
system
reads
right
or
you
know
some
of
these
more
esoteric
or
less
well-known
indicators
of
performance
right
and
I
think
that's
a
real,
that's
a
shame.
C
But
again
this
is
something
you
don't
have
to
do
because
our
you
know
the
ml
is
going
to
do
that
for
you
again
you're
getting
this
data
from
node
explorer
you're,
getting
this
data
from
c
advisor.
So
what's
a
an
abpf
correct,
you
might
as
well
actually
leverage
it
as
opposed
to
just
collecting
that
data
and
then
just
not
doing
anything
because
nobody
actually
knows
what
to
do
with
it
right,
so
so
yeah
these
are.
These
are.
C
This
is
an
example
of
all
the
metrics
that
were
taken
into
consideration
by
the
ml
to
to
trigger
this
particular
alert,
but
we'll
go
into
the
more
interesting
view
of
the
analysis,
and
I'm
actually
just
going
to
give
you
a
little
taste
and
we're
going
to
jump
out
of
this,
because
this
kind
of
links
to
a
larger
piece
but
you'll
notice.
You
know
we're
highlighting
some
of
the
issues
right.
C
We
have
this
fishbone
rca
that
we
call,
which
really
shows
you
different
categories
of
metrics
and
configurations
that
have
that
that
are
probably
important
to
this
scenario.
File
system
is
not
involved,
it's
all
grayed
out
green
and
gray,
meaning
you
know,
they're
the
good
or
not
involved,
but
I'll
highlight
just
a
category,
so
configurations
have
changed
supply
side.
Workload
is
having
issues,
that's
why
some
of
these
are
red
demands
that
work
lotus
having
issues
and
so
a
cpu
we'll
jump
back
to
this.
C
I
just
wanted
to
give
you
a
taste
of
this
is
an
auto
detected
anomaly
by
the
by
dml
right
with
all
that
underlying
data
that
we've
collected.
But
let
me
step
back
just
for
a
second
to
show
something
here.
We
have
another
alert,
that's
saying
that
our
response
time,
we
have
an
slo
breach
on
this
particular
service
right,
the
internet
service,
and
it's
saying
that
we
have
an
slo
value
of
2500,
milliseconds
and
and
that
we're
actually
responding
at
over
3
500
milliseconds.
So
we're
at
three.
C
Seconds
of
response
time,
that
is
of
course
important
right,
so
so
somebody
has
deemed
that
this
slo
is
important
to
set
and
now
it's
being
violated.
So
what
our
ml
is
doing.
Is
it's
looking
up
up
the
stack
down
the
stack
upstream
downstream
to
identify
where
there
are
actually
issues
that
could
be
affecting
and
causing
this
particular
slo
violation?
C
So
again,
even
just
just
visually
right,
we
can
see
that
there
are
some
clear
problem
areas
here
in
the
red
and
here
in
the
red
right-
and
this
is
again,
this
is
really
no
work
being
done.
It's
except
right
behind
the
scenes
by
the
ml
and
I'll
all
highlight
what
these
red
pieces
mean
in
a
sec.
I
just
want
to
click
on
this
nginx
and
we
get
a
tabular
view
again,
a
a
an
amalgam,
amalgamation
and
integration
of
multiple
data
sources.
C
You
you
have,
you
have
the
actual
slo
violation,
saying
it's
over
41
violating
we
are
because
of
ebpf.
We
can
see
the
flow
of
the
requests
and
we
can
see
which
is
the
highest
latency
path.
In
this
case,
it
goes
from
engine
x
to
web
server
to
cart,
caster
cards
for
the
db
server,
which
you're
seeing
back
here
and
rml
has
also
learned.
What's
normal
right,
so
expected
behavior
for
cart,
cash
and
expected
behavior
for
db.
Server
is
out
of
normal
right
you're
at
over
a
second
and
over
2.4
seconds
respectively.
C
We
know
this
is
not
some
sort
of
increased
request
issue
because
again
we're
also
bringing
in
data
related
to
the
to
the
url
request
count,
and
this
is
actually
going
down.
So
it's
not
an
increased
request
problem.
So
really,
what
is
it
well?
We
can
jump
into
we
if
we
wanted
to.
We
could
jump
into
each
of
these
individual
components,
but
we
know
we
don't
have
to
and
nobody's
going
to
do
that
because
we,
you
know,
as
operators
are
trying
to
get
and
resolve
the
issues
as
fast
as
possible.
C
So
what
I'm
going
to
do
is
I'm
actually
going
to
click
on
this
red,
and
it's
going
to
take
me
actually
back
into
that
alert.
We
were
just
looking
at
so
what
happened
and
I'll
just
go
back
and
go
back
a
sec.
What
happened
is
that
the
ml
actually
detected
that
for
this
particular
anomaly,
it
brought
in
this
completely
discreet
anomaly
that
we
were
looking
at
earlier.
C
That's
completely
separate,
but
it
brought
it
in
and
said
you
know
what
this
is
likely
a
contributor
or
a
cause
to
your
to
your
slo
violation
and
what
happens
is
now
we'll
actually
go
into
these
that
we
were
looking
at
earlier,
which,
by
the
way,
are
charted
automatically
here.
So
you
don't
have
to
you
know
you,
don't
you
don't
have
a
message
saying:
you're
I'll
click
here,
your
response
time
is
slow
and
then
you
go
with
some
other
tool
and
then
go
to
the
different
dashboard.
C
It's
all
in
here-
and
this
is,
you
know,
automatically
it
for
you.
You
can
see
that
the
response
time
has
increased
by
close
to
two
thousand
percent.
You
can
see
that
the
response
bytes
themselves
have
increased.
The
size
went
from
one
meg
to
you
know
close
to
eight
max
incoming
response
time
for
requests
coming
into
this
particular
cart.
Cash
container.
The
response
time
has
increased
by
1500
percent.
C
Cpu
utilization
has
increased
by
about
50
percent,
but
finally,
the
real
piece
is
the
image
right.
So
this
is
now
leveraging
all
the
open
source
data
and
bringing
it
together
saying
you
have
an
image
regression
right.
You
were
running
version
0.6
and
now
you're
running
version
0.4.
So
that's
really
the
issue
right.
We
have.
We
have
a
bad
image
that
way.
That's
a
well-known
point
for
is
it's
broken,
so
that's
really
the
cause,
I'm
I'm
I'm
out
of
time
so
I'll
hand
it
back
to
to
you
alok
and
as
well.
B
I
want
to
kind
of
cap
a
couple
of
comments
and
leave
it
for
folks
who
are
attending
to
have
q
a
what
what
we
just
showed
was
in
fact,
if
you
looked
at
what
we
just
did
here,
that
response
time
change
in
slo
came
from
ebpf
and
open
source
component.
That's
already
there.
We
didn't
have
to
do
anything.
B
The
metric
changes
came
from
primitives
and
ebf
for
every
container
and
the
flows.
The
configuration
change
we
detect
from
kubernetes
state
metrics,
for
example
right.
So
when
we
look
and
see
an
event,
change,
there's
a
logger
event.
We
can
bring
that
in.
We
haven't
shown
the
traces
we
can
drill
down,
which
is
still
in
the
works
which
will
be
shortly
out.
You
can
pull
in
the
trace
at
that
time
to
see
exactly
what
happens
to
even
further
confirm.
But
the
point
is
all
of
this
data:
is
there
freely
available?
B
B
Open
source
monitoring
has
released
all
the
data
that's
available,
that
we
need
for
telemetry
and
changes
even
dynamically.
Let's
embrace
that
add
this
intelligence
on
top,
you
can
do
it.
We
are
just
showing
you
a
way
how
we
can
do
it
and
make
your
life
simpler.
That
was
the
whole
problem
right.
So,
if
I
were
to
just
you
know,
just
summarize
and
then
open
up
for
questions
and
then
just
emphasize
this,
I
think
this
is
worth
highlighting
again.
B
You
know
we
have
all
these
open,
cncf
metrics,
open
telemetry.
Is
there
metrics
logs
traces
ebbs?
Yes,
if
you
had
s2,
we
can
also
deal
with
that
by
there,
but
you
don't
have
to
touch
kubernetes
even
changes
as
we
are
going
on
just
using
that
and
this
workflow
once
you
have
that's
what
the
intelligence
is
right
and
that's
kind
of
what
we're
emphasizing.
A
Great
sorry,
excuse
me
so
really
great
amazing
presentation
really
good
very.
A
Demo
always
happy
to
see
those.
Thank
you
so
much
so,
as
said
previously,
now
is
the
the
time
to
ask
the
questions
you
can
leave
them
in
the
chat
I'll,
be
helping
you
to
moderate
the
q
a,
but
let's
get
started,
and
let's
kick
off
with
a
few
of
my
questions,
so
I
know
you
mentioned
a
bit
about
prometheus
as
well
as
how
you
play
with
other
cncf,
tooling
and
open
source,
tooling
and
whatnot
how
these
play
into
that.
Do
you
wanna,
expand
on?
A
B
Think
we
did
and
if,
if
you
go
back
to
that
screen
to
say
sorry,
you
noticed
how,
when
we
deploy
prometheus
itself
right
and
with
that,
just
like
you
would
today
in
your
kubernetes
cluster
as
a
daemon
said,
enable
the
c
advisor
metrics
enable
node
explorer.
That's
all
we
are
pulling
together.
B
I
think
what
we've
added,
if
you
think
about
it,
because
we
needed
what
we
call
layer,
seven
metrics
is
we
now
he
leveraged
ebpf
as
well,
which
is
another
open
standard
that
now
gives
us
the
coverage
of
networking
not
only
at
the
bytes
and
packets,
but
also
the
request
rates
at
the
url
level.
Response
time
essentially
gives
us
gold
and
signals
so
nothing
out
of
the
ordinary.
As
long
as
we
have
that
coverage.
C
We're
using
prometheus,
you
know
as
a
time
series
database,
so
you'll,
you
know
leveraging
the
exporters
just
easily
sends
that
data,
or
you
know
the
data
is
described
by
prometheus
for
all
those
exporters,
so
yeah,
that's
all
the
metrics
that
you
see
whether
they're
network,
metrics
or
or
the
c
advisor
you
know,
container
metrics
or
known
metrics,
whatever
all
that's
being
fed
into
prometheus
and
that's
where
we're
grabbing
all
those
metrics
from.
So
it's
actually
a
really
key
component
to
the
observability
layer.
A
Perfect,
we
don't
have
too
much
time
left,
but
I
do
want
to
ask
a
question
because
we
went
to
kind
of
the
latest
and
greatest,
I
guess,
of
observability
and
monitoring
here.
So
what
do
you
think
are
going
to
be
the
next
steps
for
for
this
scene?
What
will
be
the
either
for
obscurus?
What
are
the
upcoming
features
or
what
will
the
whole
space
be
moving
into
the
future
and
what
will
be
the
focus.
B
A
B
Are
really
working
on
cloud
native?
One
of
the
things
we
didn't
show
is:
we
are
adding
tracing
open
telemetry
with
jaeger.
That
brings
up
more.
You
know,
adding
more
and
more
capability
on
the
causal
analysis
pieces,
so
they
can
work
and
try
to
do
fixes.
We
didn't
show.
For
example,
there
are
some
issues
that
we
are
adding
to
like
whether
it's
issues
related
to
kubernetes
faults
and
failures
that
stop
at
start
time
and
run
time,
whether
it's
application
level,
we
are
adding
more
capabilities
that
are
more
knowledgeable
about
a
known
application.
B
Awareness
like
how
does
the
kafka
behave,
what
to
look
for
those
metrics?
Even
those
have
prometheus
exporters
with
them.
So
now
we
can
even
drill
down
deeper
into
understand
a
specific
problem
with
kafka,
a
specific
problem
with
an
open
source
database
like
mysql,
so
that
gives
us
even
more
granularity
to
understand
what
problems
there
right
and,
of
course,
we
start
adding
traces.
B
C
A
you
know
as
a
in
the
overarching
space,
I
think,
but
I
think
you
know
the
the
whole
concept
of
of
bringing
multiple
sources
of
data
together
is
really
where
the
industry
is
starting
to
come
in
right
and
adding
that
that
intelligence,
you
know
obscures,
is
not
the
only
company.
C
That's
doing
that,
though,
I
think
you
know
we're
doing
some
pretty
cool
things,
but
you'll
notice
that
more
and
more
starting
to
integrate
the
tools
together
right,
because
there's
there's
a
lot
of
value
to
be
had
from
not
isolating
the
individual
tools.
I
think
the
industry's
starting
to
come
to
realize
that
so
that
and
putting
smarts
into
into
whatever
platform
is,
is
kind
of
like
the
next
phase.
The
future
of.
B
B
It's
for
anyone
going
to
cloud
native.
You
know
embrace
the
open
source
and
monitoring
and
then
add
the
intelligence
where
they
really
need
it.
That's
what
we
see
the
community
and
it's
taken
a
while.
You
know,
but
it
doesn't
matter
if
you're
an
enterprise
customer
with
a
lot
of
legacy,
application,
wounded
cloud
or
a
new
startup.
B
A
Perfect
and
perfectly
on
time,
as
well
as
far
as
the
timing
of
the
the
session
and
the
q,
a
goes
so
perfectly
wrapped
there
as
well.
So
thank
you
for
everyone
for
joining
the
latest
episode
with
cloud
native
live.
It
has
been
really
great
to
have
op
screws,
cesar
analog,
talking
about
next
generation
observability
using
open
source
monitoring.
A
It
has
been
an
absolute
pleasure
and
we
really
are
amazing
and
happy
to
see
so
many
attendees
joining
in
as
well,
and
thank
you
so
much
for
for
tuning
in
as
well,
and
we
will
be
bringing
cloud
native
tv,
obviously
on
every
wednesday
going
forward
as
well,
but
actually
next
week
will
we
have,
we
will
have
kubecon
and
therefore
we
will
have
a
break
because
there's
so
much
going
on
already
in
the
club
ladies
day
next
week.
So
no
need
for
for
our
live
stream.