►
From YouTube: How to sleep better at night and survive on-call with Robusta Automations by Natan Yellin
Description
Natan will present robusta.dev - a new and rapidly growing open source platform for Kubernetes monitoring and troubleshooting. Robusta sits on top of your existing monitoring tools (i.e. Prometheus) and tells you why alerts happen and how to fix them. It makes root cause analysis easier and gives you better manual troubleshooting tools for looking inside microservices when automated analysis fails.
A
Hello,
I
mean
thank
you
for
having
me
it's
such
an
absolute
honor
to
be
here
today,
and
the
topic
of
my
presentation
today
is
how
to
sleep
better
at
night
and
survive
on
call
with
robusta
automations,
and
we're
essentially
going
to
be
speaking
about
how
you
can
take
devops
philosophy,
and
you
can
apply
the
same
principles
that
we
use
with
infrastructure
as
code
to
how
we
handle
others.
But
before
I
get
started,
I
want
to
very
briefly
just
give
some
more
background
about
myself.
A
So
I've
been
a
developer
for
more
than
15
years,
lots
and
lots
of
open
source
involved.
I'm
also
live
cyber
security
professionally
in
recent
years,
and
I
also
have
been
blogging
for
about
as
long
recently,
I've
started
creating
youtube
content
and
for
the
recent
years
almost
everything
I
do
has
been
around
kubernetes
at
kubernetes
startups,
both
as
a
developer
and
in
customer
facing
roles
and
today
as
co-founder
of
robusta
okay.
So
I'm
going
to
talk
about
three
topics.
The
first
topic
is
the
history
of
automation
and
specific,
the
history
of
devops
automation.
A
So
I
want
to
start
off
with
a
trip
to
the
past
and
I
want
to
go
back
15
years
to
how
I
develop
software.
When
I
got
started,
I
got
started
and
I
started
working
on
some
wordpress
websites
and
I
did
some
python
code
and
web
applications
in
django
and
when
I
got
started
then
essentially
there
were
a
few
different
stages
to
how
you
developed
software.
First,
you
built
software
and
I
would
check
out
code
locally.
I
would
make
my
changes
and
I
would
build
it
locally.
A
I
would
run
some
tests
locally
and
if
everything
worked
then
I
would
go
on
to
the
next
stage
and
that's
the
deploy
stage.
So
I
would
upload
files
onto
a
virtual
host
upload
them
with
ftp.
I
would
do
an
afghan
install,
install
a
bunch
of
dependencies
move
files
into
the
right
place.
Now
I'm
deploying
my
web
application
next
step.
I
had
to
configure
it
so
I
would
go
into
a
web
console.
I
would
set
up
some
dns
records
if
I
deploy
a
new
version
of
the
application.
A
Maybe
I
would
stop
the
old
one
and
start
the
new
one
I
would
configure
stuff
manually.
Maybe
we
can
figure
some
rules
for
apache
for
engines
set
stuff
up
next
stage,
so
I
did
everything.
Well,
I
like
was
really
excited.
I
took
my
application.
Put
it
online
messaged,
all
my
friends
on
irc
put
a
post
on
hacker
news
and
it
really
blew
up,
and
now
everyone
is
coming
to
the
website.
And
now
I
have
a
problem
because
now
there's
live
load
on
my
server,
so
I
add
another
server
deploy
another
copy.
A
Essentially
I
go
here
and
I
do
that,
deploy
and
configure
stage
from
scratch,
and
now
I
have
another
copy
of
my
application
running.
Lastly,
something
is
going
wrong,
something
isn't
working
right.
So
if
I'm
lucky,
I
have
some
monitoring
in
place,
and
I
see
that
there's
an
issue
and
now
it's
time
to
respond
to
that
and
to
fix
the
issue.
A
So
I
respond
to
that.
I
connect
to
the
server
over
ssh
start
to
investigate.
What's
going
on,
I
run
ps.
I
try
and
fix
the
issue,
and
I
fix
the
issue
manually
and
every
almost
every
single
part
of
this
has
changed
in
the
past
15
years
and
it's
gone
a
fundamental
shift
and
each
case
what
we
do
today
is
totally
totally
different
than
what
we
did
back
when
I
got
started.
A
So
I
want
to
show
that
what
we
did
is
we
essentially
went
through
each
and
every
one
of
these
stages
and
we
automated
them
so
for
build,
I'm
no
longer
just
building
code,
oakley
and
then
running
tests
on
my
local
machine.
We
have
ci
cd
like
jenkins,
like
github
actions
like
circle
ci
and
essentially
my
code,
every
single
night
or
after
every
single
commit
it's
being
built,
we're
running
tasks,
we're
doing
all
that
automatically.
So
I
can't
accidentally
like
go
right
ahead
to
the
deploy
stage
without
running
tests
or
when
something
isn't
building
like
it.
A
A
Afghan
install
do
something
different
if
it's
a
centos
host
and
if
it's
a
ubuntu
host,
if
it's
a
debian
host
and
all
that
had
issues,
because
you
were
really
dependent
on
the
machine
that
you
were
running
on,
some
machines
would
have
different
dependencies
different
versions
of
the
software.
You
need
maybe
you're
deploying
a
php
application
and
there's
php
in
511
and
php4
in
another.
Don't
catch
me
on
the
versions
haven't
done
php
in
a
very
long
time.
A
But
the
point
is
that
when
you
deploy
this
software,
we
no
longer
deploy
that
same
way.
We
deploy
with
containers
and
essentially
what
I've
done
is
I,
by
automating
this
and
by
using
docker
I've
shifted
like
this
entire
deploy
phase.
I
shifted
a
big
part
of
that.
To
build
time
now
I
have
a
docker
file.
I
build
this
container.
That
container
is
self-contained,
has
everything
I
want
inside
of
it
and
when
it
comes
time
to
deploy,
I
just
take
that
container.
A
Put
it
somewhere
else,
I'm
not
dependent
on
the
local
machine,
doesn't
matter
what
the
dependencies
are
on
the
local
machine,
it's
longer
as
it
has
docker
and
my
container
there
is
self-contained.
It
has
everything
in
it.
Every
time
I
commit,
and
I
do
a
build
in
the
build
stage,
then
that
builds
my
docker
container
as
well
and
deploying
is
just
so
much
easier.
A
I
take
a
docker
container,
get
it
running
there
on
the
right
machine,
so
very
easy,
and
this
has
all
been
automated
with
docker
now
moving
on
to
the
configure
stage,
then
we
also
no
longer
do
this.
I
no
longer
go
and
just
set
up
records
or
configure
stuff
manually.
We
have
infrastructure
as
code,
whether
it's
kubernetes
and
yaml
manifests,
or
if
it's
the
underlying
infrastructure
and
stuff
like
terraform,
I
run
cube.
A
Cdl,
apply
the
instructions
for
everything
what
I
used
to
do
as
a
person
it's
now
contained
in
the
yaml
file
or
in
the
terraform
file.
This
is
exactly
how
to
set
stuff
up,
and
I
just
apply
that
file
and
everything
happens.
So
there's
no
longer
a
manual
stage
here.
A
Moving
on
to
scale,
then,
of
course
we
do
auto
scaling,
hba,
vpa,
horizontal
pod,
auto
scaler,
variable
power,
auto
scaler,
even
all
those
scaling
groups
with
aws,
if
you're
not
using
kubernetes
in
the
scale
phase,
it's
also
totally
totally
automated
and
that's
no
longer
something
that
we
do
manually.
So
we
look
at
each
of
these
first
four
stages.
A
Then
we
used
to
do
stuff
really
manually
or
we
would
automate
it,
but
we
automate
with
these
in-house
solutions,
these
ad
hoc
solutions
that
each
company
invented
for
themselves
and
today
for
each
one
of
those
we
have
a
platform.
We
have
a
really
great
piece
of
open
source
technology
that
lets
us.
Take
that
and
let's
just
do
it
automatically
and
give
us
all
these
extra
features
and
it's
a
whole
lot
more
robust.
A
If
I
move
on
to
the
last
stage,
then
the
last
stage
in
many
ways
is
still
left
behind
when
alerts
occur
and
when
we
respond
to
those
alerts
and
we
go
and
we
investigate
them
very
often,
we
are
still
connecting
over
ssh
or
running
cube,
ctl
exact,
we're
still
investigating
stuff
similar
to
the
way
we
investigated
in
the
past,
and
that's
the
area
that,
with
the
robusta
open
source
project,
we're
trying
to
really
automate
and
bring
into
the
modern
day.
A
What
I
mean
by
this
is
back
in
the
day
like
if
you
wanted
to
automate
the
build,
then
you
have
some
ad
hoc
in-house
solution
and
you
would
run
a
bunch
of
scripts.
Maybe
you
check
something
out.
You
do
autocomp
dot,
slash,
configure
and
you
run
a
bunch
of
stuff,
and
you
have
a
script
that
wraps
that,
and
you
have
all
these
scripts
that
you
write.
You
no
longer
do
that
you're.
Just
writing
a
yaml
file
that
yaml
file
sure
it
has
some
scripts
in
there
and
it
has
some
stuff
contained
in
it.
A
Sorry,
in
that
docker
file,
for
example,
when
you're
defining
how
to
build
it,
if
you're,
building
on
github
actions
and
there's
a
yaml
file
there,
that
defines
all
the
stages,
so
we
still
have
some
of
those
same
scripts
but
they're
wrapped
in
this
configuration
file
with
high
level
instructions,
and
it's
not
just
an
ad
hoc
script
and
the
reason
that
these
configuration
files
are
important
is
because
then
it's
running
on
underlying
engine
that
adds
all
these
extra
features
for
us.
A
For
example,
if
a
github
action
fails,
then
I
automatically
have
the
logs
and
I
can
see
exactly
where
it
failed,
and
I
see
that
on
the
github
by
user
interface,
if
I'm
using
a
docker
file
instead
of
like
an
ad
hoc
solution
for
packaging
stuff,
now,
there's
caching
each
layer
builds
upon
the
previous
layer.
I
benefit
from
docker.
They
are
caching.
So,
by
moving
to
these
automated
platforms,
we
suddenly
get
all
these
extra
features
and
as
a
user,
it's
a
whole
lot
easier.
A
I
can
write
a
short
config
file,
often
in
the
ammo,
and
I
get
stuff
that
previously
I
would
have
had
to
write
a
whole
lot
of
code
for
that.
Maybe
the
best
example
of
that
is
auto
scaling.
If
I
look
at
all
those
scaling,
then
I'm
just
putting
in
there
like
one
line
in
the
yaml
file
or
like
one
short
emo
file,
defining
how
to
auto
scale
stuff
and
now,
if
something
dies,
it'll
get
brought
back
up.
A
I
mean
that's
the
self-healing
aspect,
not
auto
scaling,
but
if
I
change
the
number
in
there
or
if
the
cpu
load
changes
then
suddenly
stuff
will
auto
scale
and
I'm
not.
I
don't
have
to
write
all
this
code
or
all
these
scripts.
That
then
like
say
if
this
and
then
that-
and
do
this,
I'm
just
saying
what
I
want.
What
I
want
to
happen
and
the
underlying
engine
kubernetes
will
handle
all
the
implementation
details.
A
So
by
giving
a
configuration
file,
I
can
work
at
a
much
much
higher
level
and
there
are
huge
benefits
to
that
and
the
second
part
of
this,
which
is
equally
important,
is
when
I
have
a
configuration
file.
I
can
take
that
file
and
I
can
stick
it
and
get
and
the
big
shift
that
happened
here
for
each
of
these
phases.
The
big
shift
that
happened
is
there
used
to
be
knowledge
that
was
inside
my
head
and
that
became
machine
knowledge.
A
A
If
I
want
to
deploy
my
software
like
create
a
new
version
of
it,
then
I
don't
have
to
go
and
run
a
bunch
of
commands
or
do
something
it's
just
a
docker
file
and
docker
is
doing
it
for
me.
So
essentially
in
each
of
these
stages
that
we
speak
about
the
knowledge
that
used
to
be
like
knowledge
that
was
inside
my
own
head,
it
has
become
knowledge
and
git
repository,
and
that
means
I
can
audit
that
knowledge.
I
can
share
it
with
my
team.
A
Can
we
apply
it
to
how
we
handle
alerts
now,
learning
itself
and
the
system
for
defining
others?
That's
all
very
well
known
how
to
do.
There's
great
technology
out
there,
so
you're,
probably
already
using
prometheus
or
datadog
or
dynatrace
or
app
dynamics,
or
some
other
system
or
new
relic,
so
they're
great
at
learning
systems.
I
mean
personally,
I
love
prometheus
and
invert
manager,
they're
great
systems
to
define
the
alerts,
but
what
happens
when
those
alerts
fire?
Someone
still
has
to
go
and
investigate
that.
A
So
can
we
really
try
and
automate
that
process
of
taking
the
alert
and
then
understanding
what
it
means,
understanding
how
we
fix
it,
and
maybe
even
fixing
it
automatically,
and
what
we'd
like
to
achieve-
are
really
three
things
that
we've
achieved
in
each
of
the
previous
examples
for
automation,
one
we
want
to
make
it
all
work
faster.
We
want
to
have
a
faster
response
to
alerts.
A
I
was
speaking
to
a
company
recently
in,
I
believe
in
england,
and
they
said
to
me
our
on-call,
our
own
call
team
has
to
respond
within
three
minutes
because
that's
the
maximum
downtime
that
we're
about
in
the
entire
month.
That's
the
sla
that
we
promise
customers
so
in
the
live,
major
companies
and
live
teams.
You
really
have
to
respond
fast
and
therefore
having
some
automation
is
really
important.
A
If
you
want
to
meet
that
survey
that
that's
all
the
agreement
you
have
with
your
customers,
two
now
that
sharing
just
like
by
taking
stuff
and
putting
in
a
docker
file
by
putting
github
action,
I
mean,
or
even
by
using
I
mean
configure
infrastructure
as
code
and
by
using
yaml
files.
You
can
kind
of
share
the
knowledge
and
take
knowledge
that
was
previously
in
your
head
like
instructions.
How
do
I
set
a
server
out,
and
now
it's
just
the
terraform
plan.
Now
you
just
apply
that
file.
A
So
can
we
take
the
knowledge,
that's
inside
your
head,
for
how
you
respond
to
alerts,
and
can
we
just
make
that
another
file
and
get?
And
lastly,
we
want
to
respond
to
others
better
than
we
do
today
and
if
you
look
at
all
the
previous
examples
of
automation,
then
obviously
github
actions
is
way
better
than
just
having
like
an
in-house
solution
to
try
and
some
scripts
that
you
write
because
there's
a
platform,
it's
open
source
and
lots
of
features
and
they
add
on
features
and
you
benefit
from
those
features.
A
So
can
we
do
the
same
thing
for
the
response,
and
can
we
really
respond
to
others
better
by
automating?
The
way
we
do
it?
Okay?
So
I'm
going
to
look
at
three
pieces
of
open
source
technology
that
you
can
use
to
do
this,
I'm
going
to
speak
most
about
robusta,
the
last
one
which
I
work
on
personally,
but
I
want
to
mention
the
other
ones,
because
they're
also
great
pieces
of
technology
and
they're
all
open
source,
so
first
is
stack.
A
Storm
stack
storm
is
the
oldest
technology
here,
so
it's
very
mature
and
it's
used
to
automate
their
response
and
tell
me
other
stuff:
it's
only
other
workflows
and
do
all
the
matic
remediations
and
stock,
though
the
advantage
and
the
disadvantage
is
that
it's
not
kubernetes
native.
So
it
will
work
on
everything,
but
it's
not
specifically
built
for
kubernetes
moving
on
to
the
next
one.
Argo
events,
argo
events,
absolutely
is
built
for
kubernetes.
It's
a
great
piece
of
technology
and
the
advantage
and
disadvantage
of
our
goal.
A
Events
is
that
the
way
argo
works
is
everything
is
just
a
pad
or
a
container
kubernetes
resource.
So
with
argo
you
can
say
when
something
happens
then
go
and
automate
this
by
running
a
document
by
running
a
pad
by
running
this
container.
But
then
you
have
to
supply
the
details
for
why
ysla
container
and
moving
on
to
robusta
the
advantage
and
the
disadvantage
of
robusta.
A
Is
that
we're
really
specific
and
we're
really
focused
specifically
on
responding
to
others
and
remediating
them,
and
we
have
like
domain
specific
knowledge
about
specific
others,
so
we're
a
little
more
specific
we're,
not
a
general
purpose
framework
like
the
others,
but
for
the
use
case
of
responding
to
alerts,
we've
tried
to
really
build
it
dedicated
on
that
use
case.
But
all
three
of
these
are
great
pieces
of
technology
in
the
robust
philosophy
really
is
three
things:
one
kubernetes
native,
so
it's
built
specifically
for
kubernetes.
A
It
can
work
elsewhere,
but
the
focus
is
on
kubernetes,
two,
all
batteries
included.
So
when
you
install
robusta,
you
should
just
work
out
the
box.
If
you
send
your
prometheus
alerts,
then
you
should
get
value
even
before
you
go
and
configure
anything.
So
we
have
built-in
knowledge
about
common
alerts
and
three.
It
should
be
easy
to
use.
We're
trying
to
save
you
time
we're
trying
to
help
you
automate
stuff.
A
Now
conceptually
a
trigger
is
something
that
occurs
in
your
kubernetes
cluster
or
outside
of
it
that
triggers
a
reaction
that
kicks
off
this
automated
workflow,
so
an
example
trigger
would
be
a
prometheus
that
they're
fired
and
there's
another
going
on.
There's
something
wrong
with
your
cluster
that
triggers
a
workflow
moving
on
to
the
next
part.
An
action
is
what
you
do
when
that
trigger
fires,
so
an
action
can
gather
extra
information.
You
can
take
that
other
and
you
can
investigate
why
it
happens.
An
action
can
also
fix
stuff.
A
You
can
say:
okay,
I
know
what
this
problem
is:
it's
a
known
trigger,
so
I'm
going
to
go
and
automatically
apply
a
fix
and
there
are
many
different
types
of
actions
and,
lastly,
async
that's
the
destination:
that's
where
you
get
the
data
where
it's
sent
to
you,
where
you
get
a
notification
about
that,
so
that
could
be
a
stack
channel
for
example,
and
I'll
give
some
examples
now
of
really
each
of
these
different
concepts,
so
triggers
would
be
prometheus
alerts.
A
Kubernetes
changes,
anything
that
happens
in
your
cluster,
for
example,
if
a
new
deployment
was
rolled
out
and
it
can
also
be
manual
triggers,
like
you
say,
okay
right
now,
I
want
to
trigger
an
automated
workflow,
but
there
are
many,
many
more
triggers,
I'm
just
giving
three
examples
here,
an
action
again.
We
have
something
like
some
of
the
actions
built
in
and
you
can
add
on
your
own,
but
actions
could
be
going
run
a
profiler.
A
A
Ms
teams,
apps
genie
telegram:
you
can
send
stuff
to
a
webhook,
you
can
send
stuff
to
kafka,
topic,
lots
and
lots
of
different
things,
and
if
you
need
something,
it
doesn't
exist,
then
just
open
an
issue
on
github
and
we'll
get
around
to
it
very
fast
and
now
they're
on
for
you-
and
the
last
comment
I
want
to
say
about
the
architecture
is
everything
is
strongly
typed
so
like?
If
you
have
an
alert,
unless
you
have
a
trigger
with
a
prometheus
other
that
kicks
off
an
automated
workflow,
then
we
know
in
the
robusta
system.
A
We
have
metadata
about
that,
and
we
know
like
this
trigger
is
another
about
a
pot
or
this
is
a
other
bella
node
or
this
is
a
trigger
that
occurred
because
you
rolled
out
a
new
kubernetes
deployment
and
we
have
data
about
what
triggered
that
automated
workflow
and
that
data
passes
to
the
next
phase,
and
this
is
really
really
important
because
it
means
let's
say
you
had
a
trigger.
That
was
a
prometheus
alert,
and
that
other
is
that
the
node
is
running
out
of
disk
space.
So
we
know
that
this
trigger
this
other,
that's
firing.
A
It's
a
belly
specific
node.
So
when
that
passes
on
to
the
action,
then
you
can
have
actions
like
go
fetch
a
graph
of
the
disk
usage
and
the
action
will
automatically
know
how
to
take
the
right
node
from
the
trigger
and
what
kubernetes
object.
It
should
apply
that
action
to
or
if
you
have
a
prometheus
or
that
certain
pod
is
crashing,
then
that's
the
trigger
and
that
passes
to
the
action
and
the
action
receives
that
specific
kubernetes
pad.
So
you
can
then
have
an
action
like
go.
A
Now
I
want
to
give
a
specific
example,
because
I've
really
spoken
about
architecture
and
high
level,
so
here's
one
concrete
example:
you
have
a
crashing
pad
in
your
cluster
and
the
automation
here
and
the
automated
response
is
the
one.
That's
very
simple:
go
fetch
the
dogs
and
tell
me
the
dog,
so
I
can
see
why
it
crashed,
and
essentially
this
is
like
the
world's
simplest
example
of
all
the
meeting,
your
alert
response
before
this.
A
Instead
of
getting
a
message
in
stock,
that
says,
like
your
path
crashed
and
then
you
go
manually
and
you
fetch
the
dogs,
the
dogs
are
fetched
automatically
for
you,
and
then
you
get
a
message
in
stock
that
has
like
some
extra
information,
so
you
no
longer
need
to
open
up
the
command
line
and
connect
to
the
cluster
and
then
fetch
the
box.
All
the
data
is
right
there
in
the
other
itself.
This
is
a
really
simple
example,
but
it's
one
that
I
think
is
good
to
demonstrate
the
general
concept.
A
Okay,
so
I
mean
I
see
it,
I'm
gonna
just
pause
for
a
moment.
I
see
there's
a
question
already
me
in
the
channel
I
mean
so
someone
has
asked.
Can
you
please
share
which
tool
has
been
used
for
the
slides
and
the
answer?
Is
I've
been
using
canva?
So
canva
is
an
excellent
piece
of
software
and
it's
a
sas
platform,
and
I
have
the
paid
version
on
that
and
I
use
that
to
create
the
slides.
A
So
that's
why
I
used
to
create
these,
and
if
there
are
other
questions,
I'm
going
to
answer
more
questions
later
on,
but
please
I
love
it.
When
people
ask
questions,
I
love
audience
participation.
It
makes
me
feel
a
lot
better
as
a
speaker,
so
please
feel
free
to
write
your
questions
in
the
chat.
A
Okay.
Moving
on
so
we
saw
one
specific
example
here
of
an
automated
workflow
and
now
I
want
to
show
really
how
this
automated
workflow
was
written
behind
the
scenes
and
how
it
actually
works
like.
If
you
want
to
configure
a
workflow
like
this,
where
every
time
a
pod
crashes,
then
you
get
a
message
in
stack
with
the
dogs
for
the
crashing
pad.
Then
how
would
you
actually
go
and
configure
that?
A
So
it's
really
easy,
just
the
ammo.
So
here
I
have
those
three
parts
that
I
mentioned
earlier:
triggers
actions
and
sinks.
The
trigger
is
a
prometheus
other
called
cub
crash
looping
and
if
you
don't
have
a
prometheus
other
like
that,
don't
worry
like
when
you
install
robust.
I
can
give
you
all
those
default
alerts.
Two.
The
action
is
logs
in
richer,
we're
going
to
take
this
other
and
we're
going
to
enrich
it
with
the
dogs,
and
you
might
be
thinking
the
dogs
are,
which
part.
A
What's
the
dogs
of
the
pad
that
crashed
over
here
we
get
the
data
from
the
other
and
automatically
the
actions
will
run
on
the
relevant
path
and
then
the
destination
we're
sending
it
to
is
stack
and
there
are
a
whole
lot
more
stuff.
You
can
do
here
by
default,
for
example,
this
is
rate
limited.
So
if,
like
the
same
deployment
is
crashing,
you
have
like
a
thousand
pods
you're.
Only
gonna
get
one
notification
for
every
for
every
60
minutes
or
you
can
configure
stuff
like
that.
A
So
there
actually
are
a
whole
lot
more
options
here,
but
this
is
the
general
concept
moving
on
I'm
going
to
show
a
few
more
examples
and
I'm
going
to
give
examples
from
different
categories.
The
first
category
that
I'm
going
to
give
here
is
I'm
going
to
show
how
you
can
investigate
known
alerts.
So
when
you
install
robusta,
I
said
batteries
included
so
out
of
the
box.
We
include
in
robusta
remediations
and
investigations
and
all
the
media
workflows
for
your
common
alerts.
So
you
just
sent
you
add
on
to
prometheus
one
web
hook.
A
The
stuff
gets
to
robusta.
You
can
continue
getting
your
old
alerts
and
you
can
send
the
robust
alerts
to
a
new
slack
channel
and
now
you're
alert.
If
they're
in
our
library
of
known
others,
they
will
come
with
extra
information.
So
the
first
example
of
that
is
cpu
throttling
high.
So
you
add
high
cpu
throttling
that's
a
known
alert
and
unfortunately,
when
there's
high
cpu
throttling,
it's
not
always
immediately
obvious
how
you
fix
that
you
could
fix
that
in
multiple
ways.
Maybe
the
issue
is
that
your
pod
has
an
incorrect
cpu
request.
A
Maybe
the
issue
is
that
you
have
a
cpu
limit
which
is
configured
to
rock.
Maybe
the
issue
really
is
that
you
don't
have
enough
cpu
and
you
need
add-on
more
cpu.
Maybe
there's
something
else
on
the
pod.
That's
interfering!
You
define
things
wrong,
so
there
are
a
lot
of
different
reasons
why
that
can
happen
and
essentially
the
enricher,
the
automation
for
this
specific
alert.
It
runs
the
decision
tree
and
it
figures
out
why
the
either
is
occurring.
A
Moving
on
to
the
next
example,
I
mean,
let's
say
you
have
a
memory
leak
and
your
pod
gets
killed.
Now,
if
you
have
a
memory
leak,
your
probably
gets
killed,
it's
gonna
come
back
up,
it's
gonna
get
killed
again
a
few
minutes
later
and
that's
really
inconvenient
me
or
getting
killed
an
hour
later.
It
depends
on
how
fast
you're
leaking
memory,
and
the
obvious
thing
that
you're
going
to
want
to
know
is
you're
going
to
want
to
know
well
like.
Why
is
this
pod
leaking
memory?
A
I
mean
so
what
you
would
do
normally,
if
you
were
running
like
a
normal
server
back
in
the
old
day,
then
you
would
like
association
to
that
server
and
then
you
would
run
like
jmap
or
jstack,
or
you
would
run
like
a
debugging
tool.
You
would
grab
a
memory
dump
of
that
and
then
you
would
see
like
what's
using
up
the
memory
and
you
could
do
two
differential
ones
like
10
minutes
apart
and
see
what's
allocated
but
not
free,
so
we
actually
have
a
live
stuff
built
into
robusta.
Let's
do
that.
A
So
here's
an
example
with
java
and
here's
an
example
where
you
can
see
where
the
top
objects
that
were
allocated
in
that
application-
and
you
can
run
this
automatically
every
time,
for
example,
that
your
product
is
about
to
be
killed.
So
that's
a
really
cool
thing
that
you
can
do
and
then
you
get
a
message
in
stack
with
the
data.
Now
I
see
that
there's
another
question
here:
can
we
also
configure
an
event
with
this
specific
dog,
an
event
for
the
specific
dog
stream
generally
with
any
pad,
as
well
with
robusta?
A
For
example,
a
pod
is
running
but
would
be
interest
in
a
specific
keyword,
error
code
generated
by
the
application.
Yes,
you
can.
There
are
two
ways
you
can
do
this.
The
first
way
is
you
can
just
define
with
evastic
search
if
you're
using
elasticsearch,
you
can
define
an
elastic
search,
monitor
a
robust
has
built
in
support
for
triggering
these
with
elastic
search
monitors.
So
that's
one
way
and
the
second
way
it's
not
in
general,
it's
not
in
ga.
A
Yet
so
it's
in
beta-
and
I
don't
know
if
we
have
it
public
in
our
github
repository.
But
please
message
me:
if
you
need
it,
you
can
actually
use
built-in
functionality
in
robusta
itself
and
then
you
can
add
on
the
trigger
within
robusta
native
to
robusta.
A
For
specific
keywords,
I
don't
believe
that's
in
ga
yet
so,
if
you
want
that,
then
message
me
and
I'd
love
to
have
you
better
test
that
moving
on
to
another
example,
then
we're
going
to
look
at
another
known
number
that
you
can
automatically
investigate
with
robusta
and
again
batteries
included.
So
you
just
install
robustas
and
your
prometheus
alerts.
This
will
all
happen
out
of
the
box.
You
don't
need
to
configure
anything,
but
you
can
also
write
your
own,
of
course,
for
your
own
alerts.
I
mean
so.
A
Let's
say
you
have
the
common
prometheus
other
node
file
system,
space
filling
up,
in
other
words,
you're,
getting
long
disk
space
on
the
node.
Then
by
default
we
will
run
an
automated
workflow
to
investigate
that
and
it
will
tell
you
what
pods
are
using
the
space.
Tell
you
how
much
disk
space
is
being
used,
how
much
is
being
used
by
the
pods?
How
much
is
being
used
by
the
node
itself
by
the
host?
A
Moving
on
to
the
next
topic,
you
can
also
use
robusta
for
remediation,
so
let's
say
you're,
auto
scaling
and
your
other
skater
reached
the
maximum
number
of
replicas
and
you're
like
it's
three
a.m
and
there's
lots
of
load
on
your
servers
and
you
just
want
to
go
back
to
sleep
and
like
do
a
proper
fix
in
the
morning,
so
you
can
get
in
there
and
stack
that
says.
Okay,
you
reach
the
maximum
for
the
hbase.
Would
you
like
to
up
it
by
30?
There's
just
a
button
there
in
stock
and
you
push
that
button.
A
It
doesn't
automatic
remediation.
You
can
run
these
remediations
automatically
entirely
without
even
asking
you
in
stack.
We
typically
recommend
that
you
do
like
ask
the
person
in
slack
that
they
confirm,
but
you
can
also
do
entirely
automatic
moving
on
and
here's
another
example.
This
one
isn't
quite
the
dirt
response,
but
it
also
really
helps
with
investigating
stuff.
We
have
a
lot
of
features
in
robusta
around
change
tracking,
so
you
can
kick
off
these
automated
workflows,
not
just
when
there's
another,
but
also
for
any
change
that
happens
within
the
kubernetes
cluster.
A
So,
for
example,
here
I've
set
up
an
automated
workflow
so
that
every
time
that
you
deploy
a
new
version
of
your
application,
then
we're
going
to
your
grafana
and
we're
adding
there
an
annotation,
that's
the
style
line
and
we're
showing
you
at
this
exact
point
in
time
a
new
version
of
your
application
was
deployed
and
that's
really
useful,
because
then
you
can
see
a
correlation
between
issues
that
occur
like
cpu,
going
up
in
between
your
deployments
and
to
configure
that
I'll
show
you
the
ammo
for
that.
A
So
here
the
trigger
is
just
on
deployment
update.
So
when
you
update
a
deployment
and
the
action
is
add
deployment
lines
to
grafana,
it's
an
action
that
will
go
grafana
and
add
deployment
lines,
and
you
specify
here
which
dashboard
and
the
api
key.
So
you
really
see
here
the
power
of
this
as
general
purpose:
automation,
engine
the
one-
that's
really
really
dedicated
to
other
link,
use
cases
and
last
topic.
A
I
want
to
speak
about
is
okay
like
what's
going
on
under
the
hood,
like
it's
great
that
I
showed
you
can
fetch
the
dogs
and
you
can
like
add
an
annotation
to
your
file.
You
can
do
all
these
things
or
if
you
want
to
do
something
that
isn't
one
of
our
70
built-in
actions.
So
it's
really
easy
to
write
your
own
actions.
You
just
do
it
in
python.
Under
the
hood,
everything
is
just
python
functions,
so
here's
an
example
for
the
logs
enricher
that
you
saw
earlier.
A
It's
just
the
python
function,
and
here
you
get
this
event.
That's
the
event
that
you
get
from
the
trigger.
So
you
get
the
data
with
the
pad.
It's
a
pad
event
in
python.
You
can
say
like
invent
that
get
pad
here.
You
have
the
pad,
then
you
call
a
function
like
pod.logs
and
we
have
a
rich
python
api.
It's
ultimately
based
around
the
python,
the
kubernetes
api
client
in
python,
with
some
higher
level
stuff,
but
we
make
it
really
easy
now
to
write
his
own
actions
yourself.
A
So
you
can
easily
add
on
your
own
stuff.
I
know,
for
example,
I
believe
the
guys
from
white
hat
jr
have
written
a
bunch
of
actions
internally
to
restart
deployments
under
certain
situations
where
they're
unhealthy.
I
know
there
are
a
bunch
of
other
cases
where
other
companies
have
gone
and
written
their
own
actions
and
opened
up
prs
for
robusta
to
add
those
back
to
robusta,
so
we're
seeing
a
live
interest
really
and
by
people
writing
their
own
actions
around
this,
especially
at
larger
companies.
A
Okay,
so
that's
it
I'll
pause
here
for
any
more
questions,
and
I
also
want
to
say
I
really
love
hearing
from
people.
So,
if
you're
listening
to
this
talk-
and
you
liked
it
or
if
you
didn't
like
it,
then
please
reach
out,
and
let
me
know
what
you
liked
or
you
didn't
like.
I
really
love
to
hear
people,
it's
the
most
satisfying
thing
about
talking
and
I'm
very
active
on
twitter
and
linkedin.
So
please
add
me
on
add
me
on
linkedin.
A
I
approve
everyone
feel
free
to
follow
me
on
twitter
and
please
reach
out
I'd
love
to
hear
from
people.
If
there
are
any
questions,
then
I'll
stop
here.
B
Hey
nathan,
that
was
really
a
an
insightful
session.
It's
really
informative
and
I
think
most
of
the
crowd
liked.
Your
ppt
is
very
colorful
and
very
engaging
and.
B
And
they
are
liking
the
ui
as
well.
I
think
we
have
got
a
compliment
saying
it's
aesthetically,
really
good
from
shakaibarif
thanks
a
lot,
and
thanks
for
being
open
to
accept
the
invitations
on
linkedin
and
twitter,
I
think
that
will
be
really
really
helpful
for
people
who
would
like
to
really
collaborate
with
you
and
maybe
yeah
work
with
you.
If.
A
A
10
seconds
yeah
I'll
just
show
it
because
I
know
people
like
it.
So
it's
really
easy
to
do
these
slides,
like
I'm
using
the
pro
version
of
canva
and
they
don't
pay
me
around.
I
actually
pay
them,
but
you
can
go
in
here
and
I
can
like
search
for
sleep.
Let's
do
that
in
hebrew.
I
can
search
in
here
for
like
sleep
and
then
they'll
find
me
like
graphics
of
people
sleeping
and
I
can
put
those
in.
So
that's
really
nice
and
it's
easy
to
use.
So
I
recommend
people
do
that.
B
Cool
cool
yeah,
so
I'm
pretty
sure
people
are
going
to
have
some
questions.
If
there
are
any
questions,
I
think
natan
is
going
to
be
available
on
slack
channel
for
for
the
answers
so
yeah
thanks
thanks
a
lot
natan
and.