►
From YouTube: OpenShift Coffee Break | Observability with Dynatrace
Description
Get your espresso ready for another OpenShift.TV Coffee Break as we welcome our special guest Henrik Rexed from Dynatrace to share with us real world production grade observability and monitoring stories with Dynatrace on OpenShift!
A
A
A
So
hi
thank
you
for
joining
us
today,
so
Henrik.
Would
you
please
tell
us
a
bit
more
about
you
and
what
we
are
going
to
be
talking
about
today.
B
Sure
so,
first
of
all,
it's
a
pleasure
to
be
here
so
for
coffee
break
I
took
I,
bring
I,
bring
some
tea,
I,
don't
know
which,
which
type
of
Beverage
I
was
allowed
to
drink
today,
so
I
have
the
tea
and
I
have
the
coffee,
so
in
case
I
will
be
able
to
be
able
to
switch
to
one
or
the
other
beverage
all
right,
otherwise.
Otherwise
my
name
is
Henrik
Rex
said
I
am
a
cloud
native
Advocate
at
dinner,
Trace,
so
I'm.
B
My
main
focus
at
that
Trace
is
everything
related
to
Cloud
native
technology,
of
course,
on
observability.
So
my
focus
is
on
observity
on
kubernetes
openshift,
NES,
Sorry's
mythologies
as
well,
so
here
I'm
very
honored
to
be
here
and
present
how
the
naturalist
can
save
your
life
from
a
broken
cluster
and
for
this
to
just
some
teas
about
what
I'm
going
to
present
I
have
picked
two
real
production
outage.
B
That
happened
for
real
big
brand
of
the
industry
and
we'll
explain
how
that
happens
and
how
we
can
resolve
it
and,
of
course,
how
the
natural
is
going
to
help
you.
There.
A
Thank
you
so
much
Henrik,
so
yeah
I
really
appreciate
the
fact
that
it's
going
to
be
I
would
say,
pragmatic,
because
we're
going
to
be
talking
about
real
life
situations
and
and
understand
how
how
those
monetary
Stacks
that
you
provide
can
can
help
address
those
issues
so
yeah,
okay,
so
without
further
Ado.
A
Maybe
would
you
like
to
tell
us
so
yeah,
maybe
just
to
set
the
context
so
I've
been
working
on
kubernetes
and
openshifts
for
many
years
for
about
six
years
now
and
I
I
have
been
in
those
situations
where
you
are
trying
to
understand
what
what
the
hell
is
going
on
or
what
the
heck
is
going
on
and
because
those
Technologies
are
using
so
many
layers.
Sometimes
it
can
get
really
hard
to
go
to
the
to
go
back
to
the
root
cause
and
really
understand
if
this
is
coming
from
the
cluster.
A
If
this
is
coming
from
networking
from
Storage
from
an
application,
is
this
a
performance
issue?
Is
this
like
a
service
that
is
completely
down
Etc
and
so
yeah
I
hope
that
by
the
by
the
end
of
the
session,
we
can
better
understand
how
to
yeah
troubleshoot
those
kind
of
issues
and
and
better
investigate
I
would
say
the
the
the
problems
that
we
face
when
we
are
working
with
the
kubernetes
cluster
or
or
openshift
for
instance.
So,
okay,
let's,
let's
go
ahead.
B
And
all
right:
let's
do
that
so
I
already
started
introduced
myself,
so
I'm
not
gonna,
go
again
on
this
one,
but
before
just
a
small
tease
or
a
priority
works
of
a
diamond
Trace
I've
been
working
as
a
performance
engineer,
so
I'm
trying
to
optimize
and
and
understand
issues.
That
was
my
my
my
main
goal
previously.
I'm
still
still
with
observation,
of
course,
but
because
performance
engineer
is
still
in
my
heart
and
I
love
performance
engineering
activities,
I
am
producing
some
content
for
a
podcast
and
YouTube
channel
called
perf
bytes.
B
So
if
you
want
to
learn
about
performance,
engineering
aspects
check
out,
Perth,
bytes
and
otherwise,
one
year
and
a
half
ago,
I
started
a
another
YouTube
channel
called.
Is
it
observable
yeah?
So
you
have
in
the
name.
You
can
hear
observables
it's
about
observability,
so
I'm
producing
content
about
how
getting
started
in
using
flindif
and
bits.
What
is
a
service
mesh?
A
couple
of
things
about
all
always
related
to
observity
in
general,
so
check
it
out.
This
channel
is
quite
young
and
it
deserves
feedback
to
to
improve
the
content.
B
So
what
are
we
going
to
learn
on
the
next
couple
of
minutes
or
almost
an
hour?
So
a
couple
of
things,
so
the
two
two
production
issues
that
I
have
selected
today
comes
from
a
well-known
brand
of
the
industry,
but
they
are
all
related
to
service
mesh,
so
service
mesh
is
not
bad
or
in
the
opposite,
it's
great,
but
there
are
a
few
things
that
you
have
to
consider
when
you
start
working
with
service
mesh.
B
So
we
will
see
those
problems
in
details
and
we'll
see
how
we
could
avoid
those
problems
if
we
were
using
data
Trace-
and
there
will
be
an
excuse
for
me
to
share
the
latest
and
greatest
news
about
other
than
Trace
products
and
we'll
see
that
Daniel
Trace
has
an
AI
engine
called
Davis
and
we'll
see
how
Davis
can
detect
those
problems
based
on
communicates
all
right.
B
B
So
that's
one
of
the
things,
so
it's
manage
the
communication
of
your
microservices,
but
when
you
start
Building
microservices
architecture,
first,
you
need
to
provide
features.
You
you
focused
on
building
those
features
in
the
code,
but
your
microservices
is
going
to
interact
with
other
services
in
your
cluster.
B
So
there
are
a
couple
of
things
that
we
have
to
handle
so
either
you
we
code
it
or
not,
but
there
is
the
retry
logic,
so
the
retry,
the
try
logic,
is
very
simple:
I
have
service
a
service,
B
I
I,
try
to
reach
out
to
research,
B
series
B
is
not
responding.
I
need
to
do
a
couple
of
retry
and
after
a
number
of
a
try,
I
will
throw
an
error.
The
Second
Use
case,
which
is
we
need
for
sorry,
also
for
so
in
microservices,
is
authentications.
B
So
if
I
need
to
Advocate
with
the
service
being
a
specific
way,
I
probably
have
to
need
to
code
that
way
that
that
piece
in
my
code,
so
I
will
have
to
handle
that
and
last
not
last,
but
there
there
is
a
also
a
crucial
component
of
security
is
the
SSL
the
certificates,
so
I
may
not
code.
That
I
will
probably
ask
someone
to
generate
a
certificate,
but
again
it's
great
to
have
a
certificate,
but
we
have
to
rotate
it
to
make
sure
that
it's
secure
in
a
long
term.
B
Of
course,
there
are
the
other
features
that
are
important,
like,
of
course,
we
need
to
get
some
observability
out
of
our
microservices.
We,
if
we
do
blue
green
deployment,
we
probably
want
to
be
able
to
do
traffic
split
and
many
other
type
of
features
that
are
related
to
microservices
is
communication,
so
the
main
advantage
of
service
mesh
is
that
it's
going
to
handle
those
features
without
any
extra
line
of
code.
B
Basically,
you
build
our
application
container.
That
has
only
the
feature
that
we
need
and
all
the
rest
will
be
managed
by
a
site
car
proxy
that
will
manage
those
different
features:
observerty,
Security
application
and
so
on
through
in
that
cycle,
proxy,
so
service
mesh
is
handle.
The
communication
by
adding
new
crds
and
those
crds
will
will
see,
will
help
us
to
configure
the
the
service
mesh
itself
and
inject
the
right
proxy
rules
in
our
different
parts.
A
In
fact,
yeah
Eric-
sorry
just
maybe
to
to
for
for
the
people
who
are
not
really
familiar
with
what
crds
are
yeah.
B
Yeah
then,
what
I
was
saying
is
that
if
you
want
the
way
of
you
want
to
configure
your
your
yours
different
proxies,
of
course,
you're
not
gonna
touch
every
part
single
instance
of
the
H
pod
and
I
am
going
to
add
a
container
an
extra
container
there.
No,
it
doesn't
work
like
this.
A
service
mesh
has
a
control
plane,
so
it's
a
it's.
A
master,
I
would
say
component
of
the
service
mesh,
where
you're
going
to
interact
with
the
control
plane.
B
Defining
I
want
to
do
traffic
splits
I
want
to
do
a
virtual
Services
I
want
to
do
SSL,
so
you
configure
everything
from
the
control
plane
and
then
the
control
plane
when
you
deploy
a
new
part
or
new
a
workload
in
your
in
your
cluster.
The
control
plane
will
inject
automatically
the
proxy
in
our
pods,
which
is
fantastic,
which
means
I,
don't
have
to
decode
anything.
The
the
right
feature
that
I
need
will
be
injected
automatically
in
our
on
our
workload,
all
right.
B
So
now
you
know
most
of
the
details
which
is
required
to
follow
the
story
so
once
upon
a
time,
a
Big,
Brand
booking
platform
that
you
probably
have
heard-
and
probably
you
have
used
it
for
your
Vacations
or
or
even
for
business
trips-
had
a
major
production
outage
and
what
was
the
the
issue?
B
The
old
nodes
of
the
cluster
will
was
fully
saturated,
so
the
the
ends
to
the
working
day
they
go
back
to
a
home
and
just
before
getting
back
home,
everything
was
perfect
and
they
just
went
back
on
the
next
day.
Everything
was
fully
crushed,
so
they
discovered
that
on
the
next
morning
that
they
were
facing
to
and
a
normal
number
of
pods
in
the
cluster,
in
a
very
like
a
more
than
500
plus
parts
were
added
in
the
cluster
and
they
were
doing
nothing
so
they
were
naming
those
parts.
B
These
zombie
pause
so
Paul
that
were
staying
running
in
the
cluster,
doing
nothing
but
just
eating
our
resources.
So
what
was
the
initial
reason
for
that
behavior?
So
first
this
company,
they
are
trying
to
provide
recommendation
in
the
websites,
and
so
they
do
some
analytics
through
a
recommendation
job.
So
they
have
some
Chrome
jobs
running
continuously
in
our
production
environment
and
that
will
collect
some
some
behaviors
and
and
provide
the
right
recommendations,
and
they
were
doing
that
since
many
years
and
just
before
the
incidents
they
just
deployed
a
fresh
new
service
mesh.
B
B
Our
nodes
didn't
have
any
memory
and
CPU
left
and
it
was
very
hard
to
operate
and
more
even
more
than
that,
other
workload
were
had
difficult
to
be
to
be
scheduled,
so
it
will
impacting
more
than
just
that
that
small
namespace
or
that
small
applications.
So
let's
have
a
look
at
the
Chrome
jobs.
I,
don't
know
how
much
you
know
about
it,
but
Chrome
jobs.
Allow
you
to
schedule
a
job
in
your
cluster
and
the
main
definition
of
the
objects
in
kubernetes
is
very
simple.
B
You
define
a
templates
with
the
container
that
will
hold
your
job
to.
That
will
basically
do
the
the
batch
that
you
have
designed,
and
the
idea
is
that
how
it's
been
designed
it's
scheduled,
so
kubernetes
launched
a
pod.
It
runs
and
once
the
Pod
has
been
ended,
then
of
course
the
the
part
has
been
deleted
and
the
memory
is
released.
B
The
problem
is
that
when
you
have
a
sidecar
proxy,
your
job
ends,
but
in
the
Pod
you
have
two
container
now,
so
you
have
one
container
that
has
ended
his
task,
but
you
have
a
second
container,
which
is
a
proxy
and
the
proxy
is
a
thread
so
at
the
end
the
thread
never
ends.
So
even
if
our
batch
has
ended,
the
proxy
is
still
running,
so
our
pod
is
not
deleted
by
kubernetes.
B
B
The
second
one
is
concurrency
policy,
replace.
If
you
define
this
kubernetes
instead
of
adding
a
new
pod,
it
will
replace
the
existing
one.
So,
at
the
end
in
Max
you
will
have
one
part
running
for
that,
giving
job
or
more,
if
you
have
parallel
jobs
allowed,
but
again
it's
it's
gonna,
be
less
impact
than
having
thousands
of
PODS
keeping
being
scheduled
in
the
cluster.
B
So
all
right!
So
that's
the
really
simple
configuration
aspect.
Now.
How
can
we
basically
resolve
this
because
at
the
end
we
need
to
force
the
proxy
to
end
so
first
solution,
which
is
not
the
acceptable
one.
But
let's
have
a
look
at
the
first
solution
is
using
a
file,
so
I
have
my
app
running
and
when
my
app
is
ending,
I
will
write
something
in
the
text
file
or
whatever
it
wants,
and
the
proxy
will
look
at
that
file
quite
in
a
very
regular
way.
B
And
if
there
is
something
interesting
in
the
file,
he
will
basically
stop
the
jobs
all
right.
So
that
seems
very
doable.
But
again,
if
I'm,
really
using
istio,
it
means
I
will
have
to
rewrite
the
invoy
code
and
I
don't
want
to
go
in
that
direction
for
sure.
So
what
I
will
do
is
taking
the
solution
2,
which
is
perfect,
because
in
fact,
if
I
look
at
it
all
the
service
mesh
of
the
industry
is
still
link
early
and
the
others.
They
thought
about
that
problem.
B
They
have
an
end
point
in
their
in
in
their
in
their
proxy,
where
you
could
basically
basically
send
an
HTTP
post
request
to
it
and
it
will
delete,
kill
the
current
proxy.
So
at
the
end
my
app
is
running
once
I
have
ended.
My
badge
I
send
an
HTTP
post
locally
to
the
same
on
in
the
Pod
through,
depending
on,
of
course,
the
service
mesh.
You
will
have
different
endpoints
to
interact
with,
and
this
will
delete
the
container
and
boom
problem
resolved.
B
Okay,
so
now
we
thought
out
how
we
can
resolve
it,
but
how
can
damage
stress
can
help
you
here
a
couple
of
things.
So
first
Dyna
Trace.
As
you
probably
know,
we
are
an
observerly
observerity
platform
we
to
in
the
cloud
native
space,
and
especially
in
kubernetes
and
openshift.
We
do.
We
have
a
an
operator
that
we
will
deploy
in
the
cluster
and
that
will
allow
us
to
collect
the
various
pillars
of
the
observity.
B
So
the
logs,
of
course,
kubernetes
events,
which
is
a
rich
source
of
information,
metrics,
traces
events
and
way
more
also
continues
profiling.
If
you
want
how
what
is
the
value
of
that
operator,
so
a
couple
of
things
in
operator,
you
can
even
deploy
it
from
a
UI,
you
can
deploy
it
from
Helm
charts,
there's
plenty
of
different
oil
deploying
it.
We
need
to
have
the
visibility
of
diverse
objects
of
the
cluster,
because
we
have
couple
of
different
components
that
will
be
deployed.
B
So
the
first
thing
that
we
will
deploy
is
we
will
create
an
admission
controller
rule.
So
then,
every
time
there's
a
pod
that
has
been
scheduled,
then
we
will
be
able
to
inject
the
right
Cod
module
that
will
instrument
your
code.
The
other
thing
that
we
can
do.
In
fact
we
are
quite
flexible.
We
can
deploy
it
with
different
observity
way
of
doing
things.
We
have
the
full
stack
where
we
will
instrument
our
code
and
we
will
also
look
at
your
nodes.
B
We
have
just
the
the
app
only
where
we're
just
gonna
instrument
your
pod
and
that's
it
so
it's
pretty
flexible,
but
at
the
end
we
have
still
a
component
called
the
active
GATE,
and
this
active
GATE
is
a
component
that
will
basically
interact
with
the
communities
API
to
get
the
health
of
the
nodes,
the
workload,
the
namespace
and
more
so
this
component
is
crucial
and
we
see
we
can
there's
plenty
of
features
related
to
that
active
GATE
component
then
address
is
providing
out
of
the
box
alerting
for
openshift
and
kubernetes.
B
So
if
you
look
at
a
cluster,
a
cluster
is
like
an
onion.
There
is
different
layers,
so
you
have
the
user
sitting
out
of
the
cluster.
You
have
the
cluster
itself,
you
have
the
nodes
name
space
and
within
the
namespace
you
will
have
some
workload
that
will
happen.
Pods
and
pod
will
have
containers
and
we
have
built
predefined,
alerting
that
helps
you
to
keep
track
on.
What's
going
on
really
on
the
cluster.
B
So,
first,
if
you're
outside
the
Clusters
means
you're
a
user,
we
want
to
pay
attention
on
the
response
times
and
the
failure
rates,
because
if
there
is
something
like
a
memory
problems
or
throttling
heavy
throttlings,
then
the
response
term
will
be
impacted.
So
our
engine
AI
engine
called
Davis
will
automatically
look
at
those
components.
B
So
we
will
be
you'll,
be
aware
of
what's
going
on
and
same
thing
for
the
workload
even
better,
if
you
have
some
some
workload
as
in
pending
States,
it's
a
sign
that
there's
a
memory
problems
or
a
configuration
issue.
If
you
have
your
paths
in
not
ready
or
they
are
restarting
doing
a
crash
loop
back
off
error,
sometimes
you
want
to
be
alerted
quite
quite
fast,
so
those
will
be
handled
automatically
all
right,
I
get
your
point.
I
know
that
you
want
more.
That
is
not
enough
to
operate
your
cluster,
but
don't
worry.
B
A
And
so
just
a
quick
question:
Henrik
is
there
a
way
to
define
your
own
metrics
and
not
metrics,
but
your
own
alerts
based
on
some
existing
metrics.
B
Of
course,
of
course,
yeah
that
we
there
is
dining
Trace
is
a
huge
platform
to
be
honest
today
you
can
do
tons
of
things,
but
we
have
lots
of
customers
that
operates
tons
of
clusters
and
they
don't
want
to
go
and
build
their
metric
expression
and
and
then
add
the
thresholds,
because
it
will
take
a
lot
of
time.
But,
of
course,
if
you
know
exactly
the
metrics,
you
have
the
ability
to
do
your
custom
anomaly
detections.
B
We
call
it
alerting,
but
it's
of
course
it's
anomaly
detections
and
though
we'll
be
also
taking
into
account
into
account.
What
is
great
as
well
to
just
add
in
that's
pretty
much.
What
I
was
going
to
say
on
that
slide
is
that
we,
when
you
operate
clusters
in
damage
race,
you
probably
have
more
than
one
cluster.
Obviously
probably
you
have
a
10
to
20
to
50
or
60
or
100
of
clusters
that
you
have
to
operate.
B
Usually
what
happens
is
that
I
have
some
alerting
that
is
predefined
and
globally
for
dinette
Trace,
but
it
happens
that
some
clusters
are.
Let's
say
this
is
a
crow
cluster.
Managing
our
AI,
Discovery
or
or
analytics
and
I
have
different
type
of
thresholds.
I
have
different
type
of
requirements,
so
I
want
to
change
those
alerting
just
for
that
cluster
and
you
have
the
options
to
customize
the
alerting
at
the
at
the
cluster
level,
but
also
at
the
namespace
level.
B
So
it's
pretty
flexible,
so
we
have
different
way,
a
different
level
of
configuring,
those
alerts
so
either
from
the
custom
out-of-the-box
alerting
in
the
like
mentioned
or
through
a
metric
expression
where
you
extract
the
metrics
and
you
define
it,
you
can
also
extract
something
from
the
logs
convert
it
into
event
and
that
event
could
be
also
an
alert
as
well.
So
there's
plenty,
it's
very
flexible
to
to
create
alerting
index.
B
One
thing
that
we
did
so
we
this
out
of
the
box
alerting,
was
something
that
we
provided
to
our
users
I
think
now
they
have
it
on
the
hands
last
week
or
something
something
like
this,
and
we
also
created
a
predefined
dashboards.
When
you,
when
your
kubernetes
clusters,
you,
we
also
provide
out
of
the
box
dashboarding.
Of
course
you
can
create
your
own
dashboards,
you,
you
know
you
not
have
to
be
stick
to
those
dashboards,
but
those
dashboards
are
designed
to
give
you
pretty
much
a
cluster
overview.
B
The
workload
overview,
the
namespace
and
what
is
graded.
We
have
increased
the
user
experience
in
the
way
where,
from
a
dashboard,
you
can
easily
jump
to
the
right
screens
or
from
the
screens.
You
can
go
back
to
the
right
dashboards
with
the
right
filter.
So
it's
lots
a
lot
of
small
Improvement,
but
it
helps
you
specifically
when
you
have
to
troubleshoot
so
enough
of
stalking,
let's
go
to
something
live
and
show
you
the
screens
so
for
this
I
have
prepared
an
environment.
B
I
have
deployed
a
couple
of
things
that
of
course,
in
my
this
is
just
a
suggestion.
So
here's
the
GitHub
repo
related
to
this
environment
so
have
a
two
namespace
with
two
applications.
One
is
the
open,
Telemetry
demo
application
provided
by
the
open
Telemetry
community.
So
it's
fully
instrumented
with
open
Telemetry
I
have
I'm
using
the
open
television
operator
to
be
able
to
deploy
open,
Elementary
collectors
to
forward
the
traces
back
to
the
trace
and
the
metrics
to
that
Trace
I
have
link
Rd
here,
I
decided
today
to
use
Linker
D.
B
For
this
specific
talk,
this
use
case
again.
This
series
will
be
the
same,
but
again
it's
just
to
highlight
the
usage
of
a
search
mesh
and
and
reproduce
the
pattern
that
I
just
explained
and
by
the
way
I.
Just
this
morning,
I
almost
killed
my
cluster,
so
I
was
trying
to
recover
just
before
we
we
I
was
connected
to
history
all
right.
So,
let's,
let's,
let's
see
that
in
details.
B
Where
is
my
browser
to
do?
But
here
is
oh
no!
No!
Where
is
it
here?
It
is
okay,
so
here
is
first,
let's
start
everything
from
the
dartboard
dashboard
perspective.
One
small
note
when
you
are
when
you're
using
openshift
the
advantage
that
you
would
have
compared
to
a
traditional
I'd
say
a
user
that
uses
another
a
flavor
of
kubernetes.
B
Is
that
openshift
provides
you
details
on
the
master
nodes,
lots
of
things
that
are
very
important,
especially
when
you
have
to
operate
so
we
have
predefined
dashboard
for
the
control
plane
of
the
openshift
cluster.
We
also
have
dashboard
for
hcd,
but
in
my
particular
case,
I
want
to
show
you
the
dashboard
that
we
have
for
the
cluster.
So
here
in
this
cluster
I
have
two
clusters
deployed
in
my
in
this.
B
Able
to
easily
jump
to
my
my
cluster
details
and
from
the
cluster
Details
page
I
will
be
able
to
see
the
cluster
review.
Oh
so
how
many
nodes
do
I?
Have
the
workload,
the
various
events
going
happening
in
the
environment
and
what
is
great
is
also
if
the
node
perspective,
by
the
way
you
can
see
here
doing
this
morning,
I
when
I
connected
I
saw
that
I
was
almost
losing
everything,
so
I
was
able
to
recover
just
before
before
it
was
too
late,
but
small
small
load.
B
So
here,
if
I
look
at
the
node,
I
can
see
the
status
of
them
and
how
they
use
it.
The
usage
of
each
node-
and
if
you
in
my
case,
are
using
Cloud
native
full
stack
so
I
have
the
node
instrumented
and
the
app
instrumented.
But
then,
if
I
jump
into
the
node
perspective,
I
will
be
able
to
see
all
the
details
on
what's
going
on,
if,
in
fact,
in
this
particular
cluster
for
example,
here
we
see
that
we
have
a
spike
here,
you
can
see
here
in
the
CPU
usage,
CPU
user.
B
Oh
now,
I
need
to
zoom
out.
Okay,
let
me
go
that
two
hours
I'm
surprised
up.
Let's
say
what
this
we
call
it
a
demo
effect.
So
let
me
grab
this
piece
here
where
we
have
those
spikes
in
blues
here
with
the
and
you
see
the
system
load
here
is
quite
often
it
quite
is.
Is
we
have
some
spikes?
So
let
me
grab
that
CPU
load
and
see
in
here.
B
What
I'm
going
to
do
here
is
say:
I
have
some
some
patterns
that
I
see
in
the
graph
I
want
to
understand
what
is
actually
causing
this
so
I'm
going
to
ask
the
AI
engine
Davis
to
analyze
all
the
metrics,
no
correlation,
all
right,
that's
a
bad
example!
So
let's
have
your
Spike
here:
let's
do
this
traffic
in
foreign.
B
Davis
is
analyzing
all
the
processes
that
is
actually
available
in
this
host
and
it
will
correlate
and
it
says
first
we
have
a
process-
CPU
usage,
it's
the
load
generator
service,
so
we
can
see
that
it
has
already
attached
to
the
right
process.
So
now,
I
know
by
just
looking
at
this
specific
graph
that
this
Behavior
was
related
to
one
specific
process.
In
our
case,
because
we
were
running
in
a
cluster,
it's
going
to
be
a
pod,
so
once
I
have
the
information.
B
B
Where
we
have
this
load
Raider
service
and
in
fact
here
I
so
I
have
limited
the
impact,
because,
if
I
didn't
limit
it
I
would
have
probably
50
Parts
here
running,
but
I
have
limited
the
impact,
and
this
is
the
Chrome
jobs
that
I
explained
that
were
basically
consuming
resources
and,
as
you
can
see
here
at
the
moment,
we
have
already
three
of
those
pods
running
and
eating
my
resources
so
that,
where
basically,
what
this,
how
I
recovered
this
morning
to
avoid
having
losing
my
cluster
as
it
is,
and
what
is
great
here-
you
can
see
that
also
in
this
particular
workload,
we
already
have
an
alert
that
he
says
that
the
the
workload
is
not
ready
and,
as
you
can
see,
there
is
also
exclamation
points
in
the
top
and
we
have
all
those
alerts
that
has
going
so.
B
First
I
had
some
notepad
ready,
so
you
can
see
that
all
the
various
workload
we
have
couple
of
informations.
We
also
have
some
failure,
Edge
increase,
so
this
is
difference.
It
means
that
something
has
impacted
a
user
perspective,
so
here
Davis
is
trying
to
analyze
and
figure
out.
Who
is
irresponsible
of
that
behavior
here?
So,
as
you
can
see,
with
all
patterns
at
least
you're
aware
of,
we
have
pending
pause,
or
we
have
here
these
hipster
shop.
B
Namespace
is
also
as
a
resource
quota
saturations.
So
I
can
easily
go
back
to
the
right
namespace
and
look
at
that
problems
in
details.
As
you
can
see,
I
was
not
able
to
schedule
the
the
load
generator
services
on
this
one.
B
So,
depending
on
where
you
are,
you
can
easily
jump
to
a
workload
or
to
a
service
or
to
the
Pod.
So
here
I'm
on
the
workload
screen,
I
can
easily
get
the
services
response
times
or
I
can
also
look
at
the
part
itself
jump
into
the
pot
screens
giving
me
the
CPU
the
throttling
all
the
different
details.
Looking
at
the
logs
the
events
of
this,
this
particular
container,
a
particular
pod
and
so
on.
What
is
great
is
that
we
also
have
the
notion
of
services,
so
here
I
can
see
the
definition
of
that
particular
services.
B
And
if
it's
been
instrumented
here
we
have
a
relation
with
the
application,
fully
instrumented
by
Dan
and
Trace,
and
once
it's
been
instrumented
I
can
see
the
traffic
that
has
been
happening
on
this
particular
pod
and
yeah.
I
can
even
go
further
and
say:
oh
I
want
to
look
at.
Let's
say
the
I
want
to
look
at
for
this
particular
request.
Slash
cart.
B
I
want
to
look
up
at
the
distributed
traces.
So
here
we'll
see
all
the
traces
that
has
been
collected
during
that
time
frame.
I
will
grab
this
one,
for
example,
and
I
will
see
exactly
all
the
the
first,
the
first,
the
all
the
steps
are
related
to
that
specific
term
transcription.
So
you
can
do
tons
of
things
from
the
infrastructure
perspective
to
even
going
further
to
the
application
layer
all
right.
So
now
that's
that
was
the
first
story,
I'm
looking
at
the
time,
because
I
talk
a
lot,
but
so
we
have
still
time
all
right.
A
That
I
have
so
Henrik
yeah.
If,
if
you
don't
mind,
if
we
can
just
post
quickly
there,
because
you've
shown
some
couple,
a
couple
of
really
interesting
features
that
that
provide
a
lot
of
value
when
trying
to
troubleshoot
those
issues.
A
So
as
most
of
people
who
work
with
kubernetes
know,
there
are
already
some
Community
tools
that
are
that
go
hand
in
hand
with
kubernetes
to
do
monitoring
and
to
provide
some
some
kind
of
dashboard
with
those
you
know,
graphs
and
resource
consumption,
Etc
like
Prometheus
and
grafana-
and
you
know
the
combination
of
those
but
I-
think
what
what
you
showed
here
is
really
something
powerful,
because
the
other
type
of
tools
or
types
of
tools
they
don't
really
provide
the
drill
down
capability,
where
you
say:
okay,
so
I
see
there's
a
spike
here.
A
But
let
me
drill
down
and
really
try
to
understand.
What's
going
on,
it's
going
to
give
you
an
information.
Okay
today
at
10
am
there
was
a
spike
in
CPU.
But
how
do
you
correlate
that
to
the
actual
workloads
that
have
triggered
that
spike
in
CPU
or
spike
in
memory,
consumption
and
yeah?
I
think
this
is
so
it.
A
Here,
because
you
are
just
clicking
there
Etc,
but
it's
really
really
a
powerful
I
would
say
capability
and
especially
knowing
that
there's
some
analysis
that
happens
in
the
background
to
say:
okay,
this
is
the
normal
behavior,
because
you
have
Trends
I
guess
and
the
AI
tools
are
like
observing.
This
is
the
normal
behavior.
Nothing
has
happened,
and
now
we
can
detect
that
there's
a
spike
because
there's
a
big
shift
compared
to
like
the
normal,
behavior
and
I
guess.
A
That's
also
where
the
AI
you
know
magic
happens
is
because
it's
it's
able
to
understand
that
abnormal
behavior
and
it's
also
able
to
correlate
it
to
other
things
that
happened
elsewhere
and
say
Okay.
So
this
that
you
see
here
is
basically
caused
by
something
that
was
down
there
and
it
would
take
I
would
say
a
lot
more
of
Investigation
to
be
able
to
get
there.
If
you
are
using
traditional
tools,
I
guess.
B
Yeah
I
mean
that's
true,
that
I
think
alerting
based
on
thresholds
are
great,
but
sometimes
you
want
to
also
understand
if
you
had.
A
change
of
behavior,
like
you
mentioned
and
I've,
been
having
a
baseline
approach
where
it
detects
changes
over
time
and
if
there
is
a
change,
and
it
will
basically
try
to
understand
why
we
have
this
change.
But
moreover,
here
I
mean
it's
a
demo
environment.
So
it's
there's.
B
No,
it's
a
my
own
tenant,
but
what
here,
if
I
click
on,
if
I
have
a
real
alert
impacting
users,
so
maybe
I
have
another
one
example.
So
I
would
say
I'll
check,
but
when
we
have
a
problem
impacting
user,
what
damages
will
do?
We
will
try
to
estimate
the
num,
the
number
of
user
impacted
by
that
problem
and
also,
if
you're,
using
Cloud
regions,
you
will
see
okay,
so
it's
only
the
user,
so
thousands
of
users,
out
of
whatever,
was
impacted
in
specifically
in
the
region
of
New
York.
B
So
it
will
also
help
you
to
figure
out
who
has
been
impact
acted
and
it
goes
even
Beyond.
So
I
think
when
you
are
troubleshooting.
Usually
there
is
fire.
There
is
pressure,
so
you
need
to
be
able
to
take
decisions
in
time
and
I.
Think
having
the
right,
toolings
and
the
right
information
in
your
hands,
then
you
can
make
the
right
decision
and
also
I
didn't
mention,
but
we
also
have
the
ability
to
once
we
have
a
problem
that
we
have
a
pattern
that
we
have
detected
and
we
understand
and
we
control.
B
Then
we
can
trigger
a
remediation.
So
we
can
reach
out
to
ansible
and
say:
hey.
Do
this
because
I
reach
out
to
that
I,
don't
know
a
part
saturations
on
the
quote
service
then
do
this
because
I
have
a
specific
Playbook
that
I
want
to
run
and
that
will
be
possible.
So
you
can
even
think
about
Auto
remediation
process
fully
triggered
by
the
AI.
So
I
think
that
that
is
fantastic.
So
you
can
even
recover
in
a
faster
way.
A
That's
that's
really
awesome,
and
actually
that
was
one
of
my
two
questions
that
I
wanted
to
to
ask
for
that
section.
A
So
that's
that's
I
would
say
the
first
one
and
it's
it's
pretty
amazing,
to
be
able
to
do
that
because
say,
for
instance,
you
said
that
the
Paul
are
being
spread
on
the
Node
and,
as
we
probably
know,
you
have
some
IP
allocation
that
happens
on
the
pods
and
that
creates
some
files
on
you
know
down
there
in
some
system
files
for
for
the
containers,
Etc
and
a
simple
troubleshooting
to
be
able
to
get
rid
of
that.
A
Well,
of
course
you
have
to
kill
the
Bots,
but
you
have
also
to
re.
Remove
those
allocated
IP
addresses
if
it
doesn't
happen
properly
by
the
you
know,
by
the
kublet
ETC,
and
so
that's
one
example
where
having
that
ansible,
where
you
say
okay,
so
we
have
reached
our
limit
on
pod
scheduling
capacity.
So
there's
this
thing
that
we
need
to
trigger
to
be
able
to
meet
to
remediate
basically
that
issue.
So
that's
really
really
good.
You
know
to
be
able
to
associate
those.
A
My
other
question
was
so,
of
course,
as
you
said,
the
sooner
you
know
what
happened,
the
the
the
the
better
you
are.
You
know
the
more
time
you
have,
of
course,
to
to
address
the
the
issue,
and
so
my
question
was
regarding,
like
notifications,
emailing
to
what
what
systems
can
you
connect
to
sent
out?
Basically,
you
know
critical
issue:
do
you
connect
to
things
like
slack
or
any
messaging
or
any
I
guess
emailing
systems.
B
Yeah
we
have
a
slack
Integrations,
we
have,
we
can
send
notifications
in
various
way
we
used
to
have
I
mean
there
was
a
couple
of
years
back.
People
are
less
interested
in
this.
We
have
this
Davis
component
that
was
running
on
Alexa
and
and
Google
home
and
the
others.
So
you
will
be
able
to
say,
hey
Alexa,
give
me
the
the
status
of
my
environments
and
then
Alexa
was
doing
this
and
then
I
opened.
Oh
Alexa,
open
the
dashboards
and
showed
me
the
problems.
B
So
I
mean
it
was
it
was
nice
I,
don't
think
that
people
are
really
using
it
in
real
new
production.
A
B
But
it,
but
it's
a
way
that
showcased
that
we
are
able
to
to
send
notification
to
various
systems,
and
then
you
can
do
different
type
of
usage
with
those
notifications.
A
B
So
I'm
just
looking
at
time,
so
we
have
a
20
minute,
so
I'm
gonna
try
to
be
a
bit
faster
on
the
second
one.
So
the
second
horror
story
that
I
have
is
again
the
same
company
all
again
with
related
to
recommendations.
But
it's
the
here
is
more
on
the
UI:
how
we're
gonna
deploy
visualize
the
recommendation
to
the
users.
So
here
the
the
the
thing.
B
The
problem
that
happened
is
that
this
recommendation
Services
rely
on
the
back-end
services
and
this
back-end
services
for
weird
reasons
started
to
crash
in
the
middle
of
the
day
and
when
they
start
to
do
the
the
troubleshooting
and
understand
what's
the
problem,
they
discovered
that
oh,
there
is
a.
We
have
placed
the
CPU
limit
on
a
given
workload
and
that
CPU
limit
has
caused
anom
kill.
So
what
this
is
weird.
B
So
when,
when
the
problem
occur
or
happened,
we
can
see
so
we
have
extracted
the
the
graphs
showing
the
the
consumptions.
So
we
can
see
that
we
had
a
spike
of
CPU
usage,
but
at
the
same
time
we
had
the
increase
of
memory
usage
as
well.
So
how
could
that
happen?
In
fact,
a
couple
of
things
the
ad
Services
is
Java
based,
so
Java
based,
it
means
garbage
collector.
B
So
let's
say
that
we
are
running
into
a
usage
of
memory
which
is
quite
high
and
to
be
able
to
release
the
memory.
The
Java
app
will
naturally
launch
some
garbage
collector
to
clean
up
the
actual
memory
used
by
the
job
application.
So
here
we
are
using
a
service
mesh.
So
when
we
do
the
communication
we,
the
Pod,
has
to
go
through
the
cycle
proxy
and
the
recommendation
services,
and
there
are
several
communication
Services
we'll
reach
out
to
the
ad
services.
B
But
here
at
that
particular
moment,
the
ad
Services
were
running
into
memory
issues,
so
the
garbage
collection
was
trying
to
run,
but
there
was
a
CPU
limit
and,
as
you
may
know,
when
you
have
CPU
limits
defined
and
you
reach
the
limits,
you
will
have
some
throttling.
So
if
I
run
a
garbage,
collector
and
I
being
throttled,
it
means
that
I'm
not
able
to
clean
up
the
memory,
because
I'm
post
during
my
task,
I
try
to
clean
up
and
pose
I
try
to
clean
up
and
pose
so
the
memory
is
never
released.
B
So
what
what
was
actually
happening
is
that
this,
the
the
sidecar
proxy
that
was
sitting
close
to
the
Java
application
it
hasn't.
It
did
not
have
any
CPU
limit.
So
when
the
request
was
coming
in
even
if
the
part
was
suffering,
the
traffic
will
still
send
to
the
Pod
and
send
over
and
send
over
and
send
over
until
a
moment
where
no
way
of
sending
up
the
memory.
So
what
happened
kubernetes
has
reached
we
reached
out
to
the
member
limit,
so
kubernetes
is
basically
killing
the
container.
So
this
is
could
happen.
A
B
Guess,
oh
yeah,
dude,
you're,
very
famous
Olympic
event,
that's
true
and
the
the
way
only
way
to
prove
avoid
that
situation
is
to
Define
precise
requests
and
limits
to
your
sidecar
containers.
But
as
you
as
I
explained
at
the
beginning,
you
don't
Define
that
because
it's
been
injected
through
the
control
plane,
but
the
good
news
is
that
every
service
mesh
has
annotations
that
you
can
add
in
your
workload,
definitions
where
you
can
Define
the
actual
CPU
and
memory
limit
for
the
proxy.
B
So
initial
here
is
The
annotation
for
sorry
for
the
for
the
envoy,
and
then
you
have
a
Linker.
D
has
also
specific
annotations,
and
this
would
be
a
way
of
controlling
it,
so
which
means,
if
I
Define
the
right
CPU
limits
in
this
particular
case,
then
I
would
avoid
having
the
situation
where
my
proxy
is
basically
causing
the
death
of
my
container.
B
So
how
can
the
Androids
can
help
you
similar
I
almost
showed
you
that
at
the
beginning
is
that
we
use
Davis
and
the
Davis
is
our
AI.
We
had
it
since
a
couple
of
years
and
we
started
to
add
kubernetes
events
in
the
machine
learning
or
the
algorithm
that
that
is
taken
by
Davis,
so
everything
related
to
struggling
so
CPU
throttling.
As
you
know,
if,
if
you
are
throttled,
then
your
program
has
been
posed
so
you're
not
able
to
do
actually
work,
so
usually
it
impact
also
no
response
lines.
B
The
second
thing
is:
if
you
have
om,
kill
or
out
of
memory
events,
what
would
happen
is
that
your
part
is
killed
and
you
generate
some
failure,
because
the
party
is
not
responding
anymore,
so
you
will
probably
have
some
errors
from
another
Services
same
thing.
If
you
have
evictions
or
if
you
have
evictions,
of
course,
there
will
be
also
an
event,
but
there's
an
also
something
which
is
very
important.
You
can
have
an
OM
kill
or
you
can
have
some
throttling,
so
it
means
I
have
decided
to
change.
B
We
can
see
that
the
problem
has
appeared,
so
it's
even
give
you
where
actually
we
have
that
that
behavior
second
thing
that
I
mentioned,
we
also
had
the
these
sorry
screens.
I
showed
it
before
where
we
are
on
the
way
of
doing.
We
are
adding
more
and
more
things
in
those
screens.
We
will
have
the
response
time,
the
federates
and
more
and
more
what
we
want
is
from
the
moment.
You
have
a
service
mesh
embedded
in
your
application,
so
at
the
service
level.
B
B
B
You
can
see
that
we
have
16
it's
again:
16
users
impacted
it's
90
affected
service
calls,
so
we
have
the
details
on
which
Services
is
actually
having
the
problem,
and
at
the
end
here
we
see
that
the
root
cause
here
is
an
OM
KL.
So
it
has
detected
that
we
have
a
failure
rate
increase
in
this
particular
case
and
the
failure
rate
increase
was
introduced
by
this
om
Guild.
B
So
at
the
End
by
looking
at
the
problem
card
provided
by
Davis,
you
know
exactly:
okay,
so
I
have
an
impact
on
my
users
and
those
users
were
at
the
end
impacted
by
because
probably
the
workload
definition
is
different
and
we
have
some
some
more
problems.
So
there
are
a
lot
of
things
that
we
provide
through
Davis.
B
So
with
the
exclamation
points
is
basically
all
the
problems
that
Davis
detects
and
then
you
can
walk
through
the
problems
and
look
at
the
response,
time,
degradations
or
all
the
other
problems
related,
but
also
here,
there's
something
else
that
we
have
introduced
not
related
to
our
topic
of
today,
but
I
just
want
to
mention.
If
you
was
wondering
what
what
is
the
the
the
icon
next
to
it
is
we
also
do
a
runtime
application
scans.
B
So
here,
for
example,
we
see
that
we
have
eight
vulnerable
abilities
in
our
environment,
so
we
can
grow
and
look
at
all.
The
containers
that
are
running
so
here
is
not
offline
scanning,
it's
runtime
scanning
and
based
on
this,
we
can
figure
out
problems.
So
if
you
have
any
example
here
we
have
smuggling
hdp.
So
we
can
see
the
problems
and
details
where
it's
been
happening.
If
it's
and
look
at
the
the
details
on
this,
so
we
also
cover
the
security
aspects
out
of
the
troubleshooting.
B
So
if,
if
troubleshooting
could
be
impacted
by
a
security
issue,
it
will
be
also
aware
through
that
module
as
well
all
right,
so
I
think
that's
it
for
this,
so
key
takeaway.
Very
briefly,
so,
first,
as
you
saw
here
when
you
use
service
mesh,
there
are
things
that
are
introduced
due
to
the
fact
that
we're
using
sidecar
containers
so
like
sidecar
containers
are
great.
It
provides
Sometimes
some
lots
of
great
features,
but
it
could
have
also
a
side
effect.
So
you
just
have
to
be
aware
of
those.
B
So
a
very
simple
they
Define
resource
quotas
on
the
namespace
I
mean
that's,
you
know
it's
important
and
it
will
save
lots
of
disaster.
They
find
also
limits
on
your
sidecar
containers
is
also
crucial
because
you
see,
if
you
don't
do
that
for
some
reason,
the
sidecar
could
introduce
some
problems.
If
you
use
Chrome
jobs
again,
try
to
make
sure
that
if
you
your
service
mesh
will
will
be
injected,
will
inject
the
cycle
proxy,
make
sure
to
create
the
right
process
to
be
able
to
actually
stop
the
job
and
not
keep
it
running
forever.
B
Otherwise,
in
terms
of
kubernetes,
of
course,
make
sure
to,
if
you
don't
use
a
system
like
Dan
Trace,
make
sure
to
have
the
right
alerting
on
the
cluster
on
the
nodes
on
the
name,
space
on
the
workload
make
sure
to
also
consider
events
in
kubernetes,
because
events
is
a
rich
source
of
information
to
understand
what
is
actually
happening.
B
Something
again
so
I.
Have
this
easy,
observable
Channel,
covering
different
topics.
If
you
want
to
see
more
about
service,
mesh
I
have
a
dedicated
content
about
it,
so
check
it
out
again.
The
channel
is
quite
young
deserve
to
be
the
content
could
be
improved
but
to
improve
and
in
feedback
so
check
it
out
and
reach
out
to
me
to
see.
B
If,
if
you
have
some
recommendation
for
me
small
housekeeping
rules,
just
for
those
who
are
not
aware,
we
have
a
joint
webinar
between
Dana
trace
and
red
hat
happening
on
the
Tuesday
on
29th,
so
next
week,
so
check
it
out.
If
you
want
to
see
automating
the
cloud
native
Enterprise
with
Androids
and
red
hat
to
register
I
think
there's
a
one
clogs
of
mine,
which
is
awesome
Christopher,
so
keep
I
would
definitely
recommend
to
to
connect
to.
He
has
always
great
stories
couple
of
additional
resources.
B
Of
course,
there's
we
have
produced
a
lot
of
content
about
kubernetes
in
general
and
open
Telemetry
and
stuff,
like
this
check
it
out
it's
available
in
the
dynatrace
YouTube
channel.
So
if
you
want
to
learn
more
and
see
more
in
the
product
in
live
actions,
we
have
a
series
called
performance
Clinic
that
has
been
renaming.
Observative
clinic
so
check
it
out,
there's
a
lot
of
content
that
maybe
could
be
interesting
for
you
guys
all
right.
If
you
have
any
questions
all.
A
I've
seen
some
couple
features:
I'm,
not
sure,
I
totally
understand
what
it
meant
so
I'm
rephrasing
it
so
to
make
sure
that
that's
what
you
you
mentioned,
so
you
said,
for
example,
there's
an
issue
with
the
with
the
with
the
oom
kill
that
happened
after
a
change
of
a
specific
setting.
For
example,
you
decrease
the
memory
setting
in
the
yaml
files
Etc,
and
so
is
my
understanding
correct.
A
That
Davis
is
able
to
hold
the
definitions
of
the
different
yamls
that
have
been
used
and
compare
those
over
time
and
correlate
what
works,
what
doesn't
work
and
say?
Okay,
so
we
had
this
yaml
definition.
It
was
working
fine.
Now
we
have
a
new
yaml
definition
with
this
change
in
this
field
and
it
doesn't
work
and-
and
it
says,
okay,
this
is
maybe
what
introduced
the
the
error
is
that
how
it
works
or
did
I.
B
I
don't
have
the
details,
but
yes,
the
idea
is
that
what
we
want
is
sometimes
by
making
some
workload
change
definitions,
we
we
don't
measure
the
impact
properly
and
we
can
introduce
throttling
or
omkiel
or
even
if
I
do
I
mean
that's
human
mistake.
I
do
a
a
problem
in
the
name
of
the
image.
Let's
say:
instead
of
do
a
version,
one
I
made
a
version
12,
because
I
did
some
some
typing
mistake
and
if
we
we're
generating
some
crash
to
back
off
error
due
to
that,
then
Davis
will
say.
B
Oh
and
we'll
look
at
the
the
definition
that
we
had
previously
and
we'll
say:
hey,
maybe
that's
the
issue,
so
the
idea
is
going
to
say.
Maybe
the
issue
is
is
related
to
this.
So
same
thing,
for
if
we
have
a
throttling
heavily
throttling
impacting
user
here
for
the
case
it
would
say:
okay,
so
throttling
is
heavily
and
in
fact
we
just
saw
that
yesterday
you
changed
your
workload,
so
it
may
be
related
to
the
fact
that
you
change
the
limits
of
CPU,
for
example,
and
same
thing
for
for
mql.
B
So
if
I
change,
the
memory
limits,
I
may
also
being
omk
B
is
facing
out
of
memory
situation.
So
it's
still
I
think
it's
still
useful
to
be
aware
that
our
change
has
impacted
somehow
the
behavioral
services.
So
then
we
can
basically
do
the
right
thing
at
the
right
in
faster
than
if
we
have
to
dig
in
in
all
the
metrics
that
we
have.
A
Okay,
that's
that's
really
nice,
so
yeah,
it's
sort
of
what
I
understood
and
just
wanted
to.
A
You
know
make
sure
that
was
it
because
it's
you
know
it's
really
nice
to
be
able
to
compare
data
over
time
and
and
provide
that
feedback,
because
otherwise
you
would
have
to
go
to
you
know,
dig
into
maybe
your
git
repos,
if
you're
using
githubs
to
deploy
and
stuff
like
that
and
start
comparing
manually
to
see
what
are
the
deals
Etc
here,
it
seems
like
Davis
is
doing
that
sort
of
comparing
for
you
and
hands
off
some
hints
about
what
could
be
the
the
issue.
A
So
that's
that's
really
cool
and
so
my
other
question
I'm
I'm
waiting
for
questions
in
the
channels,
but
since
I
I
don't
see
any
coming,
I'm
I'm
asking
my
own
ones
and
hopefully
they
will
be
useful
for
for
the
audience
as
well
and
so
is
dinatra
is
able
to
provide
some
hints
about
things
like.
Oh,
you
forgot
to
Define
CPU
requests
limits,
etc.
For
this
workload,
is
it
able
to
say?
Okay?
Maybe
you
should
try
this
or
that?
Do
you
have
that
kind
of
approach?
No.
B
We
at
the
moment
we
don't
do
that.
It's
not
something
that
we're
looking
at
at
the
moment,
we're
more
trying
to
take
into
consideration
all
the
objects
and
components
and
and
the
impact
of
those
components
in
a
given
situation.
So
we
will-
or
we
are
mainly
focusing
at
the
moment
on
making
increase
the
the
value
that
we
provide
when
we
observing
communist
clusters.
B
We
have
not
yet
touched
the
the
this
use
case
of
helping
the
users
to
tune
up
their
workload
definitions,
and
so
this
is
not
yet
there
I
know
that
there
is
a
partners
of
the
entrace
called
akamas
so
akamas
they
come
from
the
performance,
engineering,
engineering
World.
They
were
doing
Auto
tuning
when
you
do
performance
during
activity.
You
want
to
tune
and
to
do
that,
you
do
a
load
test.
You
look
at
the
measure,
the
response,
the
results
and,
and
then
you
make
some
change.
B
They
have
an
AI
engine
that
do
it
automatically
for
you,
so
in
eight
hours,
they're
able
to
increase
225
percent
of
performance
or
cost
reductions
based
on
with
the
AI,
but
now
they
introduce
a
new
feature
where
they
do
live
tuning
of
your
clusters,
so
they
have
some,
they
take
sensitive
tests,
they
run
and
then
based
on
those
those
results.
They
make
a
decision
to
make
a
change,
so
they
don't
always
change
directly,
because
maybe
people
will
be
concerned
about
it.
B
They
will
make
a
commit
in
your
repo
and
then,
of
course,
if
you
approve
it,
then
you
will
be
committed
and
then,
if
you
have
a
gitups,
you
will
be
deployed
process.
You
will
be
deployed
automatically
in
a
cluster
and
then
you
will
be
able
to
see
the
advantage.
That
of
the
recommendation
provided
by
the
by
this
AI
engine
provided
by
Acme.
A
All
right:
okay,
cool
thanks
thanks
a
lot
final
question,
so
we've
seen
a
lot
of
capabilities
around
using
the
tool
to
monitor
actively
the
workloads.
You
know
you
as
a
person
using
the
dashboard
Etc,
but
how
can
you
how?
How
can
you
plug
that
into
like
a
devsecops
pipeline
to
basically
try
to
identify
some
of
those
issues
earlier
during,
like
the
you
know,
staging
and
other
phases
before
actually
getting
into
production
environments.
B
So
it's
an
excellent
question.
So
diatrace
has
a
couple
of
Open
Source
initiative
that
we
have
launched
over
the
years.
There's
one
called
Captain,
so
Captain
has
now
a
change,
so
we
have
the
version
one
and
the
version
two.
B
The
version
two
is
called:
Captain
live
workload,
lifecycle
and
the
other.
The
first
one
is
was
like
a
a
platform
that
was
doing
either
cicd
or
just
remediation
or
just
quite
good.
And
here
what
I
want
to
emphasize
is
the
quality
Gates.
So
we
have
component.
B
The
notion
is
that
you
may
want
to
evaluate
a
Dev
test
or
a
desk
release
and
be
able
to
promote
your
release
from
a
Dev
tenant
to
to
Dev
cluster
to
a
staging,
and
then,
if
staging
goes
well,
you
wanted
to
to
make
it
a
release
candidate
to
to
production
and
do
that
automatically
and
that
Captain
can
do
that.
So
they
have
this
component
of
quality
Gates.
B
Where
you
define
your
SLI,
you
define
your
SLO,
so
my
SLO
or
my
SLI,
my
indicator
could
be
in
the
entries,
so
it
could
be
in
a
dashboard
or
it
could
be
a.
We
have
a
also
a
metric
expression
query.
So
a
way
of
querying
a
specific
metric
or
it
could
be
in
Prometheus.
So
we
can
do
a
promql
and
basically
you
define
your
SLI.
B
You
define
your
SLO
and
you
can
have
maybe
20
10
to
15
SLO,
covering
the
availability,
the
network,
the
security
different
requirement
that
you
cover
from
the
various
slos,
with
different
weight
and
at
the
end
every
SL
will
give
you
points
and
you
will
be
able
to
say
okay.
So
if
I
have
a
score
in
Dev
environment
out
of
890
90
out
of
100,
then
yes
automatically
deploy
to
staging.
So
then
you
can
think
of
lots
of
automated
process
where
you
deploy.
You
run
some
checks
and
you
can
go
further
with
a
new
version.
B
Captain
life
cycle.
We
are
basically
part
of
the
cluster.
We
changed
Captain
to
make
it
easier
to
deploy
and
easier
to
configure.
So
we
basically
were
part
of
the
scheduler
of
the
kubernetes.
Every
time
you
deploy,
we
do
pre-deployment
checks
and
then
we
deploy
and
then
we
do
Post
deployment
checks
and
those
deployment
checks
is
very
flexible.
You
can,
then
you
can
do
jobs,
you
can
do
typescripts.
You
there's
plenty
of
things
you
can
do
so,
for
example,
if
I'm
deploying
my
service
and
it
relies
on
my
database.
B
If
the
database
is
not
there,
then
obviously
my
service
will
start,
but
because
the
database
is
not
there,
I
will
probably
have
a
crash
to
backup
an
error.
I
will
try
to
restart
to
reach
out
the
database.
Basically,
we
can
do
pre-deployment
checks
where
we
say
oh,
so
this
is
a
dependency
of
the
database.
Okay,
so
pre-deployment
checks
is
the
database
here.
Yes,
it
is
okay,
so
we
can
deploy.
B
So
the
captain
life
cycle
now
is
is
pretty
much
designed
to
make
your
life
easier
in
kubernetes,
and
the
great
thing
is
because
we
are
part
of
a
scheduler,
then
you
can
use
gitlab,
you
can
use
Jenkins.
You
can
use
any
CI
system
of
the
market
from
the
moment.
You've
been
basically
deployed
in
kubernetes.
Then
we
do
the
these
checks
before
and
after,
and
it
makes
your
life
easier
to
to
manage
deployment
in
communities.
A
B
A
And
you
know
what
actually
this
is
you
know:
I
have
a
special
interest
for
the
stack,
UPS
and
I
think
if
you're,
okay
with
that,
we
will
want
to
want
to
to
invite
you
again
for
another
session,
maybe
you
or
or
a
colleague
that
works
on
that
topic
too
soon
more
in
details
about
this,
this
specific
topic,
because
it's
it's
at
the
core
of
you
know
when
you
are
trying
to
automate
your
CI,
CD
or
devsecups
Pipelines.
A
A
All
right,
so
thanks
thanks
so
much
Henrik.
We
are
just
a
bit
over
time,
so
if
there
are
any
further,
there
are
no
further
questions.
I
will
again.
Thank
you
very
much
for
your
session.
It
was
very
valuable
and
I
hope
our
users.
A
Again,
if
you
are
not
already
subscribed,
please
like
And
subscribe
and
keep
in
mind
that
there
are
other
sessions
on
the
openshift
TV.
You
have
the
links
to
henrik's
channel
as
well.
If
you
want
to
subscribe,
feel
free
to
to
go
there
and
check
his
awesome
content
and
thanks
thanks
so
much
for
joining
us
today
and
see
you
soon.
Thank
you
guys,
bye-bye.