►
Description
Get your espresso ready for the OpenShift Coffee Break as we welcome our special guest Liran Cohen, CTO at Sosivio, demonstrating Sosivio's integration with OpenShift for real-time monitoring and observability and automated troubleshooting and root cause analysis on Kubernetes.
A
A
And
welcome
everyone
to
open,
Chef,
TV
coffee
break
good
morning
good
morning,
everyone
and
Stefan.
We
have
two
guests
today
and
I'll:
let
them
actually
introduce
themselves
and
who
they
work.
For
a
little
bit
of
surprise
here,
companies
and
the
floor
is
yours
guys.
Why
don't
you
introduce
yourselves?
Meanwhile,
I
have
my
coffee.
It's
good.
B
C
Hey
everyone,
and
thanks
for
joining
Stephen
Thorn,
here,
Solutions
architect
versus
civios
or
anything
business
Commercial,
marketing
related
I'm
involved
with.
A
Excellent,
so
what
did
you
guys
prepare
for
us
today?
What
what
does
the
CV
do
for
those
who
are
not
familiar
with
it?.
A
Let's,
let's,
let's
share
the
screen
cool.
C
C
A
So
that'll
leave
without
a
demo
for
shift
TV.
We
have
to
see
the
real
thing
or
at
some
point.
C
So
Sevilla
has
been
around
for
a
few
years
now,
while
we,
you
know,
do
work
in
openshift.
We
also
work
in
any
distribution
of
kubernetes.
We
work
in
completely
disconnected
environments,
so
everything
that
we're
talking
about
and
demoing
today
is
just
another
application
100
in
the
cluster
itself.
So
no
no
data
offloading
it's
not
a
SAS
tool.
C
It
does
this
in
a
very
resource
friendly
way
as
well.
We
have
really
leveraged
lean,
Ai,
Ai
and
ml
to
predict
and
prevent
critical
failures
in
kubernetes
environments.
We
have
domain
experts
both
in
kubernetes
and
AI.
We
have
one
of
our
many
experts
here
for
kubernetes
labor
and
Cohen,
we'll
dive
into
the
product
a
bit
more
in
a
second
here
we
are
a
us-based
company,
we're
headquartered
in
San,
Francisco
California
and
our
r
d
is
Led
Out
of
Tel
Aviv.
You
can
see
on
the
right
side
of
the
screen
here.
C
C
So
what
do
we
do?
What's
the
problem
that
we're
solving
I'm
sure,
unfortunately,
some
of
you
have
dealt
with
troubleshooting
on
kubernetes.
You
know
that
that
can
be
a
quick
process.
It
can
be
a
few
weeks
to
solve
a
single
issue
as
well.
Just
depends
on
the
complexity
of
it,
but
the
typical
troubleshooting
process
is
you'll,
get
an
alerts.
You
know,
maybe
from
a
platform,
a
tool
you
have
in
your
in
your
environment.
Maybe
a
customer
complaint
worst
case.
C
You
then
need
to
go
into
and
prioritize
and
understand
what
what
is
the
impact
of
that
issue?
Then
you
go
and
go
through
and
identify
the
root
cause
and
then,
once
you
do
have
the
root
cause.
What's
the
remediation
you
implement
that,
and
hopefully
that
does
work
and
fix
it,
if
not
you're
kind
of
going
back
a
few
steps
there
and
trying
to
fix
it
again.
C
That
is
like
I
said
it
can
be
pretty
time
consuming.
So
what
system
it
does
it
automates
that
process
for
you
in
a
real-time
fashion
and
I'll
turn
it
to
the
run,
to
talk
about
a
little
bit
more
Sicilia.
B
Sure
yeah,
so,
first
of
all,
if
there
are
any
questions
from
anyone
in
the
audience,
I'll
gladly
answer
and
if
not,
then
just
to
add
something
important
I
think
I
used
to
be
before
Cecilio
principal
architect,
consultant
for
Red
Hat
used
to
travel
around
Europe
and
in
Israel,
and
actually
this
is
where
Cecilio
was
born.
B
Like
the
the
idea,
what
was
missing
when
I
was
giving
Professional
Services
to
Banks
insurance
companies,
Automotive
companies
Telco
companies
when
something
broke,
it
was
very
difficult
to
find
it
has,
and
we
didn't
have
all
the
information
like
the
information
was
in
kibana,
elasticsearch
or
Prometheus,
and
you
had
to
go
and
dig
deep
to
try
and
do
some
kind
of
correlation
or
in
a
or
in
a
external
file,
just
to
extract
all
the
information
put
it
on
the
timeline
and
trying
to
understand
what
happened.
B
So
Celio
does
all
that
for
you
in
real
time,
meaning
it
fetches
all
the
information
from
the
relevant
entities,
thoughts
deployments
replica,
whatever
you're
looking
at
it
gathers
it
puts
it
on
the
timeline
and
then
creates
a
sequence
of
events
in
a
couple
of
minutes.
In
the
meantime,
if
there
is
no
other
questions,
Andrea
give
me
posted.
A
At
the
momentical
questions
from
the
from
from
the
audience,
my
question
would
be:
you
said
that
it
is,
and
maybe
you're
gonna
talk
about
it.
You
can
work
in
in
air-gapped
environments
disconnected.
A
Is
there
a
procedure
where
you
have
to
somewhat
update
X
information
that
that
comes
from
my
other,
let's
say:
anonymized
experiences
or
patterns
into
air-gapped
environments?
That
was
my
first
question.
You
don't
have
to
answer.
Maybe
you
know
as
you
as
you
go
through
and
we
clear
it.
A
The
second
is,
you
probably
have
a
way
to
estimate
what
is
the
resource
requirement
based
on
some
parameters
of
the
environment
and
the
November
of
notes,
Gambler
PODS
of
metrics
or
whatever
it
is,
and
I
I'd
like
to
understand
a
little
bit
that
aspect.
So
if
people
want
to
do
this
implementation,
what
are
the
additional
resource
requirements
that
is
put
on
the
cluster.
B
Okay,
so
let's
start
with
the
with
the
the
second
question:
the
resource
consumption
resource
consumption
is
something
that
I
was
really
concerned
about
even
before
social,
when
I
used
monitoring
tools.
You
know,
because
if
the
overhead
is
like
30
40
of
cluster
resources,
that's
a
big
deal
right,
it's
it's
cost
and,
and
you
need
more
nodes
and
then
you,
you
start
troubleshooting
the
nodes
that
are
troubleshooting
that
are
running
applications
through
troubleshoot
other
applications.
So
that's
becoming
crazy.
So
cereal
footprint
is
negligible.
B
B
Of
course
it
depends
on
the
cluster
size
like
the
bigger
it
is
the
lower.
The
footprint
is
because
we
distribute
the
analysis
over
multiple
ports.
It's
not
happening
in
one
big
pod
or
in
one
storage
location,
but
remind
me
the
first
question
right
about.
A
The
first
question
was
in
many
cases
where
that
gnostics
is
involved,
especially
if
it
involves
aimed,
for
example,
you
recognize
patterns
and
there
may
be
patterns
that
have
not
occurred
in
their
cluster,
but
they
have
occurred
somewhere
else,
and
you
want
to
take
that
experience
and
bring
it
into
the
into
that
cluster.
A
There
may
be
more
cluster
related
than
application
specific,
but
so
I
I
would
imagine
that
there's
sort
of
a
bootstrap
of
intelligence
that
the
and
that's
why
my
mental
image
right
so
it
may
be
completely
wrong,
but
and
that
you.
B
You're
definitely
right:
the
engines
come
with.
Let's
call
it
pre-trained
data
sequences
that
they
already
know
malfunctions
that
we
already
detected
and
have
seen
and
are
embedded
into
the
product.
So
you
won't
have
to
first
find
them
on
your
cluster,
but
even
if
you
don't
have
those
once
a
mail
function
happens
for
the
first
time,
we
find
the
sequence
and
we
store
it
inside
your
cluster.
So
it
doesn't
need
to
be
connected
anywhere
to
understand
new
sequences
or
new
problems.
B
Once
something
happens,
the
pattern
of
severity
between
the
different
signals
will
appear
in
our
what
we
call
the
discovery.
Engine
sequence
will
be
discovered
and
from
that
point
on
that
sequence
will
be
detected
all
the
time.
The
next
you
will
probably
get
that
sequence
from
from
us
as
a
pre-trained
data
set
for
the
engines.
B
Yeah,
because
you
know
most
of
the
environments,
you
know
probably
on
openshift
the
ones
that
I
used
to
work
with.
None
of
them
had
internet
connection.
It
was
sometimes
even
worse.
The
whole
organization
didn't
have
internet
connection,
especially
if
you're
talking
about
government
or
army
or
organizations.
That's
like
completely
disconnected.
B
B
A
Anything
else,
let's
jump
into
the
demo,
no,
no
questions
you
want
to
go
to
the
Demo
First,
okay,
perfect!
Let's
just
do
that.
B
B
The
main
screen
that
we
see
right
now
has
a
lot
of
information
aggregated
to
show
you
the
state
of
the
cluster.
You
have,
for
example,
Health
course,
which
I'll
get
into
in
a
couple
of
minutes.
That
means
that
what
what
is
the
status?
B
It's
divided
into
three
different
categories
platform,
which
is
basically
the
hardware
and
the
nodes
and
the
OS
of
of
the
cluster
application,
which
is
actually
what
is
running
inside
the
pods
and
deployment,
which
is
basically
the
state
of
the
deployments
replica
set
services
like
if
you
have
endpoints
for
each
service.
B
If
you
have
all
the
replicas
in
the
correct
version
for
every
deployment,
stateful
set
and
so
on
and
so
forth,
and
these
are
the
AI
predicted
failures,
meaning
these
are
the
actual
sequences
that
we
find
and
you
can
see
that
they
progress
as
time
goes
on.
They
start
as
a
warning
and
they
become
an
urgent
until
they
reach
full
failure,
which
is
the
the
red.
We
will
show
you
exactly
the
list
of
them
and
how
you
see
them
and
how
you
deal
with
them.
B
The
load
average
of
the
nodes
which
we
found
and
customers
found
very
useful
to
see
in
in
first.
At
first
glance,
then
we
have
something
very
interesting
here.
This
is
the
automatic
application
profiling,
it's
a
CVO
offers,
and
here
you
see
two
states,
the
overallocated
and
under
allocated
deployments
in
our
cluster,
we'll
Deep
dive
into
it.
Also
there's
a
full
screen
for
it,
but
right
now,
I
would
like
to
go
directly
to
the
AI
predicted
failures,
the
things
that
we
see
here.
B
If
we
click
it
or
if
we
go
to
our
Command
Center,
then
we
can
see
the
full
list
of
failures
along
with
the
let's
call
it
the
predictive
severity
of
that
sequence,
and
you
can
see
that
we
have
many
of
those.
In
fact,
we
have
107
failures
and
68
predictions
that
things
are
going
to
break
and
two
warnings
that
it's
just
starting
to
build
the
sequence.
C
Later
on,
it's
frozen
on
the
cluster
overview
page.
A
Don't
you
let's
stop
sharing
for
a
moment
and
I'll
start
sharing
again.
Okay,
let's
see
if
now
so
I,
don't
send
why
it's
frozen.
B
A
Okay:
okay,
meanwhile,.
A
Okay,
let's
see,
is
there
a
way
you
can?
You
can,
in
the
meantime,
continue
with
the
presentation
sure.
C
A
Is
it
let
us
know
when
you're,
when
you're
ready
again
I.
A
B
So
we
can
see
some
of
the
failures
that
are
here,
there's
a
lot
of
failures
that
I
as
I
said.
This
is
a
demo
cluster,
but
there's
still
simulation
that
we
want
on
it,
and
you
see
that
it
is
constantly
updating.
Every
time
a
sequence
completes,
meaning
a
failure
starts
and
fails.
Eventually,
then,
the
information
here
updates.
We
can
see
like
things
from
panics
that
that
you
see
here
inside
pods
and
you
can
see
actually
the
Panic
itself.
B
Why
panicked
and
there's
a
lot
of
demos
here
that
was
division
by
zero,
which
was
something
simple,
and
here
we
have
blah
blah
blah
whatever
it
is
for
by
phone,
go
Java,
node.js
and
c-sharp.
B
And
there
are,
there
are
also
failures
that
are
not
related
to
the
code
itself,
meaning
it's
stuck
in
container
creating,
and
you
can
see
that
basically,
the
Readiness
probe
has
failed
and
blah
blah
blah
it
didn't.
It
couldn't
create
the
container.
So,
basically,
you
can
see
everything
that
happens
on
the
cluster.
B
Let's
put
this
on
a
30
issue,
see
if
we
have
something
else
so
here
we
have
a
configuration
issue,
which
is
someone
set,
the
deployment
to
have
image,
never
pull,
and
this
one
create
container,
configure,
meaning
you're,
probably
missing
a
PVC,
and
you
can
see
a
secret
in
this
case
the
secret
is
missing.
So
basically
you
can
see
that
everything
that
happens
on
the
cluster
that
is
critical
for
you
to
understand
and
know
and
deal
with
you
can
see
in
the
command
center.
B
So
in
the
Mass
information,
which
is
kubernetes
and
the
and
and
the
different
tooling
to
to
look
at
it,
it
is
very
difficult
to
to
differentiate
what
is
important
and
what
you
need
to
deal
with.
And
what
is
just
information
that
someone
spewed
out,
because
you
know
developers
a
lot
of
times
just
they
just
put
log
messages
for
them
for
troubleshooting
afterwards
and
as
a
whole,
to
the
environment
itself
and
to
your
service.
Most
of
it
is
irrelevant.
B
C
You
want
to
talk
about
sequences
and
kind
of
the
DNA
malfunction
and
kind
of
like
that
aspect
to
it.
Yeah.
B
B
You
need
to
get
all
the
information
from
all
of
these
moving
parts
and
put
it
on
the
timeline
and,
as
you
can
see,
this
is
just
a
four
sequence
event,
but
we
have
a
a
much
bigger
sequences
up
to
18
and
20
events
in
the
sequence,
meaning,
let's
say
you
had
a
panic
inside
the
pod
which
returned
500
and
because
of
that
the
bot
crashed
and
then
it
restarted.
Then
all
of
these
things
are
actually
put
inside
that
sequence
and
on
the
timeline
that
becomes
a
signature
of
this
mail
function.
B
B
Here
also,
so,
basically,
once
a
failure
happens,
we
just
take
the
events
that
happen
without
the
who
it
happened
to,
and
those
events
in
turn
can
be
used
to
any
other
deployment
or
pod
or
whatever
entity.
It
is.
B
I'm
trying
to
off
here,
for
example,
this
is
no
way
I'm
killed
and
you
can
see
that
the
sequence
here
is
much
much
longer.
It's
actually
30
13
events
long
and
you
can
see
everything
that
happened
to
the
deployment
on
the
Node
itself,
the
Pod,
because,
for
example,
with
omk,
you
have
three
different
types
of
omk.
You
have
Owen
kill
and
that
is
instigated
by
the
node,
not
having
enough
memory,
which
is
nothing
you
can
really
do
about
it.
Just
move
the
Pod
somewhere
else.
B
B
On
that
note,
which
is
also
something
that
the
only
way
to
fix
it
is
to
move
it
to
another
note,
if
you
don't
have
a
sequence
of
events,
if
you
don't
exactly
see
what
happened,
and
you
see
that
it
was
killed
by
the
kernel,
because
the
C
group
was
crossed
the
C
group
limit
was
crossed,
then
there's
no
way
you
can
understand
how
to
fix
it.
Giving
more
memory
to
this
spot
on
this
specific
note
will
not
fix
it.
B
Sorry
for
for
this
budget
will
fix
it,
but
for
pods
that
were
killed
because
the
node
was
cursor
memory.
It.
A
B
Have
enough
memory
to
make
the
Spot
Run.
B
That's
it
Steven.
You
wanted
to
mention
anything
else
about
the
command
center.
C
I
mean
I'm
sure
we're
going
to
want
to
talk
about.
You
know
the
application
profiling
piece
there.
So
I
don't
know
if
you
want
to
jump
to
that
now.
Yeah.
B
So
back
back
when
we
started
sharing,
we
saw
this
the
overallocated
and
under
allocated
widgets
in
the
in
the
main
screen
and
I
want
to
dive
into
them
for
a
second
and
check
the
application
profiling.
B
The
way
we
do
it
is
we
look
at
pods
and
basically
deployments
from
the
second
that
they
started
and
up
until
this
point,
meaning
we
look
at
the
behavior
at
the
dominant
behavior
of
that
pod
or
set
of
PODS,
and
we
determine
first
of
all
if
it
if
the
dominant
behavior
is
erratic,
meaning
there
there
are
spikes
memory
or
CPU,
it
doesn't
really
matter
and
we
check
also
if
it's
calm,
meaning
it
doesn't
change
as
much
by
the
dominant
Behavior.
B
We
actually
decide
with
an
algorithm
what
you
put
as
recommended
values
for
that
pawn
or
set
of
PODS
or
deployment.
The
reason
we
do.
That
is
because
you
can
imagine
that
you
have.
You
can
have
a
pod
running
like
crazy
spiking
all
week
and
then
on
Thursday
night.
It
stops
spiking
because
there's
not
a
lot
of
users
or
whatever,
and
then,
if
you
try
to
profile
it
based
on
the
last
hour
on
Thursday
or
Friday
or
the
last
day,
even
then
you
will
profile
it
wrong.
B
So
this
is
how
social
does
it
and
you
can
see
that,
for
example,
for
this
Pawn,
the
request
is
currently
80
megabytes,
which
is
what
we
recommend,
but
the
limit
is
much
much
higher
than
what
we
recommend
with
CPU.
This
sport
probably
doesn't
do
really
much
so
the
recommendation
is
really
low,
and
these
are
the
values
that
we
have
seen
what
was
the
average
and
maximum
for
CPU
and
memory
during
the
dominant
Behavior
period?
B
Let's
try
to
find
something
that
is
a
little
bit
more
spiky.
B
A
B
B
Now,
just
to
to
explain
a
little
bit
how
all
of
this
magic
is
done,
how
how
do
we
get
this
sequence
that
you
see
here
this?
This
is
a
panic
indicator.
Actually,
this
is
what
the
full
sequence,
but
if
we
choose
a
sequence,
let's
choose
one
from
here.
B
Insights
of
civil?
You
also
have
the
inside
to
metric
which,
which
you
glanced
in
a
couple
of
minutes
ago,
when
I
ran
through
it.
This
is
actually
giving
you
a
lot
of
information
that
is
coming
directly
from
our
collectors
not
being
analyzed
or
or
this
is
actually
the
road
data
that
is
going
through.
The
data,
swirling
mechanism
and
I
want
to
touch
once
we
finalize
the
demo.
B
So
basically
there's
a
lot
of
information
that
Cecilio
reads:
we
have
our
own
proprietary
collectors.
We
have
our
own
proprietary
database,
which
is
a
graph
database
combined
with
the
document
store
and
that
information
comes
with
a
very
high
granularity.
For
example,
I'm
sure
every
developer
will
love
having
something
like
this
in
their
hands.
Let's
try
to
find
something
that
is
doing
some
work.
B
A
A
B
Yeah
I
guess
chrome
doesn't
like
my
computer
too
much
but.
A
Okay,
so
we
were
just
to
recap:
how
did
we
get
here
again.
B
So
we
spoke
about
the
the
how
how
we
get
information
for
the
sequences,
why
it's
it's
very
difficult
or
I
haven't
seen
any
other
tools
that
can
have
this
type
of
information
and
this
granularity,
which
is
what
I
was
missing
again
as
a
consultant,
as
I
mentioned
before
the
way
we
do,
it
is
using
the
data,
swirling
methodology
and,
of
course,
our
own
proprietary
collectors
and
database
just
to
show
you
the
granularity
of
information
that
we
have
inside
Cecilia,
for
example,
we're
looking
at
CPU
right
now,
and
you
can
see
that
every
five
seconds
you
can
have
a
read
and
you
can
see
that
it's
0.302
cores.
B
So
it's
very
very
precise.
If
you
have
a
spike
here,
if
something
spikes,
you
will
see
it
immediately.
You
don't
need
to
go
indeed
because
all
other
tools,
especially
the
ones
that
are
based
on
Prometheus
node
exporter,
are
doing
averages.
So
it's
very
difficult
to
find
something
that
spikes
or,
if
you
are
cut
off
when
you
are,
when
you
request
a
lot
of
CPU
or
again
same
thing,
goes
for
memory
by
the
way
you
also
have.
If
you
notice
here,
the
red
line
is
your
limit
that
is
configured
for
this
spot.
B
The
blue
one
is
the
request,
and
you
also
have
the
average
and
the
maximum.
If
I
remove
the
limit,
you
can
see
that
it
zooms
in
a
little
bit.
If
I
remove
the
request,
it
will
probably
Zoom
more.
You
can
see
the
accurate
accuracy
here
of
our
reads.
Same
again,
same
goes
for
memory
if
I
remove
them
and
by
the
way
the
green
is
very
important.
That's
the
maximum
that
the
Pod
ever
consume,
CPU
and
memory.
We
also
have
the
same
for
Threads
throttling,
which
is
taken
directly
from
the
CFS.
B
This
is
not
throttling
right
now,
voluntary
context,
switches
and
non-voluntary
conflict
switches
which
have
indications
basically
voluntary
context,
switches
indicate
a
high.
I
o
either
Network
or
storage,
most
of
the
time
and
non-voluntary
actually
indicate
Deadlocks
or
Loops
inside
the
chord
things
that
get
stuck
or
take
a
lot
of
time.
B
So
that's
the
granularity
of
information
we
have.
We
also
have
the
network
connections
from
each
pod.
It
will
update
in
a
oh.
This
one
is
not
connected
anywhere,
we'll
try
to
find
someone.
That
is
let's
take.
For
example,
a
crud
should
give
us
some
Network
information,
so
you
can
see
that
all
the
HTTP
information
is
here.
We
can
see
the
apis
that
this
pod
is
actually
be
accessing.
How
many
requests?
What
is
the
maximum
latency
if
it
returns
200,
300,
400
and
500?
B
We
can
see
the
information
that
this
pod
sends
and,
of
course,
the
information
that
this
board
receives.
So
it's
very
convenient
to
see
if
you
have
latency
between
pods
here
and
to
which
API,
sometimes
the
latency
doesn't
emanate
from
network
problems,
but
actually
from
a
remote
pod
that
takes
time
or
maybe
a
database
that
is
still
loaded.
So
it's
very
easy
to
see
as
you've
seen
a
second
ago,
you
can
see
actually
a
lot
of
apis
that
are
being
accessed.
Let's
find
something
with
multiple
apis.
B
We
had
one
a
second
ago
and
I
closed
it
yeah
you
see
here
you
see
two
apis.
Maybe
one
of
these
apis
is
taking
too
long
to
respond,
so
you
can
see
also
the
maximum
latency
per
API,
which
is
very,
very
convenient
when
you're
working
in
such
a
distributed
environment,
you
go
to
the
network
connections.
We
can
also
see
the
state
of
connections
meaning-
and
this
is
very
helpful-
to
start
to
detect
load
balancer
of
firewall
issues
between
your
cluster
and
somewhere
else.
B
For
example,
if
you
had
a
load
balancer
with
the
connection
limit
or
a
firewall
which
is
just
dropping
connection,
you'll
probably
see
thin
weights
or
close
weights
or
close
connection,
basically
waiting
to
go
to
the
next
step,
but
cannot
so
that's
also
very
convenient
to
work
with
and
very
useful
information.
B
So
we
have
the
same
thing
that
we
have
for
pods:
the
recap
metrics.
We
have
for
the
whole
nodes,
of
course,
with
load
average
and
a
lot
of
other
information.
But
something
very
interesting
here
is
that
we
can
also
see
the
DNS
latency
of
that
node,
which
affects
all
the
parts
of
that
home
and.
B
B
So
also
that's
a
lot
of
information
that
you
you
can
use
to
actually
determine
if
the
problem
is
in
the
pawn
a
node,
a
whole
network
Etc
again
same
granularity.
Also
for
node
information
like
what
is
the
usage,
what
was
the
maximum
usage?
How
many?
How
much
limits
for
all
aggregated
limits
of
the
pods?
On
that
note
same
for
requests
and,
of
course,
the
average
and
network
connections
again,
sometimes
nodes
have
vlans
or
firewalls
between
them
or
I've,
seen
also
ipsec
encryption
between
nodes.
B
We
also
have
a
couple
of
other
things
that
are
worth
viewing.
Our
health
checks
is
basically
to
simplify
it
when
you're
troubleshooting,
you
don't
just
read
logs
right,
you
run
commands
like
you
check
the
deployment
you
check,
everything
is
deployed.
You
actually
run
some
curls,
maybe
a
date
to
see
DNS
resolution
so
so
savior
does
a
lot
of
these
tests
automatically
continuously
on
the
whole
cluster
on
the
platform,
Parts
application
as
I
said
and
deployment.
These
are
the
health
checks.
B
Let's
see
so
all
these
things
are
failing
or
failed
in
the
in
the
past
recent
time,
for
example,
the
kubernetes
API
service
continuously
restarting
it's
probably
because
it's
on
a
lot
of
load-
and
this
is
the
demo
environment
and
we
are
actually
simulating
a
lot
of
failures.
B
But
if
that
happens,
you
usually
don't
see
it,
it
happens
and
it
restarts
and
that's
it-
you
just
have
slow
API
communication.
You
run
Cube
CTL.
Maybe
it
takes
three
minutes
or
two
minutes
or
half
a
minute
30
seconds.
So
this
can
give
you
a
lot
of
insights
as
of
what
is
wrong
with
your
cluster
again,
this
is
just
for
the
platform.
We
have
the
same
thing
for
the
application
you
can
see
exactly
which
one
is
in
which
state
and
a
lot
of
other
tests
that
we
have
here
and,
of
course,
for
the
deployment.
A
Have
questions
so
effectively?
You
have
a
a
monitoring
capability
like
these
health
checks,
and
these
are
the
actual.
So
what
failure
they
have
happened
and
then,
and
then
you
have
a
capability
of
let's
say,
predicting
with
a
certain
confidence
that
some
patterns
are
occurring
and
they
may
lead
to
failure
of
different
types.
A
Can
that
be
somewhat
externalized
to,
for
example,
a
service
gown,
so
that
you
know
if
people
are
not
watching
the
console
all
the
time
they
can
get
alerts.
The
second
you
get
to
look
at
this
and
then
understand
what
is
the
recommended
remediation
that
you
could
do?
A
B
B
We
don't
have
an
integration
with
servicenow
that
at
the
moment,
but
Splunk
and
service
now
and
Prometheus
and
graphene
story,
or
we
already
have
integration
for
those.
So
in
case,
and
probably
it
is
in
your
knock
or
your
devops
team,
they
have
a
screen
in
front
of
them
that
they
look
at.
You
can
actually
push
all
the
information
from
the
Civil.
All
these
failures
actually
push
them
there,
and
if
there
is
a
failure,
you
can
actually
click
it
go
into
the
specific
dashboard
and
view
it
by.
C
B
Something
that
is
very
important
that
you
didn't
see
in
the
demo
is
socio
is
completely
built
on
Arbok,
meaning
all
the
permissions
that
you
have
on
the
cluster
will
be
applied
in
Social,
meaning
if
Steven
is
the
admin
and
I'm
a
developer
and
I
have
permissions
to
see
only
my
namespace
if
I
use
the
same
username,
meaning
if
I
use
oauth
to
log
into
Cecilio,
which
we
also
support.
B
Of
course,
if
I
use
the
oauth
and
I
get
my
username
and
the
same
arbuck
that
applied
to
me
on
the
cluster
itself
will
apply
to
me
in
Cecilia,
which
is
very,
very
convenient,
especially
in
production
environments,
where
you
shouldn't
have
people
you
want
to,
let
them
see,
but
you
don't
want
to.
Let
them
touch
anything
and
that's
one
of
our
big
customers,
as
you
said,
Stephen
a
top
10
Bank
International,
Bank,
International,
yeah
yeah.
So
because
we're
we
can't
send
a
name.
So
it's
a
huge
environment,
a
very
big
environment.
B
Actually
it's
multiple
environments,
but
they
use
it,
of
course,
in
development
and
testing
and
all
the
rest,
but
they
they
found
a
very
nice
use
case
for
it.
Where
something
happens
in
production,
they
don't
have
to
give
the
developer
Cube
CTL
access,
they
can
just
open,
so
cereal,
and
then
they
can
debug
whatever
they
want.
They
don't
have
permission
to
touch
anything
and
they
don't
have
direct
access
to
the
cluster.
B
Okay,
which
is
which
is
one
of
the
things
that
they
found
very.
B
Is
that
that
they
really
really
are
using
on
a
day-to-day
basis?
Is
they
gave
the
tool
to
all
of
their
developers
and
they
reduced?
Stephen?
Give
us
the
numbers,
of
course,
to
devops
People
by.
C
I
think
they
said
it
helps
them
like
up
to,
like
90
reduce
the
time
it
takes
to
fix
these
issues
because
they
have
thousands
of
developers.
C
They
do
have
some
expertise
in-house,
but
a
lot
of
things
will
get
escalated
up
in
these
experts.
In-House
they're
spending
a
lot
of
time
dealing
with
these
developer
issues,
which
may
not
be
super
complicated
and
could
be
solved.
You
know
by
developers,
but
they
just
don't
have
that
knowledge.
Yet
so
they're,
seeing
what's
going
on
they're
having
the
root
cause
presented
to
them
right
there
and
they
don't
have
to
escalate
nearly
as
many
things
up
anymore.
B
And
that's
a
very,
very
important,
Point
Andrea,
because
these
guys
they
know
kubernetes
I,
know
because
we've
spoken
to
them
many
many
many
times
they
are
experts,
but
how
many
experts
do
you
have
in
an
Enterprise?
And
when
you
have
I,
don't
know
three
four
thousand
Developers?
B
Ninety
percent
of
the
request
to
those
devops
experts,
kubernetes
experts
are
reduced
and
that
frees
them
to
go.
Do
important.
Things
like
automations
involve
the
cluster
upgrade
things
instead
of
just
answering.
Oh
yeah,
you
put
an
error
image,
never
pull
remove
it
or
the
image
tag
is
not
correct
or
you
put
the
capital
letter
in
the
repository
name
like
these
things.
That
happened
on
a
day-to-day
basis,
of
course,
for
us
for
someone
that
is
very
familiar
with
kubernetes.
B
It
looks
like
oh
yeah,
it's
two
seconds
for
developers
that
don't
deal
with
kubernetes
and
in
fact
don't
if
some
of
them
don't
even
have
Cube
CTL
access,
they
they
do
it
through
their
CD
pipeline.
They
just
try
to
deploy,
it
doesn't
deploy
Andrea,
my
Pawn
doesn't
deployed.
Okay
go
do
give
CTL.
Now
you
have
to
do
describe
or
get
oyama
or
whatever
it
is.
You
need
to
do
to
understand
what
happened
instead,
give
the
tool
to
the
developer
immediately.
A
So
not
not
only
let's
say
in
production,
but
also
in
the
your
free
phase
of
test,
while
developers
are
still
involved
and
they
can
catch
and
I
I
guess
the
system
can
be
trained
to
detect
new
patterns
once
once,
a
developer
has
understood
what
is
happening,
or
how
do
you
do
that.
B
Well,
the
system
always
finds
new
sequences,
but
at
the
moment
we
actually
manually
took
the
sequence.
Everything
that
happens
on
the
cluster
creates
a
sequence.
What
we
do
right
now
at
the
moment
as
we
develop
our
AI
engines
on
and
on,
we
actually
have
a
set
of
sequences
that
we
know
of
that
are
100
for
sure
failures,
and
we
exactly
know
what
those
sequences
stand
for,
and
we
give
you
recommendations
for
each
and
every
one
of
them
the
in
the
future.
B
There
is
an
engine
that
will
automatically
actually
find
any
any
sequence
that
we
already
find
by
the
way
you
don't
need
to
train
the
discovery
engine
it
finds
everything.
The
problem
is
to
take
those
lists
that
you've
seen
of
events
and
translating
them
into
something
that
a
developer
or
developer
person
will
understand.
A
Okay,
I'm
I'm
checking.
No,
no
questions
are
related
to
to
our
topic.
A
Wrong
you
mentioned
you
mentioned
that
we
were
talking
about.
You
know
Steven,
while
you're
really
starting
we're
talking
about
the
data
repository
that
you
have
to
have
on
on
the
cluster.
Do
you
have
to
install,
or
does
it
rely
on
some
on
some
database
that
you
have
you
have
to
install
as
well
or
is
it
is
its
own
type
of
Repository.
B
So
again,
very
you
have
good
questions
today,
Andrea.
So
this
is
a
good
question,
because
I
a
couple
of
things
we
forgot
to
mention
one
Cecilia
does
not
require
any
prerequisites
on
the
cluster.
We
don't
need
storage,
we
don't
need
a
special
cluster
or
a
special
node
or
a
GPU
or
nothing
else.
We
just
run
as
yet
another
application
on
your
cluster.
You
just
deploy
it.
It
creates
a
namespace
with
several
pods
in
it
and
that's
Cecilia,
meaning
you
don't
have
to
prepare
anything
for
it.
B
The
second
very
important
thing
is
you
don't
really
need
anything
changed
in
your
deployments
in
your
pods.
There
is
no
sidecar
required
for
all
the
information
that
you've
seen.
Everything
is
extracted
by
our
collectors
from
the
nodes
and
there's
no
need
for
any
instrumentation
or
Preparation
in
the
applications
themselves.
In
the
other
parts.
A
B
B
Actually,
an
in-memory
as
I
said
graph
and
document
store.
It's
a
graph
database
in
a
document
store.
B
It
is
highly
available,
meaning
you
can
kill
any
instance
of
it
and
it
immediately
replicates
itself
back,
meaning
there's.
No,
it's
actually
very
difficult
to
cause
data
loss.
If
you
want
to
zero
up
the
database,
you
have
to
shut
down
the
whole
deployment,
and
even
that
takes
time,
but
basically
it
is
super
fast.
It
takes
all
the
information
that
the
collectors
ingest
and
stream
it
to
the
right
locations.
It
does
not
require
any
disk
because
it
is
in
memory
it's
completely
distributed.
B
Again
we
have
requests
from
for
long-term
storage,
meaning
people
want
to
have
information
like
for
three
days
three
weeks
about
three
months
and
that
will
require
a
PD
or
PVC
naturally,
but
right
now,
Cecilio
by
the
way,
Cecilia
also
doesn't
save
information.
For
that
long.
We
don't
need
it.
B
The
idea
or
the
concept
is
you
know
if
a
poet
died
just
now
the
information
from
it
is
relevant
for,
let's
say,
30
minutes
one
hour
after
one
hour,
it's
irrelevant
because
either
there's
a
new
pod
or
the
Pod
is
not
there
or
someone
really
deployed
it.
It
doesn't
really
matter
what
happened.
What
does
matter
is
the
sequence
like
I
want
to
know,
I,
don't
care
about
all
the
logging
information
and
signals,
but
I
do
care
if
it
was
all
I'm
killed
or
if
there
was
a
panic.
A
B
A
B
Nothing
do
take
the
information
from
the
kubernetes
API,
but
not
from
Prometheus.
We
have,
as
I
said
our
own
data
collectors.
Okay,.
B
Yeah,
it's
a
demon
set.
Basically
that
runs
on
on
every
every
note.
A
Okay,
okay,
excellent
y'all,
I,
don't
have
any
more
questions.
Have
you
got
any
anything
else
to
tell
us
about
today
regarding
regarding
the
technology
or
some
use
cases
that
have
happened
to
you,
where
you
have
your
customers?
Oh.
A
C
B
So
with
this
simple,
you
guys
see
my
screen.
Yeah.
A
A
B
Actually
helped
us
find
a
very
hard
problem
that
was
around
for
three
months
in
the
Israeli
Army.
Actually,
that
was
running
openshift
on
top
openstack
and
there
was
a
problem
with
neuron
with
the
sdn
underneath
the
openshift
and
nodes
were
spiking
to
eight
seconds
latency
and
there's
no
way
to
see
it
right.
It
happens
for
three
seconds.
B
100
pods
are
timing
out.
You
can't
see
it
anywhere
else.
We
caught
it
here.
So
that's
that's
one
very
important,
also
DNS
issues.
You
know.
Sometimes
we
had
a.
We
had
an
issue
where
one
pod
was
running
an
application.
They
didn't
use
DNS
cache.
There
was
thousands
of
requests
to
Services.
The
DNS
was
bombarded
on
that
note.
Latency
was
like
500
milliseconds
to
get
a
resolution.
We
found
the
corporate
afterwards,
but
just
knowing
that
the
DNS
has
a
problem
and
it's
loaded.
B
That
is
very,
let's
say,
first
of
all,
reassuring
to
understand
that
the
problem
is
just
on
a
specific
node
or
a
set
of
nodes
and
not
on
the
whole
cluster,
and
the
second
thing
is
just
to
find
who
is
bombarding
the
DNS,
which
was
easy,
because
you
can
go
to
DNS
logs
to
to
think
of
other
other
use
cases.
B
I
can
go
about
problems,
but
I
I
think
one
of
the
customers
I
want
to
see
who
they
were
running
on
on
Amazon,
actually
not
openshift.
No,
it's
a
bad
word
here,
but
just
just
to
give
you
the
the
ran
application
profiling
and
we
found
that
they
are
like
80
percent
overallocated,
meaning
you
know,
developers
they
wanna
the
Pod
to
run.
So
if
they
need,
let's
say,
half
a
core,
they
give
a
request
of
two
or
three
cores
and
a
limit
of
six
cores
everybody's
happy.
B
But
the
company
pays
a
half
a
million
dollars.
I,
don't
know
the
numbers
right,
but
if
you
just
run
application
profiling
and
you
go
pulled
by
Port,
you
can
just
see
the
waste.
Even
in
some
Cloud
we
even
saw
requests
for
the
Calico,
pods
or
or
the
sdn
ports
that
were
supposed
to
come
from
the
environment.
They
were,
the
request,
was
half
half
a
core
for
each,
which
they
almost
don't
don't
even
use,
so
just
even
reducing
that
reduced
cost.
B
Significantly
cost
is
a
nice
thing,
but
application
profiling
is
the
the
initial
intent
behind
it
was
cost.
Is
a
nice
side
effect,
but
the
initial
thought
behind
it
was
to
not
under
allocate
right
right
where
the
problems
are.
You
don't
want
to
be
throttling
and
God
forbid.
You
don't
want
to
be
oh
I'm
killing,
so
we
have
a
lot
of
those
by
the
way,
a
lot
of
examples
where
we
put
it
in
customers,
environment
and
we
just
immediately
by
the
way
the
information
flows
into
the
product
in
10
minutes.
You
already
have
results.
A
A
You
will
gain
you'll,
add
more
and
more
of
these
sequences
that
are
that
they're,
basically
doing
a
a
automatically
diagnose
of
of
problems
as
they
as
they
come
in
or
more
engagement.
You
have
with
customers,
and
then
you
add
it
to
the
to
the
recognized
patterns
of
the
of
the
product.
If
that
that's
how
I
understood
it,
I
go
back
to
my
initial
question.
A
If
there
is
an
idea
of
allowing
customers
to
over
in
in
the
future
to
train
themselves
the
the
system
so
that
so
that
I
can,
especially
if
it
is
application
specific,
is
something
that's.
We.
B
We
started
talking
about
it
to
create
something
like
the
old
was
proof
right
like
something
that
you
wrote
open
to
the
world
and
everybody
can
contribute
and
which
is
which
is
cool,
because
we
do,
as
I
said,
have
all
the
sequences
everything
that
goes
wrong
on
the
cluster.
We
find
it
and
letting
people
say:
okay,
I
know
this
one.
This
is
my
application.
This
is
how
it
fails.
I
want
to
keep
that
and
someone
else
can
take
it
and
maybe
do
a
rabbit
and
queue,
and
this
is
how
it
fails
in
this
scenario.
Right.
B
A
Sorry,
I
think
we're
getting
towards
the
end
of
the
hour
anyway.
So
let
me
finish
my
it's
called.
A
No,
no
I
think
it's
it's
quite
interesting,
especially
if
you
know,
in
the
spirit
of
course,
of
the
open
source
and
but
but
like
I
said
I
can
think
of
the
number
of
projects
I've
been
working
on
where
one
recent
that
could
have
benefited
a
lot,
primarily
because
we
are
talking
about
very,
very
large
infrastructures,
with
hundreds
of
nodes
with
loads
of
applications
that
were
running
on
them
with
different
characteristics
that
make
use
of
storage
and,
very,
let's
say,
creative
ways,
and,
and
so
something
that
allows
you
to
to
do.
C
B
We
actually,
we
actually
had
some
collaboration
with
Reddit
we
found
in
an
older
version
of
openshift.
We
find
something
wrong
with
the
tablet
and
one
of
the
customers
and
we
immediately
caught
it.
So
we
shared
all
the
information.
We
had
a
quite
a
productive
session
there
with
the
customer,
of
course,.
A
And
I'm
thinking
a
little
bit,
there
are
cases
when
remediation
like
actually
solving
the
problem
actually
does
require
a
patch.
And
while
you
do
that
you,
the
only
thing
you
can
do
is
to
actually
watch
to
predict
when
you
know
when
it's
about
to
break
and
do
the
sort
of
workarounds
that
that
will
keep.
A
You
will
keep
the
system
running,
but
but
it's
not
ideal,
but
that's
what
you
have
to
do,
and
so,
if
you
don't
have
something
like
that,
you
have
to
actually
do
a
little
bit
of
manual
intervention
in
keeping
under
control
certain
parameters
and
and
certain
applications.
And
it's
not
easy.
It's.
B
What
do
you
do
with
the
next
version
is
deployed
yeah
yeah
everything
right
that
that's
something
probably
you
can
have
like
a
deployment
every
every
two
weeks
or
or
three
weeks
or
to
one
week
and
all
the
parameters.
Suddenly
change,
CPU
request
memory,
the
sequences
themselves,
what
the
spot
requires
Network
wise.
Maybe
it
used
to
send
just
one
HTTP
request.
Now
it
sends
500.
You
don't
know.
A
So
it
looks
like
we
don't
have
any
questions
coming
from
from
from
the
audience:
I,
don't
I,
don't
have
any
visibility,
there's
one
that
I
had
to
take
care
of
separately
and
if
you
guys
think
that
we
can
come
to
a
conclusion
of
today.
Today's
talk
some
final
final
thoughts
links
that
I
can
I
can
give
to
our
so
that
they
want
to
know
more
about
it.
If
you
can
give
me
some
links,
I'll
put
them
on
the
chat.
B
So,
first
and
foremost,
this
is
an
openshift
session,
so
we
have
a
certified
openshift
operator.
Please
you're,
welcome
to
go
to
the
operator
Hub
and
just
download
it
install
it
on
your
cluster.
It
doesn't
put
anything
else
other
than
the
Cecilio
names
based.
B
A
Can
go
there,
they
probably
get
in
touch
directly
with
you
guys.
Definitely
the
docs,
so
I
guess
that's
all
we
had
time
for
today
at.
Let
me
just
copy
the
the
email
here
today,
I'm
alone,
so
I
have
to
do
everything.
A
C
A
And
to
all
our
listeners,
I'll
see
you
all
guys
in
the
next
episode
of
open
shift
TV.
Thank
you
all
very
much.