►
From YouTube: AIOps on OpenShift with Sunny Siu and Tushar Katarki
Description
AIOps on OpenShift with Sunny Siu of ProphetStor and Tushar Katarki of Red Hat.
Filmed on October 28th, 2019 in San Francisco
A
But
that's
the
exciting
thing
I've
been
at
Red
Hat.
You
know
for
several
years
now
open
ship
product
management
for
three
or
four
years,
and
this
might
be
like
I've
done
several
of
these
comments.
So
it's
always
great
to
see
you
all
here
thanks,
particularly
to
our
customers
and
partners
for
coming
and
talking.
We
always
appreciate
that.
So
without,
let
me
introduce
you
Sonny.
B
A
B
Yes,
I'm
a
co-founder
president
of
the
company
profit
store.
We
develop
a
table
solutions
and
focusing
on
the
openshift
workload
and
thanks
to
her
head,
we
get
introduced
to
a
lot
of
different
customers
that
are
trying
to
understand
the
workload
and
try
to
optimize
the
cost
and
resources
on
the
cloud.
Isn't
supporting
the
openshift
workload.
So
I'm
gonna
give
you
a
little
bit
detail
after
sure.
Give
an
introduction
yeah.
A
But
just
to
tee
it
up
in
terms
of
a
teaser
earlier
when
you
know
discover,
I
was
talking
about
their
use
case.
They
were
talking
about
how
you
set
CPU
and
memory.
You
know
how
much
they
need
and
and
the
limits
for
that
you
know
so
with
something
some
of
the
things,
the
feedback
that
we
hear
from
customers
is
you
know,
although
that's
a
good
thing,
it's
actually
hard
for
custom,
actually,
a
guess
what
that
is.
A
So
not
everybody
is
kind
of
doesn't
know
how
to
set
or
what
value
to
set
for
the
particular
app
or
part
or
whatever
the
CPU
requests
and
limits,
and
the
other
thing
is
even
if
they
know
you
know
it
changes
over
time.
So
I
think
these
profits
store
as
a
fantastic
solution
which
you'll
hear
about
so
that's
kind
of
a
up.
So,
let's
without
further
ado,
let's
get
right
into
it
and
I
do
have
this
clicker
switch
I
can
use.
So
so
what
is
AI
ops
I
mean
I.
A
A
You
know
time-consuming
and
kind
of
gives
it2
a
bad
rap
today.
You
know
how
do
you
kind
of
change
that
using
AI
and
machine
learning,
so
that's
AI,
ops
as
platforms
and
software
systems
that
combine
data
and
AI
functionality
to
enhance
and
replace
a
broad
range
of
IT
operations,
processing
tasks?
You
know
things
such
as
availability
and
performance,
monitoring,
event,
correlation
and
analysis
and
IT,
Service,
Management
and
automation.
So
so
you
know
why
should
we
care
about
this
I
mean
at
least
I
mean
I?
A
Think
we
intrinsically
understand
why
this
is
important
right,
like
we
all
you
know,
for
example,
we
want
to
know
what's
happening
to
a
computer
and
if
there
are
alerts,
if
there
are
things
that
are
going
to
fail,
we
want
to
know
about
that,
and
you
can
imagine
at
at
a
cluster
level
or
multi
cluster
level.
When
there
are
hundreds
of
nodes
and
thousands
of
projects
and
whatnot,
you
know
you
want
to,
there
is
no.
This
is
the
classical.
A
A
According
to
this
analyst,
I
think
this
is
iid,
see.
50
percent
of
IT
assets
will
have
the
ability
to
run
autonomously
using
embedded
AI,
and
so
this
is
why
you
know
to
live
this
smart,
IT
and
facilities
system.
So
that's
right.
So
what
is
the
evolution
of
the
AI
ops
capabilities
that
you
know?
What
is
the
steps?
What
are
the
phases
right
so
so
this
actually
gives
you
a
nice
picture
here
on
the
left.
We're
in
it
starts
with
monitoring.
You
know:
how
do
you
observe
that
system
that
complex
system-
you
know?
A
What
are
you
know
after
that?
It
is.
You
know
you
know
what
can't
African
insights.
Can
you
get
from
that
yeah
using
the
data
that
you
have
collected?
Can
you
use
machine
learning
to
get
that
insights
and
then,
finally,
you
know,
can
you
act
act
upon
it?
You
know
using
some
automation,
you
know,
and
so
so,
if
you
think
about
the
evolution
on
the
right
side,
you'll
see
data
collection.
Visualization
is
the
first
step.
It's
no
different
from
any
other
machine
learning
pipeline
and
workflow,
you
know,
and
then
there
is.
A
You
know
you
want
to
understand
what
are
the
patterns?
They
want
to
discover
them
in
that
AI,
ops,
workflow
and
then
you
want
to
do
some
kind
of
prediction.
You
know.
In
this
example,
we
talked
about
how
you
know
you
might
may
be
running
out
of
cpus
at
some
point.
So
how
do
you
Auto
scale
could
be
one
example
of
it
or
it
could
be
making
predictions
about
network
attacks
if
you
are
worried
about
security?
A
So
that's
another
application,
so
those
are
some
of
the
predictions
that
you
can
do
and
then
you
do
some
kind
of
a
root
cause
analysis.
You
determine
that
this
is
the
problem
and
then
you
do
a
remediation
of
that.
So
so
real
quick,
you
know
what
have
we
done?
It's
been
an
evolution
for
us
from
an
open
ship
perspective
and
Red
Hat
perspective
on
how
we
are
approaching
this.
You
know,
I'll
start
with
the
top
left
here
and
say
you
know.
A
We
start
first
started
with
introducing
you
know
the
observability
and
strengthening
the
observatory
observability
place.
If
you
go
back
to
a
bunch
of
310
as
far
back
as
three
seven,
we
introduced
from
ETS
in
tech
preview
and
then
311,
we
g8
it.
We
had
an
operator
for
it
and
then
so
the
Prometheus
and
the
corresponding
alert
stack
is
a
kind
of
the
start
of
this
observability
story.
For
us,
then
we
have
now,
with
the
sto
Service
mesh
we
have
open
tracing,
you
know
we
have.
A
Metering
is
going
to
Jay
has
g8
in
the
most
recent
release,
which
allows
you
to
generate
reports
based
on
CPU
memory,
usage,
etc.
Then
we
are
doing
this
other
thing
and
a
visual
on
this
called
telemetry
in
which
we
are
collecting
this
Prometheus
data
and
not
only
confining
it
to
a
cluster.
But
we
are
aggregating
that
into
our
you
know
as
a
service
so
that
we
can
do
some
data
mining
with
it
and
I'll
talk
a
little
bit
about
that
and
by
the
way,
the
the
technology
or
the
platform
that
we
use
to
collect.
A
We
insert
has
been
there
for
some
time
from
Red
Hat,
but
insides
basically
gives
you
some
insights
about
the
system
that
is
deployed
from
a
Red
Hat
point
of
view.
So
I
won't
go
into
the
details,
but
now
the
inside.
So
once
we
have
the
data
we
collect,
we
analyze
it.
We
have
developed
some
insights,
then
we
bring
that
to
the
customer.
So
that's
the
first
piece
of
that
the
observability
piece
of
it.
The
second
piece
of
it
is
really
you
know.
A
A
You
can
get
an
alert,
the
the
okay,
we
got
an
alert
now
you
can
do
automation
and
you
can
do
some
automation,
some
some
some
automation
tasks
with
it
using
ansible
tower,
and
so
that's
an
integration
that
we
have.
We
have
other
things.
You
know
portfolio,
that
is
the
redhead
vision
manager,
which
is
a
rules-based
system
that
we
have
business
process.
A
Automation
some
customers
I
know
are
using
some
of
these
advanced
techniques
to
create
rules
on
how
to
you
know,
react
to
some
of
these
conditions
right
so
and
then
how
to
automate
it
using
business
process,
automation
and
then
this
is
the
connected.
Customer
really
is
our
program
where,
in
you
know,
customers
with
open
shop
for
are
sending
us
this
telemetry
data.
A
It's
just
based
on
the
telemetry
data.
We
are
able
to
determine
that
there
are
bugs
in
the
system.
So
that's
the
connected
customer.
We
want
to
do
more
than
that.
Obviously,
that's
where
some
of
the
open
data
hub
work
is
really
helping.
So
that's
why
this
is
some
something
exciting
for
us.
Then.
Finally,
we
have
the
automation
piece,
which
is
really,
of
course.
Now
you
want
to
automate
all
this.
We
heard
about
operators
on
the
operator
SDK.
We
talked
about
immutable
host,
which
is
our
operating
system.
A
We
you
know
we
have
include
the
install
experience
with
operators
with
open
ship
for
and
we
have
a
whole
bunch
of
other
things
that
we
are
doing
in
this
space.
So
so
this
is
the
example
of
what
I
was
talking
about
the
open
ship
tell'em.
This
is
the
dashboard
that
we
see
depending
on
the
number
of
clusters,
and
that
we
have
at
Red
Hat
that
I
was
talking
about.
So
you
can
see
that
we
get
some
information,
there's
a
graph
on
a
dashboard
which
shows
the
number
of
connected
customer
clusters.
A
Are
there
any
errors
in
the
system?
So
so
so
things
AI,
ops,
examples
that
we
are
doing
you
know
double-clicking
on
this.
Is
things
such
as
log
anomaly,
detection
outliers?
You
know,
you
know
we
are
analyzing
logs
that
we
collect
and
determining
if
there
are
any
outliers
and
you
use
that
to
improve
the
system,
the
product
itself,
the
second
one
really
is
the
cluster
rule.
Lord
monitor.
You
know,
one
of
the
important
features
of
open
chip
for
is
the
over-the-air
updates.
So
you
we
are
able
to
push
over-the-air
updates
to
the
connected
clusters.
A
Now
what
we
determined
is
that
if
we
monitor
that
upgrade
process-
and
if
there
are
any
anomalies,
if
there
are
any
problems
with
that,
then
we
are
able
to
detect
that
and
we
are
able
to
take
corrective
action
based
on
that.
Similarly,
we
are
doing
anomaly
detection
with
matrix,
which
is
the
Prometheus
anomaly
detection,
and
then
we
are
doing
the
work
load
prediction
resource
optimization,
which
is
something
that
Sonny
is
going
to
talk
about
with
their
technology.
So
this
is
a
kind
of
big
picture
vision.
A
I
won't
go
into
a
whole
lot
of
details
into
this,
but
you
can
kind
of
guess
what
we
are
doing
here.
We
are
collecting
the
data
both
at
the
cluster
level
and
at
the
you
know,
we
are
aggregating
that
and
then
we
are
taking
action
based
with
AI
and
ml,
and
we
have
other
things
are
important
is
is
the
word.
Is
we
here
and
that's?
A
Why
I'm
here
with
Sonny
is
because
obviously
a
lot
of
these
things,
we
are
not
able
to
do
on
our
own,
but
on
the
other
hand,
with
operator
hub,
we
are
encouraging.
Our
partners
to
you
know,
take
advantage
of
the
open
ecosystem
to
bring
all
these
tools
and
technologies
to
you
guys,
so
so
that's
kind
of
the
introduction
of
what
we
are
doing
right
over
to
Sonny,
to
explain
a
double
click
on
exactly
what's
happening
with
historian
federated
AI.
Thank
you.
Thank
you.
B
Alright,
so
in
the
remaining
10-15
minutes,
I'm
gonna
give
a
bit
more
details
on
how
our
solution
worked
on
openshift
and
the
value
proposition.
So
I
just
want
you
to
get
this
lie,
because
to
show
you
that
this
is
a
server
by
right
skill,
now
a
part
of
flex
era.
They
did
a
survey
earlier
this
year
over
about
a
hundred
enterprises
and
half
of
the
enterprises
about
400
of
them
have
more
than
1,000
employees,
and
this
is
the
conclusion.
They
said
that
the
cloud
cause
optimization
is
the
number
one
priority
for
all
these.
B
For
majority
of
these
enterprises.
Okay
and
the
it
doesn't
matter,
the
other
conclusion
it
doesn't
matter
how
long
you
have
used
the
public
law
or
private
clouds
cause
is
continues
to
be
the
number
one
priority
and
our
solution
try
to
address
a
this
very
important
issue
and
what
we
learned
from
we
had.
No
customers
open
ship
customers
and
other
partners
is
that
usually
the
the
CIO
will
get
a
shock
above
the
bills
when
the
developers
are
just
using
the
public
cloud
freely.
B
Okay,
so
our
solution
is
trying
to
adjust
that
and
let
me
tell
you
a
bit
how
it
works.
Okay,
so
specifically,
the
the
pain
points
we
address
is
that
if
you
deploy
your
applications
on
the
cloud,
most
users
or
developers
will
not
know
exactly
what
resources
on
the
cloud
needed
to
support
the
applications.
Okay,
right
now,
it's
all
guesswork
and
adding
to
this
is
all
your
application.
What,
though,
is
quite
dynamic
right
containers
very
dynamics
and
sometimes
we're
a
short
life?
Okay,
but
you
get
to
deploy
many
of
them
on
a
weekly
or
daily
basis.
C
B
A
machine
ml
workload,
it
would
be
GPU
resources
very
expensive.
You
can
charge
an
hourly
basis,
okay
and
in
if
you
are
in
a
major
enterprise,
you
may
have
many
different
divisions
projects
and
each
one
of
them
might
ask
the
CIO
office
for
some
cloud
resources
and
again
they
don't
know
what
kind
of
resource
they
will
need
to
support
the
application.
So
it's
all
best
guess
and
usually
takes
a
long
time.
B
Okay,
last
time
in
the
red
summit,
one
of
the
major
urban
shift-
customers
from
Europe
in
the
automotive
industry
say
that
in
certain
cases
some
of
the
divisions
only
use
10%
of
what
they
get
allocated.
Okay.
So
it's
a
lot
of
Waystation.
Okay,
this
one
just
show
very
high-level
overview
of
our
solution.
Works
federated
a
is
our
solution.
B
It
can
show
that
compared
with
the
native
kubernetes
of
not
using
any
mechanism
at
all,
some
of
them
can
go
up
to
70%
in
cost
savings,
so
the
ROI
is
quite
significant
and,
as
I
said
earlier,
if
the
customers
allowed
us,
we
can
execute
and
dynamically
auto
scale
the
cluster
for
them
and
just
to
summarize,
and
the
reason
why
we
can
do
this,
is
we
do
dynamically
continuously
on
all
the
workload
and
resource
on
a
different
time
scale.
The
time
scale
could
be
a
the
next
hour.
B
Okay
and
this
I
will
just
show
you
a
screenshot
of
the
actual
solution
on
the
upper
part.
It
is
the
the
CPU
prediction.
The
blue
curve
represent
the
actual
customer
workload
and
our
daughter
line
represent
our
prediction.
As
you
can
see,
it
doesn't
really
overlap
because
we
are
just
doing
protection,
but
it
follows
quite
well.
The
general
pattern
and
the
run
on
the
red
box
inside
is
actually
our
production.
The
green
line
represent
our
resource
recommendation
so
custom,
actually
don't
don't
really
are
not
really
interested
in
a
particular
curve.
B
The
actual
workload
but
they're
interested
in
what
kind
of
resource
in
this
case
on
the
upper
graph
is
the
CPU.
What
resources
needed
to
support
the
application,
workload
right
and
the
below
represent
the
memory
workload
and
we
give
a
margin,
because
we
want
to
make
sure
that
it
never
runs
off
memory.
So
we
give
a
margin
of
15-20
percent
and
this
can
be
configured
by
the
user,
and
we
are
right
now
left
with
five
certified
operator
and
thanks
so
much
for
the
rear
head
support.
We
have
turned
this
staff.
B
Of
course
they
negotiate
a
certain
break,
but
for
the
rest
of
the
95%
customers
they
will
use
the
standardized
cost.
Okay
and,
as
I
said
earlier,
we
provide
much
better
workflow
management
and
a
native
kubernetes
mechanism
right.
So
just
this
is
another
screenshot.
Once
we
learn
about
the
workload
on,
because
it
varies
on
different
workload.
This
is
a
very
small
cluster.
We
deploy
alright
and
it's
actual
customer
work
long.
B
It's
just
example
in
this
particular
case,
it
turns
out
that
Amazon
is
the
cheapest
so
and
just
want
to
show
you
some
use
case.
This
is
a
leading
market
research,
company
market
research,
firm
based
in
Boston.
They
are
migrating
a
number
of
on-premise
VMware
based
workload
to
your
oven
shift
on
AWS,
and
they
have
no
idea
how
to
sized
AWS
cluster
to
support
this
continue
eyes,
workload
so
running
our
tools
right.
B
This
is
the
one
a
different
use
case,
but
the
GPU
cluster
is
a
pretty
sizable
cloud
provider.
Is
a
government-funded
high
performance
computing
center,
allowing
all
the
GP
resources
to
be
used
by
the
enterprise
in
Taiwan,
University
and
research
labs?
They
have
over
2,000,
GPUs
and
I
have
to
say
they
spend
about
close
to
20
million
dollars,
buying
those
GPUs
systems
from
Nvidia
over
9000
CPUs.
B
So
it
turns
out
that
they
already
lk
over
90%
of
the
cheapy
resources
to
these
users
right,
even
though,
most
of
them
only
using
3%
of
it,
okay,
so
very
inefficient,
but
now
using
our
tools
to
analyze
or
the
work
law,
they
can
raise
from
30
percent
utilization
to
80
percent.
So
that's
very
significant
increase
in
utilization
and
the
return
on
the
investment
ROI
and
using
our
toys
is
almost
10
times.
B
Okay
and
in
addition
to
providing
this
workload,
GPU
what
loop,
disability
and
the
resource
protection,
we
also
give
them
some
performance,
anomaly,
detection,
okay,
and
these
that's
basically
about
what
I
have
to
say.
And
if
you
want
to
learn
more
about
solution,
you
know
go
to
our
website
or
just
send
me
an
email
and
again.
What
do
you
think
rehab
inviting
me
to
keep
it
talking?
Thanks
for
sure.