►
From YouTube: Beyond AIOps: How to Open Source Operations Marcel Hild (RedHat) OpenShiftCommons Gathering 2021
Description
Beyond AIOps: How to Open Source Operations
Marcel Hild (RedHat)
OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html
Find out more about OpenShift Commons, please visit: https://commons.openshift.org
A
I'm
going
to
talk
about
the
topic
beyond
a
iop,
so
I've
been
looking
into
a
iops
as
a
manager
now
starting
off
as
a
senior
software
engineer,
principal
software
engineer
in
the
ai
center
of
excellence
in
the
office
of
the
cto
and
working
on
all
things,
ai
and
operations,
and
how
that
evolved
over
time.
So
it'll
be
a
little
bit
of
a
journey
because
my
perspective
on
it
changed
over
the
time.
A
So,
let's
start
with
the
ground
definition
from
gartner,
which
would
say,
aiops
platforms
are
software
systems
that
combine
big
data
and
ai
or
machine
learning
functionality
to
enhance
and
partially
replace
a
broad
range
of
it,
operation,
processes
and
tasks,
including
availability
and
performance
monitoring,
event,
correlation
analysis,
id
service
management
and
automation.
That's
a
whole
lot
of
words.
A
Some
people
might
interpret
it
as
a
as
ai.
It's
gonna
replace
it
operations.
I
mean,
maybe
that's
a
spin
that
some
people
out
there
are
taking.
But
if
you
take
a
closer
look
to
those
aiops
features,
it
boils
down
to
pretty
simple
in
air
quotes
stuff
like
finding
a
baseline
of
your
metrics,
simulating
the
future
finding
some
correlation
with
your
incidents.
Doing
anomaly:
detection,
which
basically
means
I'm
predicting
the
future
and
if
it
deviates
from
it,
I'm
seeing
an
anomaly
and
hopefully
it
helps
you
finding
the
root
cause
of
your
problems.
A
But
it's
not
like
that.
You
have
an
ai
running
your
system
or
or
doing
the
job
for
you,
it's
not
even
so
that
a
iops
would
be
a
product
in
a
box.
I
think
it's
more
like
a
marketing
term
nowadays
like
when
clouds
hits
the
market
and
everybody
was
doing
cloud
and
you
talk
to
different
people
and
everybody
has
a
different
opinion.
A
What
cloud
actually
means
so
these
days,
you
see
a
lot
of
old,
not
old,
but
also
new
products
that
are
in
the
monitoring
space
that
are
in
the
tracing
space,
slapping
aiops
on
their
products
and
people.
Think
oh
yeah,
I'm
gonna
buy
this
product
and
then
boom.
I'm
cloud
native,
I'm
aiops
and
I'm
gonna
send
my
ops.
A
People
do
more
interesting
stuff
because
I
now
have
a
product
doing
other
tedious
work,
but
in
reality
it's
more
like
having
smaller
experiments
across
your
ops
people
and
then
develop
those
capabilities
that
leverage
ai
to
bring
you
to
the
next
level.
I'm
seeing
it
more
as
a
cultural
shift
so
similar
like
what
we
saw
with
dev
and
ops
during
a
devops
movement.
A
So
the
developer
people,
using
the
tools
from
the
ops
side
of
the
house
with
cloud
native
you
developer,
can
deploy
a
whole
complex
scenario
where
previously,
that
was
only
only
only
available
to
the
ops
people,
because
you
need
to
spin
up
so
many
servers
and
and
stuff.
Nowadays,
you
can
do
that
with
just
one
command
cut
and
paste
from
the
internet
and
boom.
You
have
a
multi-nodes
deployment
and
similarly
ops
people
using
the
tooling
from
the
development
folks
being
yaml
engineers
and
and
codifying
their
operational
landscape.
A
So
using
the
tools
from
those
two
domains
brought
devops
and
sre.
Folks
are
an
implementation
of
devops
and
I
think
it's
similar
with
a
iops
see
the
devops
people
using
tooling
from
the
ai
side,
from
the
data
science
side,
and
that
might
start
with
just
using
eda
tools
and
data
analysis,
jupyter,
notebooks
and
the
like
to
detect
those
signals
in
the
data.
A
So
you
start
out
with
assisted
ai,
which
is
basically
the
ai,
takes,
helps
you
discovering
things
it
tells
you
look,
here's
a
there's,
an
anomaly:
here's
a
correlation
of
some
things,
but
in
the
end
you
have
to
make
the
decisions.
I
think
that's
where
we
are
currently
now
wrapping
that
in
a
box
gets
to
augmented
ai.
So
we
have
smaller
parts
of
my
operational
domain
taken
over
by
autonomous
ai
agents,
doing
some
some
some
roll
backs
or
or
merging
patches,
or
something
like
this.
A
And
then,
if
you,
if
you
take
that
even
further
and
you
have
an
autonomous
ai
running
your
deployments,
I
think
then
we're
sometimes
then
we're
there,
where
we
have
really
a
massive
scale
where
we
have
smaller
amount
of
people
managing
larger
amounts
of
the
data
center.
And
that's
where
we
want
to
get
to
right.
A
So
analysts
are
saying
100
times:
cost
reduction
for
operating
infrastructure
if
we
reduce
the
operational
costs
and
the
idea
here
is
building
up
the
competence,
encapsulating
the
competence
in
in
something
that
you
create
inside
your
ops
team
gather
the
observations
and
then
feed
that
back
into
building
the
component
competence
and
building
out
that
tooling.
So
every
ops
team
needs
to
go
through
this
circle.
Through
this,
it's
not
a
visual
circle
which
is
circle,
but
it's
it's
a
continuous
improvement
circle
right.
A
So
that's
that's
kind
of
kind
of
sad
right
because
in
the
end,
it's
with
every
environment
with
every
deployment
with
every
customer,
you
are
taking
your
data
and
then
you
are
training.
Your
model
in
your
ai
helps
product
nice
good.
So
you
don't
get
a
preach,
a
pre-trained
thing
from
a
vendor,
because
the
only
thing
that's
that
you
get
there
is
the
tooling
to
train
the
models.
A
A
And
for
this
we
have
to
go
a
little
bit
back
in
time,
so
before
open
source,
there
was
code
right
code
was
the
secret
we
compiled
it
into
binaries
and
we
were
making
money
out
of
these
binaries
on
the
left
side,
the
code
more
valuable
than
the
operations
of
the
code
operating
it
at
scale
was
left
to
the
folks
in
the
basement,
so
that
was
more
or
less
an
afterthought.
A
So
if
the
value
in
it
is
an
ops
and
ops
are
proprietary,
then
open
source
has
a
problem,
and
this
is
something
that
matt
essay
being
a
cloud
and
open
source
executive
at
aws
is
saying
what
happens
if
you
open
source
everything,
that's
exactly
what
juggerbytes
did
when
they
dumped
the
open
core.
A
Instead,
releasing
all
of
its
code
as
open
source,
so
they
are
seeing
that
open
core
doesn't
work
which
is
good
but
running
or
releasing
the
complete
stack
as
open
source
is
the
better
example
now
because,
because
essentially,
the
value
is
in
operating
the
software
and
they
are
also
giving
people
the
juggernaut,
juggerbytes
platform
to
run
this
database
for
their
customers.
A
So
are
they
really
open
sourcing
everything?
No
everything
is
the
code,
but
not
the
ops
platform.
So
that's
still
left
to
the
customer
or
you're
buying
this
from
the
service
provider
and
don't
get
me
wrong.
That's
there's
nothing
bad
with
this
right.
You
can
move
these
capabilities
out
to
people
doing
this
for
you,
but
democratizing
software
through
open
source
software
brought
so
much
innovation,
and
I
think
the
same
should
be
true
for
operations.
A
So
if
so,
we
trying
to
do
this
with
this
operate
first
initiative
operate.
First,
is
an
initiative
to
operate
software
in
a
production
grade
environment,
bringing
users
developers
and
operators
closer
together,
so
ideally
operate.
First
becomes
a
partner
to
upstream,
first
as
a
basic
tenant
of
our
workflow,
being
red,
hat's,
workflow,
being
the
open
source
workflow
upstream,
first
meaning
if
we
productize
something.
A
So
what
we're
currently
actually
doing
with
the
massachusetts,
open
cloud
and
open
info
labs
at
red
hat
is
launching
this
initiative,
where
we
want
to
operate
upstream
projects
at
scale,
and
I
mean
we're
starting
we're
starting
small,
so
we're
not
there
at
the
scale
yet.
But
we
want
to
embrace
upstream
communities
to
give
them
a
chance
to
operate
their
projects
in
a
cloud's
native
environment.
A
Hopefully,
we
also
identify
the
bugs
and
the
shortcomings
or
the
edge
cases
that
that
that
a
customer
would
run
into
right
and
with
spicing,
this
all
up
with
open,
telemetry,
open
tracing,
open
ops,
so
sharing
all
those
best
practices
and
tools
and
and
deployments
with
the
community,
so
that
we
can
replicate
it
into
other
deployments
and
that
we
can
learn
from
each
other.
That's
a
whole
lot
of
words
in
the
end,
what
you
have
to
imagine
is
think
cloud
provider
with
full
visibility
into
the
operations,
and
it
makes
complete
sense.
A
A
So
wouldn't
it
make
sense
to
have
ops
and
developers
working
closely
together
in
a
transparent
cloud
working
on
the
ops
piece
and
then
pick
and
choose
some
of
the
the
obstacles
that
they
run
into
the
tedious
tasks,
the
chore
that
they
have
to
do
and
codify
that
in
an
operator
instead
of
the
developer.
Thinking
on
the
product
manager.
Thinking,
oh,
this
is
something
that
the
customer
is
usually
doing,
so
we
codify
that
in
the
operator.
But
then
it's
it's
not
really
a
problem.
A
A
If
it
doesn't
fix
it,
I'm
going
to
report
so
I'm
taking
one
step
in
the
contributor
funnel
and
then
he
said.
Eventually,
if
I'm
super
super
involved,
I
can
resolve
the
problem
because
I'm
contributing
back
to
the
code,
because
everything
is
open
in
software
development
in
open
source.
This
is
not
the
case
in
operations
operations,
every
ops
deployment
is
de
facto
a
snowflake
and
it's
behind
the
walls,
which
is
completely
fair
because
you
have
to
deal
with
privacy
and
stuff.
A
But
if
you
think
about
ai
ops,
using
ai,
to
train
your
operational
models
and
your
operators
that
are
running
on
auto
mode
training
them
always
from
scratch,
does
that
make
sense?
In
this
example,
here
enter
transfer
learning.
So
that's
an
ai
technique
where
you
train
your
model
in
a
certain
on
a
certain
set
of
data,
then
take
that
knowledge
inside
the
model
so
that
you
can
train
the
second
model
on
less
data
right.
A
So
ai
ops
in
a
community
starts
with
discussing
collaborating
and
logging
in
locking
in
standards
like
prometheus
becoming
the
de
facto
standard
of
metrics.
These
days
blocks,
this
hasn't
happened
yet,
but
I
think
we're
on
a
good
way
there
for
model
exchange
same
right.
So
developing
these
standards
is
super
super
crucial
because
only
with
standards
you
can,
you
have
standardized
on
something
and
then
you
can
build
on
the
same
same
foundation.
So,
let's
grow
collectively
and
codify
that
operation
experience.
A
In
the
end
operations
being
democratized
democratize
operations,
everybody
should
be
able
to
to
operate
stuff
at
scale.
Think
of
it
as
brew.
Install
op
center
get
clone
operations,
don't
start
from
scratch.
A
So
then,
you
can
use
your
data
as
a
competitive
differentiator
or
what
your
product
is
actually
selling
and
not
so
much
your
operational
excellence,
but
you
should
be
building
on
top
of
the
data
that
you
collect
from
your
customers
so
that
the
customers
give
to
you
or
that
you
know
about
your
landscape,
not
so
much
about
operating
your
cloud
and
we're
also
prototyping
this
with
the
open
data
which
you
heard
here
today
and
you're
going
to
hear
here
later
on.
So
that's
only
natural
because
it
has
some
ai
in
it.
A
A
It's
also
a
young
project
which
makes
it
good
because
you
can
still
influence
them
and
they
are
building
out
the
operational
ideas
and
capabilities.
Yet
so
we
have
potential
to
influence
them
and
you
need.
Obviously,
you
need
users,
workloads,
users,
users,
users,
right
so
without
anything
happening
on
your
platform.
A
You
won't
produce
any
data
and
you
won't
produce
any
issues,
and
then
you
have
just
an
idling
cluster
there,
which
is
kind
of
boring
so
doing
these
things
also,
with
certain
other
workloads
from
cloud
native
virtualization
mesh
for
data
open
shift
itself,
acm
is
advanced
cluster
manager
and
other
emerging
tech
projects
being
onboarded
there,
but
as
it's
a
community
everybody
out
there
can
onboard
and
can
run
their
experiments
there.
We
also
do
a
lot
of
things
with
research.
A
We
have
a
telemetry
working
group,
there's
a
lot
of
threats
going
on
it's
slowly,
slowly,
starting
it's
perfect
time
to
charming.
So
my
call
to
action
here
is:
get
your
access
to
all
areas
caught
into
an
op
center,
because
it's
so
easy
to
be
onboarded
there.
Essentially,
you
just
need
a
right.
Now.
It's
a
google
mail
address.
A
Maybe
we
change
that
to
something
different,
but
it
should
be
really
read
only
by
default
for
everybody
out
there,
so
that
you
can
deploy
your
workloads
in,
of
course,
in
in
collaboration
with
the
community,
and
then
we
solve
issues
there
right.
So
you
onboard
via
our
onboarding
infra
opportunities
you
get
compute
and
in
return
you're
giving
away
the
data
that
your
compute
produces
your
metrics,
your
locks
and
stuff
click
on
operate.
A
First
you're
going
to
land
on
this
page
here
where
we
have
bucketed
into
data
signs,
users,
operators
and
blueprints
on
the
data
science
side,
you
can
follow
along
what
we're
doing
there
on
the
ai
research
bits
most
of
it.
Oh,
I
think
all
of
it
has
a
a
ops
touch,
so
you
won't
see
image
detection
stuff
there,
yet
it
has
to
do
with
how?
A
A
Moving
on
to
the
operators
bits,
we
have
documentation
for
onboarding,
your
your
your
workloads.
We
are
argo
city
so
that
we're
following
a
githubs
approach
here,
get
all
the
best
practices
in
in
a
in
a
in
a
cloud
native
deployment
model
so
to
say
and
another
another
interesting.
A
Aspect
of
this
or
a
perspective
of
this
is
that
if
you,
if
you
want
to
do
a
pull
request
to
a
service
or
to
a
running
running
system,
you
need
to
replicate
the
setup
somehow.
So
we
also
have
tools
to
replicate
the
setup
on
crc.
That's
that's
a
kubernetes
cluster,
on
your
laptop
or
onto
other
environments.
So
hopefully
we
will
have
guides
to
deploy
the
same
setup
into.
A
I
don't
know
aws
azure
or
on
bare
metal
deployments.
So
hopefully
we
grow
this
environment
into
into
into
other
data
centers
over
time,
and
I
think
that's
that's
a
super
super
crucial
part
because,
as
I
said
earlier,
sre
is
usually
a
process
that
everybody
has
every
every
customer
and
every
project
has
to
set
up
on
its
own
and
they
have
to
write
their
best
practices
on
their
own.
A
We
are
really
documenting
them
out
in
the
open
so
that
you
can
actually
do
a
git
clone
of
the
decisions
that
we're
taking
and
the
processes
that
we're
documenting,
and
then
you
just
adjust
them
to
your
means
on
and
and
to
your
demands,
and
you
don't
have
to
start
from
scratch,
but
you
can
build
upon
our
best
practices
here
and,
as
I
said,
it's
a
community
you
every
page
has
this
contributed
to
this
page,
which
then
takes
you
to
this
sleuth
of
github
repos,
because
every
where
we
are
a
little
bit
spread
all
over
the
place
from
that
operate.
A
A
B
Awesome
marcel,
thank
you
very
much
and
I'm
really
glad
that
you
brought
the
operate
first
cloud
to
our
attention
because
I
hadn't
seen
it
before
today.
So
I
really
appreciate
that,
and
the
data
science
projects
and
workflow
stuff
is
just
awesome
on
that
site.
So
I
encourage
everybody
who's
who's.
Listening
in
now
to
take
a
look
at
that,
I
heard
it
before
today,
so
I
have
learned
something
new
today,
I'm
thrilled
to
see
it,
that's
good
awesome,
so
audrey,
you've
joined
us.
B
C
Marcel,
I
do
so
hey
marcel
how's
it
going
I'm
going
to
take
on
the
persona
of
somebody,
that's
like
kind
of
brand
new
to
ai
ops.
C
A
A
It's
also
tickets
and
issues
bugs
and
the
like.
That's
why
we're
beating
into
our
our
alerts
into
github
issues?
So,
yes,
we
want
to
make
everything
open
so
completely
transparent.
You
might
see
some
passwords
encoded,
you
might
see
emails,
encoded
or
or
phone
numbers.
If
somebody
is
on
call,
so
we
don't.
Obviously
we
don't
want
to
expose
any
privacy
identifying
information
right,
but
if
you're
going
to
facebook
you're
also
giving
away
your
privacy
identifying
information.
A
So
maybe
that's
just
the
deal
if
you
are
in
there
use
your
your
internet,
pseudonym
and
avatar,
and
not
your
real
name
right.
So,
if
you're
worried
about
this,
I
would
really
treat
it
as
as
open
as
it
can
be
because
that's
usually
the
roadblocker.
If
you
want
to
get
access
to
data,
even
within
our
company,
we
have
troubles
getting
access
to
the
internal
data,
although
it's
all
internal,
but
you
have
to
go
through
infosec
and
all
that
stuff
because
it
might
contain
some
data.
A
So
let's
do
it
open
from
the
from
the
beginning,
we're
not
there
yet,
but
that's
really
the
aim
in
terms
of
big
data
processing
types.
Yes,
open
data
has
kafka
in
it.
It
has
spark
in
it.
So
we
have
the
possibility
to
crunch
big
data.
I
hope
that
we
have
big
data
at
some
point
right
now.
I
think
we
have
two
issues,
so
it's
not
that
big,
but
we're
we're
starting
to
collect
stuff
and
from
from
the
setup
perspective
we
are
connected
to
the
north
eastern.
A
I
think
they
call
it
nerc
or
nessie
something.
That's
that's
another
research
domain,
so
I
think
we
have
unlimited
storage
at
least
that's
the
way.
I'm
assuming
right.
C
Okay,
so
I
guess
I
would
ask
one
more
question:
do
I
have
time
for
one
more
question:
diane
little
one
a
little
one:
okay,
this
one
could
probably
be
kind
of
a
yes
or
no
one,
so
we're
going
ahead.
We're
gathering
all
this
this
data.
C
Do
we
have
something
in
place
that
will
reduce
what
I
would
call
noise,
so
there
might
be
some
kind
of
spurious
data
that
comes
up.
Maybe
there's
some
data
that
we
could
spot
trends
where
somebody's
trying
to
do
something
with
with
the
system
is
that
in
in
the
works
for
for
a
iops,
because
otherwise
we
would
just
have
huge
amounts
of
data
that
you
know
that
I
don't
think
we'd
really.
A
C
A
That's
that's
an
absolutely
great
question
because
that's
the
challenges
to
the
ai
community,
I
would
say
so:
I'm
I'm
not
really
an
ai
researcher,
but
I
know
that
you
that
we
have
the
same
problem
in
reducing
noise
and
identifying
cats
and
images.
So
you
don't
want
to
identify
the
cat,
that's
in
in
the
background
as
a
cat,
but
only
the
cat
that
is
actually
moving
for
example,
or
like
these
adversarial
attacks,
where
you
only
change
your
pixel
right.
A
So
I
also
don't
want
to
have
my
ai
ops
agent
evacuate
a
cluster
only
because
somebody
deployed
something
which
has
an
emoji
in
a
lock
message
and
that's
an
adversarial
attack
on
on
the
ops
agent.
So
exactly
collecting
all
that
data
and
then
having
the
scenarios
where
you
have
a
outage
or
something
and
then
retrain
your
model
so
that
it
better
works.
So
yes,
that's
I
don't
have
it
yet.
A
We
have
some
pocs
doing
something
in
in
that
domain
and
we
have
we
we're
seeing
interesting
projects
there
from
from
research
and
also
from
ibm
or
in
on
looking
or
extracting
log
templates
from
log
files
and
predicting
time
series
and
correlating
time
series
so,
but
the
problem
that
most
of
them
had
is:
oh,
we
need
to
have
your
data,
so
no,
we
don't
have
any
data
yet
so
then
we
can
strain
our
model
yeah.
No,
that's
exactly
the
problem
that
we're
trying
to
solve
here
provide
a
common
set
of
data.
B
So
that
that
is
a
great
aspirational
goal
and
I
think
it's
doable
if
we
do
it
as
a
community
effort,
so
thanks
marcel
for
doing
that,
I'm
going
to
queue
up
our
final
speaker
for
the
day,
I'm
sure
griffin
and
then
I'll
come
back
on
with
some
resource
links
as
well,
but
I
really
want
to
thank
everybody.
Who's
persevered
with
us
today.
We,
it
is
fluid,
so
we've
run
over
a
little
bit,
but
we
should
be
able
to
wrap
this
up
in
the
next
15
minutes.