►
From YouTube: Site Reliability Engineering, Managed Services and the path to the future Sasha Rosenbaum (Red Hat)
Description
OpenShift Commons Gathering 2021
Site Reliability Engineering, Managed Services, and the path to the future
Guest Speaker: Sasha Rosenbaum (Red Hat) @divineops
https://commons.openshift.org/index.html#join
A
So
while
people
are
coming
back
in,
this
is
going
to
be
a
talk
on
sre
managed
services
and
the
path
to
the
future,
and
my
name
is
sasha
and
I
work
for
red
hat.
A
So,
by
way
of
introduction,
oops
I've
been
in
this
industry
for
a
long
time,
and
I
started
off
as
a
developer.
I
have
a
computer
science
degree
and
then
I
went
and
had
all
sorts
of
different
jobs
in
technology
which
most
of
them
didn't
exist.
When
I
was
a
kid,
so
you
couldn't
even
choose
it
as
a
career
path.
A
By
and
large,
I
like
solving
problems
with
people
and
technology,
and
I
like
to
believe
that
the
world
is
getting
better
every
day
and
that
we
are
in
a
good
industry
to
help
it
get
better,
and
so
that's
why
I'm
excited
to
be
at
this
conference
today
and
be
talking
to
you
all
and
maybe
we'll
come
up
with
new
ideas.
That's
why
I'm
excited
to
be
at
this
conference
today.
A
Okay-
hopefully
it
doesn't
happen
again,
okay,
so
anyway,
awkward,
I'm
going
to
be
quoting
this
book
a
lot.
This
is
the
book
on
site,
reliability,
engineering,
the
first
one
by
google,
co-authored
by
a
whole
bunch
of
awesome
people.
I
really
like
the
book.
A
It's
a
really
good
book
to
get
started
if
you're
just
getting
into
sre
concepts,
and
you
want
to
understand
what
they're
all
about
I'm
going
to
start
with
the
sentence
that
I
least
like
in
the
whole
book,
and
that
sentence
is
sre
is
what
happens
when
you
ask
a
software
engineer
to
design
an
operations
team
like.
I
really
really
don't
like
this.
A
This
is
usually
my
face
when
I
see
a
definition
like
this,
because
I
think
it's
very
elators
and
it
assumes
that
developers
are
cooler
than
ops
people
and
that
ops
people
couldn't
come
up
with
the
idea
of
automation,
and
you
know
google
had
to
come
in
and
solve
all
the
world's
problems.
It
really
kind
of
isn't
what
happened?
The
definition
that
I
do
like
is
that
sre
is
roughly
google's
implementation
of
devops.
A
That
definition
is
that
actually
also
in
the
sre
book,
so
I
didn't
make
that
one
up,
so
we
started
off
with
devops
more
than
a
decade
ago,
and
this
happens
to
be
the
picture
of
on
the
first
two
years
of
the
upstate
chicago
and
I'm
the
only
person
in
this
picture
of
organizers
that
identified
as
a
developer.
A
There
was
an
awesome
person
and
she's
right
here
and
wave
at
me,
bridget
and
she
helped
grow
this
conference
to
like
a
global
enterprise
which,
like
thousands
of
people
in
hundreds
of
cities,
show
up
to
and
again
we
were
all
talking
about
automation
and
you
know
how
do
we
get
to
a
better
place
where
we
get
to
solve
more
interesting
problems?
A
So
if
you
were
alive
in
the
90s-
and
you
remember
what
they
looked
like,
you
know
that
getting
a
new
server
app
if
you're
lucky
it
took
you
three
months
right,
because
you
had
to
actually
file
a
procurement
order
and
you
had
to
wait
for
the
actual
physical
server
to
show
up.
And
then
you
had
to
build
a
server
rack
and
you
had
to
wire
it
up
and
configure
it
and
stow
stuff
and
whatever
also
in
the
90s.
A
A
So
if,
if
you
had
a
couple
maintenance
windows
like
that,
that
would
be
less
than
two
nines,
because
two
nines
only
gives
you
3.65
days
a
year
of
downtime.
So
that's
just
plain
maintenance
right
and
we
took
down
servers
for
planned
maintenance
for
whole
weekends
and
we
used
to
think
that
sports
and
reliability
are
not
on
the
same.
That's
just
plain
maintenance
right
and
we
took
down
servers
for
planned
maintenance
for
whole.
A
I'm
loving
this
and
we
we
used
to
think
that
speed
and
reliability
can't
be
friends
and
devnops
can't
be
friends,
because
devs
just
are
incentivized
to
push
the
production
as
quickly
as
possible,
and
that
breaks
things
and
ops
are
incentivized
to
keep
the
lights
on
and
they
carry
pagers
and
they
get
paid
in
the
middle
of
the
night
and
they
hate
change
right.
A
It's
all
about
incentives,
but
the
allegory
that
works
better
for
software
development
is
actually
it's
like
riding
a
bike
right
like
what
our
inherent
assumption
is
that
if
we
go
slower,
we
break
things
less,
but
it's
actually
not
always
true,
not
in
all
domains
and
software
development
is
kind
of
like
riding
a
bicycle.
If
you
go
too
slow,
it
actually
breaks
more.
You
can't
keep
your
balance.
A
A
Well,
the
biggest
thing
was
that
effective
automation
requires
consistent
apis
and
that's
something
we
didn't
have
so
like
one
of
the
words
that
pops
up
a
lot
in
the
stock
is
apis
and
you
need
them
to
be
able
to
automate
anything.
Clayton
just
talked
about,
like
you
know
the
bigger
control
plane
and
being
able
to
automate
something
at
a
across
clouds
level
right,
but
we
started
off
at
the
basic
level,
so
you
had
to
start
with
operating
system
level,
api
and
then
with
linux.
A
So
like
it
was
a
real
problem
that
people
were
trying
to
solve,
which
actually
brings
me
to
one
of
my
favorite
transformation
stories,
which
is
a
story
about
powershell
championed
by
jeffrey
snover,
and
that's
the
cli
scripting
language
and
a
configuration
management
framework
that
shipped
in
with
windows
in
2006
and
before
that
jeffrey
went
through
five
years
in
his
career,
where
he
was
on
the
verge
of
getting
fired
every
day,
because
angry
executives
were
yelling
at
him.
What
part
of
effing
windows?
Do
you
not
understand?
Admins,
don't
want
apis
turns
out
that
admins.
A
Do
you
want
clies
and
apis
and
want
to
automate
things,
and
fortunately,
automation,
one
in
this
battle.
Every
wave
of
automation
enables
the
next
wave
of
automation
right
next,
we
got
to
infrastructure
level
apis.
So
this
is
another
quote
from
the
sre
book
central
to
borg.
Success
and
its
conception
was
the
notion
of
turning
cluster
management
into
an
entity
for
which
api
calls
could
be
issued
right.
A
So
basically,
we
arrived
at
the
idea
that
we
needed
an
api
for
the
entire
infrastructure
and
it
had
to
be
consistent
and
it
had
to
be
managed
and
manageable
by
automation,
automatable
and
it
wasn't
just
google.
Obviously
it
was
amazon.
It
was
azure
right.
Everybody
was
kind
of
arriving
at
the
idea
that
there
was
this
pressure
to
deliver
adaptable
services
at
scale,
and
you
need
api
to
do
that.
A
There
was
another
thing
that
was
happening
in
kind
of
a
slightly
different
part
of
the
industry,
which
was
like,
if
you
didn't,
if
you
weren't,
google
or
microsoft,
and
if
you
didn't
run
like
gazillion
servers
and
could
custom
order
the
server
x.
The
way
you
wanted
them,
you
still
need
automation
and
so
companies
like
puppet
and
chef
and
ansible
we're
starting
to
build
that
automation
for
sort
of
your
own
data
center
right.
What
was
in
there
what's
new
is
service
level
objective
and
that
is
business,
approved,
availability.
A
So
there's
this
concept
that
100
reliability
is
actually
unsustainable,
unnecessary
and
also
extremely
expensive
right.
So
if
we
even
talk
about
not
100
but
the
five
nines,
which
was
everybody's
holly,
grail
right,
that's
five
minutes
26
seconds
a
year
of
downtime
available
downtime,
that's
all
you
can
have
with
five
nines
and
the
major
question
is:
will
your
users
even
know
that
you're
that
available
and
the
resounding
answer
to
that
is
no?
A
A
A
A
So
I
can
get
a
promotion
at
the
end
of
the
year,
and
so
my
best
interest
is
to
test
the
hell
out
of
it
before
I
push
the
ops
people
to
to
push
it
right,
because
I
only
have
that
three
minutes
left
in
my
budget
in
my
error
budget
for
the
quarter,
so
slos
and
error
budget
actually
help
us
align,
speed
and
reliability
right
in
a
way
that
makes
everybody
be
more
successful.
A
A
Of
course.
Observability
is
another
concept,
that's
related
and
that
again
talks
about
how
much
you
know
about
how
your
services
are
doing.
This
is
important
to
me,
and
I
know
some
other
people
who
you
know
carried
a
pager
in
their
life.
A
You
need
a
good
signal
to
noise
ratio,
because
if
you're
paging
people
about
every
single
thing,
that's
not
important
they're
going
to
stop
responding
to
pages
and
if
you're,
not
paging
people,
when
their
help
is
actually
required.
That's
also
a
problem
we
could
also
dive
in
into
who
should
carry
a
pager,
but
we
won't
anyway.
I
do
want
to
say
like
so.
It
always
sounds
like
when
you
talk
about
a
serene
automation.
It
always
sounds
like
automation's
gonna
solve
all
everybody's
problems,
so
there
is
a
little
bit
of
caution
in
here.
A
A
Then
the
second
part
of
it
is
that
automation
drift
starts
immediately,
so
you
write
a
service,
you
write
automation
for
the
service
and
then
you
update
the
service.
Then
you
have
to
update
automation
right,
so
you
immediately
start
accumulating
those
differences
between
the
automation
and
the
actual
services.
It
runs
automating,
one
of
this
inefficient.
A
I
could
spend
six
hours
automating
tasks
that
actually
takes
me
six
minutes
to
do
manually
and
if
that
was
a
one-off
and
it's
never
going
to
happen
again,
then
I
just
wasted
time
and
then
importantly,
very
importantly,
all
systems
are
socio-technical
right.
So
the
goal
of
this
is
never
to
automate
humans
completely
out
of
a
system
I
mean
we,
we
make
this
arrow
all
the
time.
We're
like.
Oh
we're
going
to
automate
all
the
things,
because
humans
are
the
problem.
A
I
mean
humans
are
the
problem
a
lot
of
times,
but
also
they're
the
solution
right,
because
the
second
law
of
thermodynamics
states
that
the
universe
goes
towards
chaos
right.
So
all
systems
left
unsupervised
tend
to
work.
Chaos
and
entropy
always
wins
in
the
end,
so
you
need
a
human
to
maintain
order.
A
So
let's
talk
about
what
the
future
is
and
clayton
said
that
he
doesn't
know
what
the
future
is.
I
do
so
it's
not
no,
but
I
think
there's
a
certain
kind
of
level
of
goals
that
we
all
have
right.
We
kind
of
are
striving
towards
the
same
thing.
The
the
future
is
already
here,
it's
just
not
all
evenly
distributed.
So
I
know
if
we
talk
about
like
the
five
nines
and
all
the
fancy
automation
things.
A
People
also
have
like
70
years
of
what
we
call
legacy
code
and
that's
the
actual
thing
that
makes
them
money
and
they
have
to
run
the
business
you
know
so
I
think
and
I'm
biased,
because
it
worked
for
ed
know,
and
I
you
know
I'm
on
a
managed
services
team
right.
So
I
think
the
future
is
managed,
services
and
managed
services
can
be
defined
in
you
know
many
ways
right,
but
it's
all
about
platform
as
a
service
right.
A
We've
we've
been
talking
about
platform
as
a
service
for
a
really
long
time
right
and
we've
wanted
platform
as
a
service
for
10
years
or
probably
20
right,
and
we
just
all
we
want,
is
to
get
to
the
point
where
we
can
run
our
applications
and
there
were
many
attempts
to
implement
a
path
service.
Maybe
some
of
them
were
more
successful
or
less
successful.
A
A
Some
spreadsheet
somewhere
runs
on
excel
and
someone's
laptop
like
it
just
happens,
and
we
know
that
effective
automation
requires
consistent
apis
and
we
know
that
every
wave
of
automation
enables
the
next
wave
of
automation,
which
is
why
I'm
happy
that
I'm
a
kubecon,
because
I
think
that
grace
is
potentially
something
that
will
allow
us
to
proceed
to
the
future
and
have
that
consistency
and
have
that
consistent
api
across
different
systems
and
different
deployments,
and
so
85
of
global
I.t
leaders
agree.
That
kubernetes
is
the
key
to
cloud
native
application
strategies.
A
I
don't
know
if
all
application
strategies,
but
you
know
cloud
native
application
strategies
so
point
is
everybody-
wants
to
have
a
piece
of
kubernetes
which
is
cool
and
the
other
thing
is
like
we.
We
all
have
open
source
now,
which
provides
it
like
open
source
one
and
it
provides
us
with
a
way
of
setting
up
a
standard
and
letting
people
do
kind
of
share
knowledge.
A
What
we
have
in
common
and
work
together
to
define
what
that
consistent
api
looks
like,
but
the
problem
with
open
source-
and
yes,
I
think,
probably
everyone
is
going
to
have
this
slide
in
their
presentation
and
the
problem
with
open
source
is
that
you
know
you
have
the
proliferation
of
services
and
tools
and
all
the
things
and
that's
a
real
world
picture
of
someone
trying
to
run
kubernetes
in
production.
A
So
that's
what
it
usually
ends
like
and
but
you
do
have
an
advantage
today
compared
to
just
a
few
years
ago
like
if
you
want
to
get
out
of
the
data
center
management
business.
A
You
can
go
to
the
cloud
and
if
you
want
to
get
out
of
kubernetes
management
business,
you
can
go
to
openshift
again,
like
I
said,
I'm
biased
right.
So
I'm
on
this
team
that
works
on
these
services
that
are
called
red
head
cloud
services
right
and
and
we
run
on
top
of
different
clouds.
Actually
you
can
pick
your
favorite
cloud
and
run
your
manager
openshift
on
one
of
those
clouds
and
openshift
is
kind
of
an
opinionated
turnkey
way
to
get
all
the
bells
and
whistles
that
you
need
inside
your
kubernetes.
A
So
you
don't
have
to
browse
that
cncf
slide
and
identify
whose
project
is
maintained
by
a
single
maintainer
on
weekends,
and
you
know,
depending
with
all
your
security
on
something
that
joe
is
maintaining
his
garage
when
he
has
free
time
and
there's
the
whole
thing
which,
like
we
do
actually
as
run
sre
for
the
folks
who
rely
on
these
redhead
cloud
services
and
on
top
of
different
clouds,
which
is
an
interesting
problem
to
solve,
because
we
don't
own
the
infrastructure
right
and
we
are
running
sre
on
top
of
infrastructure,
we
don't
own,
which
is
exactly
the
same
problem:
every
company
in
the
world:
who's,
not
google,
microsoft
or
amazon
strength,
so
so
we're
trying
to
sell
it
for
other
people,
which
is
cool.
A
You
know
at
red
hat,
so
we
also
went
through
a
journey.
So
when
we
first
started
offering
these
services
there
was
a
you
know,
sla
of
two
nines
and
now
it's
four
and
we're
trying
to
get
even
better
and
better.
You
know
because
a
continuous
improvement
is
a
thing.
A
So
if
you
compare
the
traditional
organizations
with
cloud
native
organizations
in
a
traditional
manner
again,
we
have
this
proliferation
of
different
infrastructure
and
we
have
this
proliferation
of
different
platform
services
right
and
again,
as
you
standardize
you're
just
getting
to
enable
people
to
automate
this
complexity
and
to
standardize
on
something
that
it
can
all
share
across
board.
A
There's
this
picture.
I
I
like
this
picture
it.
It
comes
from
like
originally
from
hans
morabik
talking
about
ai
taking
over
the
world,
which
probably
eventually
will
happen.
I
don't
know,
basically
it's
like
a
picture
of
this
water
rising
in
the
landscape,
right
and
so
ai
gradually
takes
over
people's
jobs.
We're
not
talking
about
ai,
yet
here
we're
talking
about
automation,
but
it's
still
happening
right.
The
api
kind
of
gets
higher
and
higher.
A
A
Instead
of
doing
something,
that's
going
to
be
table
stakes
in
a
few
years,
so
to
the
extent
possible,
you
want
to
outsource
your
sre
to
your
platform
provider
and
last
but
not
least,
I
wanted
to
mention
something
that
red
is
working
on
and
so
first
of
all,
ideas
are
open
source,
which
is
why
I
learned
a
bunch
of
ideas
in
these
slides
from
other
smart
people,
and
hopefully
other
smart
people,
learn
ideas
for
me
sometimes,
and
we
know
that
open
source
one
because
it's
cool,
but
we
now
facing
a
slightly
different
challenge
that
we
did
before.
A
So
you
know
in
in
open
source,
we
are
always
trying
to
incorporate
the
knowledge
that
we
learn
back
into
the
code
base
right
so
upstream.
First
we're
trying
to
share,
but
now
that
we're
moving
to
everything
is
a
sas
right.
We're
having
this
problem
again,
where
the
platform
is
proprietary
right,
so
we're
no
longer
sharing
knowledge.
We're
no
longer
contributing
the
knowledge
back
to
upstream
and
so
redhead
is.
This
is
super
initial
stages,
but
we're
starting
this
new
initiative.
That's
called
operate.
A
First,
it's
a
concept
of
incorporating
operational
experience
back
into
software
development
right,
so
you
can
find
on
some
of
these
concepts
on
the
website.
It's
operatefirst.cloud,
I'm
in
this
an
effort
to
basically
get
people
started
with
a
playbook
for
learning
how
to
run
sre
and
also
in
a
playbook
for
sharing
operational
knowledge
across
different
clouds
right,
so
we
can
all
learn
from
each
other
in
terms
of
how
we
run
these
services.
A
So
that
was
all
I
wanted
to
share
with
you
today
and
I'm
sasha.
You
can
find
me
on
twitter.
I
especially
follow
me
if
you
like
cat
videos
and
I'd
be
funny.
I'd
be
happy
to
continue
this
conversation
because,
like
I
said,
I
think
we
all
learn
from
each
other.
All
the
time.