►
Description
Reliability in production environment, design best practices with kubernetes to deliver 99.99% service uptime.
---
KCD Africa 2022 is the 2nd iteration of the Kubernetes Community Days Africa, a CNCF-powered free community event. Visit https://kcdafrica.com for more information.
A
Okay,
the
next
speaker
is
frank,
adul
who's
going
to
be
leading
us
on
the
topic
sites.
Reliability,
engineering
with
kubernetes,
so
frank-
is
an
engineering
graduate
with
over
60
years.
Experience,
building
and
developing
distributed
system
and
frank
has
worked
as
a
senior
software
engineer,
principal
software
consultant
automation
and
release,
engineer
platform,
engineer
and
so
much
more
in
the
cloud
system
ecosystem.
So
it's
good
to
have
you
here,
frank.
B
B
All
right,
okay,
all
right-
that's
cool
yeah,
so
excited
to
be
here
today
and
it's
a
privilege
to
be
part
of
the
cncf
today
to
actually
talk
about
site,
reliability.
C
C
To
talk
about
what
site
reliability
is
actually
we've
had.
B
The
term
and
the
word,
the
buzzword,
right
now,
I'm
an
sre.
C
I
mean
sorry
devops
engineer
as
well,
so
basically,
an
sr
engineer
is
ensuring
systems
reliable,
and
that
is
how
I
define
it,
and
that
is
why
I
coined
it.
So.
C
Stack
of
the
chain
of
delivery,
so
you
start
from
the
design
stage
and
also
monitoring
stage
as
well.
Also,
you
cut
across
the
entire
stack
of
the
divide
within
the
sdlc
process
as
well
too,
and
also
you.
B
C
With
the
developers
very
well
and
also
with
their
ops
team
as
well,
so
they
there's
a
lot
of
responsibility
as
you
being
a
site,
reliability
engineer
in
the
tech
space
or
in
within
an
organization
or
being
a
consultant
as
well.
C
So,
let's
dive
in
so
these
are
the
key
areas.
I've
had
so
many
questions.
I've
been
asked
privately
either
on
linkedin.
You
know
on
the
twitter
space
as
well.
What
is
the
difference
between
xre
and
devops
itself?
Okay,
it's
a
bit
tricky
to
actually
separate
both
of
them.
So
basically,
what
an
x-ray
does
is
unnecessary,
extends
the
functionality
of
devops
itself
and
it's
cool
of
thoughts
with
some
of
my
key
folks.
C
Some
of
my
the
I.t
professionals
agree
that
devops
should
not
be
coined.
Devops
engineering
and
the
devops
is
a
culture
kind
of
so
there
is
an
argument
and
there's
a
school
of
thought
around
that
as
well.
C
So
basically,
what
sorry
is
sre
dev
leverages
on
devops
design
to
change
and
also
extend
the
functionality
that
devil's
practices
actually
brings
within
the
tech
stack
as
well,
so
you
being
a
necessary
or
necessary
consultant
or
an
hre
practitioner
within
a
team.
C
You
have
a
lot
of
key
responsibilities
that
actually
rely
falls
on
you
within
that
space,
so
you
are
more
concerned
about
reliability,
about
availability
and
also
system
observability
as
well.
I
was
very
happy
when
the
service
match
thing
came
up.
It
was
very,
very
cool
topic,
and
that
has
really
given
me
a
very
good
library
to
actually
start
on
in
terms
of
system
of
visibility
and
also
some
strategies
with
a
high
availability
and
some
drs
and
most
critically,
automation
as
well
and
built
in
for
tolerance
and
also
github's.
C
I
was
pretty
much
excited
when
I
actually
saw
the
github
session.
Yes
as
well,
it
was
pretty
cool
and
also
with
the
containers
part
that
was
actually
taken
care
of
today
as
well.
So
I
think
they
have
made
my
presentation
a
bit
easier
for
me
as
well,
and
also
what
we
call
spf
single
point
of
failure
and
and
monitoring.
So
this
is
where
some
of
the
issues
where
it
comes
in
in
terms
of
x-ray.
What
do
you
have
to
monitor?
C
What
are
the
key
functionalities
that
you
have
to
monitor?
What
are
the
metrics
business
is
interested
in
as
well,
so,
basically
as
an
acery
for
you
to
be
successful
within
a
necessary
team
or
you're
being
a
consultant
or
you'll
be
a
core
engineer
within
the
team.
You
see
yourself
more
as
a
technical
customer
within
the
team
itself.
So
let's
say
that
a
customer
out
there
doesn't
know.
When
is
the
system
going
to
fail?
C
C
It's
a
very,
very
full,
full
dose
of
topic
that
I
would
try
as
much
as
possible
to
really
dive
into
some
of
the
key
areas
that
you
have
to
really
monitor
and
some
of
the
tools
you
can
actually
use
as
well-
and
this
is
the
exchange
standing
functionality,
part
that
comes
in
in
terms
of
management
you
as
an
sre,
you
handle
all
core
plans
and
you
prepare
incident
management
tools
as
well.
C
You
do
rcas,
which
is
root,
cause
analysis.
C
C
This
is
a
practice
that
is
very,
very,
very
crucial
for
the
team
to
sustain
blindness
pose
more
team.
There
is
a
template
for
it
as
well,
so
what
it
actually
means
basically,
is:
when
anything
happens,
the
entire
team
take
full
responsibility,
not
actually
pointing
out
someone
else.
It
was
his
call.
It
was
that
cool.
It
was
a
bad
code
that
actually
caused
a
problem.
C
So
when
you
have
a
blameless
postmodern
that
is
documented
is
very,
very
easy
for
that
issue,
not
to
repeat
itself
again
and
most
critically
is
the
software
design
parts.
C
You
are
always
part
of
the
architectural
design
discussions,
because
before
the
software
is
actually
being
released,
you
are
part
of
every
sdlc's,
a
process
in
terms
of
architecture,
so
the
architecture
you
bring
in
all
the
key
metrics
the
service
of
time,
the
way
the
the
micro
services
being
built,
what
kind
of
meshing
system,
and
so
there
you're,
actually
looking
up
at
observability
and
you're,
also
looking
at
availability
as
well-
and
you
know
so
on
so
software
design
part-
is
very
critical
in
this.
C
So
now,
let's
dive
into
the
next
one.
I
always
say
this
every
time
there
is
nothing
much
on
us
and
challenging
us
working
on
this
area
in
a
cuban
ages.
Environment
is
very,
very
fun
to
be
working
in
a
cloud
native
space.
Very
well.
I've,
never
regretted
okay,
but
sometimes
it
comes
with
the
very
very
big
pains
as
well
trying
to
resolve
some
issues
that
are
could
left
you
wondering.
Why
did
you
actually
choose
this
failed?
C
Okay,
next,
one,
okay
in
terms
of
availability
availability,
so
I
made
mention
of
has
and
gr's
as
well.
So
now
I
think
xr
is
now.
We
could
really
say
that
we
are
a
little
bit
advantaged
when
working
in
in
an
environment.
Where
is
not
a
kubernetes
based
or
a
cloud
native
driven
environment,
so
here
kubernetes
was
built.
C
Is
production
ready,
production
rating
in
the
sense
that
it's
thought,
tolerant,
ready
and
also
is
high
availability
ready,
and
I
actually
like
the
word
gabriel,
actually
used
when
he
made
mention
of
david.
Actually,
it
is
checking
constantly
just
every
state
is
actually
being
maintained.
C
So
one
of
the
critical
tools,
as
as
a
necessary
to
have
within
the
community
space,
is
to
be
able
to
write
some
crd
and
play
operators
very
well.
So
when
you
have
that
at
your
back,
it
makes
you
manage
the
infrastructure
a
bit
easier.
C
So
when
you're
actually
having
crds,
you
are
actually
extending
the
functionality
of
the
capabilities
environment
itself,
like
I
agree
with
you
ray
when
he
made
mention
of
it,
gets
to
a
point.
Kubernetes
cannot
take
care
of
itself
anymore.
You
need
to
enhance
it,
so
that
is
where
you
have
to
have
the
tool
set,
the
creds
and
the
operators.
So
it
is
very
key
for
you,
as
an
sra,
to
have
that
tool
set
with
you.
You
need
to
learn
how
to
write
some
cr
days.
C
Sometimes
you
have
to
go
as
far
as
doing
some
golang
as
well.
So
I
also
write
in
go
too,
and
the
beetle
python
as
well
so
so
now
in
terms
of
of
availability.
Fine
is
built
is
ready.
It's
for
tolerance,
ready
and
all
that,
but
you
have
to
come
in
with
some
level
of
architecture
that
you're
really
into
really
having
as
well
so
yeah
you'll
need
your
three
just
normal
standard.
C
One
master
you'll
have
your
two
workloads
running,
which
is
a
normal
that
you
have
and
now
for
you
to
be
much
more
for
tolerant,
you
will
need
an
extra
three
nodes
where
you'll
have
one
acting
as
a
load,
balancer
serpently
and
you
have
the
other
two
acting
as
masters
as
well.
You
have
a
secondary
and
the
third
master
as
well,
so
that
is
a
full,
highly
available
environment
that
you
are
actually
planning,
because
all
these
things
start
at
the
stage
of
planning.
C
I
remember
I
talked
about
the
software
design,
so
you
were
always
part
of
the
design
process
in
there.
So
these
are
where
you
have
to
come
in
as
an
sre
to
actually
plan
out
as
well.
So
if
you're
running
a
cloud
you're
a
bit
very
lucky
to
really
do
the
configuration
and
do
some
add-ups
to
the
nodes,
but
if
you
not
so
lucky
and
you're
doing
on
bare
metal,
it
happens
to
be
some
of
the
times
that
the
environment
are
working
most
of
the
times.
C
You'll
definitely
need
to
deploy
this
yourself
manually.
You
need
to
get
extra
nodes
in
there
and
I
know
you'll
be
wondering
what
is
the
etcd
are
doing
out
there.
You
know,
so
you
are
unnecessary.
You
care
about
performance,
so
for
performance-driven
architecture,
it
is
very
crucial
for
you
to
have
a
separate
etcd,
node
or
cluster
running
separately
separately.
So
with
that,
you
will
you're
very
sure
of
your
speed
and
your
performance
as
well
running.
C
So
here
I
use
her
proxy.
You
can
use
any
other
process
out
there.
You
can
use
a
engine
x
or
actually
proceed
to
to
act
and
also
have
your
ingredients
controllers
work
in
as
well
too
yeah.
So
that
is
that
please
keep
your
questions
coming.
If
you
have
questions,
I
could
see
them
then,
also
in
terms
of
reliability,
which
happens
to
be
very
crucial.
C
I
could
give
you
a
very
case
scenario
that
I've
really
had
in
the
past
when
I
wasn't
in
an
e-commerce
space.
C
Fridays
are
usually
very
heavy
with
traffic,
and
some
certain
hours
of
the
day
is
very
heavy
on
traffic
as
well,
and
you'll
see
a
huge
spike
with
traffic
coming
in,
and
how
will
you
be
able
to
mitigate
that?
How
would
you
a
very
clear
example,
for
instance,
is
julia,
doing
a
black
friday
or
doing
a
promo
on
a
second
hour
of
the
day?
C
How
can
you
really
mitigate
that
traffic
coming
in
as
well
and
no
matter
how
you
try
to
do
some
throttling
still
you'll
still
face
resource
drain
issues,
so
managing
resources
is
very
critical
to
service
reliability.
It
is
very
critical,
so
some
strategies
at
this
level
are
vertical
scaling.
C
You
have
a
vertical
scaling
running,
which
is
your
vpa
vertical
scaling
system,
which
is
the
cost
scaling
already
running,
and
you
also
have
your
horizontal
port
that
also
does
scale
and
as
well.
So
it
gets
to
a
threshold
when
the
threshold
is
above
60
percent
that
you
have
set
for
your
hbas.
C
C
You
use
your
crd
to
actually
do
that
if
you
want
to
really
have
the
full
functionality
of
that
you'll
have
that
reading
to
really
set
in
and
kick
in
their
threshold
to
handle
that
so
sometimes
I
could
believe
with
all
the
experience
way
back
two
three
years
ago
you
will
have.
There
is
jam
with,
for
instance,
like
the
jam
portal.
Sorry,
I'm
actually
making
an
example
of
that
it
gets
so
bad.
C
No
one
could
access
the
site
at
a
particular
time
because
of
traffic
you'll
hear
there's
so
much
traffic,
the
site
is
inaccessible
or
it's
pretty
slow.
It
takes
a
lot
of
time
for
this
to
happen.
This
is
the
problem
you
understand,
so
as
an
acer
within
the
team,
you
need
to
make
sure
you
mitigate
against
that.
So
how
do
you
do
that?
So
you
come
up
with
strategies
and
this
part
of
the
strategy
you
have
to
come
up
with
this
type.
C
You
have
your
hpas
to
increase
the
number
of
poles
that
are
going
to
be
running
within
your
clusters.
You
also
need
to
include
the
vertical
scaling
system
within
your
environment
as
well,
and
this
is
so
interesting
because
it
happens
a
lot
and
so,
if
you're
running
the
cloud,
you
have
a
big
advantage
already
running
in
the
cloud.
You
have
your
auto
scales
running
in
the
cloud
and
it's
pretty
good
working.
So
how
do
you
handle
that
in
a
bare
metal
situation?
C
So
that
is
where
the
physical
nodes
have
to
come
in
as
well,
that
you
have
to
really
fit
in,
so
that
is
for
that
and
also
to
improve
layers,
improve
traffic,
improve
latency.
C
There
is
failure
to
actually
have
a
caching
strategy
at
the
back
end,
which
is
the
database
layer
between
the
service
and
the
database
layer
most
times
they
actually
mix
that
part
out,
and
there
is
always
struggling.
There
is
always
back
and
forth
communication
happening
and
which
relating
to
slow
queries
as
well.
So,
in
terms
of
query
performance,
it
is
very
necessary
for
you
to
have
some
catching
strategies.
You
can
use
radius
radius,
it's
pretty
cool.
C
Something
else
is
the
tool
I've
actually
worked
with
as
well,
and
it's
also
part
of
cloud
native
as
well
too,
a
very
good
tool
to
actually
work
with,
and
so
there
are
all
that
kitchen
system
out
there
that
you
can
actually
work
with.
So
the
whole
list
is
to
make
sure
you
increase.
Query
performance
at
the
database
backend
layer,
and
this
is
going
to
improve
the
latency
from
the
api
layer.
I
think
the
previous
speaker
to
the
invention
of
api
calls
and
all
that
you
know
so.
C
This
is
an
extension
to
what
she
actually
mentioned
of
now.
Let's
move
on
to
the
next
one.
Please
keep
the
question
coming.
If
you
have
questions
yeah
in
terms
of
observability,
thank
you
jibri
for
actually
doing
some
service
machine
and
all-
and
I
think,
if
you
missed
that
you
can
actually
watch
the
video
to
really
grabs
what
service
mesh
is.
So
he
made
mention
their
own
eyesight
cards
very
good.
C
So
in
the
hre
world,
okay,
we
we
can
hear
words
like
injection
of
sidecars
within
a
service
running,
and
it's
still
the
same
thing
as
what
he
explained
earlier.
C
So
if
it's
very,
very
crucial
and
very
very
important
for
an
sre
with
any
team
or
handling
team
to
have
an
eyes
into
the
services
that
are
running
at
the
back
end
back-end
level,
which
is
very,
very
critical.
So
if
you
have
the
service
mesh
running,
you
could
directly
scrape
or
monitor
the
the
end
points
at
for
the
interconnectivity
to
actually
check
which
is
failing,
which
is
not
actually
working
and
also
this
will
also
help
you
to
mitigate
against
service
degradation.
C
You
know
it's
something
I
I
see
almost
every
day
in
some
teams
and
they
have
a
lot
of
service
that
are
degraded.
They
are
not
picking
up
again
because
of
the
visibility
strategy
they
actually
use.
For
instance,
there
are
different
strategies
you
can
get
in
to
make
that
work,
you
can
have
retries,
you
can
have
a
single
transactions.
C
You
know
there
are
a
lot
of
strategies
that
you
can
really
put
in
to
really
make
that
happen,
and
all
this
happened
within
the
design
stage,
so
within
the
during
the
architectural
design.
If
you
could
remember
when
I
started,
I
talked
about
the
sra,
be
part
of
the
software
design
plan,
so
you
are
always
in
every
meetings.
You
are
always
giving
out
your
own
level
of
experience,
making
sure
the
software
that
is
about
to
be
delivered
is
100
or
99
conformity
to
best
practices
as
well.
C
So
you
cannot
imagine
a
service
being
put
out
there
and
you
are
an
sra
within
the
team
and
there
are
no
service
mesh
help.
There
are
no
ways
to
actually
improve
the
intercontin
interconnectivity
of
the
micro
services
that
actually
are
running.
You
could
say
fine.
C
We
are
running
an
spa,
just
a
single
application,
we're
running
you
can
actually
have
some
level
of
obsessability
within
that
you
can
actually
have
an
endpoint
being
scraped
periodically
to
actually
check
if
that
endpoint
is
up
so
not
necessarily
within
the
microspaces
alone,
so
you
can
actually
have
that
done
in
an
sp
environment
and
also
for
the
distributor
tracing.
C
I
could
remember
when
I
got
into
a
particular
team.
There
were
a
lot
of
issues.
You
don't
know
where
the
issues
are
coming
from.
You
can
not
even
tell
is
a
200
or
it's
a
502.
It's
a
504
or
all
what
you
see
is
internal
server
error
and
it's
quite
vague,
you
know,
and
actually
having
developers
or
the
software
engineers
to
actually
start
tracing
start
checking
out
and
all
that
that
takes
a
lot
of
time
and
it
becomes
so
modern
and
repetitively.
C
Success
possible
every
repetitive
task
or
to
make
those
processes
as
much
as
possible.
So
one
of
the
good
one
of
the
areas
you
can
actually
do
is
to
bring
it
distributed,
tracing
and
a
very
good
tool.
You
can
use
out
there
within
the
microservices
space,
which
is
part
of
cloud
native.
You
have
the
what
we
call
jager.
You
can
use
jager
to
actually
do
this.
It
runs
on
aws,
it
runs
on
azure
space
as
well
also
run
on
google
space
as
well
to
also
run
on
bare
metals
as
well.
C
It's
something
I've
actually
have
been
exposed
to
a
a
couple
of
times,
and
it's
one
of
my
tools
that
I
use
as
well,
and
so
what
the
distributor
trades
in
the
developers
could
actually
tell
is
a
200.
C
Is
it
a
504
or
is
it
a
500
and
it
comes
up
and
it
drills
down
to
the
exact
service
that
is
actually
failing,
so
you
visualize
it
and
you
actually
see
it.
C
I
I
started
a
whole
big
topic
on
this
phone
that
I
could
talk
on
and
on
and
on,
and
so
we
all
can
gain
clarity
and
also
beauty
and
apm.
I
talked
about
apm
right
now,
an
application
monitoring
now
you're
drilling
down
to
a
particular
endpoint
directly,
and
you
are
scraping
every
seconds
hitting
the
traffic
you
could
use
promoters
for
that.
You
could
use
other
enterprise
tools
for
that
you
could
use
new
relic
as
well.
C
You
can
use
data
dog
as
well,
but
my
go-to
tool
that
is
quite
flexible
for
me.
I
try
to
work
with
this
prometheus.
I
use
permeability
to
actually
do
this.
Almost
all
the
time
depends
on
the
environment.
I
find
myself
as
well,
so
you
create
that
one
with
it
as
well,
and
you
set
a
large
on
the
end
points
that
you're
actually
monitoring.
C
If
it's
failing
or
if
it
fails,
you
see
it
first
before
the
client
or
the
customer
actually
says
it,
so
you
have
to
mitigate
it.
So
when
that
happens,
that
is
where
you
have
your
rollback
almost
immediately.
That
happens.
Most
a
very
good
scenario
is
usually
sometimes
when
the
new
release
is
out
and
their
endpoint
fails.
So
what
happens?
You
have
to
do
a
quick
roll
back.
C
That
is
where
your
github's
coming
and
your
github's
automation
comes
in
as
well
to
make
sure
you.
I
will
talk
about
that.
I
think
in
the
next
couple
of
slides
as
well
and
also
back-end
observability
is
very
critical.
C
You
have
to
monitor
your
database
because,
right
now
there
is
even
a
new
role
called
database
observability
engineers,
basically
taking
care
of
the
database
part
getting
some
logs
checking
out
latency
from
queries
and
building
the
very
best
to
make
sure
there
is
full
observability
at
the
backing
backend
services
running
on
you
know
the
database
part
and
also
in
terms
of
if
a
lot
of
us
have
heard
about
fluent
d
and
all
influence
is
very,
very
good
logging
tool
that
I
think
every
siri
out
there
should
really
have
as
part
of
their
toolings
as
well.
C
So
with
4d,
you
have
a
full
sensibility
into
your
services
that
are
running.
You
could
do
a
whole
lot
with
it
a
whole
lot
with
fluency.
I
think
when
I'm
giving
them
this
opportunities
sometime,
I
could
talk
about
fluently
in
the
asari
space
as
well
extensively.
I
think
I
should
move
to
the
next
slide
now.
My
time
is
almost
gone:
okay,
so
for
tolerance,
what
is
for
tolerant
for,
within
the
sorry
space,
you
use
a
chaos
method,
a
chaos
mechanism.
C
Fine,
you
have
your
qa,
your
q
is
working,
and
so,
with
the
photo
around,
you
could
also
automate
that
process.
Within
your
pipeline,
your
builder
pipeline
automates
your
full
pipeline
that
is
running.
You
could
be
running
in
a
traditional
jenkins
environment,
for
instance,
and
if
you're
lucky
enough,
you
could
be
working
in
the
github
actions
environment
as
well,
so
you
automate
the
entire
process.
You
have
your
qa
within
it
as
well,
and,
most
importantly,
that
you
have
to
add
in
to
make
your
kos
method
complete
within
their
pipeline.
C
You,
you
need
to
have
endpoints
within
your
pipeline.
So
before
the
release
happens,
the
pipeline
actually
does
an
api
call
to
the
end
point
before
the
release
happens.
So
if
the
end
point
fails,
it
should
not
get
into
production,
actually
not
getting
to
the
cat
into
the
the
sandbox
area
or
what's
on
the
stage
and
to
not
even
make
it
to
staging.
C
So
it's
a
very
critical
area
and
a
very
new
system
way
to
make
sure
before
you
go
to
production
everything
to
a
hundred
percent,
because
business
focus
is
to
make
sure
that
the
services
out
there
is
one
hundred
percent.
They
don't
want
to
know
if
it's
ninety
nine
percent
everything
is
hundred
percent
for
business.
It's
very
critical
for
business,
so
chaos
method
is
one
of
the
critical
areas
before
you
go
to
production.
C
You
have
to
make
sure
that
happens
as
well,
so
also
our
mesh
system
has
made
for
tolerant
testing,
also
very
easy,
and
so
with
the
diagram
here
you
can
actually
see
from
one
you
have
this
analog
balancer
that
is
coming
in
and
also
you
have
the
ingress
data
plan
plane
that
is
coming
in
and
the
inverse
control
plane
and
the
service
mesh
control
plane
coming
in
and
hitting
the
service
process.
That
debris
actually
talked
about
service
processes
and
all
that.
C
So
this
is
a
inter
communication
with
the
micro
service
itself,
and
this
also
improves
security
in
terms
of
security
as
well
into
pro
security.
In
terms
of
inter-service
communication
from
what
you
can
see
here
right
now,
they
are
all
communicating
separately
and
independently
and
actually
hitting
directly
to
their
the
service
processing.
So
each
of
them
has
a
separate
policy
that
actually
enables
for
service
discovery
as
well
and
how
to
avoid
service
degradation.
C
What
kind
of
mechanism
are
you
having?
What
kind
of
how
many
seconds
actually
for
a
service
or
what
constitutes
a
service
to
be
degraded?
Is
it
after
three
tries
the
service
is
degraded,
and
if
it's
degraded,
it's
a
replica
or
is
there
a
replication?
So
that
is
where
you
have
a
strategy
within
the
k8
space,
where
you
have
a
stateful
set
as
well.
That
comes
in
so
with
this
for
states,
you
can
almost
have
as
photo
around
as
much
as
possible.
C
Automation,
the
critical
parts
I
think,
every
sre.
I
think
this
is
one
of
the
call
functions,
one
of
the
conf
functions
within
the
xr
space,
talking
about
automation,
code,
compliance,
infrastructural
code
and
they
are
having
helm
to
to
to
to
to
act
as
your
app
management
tool
set
as
well.
Like
I
said
here
is
one
of
the
cheapest,
the
yet
most
expensive
practices,
some
good
practices
with
good
quality,
with
good
quality.
C
You
can
have
lenses
to
make
sure
your
infrastructure
is
cold,
well,
taking
lenses
to
to
monitor
your
yaml
and
also
have
your
control,
and
it's
also
a
use
case
as
well,
as
is
one
of
the
cheapest.
That's
like
I
said
so,
one
of
the
case
of
failovers.
C
So
let's
give
a
practical
example:
you'll
have
your
infrastructure
that
has
been
built
using
telephone,
for
instance,
and
you
spawn
up
your
eks
or
your
ats
or
your
gk
environment
that
you're
actually
managing
and
something
happens
to
the
region
or
happens
to
the
data
center
or
you
can
easily
grow
back.
You
could
usually
spin
up
the
infrastructure
almost
initial
period
of
time.
Like
you
having
to
do
your
snapshots,
you
have
to
do
some
geo
replication.
C
That
is
quite
expensive,
so
it's
cheap,
but
it's
expensive
when
I
mean
usability,
because
sometimes
that
is
not
being
put
into
perspective
as
well.
So
within
an
x-ray
in
a
team,
the
sre
ensures
that
every
infrastructure
is
built
as
a
code,
not
a
doi.
C
The
ki
is
highly
discouraged
because
of
the
prone
to
errors,
mudin
approach
and
the
number
of
time
it
takes
to
actually
bring
up
back
if
an
infrastructure
actually
goes
down,
for
instance,
okay,
then,
a
very
good
other
way
is
having
your
backups,
your
ssl
search,
your
installations.
For
instance,
you
want
to
install
some
softwares.
You
want
to
install
some
packages
with
the
pipeline,
for
instance,
you
have
to
use
sensible
to
do
that,
so
it
makes
their
entire
infrastructure
more
robust
and
quite
easy
to
maintain.
C
Okay,
I
think
this
was
dealt
with
yesterday.
I
have
a
very
limited
time.
The
githubs
this
was
very,
very
dealt
with
yesterday
and
I
was
pretty
cool
with
it
talked
about
flaws
and
I
go
cd
as
well.
C
There
is
so
much
talk
about
it,
or
should
I
go
for
argo
cd,
or
should
I
use
flux
as
well,
so
floss
can
only
observe
one
ripple
at
a
time.
I
don't
know
if
they've
been
able
to
update
that.
I
think
the
lady
from
weave
works
can
actually
point
correctly
on
that
and
while
ago
cd
you
can
observe
multiple
repo.
So
you
before
you
actually
use
the
tool.
You
should
actually
know
the
reason
why
you
have
to
use
such
particular
tools.
C
So
remember
I
talked
about
rollback
and
how
quick
for
you
to
get
back
up
up
again
to
make
sure
you
maintain
some
level
of
reliability
and
availability.
This
stops.
Automation
has
come
to
stay
and
we
have
a
good
city
and
flux
to
actually
do
that,
but
I
prefer
using
aggro
city
because
of
the
multi
ripple
that
I
actually
work
with.
C
I
think
it's
one
of
the
areas
that
flux
has
to
really
improve,
that
I
don't
know
if
they
have
a
new
affection
that
can
actually
do
that
at
the
moment
as
well,
and
also
it's
a
source
of
truth
and
also
release
integrity
and
becomes
ownership
driven
as
well
and
also
for
containers
containerization.
I
really
enjoyed
the
container
session
that
was
done
by
grits.
I
really
loved
it
and
I
enjoyed
myself
and
actually
lunch
as
well.
So
we
all
know
the
regular
container
and
health
check
about
containers.
C
We
could
tell
you:
okay,
fine,
the
life
cycle.
You
have
liveness
probe
readiness,
probe
and
startup
pro
and
they're
pretty
cool,
but
there
is
more
to
that.
If
you
are
working
in
a
very
in
a
real
in
cloudy
space,
that
is
data
platform,
for
instance,
and
the
data
streaming
platform.
For
instance,
they
want
some
audio
java
based
python-based
golang,
based
applications
to
actually
run
side
by
side
with
containers.
C
What
you
have
to
do,
you
have
to
use
the
init
container,
so
you
have
to
initialize
the
software
first
before
their
the
containers
themselves
that
have
the
images
actually
around.
So
there
are
set
of
scripts
that
you
cannot
find
within
your
images.
So
it's
something
you'll
have
to
write.
I
think
we
all
know
this
command
very
simple
command
that
we
can
remember.
C
You
had
enabled
this
sleep,
sh
echo
blah
blah
and
sleep,
and
all
that
and
also
you
have
the
init
containers
that
initially
initially
initialize
the
containers
that
you
need
to
run
side
by
side.
With
that,
I
think
I
can
show
you
very
quickly
something
here
how
that
works.
So
by
the
way,
if
you're
wondering
what
kind
of
ui
I'm
using
sorry,
this
is
one
of
my
tools
that
I'm
using
this
is
what
we
call
lens.
So
this
is
an
example
of
in
its
running
all
right.
C
So
this
is
how
and
in
each
round
so
it
runs
side
by
side
with
the
the
app
that
has
the
images
the
latter
running.
So
it
runs
by
side
by
side.
So
the
advantage
of
this
is,
you
don't
have
much
of
your
app
throttling
or
your
image
throttling
or
your
image
shutting
down
unnecessarily,
because
you've
taken
the
heavyweight
off
so
the
code
for
setup.
Everything
is
the
other
container.
So
it's
like
you
during
this
separation
of
process.
D
B
All
right
so
single
point
of
failure,
single
point
of
failure:
yeah
you
have
to
have
the
multi
region.
C
Deployment
kind
of
strategy
that
you
need
to
have
you
need
to
have
your
easys
from
the
diagram
here.
I
think
I
need
to
be
fast
as
well.
Sorry,
abu
bakar,
I'm
taking
much
of
my
time
more
time.
So
yeah
you
have
the.
Maybe
you
could
go
to
the
video
and
actually
look
through
it
and
see
how
this
actually
works
as
well.
You
have
the
multi-multi
region
and
also
okay.
Okay,
thank
you.
Thank
you.
Thank
you.
Okay,
all
right,
so
let
me
go
ahead.
C
Okay,
cool
yeah,
so
you
have
the
application
so
bus,
let's
try
to
just
avoid
failures
in
your
production
environment
is
actually
have
a
multi
region.
Your
girls,
like
I
mean
mentioned
on
iac's,
which
is
in
particular
code.
I'm
not
going
to
delve
into
that
again.
You
have
your
disaster
recovery
kind
of
mechanism
and
your
multi-region
as
well.
So
in
terms
of
your
girls,
it's
always
very,
very,
very
critical
for
you
to
have
your
database
structured
in
a
dr
process
as
well.
C
All
right
in
the
gro
as
well,
so
so
with
your
drs,
is
for
your
database.
You'll
have
different
different
drs
are
very
expensive,
very
expensive.
So
before
you
do
that,
you
should
know
it's
gonna
cost
huge
amount
of
money
to
really
really
do
to
go
ahead
with.
So
you
have
the
transition
logs
as
well,
and
also
with
the
with
the
regency
having
your
cluster,
then
a
monster.
C
So
all
this
actually
helps
with
your
replication
and
also
helps
with
your
data
performance
as
well
too,
and
also,
I
think
I
will
rush
through
the
other
one
as
well.
So
I
can
take
questions
okay,
interesting.
I
think
I'm
on
the
last
part
now
cool
this.
Is
it
now
we've
heard
about
slls
slis?
C
I
think
abubakar.
You
need
to
bring
me
back
again.
So
I'll
talk
more
about
slo
and
sli,
so
the
devops
community
or
the
cloud
native
community,
some
people
really
understand
what
an
slo
how
to
set
slo
work.
To
say
I
could
remember
my
very
first
consulting
was
okay.
We
wanted
to
drop
an
slo
for
their
application
or
for
the
company
I
it
was
very
difficult
for
me.
I
have
to
go
through
google
documentation
to
understood
what
an
slo
was
and
all
that
I
wasn't
looking
at
it
from
a
business
perspective.
C
I
was
looking
at
it
from
an
engineering
perspective,
so
the
time
I
started
looking
at
it
from
a
business
perspective,
it
was
quite
easy
for
me
to
actually
relate
with
what
is
an
slo
to
business
and
how
important
it
is
for
a
business
and
you're
making
a
call
to
that
application,
making
that
streaming
on
netflix,
for
instance-
and
they
really
the
business-
want
to
make
sense
of
what
how
the
response
time
is,
for
instance,
now
and
very
critical
is
your
latency,
and
this
a
typical
example
of
an
slo
dashboard
that
was
made
in-house
was
made
in-house
with
from
corel
with
rounds
on
pro
meteos,
and
this
was
done
with
the
grafana
dashboard
as
well.
C
You
have
the
throughputs
and
era
budget
is
something
I
can
talk
from
now
till
six
hours.
Every
project
is
quite
huge
and
it's
any
any
any
company
that
is
running
99.9.
C
What
that
means
that
in
a
whole
year,
it's
been
calculated
to
know
the
number
of
time,
they're
actually
losing
money,
and
that
is
very
critical
for
business
anytime.
I
always
present
slo
dashboard
business
get
excited
about
it,
so
they
put
more
pressure
on
the
team
to
make
sure
they
perform
optimally
as
well.
So
I
got
into
a
team,
they
were
making
95
and
that
was
a
huge
loss
in
terms
of
our
budget,
but
in
their
engineering
world
95
percent
is
okay.
It's
a
good
timing.
C
C
Why
are
we
really
having
error
budgets
wrong
error
budget?
So
one
of
the
issues
with
having
wrong
error
budget
is
the
tech
stack.
Most
of
them
were
actually
using
the
environment
that
had
95
percent
where
they
were
actually
using
a
legacy
application.
I'm
not
trying
to
say
some
languages
are
bad
and
all,
but
if
your
application
is
going
to
be
interoperability,
you
need
to
make
sure
you
are
running
a
modern
stack.
C
So
that
was
a
great
improvement
for
business.
So
in
a
whole
year
when
that
was
calculated
and
for
you
to
have
a
effective
area
budget,
you
need
to
measure
or
for
a
minimum
of
14
days
of
that
particular
service
14
days.
That
is
where
you
can
actually
okay.
You
can
come
up
with
something
so
slo
maturity
actually
starts
from
30
days.
C
If
you
build
an
iso
dashboard,
now
you're
not
going
to
get
the
exact
picture
to
have
to
wait
for
30
days
before
you
can
really
make
sayings
of
their
dashboard
that
you've
actually
made
for
business,
so
business
actually
love
dashboard.
This
is
one
of
the
beauty
about
you
being
an
isari.
You
communicate
with
business.
You
are
a
business
person
within
a
technical
team.
You
are
the
technical
business
person
within
the
team
as
well
to
make
sure
all
these
parts
are
followed
and
also
taken
care
of
as
well.
C
So
in
terms
of
incident
management
also,
there
is
a
template
for
that.
I
think
you
can.
You
can
chat
me
up
or
something
I
could
share
that
as
well.
There's
a
template,
and
also
for
rcas-
they
usually
templates
as
well
and
also
for
postmortem,
there's
also
a
template
as
well
to
actually
work
with
and
also
on
call.
Thankfully,
you'll
have
a
lot
of
software.
Now
that
does
on
call
rotation.
You
can
use
quadcast
to
make
that
happen.
C
You
can
use
a
new
relic
inbuilt
on
call
that
you
can
actually
work
with
as
well
or
you
can
have
your
own
internal
on-call
system
to
actually
handle
that
as
well.
So
the
whole
essence
of
phone
call
is
to
make
sure.
Well,
I
think
I
will
also
teach
that
as
well.
There
is
there
are
procedures
that
come
with
phone
call
that
comes
in
when
a
software
goes
down.
Who
do
you
have
to
talk
to?
Who
do
you
need
to
speak
with?
Not
in
our
traditional
setting?
C
The
software
goes
down,
everybody
is
hair
wire.
We
don't
know
who
is
responsible.
Who
is
the
next
person?
Who
is
the
right
person
to
actually
speak
with
so
sorry
brings
in
processes
within
the
software
paradigm,
so
the
entire
sdlc
chain?
You
are
part
of
it
till
the
very
last
now
till
the
very
last
part.
So
everybody
goes
to
sleep,
they
doesn't
necessary.
You
are
considering
how
to
really
make
sure
the
software
is
in
uptime
and
it's
always
been
decided.
So
there
are
a
lot
about
to
talk
about.
C
You've
been
an
sre,
so
much
so
much
so
this
is
quite
a
limited
time
for
me
to
actually
talk.
I
would
have
also
loved
to
do
some
demos
as
well,
but
time
is
not
really
going
to
permit
me
to
do
that.
I
can
quickly
show
you
I
don't
know
if
I
can
share
this.
My
screen
right
now.
D
Okay
where's
my
screen.
Now
what
should
I
share?
Now?
Okay,
I
think
I
can
share
something
now,
all
right
so.
C
Cool
all
right,
so
this
is
a
dashboard
all
right-
and
this
is
a
streaming
dashboard
built
with
kafka,
and
so
business
wants
to
see
this
business
is
interested
in
this
business
actually
want
to
see
how
this
is
happening
and
all
that,
and
so
this
is
how
business
actually
makes
sense
of
what
is
happening
in
the
back
end,
and
this
is
a
very
classical
example
of
how
this
works
as
well.
C
So
this
is
a
kafka
running
at
the
back
end
for
data
streaming
processing,
which
is
live
right
now,
so
this
is
happening,
live
and
you're
getting
live
streams
and
seeing
the
calls
that
are
happening.
So
this
is
a
practical
example
of
what
an
xre
your
day-to-day
is.
You
have
to
build
it.
This
is
why
I
call
it
an
sre
you'll,
be
part
of
it.
You
build
it,
you
monitor
it.
So
those
are
the
three
core
areas
you
have
to
really
look
into
as
well
too.
So
thank
you.
C
I
think
I'm
I'd
be
good.
Now.
E
Awesome
I
have
been
sharing
everywhere
that
it
seems
as
if
illyria
yoda
planned
this
schedule
so
vision
and
carefully
placed
your
session
to
be
the
last
one.
You've
been
touching
on
almost
all
previous
sessions
from
yesterday.
Yeah
comments
have
been
going
on
around
that.
This
is
very
interesting
and
definitely
if
we
had
known
we'll
have
made
it
to
one
hour
together,
so
that
you
think
of
it.
One
hour.
E
Yeah,
but
definitely
would
definitely
reach
out
next
time
if
we
are
hosting
more
events
so
that
at
least
you
can
share
more
of
the
another,
especially
the
slus
and
sl
yeah.
E
We
don't
have
any
questions
in
chats,
but
I've
shared
your
linkedin
url
so
who
can
reach
out?
If
you
have
any
questions
and
to
ask
you
and
yeah,
I
I
don't
know
if
you
have
a
twitter
handle
most
of
our
comments.
E
Okay,
so
yeah,
you
want
to
reach
out
to
frank
it's
at
foodrank
on
twitter,
so
you
can
have
conversations
about
things
you
want
to
learn
and
around
srs
sre
is
a
major
thing
and
there
are
lots
of
interests
and
a
lot
of
companies
are
looking
for
more
talent,
talented
sres,
literally
yep.
Okay.
Thank
you
very
much.
For
your
time,
frank
and
all.