►
From YouTube: SLOconf: Left Shift your SLOs - by Michael Friedrich
Description
Everyone talks about Security shifting left in your CI/CD pipeline. Tools and cultural changes enable teams to scale and avoid deployment problems. SLOs are left out - what if a software change triggers a regression and your production SLOs fail? As a developer, you want to detect these problems as early as possible. This talk dives deep into CI/CD pipelines and discusses ideas to calculate and match SLOs in the development lifecycle. Early in your Pull or Merge Request for review.
A
Hello
everyone
today,
I
want
to
show
you
how
to
left
shift
your
slos
a
little
bit
about
myself,
I'm
making
I'm
working
at
gitlab
as
a
developer
evangelist,
and
you
can
find
me
online
as
dns
michi,
which
is
dns
m-I-c-h-I
on
twitter,
on
linkedin
and
everywhere
else.
A
Now,
let's
dive
into
a
little
bit
of
history
when
we
have
been
slowless,
meaning
to
say
well
we're
turning
back
time
a
little
and
saying.
Well,
we
had
the
traditional
monitoring
state
black
box
monitoring
slowly
have
been
adding
metrics
to
our
monitoring
journey.
We
had
traditional
sla
reporting
at
some
point,
probably
generating
some
pdfs
for
our
managers
and
and
also
calculating
the
service
level
availability
for
our
customers.
A
So
we
were
tracking
state
changes
over
time,
maybe
adding
some
metric
data
points
and
trends
and
in
the
end
we
kind
of
needed
to
learn
sla,
slo
and
sli
and
that
that's
it
let's
dive
a
little
more
into
the
history
and
find
some
situations
where,
where
it
could
have
been
useful
to
develop
some
rocket
science
and
also
add
slos.
A
A
We
were
thinking
about
a
cpu
overload
with
threads
involved
and
considered
using
lightweight
core
routines
in
c
plus,
plus
meaning
to
say
those
are
like
fibers
or
asynchronous
routines
stacked
less.
The
function
point
is
put
on
the
heap
when,
when
the
function
is
being
paused
and
the
stack
unwinding
happens,
then
for
continuation
sounds
easy
and
we
continued
implementing
that
well
and
then
we
had
to
debug
it
because
it
went
in
production.
A
It
went
into
release
and
there
was
a
certain
crash
in
production
only,
but
only
with
having
100
1
000
api
clients
and
a
little
more
than
that.
The
memory
was
corrupted
at
some
point
and
it
was
so.
It
also
sort
of
was
exhausted,
which
pointed
to
some
sort
of
leaks,
and
we
were
not
sure
about
it
and
lots
of
time
went
into
debugging.
A
This
course,
when
we
could
have
created
a
staging
environment
and
defined
our
slos,
because
we
actually
want
to
ensure
that
this
stack
and
heap
memory
meets
the
operational
requirements
so
defining
an
the
service
level
objective
of
saying,
hey.
We
want
to
ensure
that
heap
and
stack
memory
is
on
the
usage
usage
level.
We
define,
for
example,
2
gigabytes
of
ram,
because
we
cannot
really
exceed
that
from
an
agent
perspective
and
the
other
side
was
we
actually
needed
to
measure
that
in
real
time
production
environments,
it's
it
works
on
my
machine.
A
It's
not
a
it's,
not
a
valid
argument
when
the
application
is
crashing
and
also
we
probably
would
have
needed
sort
of
chaos
engineer
for
our
api
requests.
Everything
was
not
there,
and
this
is
just
one
of
one
of
the
many
things
where
slos
would
have
been
helpful.
Another
story
is
when
it
goes
really
slow.
You
maybe
remember
the
time
when
there
was
a
report
about
the
gto
online,
a
gta
online
and
the
loading
times
were
like
eight
minutes
or
ten
minutes,
and
the
user
reverse
engineered
the
source
code
guessing
from
the
assembler
code.
A
How
how
often
a
specific
function
is
being
called
and
reverse
engineered
that,
in
a
way
of
creating
a
preload,
dll
and
saving
70
percent
of
time
now?
Well,
it
could
have
been
prevented
in
our
development
process,
so
the
mitigation
as
a
developer
would
would
have
been
like
measure
the
login
time
at
application.
A
Timing
points
in
the
first
iteration
at
some
metrics
in
the
second
iteration,
look
into
tracing
spans
and
just
figure
out
what's
going
on
and
make
the
application
useful
this
the
thing
is,
we
still
need
to
add
some
sort
of
service
level
objective
to
achieve,
and
this
is
now
slo
in
a
staging
environment.
A
We
want
to
ensure
that
much
requests
and
pull
requests
are
directly
deployed
from
a
ci
cd
pipeline
into
a
staging
production
environment
and
we're
defining
end-to-end
test
scopes
saying.
Well,
it
is
the
user
now
doing
the
login.
This
is
one
of
the
test
scopes
and
then
the
user
is
playing
and
also
experiencing
some
like
log
out
login
issues
and
so
on
for
defining
an
actual
slow.
A
It's,
let's
just
imagine,
saying:
hey
the
login
time
needs
to
be
lower
than
two
minutes,
but
for
for
low
latency
connections,
for
example,
a
metric
which
we
can
correlate
from
from
application,
performance,
monitoring
and
user
monitoring
in
a
certain
way.
If
it's
a
low
level
low
latency
connection,
we
want
to
increase
the
login
time
to
five
minutes,
because
there
could
be
traffic
involved
and
other
things,
and
when
we
detect
a
certain
failure
of
the
slo,
we
have
a
quality
gate.
A
A
This
brings
me
to
defining
an
slo
process,
so
in
order
to
shift
left,
we
need
to
update
our
culture.
We
need
to
update
our
workflows.
We
need
to
define
an
actual
process
for
that,
meaning
to
say,
we
need
to
be
aware
of
quality
gates
in
our
ci
cd
pipelines.
So,
for
example,
we
can
use
captain
as
a
quality
gate.
We
can
combine
it
with
promises
for
metrics
measuring
the
slos,
but
we
could
also
add
chaos.
A
Engineering
like
adding
litmus
chaos
to
it
in
addition
to
quality
gates,
we
also
need
to
define
application
observability
like
adding
the
traditional
application,
performance,
monitoring
and
enriching
it
with
logs
and
tracing
and
combining
everything
into
open
telemetry.
So
our
slos
are
actually
defined
early
enough
in
the
cicd
pipelines,
with
quality
gates
with
application
observability.
A
Now,
when
we
want
to
measure
everything,
we
certainly
need
to
find
ways
to
automate
things
so
probably
defining
an
slo
means.
We
generate
it
from
a
predefined
dsl,
and
I
would
recommend
watching
andrew's
talk
around
this,
how
we
define
it
on
gitlab.com
and
matching
several
things,
even
about
simpson
as
sli,
in
order
to
go
slow,
and
this
brings
me
to
basically
summarizing
what
I've
learned
in
in
my
past
developers
need
to
get
the
slo
feedback,
so
it
doesn't
help
when
someone
is
paged
or
someone's
watches
a
graph.
A
It
needs
to
be
there
so
directly
integrated
into
much
request
the
pull
request.
You
need
to
get
the
details.
Why
the
why
the
pipeline
was
failing?
What
exactly
is
the
slo
and
more
details
to
read
upon
and,
of
course,
training
the
teams
to
adopt
it?
A
So
we
could
have
prevented
of
introducing
co-routines,
because
the
memory
is
corrupted
would
have
never
reached
customers
in
large-scale
environments.
The
gto
online
login
algorithm
would
never
have
been
introduced
in
that
way
and,
of
course,
we
would
have
added
more
code
quality
checks.
This
is
everything
we
can
do
for
improving
slos
and
in
order
to
go
even
more
slow.
We
really
wanna
correlate
the
slos
with
our
instrumentation
and
observability
capabilities
in
in
the
future
of
our
deaf
sec,
ops,
workflows.
A
We
need
to
train
everyone
and
make
everyone
aware
that
this
is
a
benefit
and
also
make
use
of
the
dynamic
resource
environments,
because
this
is
a
huge
benefit
of
order.
Scaling
virtual
machines
in
aws
in
whatever
cloud
is
needed,
and
we
can
also
reuse
the
power
of
container
clusters
and
at
chaos,
engineering
say
well.
I
want
to
really
battle
test
the
application
before
it
reaches
production,
and
there
was
once
a
slow
that
put
to
prod.