►
From YouTube: GMT20210113 Kubernetes fire drill
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
This
so
I
actually
haven't
been
involved
in
any
of
these.
Yet
what
I
was
thinking,
what
we
would
do
is
kind
of
walk
through
a
fictitious
scenario,
to
kind
of
see
what
ideas
people
have.
The
first
question
I
like
to
pose
for
this
is
what
metrics
or
what
would
you
do
in
the
first
five
minutes
or
even
the
first
minute
of
the
incident?
A
A
So
I
think
this
kind
of
like
varies
depending
on
where
you
are
in
the
incident
like
I
think,
I'd
like
to
start
off
with
just
alerts
have
fired.
What
is
the
first
thing
you
do.
B
A
Sure
yeah,
so
we
open
the
slo
dashboard
to
see
that
the
you
know,
errors
have
started
to
increase
and
alerts
are
firing.
A
A
Okay,
so
the
next
question
I'm
going
to
pose
here,
obviously
like
what
we
would
do
it
after
this
is
maybe
we
would
look
at
logs
as
well
to
try
to
narrow
in,
but
I
think
one
thing
I'd
like
to
try
to
do
immediately
is
to
rule
out
a
cluster
specific
issue.
So
how
would
you
do
that,
like
you
want
to
kind
of
rule
out
whether
this
is
infrastructure
or
application.
D
So
this
this
happened
to
me
yesterday,
although
for
a
different
service
you
can,
you
can
go
down
to
the
kubernetes
overview,
tab
and
the
generic
dashboard
and
inspect
the
different
clusters
on
various
series
like
some
when
I've
had
this
happen
in
real
life,
it's
often
been
a
bit
indirect,
like
one
cluster
has
substantially
higher
or
lower,
say
memory
or
cpu,
which
can
be
used
for
further
drilling
down
using
the
gke
console,
usually.
D
D
So
yeah
the
we've
got
a
metrics
catalog
function
to
generate
a
couple
of
extra
bits
for
dashboards
marked
as
having
stuff
in
kubernetes,
which
will
soon
be
almost
everything.
You've
got
some
kubernetes
detail
dashboards
here,
which
I'm
not
going
to
go
to
yet
and
you've
got
a
kubernetes
overview
fold
out.
D
These
are
typically
partitioned
by
a
cluster.
So
the
case
of
gear
we've
only
got
zonal
clusters,
but
you
might
also
see
the
regional
cluster
here
for
some
services
and
as
as
expected
for
a
healthy
service,
we
can
see
quite
tight
banding
for
resource
usage,
one
symptom
at
a
higher
level
that
might
lead
to
looking
for
cluster
level.
Problems
is
sudden
drops
of
like
a
third
right,
because
we
typically
have
like
the
three
zonal
clusters.
D
So
if
you
see,
for
example,
error
ratio,
abdex
go
to
two-thirds
or
one-third,
that's
often
a
bit
of
a
clue.
A
D
If
you,
if
you
really,
if
you
already
sort
of
suspected
the
cluster
thing
you
could,
you
could
inspect
the
workloads
in
gke
viewer,
but
I
would
I
prefer
to
start
with
the
logs.
So
let's
do
this
for
https
like
puma.
Am
I
still
sharing?
I
think
so?
D
Yes,
you
are
sorry
that
wasn't
like
yeah
I
got
paid
for
like
7
am
so
I'm
still
I'm
still
recovering.
What
I
meant
to
say
in
in
actual
english
was,
I
would
go
to
the.
What
I
usually
do
is
go
to
the
logs
and
start
splitting
error
time
series
charts
by
interesting
metrics.
So
right
here
I
could,
I
could
add,
a
split
series
for.
A
Actually,
on
the
on
the
agenda,
I
put
a
quick
link
if
you
just
want
to.
D
A
D
Yeah
kubernetes
region,
nice
yeah,
so
there's
nothing
super
interesting
here.
But
if,
if
one
of
the
clusters
was
was
booked,
you
would
see
perhaps
a
substantially
higher
amount
of
errors
in
in
one
of
them.
B
Right
and
that's
pretty
much
exactly
what
you're
about
to
put
together
as
well
in
in
your
query,.
D
Craig
mine
was
not
going
to
be
as
pretty
mine
was
just
going
to
be
four
lines.
I
don't
even
know
how
to
draw
charts
like
this.
What
is.
C
B
All
right,
so
my
guess
is
that
this
is
going
to
be
somewhere
in
not
gitlab,
helm
files
because
that's
our
infrastuff
but
the
kubernetes
workloads.
So
let's
start
with
that.
B
Values
and
gprod
in
a
tab,
okay,
so
this
is
including
a
bunch
of
stuff,
including
the
gprod
one.
So
let's
maybe
go
take
a
look
at.
B
This
and
the
g
prod
one
okay
g
prod.
This
looks
like
the
place
we
want
to
make
that
change.
So,
let's
see
if
we
have
a
a
sub
section
for
the
the
git
service.
A
So
what
we
have
here
are
this
is
like
the
base
value
for
the
minimum
max
number
of
replicas
and
then
underneath
web
service.
You
can
define
multiple
deployments.
This
is
the
traffic
splitting
that
was
recently
introduced
into
the
helm
chart
which
allows
you
to
have
different
deployments
for
different
https
traffic.
A
A
E
I
think
that
this
is
pretty
well,
not
that
straightforward
to
actually
like
this
is
a
detail
that
you
need
to
know
in
order
to
fix
something,
then
maybe
just
including
a
comment
about
hey
if
you
need
to
increase
the
capacity
just
increase
the
web
service
as
well,
or
just
increase
this
this
instead.
B
Yeah,
that
could
also
be
in
the
git
in
the
run
book
for
the
git
service.
I
think
that
would
be
a
useful
thing
to
have.
I'm.
E
Not
I
think
in
the
file
is
probably
a
better
place
to
have
this,
because
it's
more
direct,
because
when
I
find
a
file
where
I
can
tune
values,
I'm
less
inclined
to
look
for
a
run
book
to
see.
Okay,
I
can't
find
what
I
need
to
tune.
What
what
they
need
to
do
and
then
basically
just
having
a
pointer,
would
probably
be
a
good
thing
there.
So.
D
A
There,
it
is
so
this
is,
this
is
what
we're
overriding.
This
is
the
base
values
file
and
you
can
see.
C
A
We
have
deployments
defined
here
and
right
now
we
have
two
deployments
web
and
git.
The
web
deployment
is
the
catch.
All
that's
like
any
request.
That's
not
a
get
request
currently
on
kubernetes.
That
only
is
well
it's
only
covering
web
sockets
for
the
interactive
terminals.
So
it's
not
a
lot
of
traffic
eventually
or
very
soon
we're
going
to
be
creating
a
new
websockets
deployment
for
that
specifically.
B
Got
it
cool
so
jumping
back
over
to
the
the
gprod
one?
I
guess
this
is
what
we
would
edit.
So
let
me
just
go
ahead
and
do
that
now.
F
C
C
B
For
get
https
was
it
yes,
yes,
web
service
from
150
to
250.,
yeah
I'll,
just
go
ahead
and
submit
this
blank.
If
that's
okay,.
A
Right
now,
this
project
requires
approval
from
delivery,
but
I
guess
like,
depending
on
who's
online,
you
may
you
may
need
to
override
that
you
can
see
the
dry
run.
If
you
haven't
seen
this
before,
you
can
see
the
dry
runs
that
are
happening
in
the
deployment
pipeline
on
ops.
If
you
click
the
link
igor
for
ops
deployment
pipeline.
A
A
Then
you
would
need
to
override
the
approval
settings
to
merge
it
in
the
on
the
project,
because
I
think
everyone
has
the
has
edit
capabilities
on
the
project
itself.
Hi
dude.
We
should
probably
make
it
so.
At
least
managers
have
merge
approval,
access
and
probably
all
sres
as
well.
B
G
We
should
not
have
override
approvals.
We
should
have
everyone
actually
practice
more
in
this
repository,
submit
more
mrs
and
go
to
the
trainee
process
that
craig
muscle,
for
example,
is
going
through,
and
then
it's
going
to
be
trivial
for
us
not
to
have
a
specific
named
folks
in
there
like.
I
don't
want
delivery
to
be
the
gate
here.
I
just
want
to
ensure
that
every
sre
has
worked
in
this
repository
know
what
they're
doing
and
then
they
can
serve
each
other.
G
A
Cool,
so
this
is
your
diff.
If
we
were
in
an
incident,
we
would
look
at
this
we'd,
probably
say:
okay
we'd
merge
it.
Then
it
would
get
applied
to
the
clusters.
A
D
Technically,
quite
it
does
a
helm
diff,
which
is
the
diff
between
the
manifest
that
helm
has
generated
from
your
branch
yeah
and
what
else
generated
last
time.
This
is
usually
identical
to
what
you
described
hendrick,
but
if
someone
has
made
a
manual
intervention,
it
doesn't
show
up
in
the
dip.
So
this
is
just
a
bit
a
bit
of
a
gotcha
that
usually
doesn't
matter.
A
A
H
I
have
I
have
one
comment
before
I
have
to
drop
off
in
a
couple
minutes
here,
but
I
wanted
to
make
sure-
and
I
already
pinged
alberto
about
this,
but
I
want
to
make
sure
that
we're
we're
actually
leaving
this
call
with
like
owners
of
some
of
these
actions.
Like
I,
I
see
some
really
great
stuff
here,
like
hendrick's
mention
of
like
we
should
comment
about
how
traffic
is
split.
H
Know
I
love,
I
would
love
to
see
some
of
these
things
just
generate
issues
like
we
don't
need
to
necessarily
like
action
on
everything
but
like
if
we're
getting
things
in
our
in
the
reliability
backlog.
For
sure
I
would
love
to
see
that
as
the
outcome
of
this
and.
E
What
how
can
we
so
if
we,
if
we
see
that
the
we've
already
reached
the
capacity
limits
of
the
notes
that
are
in
the
they're
in
the
cluster?
How
do
we
go
about
this?
Is
this
auto
scaling
or
how
will
we
increase
it
just
in
terraform
and
bump
up
the
number
of
the
nodes,
or
how
would
that.
A
Yeah
this
this
would
be
in
terraform,
and
we
can
take
a
look
at
that
now.
If
we
have
time,
maybe
I
think
we
have
enough
time
or
does
anyone
else
want
to
take
a
crack
at
this.
F
F
B
B
A
A
The
regional
cluster.
Has
you
know
a
bunch
of
node
pools
and
node
pools.
You
can
consider
them
as,
like.
You
know
those
are
the
groups
of
vms
that
we
deployed
to
for
different
services.
Each
node
pool
has
a
default
min
and
max
which
we
use.
I
don't
remember
off
the
top
of
my
head
what
it
is,
but
you
can
also
override
it.
You
can
actually
look
if
you
look
at
the
zono
cluster
config.
So
this
is
the
zonal
cluster
config
for
usc
1b.
A
A
B
And
I
think
we
had
the
the
the
git
url
to
that
in
the
terraform
file,
as
well
just
for
discovery
purpose.
A
So
for
the
default
node
pool
config,
which
is
what
is
used
whenever
you
add
like
a
new
node
poll,
these
are
the
defaults.
A
10.,
so
you
can
see,
then,
for
that
we've
overridden
it
in
some
places.
Let's
take
a
look
for
for
git.
We
have
not
overwritten
it,
but
I
think
that
was
done
because
we
were
running
much
like
much
less
than
30
bms.
C
A
So
these
are
the
clusters
we
have
the
regional
cluster
and
the
zono
clusters,
we'll
just
go
to
usc.
A
A
A
No
because
it
wouldn't
it
wouldn't
go
up
to
25
if
the
it's
one
it's
it
looks
like
it's
one
to
50.,
so
we
already,
we
already
set
this
to
50..
So.
F
F
Is
there
any
way
for
people
to
stubbornly
edit
this
here
in
the
web
ui
and
then
terraform
comes
back
and
stomps
on
them?
Yeah.
F
E
I
guess
next
time
someone
would
apply
the
change
via
terraform,
it
would
display
as
a
diff.
So
you
would
at
least.
A
A
So
so
here's
where
we
set
the
max
node
count
to
50.
We
would
just
increase
this
in
terraform
to
100
or
whatever.
We
need
to
do.
A
A
If
you've
never
seen,
this
part
of
the
google
console
workloads
is
also
interesting
because
it
allows
you
to
see
deployments.
E
A
So
we
can
look
at
like
a
deployment,
it
gives
you
some
basic,
metrics
and
monitoring.
It
shows
you
all
of
the
pods.
A
B
F
So
so
this
whole
thing
is
sort
of
a
kubernetes
web
ui,
where
you
can
click
instead
of
running
cli
commands
yeah.
A
Yeah
this
was
the
first
link
I
pasted,
but
I
didn't
really
cover
it.
This
link
here
for
log
events
on
the
cluster.
We
have
a.
We
have
an
index
in
elasticsearch
for
gke.
You
can
use
this
to
kind
of
see
what's
going
on
on
the
cluster
in
general.
If
you
want
to
rule
out
a
cluster
problem
here,
we
would
see
errors
like
for.
If
there
was
a
problem
with
scaling
the
number
of
pods,
we
would
see
them
here.
A
You
see
a
lot
of
scary
messages
if
you
look
at
it
right
now,
but
this
is
like
normal
stuff
that
happens
when
we
cycle
pods
and
go
through
a
deploy,
so
nothing
to
be
concerned
about
like
these
503s.
A
You
can
see
like
when,
when
containers
are
created,
another
problem
that
we
sometimes
see
is
like,
if
we're
unable
to
pull
an
image.
For
example,
we
deploy
to
the
cluster,
we
do
an
application
update
and
the
image
doesn't
exist
on
dev.
You
would
see
errors
here.
E
Would
we
have
metrics
for
these
container
pull
failures,
or
would
that
be
in
the
in
the
crash
loop
back
off
thing
in
kubernetes
or
where
we,
where,
basically,
where
would
we
see
these
these
kinds
of
errors.
A
Yeah,
that's
I'm
not
sure,
actually,
whether
we
would
alert
on
that
at
all.
We
would
probably
see.
A
A
Kubernetes
I'll
add.