►
From YouTube: Firedrill Git Service has saturated it's HPA
Description
Members of the infrastructure team review how to determine and what to look for when observing that the Horizontal Pod Autoscaler has reached its capacity limitation.
A
Hey,
I
just
did
the
initial
run
of
customer
staging
on
the
on
the
new
poc
environment
and
it
got
all
the
way
down
to
trying
to
pull
the
code
in
wow.
So
that
was.
A
C
B
B
B
C
B
C
All
right
so
since
we
have
at
least
a
few
people
here,
I'll
go
ahead,
get
started
so
as
noted
in
well.
First
off
announcements,
thanos
is
ready,
enjoy,
have
fun,
okay,
so
agenda
fire
drill.
Let's
do
it
so
there's
a
google
doc
linked
into
the
issue.
C
C
So
this
is
this:
one:
is
a
this
one's
a
fun
one,
because
we're
like
in
the
middle
of
trying
to
improve
this
process.
So
in
the
meantime,
let's
try
to
poke
holes
in
our
current
process.
C
B
D
B
No,
I
would
yeah,
it
would
be
interesting
to
see
if
there
were
flat
lines
in
anywhere.
That
would
be.
You
are
right
for
our
joiners.
We
are
starting
to
walk
through
the
fire
drill
and
I
got
volunteered
to
start
blindly
looking
through
dashboards,
to
answer
the
question
of.
If
we
had
hit
the
saturation
point,
what
would
we
see
it's
interesting?
I
would
note
that
this
quota
seems
to
be
blank,
and
that
would
be
a
potentially.
C
Useful
thing
well
I'll,
give
you
a
hit.
No,
it's
not
okay.
What
that
particular
table
is
supposed
to
show
is
the
cpu
specific
quota
resources?
Okay,
so
for
each
pod,
there'll
be
a
listing
of
hey.
This
pod
is
at
50
of
its
cpu
resource
allocation,
for
example.
No.
B
C
So
far,
I'm
not
sure
you
are
on
the
right
track.
Active
replica
set
is
currently
the
only
thing
we
have
in
our
dashboards.
That
say
this
is
the
count
of
pods
that
are
currently
running
in
this
environment.
In
this
particular
case,
you're,
concentrating
your
efforts
on
u.s,
east
1b,
so
only
one
of
our
three
clusters,
which
is
not
the
best
desired
state,
but
this
does
give
you
our
pod
counts.
C
B
Right,
I'm
gonna
put
on
my
like
really
dumb
manager,
hat
and
say
like
I
know
I
should
be
looking
at
the
gke
stuff
in
the
gitlab
on
ops
for
our
config
there
I
see
craig
bear
raising
his
hand.
So
maybe
you
want
to.
C
B
B
Yeah
right
now,
I'm
on
get
live
chat,
but
we
would,
I
mean
we'd,
want
to
filter
on
the
right
service
here
where
I
I
thought
that
show
would
be
one.
B
Yeah
so
we'd
want
to
highlight
that
guy
and
if
this
again
was
ceiling,
then
the
90
to
100
range.
That
would
probably
not
be
good
indeed,
but
thank
you
craig
for
pointing
that
out.
Then.
The
other
question
I
think
you
were
looking
for
scarborough,
because
we
could
also
look
at
in
configuration
in
the
gitlab
deployments,
yep
config
and
find
the
actual
hpa
config
there
that
I
do
not
know
off
the
top
of
my
head.
I'm
going
to
go
ahead
and
reveal
that
I
thought
I'm
going
around
pretty
thoroughly
to
try
to
find
that.
E
C
B
C
C
I
have
two
questions
regarding
that:
if
a
100
saturation,
if
we
are
100
saturation,
this
is
a
very
easy
question,
but
kind
of
dumb.
What
technically
does
that
mean.
D
I
would
think
that
would
be
one
cluster
like
you'd
have
to
check
each
zonal
cluster
independently.
I
didn't
notice
on
that
dashboard
if
it
was
that
dashboard
is
not
broken
down
by
cluster.
E
D
D
E
It
depends
on
well
at
the
moment.
Yes,
so
I'm
just
thinking
at
the
right.
The
second
we're
talking
about
producing,
zonal
side,
good
clusters,
where
one
of
them
being
saturated,
would
actually
be
a
problem.
But
that's
that's
so
independent
don't
get
too
distracted
right.
C
A
I'm
going
to
say
if
your
load
distribution
is
equitable
and
working
like
it
should,
and
one
of
your
clusters
is
at
100.
The
other
ones
are
going
to
be
very
close
if
not
at
100.
Also,
now,
that's
not
to
say
there
aren't
corner
cases
where
that
could
not
be
the
case,
but
that's
ideally
I
think.
If
it's
not,
then
the
problem
isn't
necessarily
the
running
out
of
space.
It's.
Why
are
we
either
accumulating
traffic
in
one
zone
or
routing
traffic
to
favor
one
or
the
others.
A
C
So
as
part,
let's
continue
the
exercise.
Where
would
we
go
to
dave's
shied
away
from
doing
this?
But
where
do
we
want
to
go
to
if
we
want
to
bump
up
our
maximum
allowed
pods.
B
C
C
That's
precisely
where
we
are
so
craig
missko
I'll
know.
If
you
want
to
share
your
screen
just
so,
we
could
do
a
quick
overview
as
to
what
values
are
actually
configured
because,
as
dave
quickly
showed,
we
see
a
lot
of
values
repeated
a
few
times,
so
I
think
it
might
be
worth
just
discussing
at
least
yeah.
Thank
you.
I
just
wanted
to
try.
C
So
what
craig
is
showing
is
that
we've
got
our
hpa
configuration
for
each
of
these,
so
the
min
replicas
is
obviously
the
minimum
number
of
replicas
that
we
want
to
be
running
and
the
max
is
the
maximum
number
of
positive.
We
want
to
run
it's
important
to
note
that
what
you
see
for
these
values
is
per
cluster.
C
C
So
if
we
want
to
change
this,
we
we
need
to
be
aware
of
a
couple
of
things.
One
is
that
this
is
going
to
take
a
play,
take
place
across
all
clusters,
and
then
we
also
need
to
make
sure
we
have
enough
resources
available
to
us.
We
have
right
above
this,
you
see
where
we
set
our
resource
limits
and
requests.
C
It
would
be
wise
to
make
sure
that
the
node
capacity
that
we
are
using
for
this
particular
for
this
workload
is
sufficient
to
hold
the
amount
of
pods
that
we
need
to
run
so
craig.
I
don't
know
if
you
kind
of
want
to
delve
into
this,
but
if
you
want
to
pull
up
our
terraform
repo,
we
could
then
look
at
what
kind
of
nodes
we're
running
what
size
they
are
and
guesstimate,
how
many
pods
that
we
should
be
able
to
achieve
on
these
nodes.
If
you'd
like
to
sure.
E
E
C
Web
service
all
right
check
out
either
the
I
don't
know
what
file
you
have
open.
C
Just
do
a
search
for
web
service
only
or
excuse
me
get
because
it's.
E
Website
was
good.
There
we
go.
E
C
D
E
E
A
E
A
E
So
in
terraform
we
have
I'm
assuming
in
the
variables
for
jeep
fraud:
okay,
https.
We
are
running
on
custom
16
2480s,
which
16
cpu
then
yeah
16,
cpu
and
20
gigs
of
ram.
E
But
I
don't
know
how
many
of
them
we
have
that's
just
a
machine
type.
Where
do
we
define
node
count
for.
E
So
there
we
go
next
note
count
of
50.
C
E
So
we
have
15
up
to
50
nodes
per
zone
and
each
of
those
has
20
gigs
of
ram
20,
gigs
gigs.
Sorry,
yes,
gigs!
So
it's
a
thousand
gigs
of
ram
per
zone
and
we
allow
four
gigs
per
machine
psycho.
Oh
that's
web
sockets.
E
E
Don't
have
a
request
default
level
for
the
web
service
is.
E
E
This
we'd
have
to
go
look
at
at
the
actual
deployed
configuration.
E
E
A
D
D
E
C
D
C
D
C
It's
just
more
like
we
look
for
it's
a
node
selector,
okay,
so
in
this
particular
case
I'm
looking
at
psychic
just
because
I
have
it
on
my
screen,
so
we
have
a
value
called
node
selector.
We
look
for
a
key
of
type
and
we
look
for
a
value
called
sidekick
in
this
particular
case,
and
we
should
see.
D
D
C
Which
is
you
know
not
the
most
efficient
way
of
running
kubernetes
at
the
moment,
but
in
order
to
minimize
the
amount
of
change
that
we're
introducing
as
we
migrate
things
over
we're
just
creating
new
node
holes
to
avoid
the
noisy,
neighbor
effect
of
our
own
workloads,
I
want
to
revisit
that
in
the
future,
but
that's
a
future
task.
I'm
trying
to
get
rid
of
I'm
trying
to
finish
up
in
okr
before
I
get
to
that
process.
A
A
A
It
tries
to
quantify
that
stuff
into
a
common
measurement
between
your
available
resources
and
what
you're
going
to
run
and
compares
them
and
says
hey.
It
looks
like
you're
going
to
be
over.
Sometimes
it's
fine,
sometimes
it's
not,
but
breaking
things
down
into
these
common
you
know
like
for
vmware,
is
very
simple:
to
break
it
down
to
memory
and
cpu
and
compare
them,
and
it
sounds
like
we're
doing
this
all
by
hand.
B
Well-
and
I
thought
I
mean
the
other
thing
would
be-
is
that
it's
been
a
while
since
I
interacted
with
vcenter
and
vsphere,
but
I
thought
there
were
warnings
on
api
interactions
and
particularly
in
the
ui
anytime,
you
did
stuff
in
the
ui.
But
let's
not
talk
about
kubernetes
ui
for
now.
Just
but
an
api
interaction
would
throw
a
warning
back
at
you
and
say
you're
over
provisioned
or
you
could
configure
the
api
to
do
so.
That's
what
I
thought
vmware
had
for
that
kind
of
stuff.
B
B
B
B
D
Nodes
are
that's
something
more
in
the
realm
of
helm,
because
helm
does
make
some
attempt
to
render
its
values,
okay
files
and
and
do
right.
I
mean
correct
me
if
I'm
wrong.
Here's
garbage
but
like
helm,
tries
to
do
a
pre
and
post
and
like
generates
a
diff
of
current
state
versus
new
state,
so
helm
would
be
best
positioned
to
maybe
try
to
start
teasing
out
that
intelligence.
C
Yeah
to
explain
a
little
bit
further
prior
to
us
performing
a
deployment,
we
run
a
diff
and
we
validate
that.
The
changes
that
we
expect
to
occur
show
up
inside
of
that
diff
as
part
of
our
merge
review
procedure
for
auto
deploys.
C
We
have
this
little
checky
thing
that
says:
hey
give
me
all
the
diffs
and
if
we
find
a
configuration
change,
that
was
not
meant
to
be
there.
We
bail
the
auto
deployment
just
to
make
sure
that
we're
not
mixing
a
deploy
with
an
auto
deployment
to
go
into
the
details
of
what
cameron
is
trying
to
get
towards.
You
were
correct,
craig
and
that
helm
would
probably
be
the
best
position
to
start
trying
to
determine
whether
or
not
we
may
potentially
over
provision
ourselves.
C
C
It
sounds
like,
like
dave
said:
we
probably
need
some
sort
of
tool
that
probably
is
able
to
coalesce
all
that
information
together.
So
maybe
this
would
be
a
great
improvement
to
the
services
thing
that
we've
created.
Do
we
still
use
that?
I
know
we
use
it
for
creating
dashboards
and
alerts,
but
I
don't
know
if
we
still
have
our
ui
in
front
of
it.
C
D
A
C
All
the
pieces
are
there,
yeah
cube
control
will
tell
you
when
things
are
wrong,
like
we
have
the
event
log
and
if
a
pod
is
unable
to
be
scheduled,
we
get
the
reason
and
it
could
be
that
a
node
is
out
of
a
specific
resource
and
it'll
tell
you
that
we've
got
30
nodes
available.
We
can't
match
any
because
say:
cpu
is
out
of
or
out
of
cpu
availability,
rather
so
cube
control.
C
I
think
the
ultimate
goal
here
would
be.
We
set
up
the
necessary
alerts,
so
we
know
that
we're
saturated
beforehand.
So
we
know
how
to
tackle
that
prior
to
it
becoming
a
problem.
C
Understand
we
blew
through
that
fire
drill
relatively
quickly,
so
I
think,
for
the
remainder
time,
I
think
we
should
just
ask
questions
and
just
keep
whatever
conversation
we
want
going.
I
don't
have
anything
else
that
I
want
to
cover
specifically
right
now.
D
Should
we
for
the
purposes
of
focus
and
brevity,
should
we
cut
the
recording.