►
From YouTube: 2021-06-09 GitLab.com k8s migration EMEA
Description
Demo of API Service resource tuning
B
A
D
B
B
Kilograms
of
radio
stuff
on
my
shoulders
and
four
hours
hike
yeah.
I
can
handle
it
right.
A
Whatever
so,
I
wanted
to
kind
of
give
a
quick
overview
as
to
where
we
have
been
with
the
tuning
of
the
api
service.
So
I
guess
I'll
start
from
the
very
beginning.
A
A
The
amount
of
api
pods
was
excruciatingly
high,
considering
the
amount
of
workers
that
are
running
in
consideration
to
what
was
running
on
our
virtual
machines
after
the
migration
we've
since
eliminated
the
nginx
issue,
so
we're
now
running
pretty
much
at
our
minimum.
If,
last
time
I
looked
but
we've
also
reduced
our
default,
node
counts.
Api
pod
counts
are
still
relatively
high,
but
they're
less
than
what
they
are
shown
in
this
before
status.
A
A
Our
latest
attempt
at
this
was
targeting
our
canary
stage
as
well
as
just
cluster
b
in
usd
one.
The
goal
here
was
to
compact
the
amount
of
pods
running
that
way
we
better
match
what
our
virtual
machines
were
doing.
We
were
running
four.
This
gives
me
16
processes,
pumas
and
right
now
we
were
usually
seeing
at
most
two
or
excuse
me
two
times
four,
which
is
eight
so
usually
around
eight.
A
Our
latest
test
was
trying
to
shove
four
times
four,
so
16
processes-
and
we
see
in
the
latest
change
we
see
earlier
today
there
is
one
node
and
then
there
were
like
upwards
of
four
those
was
primarily
deployed.
We
saw
we
saw
five
for
a
very
short
period
of
time,
but
the
amount
of
times
we
see
the
amount
of
node
this
chart
is
showing
the
amount
of
nodes
running
more
than
three
pods
that
are
running
the
api
pod
on
a
given
node,
so
that
very
rarely
changed.
D
D
Out
a
little
bit,
at
least,
we
saw
this
at
nighttime
at
low
traffic.
We
saw
a
big
improvement
right.
It's
just
that
yeah.
As
soon
as
we
get
higher
traffic
things
break
down
again,.
A
Yeah
ignore
big
spikes
like
this.
This
is
probably
a
deploy
or
something,
but
so
you
know
seven
nodes
were
running
at
some
point
in
time,
more
than
four
pods,
but
that
pales
in
comparison
to
how
many
nodes
that
we're
running
okay-
this
is
the
weak
view.
Let's
turn
that
down
a
little
bit,
I
mean
we're
still
running
24
nodes,
which
is
you
know
three
and
five
nodes
less
than
our
existing
clusters.
A
E
A
I
think
what
this
really
boils
down
to
is
that
the
method
at
which
we
are
trying
to
schedule
resources
is
significantly
different
on
kubernetes
versus
virtual
machines.
I
outlined
in
a
thread
down
here
that
kind
of
the
same
information
I'm
dialing
here
or
indicating
here,
but
like
our
saturation,
is
right
at
75
percent.
A
If
we
look
at
our
which
one
these
is
memory
usage,
you
know
if
I
could
label
my
tabs,
that'd
be
great
memory.
Use
for
kubernetes
right
now
we're
at
most
30
memory
saturation
for
these
nodes,
which
is
horribly
inefficient.
A
So
I'm
curious
what
others
may
have
opinions
on,
because
we
have
other
options
such
as
modifying
the
amount
of
workers
that
we're
running
inside
of
our
pods
right
now,
it's
four!
But
if
we
change
our
instance
type,
we
could
trade
the
amount
of
cpus.
We
have
with
memory
right
now
we're
running
close
that
tab.
E
A
E
A
Four
came
about
because
I
didn't
know
how
to
perform
the
necessary
evaluation,
determine
how
many
workers
should
be
put
into
a
pod,
and
the
theory
was
that
we
could
just
say
you
know,
set
it
to
four
and
we
could
run
four
pods.
That
gives
us
the
same
amount
of
number
right.
E
I
mean
it's
surprising
that
it's
four
and
we're
still
at
that
threat.
Contention
number
of
you
know
80
percent,
so
something
doesn't
add
up
there.
I
just
had
one
other
question:
do
we
know
like
the
state
of
pods
before
like?
Can
we
tell
if
a
pot
is
pre
accepting
requests
like
pre-readiness?
A
A
Yes,
we
should
be
able
to
because
this
timeline
is
usually
pretty
sure.
E
One
thing
that
would
be
kind
of
interesting
is
to
know
what
percentage
of
the
fleet
at
any
stage
is
initializing,
because
that
is
very
cpu
intensive.
We
know
that
anyone
who
runs
jdk
knows
that,
and
you
know,
if
we've
got,
we
know
that
we've
got
a
hyperactive
hyperscaler
and
you
know
a
lot
of
that
cpus,
possibly
on
that,
or
at
least
it'd
be
interesting
to
gauge
what
percentage
it.
E
D
Had
a
look
at
how
long
it
takes
for
pot
to
set
up
at
all-
and
it
seems
like
it-
takes
around
60
seconds
from
a
container
start
to
the
ruby
workers
accepting
traffic
and
then
they
start
right
away
with
the
same
latency
like
later.
So
they
are
not
really
starting
slow
and
then
dropping
latencies
or
something.
D
But
if
you
scale
a
node,
then
this
gets
additionally
140
seconds
or
something
like
that.
So
60
seconds
for
container
startup.
And
if
you
put
it
on
a
new
node,
then
I
think
it
again
takes
something
like
100
seconds
or
something.
D
This
is
based
on
both
like
I
saw
I
looked
into
event
logs.
I
put
this
in
the
issue
somewhere
event
logs,
showing
that
container
was
starting
up
and
and
also
in
one
case,
that
a
node
first
needed
to
be
a
setup
to
be
able
to
schedule
this
part,
and
then
I
looked
into
kibana,
and
then
I
saw
okay.
Now
I
see
ruby
starting
up
and
a
lot
of
locks
about
will
be
setting
up.
Then
workers
becoming
ready,
and
then
I
see
the
first
request
coming
in
being
asked.
B
Because
I
was
investigating
an
incident
earlier
this
week,
and
so
we
I
mean
maybe
was
specific
to
what
happened
in
that
case,
but
we
saw
that
basically,
every
node
is
out
from
every
part
is
out
from
the
balancer
for
two
minutes,
so
we
get
502
from
the
readiness
probe
for
about
two
minutes,
which
is
twice
the
amount
you
are
telling
me.
B
I
mean
it
was
very
specific
to
that
one,
because
the
young
call
had
questioned
about
the
if
we
had
an
outage
or
something
like
that,
so
we
tried
to
there
were
a
lot
of
errors,
but
then
we
broke
it
down
by
part
name.
I
don't
remember
hostname
partner.
I
remember
what
was
the
point
is
that
it
was
clearly
two
minutes.
Every
everything
was
out
of
the
readiness
probe
for
two
minutes.
D
Let
me
find
what
I
have
written
down,
so
I
can
show
you
the
numbers,
because
I
don't
have
them
all
in
my
head.
But
let's
wait
for
the
issue
to
look
because
it's
very
long.
D
Like
pot
being
created,
it
takes
around
50
sec,
six
seconds
to
be
accepting
traffic,
but
in
the
case
of
a
node
being
scheduled,
then
it
takes
over
166
seconds
for
a
port
to
become
ready
and
if
we
deploy,
then
we
see
often
heavy
node
scaling
right
like
like
adding
five
more
notes.
Stuff
like
that,
and
so
we
have-
I
don't
know,
maybe
15
pots
or
something
like
that-
waiting
for
four
notes,
first
and
then
being
scheduled
and
then
starting
up.
So
I
don't
know
at
this
point
they
at
the
beginning.
D
D
B
So
we,
basically
that
was
what
we
were
looking
at,
because
we
were
looking
at
errors
at
workers
level.
I
think,
was
I'm
trying
to
find
the
details,
because
maybe
it's
of
topic
from
this.
But
the
point
is
that
we
had
clearly
two
minutes
of
failure
on
the
readiness
probe
for
each
for
each
one.
I
I
find
you
just
because
maybe
it's
completely
unrelated
so
just.
D
Just
a
data
point
yeah,
but
but
coming
back
to
the
performance
optimizing,
I
think
if
we
very
often
scale
up
and
down,
then
we
of
course
have
issues,
and
that
relates
with
the
size
of
the
pots
that
we
have
right
a
they
are
very
slow
to
start
up
anyway.
So
we
don't
have
good
elasticity
to
respond
to
spikes
so
having
more
small
ports.
D
Maybe
to
I
don't
know,
don't
start
up
notes
as
often
maybe
it
would
be
nicer
to
just
correspond
to
finer
finer,
finer
spikes,
but,
on
the
other
hand,
having
small
ports
is
very
inefficient,
so
I'm
not
sure
how
to
best
tune
this
year,
but
playing
around
with
workers
and
maybe
choosing
a
different
note
sizes
garbage
was
suggesting-
maybe
could
help
here.
E
Can
I
can
I
just
ask,
do
we
know
if
a
pod,
like
the
the
state
diagram
once
it
goes
to
ready,
does
it
ever
go
back
to?
Can
it
ever
go
back
to
like
not
ready
and
then
back
to
ready?
Was
it
just
it's
ready
and
then
and
then
shut
down
kind
of
thing,
or
does
it?
Can
it
switch
between
ready
and
not
ready.
E
A
Or
we
send
readiness
probes
and
if
should
a
readiness
probe
fail,
the
pod
will
be
removed
from
the
service,
so
it
shouldn't
accept
any
new
traffic
but
will
still
process
the
existing
traffic
and
the
writing
is
probes
for,
for
the
web
service
go
through
puma.
A
So
if
something
is
failing
inside
of
puma
for
any
reason
at
all,
like
its
inability
duration
database
as
an
example,
we'll
start
filling
that
health
check
and,
of
course,
if
we
start
the
shutdown
procedure
we
send
with
sig
term,
I
believe,
and
that
forces
us
to
return,
I
believe,
a
503
which
will
fail
the
readiness
probe
it
will
pull
the
node
out
of
her
or
pull
the
pod
out
of
rotation.
E
So
I'm
just
quickly
looking
at
the
metrics
that
we've
got,
I
don't
know
if
this
is
interesting
to
anyone
or
just
me,
but
where's,
my
screen
here.
So
this
is
for
the
api
service.
What
it's
not
very
nice!
We
can
make
this
better,
but
it's
effectively
what
percentage
of
the
pods
are
not
ready
as
a
percentage
of
all
pods,
just
kind
of
answering
that
question
and
saying
earlier
like
how
much
of
the
time
are
we
initializing
and
you
can
see
it's
up.
A
E
A
E
I'm
gonna,
I'm
gonna
like
see
if
we
can
start
tracking
this
so
that
we
can
get
a
long-term
metric
like
that,
we
can
start
optimizing
on.
Obviously
I
make
it
prettier
than
this,
but
then
we
we
can
know,
like
you
know,
there's
a
few
things
we
can
make
the.
If
we
can
improve
the
startup
time,
we
could
improve
it.
If
we
could
improve
that
scalar
and
then
you
know
those
things,
but
the
other
thing
that
I
just
wanted
to
show
is
one
way
of
looking
at
the
how
long
the
startup
time
is.
E
Sorry,
that's
not
gonna
be.
This
is
what
I
want.
E
E
B
E
E
If
we,
if
we
do
this
until
14
15,
so
can
you
see
what
the
the
the
14
20,
let's
just
say,
and
we
do
one
hour
or
even
half
an
hour
like
the
the
that
time
to
there
is
the
workhorse
startup
time
according
to
the
readiness,
because
this
is
the
ready,
the
readiness
label,
but
then
from
1407.55
to
1409.55.
E
C
A
E
E
Trying
to
do
is
stop
like
slow
loris
attacks
where
people
kind
of
feed
you
and
then
you
saturate,
your
entire
fleet
and
there
could
be
some
delays,
but
I
don't
think
it's
nearly
as
serious
as
what
you
have
like
on
the
wild
internet
just
but
yeah
yeah,
but
yeah.
I'm
surprised.
I
would
like
to
check
that
work.
So
alessio
are
you.
Are
you
like
100
certain
that
the
the
readiness
check
proxies
to
the
workhorse,
sorry
to
the
to
the
rails,
readiness
check
in
workhorse,
so.
B
E
E
A
E
B
E
B
Running
in
development,
basically,
it's
just
is
proxying
the
requests
upstream
and
is
just
making
something
about
formatting.
So
unless
there's
something
here
that
is
doing
the
magic,
this
is
going.
B
A
Need
to
interrupt
you
because
health
checking
is
working
slightly
different
in
kubernetes.
Let
me
share
my
screen,
if
you
don't
mind
yeah
sure
so,
down
here
inside
the
gitlab
workforce
container,
the
lightiness
and
readiness
checks
are
executing
a
script
called
script.
Health
check
inside
that
script.
All
we're
doing
is
a
printf
to
get
and
sending
it
to
the
local
host.
So
we're
just
we're
not
even
pinging.
B
It
makes
a
good,
I
think
it's
a
good,
it's
better
than
what
we
have.
If
you
remember
angie,
you
were
making
that
that
merge
request
in
workers
for
implementing
proper
health
check,
so
that
workers
can
have
his
own
checks
plus
relying.
E
B
E
E
So
why
don't
we
just
put
that?
Why
is
that
not
just
coded?
This
is
a
surprising
way
to
do
that
right
like
why?
Doesn't
the
health
check
and
workhorse
just
respond
in
that
way?
As
you
know,
when
workhorse
is
ready
it,
it
says
it's
ready
rather
than
us,
going
to
like
the
route.
Another
thing
that's
weird
about
that
is
that
in
normal
cases,
that's
getting
proxied
through
to
gitlab.com
root
right.
E
E
Obviously
it
just
throws
an
error
or
whatever
tcp
dev
tcp
does,
because
it's
kind
of
also
surprising
and
then
but
then
eventually,
once
it
starts
getting
going,
then
every
other
call
of
that
will
be
making
a
full
rails
request
to
the
route,
because
those
will
get
proxied
through
to
rails
and
be
generating
traffic.
On
the
you
know
the
dashboard.
E
E
But
yeah
I
mean
maybe
just
the
thing
to
do
is
just
to
have
workhorse
just
return,
I'm
good
like
pretty
much
all
the
time
except
except
maybe
we
could.
When
you
do,
I
don't
know
if
workers
has
a
drain
mode,
does
work
or
strain
like
itself
or
because
then
we
could
then
the
only
time
it
would
stop
being
ready
is
when
you
like,
sick,
kill
it,
and
then
it's
like
I'm
no
longer
good.
E
A
E
A
E
The
funny
health
check
we
can
just
use
the
really
boring
http
health
check.
B
E
I
thought
that
health
did
the
whole
check
that
everything's
good.
You
know
the
dangerous
type
of
health
check
where
it's
like.
Okay,
I
can
talk
to
the
database
and,
like
you
know,
everything
behind
me
is
also
say.
It
was
also
good
which,
obviously,
if
you
use
that
kind
of
health
check-
and
you
have
a
hiccup,
then
you
know
you
bring
down
your
application.
E
B
B
Workers
is
still
upstreaming
the
requests.
A
I
got
you
andrew.
I
just
linked
the
merge
requests
that
you
were
asking.
E
A
A
second
ago,
so
we
kind
of
drifted
off
topic
a
little
bit,
but
I
guess.
A
So
back
to
the
api
and
experimenting
with
tuning,
I
don't
think
we
should
move
forward
with
trying
to
shove
more
pods
onto
our
current
nodes,
because
we
are
at
our
saturation
limits.
So
I
guess
I'm
looking
for
an
answer
for
two
questions
is
one:
should
we
look
for
a
different
instance
type,
the
dangerous
part
about
that
is
the
we're
currently
using
c2s.
A
We
don't
have
the
ability
to
customize
those
any
further.
So
if
we
want
more
cpus,
we
also
get
a
lot
more
memory,
which
means
we're
just
wasting
a
lot
more
money
on
that
situation
as
well.
A
The
other
option
is
to
figure
out
what
further
tuning
we
could
do
with
the
number
of
workers,
but
I
don't
know
how
to
properly
evaluate
that
in
any
way,
shape
or
form.
So
I
don't
know
what
to
look
for
unless
I
make
that
change
and
observe
the
behavior,
but
I
would
love
to
have
more
concrete
answers
and
maybe
a
person
to
talk
to
to
figure
out
what
the
behavior
might
be.
Yeah.
E
So
we
did
a
very,
very
similar
thing
with
you
know
the
sidekick
workers
and
then,
when
we
switched
to
puma
as
well-
and
you
know
java
and
myself
and
camille
were
all
kind
of
involved
in
in
that
and
and
you
know,
maybe
one
of
us
needs
to
kind
of
sit
and
because
a
lot
of
it
was
just
collecting.
You
know
like
the
matching
up
the
same
requests
for
the
same
endpoints
and
then
looking
what
it
looks
like
and
and
then
you
know
just
kind
of
brute
force
data.
E
You
know
not
and
not
aggregating
across,
like
a
large
number
of
different
requests,
but
actually
choosing
you
know
this.
This
merge
request,
you
know,
maybe
a
busy
merge
request
when
we
have
a
tune
like
this
on
average,
takes
this
long
to
serve
that
page
and
on
and
when
it's
cheap
like
this,
it
takes,
you
know
15
longer,
and
then
you
can.
E
A
D
I
think
you
also
maybe
should
do
a
step
back
again
and
think
about
what
problems
we
try
to
reach
here,
because
if
we
try
to
get
the
our
nodes,
our
node
points
saturated
as
good
as
we
can
right,
then
we
need
to
tune
in
a
way
that
we
have
the
hpa
try
to
hit
the
cpu
target.
Maybe
that
brings
us
closer
to
load,
pool
cpu
saturation
right,
which
is
what
we
are
aiming
for,
and
this
necessarily
always
brings
us
to
a
state
where
we
have
less
head
room
for
spikes
right.
D
So
if
we
have
a
deployment
or
certain
spew
spikes,
we
don't
have
enough
headroom
to
catch
them
up,
because
we
are
very
slow
in
scaling
up,
because
puma
just
takes
at
least
a
minute
to
be
ready
and
and
notes
scaling
up
even
longer.
So
this
will
always
be
a
problem.
We
will
never
be
able
to
saturate
our
node
fleet
in
a
nice
way.
D
A
A
We
probably
waited
too
long
to
roll
back
that
change,
but
we
did
roll
back
that
change.
So
we
know
what
value
of
cpu,
with
the
current
value
of
four
workers
per
induces
too
much
problems
on
that
particular
pod
that
drops
our
aptx
and
increases
our
area
or
error
ratios.
I
guess
so.
I
think
we
know
that
portion,
it's
just
a
matter
of
what
we
could
do,
the
two
next
to
figure
out
how
to
go
backwards
in
order
to
move
forwards.
D
If
we
really
put
four
parts
in
an
api
node,
then
we
get
into
bigger
problems,
and
the
issue
was
that
we
still
had
a
lot
of
headroom
of
several
cores
on
each
api
node
to
catch
up
traffic
spikes
right
because
we
are
allowed
to
go
over
requests
if
needed,
and
we
had
this
headroom.
But
this
is
really
bad
because
we
really
suffer
from
big
spikes
all
the
time.
So
I
don't
know
that's
really
really
bad.
D
What
I
would
like
to
look
into
is
maybe
tuning
further
with
workers
and
and
threats,
because
having
more
workers
and
less
threats
probably
helps
with
contention,
and
we
still
have
a
lot
of
memory
headroom
anyway,
on
the
notes,
so
that
shouldn't
be
a
big
issue.
A
What
I'll
do
moving
forward
prior,
because
switching
the
fleet
type
is
that's
going
to
require
a
little
bit
more
research
and
I
think
we
can
move
faster
if
we
play
with
worker
counts.
So
what
I'll
do
next
is
revert
our
testing,
because
that's
still
in
place,
I
haven't
reverted
it
yet,
and
I'll
start
playing
with
worker
counts.
I
think
I'll.
Just
move
up
worker
count
one
at
a
time.
You
know
we'll
have
an
odd.
A
E
Have
you've
got
three
three
regions
right?
Is
it
hard
to
change
it
to
different
values
in
different
sorry,
different
zones?
Piggy
patterns,
very
easy,
very
easy.
So
so
why
don't?
We
have
like
three
four
and
five
right
and
we
leave
it
like
for
at
least
24
hours
and
then
and
then
we
go
and
we
go
through
elk
and
we
find
like
the
the
most
common
queries
and
then
we
look
at
how
they
break
down
between
those
regions.
And
then
you
know
we
keep
going.
E
So
just
a
warning:
you
don't
want
to
use
metrics
for
this
at
all,
because
metrics
you're,
using
histogram,
sorry
and
the
reason
is
right-
is
that
in
in
in
his
in
in
in
the
metrics,
we're
using
histograms
and
the
histograms
are
buckets
from
say,
0.1
seconds
to
0.5
seconds,
and
the
next
bucket
is
0.5
seconds
to
a
second
right,
and
you
can't
tell
there's
no
resolution
inside
of
that
right.
So
it's
very
it's
kind
of
like
looking
through
like
frosted
lenses
and
trying
to
see
what's
happening
on
the
other
side.
E
So
because
you
you
just
have
these
very
coarse
buckets
that
you're
using
for
latency.
So
it
it
you
know,
and
then
the
bigger
buckets
you
go
from
one
second
to
10
seconds.
It's
you
know,
really
really
blurry,
and
so
just
don't
just
first
thing
is:
don't
use
metrics,
don't
use
histograms
for
this
at
all
in
kibana.
E
What
you
want
to
do
is
you
want
to
try
and
find
like
a
bunch
of
very
common
urls
that
are
kind
of
chunky,
maybe
cpu
bound
things
like
like
merge,
request,
controller
widget
stuff,
if
you
just
basically
rank
by
like
the
most
common
paths,
not
not
controllers
but
actual
paths
right,
because
what
you
want
is
like
a
project
that
is
getting
called
a
lot.
E
So
the
the
the
variance
in
the
in
the
latency
will
be
very
low
because
it's
always
looking
at
the
same
resource
right
and
then
and
then
what
you
can
do
is
you
can
break
that
up
in
kibana
by
re
by
zone,
or
I
think
it's
a
region
label
so
I'll
just
call
it
region
and
then
and
then
take
a
look
at
the
histograms.
Sorry,
the
the
like
the
p
50,
the
p
75,
the
p95
for
those
and
you'll,
see
that
there'll
be
like
differences
between
them
and
then
we
can
optimize
on
that.
E
Does
that
make
sense
and
and
the
the
reason
why
kibana
doesn't
suffer
from
that
is
because
you,
when
you
say
the
95th
percentile
in
kibana,
it's
literally
taking
buckets
of
100
and
choosing
the
top
five
right.
So
it's
it's
accurate,
whereas
the
histograms
are
very
much
a
an
estimate
in
in
the
metrics.
E
A
A
Probably
since
that's
the
last
one,
we
deployed
to
anyways
like
clusters,
b
and
c,
we
could
set
the
workers
to
a
dif,
a
differing
number
like
five
and
six
respectively
and
see
what
happens
and
then
we
could
observe
our
cpu
and
memory
utilization
I'll
have
to
generate
a
fancy
chart
in
cabana
to
make
sure
that
we're
not
negatively
impacting
or
we
have
the
ability
to
compare
the
latencies
of
various
routes
inside
of
our
logs.
D
A
So
maybe
what
I'll
do
is
modify
the
worker
first,
because
I
don't
know
what
that's
going
to
do
just
yet,
but
seeing
what
that
does
I'll
be
able
to
better
understand
what
cpu
utilization
is
going
to
do?
Then
I
can
modify
the
cpu
request
appropriately
and
I
could
probably
do
some
sort
of
fancy
math
to
do
the
same
thing
for
another
bump.
That
way
I
could
request
the
worker
and
the
cpu
bump
at
the
same
time,
for
the
other
cluster.
D
And
by
by
adding
more
threads,
we
shouldn't
see
as
much
increase
in
memory,
I
would
say,
but
we
definitely
need
to
also
higher
the
requests
than
to
be
able
to
guarantee
enough
our
pots
there
and
that
will
again
influence
how
many
pods
fit
and
true
notes.
So
we
need
to
also
fine-tune
and
think
about
this.
How
we
squeeze
puts
into
notes
then,
but
I
think,
with
three
parts
on
a
note
that
should
work
in
most
cases,
because
then
we
still
have
a
lot
of
headroom
on
cpu
and
memory.
A
Yeah,
I
think
if
we
could
figure
out
a
way
to
get
ourselves
down
to
like
jared
mentioned
this
last
week,
but
if
we
get
ourselves
within
a
certain
amount
of
the
similar
usage
we
had
with
our
vms,
I
think
we
did
a
pretty
good
job.
E
What
do
you
think
like
a
as
a
guess
where
do
you
think
the
right
number
of
nodes
would
be
like
one
and
a
half?
I
would.
A
Love
to
be
running,
I
would
love
to
be
running
somewhere
within
a
20
range
of
the
original
36
nodes
we
had
originally
because
we
were
under
provision
right.
So
if
we
could,
you
know
bump
up
that
mount
and
still
have
the
auto
scale
available
to
us.
So
we
could
turn
down
during
the
weekends
and
night
time.
That'd
be
great.
E
E
A
Tuning
sidekick
very
well
in
any
way
shape
or
form
like.
We
didn't
really
do
any
research.
We
didn't
sit
here
and
concentrate
as
much
as
we
are
today
with
the
api.
So
I
think
whatever
learnings
we
gleaned
from
the
api
service,
we
could
take
the
sidekick
and
really
do
some
call
savings
there
as
well.
It's
just.
D
You
have
to
remember
that
at
night
times
at
low
traffic
times
we
can
reduce
the
number
of
nodes
automatically
drastically
even
much
more
than
we
did
now
I
mean
we
already
hit
the
ceiling
that
we
set
ourselves
at
30
notes
or
something,
but
we
can
go
much
lower,
so
we
would
save
cos
at
least
at
low
traffic
times
why
we
spent
more
on
high
traffic
times,
but
maybe
that
evens
out
a
little
bit.
E
C
Final
bit
on
this
kind
of
like
where
we
want
to
get
to,
I
think
the
other
bit
that
would
be
useful.
It
doesn't
have
to
be
an
exact
resource
match,
but
it'd
be
good
just
to
understand
why
it's
different.
So
we
already
know
we
were
under
provisioned
if
there's
another
good
reason
why
these
things
are
going
to
behave
differently
great
like
if
we
know
that
we
can
just
log
that
right
and
come
back
to
this
stuff
later.
A
A
I
think
it
might
be
worth
the
time
and
effort
to
write
an
interesting
blog
post
about
what
we
did
to
get
here,
because
this
is
also
a
topic
that
I
see
often
in
the
kubernetes
selection
of
like
how
do
I
tune
these
requests
and
such
and
like
the
only
thing
you
find
on
the
googles,
is
how
to
do
the
tuning,
not
how
to
observe
the
behavior
of
your
application
to
decide
what
to
set,
how
to
set
it
and
get
the
result
that
you're
looking
for.
So
that
could
be
an
excellent
blog
post.
That.
E
A
Would
be
a
good
end
goal
to
have
is
to
consolidate
that
massive
thing
into
a
blog
post
of
some
kind.
C
So
should
we
move
on
to
the
to
the
next
item
for
retro.
A
Oh
yeah,
so
retro,
I'm
just
looking
for
any
further
feedback
from
anyone.
If
there's
no
one,
if
no
one
chimes
in
on
this
by
the
end
of
my
day
today,
I'm
going
to
close
it.
I've
created
a
few
issues
related
to
the
stuff
that
we
found
and
I've
started
to
go
through
the
web,
epic
that
amy
you
have
stated
as
now
ready
to
go,
and
I've
started
to
populate
some
of
the
issues
with
a
little
bit
more
information
to
make
sure
that
we
don't
forget
certain
things.
A
So
hopefully,
hopefully
we're
in
an
okay
state.
I
think
we
are
it's
just
a
matter
of
finishing
up.
Writing
all
the
other
issues
that
we
need
to
need
to
create
awesome
thanks
for
doing
that,
and
of
course,
you
already
saw
the
language
changes
to
both
their
our
readiness
templates,
as
well
as
our
delivery
team,
epic
language,
so
that
we
hopefully
do
a
better
job
with
certain
things
as
well.
C
Great
sounds
good
henry
was
there
anything
you
wanted
to
go
through
on
observability
or
andrew.
D
So
the
issue
is
that,
right
now
we
don't
have
cpu
saturation
metrics
for
services
and
kubernetes,
which
don't
have
cpu
limits,
because
our
current
cube,
cpu
saturation
is
based
on
the
limit
that
we
set
and
for
things
like
api,
we
didn't
set
limits.
D
So
what
I'm
currently
working
on
is
adding
another
saturation
metric
for
cpu,
which
is
based
on
requests,
because
this
is
helpful
anyway,
because
this
request
is
what
we
estimate
that
v,
what
we
need
to
guarantee
to
a
container
and
if
we
come
close
to
request
with
cpu
usage,
then
I
think
we
need
to
be
aware
of
that.
But
we
can
potentially
go
over
requests,
which
is
not
nice
for
a
situation.
Metric.
E
I
was
going
to
say:
oh,
do
we
treat
100
as
the
request.
A
E
A
D
It
might
be
wise
at
least
not
to
be
alerted
on,
but
but
to
see
that
we
reach
a
point
where
we
miscalculated
how
much
we
need
to
guarantee
so
to
certain
services
right
right.
If.
A
So
if
we
set
our
saturation
level,
if
we
look
at
that
saturation
level
today
we're
at
roughly
80
percent,
because
that's
what
the
hba
does
and
it's
never
going
to
change,
it's
always
going
to
sit
that
way,
but
technically
we're
well
under
utilizing
psychic
as
a
whole
and
our
nodes
are
going
to
react
accordingly
because
of
our
hpa
and
how
it
scales.
So
I
don't.
D
Workhorse
and
and
web
service
containers
and
and
the
api
port
right,
so
we
tune
hpa
for
the
web
service
container
to
reach.
I
don't
know
2600
milli
cpus
right
now,
and
this
will
of
course
also
change
how
much
the
traffic
will
send
to
workhorse.
So
in
workers
we
will
see
cpu
go
up
or
down,
and
but
we
are
not
setting
any
target
for
workhorse
right.
So
we
just
don't
watch
it.
If
you
don't
watch
it,
then
we
don't
see
what
is
happening
with
the
workhorse.
D
Why
we
change
the
cpu
target
for
for
web
servers,
and
I
would
like
to
see
some
kind
of
metric
that
shows
me.
Oh,
I
see
workers
getting
close
to
our
requests
unexpectedly
because
they
fiddled
on
on
the
web
services
container.
That
would
be
a
helpful
metric,
but
it's
hard
to
express
as
a
situation,
because
we
allow
you
know
to
go
over
the
limit
there,
which
is
not
nice
yeah.
We
could.
D
It
but
this
is
super
hard
because
you
need
to
figure
out
how
many
web
service
pots
are
on
on
this
node
and
how
much
capacity
do
we
have
left
and
then
calculate
out
of
this
for
each
of
the
parts
you
know.
If
there
are
three
pots,
then
they
all
share
a
third
of
this
leftover
capacity
of
the
node.
I
don't
know
if
you
can
calculate
this
and
believe
it.
E
But
surely
the
the
ultimate
thing
that
would
happen
in
prometheus
is
that
the
node,
so
we
can
figure
out
what
the
total
requests
are
and
then
and
then
we
can
also
say
look
the
node
is,
is
that,
like
you
know,
95
of
all
cpu
is
gone
and
you
know
in
total
it
should
have
been
a
16,
so
it's
badly
packed
and
it's
like
66
and
we
know
we're
at
well
at
100,
then
yeah.
Maybe
maybe
we
just
need
like
no
the
node
metrics.
Maybe
the
node
metrics
are
the
right
thing
to
use
there.
A
A
D
A
Yeah
there
was
a
blog
post
that
then
cochi
had
found
where
it
was
discovered
that
cpu
limits
were
being
it
wasn't
just
a
throttle
where
you
would
run
into
say.
4000
was
your
limit.
If
you
ran
over
4
000,
you
weren't
just
limited
4
000,
the
entire
process
would
greatly
slow
down
to
squeeze
itself
under
that
4000
limit
and
that
had
a
lot
of
a
lot
of
negative
implications.
I
don't
know
if
anything's
changed
related
to
that,
but
that's
kind
of
why
we're
not
setting
cpu
limits
at
the
moment
right.
Okay,.
E
D
Robert
trying
to
use
requests
as
a
limit
and
then
make
a
saturation
out
of
that,
and
even
if
we
then
sometimes
go
over
this,
we
have
a
saturation
which
is
going
to
120
percent,
which
makes
the
graph
look
a
little
bit
strange.
But
you
still
get
some
value
out
of
that
right.
E
Well,
actually,
the
the
the
way
that
saturation
metrics
are
designed
is
they
always
get
wrapped
in
a
clamp
max
one,
so,
no
matter
where
it
goes
to
it'll
always
flat
line
at
that
point
just
because
otherwise
it
messes
up
all
the
graphs
and
all
the
downstream
data
it's
like
designed
to
always
be
between
zero
and
one.
So
if
it's
a
hundred
and
twenty,
it's
at
a
hundred
percent
just.
D
E
The
other
thing
you
could
try
with
that
henry
is,
if
you
do
have
little
spikes,
that
spike
up
instead
of
using,
like
rate
over
five
minutes,
you
use
a
rate
over
like
an
hour
or
something
like
that,
like
we
do
with
this
there's
one
sidekick
saturation
metric
that
we
do
that
with
and
so
that
it
kind
of
climbs
up
very
slowly.
But
if
it's
kind
of
pinned
there
all
the
time,
then
then
you
get
that.
But
if
you
just
have
brief
spikes,
then
they
just
kind
of
round
out.
D
C
Awesome
sounds
good,
so
is
there
any
thing
else
we
need
to
go
through.
C
Today,
awesome:
okay!
Well
thanks!
So
much
for
the
demos
and
discussion
and
yeah
good
luck
with
the
with
the
tuning
and
the
saturation
metrics
all
right
take
care.
Everyone
bye.