►
From YouTube: Redis rate limiting in k8s retrospective/walkthrough
Description
Walkthrough and retrospective of the work that the Scalability:Projections team did to migrate redis rate limiting from running on VMs to running in kubernetes.
A
Yeah,
okay,
so
just
a
very
brief
recap
of
what
we're
going
to
do
for
future
us
and
anyone
else
who
is
interested,
we're
going
to
walk
through
the
work
that
we
did
to
migrate,
redis
rate,
limiting
from
redis
Sentinel
on
VMS
to
kubernetes
and
just
walk
through
the
issues
that
we
did
and
generally
discuss
so
yeah.
A
Do
you
want
to
start
Bob
or
do
you
want
me
to
start
or
go
ahead?
Yeah.
B
So
the
way
I
think
I
set
up
this
project
at
the
beginning
was
just
start
with
the
difference.
C
C
So
the
way
we
we
tackled,
that
was
we'll,
do
pre
first,
that
we'll
do
staging
that
will
do
production
and
staging
is
the
closest
thing
that
we
have.
That
would
resemble
production
in
a
way
of
doing
things,
because
we
already
had
an
instance
to
start
from
in
Pre.
We
just
needed
to
tweak
it
a
little
bit
to
not
be
a
mixed
redis
instance.
So,
with
a
mixed
redist
instances,
I
mean
Sentinels
in
VMS
and
other
Sentinels
and
kubernetes.
That
would
have
a
replica
that
we'd
failover.
C
That
was
the
initial
plan
we
had,
but
we
decided
against
that
because
we
didn't
need
it
because
we
didn't
really
care
about
migrating.
The
data
that
only
lives
for
a
few
minutes
and
just
yeah
do
a
clean
switch
in
the
configuration
and
go
that
way.
So
the
first
thing
we
did
was
create
these.
C
These
pre
issues
here
so
update
the
red
escalator
begins
to
be
Standalone
and
then
start
using
it
from
the
from
the
configuration
where
we
did
that
we
got
to
see
a
bunch
of
things
regarding
or
existing
metrics.
So
that's
the
observability
issue
here.
C
Lots
of
things
were
like
not
not
working
because
they
were
relying
on
an
fqdn
label
to
be
present,
so
we
need
to
update
that
to
use
instance,
the
instance
label
instead,
which
is
based
on
IP,
address
Port
name
rather
than
domain
name
the
we
were
doing
that
in
parallel.
C
In
parallel,
while
we
were
spinning
up
the
well
we're
doing
the
same
work
for
staging
for
staging
spinning
up
a
new
cupboard
like
that,
we
need
to
start
with
a
brand
new
redis
instance.
Spinning
up
a
new
instance
meant
creating
the
designated
note
pool
in
terraform.
Stephanie
are
probably
going
to
be
better.
C
So
if
I
forget
something
interrupt,
spitting
up
the
the
new
now
preparing
the
new
note
pool
and
terraform,
adding
the
new
redis
instance
to
to
a
tanka
deployment
setup
everything
there
was
already
prepared
by
the
Frameworks
group
who
was
working
on
them
container
registry
cache
instance.
So
we
just
used
their
stuff
and
tweaked
it
where
we
needed
it
like
where
we
needed
slightly
different
configuration
and
so
on.
C
So
after
we
done
that,
we
could
just
switch
over
to
use
the
new
redis
instance
in
staging
and
there
we
noticed
some
problems
because
well
nobody
we
couldn't
get
this
right
from
the
first
time.
The
first
problem
we
we
know
this
was
a
different
database
name
and
different
secrets.
So
then
we
needed
to
correct
the
the
database
name
for
Reddit
Sentinel
to
be
this
magic
string,
my
master,
which
is
the
default
database
name
for
redis,
and
then
there
was
the
the
thing
with
Secrets
being
stored
in
gkms.
C
A
There
was
also
the
the
issue
here
of
we
had
not
originally
set
up
external
Secrets
correctly.
To
start
with
in
there's
a
there's,
an
external
secrets
in
kubernetes
that
did
not
was
not
set
up
correctly.
A
There
was
an
external
Secrets
object
that
one
is
slightly
farther
above,
where
it
says
rotate
British
rate
limited
secret
on
staging,
and
there
was
a
separate
Mr
that
we
had
to
create
for
that
that
once
we
did
that
it
was
there.
There
was
also
also.
A
Oh
no
I
haven't
even
touched.
The
load
balancer
chip
I,
was
also
noting
that
when
we
created
staging,
we
also
forgot
that
we
had
to
create
a
secret
in
Vault,
and
there
was
all
the
back
and
forth
of
where
these
secrets
lived,
because
they
moved
in
between
the
first
setup
of
these
things
in
kubernetes
and
this
one
so
like
for
any
future.
People
make
sure
you
have
all
of
your
secrets
in
place
before
you
do
this.
Otherwise
it
doesn't
work
and.
C
If
I
got
it
right,
we've
got
the
correct
helpers
now
in
the
tanka
deployments,
like
it's
a
helper
method
that
you
can
just
call.
This
is
what
the
secret
called
This
is
the
environment.
This
is
the
namespace
and
then
it
will
build
the
string
to
get
the
secret
from
the
right
place,
but
you
still
have
to
manually
add
it
to
Vault.
So.
B
A
C
So
when
we
did
the
actual
conflict
change,
we
weren't
super
careful
with
staging
because
well
that's
what
staging
is
for
so
we
broke
it
and
we
had
to
revert
it
when
we
made
the
configuration
change
in
in
staging.
So
first
we
broke
it
because
we
forgot
about
the
secrets
we
fixed
that
and
then
we
could
access
red
is
from
a
console.
C
But
then,
when
we
proceeded
to
roll
out
to
everywhere
we
were
getting
500
and
that
was
because
the
clients,
so
the
rails
application
couldn't
access
the
the
redis
instance,
because
the
load
balancers
didn't
allow
them
to.
B
A
Sure
the
way
that
we
had
set
this
up
was
all
of
these
are
running
load
balancers
within
kubernetes,
and
one
of
the
issues
that
we
ran
into
was
that
we
have
moved
some
of
the
things
that
we're
using
outside
of
the
same
region
as
where
these
kubernetes
objects
or
where
these
kubernetes
clusters
are
specifically
things
like
console,
which
is
moved
to
a
different
region
for
Dr
purposes
and
all
of
our
kubernetes
load.
Balancers
were
not
set
up
to
allow.
A
At
essentially
Global
access,
I'm
trying
to
actually
go
and
find
the
issue,
but
it
doesn't
really
matter
so
what
we
have
done
is
we
spent
some
time
trying
to
debug
exactly
there.
You
go
what
it
is
and
then
we
set
it
up
so
that
it
was
actually
Now
using
The
annotation
for
Global
access,
which
is.
A
Essentially
saying
that
these
we
can
access
these
load
balancers
from
anywhere
within
the
same
VPC
as
part
of
this
work,
we've
actually
made
that
a
default
across
reliability
as
well,
so
hopefully
nobody
else
loses
a
full
day
and
multiple
brain
cells
trying
to
figure
out
what
was
going
on
here.
So.
A
Yes,
it
is
it's
a
safe
default
and
it's
also
Now
the
default
for
redis
as
well.
We
we
made
that
the
default
as
part
of
rolling
this
out.
A
So
hopefully
we
have
saved
this
for
the
future,
but
yeah
it's
the
actual.
Annotation
is
internal
load.
Balancer
allow
Global
access.
C
Cool
so
there
we
had
it
running
in
staging
and
it
was
time
to
prepare
for
production.
We
prepared
a
Readiness
review,
that's
like
a
markdown
document
in
the
Readiness
review
project
and
we
we
could
reuse
a
lot
of
information
that
was
already
collected
for
the
the
wreckage
the
registry
cash
instance,
but
yeah
some
things
were
alone
our
own,
because
this
instance
already
had
quite
a
lot
of
traffic.
So
slight
differences
there
when
that
was
approved,
Stephanie
and
I
started
to
prepare
the
change.
C
So
the
change
issue
with
different
steps
to
roll
it
out
or
nigger's
recommendation.
We
step
back
a
little
bit
and
we
went
from
just
doing
Canary
everything
to
Canary
one
region
at
a
time
for
one
zone
at
a
time.
I,
don't
remember
one
zone
at
the
time.
It.
C
Yeah,
so
when
we
were
doing
that,
yes,
the.
C
So
during
the
rollout
we
did
Canary
and
then
let
it
sit
for
a
day
to
just
see
the
thing
with
traffic
the
connect
like
during
this
this
day,
we
would
effectively
effectively
have
a
split
plane,
split
brain
for
rate,
limiting,
because
traffic
going
to
True
Canary
would
not
count
towards
the
rate
limiters.
C
The
rate
limits,
that's
otherwise
counted
in
the
redness
instance
in
VMS,
but
we
decided
that
that
was
acceptable,
but
we
couldn't
have
that
for
more
than
a
day
with
more
traffic
than
just
Canary,
so
our
plan
for
rolling
out
was
doing
a
single
zone
first
and
then
immediately
proceeding
to
the
entire
the
entire
fleet.
C
C
That's
when
we
decided
like.
Let's
not
do
everything
right
now:
let's
do
one
extra
Zone
before
we
do
the
whole
thing,
which
is
what
we
did
and
then
the
three
of
us
Stephanie
Igor
and
myself
made
the
call
that
we
will
not
proceed
because
we
were
already
like
close
to
like
60
CPU
utilization
with
two
zones
and
that's
about
as
high
as
the
current
CPU
utilization
is
at
Big
Time
on
VMS,
so
he's
decided
to
stop
there
and
investigate
further
what
we
do.
A
Yeah
and
just
another
quick
note,
the
difference
in
CPU
between
one
Zone
plus
Canary
and
two
zones
plus
Canary,
was
about
20.
So
we
were
running
at
like
40-ish
percent
with
one
zone
in
the
canary
and
then
when
we
went
to
two
zones
we
were
at
60.
The
logic
was
that
we
would
probably
be
close
to
80
with
three,
and
that
gave
us
very
little
Headroom,
especially
since
this
is
the
fastest
CPU
that
we
have,
that
gcp
has
and
redis
is
single
threaded.
C
Yeah
so
staying
there
would
mean
We've
effectively
moved
the
saturation
the
such
the
moment
of
saturation
closer,
and
we
don't
have
a
horizontal
scalable
horizontally
scalable
solution
for
that.
Yet
if
we
did,
then
this
would
have
been
just
a
case
of
adding
some
more
redis
and
then
we
could
continue.
But
that's
what
we're
going
to
be
working
on
next.
A
So
maybe
I
should
walk
through
briefly
the
CPU
increase
investigation,
which
is
long
so
after
we
got
through
this.
A
You
know
we
rolled
back
everything
in
production
and
left
us
with
this
running
and
staging
and
in
pre-production,
using
the
VMS
and
Matt
and
I
got
on
a
truly
epic
like
four
hour
long
Zoom,
where
we
walked
through
a
lot
of
flame
graphs
and
essentially
came
up
with
you
know.
Igor
during
the
actual
rollout
had
offered
a
thought
that
perhaps
this
was
related
to
redis
networking
and
things
of
that
nature.
A
Matt
and
I
fairly
conclusively
proved
that
that
was
the
case
in
that
someone
I
think
it
was
Philippe.
Actually
yep
also
confirmed
this
we're
using
Calico
in
our
kubernetes
and
because
of
its
hybrid
approach
to
control
traffic,
which
is
you
know,
iptables
and
a
bunch
of
other
things.
There
was
approximately
a
30
percent
increase
in
CPU
time
for
any
anything
that
was
yeah
like
25
to
30
during
incoming
packet
processing
and
packet
processing
as
a
whole.
This
explains,
the
you
know,
explains
the
general
overhead
of
CPU.
A
It's
also,
unfortunately,
something
that
often
comes
whenever
you
add
more
layers
of
abstraction
to
a
system.
They
often
don't
come
free.
In
this
case,
this
was
a
much
higher
hit
than
I
think
anyone
expected
and
the
reason
that
we
discovered
this
during
rate
limiting
and
not
during
some
of
our
previous
work.
Is
that
rate
limiting
is
very
connection.
Heavy
I,
don't
know
if
it
is
the.
C
Short
fast
calls
like
very
many
very
many
calls
that
yeah,
like
every
request,
makes
one.
A
Right
so
lots
and
lots
of
connections
which
meant
that
the
overhead
for
any,
like
this
connection
overhead,
was
hit
rate
limiting
worse
than
it
might
other
redis
instances
in
that.
If
we
were
doing
fewer
connections
that
were
slower,
it
would
not
have
been
quite
as
much
of
a
CPU
increase,
so
For
Better
or
For
Worse
picking.
This
was
a
good,
a
good
candidate
to
discover
this
exact
problem.
Good.
A
A
C
One
thing
that
this
also
makes
clear:
we
wouldn't
be
able
to
fit
redis
cash
or
redis
persistent
on
kubernetes.
Just
like
that,
we
would.
A
A
A
I
mean
essentially
we're
going
to.
We
haven't
actually
completed
this
work
yet,
but
you
can
see
there
that
there's
the
revert
back
to
using
registry
limiting
on
VMS
in
staging
in
Pre
we're
going
to
be
pivoting
to
do
work
on
how
to
do
horizontal
scaling,
but
I
am
for
for
future
me
and
for
anyone
else
who
watches
this.
A
A
C
Like
I
think
I
think
that's
also
something
that
I
think
we
should
call
out
the
the
rollout
and
failovers
and
stuff
that
we
did
wouldn't
work
easily
on
the
on
VMS
that
we
would
need
to
do
more
fancy
things
with
stopping
Chef
client
doing
a
change
startups,
have
client
on
a
single
machine
and
then
the
other
machine
and
then
the
other
machine,
while
on
kubernetes
that
was
just
merging
emergency
Quest,
then
watching
graphs
change,
yep.
A
A
We
can't
get
any
so
you
know
as
a
whole.
It
was
a
it
also.
The
other
piece
I
think
that
is
worth
noting.
A
A
I,
actually
think
that
worked
in
our
favor
in
that
we
discovered
that
we
could
make
all
these
changes
and
no
one
noticed,
but
also
we
saw
the
highest
traffic,
and
thus
everyone
was
there
watching
it
and
being
able
to
compare.
Had
we
looked
at
a
lower
CPU
time,
it
would
have
been
harder
to
see
the
difference
in
the
graphs
not
impossible.
This
is
a
pretty
significant
difference,
but
harder.
C
Yes,
we're
calling
out
that
we
we
needed
to
proceed
immediately.
So
the
fact
that
we
did
this
on
a
because
of
the
split
brain
issue
we
needed
to
proceed
immediately,
but
because
we
were
doing
this
on
a
on
a
high
traffic
moment.
The
problem
surfaced
when
we
were
working
on
it
as
Stephanie
mentioned.
If
we
hadn't
done
this,
but
had
done
this
like
like
in
the
downtime,
then
this
would
have
been
the
problem
of
the
on-call
I.
Think
because
I.
B
A
Yep,
the
only
other
piece
that
I
think
is
again
worth
calling
out
is
I.
Do
think
that
Bob
and
I
did
a
great
job
during
this
thing
in
like
doing
like
handing
over
the
work
back
and
forth
and
being
able
to
move
faster,
because
there
were
two
of
us
in
two
different
time
zones
so
like
Bob,
would
do
a
bunch
of
work
during
his
day.
He
would
hand
over
things
to
me.
I
would
update
I'd
hand
back
to
him,
Etc
and
I.
A
C
Because
that
we
were
all
sort
of
rolled
in
right
like
like
approving
from
like
just
the
the
work
getting
the
approvals
before
you
came
online
and
then
getting
the
approvals
from
the
SRE
on
call
and
so
on.
Everything
was
ready
and
we
just
needed
to
like
meet
and
then
yep
get
it
done
and
then
get
it.
Undone
and.
A
Definitely,
like
I
think
we've
proved
that
we
can
run
redis
and
kubernetes
with
this,
just
if
we
also
had
horizontal
scaling,
yeah
cool
any
other
last
words
before
I.
Stop
this.