►
From YouTube: 2022-04-06 GitLab.com k8s migration EMEA/AMER
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Good
morning,
everyone
welcome
to
the
april
6th
kubernetes
demo
meeting
no
demos
on
the
agenda
today,
but
we've
got
two
discussion
items.
Bob
you've
got
the
first
one
related
to
that
saturation.
You
want
to
kick
us
off.
B
Yeah,
so
that's
something
that
popped
up
this
week
somewhere
in
our
saturation
metrics.
We
count
how
many
available
ports
we
have
in
total
on
on
a
gateway
and
that's
calculated
from
the
number
of
rp
addresses
times
60,
something
thousand
to
know
how
many
ports
we
have
available
and
we're
running
out.
B
The
projected
date
is
like
somewhere
in
june,
for
like
80
confidence,
but
there's
some
outliers
that
already
reach
above
the
90.
So
I
don't
know
how
urgently
we
need
to
look
into
that,
because
I
also
don't
really
know
the
side
effects
of
what
happens
when
we
are
saturated.
A
This
was
an
issue
after
we
completed
one
of
our
service
migrations
into
kubernetes,
where
we
increased
the
number
of
both
the
pods
and
the
number
of
nodes
that
were
sitting
behind
our
net
gateway
and
the
issues
were
difficult
to
troubleshoot
at
first,
because
we
didn't
have
the
appropriate
monitoring
in
place.
We've
since
improved
that
so
now
in
the
future.
If
we're
approaching
saturation
around
that
device,
we
should
have
the
ability
to
alert
the
eoc
to
be
like
hey,
there's
something
that's
going
to
be
wrong.
A
If
we
don't
address
this
today,
I
did
drop
a
comment
that
kind
of
highlights
what
starts
to
happen.
You
know
autodeploy
becomes
blocked,
because
every
single
node
needs
to
deploy
a
new
image
and
since
they're
all
reaching
out
they're
all
going
to
be
using
a
port
to
get
out
and
ask
for
the
docker
container
it
needs,
and
that
has
a
compounding
effect,
because
if
those
are
blocked,
sidekick
which
has
web
hooks
or
needs
to
reach
out
externally,
for
any
reason
starts
to
get
blocked
and
sidekick
is
smart
enough
to
retry.
A
C
B
C
Think
one
of
the
I
mean
this
topic
has
been
discussed
since
since
a
long
time
right,
I
think
addition.
Initially,
craig
furman
was
working
on
on
cloudnat
and
also
with
strategies
how
to
deal
with
that,
and
we
even
got
an
extra
extended
ip
block
from
google
just
for
us
to
assign.
So
normally
they
don't
give
out
a
block
off
of
ip
addresses
just
for
a
single
customer,
but
we
got
it
from
them.
C
Just
for
that
reason
to
to
be
able
to
extend
expand
our
cloud
nuts,
I
think,
and
also
to
give
contiguous
ip
addresses
for
white
listing
to
our
customers
so
that
they
can
say,
okay
or
gitlab
traffic
is
coming
from
these
non-ip
addresses,
and
I
think
one
of
the
biggest
problems
here
is
that
for
each
connection,
which
is
going
to
the
same
destination
and
ip
destination
ip
and
port
tuple,
we
need
to
open
a
new
port
on
the
cloud
network
cloud
gateway
to
be
able
to
distinguish
those
connections.
C
And
the
big
issue
is,
if
a
lot
of
I
don't
know
runners.
For
instance,
a
lot
of
our
ports
try
to
reach
the
same
ip
and
port
address
like
like
trying
to
pull
the
image
from
the
from
from
our
docker
registry.
C
Then
we
need
to
have
a
lot
of
different
ports
opened
in
our
cloud,
not
gateway,
and
this
is
really
saturating
things.
So
one
thing
is
to
add
more
ip
addresses
to
the
cloud:
netway
not
gateway
to
have
more
capacity,
and
the
other
thing
that
we
could
try
to
work
on
is
to
prevent
that
a
lot
of
ports
are
trying
to
reach
the
exactly
same
address
and
board.
C
At
the
same
time,
I
think
the
reservation
for
the
nut
port
thing
is,
I
don't
know
for
two
minutes
or
something
this
reserve,
then,
for
the
connection,
at
least
I
don't
know
when
this
is
relaxed
again
at
least
again,
but
if
you
would
be
able
to
somehow
split
the
traffic
that
is
going
to
just
one
single
destination
address
and
port.
C
Dev.Registry.Giblet.Org,
I
think
right
to
pull
the
image
and
we
have
a
lot
of
connections
from
a
lot
of
parts
open
to
the
same
ap
destination,
ap
address
and
port,
and
this
will
lead
the
clutter
cloud,
nut
gateway
to
use
a
lot
of
of
those
ports
that
we
have
available
to
be
able
to
distinguish
between
the
different
pots
that
are
all
connected
to
the
same
address.
If
they
would
reach
out
to
different
addresses,
then
they
could
use
the
same
source
port
on
the
cloud
nut
gateway.
C
D
This
is
only
for
traffic
get
reaching
out
of
the
cluster.
That's
correct
right,
so
even
just
I
need
really
depends
on
what
the
problem
is,
but
even
if
it's
just
a
registry
for
downloading
images
having
a
proxy
approximation
inside
the
cluster,
we
just
using
the
the
button
here
because
you
have
one
connection
goes
outside
and
everything
is
connected
to
the
internal
registry.
A
And
hypothetically,
we
could
add
a
step
in
our
deployment
procedure
that
says
hey
now
that
this
image
is
built,
push
it
to
that
registry
and
then
we're
never
making
an
outbound
connection.
We're
just
saying:
hey
talk
to
this
other
registry,
that's
local
to
you
and
theoretically,
our
images
will
download
faster.
Therefore,
hypothetically
deploys
will
be
significantly
faster
as
well.
A
B
B
Yes,
but
the
time
and
the
resources
to
do
that.
So
if
we
need
to
do
a
short
term
thing,
you
mentioned
adding
some
ips
and
then
we
can
probably
engage
stigor
math
to
to
help
with
that,
but
the
long
term
solution
and
working
out
the
plan
for
for
it.
I
think
delivery
would
be
better
to
handle
that
if
you
yeah,
do
you
think
that,
because
you
mentioned
adding
different
different
gateways
per
cluster,
is
that
something
that's
possible
if
we
communicate
the
ips
outward
to
customers,
so
they
can
allow
us
yeah.
A
A
A
B
E
B
And
then
it
would
be
good
to
link
here
what
we're
going
to
do
in
the
long
term.
A
So
speaking
of
long
term,
let's
talk
about
rebuilding
clusters.
There
are
many
reasons
as
to
why
we
want
to
rebuild
clusters.
Some
of
it
is
due
to
limitations
that
we're
going
to
run
into
one
of
those
is
the
nat
saturation,
for
example,
another
one
is
our
ip
address.
Space
for
arizona
clusters
was
not
set
up
in
the
most
optimal
fashion.
A
So
eventually,
if
we
migrate
any
more
workloads,
we
run
the
risk
of
running
out
of
ip
addresses
for
nodes,
so
we'll
be
able
to
scale
up
nodes,
but
we
won't
have
an
ip
address
for
it.
Therefore,
kubernetes
will
be
like
sorry
guy
and
then
there's
also
other
things.
Calico
has
caused
some
pain.
Gke
is
phasing
calico
out
in
favor
of
something
called
data
plane
version
two.
This
is
not
something
we
could
just
switch
to.
A
I
think,
if
we
needed
to
in
an
emergency
situation,
we
could
bring
down
an
existing
cluster
entirely,
and
then
we
could
build
a
new
cluster.
A
replacement
cluster-
and
you
know
igor
just
highlighted
the
fact
that
this
is
still
iffy,
because
we
do
have
other
problems
that
need
to
take
into
consideration.
A
If
we
want
to
bring
a
cluster
down
entirely,
we
have
to
make
sure
that
our
other
zones
have
the
capacity
to
take
the
traffic
load
from
them.
The
other
thing
we
need
to
be
cognizant
of
is
that
we're
going
to
drive
up
our
networking
bill
temporarily,
while
that
one
cluster
is
down
because
we're
going
to
have
a
lot
of
traffic
coming
through
our
front
door,
aj
proxy,
and
that
traffic
is
going
to
cross
zonal
boundaries
as
it
reaches
out
to
the
other
clusters
in
our
environment.
A
And
thirdly,
h.a
proxy
is
not
very
well
configured.
Therefore,
when
you
drain
a
single
cluster,
you
end
up
sending
a
lot
of
traffic
to
canary,
which
is
not
very
well
tuned.
So
from
a
technical
aspect,
it
would
probably
be
wise
if
we
figure
out
what
to
do
with
canary
and
at
proxy
prior
to
us
bringing
clusters
down.
A
A
A
That
in
itself
is
kind
of
a
problem,
because
now
ci
doesn't
have
any
way
to
connect
to
it
and
doesn't
have
any
way
to
be
like.
Oh,
this
environment
is
for
this
cluster,
even
though
it's
the
exact
same
as
another
cluster,
and
this
is
kind
of
a
problematic.
A
Yeah
something
and
we
could
handle
that
in
two
ways
like
we
could
build
that
into
our
gke
module
if
we
wanted
to
or
we
could
assign
that
as
something
we
send
when
we
build
our
cluster
initially
when
we
consume
the
module,
so
we
have
options
for
that.
So
it's
not
a
huge
blocker.
It's
just
something
that
we
need
to
be
made
aware
of.
A
A
A
A
So,
if
any,
if
anyone
has
ideas
on
networking,
I'm
all
ears
I'll
gladly
listen
to
any
ideas.
Anyone
has
because
I
do
not
currently
know
what
to
do
with
this.
A
And
then,
the
next
item
that
I
think
is
going
to
be
the
most
important
is
what
to
do
with
all
of
our
static
ip
addresses
for
many
of
our
services.
We
ask
for
a
static
ip
that
is
stored
within
terraform,
and
then
we
ask
for
that
ip
address
to
get
applied
via
our
kubernetes
configurations
and
that's
scattered
across
kate's
workloads
in
those
three
repositories
in
some
way,
shape
or
form.
A
A
A
I
imagine
if
we
upgraded
proxy
that
would
probably
solve
that
situation,
because
I
would
imagine
this
is
a
long-standing
bug
that
eventually
got
fixed
now.
Does
anyone
else
have
any
concerns
or
thoughts?
A
B
Regarding
regarding
the
running
out
of
ip
addresses,
have
we
validated
the
saturation
metric
that
we
recently
added
for
that.
B
A
So
I
don't
think
it's
an
immediate
concern,
because
the
amount
of
nodes
that
we
run
doesn't
change
as
sporadically
as
say,
the
nat
device.
Does
it's
not
as
spiky.
I
guess
I
could
say,
but
the
concern
is
there
and
eventually
we
will
need
to
rebuild
cluster.
So
it
is
something
that
I
do
want
to
make
sure
that
we
keep
in
mind
when
we
think
about
this
project.
C
I
don't
know
if
this
is
maybe
out
of
scope,
because
we
already
have
enough
issues
to
tackle
here.
But
what
I
would
really
like
to
see
is
that
we
mix
our
different
workloads
on
on
the
same
cluster
just
for
better
resource
utilization,
because
what
we
are
doing
right
now
is
for
each
of
our
workloads.
We
have
one
dedicated
node
pool,
so
I
always
say
cluster.
C
So
this
would
be
cool
to
have
just
a
generic
amount
of
generically
named.
I
don't
know
node
pools
and
generically
named
clusters
and
then
just
try
to
randomly
assign
where
we
run
our
workloads
in
a
way.
So
they
are
mixed,
but
this
will
be
a
totally
really
a
big
change
over
how
we
run
our
clusters
and
outputs
right
now.
So
this
is
really
another
big
challenge.
I
think
this.
D
Is
important
henry
I
I
really
so
do
we
had
this
conversation
in
the
past
that
may
become
more
something
that
we
may
act
upon
in
the
future,
which
is
rethinking
the
product
deployment
based
on
featured
category
instead
of
service,
so
that
you
run
by
everything,
ci
everything,
package,
everything
and-
and
everyone
has
his
own
deployment.
D
So
if
you
think
about
this
way,
the
workload
would
be
substantially
different
than
what
we're
doing
today,
because
then
you
have
metrics
in
terms
of
features
category
and
as
well
as
you
no
longer
have
front-end
api,
and
maybe
sidekick
can
still
be
different.
But
that's
just
this
is
just
a
detail
so
in
that
direction,
having
beefier
node
pool
and
cluster
that
are
easy
to
rebuild
and
they
are
kind
of
general
purpose,
it's
easier
because
then,
by
name
spacing,
you
can
just
mix
the
workload
together.
E
I
guess
the
the
main
concern
that
comes
to
my
mind
is
isolation.
Like
c
groups
aren't
perfect,
and
especially
I
don't
know
if
we're
specifying
hard
limits
on
all
of
them.
Currently,
I
suspect
that
we
don't,
and
so
that
means
we
become
a
lot
more
susceptible
to
kind
of
crosstalk.
E
A
I
think
this
is
a
very
large
project,
because
there's
a
lot
of
facets
to
this
and
that's
one
of
them
is
trying
to
figure
out
how
best
to
create
our
node
pools
because
similar
to
sidekick
and
the
routing
mechanism
that
we
created
to
kind
of
spread
that
workload
as
much
as
possible.
We're
effectively
doing
the
same
thing
at
the
kubernetes
level
and
the
one
thing
I'm
kind
of
concerned
about
is
our
metrics
are
very
tied
to
making
the
assumption
that
all
the
pods
that
we
care
about
run
on
a
similar,
node
pool.
A
So
like
the
api
pods
run
on
an
api
node
pool.
That
gives
us
an
easy
way
to
look
at
our
node
level
metrics.
But
if
we
were
to
start
intermingling
our
workloads
between
node
pools
that
becomes
complicated,
and
I
don't
think
we
have
a
solution
for
that.
And
I
think
we
need
to
figure
out
a
way
to
do
that.
A
A
Completely
enjoy
the
aspect
of
this
project
because
I
think
it
it
would
help
us
out
in
various
ways
like
I
do
think
we
run
too
many
nodes
in
looking
at
a
few
nodes.
You
could
tell
they're
quite
underutilized,
as
is,
and
if
we
could
pack
them
down
a
little
bit
further,
that
helps
us
through
various
different
cost
mechanisms,
not
only
just
the
cost
of
having
a
extra
node
running,
but
also
the
logs
associated
with
it,
the
metrics
associated
with
it
and
the
networking
cost
that
actual
ip
address.
That
node
is
using.
A
E
In
terms
of
metrics,
I
guess
it
it
means
we
need
to
rely
more
heavily
on,
like
the
c
advisor
container
level,
metrics,
which
I
don't
know,
what
exactly
we're
using
on
the
kubernetes
side
right
now,
but
basically
looking
looking
more
at
deployments
and
containers
and
looking
less
at
node
level
metrics.
E
The
other
thing
that
comes
to
my
mind
is
priority
levels
on
a
process
level.
So
if
we
want
to
increase
utilization.
E
In
order
to
still
survive
bursts,
where
we
kind
of
need
that
burst
capacity
being
able
to
sacrifice
some
lower
priority
workload,
so
say:
oh
this
sidekick
thing
well,
not
as
important
as
this
web
pod.
So
it's
not
going
to
get
cpu
for
the
next
five
seconds,
while
we
spin
up
these
pods
to
serve
this
bursty
workload
right.
So
I
don't
know
what
the
built-in
kubernetes
story
is
for
that
kind
of
stuff,
but
I
think
that's
what's
needed
if
we
want
to
increase
our
utilization
without
destroying
our
slos.
C
Yeah,
but
I
think
we
could
easily
if
we
pack
tighter,
we
could
still
easily
define
our
requests
in
a
way
that
leaves
room
right,
but
right
now,
even
by
tuning
our
request
as
good
as
we
can.
We
still
leave.
I
don't
know
one
third
of
our
notes
unused
because
we
can't
pack
it
better
together
because
of
either
memory
being
underutilized
or
cpu
being
underutilized.
C
A
A
E
E
So
if
we
go
slowly
and
kind
of
decrease,
decrease,
decrease,
increase,
increase
increase
and-
and
I
guess
kind
of
plan
ahead
of
time-
how
much
like
basically
ensure
that
we
have
one
third
of
headroom
for
each
for
for
both
of
the
other
two
clusters
yeah,
combining
that
with
moving
slowly,
I
think,
should
be
fairly
safe.
So
I'm
not
seeing
any
strict
blockers
it's
more
of
a
mess
around
and
find
out
kind
of
situation.
A
Okay,
so
I
think
the
point
that
you
raise
is
important,
though,
so
I
think
we
could.
We
could
at
least
predict
that
ahead
of
time,
because
we
could
look
at
our
max
pod
counts.
We
could
look
at
our
max
node
counts
and
such
and
determine
whether
or
not
we've
got
that
extra
capacity.
So
we
could
look
at
that
from
a
configuration
standpoint
to
see
if
we
could
survive
this
type
of
exercise.
A
So
I
could
work
on
that
next,
I
guess
and
then
maybe
in
parallel,
try
to
figure
out
what
else
we
need
to
potentially
prioritize
from
this
particular
issue
in
regards
to
configurations,
and
such
my
biggest
fear
is
that
if
we
go
obviously
we
want
to
test
this
in
staging
first,
because
my
biggest
fear
is
that
we're
going
to
have
there's
going
to
be
some
run
book.
A
E
Control
the
ingress
on
that
right,
so
like
we
once
it's
brought
up,
we
can
kind
of
avoid
traffic
traffic
being
sent
there.
Sidekick
is
on
the
the
regional
cluster
anyway
right,
so
I'm
guessing
we're
starting.
E
E
A
So
how
does
this
sound?
Potentially,
I
could
craft
up
an
issue
for
investigating
what
kind
of
capacity
we
have
in
our
extra
zonal
clusters
just
to
make
sure
that
we're
not
going
to
drain
ourselves
or
you
know,
kill
ourselves,
and
then
maybe
we
could
spin
up
a
change
request
that
goes
to
the
exercise
of
taking
down
a
cluster
nice
and
cleanly
and
bring
it
align
a
new
cluster,
nice
and
cleanly
and
bootstrapping
it.
A
Because
I
think
step
one
for
us
should
be,
let's
make
sure
bootstrapping
a
cluster
works
to
the
extent
where
the
application
will
work
on
it.
So
that
might
be
I'm
thinking
just
a
straight
replacement.
No
tooling.
Improvements
are
made
whatsoever
just
making
sure
we
could
follow
our
run
books
as
step
one,
because
without
that
it
doesn't
matter
what
improvements
we
make
to
our
tooling,
to
enable
us
to
have
a
secondary
cluster
in
whatever
zone,
because
we
don't
have
the
ability
to
bring
one
online.
A
So
that
might
be
a
good
starting
point
for
this.
This
epic,
I
guess.
A
A
A
Because
that
enables
us
to
test
our
runbooks
to
start
with,
if
that
works,
okay
or
if
we
need
to
make
improvements,
we
could
do
that
at
that
step.
We're
at
that
point
in
time,
and
then
I
think
maybe
we
could
start
looking
towards
making
whatever
tooling
improvements.
We
need
necessary
that
way.
We
could
run
two
clusters
at
a
time.
C
A
C
Good
for
especially
because
one
of
the
important
aspects
here
is
to
be
able
to
root
traffic
away
from
a
cluster
and
route
it
back
and
everything
is
still
working
right,
because
this
currently
isn't
possible-
and
this
is
really
a
bummer,
because
we
likely
needed
to
do
something
like
that
right
and
cause
some
outages
with
that.
So
just
being
able
to
do
this
in
an
emergency
right
would
be
great
enough
already,
and
that
would
also
help
with
being
able
to
create
a
new
cluster
in
this
way.
E
Yeah
and
what
I
like
about
the
rebuild
in
place
approach,
is
it
disambiguates
issues
in
our
like
just
general
issues
in
the
bootstrapping
procedure
versus
issues
introduced
by
having
a
differing
cluster
name,
and
so
once
we've
worked
through
any
bootstrapping
issues
and
we're
confident
in
that
part,
we
can
then
focus
on
the
the
separate
piece
without
having
to
you
know
disambiguate
that
or
investigate
that
difference.
Every
time.
A
Pretty
seriously
okay,
so
I
will
start
to
formulate
this
epic
to
circle
around
that
initial
aspect
of
rebuild
in
place
in
some
way
shape
or
form
as
safely
as
possible.
I
really
don't
know
how
to
do
that
safely
off
the
top
of
my
head,
because
I
know
there's
just
so
many
moving
pieces
inside
of
our
infrastructure.
It's
gonna,
be
it's
gonna,
be
fun.
B
E
It's
drop
down
page,
so
yeah
just
just
wanted
to
quickly
highlight
that
I've
restructured
this
radisson
kubernetes
epic.
E
Previously
it
was
a
huge
pile
of
random
issues
slightly
grouped
and
I've
tried
to
kind
of
give
it
a
phase
based
structure
so
that
we
can
work
towards
specific
goals
and
have
something
a
bit
more
tangible
to
not
only
to
work
towards,
but
also
to
show
and
to
kind
of
measure
our
progress
against
and
since
there's
a
few
folks
on
this
call
who
have
worked
on
this
project.
I
I
wanted
to
highlight
it
and
see
if
there's
any
feedback
on
that.
C
C
I
think
it's
not
having
hpa
integrated,
for
instance,
but
it's
a
start,
and
it
will
continue
on
that
and
it's
living
in
the
gitlab
pane
files
repository
for
now
from
where
we
can
deploy
it,
and
we
can
consider
later
if
you
want
to
integrate
this
into
our
gitlab
charts.
Maybe
so
I
wanted
to
just
list.
C
It's
not
it's
not
in
omnibus
and
and
not
in
our
charts,
so
it's
nothing
that
we
deliver
with
gitlab
right
now.
So
if
we
wanted
to
put
this
into
our
hand
charts,
then
we
also
would
need
to
adjust
omnibus,
I
guess
to
deliver
a
camera
proxy
with
it.
We
have
it
in
our
documentation.