►
From YouTube: 2022-06-29 GitLab.com k8s migration (EMEA/AMER)
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
C
A
A
A
The
reason
for
this
would
enable
me
to
not
completely
disable
canary
across
all
aha
proxies.
We
still
want
canary
to
be
usable.
Therefore,
autodeploy
still
has
a
mechanism
for
which
it
has
a
backend
to
test
against.
If
we
completely
disable
canary,
we
effectively
block
team,
delivery's,
auto,
deploy
capabilities,
and
since
we
don't
know
the
full
scope
of
cluster
outages,
I
don't
want
to
take
that
type
of
negative
impact
to
another
team.
A
So,
in
order
to
be
friendly
to
release
managers,
I've
added
the
ability
where
we
could
disable
canary
for
target
zone.
That
way,
autodeploy
still
has
the
test,
and
secondly,
because
we'll
probably
be
doing
this
across
all
nearly
30
front-
end
hd
proxies
and
we'll
do
this
slowly.
That
way,
when
we
start
disabling
actual
zones
that
are
taking
the
brunt
of
customer
traffic,
so
say,
cluster
b,
for
example,
becomes
disabled.
A
So
I
simply
added
a
small
sleep
so
just
to
showcase
I'm
going
to
use
staging
that
way.
I'm
not
sitting
here
banking
production
if
we
do
a
get
server
statement.
Let's
so
we
want
to
look
at
the
ssh
back-ends,
specifically
hypothetically
everything
should
be
up.
Let
me
see
if
there's
any
deployments
going
on
okay,
so
there
is
a
deploy
going
on.
A
But
it's
past
the
kubernetes
part.
So
that's
that's
good.
So
in
this
case
and
jenny
just
in
case
you're
not
familiar
with
this
tool,
all
this
is
doing
is
reaching
out
to
our
chef
server
gathering
a
list
of
all
fe
nodes
that
we
have
so
in
staging.
Currently,
we've
got
two
fe
nodes
that
are
of
type
ci.
We've
got
two
or
excuse
me,
three
fe
notes
they're,
just
our
generic
fe
nodes
that
intake
most
of
gitlab
traffic.
A
We
do
have
a
few
others
there's
one
four
pages:
well
a
set
of
them
for
pages
and
a
set
of
them
for
registry.
They
do
not
run
the
ssh
backend
service.
They
don't
intake
that
traffic.
Therefore
they
don't
have
an
ssh
backend,
but
we
are
inherently
querying
those
backends
they're,
simply
not
showing
it,
because
there's
no
data
to
show
I'm
targeting
staging
and
I'm
looking
specifically
for
ssh.
This
is
just
a
blank
or
blanket
regex,
so
it's
looking
for
all
ssh
backends.
A
So
my
imagination
is
that
what
we
would
do
is
there's
a
convenient
set
server
state.
So
my
imagination
is
that
what
we'll
do
inside
of
our
procedures
for
mucking
around
when
it
comes
time
to
take
down
the
clusters
that
we
want
to
disable
our
canary
back-end
explicitly?
A
A
But
if
we
do
another
get
server
state,
you
know
we're
going
to
query
all
of
our
servers,
but
the
ux
of
this
particular
tool
is
not
great
and
I'm
not
really
sure
how
to
make
it
any
better,
because
we're
just
using
basic
bash
commands
like
we're
grabbing
the
information
for
all
the
h,
a
proxy
nodes,
we're
doing
a
sort
and
a
count.
So
we
see
that
one
fe
node
says
that
this
can
air
back
end
is
in
maintenance,
and
we
see
that
two
of
them
are
up.
A
Based
on
what
I
specified
here,
where
I
specified
zone
b,
I
could
take
a
gather
that
the
h,
epoxy
node,
located
in
zone
b,
is
the
one
that's
in
maintenance
mode,
but
that's
not
explicit
here,
which
is
kind
of
disappointing,
but
in
my
procedures
I
could
always
add
another
validation
step
just
to
make
sure
that
people
fully
understand
what
the
state
actually
is.
We
could
also
probably
look
at
metrics
to
determine
that
hey
effing,
these
sets
of
fe
nodes,
all
locations
that
would
be
are
no
longer
taking
traffic
as
well.
A
So
now
that
canary
is
disabled.
The
next
thing
I
would
probably
do
is
go
into
here
and,
let's
say
we're
again
targeting
cluster
b
and
we
would
just
go
in
here
and
say:
hey
you
set
you
in
maintenance
and
because
I
know
this
is
staging
but,
like
let's
say
we're
doing
this
in
production,
for
example,
we'll
want
to
like
slowly
roll
across
all
nearly
30
fp
nodes,
but
in
staging
we're
only
going
to
hit
one
node.
A
We'll
say
yes
to
this
and
we
even
output
a
little
message,
saying:
hey:
you
desired
the
change
related
slowly,
so
we
modify
our
knife
command
to
operate,
one
node
per
time
or
one
node
at
a
time.
There's
a
concurrency
flag
on
knife,
the
c
flag
of
the
concurrency
file.
We
simply
default
that
to
one
this
particular
case
and
the
actual
knife
command
that
sends
to
the
he
proxy
administrative
socket,
there's
a
sleep
injected
in
front
of
that
command.
So
for
every
proxy
node
we
sleep
first
and
then
we
set
the
actual
flag.
A
A
Not
really,
you
know
this
tool
doesn't
send
any
audit
logs
anywhere.
It
doesn't
send
a
notification
anywhere,
that's
being
used,
so
not
really.
Normally.
This
is
used
inside
of
change
requests,
but
you
know
there's
nothing
preventing
me
from
using
here
locally.
You
know
I'm
sitting
here
targeting
staging
with
this.
You
know
without
any
sort
of
issue
at
all.
A
A
Locally,
maybe
like
script,
this
locally
there's
a
flag
that
I
could
set.
This
is
taking
forever
there's
a
flag
that
we
could
set
on
this
tool.
That
effectively
is
like
a
force
flag
where
it
doesn't
ask
me
to
hit,
enter
to
continue,
and
I
could
create
another
bash
script
that
wraps
around
this
tool
that
ensures
that
we
can
just
copy
and
paste
a
single
command.
A
A
Anyways
I'll
get
I'll
get
these
nodes
back
up
and
running
the
way
I
need
to
canary's
the
last
one.
A
But
that's
what
I
wanted
to
showcase!
That's
what
I've
been
working
on
lately
so.
C
D
A
A
As
far
as
next
steps
go,
the
last
thing
that
I
wanted
to
make
sure
that
we
accomplished
was
making
sure
we
don't
block
all
the
deploys
right
now.
We've
got
a
situation
where,
if
a
cluster
is
down,
our
deployer
will
see
that
and
we'll
bail,
and
this
is
intentional
because
we
don't
want
any
back-ends
to
be
down
for
the
most
part.
A
So
I
need
to
make
sure
that
our
I
need
to
make
sure
that
we'll
be
okay
like
setting
some
sort
of
flag
that
ensures
that
we
could
still
proceed
during
a
maintenance
procedure
such
as
that
we
already
have
environmental
variable
set
today
or
that
I
can
set
today.
That
would
do
the
same,
but
it's
very
global,
it's
it.
A
It
would
bypass
the
check
for
all
of
our
clusters,
but
I
don't
want
that
to
occur.
I
want
that
to
occur
just
for
the
target
cluster
that
we're
operating
on.
I
think
that'll
just
be
a
more
safe
method
of
doing
that.
A
Yeah
all
right
so
discussion
items.
I
already
talked
about
number
two.
I
went
out
over
my
apologies
item
number
one
get
ssh.
I
wanted
to
bring
up
a
chart,
but
I
failed
to
do
that
ahead
of
time.
So
my
apologies
leading
up
to
the
get
lab
sshd
enablement.
We
used
to
have
a
an
hpa
that
was
leveraged
heavily,
but
our
eocs
started
getting
paged,
often
enough
to
be
concerned.
B
A
A
We
used
to
leverage
the
hpa
of
the
gitlab,
ssh
or
gitlab
shell
demon
quite
greatly.
It
would
scale
all
the
time,
but
because
we
were
having
issues
and
we
couldn't
find
the
root
cause.
We
just
told
the
hpa
to.
A
Don't
do
its
jobs,
we
effectively
told
the
minimum
replica
set
to
be
a
value
that
was,
that
was
exceedingly
high.
What
that
led
to
was
the
inability
for
us
to
leverage
the
hpa
and
now
we're
running
a
lot
more
pods
than
we
used
to
run
no
data.
A
A
A
D
A
So
we
used
to
run
roughly
40
pods
spiking
upwards
of
50
pods
right.
D
A
A
A
D
A
A
C
I
was
just
looking
at
so
there
was
a,
I
think,
an
investigation
going
on
from
igor
and
someone
else
from
source
code
about
doing
some
performance
comparison
that
I
think
is
completed.
C
A
All
right
I'll
get
an
issue
fired
up
to
address
that
it's
low
priority
at
this
point.
We're
just
wasting
a
lot
of
money,
in
my
opinion,
so
yeah,
I
think
it's
low
priority,
but
this
would
be
a
good
like
it
shouldn't
be
an
issue
that
takes
a
lengthy
amount
of
time.
It'd
be
one
of
those
quick
wins
in
my
opinion.
A
It
would
just
be.
It
would
be
time
consuming
because
of
the
necessary
need
to
perform
the
research
and
then
make
the
changes
slowly
such
that
we
don't
induce
an
outage.
We
want
to
you,
know
kind
of
introduce
changes,
slowly,
maybe
test
them
on
one
particular
cluster
before
we
roll
it
out
to
all
of
them,
for
example,
and
then
this
is
only
impacting
production.
The
change
that
we
made
that
I
described
where
we
set
the
minimum
replica
account.
We
did
not
do
that
on
stage,
and
we
only
did
this
on
production,
because
only.
A
Was
the
environment
that
was
yelling
at
our
eocs
about
stuff,
so
I'll
fire
up
an
issue?
I
guess
I'll
link
this.
C
A
A
C
A
About
that
otherwise,
okay,
so
we
already
talked
about
number
two,
so
jenny,
let's
talk
about
what's
going
on
with
stateful
sets,
because
this.
B
Is
silly
yeah?
Oh
I
mean
you
two
both
have
the
context
behind
it,
but
yeah.
Let
me
just
go
over
so
yeah.
I
was
working
on
this
issue
to
basically
add
basically
a
label
to
a
staple
set,
and
it's
not
loading
great,
and
I
had
this
vertex
quest
before
just
for
the
pre-cluster
just
to
test
things
out.
You
know
we're
trying
to
add
a
stage
main
label.
That's
the
only
new
thing,
the
change
being.
B
You
know,
if
there's
a
stage
label,
please
add
the
stage
label
which
is
made
and
then
once
we
run
that
so
at
first
you
know
the
pipeline
that
starts
running
from
this
fails
because
the
diff
says
hey.
B
This
stateful
set
cannot
take
any
more
labels,
basically
other
than
these
fields
are
forbidden.
So
basically
we
thought
okay,
so
we're
going
to
delete
the
staple
set
manually,
merge
the
merge
request
well
delete
the
delete
staple
set
orphan,
the
pods
in
that
step
and
then
merge
the
merge
request
and
then
the
new
stateful
set
theoretically
should
pick
up
the
orphan
pods,
restart
them
with
the
new
label.
B
B
But
then,
when
I
actually
went
and
did
that
step,
we
came
into
an
issue
where
it
says
the
new
sds
couldn't
pick
up
the
new
pods,
because
the
post
operation
against
the
pod
could
not
be
completed
at
the
time.
Please
try
again
like
it's.
It's
a
very
ambiguous
error
that
happens
so
then
we
manually
went
in
and
deleted
them
one
by
one.
Thankfully,
in
the
pre-cluster
there
were
only
two
now.
B
What
even
stranger
is
that
you
know
when
I
took
down
the
first
part,
it
actually
didn't
say
ready
one
out
of
two
when
the
new
pod
came
up
by
the
new
stateful
set,
it
said
zero
out
of
two,
so
that,
like
made
me
worry
again,
but
then
we
said,
let's
just
go
ahead
and
delete
the
second
old
pod
and
then
another
one
came
up
and
then
that's
when
it
turned
two
out
of
two
which
just
might
be
a
visual
glitch.
I'm
not
sure
why
that
staple
set
behave
that
way
but
yeah.
B
So,
basically,
that's
the
premise
of
what
happened.
I
wanted
to
talk
about
like
like
that's
the
downside
and
then
because
of
that,
if
we're
going
to
do
this
in
the
product
cluster,
we're
going
to
be
doing
this
to
50
pods
only
bring
them
down
one
by
one
at
a
time
which
takes
around
five
minutes
per
pot,
which
made
my
change
of
cost,
have
260
minutes
for
the
time
to
completion,
which
is
less
than
ideal
right,
so
yeah,
that's.
That
was
why
I
wanted
this
discussion
now
overnight.
B
Graham
had
suggested
that,
well,
he
noticed
two
things
right.
He
says
that
pre-cluster
is
the
only
one
on
kate's
1.22,
and
that
might
be
the
reason
why
the
staple
set
is
acting
that
way.
But
then
I
thought
you
know,
1.22
is
the
latest
one
on
the
very
stable,
like
release
line,
which
kind
of
doesn't
make
sense
like
why
a
stateful
set
would
behave
so
weirdly
on
a
very
stable
release,
but
I
can
do
more
research
into
that
and
then
also
that
fluentd
archiver
can
take
down
time.
B
So
if
it's
going
to
take
120
100
260
minutes,
then
we
should
just
delete
the
sts
without
opening
the
pause
and
just
doing
a
full
redeploy
taking
down
time
so
yeah
thoughts,
opinions.
A
A
I
think
grain
pointing
out
the
debris
cluster
is
slightly.
Newer
is
just
pointing
out
the
fact
that
maybe
there's
a
change
brought
into
122
that
may
differ
with
the
way
staple
sets
behave
versus
121,
which
is
what
we
run
everywhere
else.
A
A
So
one
thing
I'm
kind
of
curious
about
because
I
know
staple
sets
like
to
work
in
a
backwards
format.
You
know
it's
always
building
the
first
pod
and
if
there's
a
change
that
needs
to
be
introduced,
it's
going
to
make
that
change
to
the
latest
pod.
So
in,
like
pre,
for
example,
was
running
two
pods,
so
you
have
the
pod,
zero
and
pod
one.
A
A
Not
because
the
error
we
got
is
very
generic
and
doesn't
lead
us
any
towards
any
sort
of
troubleshooting
direction,
so
I
suspect
there
was
probably
a
conflict
with
the
name
and
kubernetes
didn't
know
what
to
do,
because
a
pod
was
already
named
that.
So
I
suspect
that
deleting
the
pods
overall
is
going
to
be
what's
required,
but
it's
just
a
curiosity.
That's
you
know
itching,
my
brain,
you
know,
so
I
don't
know
whether
or
not
you
want
to
test
that
out
that
test
that
theory,
auto
and
ops
I'll
leave
that
to
you.
A
Otherwise
I
think
it
fluenty
archiver,
you
know
fluently
uses
that
positional
file,
so
it
knows
where
it
stopped.
When
it
was
last
running
so,
like
graeme
said,
we
can't
take
down
time,
it's
just
a
matter
of
making
sure
we
limit
that
downtime
as
much
as
possible.
So
we
don't
end
up
making
fluent
d
suffer
too
much
by
taking
too
many
resources
as
it
catches
up.
We
also
don't
want
to
drown
wherever
those
logs
go,
to
which
I
think,
if
I
recall,
go
to
gcs.
A
So
that's
probably
not
an
issue
for
us,
so
I
I
see
two
options.
If
you
want
to
try
to
figure
out,
if
you
know
stateful
sets
behave
differently
on
version
121,
we
could
repeat
the
exact
same
experiment
you
did
already,
which
I
suspect
it's
not
going
to
change
anything
or
we
could
go
about
testing
your
new
procedure,
which
is
just
simply
deleting
all
of
them,
and
you
know
hitting
that
deploy
button.
B
How
about
this
yeah
so
in
the
ops
cluster,
so
we
have
off
staging
and
prod
left
and
obviously
don't
want
to
test
on
prop,
but
on
the
ops
one.
How
about
we
test
out?
B
A
A
B
Yeah
and
then,
if
that
experiment
fails,
I'm
just
going
to
delete
all
the
parts
delete
the
new
staple
set
and
then
I
can
redeploy
that
pipeline
right
like
run
the
deploy
picture.
A
B
Okay,
yeah,
then
I
think
I'll
do
that
and
then,
depending
on
what
comes
out
of
that,
go
forward
with
that
on
staging
I'm
guessing
it's
going
to
be
the
whole,
the
like
delete
the
whole
thing
and
then
redeploy.
B
But
you
know
if
we
don't
have
to
do
that,
it'll
be
ideal
just
because
50
pods
it's
a
lot
yeah.
A
D
A
D
I
know
I
I
don't
know
why
it
takes
around
five
minutes,
but
it
is
what
it
is
so
yeah.