►
From YouTube: 2021-08-18 GitLab.com k8s migration EMEA
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
B
C
Well,
let's,
let's
get
it
going.
C
E
Move
in
you
do
fine,
yeah,
yeah,
good,
good,
so
exciting
demo.
Today,
let's
go
back
over
to
you.
C
Yeah,
so
it
took
us
three
attempts
so
far
we
haven't
rolled
back
yet,
but
we
are
taking
traffic
inside
of
the
canary
stage
on
kubernetes,
so
I'm
just
going
to
show
a
few,
not
really
anything,
exciting
dashboards.
C
We
can
see
that
you
know
our
app
x
is
wiggly,
but
like
it's
always
no
big.
If
I
go
back
to
seven
days,
which
would
take
a
while
to
load,
it
always
looks
a
little
okay.
The
granularity
changes,
so
maybe
not
seven
days,
maybe
two
days.
C
C
C
Thank
you
to
henry
for
helping
me
with
the
nmr
to
review
that.
So
we
are
now
seeing
saturation
just
shy
over
80.
C
D
C
Yeah
I
feel
like
we
should
have
like
a
a
median,
or
maybe
the
average
instead
of
the
max,
but
whatever,
as
you
can
see,
we
did
fix
our
metrics.
This
is
when
we
first
implemented
everything
and
the
metric
just
kind
of
dropped
down
to
zero.
I
wasn't
watching
this
chart.
I
wasn't
watching
this
chart.
I
was
paying
attention
to
only
these
things
up
here.
C
I
did
see
the
memory
saturation
but
like
when
I
was
doing
the
migration
I'm
like
we
could
toy
with
that
I'll
revisit
it
later,
but
yeah,
so
our
metrics
are
now
fixed.
So
we
have
the
data
necessary,
interesting
point
about
metrics.
Let
me
see
if
I
could
go
back
to
our
main
stage,
because
we
did
make
a
mistake
yesterday
and
I
accidentally
induced
an
outage
because
of
that
when
I
was
attempting
to
fix
the
metrics
and
might
have
been
this
period
of
time.
C
C
We
have
a
very
similar
query
to
our
api,
but
what
was
missed
was
that
it
wasn't
looking
for
workhorse
specific
to
the
web
fleet.
So
during
this
time
in
which
appdex
plummeted,
what
was
happening
was
that
we
were
gathering
metrics
for
workhorse
across
all
of
our
services,
inside
of
both
kubernetes
and
or
virtual
machines.
So,
instead
of
only
taking
into
account
roughly
five
thousand
well,
I
guess
nearly
ten
thousand
requests
between
seven
and
a
half
to
eleven
thousand
requests.
C
We
were
taken
into
account
like
over
14
000
plus
requests,
because
those
were,
of
course
across
everything
oops.
I
don't
know
why.
We
don't
see
the
rps
change
here,
but
if
we
go
to
staging,
I
think
this
is
where
I'm
like.
Oh
yeah,
I'm
doing
something
wrong
in
our
query,
but
I
don't
know
why
that
same
thing
didn't
show
up
in
production
when
that
change
was
rolled
out.
But
that
highlights
an
issue
with
the
way
that
we
deploy
our
rules
for
prometheus,
because
I
can't
create
something
that
targets
only
production
and
staging.
C
A
Service
regarding
testing
thing,
so
do
you
refer
how
we
can
test
the
metrics,
so
the
dashboard
or
the
metrics
itself.
C
So
this
is
a
chart
showing
all
of
our
canary
traffic
for
web,
avoiding
the
health
control
on
the
web
exporter
because
those
are
always
being
hit
and
we
can
see
the
majority
of
our
traffic
is
kubernetes.
In
fact,
if
we
get
rid
of,
if
I
re-enable
this
filter,
we
see
some
traffic
going
to
or
virtual
machines,
but
if
you
go
to
what
that
traffic
is,
we
just
recently
had
a
deploy.
So
all
you
see
here
is
us
starting
and
stopping
the
services
on
our
virtual
machines.
So
that's
good!
C
C
Which
goes
nicely
into
our
topic
of
discussion.
We
have
one
readiness
review
left
over
from
the
infrastructure
side
of
things.
I
mistakenly
assigned
one
of
the
reviews
to
someone
who
was
on
vacation,
so
I
got
two
volunteers
to
help
with
that.
One
person's
already
completed
the
review.
Another
person
is
doing
that
today.
So
if
nothing
comes
out
of
that
review,
that's
super
exciting.
I
think
the
our
target
would
be
early
next
week
would
be
the
next
time
that
we
shoot
for
performing
the
migration.
E
Yeah,
that
makes
sense
I
was
chatting
to
graham
this
morning.
I
think,
with
the
release
coming
up.
It
fits
in
well
with
timing
anyway,
so
yeah,
let's,
let's
aim
for
next
week.
C
The
change
request
he's
still
working
on
I'm
trying
to
figure
out
a
way
that
we
can
reduce
the
severity,
not
severity,
but
like
the
change
level,
I
guess
because
currently
it's
labeled
as
a
c2,
but
we
could
do
this
as
a
c3
that
way
we're
not
being
a
blocker
for
auto
deploys
yeah,
because
it's
going
to
be
a
multi-day
thing,
so
we
don't
want
it
to
be
a
very
high
sea
level,
change,
anyways
yeah.
We.
D
E
Cool
yeah
that
sounds
good.
That
sounds
good
super.
Do
you
know
so
one
thing
that
was
a
kind
of
I
guess
a
question
mark.
Maybe
it's
too
early
to
tell,
but
one
thing
that
was
a
question
mark
around
removing
x
was
whether
we
would
see
performance
impacts
is
canary,
an
environment
where
we
can
get
a
indication
of
whether
we
have
impacted
performance
or
would
or
do
we
need
to
wait
for
the
main
stage.
For
that.
D
Had
the
same
question,
because
when
I
looked
last
time
when
we
enabled
canary
for
the
aptx,
for
instance,
it
looked
like
at
least
as
good
as
before
last
week-
maybe
even
better.
But
now,
if
I
look
at
it,
it's
not
really
a
big
difference.
So
I
guess
also.
We
need
to
look
into
locks
right,
skyway
for
for
really
figuring
out,
if
maybe
some
percentile
of
latencies
got
better
or
worse
and
we
should
be
able
to
filter
this
for
canary.
I
think
to
see
this.
E
We
I
mean
we
can
certainly
look
at
it
on
on
the
main
stage
like
I
was
just
curious
because
I
know
as
far
as
it
stands.
I
think
that
was
the
the
only
real
question
mark
that
was
left
from,
like
the
testing
graham
had
done
was
actually
whether
this,
like
there
probably
isn't,
but
there
may
be
some
small
change,
so
we
can.
We
can
look
at
that
next
week.
That's
fine.
C
Well,
that's
a
good
question.
I
just
no
I'm
not
sure
how
to
find
that
information,
because
from
a
technical
standpoint
you
remove
an
object
that
measures
things
we
removed
nginx
yeah,
so
because
we
don't
have
that
to
measure
because
it's
gone,
I
don't
know
how
to
say:
hey
yay
we
improved
or
degraded
performance.
I
guess
we
already
have
aptx
for
canary,
so
we
haven't
negatively
impacted
that
as
far.
D
C
Could
tell
but
our
true
method
of
determining
whether
or
not
we
did
any
sort
of
improvement,
we
would
have
to
look
at
proxy
and
look
at
the
response
times
from
our
back
end.
E
That's
right,
I'm
not
expecting
it
to
be
faster.
The
main
big
thing
is
stability
right,
so
we
should
see
like
fewer,
odd
stuff
happening,
which
is
good
but
cool.
Well,
we
can.
We
can
monitor
that
next
week,
that's
sounds
good.
D
Mentioned
it
in
the
emir
reliability,
discussion.
D
E
Cool
okay,
great
so
yeah.
Let's
just
keep
repeating
that
say
that
as
many
people
as
possible
know
where
we're
up
to
super
exciting
stuff.
D
C
C
C
C
What
we
had
was
a
lot
of
boom
kill
events
occurring
at
that
same
time.
So
looking
at
the
memory
profile
during
the
deploy,
we
were
at
our
limit.
So
this
is
one
gigabyte
and
we
were
pretty
much
at
our
limit,
so
any
pod
that
was
at
that
limit
was
going
to
get
killed,
but
what
I
also
saw
was
some
saturation
at
the
node
level.
So
this
is
the
amount
of
memory
free
and
then
later,
matt
smiley
came
in
behind
me
and
showed
me
a
different
chart
somewhere
in
here.
C
Not
entirely
contradicting
what
I
was
saying
but
like
we
still
had
memory
available,
just
not
down
to
zero,
so
we
had
like
three
and
a
half
gig
left
on
server
nodes.
So
I'm
kind
of
surprised
that
we
had
some
moon
kills
in
general,
but
like
we
did
have
quite
a
few.
C
We
had
149
events
where
things
were
being
killed
and,
of
course,
I
can't
pull
this
chart
up
anymore,
because
those
logs
will
have
been
rotated
out
because
time
has
passed,
but
it's
a
mixture
of
either
specific
client
events
being
killed
or
the
actual
process
that
manages
the
pod
itself
and
when
the
process
that
manages
the
pod
gets
killed,
that
entire
pog
gets
removed
from
rotation.
C
If
we
lose
enough
capacity,
we're
not
going
to
be
able
to
serve
those
claim
requests
because
we'll
have
saturated
those
pods
in
various
ways.
We've
seen
in
a
previous
incident
where,
when
the
api
went
down
for
a
period
of
time,
gitlab
shell
ran
up
its
cpu
usage
off
the
charts.
C
In
this
particular
case,
our
memory
usage
went
way
too
high.
So
I
think
get
lab.
Shell
is
one
of
those
workloads
where
we
need
to
figure
out
a
better
way
to
tune
it
because
during
our
normal
usage
of
this
workload,
oh
no,
I
closed
my
tab
during
our
normal
usage
of
this
workload,
we're
greatly
over
provisioned
we're
only
writing
between
two
and
three
pods
on
these
nodes
and
if
you
go
somewhere.
C
This
is
what
I
get
for,
not
preparing.
Let's
see,
we've
got
eight.
Nearly
six
gigs
of
ram
that
we
could
allocate,
but
we're
like
sitting
here
at
using
less
than
two.
This
blue
line
is
how
much
we
use
we're
not
using
very
much
ram
at
all
so
like
a
lot
of
the
resources
on
these
nodes
are
just
not
being
used
at
all,
but
during
times
in
which
we
are
suffering
we'll
use
all
of
it
and
the
note
itself
will
start
to
suffer.
C
C
Okay,
so
here's
the
number
of
clients
that
were
connected
to
the
average
number
of
the
average
number
of
processes
which,
to
an
extent
correlates
to
the
number
of
connected
clients
to
gitlab
shell.
We
know
that
when
we
get
to
a
high
number,
we
start
to
saturate
our
pods.
C
E
Did
we
ever
get
anything
or
did
we
ever
ask
the
gitlab
shell
team
about
the
architecture,
because
I
feel
like
this
changed
at
some
point
in
the
last
like
six
months
or
so
and
started
causing
us
more
of
these?
Maybe
not
these
sorts
of
problems,
but
it
started
to
use
up
more
memory
and
stuff.
Did
we
ever
ask
them
about
that?.
C
Shell
and
I
didn't
receive
a
like
a
solid
answer,
but
it
sounds
like
the
answer
is
technically
no,
that
they
pass
the
information
directly
to
getaway
as
necessary,
and
vice
versa.
C
But
the
one
thing
I
did
put
a
feature
request
in
for
was
to
see
if
we
can't
ask
both
getaly
and
get
lebshell
to
see
if
we
can't
log
the
amount
of
data
that
is
being
transferred
to
and
from
the
client.
And
this
is
a
common
thing
that
we
love
for
http
requests.
But
we
don't
do
the
same
thing
for
this
particular
service.
A
These
things
interact
with
italy
with
grpc
core
right
yeah,
so
I
remember
a
conversation
that
I
had
with
jakob
when
I
was
working
on
some
improvement
about
cloning
stuff.
I
think
we
were
discussing
about
this
now.
I
remembered
it
was
too
long
time
ago,
but
basically,
what
he
told
me
is
that
the
process,
the
grpc
architecture,
is
well
designed
for
short
and
small
data
transfer.
A
But
when
you
start
moving
chunks
of
data
it
just
becomes
a
memory
hub
because,
basically
it
just
allocates
and
buffer
and
buffer
things
converts
them
into
from
the
internal
structure
to
the
to
the
wired
structure,
then
it
serialize
them
and
send
them
over
the
wire.
And
when
you
receive
the
same
thing
right
just
you
have
to
pick
the
package
just
convert
it
to
the
memory
structure
and
things
like
that.
So
maybe
it
can
be
related
to
this
because
it
really
depends
on
what
which
kind
of
operation
we
are
doing.
A
C
Yeah
and
one
of
the
things
that
matt
smiley
noted
was
that
when
the
memory
killer
came
into
play
a
lot
of
the
times,
it
was
killing
processes
that
weren't
using
a
lot
of
memory,
which
is
kind
of
problematic,
because
that
means
we
we
are
serving
a
lot
of
requests,
but
then
there's
not
much.
The
node
can
do
to
help
keep
itself
stable
and
healthy.
C
So
it
I
know
gitlab
show,
is
in
the
process
of
reworking
how
gitlab
show
operates
entirely.
So
I
don't
want
to
delve
too
deeply
into
you
know,
trying
to
figure
out
what
we
could
do
better
for
that
service
at
the
moment,
because
I
feel
like
in
the
future
that's
going
to
change
significantly
with
them,
introducing
a
demon
versus
the
current
method.
A
We
can
still
counter
your
first
point,
which
is:
maybe
we
are
focusing
on
something
that
is.
This
is
not
affecting
this,
and
now
that
we
are
in
a
major
architecture
refactoring.
Maybe
it's
a
good
time
to
think
also
about
this,
because
maybe
we
just
changed
stuff
and
then
their
underlying
problem
is
still
there.
D
I
just
had
a
look
at
the
memory
usage
of
gitlab
shell
during
that
I
think
it
was
on
that
12th
right
yeah
and
it
looks
like
a
very
strange
pattern
like
over.
I
don't
know
two
or
three
hours
we
had
very
low
usage
on
this
cluster
b.
I
think
I'm
looking
at
and
also
it's
very
spiky
right.
It's
going
up
up
and
down
and
like
like
doubling
the
amount
of
usage
in
between.
So
I
think
the
the
pattern
of
usage
of
resources
is
very
erratic
in
general
for
for
good
lab
shell
right.
C
C
C
If
our
pods
were
failing
during
a
deployment
which
was
the
case
in
this
particular
situation,
we
were
removing
25
of
the
capacity,
thereby
allowing
any
new
nodes
and
any
existing
nodes
that
still
have
yet
to
be
rotated
out
during
the
deploy
to
suffer
more,
and
I
think
because
of
that
we
started
hitting
our
memory
limits.
Things
were
getting
killed
and
that
was
just
a
cascading
failure
until
things
got
to
a
stabilization
point.
C
So
I
modified
our
deployment
strategy
that
we
already
use
this
for
our
web
deployments,
where,
instead
of
allowing
25
or
pause
from
being
taken
out
of
rotation
instead,
that
value
is
now
zero.
So,
instead
of
scaling
down
or
up
at
the
same
time,
we
only
scale
upwards
before
we
start
tearing
down
old
pods
and
those
new
pods
have
to
be
up
ready,
taking
traffic
before
any
older
products,
get
taken
out
of
rotation.
C
We
see
that
we
scaled
up
new
pods
and
then
we
were
doing
a
little
wobbly
effect
during
the
deploy,
and
then
we
went
back
down
to
where
we
were
so.
We
started
at
45
prior
to
the
deploy
we
finished
with
45
pods
and
then
our
hp
is
doing
its
job.
As
you
know,
load
happens.
So
that's
precisely
what
I
want
to
see.
C
C
In
this
particular
case,
when
you
know
well
when
the
cpu
usage
scales
up
or
goes
up
because
a
load
will
add
new
pods
and
vice
versa,
when
cpu
usage
goes
below
a
certain
threshold
will
remove
pods.
C
This
is
the
one
thing
that
we
don't
have
with
virtual
machines,
so
we're
not
able
to
scale
up
our
workload
to
handle
client
requests
instead
of
having
to
wait
for
an
infrastructure
engineer
to
be
like
hey,
our
cpu
usage
is
too
high.
Let's
add
three
nodes
and
they'd
spend
the
next
two
days
trying
to
figure
out
how
to
accomplish
that
task.
E
Back
to
your
point
that
you
made
alessio
what
were
you
talking
about
the
get
up
shell
architecture,
as
opposed
to
our
infra.
A
Yeah,
I
was
talking
about
the
gitlab
shell
architecture,
so
you
said,
if
the,
if
you're
really
rethinking
it,
maybe
it's
worth
just
to
put
also
this
question
on
their
plate
so
that
when
they
are
thinking
about
what
to
do,
they
may
consider
also
this
type
of,
because
I
just
very
very
briefly
just
as
an
idea
right.
So
if,
as
also
as
henry
pointed
out,
the
behavior
memory
behavior
is
erratic.
A
So
probably
we
need
something
which
is
more
within
the
business
logic
that
can
really,
let's
say,
made
a
guess
or
an
estimation
about
the
incoming
request.
How
much
memory
will
it
take?
And
probably
this
is
the
type
of
project
the
type
of
software
where
we
need
to
have
tighter
memory
management
within
the
process
itself.
So
let's
say
something
like
this:
we
get
a
new
client
and
we
know
that
in
we
have
a
memory
limit
because
we're
running
kubernetes,
so
we
need
to
bring
this
information
down
in
the
process.
A
So
we
know
that
we
have
an
upper
limit
of
say
one
gigabyte
in
the
spot
right,
so
the
process
know
how
much
request
is
serving
and
roughly
the
memory
that
is
using.
So
if
there
is
a
gisely
call
that
can
kind
of
estimate
the
amount
of
memory
needed
for
the
incoming
requests,
it
will
make
an
informed
decision
about
serving
it
or
just
putting
in
a
queue,
because
you
don't
want
if
this
is
serving
more
than
one
request.
A
C
Something
to
keep
in
mind
is
the
way
the
get
lab.
Shell
currently
works
right
now,
when
you
make
an
ssh
connection.
Gitlab
shell
spins
up
a
brand
new
process
for
each
new
connection.
That
in
itself
has
its
own
memory
overhead,
whereas
the
future
architecture
is
github,
shell
becomes
its
own
ven.
It's
running
the
necessary
tcp
stack
to
accept
an
ssh
connection
and
it's
not
spinning
up
a
new
process,
but
instead
will
be
a
single
process
that
could
then
do
exactly
what
the
list
is.
C
Managing
those
connections
is
necessary
and
it'll
have
its
own
memory
capabilities,
because
it's
managing
all
of
the
connections
instead
of
a
single
process
that
is
disconnected
from
all
of
the
others,
so
the
architecture
will
be
better.
It's
just
a
matter
of
what
are
the
future
impediments
or
future
situations
that
we
need
to
be
aware
of
when,
whenever
we
turn
that
on,
I
don't
know
what
the
timeline
is.
C
I
know
our
helm
chart
now
supports
switching
over
to
the
new
shell
demon,
but
I
don't
know
where
that
team
is
at
with,
but
actually
rolling
that
out.
E
Can
we
find
out
like
because
I
think
like
that
would
be
a
great
one
for
us
to
like
plan
stuff,
but
also
know
how
much
we
should
worry
about
this
stuff
or
metrics,
and
things
like
find.
If
it's
just
a
comment
on
this
issue,
we
have
especially,
we
can
link
them
to
this
video
and
see
if
there's
any
any
kind
of
practical
stuff.
C
I'll
figure,
I
think,
yeah
I'll-
do
just
that.
Actually.
E
Super
thank
you
awesome.
E
Does
that
give
you
what
you
need
skybeck
for
in
terms
of
overviewing
like
this
thing?
Is
there
anything
you
need
from
from
our
side
today
to
help
here.
C
For
this
issue
now
so
at
this
point
I've
kind
of
completed
my
investigation,
I
pulled
in
a
few
other
persons
who
are
curious
about
this
they've
kind
of
completed
what
they
wanted
to
see
out
of
it.
If
anything,
jacob
may
have
more
questions,
but
I'm
going
to
go
ahead
and
proceed
to
close
this
issue.
At
this
point
in
time,
I've
already
completed
an
initial
corrective
action
that
I
think
was
the
most
beneficial
and
I
just
showed
the
chart,
so
I
think
we're
in
a
good
state.
E
Awesome
cool
sounds
good,
great
cool
and
then
goals,
we've
kind
of
touched
already:
readiness,
review
and
change
requests.
Is
there
anything
else
and
do
we
need
to
give
anything,
provide
anything
or
help
with
anything.
C
I've
had
no
buff,
I
commented
on
the
cr
that
graeme
has
started
to
hopefully
make
it
a
little
easier
on
ourselves
and
then
I've
got
one
person,
that's
doing
the
review
on
the
right
or
performing
a
readiness
review.
So
those
are
the
two
things
I
know
of
that's
going
to
be
holding
us
up,
but
I
think
we're,
I
think,
we're
getting
near
ready.
C
E
Nice
on
the
we
can
see
it
probably
I'm
sure,
they're
on
the
emails
but
on
the
main
web
rollout
like
what's
the
kind
of
rollout
plan,
are
we
going
cluster
by
cluster
or
how
are
we
gonna
get
this
out
to
them?.
C
The
current
plan
that
graim
has
put
together
it
looks
like
we
start
by
introducing
the
new
configuration
one
cluster
at
a
time
here,
I'll
roll
through
this
real
quick.
Since
I've
got
my,
I
have
the
issue
up,
so
we've
got
a
chef
change
and
all
this
does
is
add
for
one
cluster,
the
back
end
of
the
web
endpoint
to
that
to
that
back
end,
and
we
do
that
per
cluster.
C
So
at
this
point
one
set
of
hda
proxy
nodes
will
send
three
percent
of
that
traffic
to
that
cluster
and
we'll
get
to
that
point
at
the
end
of
I
guess
day,
one
wait
an
hour,
wait
an
hour
yeah,
so
wait
one
day
so
day
one.
We
do
just
the
initial
shift
change
across
our
front
end
nodes
after
that
we
just
play
with
the
weights
and
we'll
just
play
with
the
weights
until
we
reach
100
of
the
traffic
being
in
kubernetes.
C
That's
the
initial
plan.
What
I'm
this
is
a
little
bit
intrusive
because
it
involves
stopping
chef
on
our
fe
nodes,
executing
the
merge
request
and
then
running
chef.
I
guess
you
don't
have
to
do
it
targeted
because
we're
doing
this
one
cluster
at
a
time,
but
I
think
we
can
make
this
easier
and
then
we
could
reduce
the
change
level.
D
E
E
Okay,
awesome
thanks
so
much
for
demoing,
scarborough
and
yeah,
exciting
great
progress,
exciting
to
see,
see
things
on
canary
and
looking
forward
to
the
next
step.