►
From YouTube: 2022-03-09 GitLab.com k8s migration EMEA/AMER
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
B
Let's
see,
we
got
two
items
on
the
agenda
henry.
How
about
you
go
first.
C
If
you
don't
mind
yeah
yeah
sure,
so
I
don't
have
anything
to
read
demo
yet,
but
wanted
to
give
a
short
update
from
where
we
are
with
red
is
right
now,
so
I
merged
most
of
the
things
that
you
created
in
your
branch
skyway
to
master.
Now
there
will
be
one
more
mr
coming,
and
this
gives
us
the
scripting
to
be
able
to
test
the
switch
to
using
dns
names
on
dvms,
and
that
should
work
on
parameterized
clusters
and
nodes
and
hosts.
C
So
we
can
use
this
also
later
in
the
same
form
for
doing
the
real
switch
over,
maybe
with
some
rework
and
what
I'm
doing
now
is
testing
exactly
this
procedure,
see
if
I
see
any
flaws
or
improvements
that
I
can
do
there
and
then
see
if
we
can
get
to
a
state
where
we
want
to
maybe
try
this
in
pre-even,
but
we
need
to
see
how
we,
how
comfortable
we
feel
with
that.
So
that's
where
we
are
with
this
right
now.
C
A
C
Yeah,
so
that's
what
I
wanted
to
report
if
there
are
no
other
questions.
A
C
I
want
to
test
this
procedure
in
our
sandbox
cluster
first
to
be
really
sure
that
it
works
right
to
find
all
flaws,
possible
flaws
and
then
base.
You
know
what
could
go
wrong
and
once
you
feel
comfortable
with
that,
I
hope
there's
not
much
missing
for
that.
We
should
see
how
we
can
do
the
same
in
pre
right
applying
those
scripts
on
pre,
which
should
search
over
the.
A
C
Yeah,
so
this
is
testing
if
our
automation
so
far
is
sane
enough
and
working
so
that
we
can
use
that
in
production.
Also,
what
we
need
to
test
is
to
try
this
running
with
some
load
like
like
something
which
is
doing
requests,
and
so
we
see
if
we
fail
or
how,
how
much
we
fail.
E
C
Doing
this
the
switch
over
because
it
involves
doing
failovers
and
then
searching
configurations
resetting
sentences,
so
we
need
to
see
how
much
requests
we
might
lose
in
between
maybe
so.
This
is
what
we
need
to
test
here.
B
That's
effectively
where
I
dropped
off,
like
one
of
my
notes
and
one
issue
that's
assigned
to
me
was
like.
I
would
like
to
get
the
benchmark
process
of
that
environment
running
again
and
performing
the
failover
as
the
benchmark
is
running
to
see
you
know,
what's
the
failure
rate
of
those
requests,
so
that
all
makes
sense.
C
A
C
Yeah
they
do
that
so
skyway,
I'm
taking
this
officially
over
and
over
from
you.
Okay.
C
B
I
don't
have
a
formalized
discussion,
so
I'm
just
going
to
try
to
explain
what
I've
been
looking
at
recently
so
that
all
of
us
could
just
be
aware
and
then
maybe,
if
anyone
has
ideas
or
thoughts,
brain
dump
it
on
me
so
I'll
share
my
screen,
because
I
don't
really
have
a
good
way
to
present
this
without
fumbling
over
my
own
words.
B
So,
a
little
while
ago,
we
started
to
see
more
and
more
issues
where,
when
production
is
ready
to
be
deployed
or
if
someone
is
doing
a
a
feature,
flag
change
and
we
run
a
simple
prometheus
query
to
determine
whether
or
not
our
services
are
healthy
and
you
can
see
from
the
screenshots
that
amy
provided
the
main
stage
of
sidekick
was
not
healthy
during
one
run
and
apparently
two
minutes
later
someone
ran
the
same
command
and
sidekick
isn't
here
at
all,
and
our
get
stage
went
unhealthy.
B
So
I
started
looking
into
this
because
this
is
rather
concerning.
We
don't
want
this
to
happen.
Obviously
we
do
have
active
work
happening
on
our
metrics
from
the
scalability
side
of
things.
So
you
know
maybe
something
digressed
and
you
know
made
things
worse
or
what
have
you
so
I
started
looking
into
this,
and
one
of
the
things
we
noticed
was
that
our
metrics
had
holes
in
them,
which.
C
B
Good
and
this
is
a
metric,
so
this
is
the
actual
measure
that
we
use.
It's
called
git
lab
deployment,
health
service
and
we've
got
one
for
apex
and
one
for
errors.
We
mash
those
two
together
and
if
we
get
a
value
of
one,
things
are
healthy.
If
we
get
a
value
of
zero,
things
are
not
healthy
and
we've
got
this
very
complicated.
B
I
drew
up
a
chart
that
does
this
prometheus
doesn't
provide
an
easy
mechanism
to
do
this,
so
there's
actually
some
complicated
logic.
So
I'm
gonna
save
that
for
outside
this
conversation,
but
effectively
we're
looking
at
the
service
ratio
for
one
hour
for
five
minutes
for
six
hours
and
thirty
minutes,
and
we
combine
them
two
and
if
we're
good,
we
actually
have
a
value
of
z
when
we
flip
it
to
one.
B
If
we
have
a
value
of
one,
we
flip
it
to
zero
and
vice
versa,
to
determine
if
a
given
service
is
a
healthy
or
not.
The
problem
is
that
these
metrics
so
get
lab
service
errors
with
the
ratio
of
six
hours
and
one
hour
are
quite
holy
for
some
reason
they
just
there's
no
metric
showing
up.
If
we
decompose
what
those
metrics
are
consisting
of,
we
do
see
them
dip
for
various
services.
They
do
occasionally
dip
to
a
value
of
zero
and
back
up
to
one.
B
So
when
looking
into
this,
I
think
I
found
somewhere
that
like
well,
metrics
are
missing,
so
I
thought
one
of
the
things
I
could
do
with
this
was
like,
let's
just
get
rid
of
the
metrics
that
are
missing
and
andrew
is
against
doing
this
so
because
this
picks
up
errors
that
are
occurring
over
the
course
of
the
lifetime
of
a
given
service
and
even
though
it's
a
six
hour
window,
it's
looking
at
the
errors
over
the
course
of
history.
B
So
if
errors
are
increasing
over
the
course
of
six
hours,
there
could
be
some
other
inherent
issue
that
just
doesn't
show
up
within
a
short
time
span
of
an
hour.
So
it's
wise
to
keep
this
metric
in
place.
Therefore,
we're
trying
to
capture
like
everything
all
encompassing
in
a
given
deployment
package.
B
What
we
found
our
investigation
was
obviously
that
you
know
metrics
are
holy,
but
we're
getting
timeouts
metrics
are
simply
not
being
somewhere
in
here
goodness.
Where
is
it
at.
B
B
So
if
I
just
do
this,
one
metric
is
simply
called
the
completion
seconds
bucket.
This
is
what
feeds
into
the
aptx,
for
example-
and
this
is
already
taking
a
really
long
time
to
render
me
the
amount
of
metrics
that
we
produce
so
much
so
that
you
know
thanos
is
crashing
on
my
machine,
but
we
got
over
8
000
metrics
returned
for
just
this
one
metric
alone,
so
the
cardinality
for
this
one
metric
is
extraordinarily
high.
B
Yesterday,
when
I
was
looking
at
this
issue
at
the
time
I
was
looking
at
it,
we
had
over
570
pods
running
in
production,
so
for
this
one
metric
alone
for
every
single
pod,
we've
got
500
times,
8
000
metrics,
that's
quite
a
lot
and
I
think
the
problem
that
we're
suffering
from
is
when
we
try
to
query
prometheus
and
try
to
build
our
metrics.
B
B
I
didn't
screenshot,
that's
why
I
can't
find
it,
but
the
thing
is
this
one:
just
so:
prometheus
has
built
in
timeouts
and
we're
simply
timing
out
querying
this
extravagant
large
amount
of
metrics
overall.
B
And
this
is
kind
of
evaluating
real,
failed
query
timeout,
and
this
is
happening
for
all
of
our
prometheuses
inside
of
kubernetes
in
production
today.
So
this
is
not
really
a
small
problem.
This
is
kind
of
a
large
problem,
because
this
this
is
going
to
impact
quite
a
few
things.
This
bubbles
all
the
way
up
back
to
our
reporting,
metrics
that
we
use
for
determining
how
healthy
services
are
over
the
course
of
lengthy
periods
of
time
so
like
looking
at
a
month
overview,
for
example.
B
B
B
I
think
I
calculated
that
sidekick
outputs
over
10
000
metrics
per
pod,
which
is
excruciatingly
large,
so,
like
I
mentioned
before,
scalability-
is
already
working
on
metrics
as
a
whole
for
other
various
aspects,
but
one
of
the
things
that
we're
considering
is
learning
the
amount
of
buckets
that
we
use
that
capture
with
the
completion
times
for
various
workers,
as
well
as
the
buckets
that
we
use
for
capturing
database
related
metrics
for
redis
and
postgres
as
well.
So
shrinking
metrics
will
certainly
help
us.
B
I
guess
the
question
I
have
for
anyone
on
this
call,
because
I
don't
know
how
to
look
into
this
appropriately.
Are
there
other
things
that
we
could
potentially
look
at
in
doing
that?
Doesn't
involve
touching
metrics
because
that's
already
being
worked
on,
and
hopefully
we
can
be
successful
in
that
realm,
but
we
also
have
prometheus
where
we've
got
one
prometheus
per
cluster.
B
I
ponder
if
we
need
to
split
out
these
prometheus
deployments
into
many
and
then
have
those
prometheus
target
specific
items,
so
we
might
have
prometheus
dedicated
to
sidekick
or
we
might
have
prometheus
dedicated
to
specific
shards
of
sidekick
or
maybe
more
simplistic.
We
have
prometheus,
get
dedicated
to
get
lab
and
then
we
have
another
prometheus
dedicated
to
everything
else.
B
I'm
looking
at
other
ways
that
we
might
want
to
start
thinking
about
how
to
alleviate
this
problem,
because
I
don't
see
this
going
away.
In
fact,
I
see
this
getting
worse
over
the
course
of
time,
because
git
lab
is
only
going
to
grow.
Prometheus
has
already
been
a
source
of
contention.
You
know
we
occasionally
run
into
problems
with
wall
files
and
we
need
to
completely
bash
and
restart
prometheus
and
delete
all
of
its
data.
So
we
have
to
rely
on
the
redundant
node
to
provide
us
this
metric
so
that
we
don't
lose
any.
B
I
don't
know
open
question.
I
don't
know
if
I
covered
the
topic
very
well,
but
you
know
feedback
or
questions.
A
B
So
that's
the
thing
that
I'm
looking
into
right
now.
This
is
a
blocker
for
me
because
I
don't
know
of
another
good
way
to
fix
the
problem
for
team
delivery.
I
think
our
metrics
are
is
where
we
source
all
of
this
information.
B
So
if
we
try
to
build
some
mechanism
inside
of
release
tools,
that
says
ignore
the
metric.
If
it's
empty
is
a
very
dangerous
thing
to
be
accomplishing.
If
we
tell
release
tools,
hey
this
metric
is
missing,
even
though
it
should
be
here.
We're
not
fixing
the
problem,
we're
just
exposing
the
problem
that
we
already
know.
Very
much
is
a
problem
currently
today.
B
B
C
I
think
this
is
a
really
complex
topic.
Yes,
for
one
I
mean
it
needs
to
be
fixed
at
the
root,
because
the
whole
engineering
is
depending
on
working
metrics
right.
So
we
need
to
find
a
solution
to
overloading
our
previous
or
thanos
instances
with
too
many
metrics
or
too
high
cardinality.
C
One
approach
I
saw
in
the
issue
mentioned
was
that
there's
an
epic
for
splitting
up
prometheus
and
grenadas,
as
you
mentioned,
for
maybe
having
specific
previous
for
different
components,
maybe
or
something
like
that.
So
there's
an
epic
for
this
already,
but
I
guess
some
work
to
be
accomplished
as
an
immediate
solution
or
patch
just
for
us.
C
The
only
thing
I
could
think
of
is
that
we
try
to
come
up
with
some
recording
rule
which
is
trying
to
you
know
piece
together
the
metrics
that
are
mostly
working
and
leaves
out
what
has
gaps.
Maybe,
but
then
we
wouldn't
have
a
coverage
for
a
sidekick,
but
maybe
we
could
find
a
replacement
query
or
recording
rule
for
this
one,
which
makes
it
you
know
not
as
heavy
and
those
working.
C
So
this
would
be
patching
around
the
the
current
problem,
so
we
would
have
our
own
recorded
metric
for
for
saying
this
service
is
healthy,
which
could
lean
on
what
we
have
already,
where
it's
working
and
replacing
the
metrics
with
gaps
with
something
that
just
you
know
not
as
heavy
maybe.
But
this
would
also
be
some
some
work
and
then
it's
just
a
patch
around
a
recurring
problem.
Right,
yeah.
A
So
I
guess
like
in
terms
of
I
mean
like
thanks
for
sharing
that
effort.
Can
we
like
do
we
are
in
agreement
that
that
is
that's
like
the
proper
solution.
B
I
don't
know
if
it's
the
proper
solution,
it's
one
idea
that
we
have.
That
could
probably
help
us
out
here,
like
our
prometheus
right
now
is
just
overloaded.
You
know
it's
consuming
an
enormous
amount
of
ram.
It's
consuming
an
enormous
amount
of
cpu
we
for
some
of
our
zona
clusters.
We
have
nodes
dedicated
to
run
only
prometheus
and
effectively
nothing
else.
B
But
there's
a
lot
of
work
to
make
that
happen.
Like
that's
not
going
to
be
a
oh,
let's
just
deploy
it
and
we're
done.
That's
not
a
one
day
thing:
that's
going
to
be
a
multi-week
multi-month
project
to
get
it
going,
reconfigure
things
make
sure
they
didn't
break
anything
etc.
Like
it's
not
a
quick
solution,
it's
my
concern.
B
A
Yeah,
that
makes
sense.
Okay
and
the
idea
you
suggested
henry
like
what
do
you
reckon
in
terms
of
like
rough
size?
For
for
that.
C
It's
it's
hard
to
say
I
mean
that
just
came
to
my
mind,
but
this
would
just
involve
getting
deep
into
our
rule,
set
that
we
have
for
prometheus
and
figuring
out
how
we
can
set
up
our
own
recording
rule,
which
then
would
need
to
just
leave
out
some
parts
which
are
not
really
reliably
working
and
trying
to
find
replacement,
queries
for
those
which
are
working
and
patching
this
together.
C
So
that
would
be
I
mean
just
working
on
those
recording
rules,
so
we
need
some
knowledge
about
how
this
all
stiff
together
and
working,
because
this
is
some
complex
generated,
json
stuff
that
we
are
dealing
with,
but
maybe
we
can
just
make
a
easier
manual
patch
around
this.
Instead
of
trying
to
you
know,
auto
generate
a
lot
of
stuff
here
could
work
that
would
be
a
good
discussion,
maybe
with
people
like
andrew
or
maybe
igor,
or
who
else
is
deep
into
our
recording
roots.
A
I
would
probably
prefer
like
if
it's
if
it's
it
sounds
like
weeks
to
months
for
us
to
do
that,
like
I
would
say.
C
I
think
I
think,
such
weeks
and
months,
it's
just
spending
a
few
days,
maybe
looking
deeply
into
our
recording
rules,
but
this
is
just
working
on
some
kind
of
code
on
our
premieres
probably
expressions
there,
and
I
think
that
would
be
work
that
we
can
just
say.
Okay,
we
try
for
three
days
to
get
this
fixed
this
way
and
if
not
just
stop
here,
but
it
shouldn't
take
much
much
longer.
I
think.
B
C
B
Like
I
don't
know
what
else
uses
these
metrics,
I
didn't
know
that
feature
flag
changes
were
using
these
metrics.
I
thought
it
was
just
our
own
delivery.
Tooling,
like
I
thought
these
metrics
were
built
specifically
for
delivery,
and
no
one
else
so
learning
that
this
is
used
elsewhere.
Has
me
a
bit
nervous
that
if
other
people
are
relying
on
these
metrics,
we
whatever
solution,
we
choose
with
henry's
idea.
We
may
we
need
to
make
sure
that
we
clearly
advertise
that
hey.
These
metrics
may
not
be
correct
because
of
x,
y
and
z.
B
For
the
moment,
until
we
have
a
better
permanent
solution
and
then,
if
we
implement
said
solution,
I
guess
a
fourth
facet
of
this
idea
is.
I
want
to
make
sure
that
we
don't
leave
that
patch
in
place
for
long,
because
I
don't
want
to
be
giving
people
incorrect
information,
and
I
want
to
make
sure
that
we're
going
to
at
least
get
towards
the
final
solution
in
the
long
run,
which
would
probably
try
to
figure
out,
involve
that
epic.
That
henry
mentioned
earlier.
C
I
don't
think
we
need
to
patch
the
current
recording
rule.
We
can
just
leave
it
in
place,
so
it's
working.
C
So
we
just
need
to
make
sure
that
who
wants
to
use
it
knows
that
this
is
kind
of
incomplete
and
patching
around
some
problem,
but
I
guess,
if
you
name
it
accordingly,
not
too
many
people
would
try
to
use
it
right.
Gitlab
deployment.
E
C
Rely
on
on
recording
rules
already,
so
everything
you're
relying
on
is
based
on
recording
rules
mostly,
and
it's
a
high
several
layers
of
abstraction
and
json
that
generating
these,
and
I
think
only
andrew
is
the
person
who
is
really
understanding
what
it
is.
But
yeah.
A
Okay
for
now,
given
that
that's
a
it's
great,
we
have
an
option.
I
would
rather
not
have
to
patch
something
specifically
for
delivery.
If
we
can
let
let
me
go
off
and
see
what
we
can
do
like
it
sounds
like
this.
One
is
a
cross
department.
A
Effort
is
gonna
like
something
will
come
out,
but
let
me
see
what
we
can
actually
pull
together
and
figure
out,
so
that
we
can
actually
get
an
idea
of
what
is
gonna,
be
the
real
fix
that
goes
in
for
this,
and
when
can
we
maybe
sort
of
expect
that
or
contribute
to
that
and
then
once
we
know
a
bit
more,
maybe
we
we
make
a
plan
for
whether
we
want
to
actually
like,
like
you
say,
henry
like
time,
box
a
few
days
and
see
if
we
can
actually
put
something
up
for
delivery.
B
So
one
of
the
items
that
bob
proposed
reduces
the
amount
of
metrics
we
have
by
a
certain
amount.
I
don't
know
how
to
evaluate
how
significant
that
proposal
has
until
after
it's
done.
Unfortunately-
and
this
is
just
due
to
my
lack
of
knowledge
with
prometheus
in
this
particular
case-
I
would
love
to
see.
B
If
we
accept
that
proposal,
I
would
love
to
see
what
kind
of
impact
that
has,
because
if
we
can
reduce
any
metrics
that
we
have,
I
think
that'll
be
beneficial
to
us.
So
I'd
love
to
see
us
try
to
see
that
through
first,
if
possible
and.
A
Yeah,
I
absolutely
agree-
and
I
think
in
this
one
scarborough
when
you
say
we
like
all
right-
let's
all
remember:
that's
infrastructure
like
this
is
a
so
like
bob
is
actively
working
on
this
proposal
for
for
reducing
the
so
it's
in
the
improvements
for
error
budgets,
reducing
the
number
of
metrics
we're
capturing.
A
I
see
steve's
commenting
on
the
epic
henry
shared
about
on
623
about
the
options
there.
So
there's
a
lot
of
people
talking
about
this
and
sort
of
thinking
about
this.
So
I
think
that's,
let's
see
how
we
can
coordinate
a
bit
of
a
plan
to
actually
fix
the
problem.
C
That's
one
question
and
how
much
is
this
blocking
us?
I
guess
it's
it's
annoying
because
we
see
these
in
our
production,
health
checks
right,
but
that's.
A
A
E
C
To
the
dashboard
for
the
service
that
we
see
right
at
this
moment
and
see
if,
if
it's
looking
healthy
there,
that
would
be
a
work
around
for
now
to
to
be
sure.
Okay,
it's
it's
still!
Okay,
because
there
we
don't
mix
all
of
this
together,
but
we
have
you
know
the
just
standard,
aptx
dashboards
over
a
shorter
time
range
and
those
are
working
normally
and
I
think
the
really
bad
thing
is
just
if
you
want
to
automate
right.
A
So
about
one
to
play
a
day,
so
it
might
be
worth
like.
Does
somebody
want
to
actually
just
put
together
like
a
small
thing,
then
so
that,
like
a
release,
manager
has
like
a
quick
step
guide
of
like
you
run
this
thing,
if
you're
not
seeing
all
of
these
metrics
go
here,
look
for
this
double
check
before
you
hit
the
button.
I.
B
Could
do
that
awesome
I'll
work
on
that,
because
that's
a
good
thing
to
have.
A
And
then
that
kind
of
gets
us
through
like
this
immediate
problem,
but
yeah,
like
you
say
henry
like
we
won't
be
able
to
switch
to
automated
promotions,
but
it
sounds
like
from
scarbeck's
kind
of
initial
overview
that
actually
that's
the
least
of
our
problems
on
this
metrics
one
right
now.
So,
hopefully
we'll
actually
like
see
a
project
come
together
quite
quickly.
A
Awesome
thanks
for
bringing
that
one
up
scarborough
and
for
the
investigation
actually
like
super
super
great.
It
was
a
much
more
interesting
investigation
than
I
expected
from
the
issue
when
I
opened
it.
B
Okay,
well,
that's
all.
I
had
any
other
questions,
comments
or
fun
topics.
Anyone
wants
to
bring
up.
A
At
all,
I
think
probably
two
fun
topics
so
one
I
don't
know
if
you've
all
met
liam
liam
has
just
joined
platform.
It
was
the
week
before
last
right
liam.
I
I
lost
track
of
time.
So.
D
Yeah,
it's
like
in
nearly
two
weeks.
Now
it's
a
swing
by
yeah.
I
don't
think
I
have
met
the
majority
of
the
people
on
the
call.
Actually,
I've
seen
your
names
lots,
of
course,
around
slack
and
on
gitlab
yeah.
I
joined
from
the
manage
stage
which
I
was
working
with
for
about
three
and
a
half
years,
and
I
felt
as
if
my
time
there
was
coming
to
an
end
and
it
was
fun
to
come
over
and
see
what
was
going
on
on
the
infrastructure
side
of
the
business.
D
So
I've
joined
scalability
working
alongside
rachel
to
spin
up
a
second
team
there
and
so
yeah.
I'm
excited
by
the
challenges
that
exist
over
over
on
the
infrastructure
side
and
lots
to
learn,
lots
to
do
and
yeah
lots
of
opportunities
for
growth
for
me
and
for
the
company.
A
Yeah
awesome
good
to
have
you
over
here
liam
and
for
those
of
you
on
reddit
liam's,
going
to
be
sort
of
starting
out
getting
kind
of
caught
up
on
the
reddish
project.
So
you'll
get
to
know
henry
probably
first
and
then
maybe
akamada
should
come
off
release
management
in
a
couple
of
a
week,
or
so
you
might
be
able
to
get
more
involved
there
as
well.