►
Description
Working on understanding why there was downtime in a recent upgrade and figuring out steps to mitigate it. https://gitlab.com/gitlab-org/gitlab/-/issues/225684#note_375956757
B
12.10
they
go
here.
You
just
look
at
the
admin
panel,
yeah
and
Postgres
eleven
point:
seven
we're
gonna
update
to
the
most
recent
13,
which
is
13
point
zero
point,
ten,
be
monitoring
the
h
approach
boards
and
some
production
logs
running
some
tests
here
to
see
when
yeah.
When
we
see
any
potential
issues
will
be
monitor
the
logging
all
the
details
in
this
document
here
and
with
that
I
think
we
can
get
started
so
here.
The
instructions
know
the
way
beginning.
B
Yeah,
it's
like
so
many
things
on
my
screen:
okay,
so
first
we're
going
to
choose
a
deployment
node
and
if
it's
running
so
we're
choosing
a
sidekick
okay,
so
each
each
deployment,
primary
site
secondary
site
has
two
rails
notes:
two
psychic
notes:
Italy
Postgres
and
Redis
get
everything
yeah,
so
we're
gonna
use
sidekick
to
as
the
deploy
note
and
I've
already
run
the
status
here
and
see.
That's
not
currently
not
running
and
there's
links
in
the
document.
B
B
Good
way
of
doing
this
well,
anyway,
after
step,
one
which
is
stopping
the
deployed
step
two,
we
will
make
sure
that
this
file
exists,
skip
Auto
reconfigure,
and
we
also
want
to
do
that
so
all
nodes,
including
the
primary
deploy
node.
So
all
of
the
notes
here,
yeah
I,
did
that
before
we
started
recording
just
to
make
sure
that
it's
there
on
each
I.
C
A
Yeah,
are
we
only
so
like
I?
Don't
know
that
much
about
the
downtime
like
the
no
downtime
deployment
strategy,
because
we
could
only
do
downtime
deployments
basically,
because
we
had
to
recompile
everything
they
took
an
hour
pretty
much
at
a
minimum.
Every
time
and
yeah
people
always
got
really
mad
at
me,
because
I
didn't
want
to
do
them
after
business
hours
and
they
were
like,
oh
but
I,
think
it
worked
out
and
I'm
like
well
me
too,
and
I
also
want
to
not
do
this
not
at
work
so
have
fun.
Yeah.
A
So
we
upgrade
the
other
nodes
first
before
running
the
database
migrations.
That's
the
strategy
right
or
is
Ginley
the
deploy,
node
I
can't.
B
Problem:
let's
go
ahead
and
update
the
deploy
hood,
which
is
up
here.
A
B
A
B
B
Did
we
win,
it
finished,
so
I
think
still
running
okay,
so
that's
normally
uses
a
request.
So
now
we
couldn't
restart
psychic.
C
B
A
D
B
D
A
A
A
B
A
A
A
A
B
A
B
B
So
we
took
a
break
after
updating
the
first
rail
snow.
He
started
seeing
some
pipeline
failures.
We've
noted
in
the
document
after
that
I
started
a
new
pipeline,
the
same
test
where
you're
not
running
anything
right
now
in
terms
of
upgrade
commands
and
seeing
failures,
which
are
it
look
like
also
500
errors.
It's
looking
at
the
output
here.
A
B
B
Okay,
okay,
this
one
failed
again:
okay,
well,
look
at
the
I.
C
A
B
B
B
B
B
A
A
B
Mean
that
we
still
have
the
locks
and
everything
it's
just
yeah
I,
don't
see
anything
happening
right
here
so.
B
Yeah
we
could
I
just
I,
don't
have
any
really
I
just
have
the
readiness
checks
running
right
now.
There's
no,
that's
true.
I'm
still
waiting
for
these
to
start
and
they
take
about
three
minutes.
B
A
Yeah
I
mean
we
definitely
saw
downtime
in
that
reconfigure,
so
I,
don't
know
that
this
will
necessarily
make
a
difference.
I
also
am
not
sure
what
we
count
like
what
our
SLA
counts
as
downtime
and
whether
or
not
trio
sync
is
what
a
Gio
not
syncing
is
considered
downtime
or
not.
I
have
no
idea
what
that
I,
don't
know
how
how
things
work
yeah.
B
A
B
D
B
Okay
going
so
that's
step,
we
got
that
okay,
so
we
configured
psychic.
So
let's
go
back
to
our
else
or
not
back.
Let's
go
to
rails
on.
A
A
Ice
they
seem
to,
it
seems
to
have
stopped
getting
them
on
the
script,
so
the
script
got
them
for
about
26
seconds
this
time.
Okay,.
D
A
A
Just
gonna
say
I
know
for
me:
historically,
when
I
build
zero,
downtime
upgrade
strategies,
I
almost
always
do
sidekick
first,
just
because
then
you
don't
have
to
worry
about
in
queueing
jobs
from
rails.
That
sidekick
doesn't
understand
and
it
makes
your
a
great
strategy
a
little
easier,
but
that
doesn't
mean
that
if
we're
not
telling
people
to
do
that,
then
I
don't
know
that
we
shouldn't.
We
shouldn't
expect
that.
B
B
A
A
B
B
A
D
B
A
A
A
A
A
A
B
B
B
B
A
B
Okay,
so
we're
over
here,
yeah
there's
some
chicks,
I
think
I've
already
done.
Let's
go
to
that
panel.
B
B
B
B
B
B
B
A
A
B
C
C
D
C
B
B
A
D
D
B
A
A
B
B
Okay,
so
that
was
reconfigured
on
the
two
rails
notes:
let's
see
yeah,
it's
always
a
lot
to
do
they
Postgres
and
get
only
earth
and
Redis
nuts
when
we
do
Postgres
first
Oh.
B
B
C
D
A
B
A
B
D
A
B
A
A
C
B
B
B
C
B
Well,
we
have
to
run
the
refresh
this.
Oh
right,
so
wait
for
private
you,
yeah
I
was
just
going
to
this
step
to
look
at
the
data,
replication
lag
and
it's
a
zero.
So
now,
I'm
gonna
run
this
on
the
deploy
note
on
the
secondary.