►
From YouTube: 2022-12-09 Managing Mixed-Version Deployments
Description
Discussing the current mixed-version environment setup and tests. Related to https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8145
A
Okay,
so
mixed
version
testing,
so
yesterday
we
had
an
incident.
I
just
mentioned
a
little
bit
what's
interesting.
The
incident
like
itself
like
has
more
of
the
specific
details,
but
most
generally
what
happened
here
was
we
had
a
failure
on
the
staging
environment
and
we
had
a
quarantine
of
the
test.
Sorry,
let
me
say
we
had
a
failure
on
the
staging
environment
which
led
to
tests
failing
those
tests
were
then
quarantined
with
an
MR,
which
meant
a
new
package
was
created
that
needed
to
be
deployed
right.
A
So
here
we
have
our
pipeline,
so
we
had
gone
through
this
already
right,
so
we'd
already
put
the
package
through
Cedric
Canary
we'd
already
run
tests,
and
we
had
our
failure
on
staging
now.
A
What
this
meant
is
when
we
got
the
new
package
and
we
needed
we
needed
to
basically
have
our
new
package
on
staging
in
order
for
our
tests
to
begin
to
pass
again
now.
The
reason
for
that
is
because
sagin
Canary,
these
QA
tests
on
step
two
are
designed
to
test
as
well
as
testing
the
new
package
and
the
new
functionality
they're,
also
designed
to
test
mixed
version
compatibility.
A
So
what
that
means
is
mixed
version.
Compatibility
was
a
problem
we
were
previously
seeing
because
our
original
pipelines
deployed
to
staging
then
they
deployed
to
production,
Canary
and
then
they
deployed
to
production.
Now,
when
we
have
a
canary
environment
and
a
main
environment,
they
share
a
database.
A
So
we
have
our
Canary
production
and
our
production
are
using
the
same
database
and
what
we
had
been
finding
was
that
occasionally,
when
we
had
this
mixed
version
in
compatibility
users-
and
unfortunately,
it
was
usually
users
on
production
were
trying
to
use
features
which
had
the
database
would
have
been
updated
by
the
production,
Canary
deployment
and
the
code
on
production
wouldn't
be
backers
compatible,
so
users
would
have
problems.
So
that
was
the
sort
of
original
problem
we
tried
to
solve
with
through
this
kind
of
mixed
version
project,
and
we
solved
it
in
two
ways.
A
A
Right
make
sense.
So
we
got
a
lucky,
it's
the
known
Edge
case,
but
it's
not
a
common
one.
So
we
got
unlucky
with
our
incident
because
what
we
deployed
on
staging
failed
tests,
the
recovery
for
that
was
a
quarantine
of
the
tests.
In
the
new
package,
we
deployed
that
to
staging
Canary
where
it
was
fine
and
happy,
but
the
mixed
version
tests
were
still
hitting
staging,
which
had
the
previous
version
installed,
and
we
don't
have
a
way
on
our
current
pipeline
to
actually
say
progress
to
staging.
A
B
So
yeah
I
think
that
was
a
really
good
explanation.
I
have
two
common
slash
questions.
First,
one
is
we
quarantined
the
test,
which
kind
of
tells
me
that
so
staging
Canary
smoke
tests
are
like
that
QA
test
interact
with
stagings
tests.
A
Then
they
are
pretty
much
the
same
tests.
Yes,
they
might
actually
be
the
same
tests.
B
So
I
guess
what
I'm
more
or
less
confused
about
is
like
when
those
set
of
tests
got
to
staging
Canary,
they
should
have
been
quarantined
and
my
understanding
of
quarantineers
don't
run
them.
A
Oh
I
see
what
you
mean
so
I.
Don't
know
this
for
sure,
but
I
I
I
think
what
may
happen,
and
actually
perhaps
this
is
where
one
of
the
improvements.
This
is
a
super
good
question,
though
yes
is
perhaps
what
the
staging
Canary
tests
are
actually
doing
is
triggering
the
staging
tests.
So
rather
than
saying
here
is
a
suite
of
tests
that
hit
the
two
environments,
which
is
actually
what
I
thought
yeah,
maybe
you're
yeah.
A
B
Yeah
because
once
we're
quarantining
a
set
of
tests-
and
we
are
expected-
we
we
kind
of
expected
it
to
quarantine
on
stage
and
Canary,
but
it
didn't
because
staging
QA
tests
were
not
quarantined
at
that
point.
B
Why
run
that
on
the
canary
stage
or
if
we
run
it
on
the
canary
stage
for
staging,
then
why
do
we
run
it
again
after
staging
okay.
A
Yes,
that's
a
great
question
so
that
one
I
can
definitely
answer
so
on
your
first
question:
I
I
I,
it's
super
great
question.
I,
think
what
what
we
probably
want
to
figure
out
or
find
out
from
quality
is
a
little
bit
more
about
that
structure.
I,
actually,
I
actually
don't
know,
but
your
question
is
a
very
good
one,
which
is:
how
are
those
tests
basically
linked
to
the
environments?
A
Well,
your
second
question:
we
actually
don't
run
those
tests
after
staging
anymore
and
the
tests
after
staging
have
moved
and
they
now
sit
on
the
post-deployment
migration
pipeline.
Oh.
A
A
A
Is
that
what
I
understood
that
I?
Your
original
question
around
like
the
how
the
quarantine
didn't
end
up
affecting
stages?
I?
Think
it's
a
really
interesting
one,
but
my
understanding
is
that
these
tests
are
looking
for.
They
are
hitting
both
the
stage
and
Canary
and
the
staging
environments.
Basically,
so
yes,
they
two.
There
are
two
Suites
in
there
or
there
are
various
different
Suites
and
they
are
designed
to
test
a
new
package
and
also
look
specifically
for
mixed
version
problems
that
would
be
visible
on
staging
right.
B
B
A
Know
like
so
zeph
was
super
involved
in
designing
these
tests
and
he
led
all
the
test
designs.
So
he'll
certainly
be
able
to
explain
like
or
point
us
to
the
documentation
on
like
how
things
are
set
up
and
how
like
what
we
can
expect
there
yeah.
B
My
second
point
is
it
kind
of
smells
like?
What's
it
called
like?
You
know
like
how
we
use
a
pick
label,
and
then
it
like
does
like
different
things
with
the
pipeline
kind
of
feels
similar
as
a
solution
to
this,
but
but
also
it
feels
dangerous.
B
If
we
don't
know
what
it
does
so,
for
example,
I'm
thinking
of
a
label
that,
like
I
don't
know,
say
like
quarantine
tests
as
a
label
and
that
Mr
gets
merged
and
because
it's
like
I,
don't
know
in
an
incident,
it
would
deploy
immediately
to
Canary's,
staging
and
staging
in
order
to
get
those
quarantine
tests
passing
on
both
environments,
but
then,
like,
if
say,
like
the
release,
honors,
who
doesn't
know
that,
then
they
can
like
promote.
Another
package
that
possibly
went
through
have
a
really
weird
staging
deploy
going
at
that
point.
B
A
Don't
know,
probably
something
in
that
for
sure
yeah
somebody
mentioned
in
the
incident
Channel
I
can't
remember
who
it
was
like.
This
is
possibly
ties
a
little
bit
as
well
to
some
work
that
engineering
productivity
are
doing
around
so
they're
working
at
the
moment
to
allow
only
the
tests
you
need
to
run
to
be
running.
So
when
you
do
a
merge,
you
wouldn't
necessarily
run
every
test.
It
would
detect
what
you've
actually
been
changing
and
run
the
right
test.
A
A
This
is
an
option,
doesn't
it
yo
when
you
said
that
you
also
have
reminded
me
of
one
other
thing
that
you've
mentioned
very
very
near
the
beginning
of
that
call
and
is
relevant
to
this,
which
is
so
that
we
resolve
this
incident
by
deploying
directly
to
staging
so
deploying
the
package
we
had
on
stage
of
canary
director
staging.
A
We
did
take
a
little
bit
of
risk
with
that,
because
it
was
a
package
with,
like
other
changes
in
which
hadn't
fully
passed
testing
now
a
slightly
less
risky
one
and
the
one
that
Myra
was
sort
of
helping
investigate
was
whether
we
had
a
package
which
had
the
previously
deployed.
You
know
like
a
previously
tested
package
with
just
the
quarantined
Mr
on
top
of
it,
which
I
think
we
probably
did
have
somewhere,
because
the
pick
label
was
used,
but
that
would
have
been
another
way.
A
B
A
If
things
had
been
risked
like
super
super
risky
like
we
would
have,
we
would
have
had
an
option
to
roll
staging
back
from
the
version
you
deployed,
so
we
couldn't
recovered
staging
that
way
right,
because
that
one
the
deploy
you
did
was
completely
safe
there,
so
that
bit
wasn't
super
risky
at
least
yeah.
B
A
Exactly
yeah
there
was
definitely
some
mixed
version
risk,
because
what
you
lost
was
you
lost
the
mixed
version
testing?
Basically,
because
we
stuck
the
same
verb,
the
same
package
version
onto
yeah,
so
you
did
lose
a
little
bit
of
testing,
but
overall
the
the
risk
wasn't
wasn't
massive
and
I
think
when
we've
been
blocked
for
that
many
hours
it
was
a.
It
was
a
decent
trade-off.
A
B
A
We
had
the
introduce
the
mixed
version
tests
which
theft
from
quality
LED
and
then
those
two
pieces
together
should
be
giving
us
some
mixed
version.
Protection.
B
A
And
this
is
one
of
the
things
that
now
gives
us
a
few
extra
complexities.
So,
for
example,
all
of
this
mixed
version
testing
is
the
reason
why,
when
we
roll
back
staging,
we
then
put
it
back
onto
the
right
version,
because
otherwise,
again
our
mixed
version
testing
is
testing
against
it,
like
what
we
really
need
is
staging
Canary
and
staging
to
end
up
with
the
same
two
versions
that
will
be
production,
Canary
and
production,
because
that's
basically
what
we're
testing
for.
A
So
that's
the
reason
why
we
roll
back
staging
and
then
redeploy
onto
it
as
well
to
to
get
that
versioning,
and
it
also
linked
to
a
little
bit
to
another
issue
which
we
have
going
about:
how
much
locking
do
we
need
to
put
around
the
post-deploy
migrations
environments?
Again,
it's
a
similar
thing,
which
is
actually
how
much
like
do.
We
have
enough
protection
over
the
staging
version
to
be
able
to
to
be
able
to
make
the
mixed
version
testing
dependable.
B
Yeah
that
actually
reminds
me
this
is
kind
of
going
back
to
investigating
like
what
what
happened
during
the
24
hours,
basically
that
this
was
down
for,
but
basically
so
I
guess.
My
question
would
be
that
this
package
got
deployed
to
staging
like
the
one
with
the
failing
set
of
tests.
Yes,.
A
B
A
I
believe
so,
yes,
I
think
it
does
I
think
it
does
I
think
what
this
one
took
a
little
bit
of
this
one
was
a
bit
unusual,
because
I
think
we
knew
quite
quickly.
There
are
cause
of
the
original
failures
which
was
down
to
not
having
enough
resources
available
in
the
whatever
the
projects
were,
so
the
the
report.
A
The
test
couldn't
create
the
repositories
they
needed
yeah,
so
quality
spent
quite
a
lot
of
time,
cleaning
up,
which
was
all
good,
but
that
just
wasn't
enough
to
restore
the
stability
of
this
test,
basically,
which
was
why
it
got
quarantined.
So
it
was
kind
of
like
two
efforts
to
to
recover
from
there
I.
B
See
yeah
I'm
looking
at
this
is
the
QA
reliable
one
out
of
ten
I'll.
Take
a
closer
look
at
this
later,
but
yeah
like
so
I
guess
for
the
next
time,
because
this
was
kind
of
a
special
case
because
we
were
blocked
for
so
long
and
we
kind
of
wanted
to
get
packages
through,
but
I
think
actually
the
next
time
we
run
into
this.
The
correct
path
is
actually
to
rollback
staging
instead
of
deploy
this
yes.
A
A
B
B
A
A
A
A
Is
this
a
thing
that
we
can
roll
back
from
because
you're
right
actually
yeah
like,
and
maybe
we
need
to
be
rolling
back
staging
a
little
bit
more
for
these
things,
because
actually,
if
this
had
been
a
quarantine
Mr
to
do
the
test
that
had
come
from
a
software
change,
yeah
you're,
absolutely
spot
on,
like
we
should
have
brought
staging
back
and
got
back
to
the
versions
that
way.
B
A
A
A
A
Do
you
know
where
we
might
actually
I
know
we
were
able
to
do
a
push?
That's
fine!
It's
almost
an
example,
though,
for
a
Hot,
Patch
I,
wonder
if
we
should
have
hot
patched
staging
how
hold
your
heart
perch.
So
we
have
a
different
process.
We
have
a
whole
different
pipeline,
a
whole
different
tool
where
you
can
basically
generate
a
project
link
where
we
can
basically
generate
a
effectively
like
a
code
patch
that
bypasses
our
deployment
Pipeline
and
gets
applied
directly
onto
an
environment.
A
B
B
A
And
this
is
something
which
going
to
bed
at
the
moment
and
while
we
need
to
introduce
for
our
Hot
Patch
process
is
a
bit
similar
to
our
rollbacks
practice.
So.
A
Have
a
way
where
we
we
run
these
semi-regularly
or
regularly
so
that
we
can
actually
make
sure
they're
still
working
as
expected,
and
that
everybody
knows
it's
an
option
and
here's
how
to
use
it.
We
don't
do
it
on
production
except
for
an
S1,
but
actually,
if,
if
we
hadn't
been
comfortable,
so
say,
for
example,
yesterday
the
next
package
headed
for
staging,
had
like
500
changes
in
it,
and
you
know
one
of
those
was
like
super
risky
looking
or
whatever
reason
where
we
decided
that
looked
too
risky
to
miss
those
tests.
A
Yeah,
we
probably
wouldn't
have
been
to
roll
back
able
to
roll
back,
because
I
said
I,
don't
actually
think
it's
a
new
test.
I
think
it
just
became
unstable,
but
we
should
have
validated
that
and
then
our
final
option
would
be
to
Hot
Patch
onto
staging
and
store
the
test.
That
way.